Putting the “Science” in Data Science

Data Science is the new “Big Data” – everyone is talking about it, everyone thinks everyone else is doing it, but few are realizing its full potential.  Data science consists of transforming your business into not merely a data-driven company but a model-driven one.   The distinction is a model-driven company creates systems to provide the best possible intelligence and inform business decisions in real-time, as opposed to relying on ad-hoc explanations and business solutions.  At The General® we use data science to drive change and dominate the non-standard auto insurance market.  

Science as a Process

In order to put the science in data science, we need to clarify what we mean when we say ‘science’.  When referencing science nomenclature, it’s tempting to think in concrete terms – as the stuff science creates: Teslas, iPhones, etc.  But we think science is better thought of as a verb – it’s a process we engage in.

Science as a discovery process has been developing for thousands of years.  Defining science is still a matter of debate – one we’ll leave to philosophers. And the more technical aspects of science such as hypothesis testing, A/B testing, etc., are better left for a future post.  Our aim here is to distill a modern view of the scientific method down to a handful of useful concepts that illustrate how to use the process to navigate business problems.

We all likely took at least one high school science class, so we won’t bore the reader with an inclusive A-Z recap.  This seemingly simple process becomes complicated by the humans who engage in it, with our biases, blind-spots, and limitations.  Having a few principles in mind can help keep us from making errors or unintentionally loitering in the pseudo-science area. 

First, remember science is inherently a discovery process.  As humans, we are filled with biases which we can and should challenge thoughtfully, rather than seeking to confirm our existing ones. When developing a hypothesis, if you find yourself treading down well-worn roads, or feel uncomfortable when others suggest a new line of inquiry, you might be engaged in confirmation bias. And if your forecast seems too good to be true, it likely is and you should consider investigating your model for overfitness (did we just make a word up?).

Second, the essence of science is testing, confirming, falsification, and prediction.  Our results should be testable, replicable, and produce actionable predictions.  At a high level, Nobel Laureate Richard Feynman said the following:

If it disagrees with experiment, it’s wrong. In that simple statement is the key to science. It doesn’t make any difference how beautiful your guess is, it doesn’t matter how smart you are who made the guess, or what his name is … If it disagrees with experiment, it’s wrong. That’s all there is to it.

And third, effectively communicating results is essential to a successful project. As we’ve highlighted previously, the work we do is useless if we cannot get buy-in from the business owners to implement our solutions.

Science in the “Real World”

It’s all well and good to [briefly] play arm-chair philosopher, and understanding the language of science –as with any subject – is important if you’re to learn it.  But science is about doing stuff and producing insights, not just pondering (but if you enjoy pondering, read this); a distinction that will be appreciated by your boss and company, who want you to produce.  Taking science from the abstract to the daily grind, let’s discuss some of the roadblocks that can present themselves while doing science in the “real world”.

Managing Internal Expectations

The key to doing science in a practical environment may be to manage your ideals, and realize at some moments there are other priorities to consider.  You should reflect on your and your team’s values and goals and how they align with your company’s overall strategy.  It’s helpful to list them, and understand how much more you rank 1 over 2, 2 over 3, etc.  Realize there are costs and benefits to placing science high and above all else, just as there are for prioritizing politics and being congenial above all else.   

One area in which compromise may be needed is with the tools and data you use.  When we’re learning, we use prepared datasets that allow for nice, clean, models leading to fast results.  Real-world problems are messy and rarely allow for this experience.  You may have limited useable data; or you may have an amazing model, but resource constraints may require you to use something far less meaningful.   Worse yet, the best possible model doesn’t live up to expectations and a solution does not exist.  The best course in these situations is to accept limitations for the time being and accept that sometimes “good enough” is okay. 

Good enough is dependent on your situation.  For example, if you’re data science team is relatively new and can grab low-hanging fruit, getting a working model into production itself is a big win; if your team is mature and situated within a large company, good enough may require extra time and resources to achieve goals.  Be sure to communicate with leadership and business owners so realistic, productive objectives are clearly defined at the outset of the projects.

Gaining Buy-In from Competing Perspectives

Keep in mind that not everyone in your company may share your zeal for inquiry.  When setting out on a project, you may encounter individuals who see the project as an opportunity to confirm what they already know.  Their position likely has merit, but as data scientists we’re here to test ideas.  In this situation, make your case as diplomatically as possible – think long-term with your relationships, use analogies to communicate complex ideas, and try to make a case about the pitfalls of moving forward solely on the basis of gut instincts.  If business pressures dictate that you must confirm existing ideas, avoid internalizing these limitations and move towards incremental improvements when possible.  

Managing External Expectations

One of the more difficult tasks for a data scientist can be managing expectations of leadership and business partners.  As terms like Big Data and Machine Learning have moved into the mainstream – a symptom of success – their buzzworthiness may be leading to unrealistic expectations for what data science team members are capable of.  Someone who is an expert in GLMs, machine learning, natural language processing, deep learning, and has a decade of experience in Java, Python, and R is a Unicorn – we haven’t found one yet and if we did we probably couldn’t afford them anyways!  It’s important to level set expectations of what team members are and are not capable of, and what data science is capable of as well.  While we like to believe we can solve any problem, we still cannot predict the future!

The General’s® Data Science Ecosystem

Our Data Science team is organized into two frameworks: one specializing in engineering made up of Machine Learning Engineers and the other focusing on model building made up of Data Scientists.  Our Machine Learning Engineers generally have backgrounds in software engineering with strong skills in Java, but some have started from a data or cloud engineering perspective and their interest in data science has led them to their current role.  The Data Scientists on the team have varied backgrounds from economics to bioinformatics to mathematics, and the paths taken to their current roles include experience in consulting, academia, and industry.  Of course, not every Data Scientists is an expert in all areas of data science and they don’t need to be.  When we hire new team members, we look to expand an area of our knowledge we don’t currently have on the team.  The combination of our parts in natural language processing, machine learning, supervised and unsupervised learning models, and traditional GLMs and statistics, make the whole greater.

At our organization data science currently sits within the Product department.  That’s not to say that’s the only way or best way to organize a data science team and in your organization it may make more sense to be aligned with IT, Marketing & Sales, report directly to the CEO, or role up through some other department.  Our most important charge is speed to market, and sitting so close – figuratively and literally – with our business partners helps facilitate this objective.  We touch projects in all areas of the business from claims to underwriting, and it’s important to be centrally located to business decisions.  (In a previous post we touched on the importance of building trust with your organization, and we would be remiss to not mention how imperative those relationships are.)

Prioritizing Data Science Projects

With limited resources and a lengthy backlog, it is important to prioritize our work and add the most value in the shortest amount of time.  As such, we are selective in where we focus our efforts, and have an internal steering committee made up of members of the senior leadership team to advise us on project adoption.  Any projects we take on need to have clearly defined objectives and focus on optimizing one of three areas of improvement: process, product, or revenue. 

Perhaps more importantly, the business needs to be ready to implement the solution.  If the resources or capabilities do not exist to productionalize the final output, we will not take on the project.  Oftentimes the outcome of a large analysis by an analytics team is the business partner saying, “Great insight!  That’s wonderful to know!” and will then have that knowledge moving forward.  We call these project outcomes “nice to knows” and we avoid them at all costs.  The outcomes of our projects need to be actionable and implementable with scalable solutions.  We are in the business of productionalizing models.  Of course, data analytics still occurs – it’s just not our area of focus.  Each department throughout the organization has built in data analysts that are domain experts and handle much of the day-to-day number crunching and reporting.

Our endeavors would be nothing without a great support network, and we work hand-in-hand with many other teams who are imperative to our success.  The Data Warehouse team is responsible for ingesting data from internal and external sources and storing it in a useable and readily accessible format.  We work closely with our Data Services team, who among other duties are responsible for helping business partners put together beautiful reports and dashboards in Tableau to see how any number of decision engines are performing.  Also heavily involved in our processes is our DevOps team who are experts in setting up secure environments within our cloud infrastructure to enable our success; and of course we have a strong relationship with our IT Security team.  Collaborating with these teams allow us to be laser-focused on generating revenue.

The Road Ahead

We have an ambitious three year plan for data science at The General, including creating an internship program in 2019 consisting of two interns and expanding it in coming years, and continuing to build partnerships with area universities and the Nashville Software School.  Staying focused on projects that add the most value will translate into the business continuing to invest resources into data science and allow us to grow the team, working on more projects that touch more areas of the business and drive more value.  This year we have hired an additional Machine Learning Engineer and have plans to hire two additional Data Scientists.  Perhaps in the future we may even have our own DevOps or Cloud Security resources.  For now though, we are happy to concentrate our attention on projects that make our customers’ lives easier and emphasize transforming The General into a model-driven company.