Data Science Technology at The General®: An Introduction

The practice of data science is a broad field that spans across all types of business and research questions. Since data science has a wide application, there are plenty of technological tools available to the data scientist across different languages and compute environments. This post will be a high-level introduction to the technologies employed by The General®.

Data Scientists at The General use many languages and technologies including R, Python, Java, DataRobot, and others we will highlight.  We are also in the cloud using AWS and Snowflake. That list probably looks familiar to anyone who has read a data science blog post or job posting. While we are all familiar to one degree or another with this tech stack, what is more interesting is how many different ways they can be used  and how a certain technology is selected over another. Why use Python over R for model development? Why develop a model with code when you could use DataRobot or H2O.ai? Future blog posts will address these questions. For now, we will discuss a few details about each technology and how we’re using them here.

We use Python in almost every part of our data science workflow. From retrieving data, to building machine learning models, to exposing those models via RESTful APIs, to automating tasks, to visualizing data—Python is firmly engrained in our everyday life. For example, a major area of our work is Natural Language Processing, and in our experience we have found Python provides the most efficient solutions for our needs. Packages like Gensim and SpaCy make many NLP tasks easier and more performant.

Machine Learning Engineers develop APIs in Java for web-based model deployment. Java provides consistent logging, validation, and a web service wrapper to the various modelling frameworks used throughout our infrastructure.

To loosely paraphrase Alison Hill, a self-described Data Scientist, Professional Educator, and instructor at DataCamp (check out her new tidyverse course!), “you don’t learn R, you learn how to do things in R”.  At The General, we are continually improving how we use R and exploring its capabilities in creative ways.  Currently, we use it most frequently to manipulate and visualize data using the ggplot2 and dplyr libraries, build traditional predictive models (e.g., GLMs including GAMs and Poisson Regression, survival analysis, decision trees), and create readouts for our business owners with the R Markdown feature.  

SQL (structured query language) is used in multiple forms (such as T-SQL and ANSI-SQL) to query various database systems at The General. We often use SQL within the first steps of pipeline development, to generate datasets from a database for processing. The processing and analysis of the SQL-generated dataset is typically done using other languages and software, such as Python or R.

Data scientists and machine learning engineers use a plethora of AWS services. Instances can be started and scaled on the fly which allows data scientists to test and train models to determine the adequate compute resources necessary to deploy a model. Training data from external vendors is loaded to AWS for centralized access. The use of AWS services is constantly evolving.

Snowflake’s scalability and API access allows for flexibility in model training and deployment. Snowflake’s Python connector allows for module-level control of compute resources for queries required to train models or perform ETL jobs. Data loaded to Snowflake can also be accessed by the Data Services team for reporting to business users.

DataRobot is a subscription software company specializing in automating Machine Learning.  DataRobot’s strength lies in being able to build and compare hundreds of complex models simultaneously.  We’re using DataRobot to automate some of our model building processes, and for some model deployments and model monitoring.

H2O.ai is an open-source software for predictive model development and scoring. At The General, H2O has been used for model development in relation to customer lifetime value predictions to inform web-based marketing initiatives. In the future, we plan to use H2O in conjunction with other technologies to implement real-time web-based model scoring.  

Airflow is used for script orchestration on AWS for a complex process that moves data between systems and through models. Airflow DAG files schedule Python scripts in workflows and can verify the completion of a certain script before starting the next script in the process.

The applications of the listed technologies are constantly changing at The General. We are dedicated to the pursuit of learning how to apply technologies in new ways and developing best practices.  As the title implies, this post is meant to be an introduction to an ongoing series in which we discuss how we implement our tech stacks, and leverage these tools to solve specific business problems.

Jack Pitts