Data Science Technology at The General®: Python

Introduction

This post is a continuation of a series that Jack kicked off about the data science tech stack at The General®. In this post, I’ll focus on Python and why we use it over comparable tools. Throughout our entire data science pipeline—from data gathering to prediction generation to presentations—you’ll find Python employed to a certain extent. There are many reasons we use Python for certain segments of this pipeline, be it the nature of the task, the strengths of the programmer, network externalities, or conveience. For example, I began programming in Python, and since Python is already engrained in our team’s day-to-day tasks, I’ll usually choose Python over comparable tools.

Before we begin, let me say what this post is not. This is not meant to be an R vs Python post, or DataRobot vs built-from-scratch solutions. Those posts already exist. This article is only highlighting why our team uses certain tools. The best advice I can give you for deciding between tools is to choose what you’re most comfortable with and what your current tech stack can support.

So let’s go through our pipeline and discuss where and why we use Python for certain tasks over other tools. Our data science pipeline at The General roughly follows CRISP-DM, so it should be familiar to most.

Data Retrieval

When retrieving data from a SQL database I use SQLAlchemy, an object relational mapper (ORM) that can map to most relational database systems. Executing raw SQL (within SQLAlchemy) is fine for ad hoc queries, but for highly transactional systems it may be better to use the ORM capabilities of SQLAlchemy. An ORM is great because it allows you to store and persist data between two incompatible type systems while also taking advantage of the language’s (Python in this case) OOP features, making querying a database more Pythyonic, natural, and modular.

When I have to read flat files, I usually choose pandas because it will convert the data directly to a DataFrame, which is the format I do most of my data cleaning and feature engineering in. If I have to retrieve data over HTTP (or other transfer protocols), I also use Python because tools like requests make this easy.

So why choose Python for these tasks? For me, the main reason is because I’m already familiar with Python and because other tasks that are related to data retrieval are written in Python. But other tools exist for these tasks. I’ve used R’s odbclibrary before and it was straight forward. However, I haven’t found a good ORM alternative for R, so I’d say Python is an easy choice for your ORM needs.

Data Cleansing

A big part of a data scientist’s job is cleaning data to make it consumable by downstream tasks. Choosing between the ecosystems of R and Python for this task boils down to language preference. I’ve used pandas/numpy and the tidyverseand I like them both. I’ve come across data cleansing tasks I prefer in pandas and some I prefer in dplyr. If data cleansing weren’t connected with other tasks in my workflow, I’d be indifferent between the two. But since most of my workflow is in Python, that’s what I typically use for data cleansing.

Modeling

Here at The General, we use quite a few tools for modeling, depending on the business question and the delivery method. For many projects, we use R. For others, we use DataRobot. And we have data scientists on the team who will discuss why they use R (or DataRobot) in future posts. I mostly use Python for a variety of reasons.

Most of my work is in natural language processing and machine learning. I have found that these tasks are easier to implement with tools in the Python ecosystem. For example, many of my data science projects depend on word embeddings. I haven’t found tools outside of Python that support this as smoothly as various Python libraries. Because of this, the work I do in word embeddings and topic modeling is done using gensim, a great library that allows you to quickly train NLP models and utilize them in downstream tasks.

I also use Python to convert these word embeddings into features that are used in many prediction tasks. For example, we build a lot of classifiers that use word embeddings as features. Using Python to build the classifiers is an obvious choice since Python was the tool that generated the word embeddings, which are usually the most salient features.

We experimented with a couple R libraries for topic modeling (topicmodels and tidytext) but found that 1) we got better results with gensim and 2) the gensim models were easier to put into production given our tech stack.

The primary package I use for ML modeling is sci-kit learnsci-kit learn‘s architecture is beautifully written and makes modeling with different algorithms very intuitive and easy. There are certainly R alternatives like caret, which are as intuitive as sci-kit learn, but since I use Python for NLP tasks and those NLP tasks are fed into my ML models, I use a Python tool for my ML modeling as well.

Model Deployment

This is probably the most important piece in our data science pipeline; and, coincidentally, the hardest to build. We have a wide variety of compute-intensive models that solve a host of business problems that require an exposable solution that gives the data scientists the freedom to define their own inputs while easily accessing the output. Easy right? Not so much. There are many tools out there that work great for most problems, and we explored many including AWS SageMakerAWS LambdaAWS API GatewayAlteryx Promote, and ad hoc Flask apps. These tools are great and we encourage you to try them out to see which is best for your projects. Unfortunately, we found that our needs were too diverse for any of these tools to solve.

To overcome these challenges, we have begun creating our own model deployment/model management tool from scratch. Python—as well as R and JavaScripit—plays a huge role in developing this new tool. Dubbed Splinter, the backend architecture (including the translation layer) is written in Java with the ability for a data scientist to expose a model written in Python, JavaScript, or R. With this tool, we are able to control everything from the ground up and can add new functionality as we see fit without having to pay additional costs for premium features.

Scott Carr, a machine learning engineer, comments on why this solution works for us:

With this solution, our data scientists get to focus on what they’re good at and don’t have to worry about engineering solutions to problems. They simply write a function to tell us what to do with the data and we take care of the rest programmatically.

Michael Kehayes, also a machine learning engineer on our team, put it this way:

With Splinter, we get to do it our way and support features based on our needs. We’re not constricted by out of the box solutions.

The freedom to implement our own solutions are important to us as we continue to learn how diverse our needs are and how complicated some project deliverables can be. With our own tools, we are able to implement specific features on our timeline.

Of course, not every data science team has the resources to build their own automated model deployment system from scratch. Don’t fret! You can easily put models into production using only R or Python and a few lines of code. The use case and skillset will dictate what language is best. If your goal is to build a beautiful dashboard but don’t have the JavaScript, HTML, or CSS skills required to do so with the wonderful d3.js library, then use R Shiny. It’s a great way to create and publish visualizations with minimal web development skills. If your goal is to build a RESTful API system that requires speed and scalability, then use Python’s Falcon framework which boasts of being a “bare-metal Python web API framework for building very fast app backends and microservices.” Falcon can serve hundreds of thousands of requests per second while maintaining the simplicity of a Flask app.

Presentation

We use many tools for project presentations. This depends highly on what the project is and to whom we are presenting. Some business owners like to see Excel sheets, so that’s what we show them. Others prefer graphs and BANs. My go-to choice for presentations is Jupyter Notebooks and reveal.js, which makes converting notebooks to slideshows easy. I prefer this option because during a presentation I can run live code. Although usually—and anyone who’s given a demo can attest—anytime you’re demoing something you built with people around, it’s bound to break! But, hey! It’s not a real demo unless you’re debugging halfway through, right?

Others on the data science team use R Markdown, which is also a great option that produces some really smooth-looking presentations. There are some features, like the sidebar table of contents, that I think make R Markdown a better option; and other features, like live inline code execution, that make Jupyter Notebooks a better option. If I were more comfortable in R, I’d probably choose R Markdown. But due to familiarity bias, I typically choose Jupyter Notebooks for presentations.

Final Thoughts

I advise you to choose the best tools that work for you and your team. If that means using Alteryx for modeling or Perl for data munging, then go for it! I prefer Python for most of my work, but there are others on the team who are equally (if not more) productive than I am who use R for most tasks. Work with the tools you are most comfortable with and make you most productive.

Stay tuned for the next iteration of our data science tech stack deep dive and how we work with certain tools. Part III of this series will be from the engineering perspective, and their tools are very different!

Tim Dobbins