Gaining Deeper Insights with Data Science

DevOps

Reading Time: 5 minutes

If you’ve seen any tech job listings within the past year or so, odds are you’ve seen a description that contains the words data science or data scientist. Data science isn’t as much of a buzzword these days as it is a source of confusion. What exactly does a data scientist do? Do they wear lab coats and work under fluorescent lights all day? The answer is probably different than our imaginations would lead us to believe.

Data science, simply put, is the discipline of taking large amounts of data and deriving useful patterns and conclusions from it. I want to lay out some practical ways for us to not only understand the value of data science, but apply it practically within our development teams and projects.

Why Data Science?

You might hear an initial pitch about why using data science matters and think, “Can’t I already access that information somewhat readily?”

While the answer may be fairly obvious for some problems, you’ll inevitably run into something that you can’t just solve with a short database query. Data science isn’t necessarily about running a few queries to gather data metrics. Rather, it’s more about taking those metrics over time and finding patterns and trends within them.

What’s a practical example of this, though?

Let’s say you’re a CI solution like Codeship and you’re looking at the database that contains all the information about your project builds. You might be asking yourself and your team, “What are the three most popular languages that we build, test, and deploy the most for users?”

To figure this out, we need to gather data. There are certain pieces of data you can easily fetch, like what kind of primary language type was associated with a build. However, it could also be said that multiples languages could be associated with a build as well. You could be testing a Ruby application, but also be running tests on an embedded JavaScript application that’s nearly just as big!

Information like this may or may not be obvious on first glance, which is why we need to programmatically examine each build and extract what languages they use. We’d then sum up, average, and distribute the data points that we’ve found from our analysis.

The findings from this question can help our company understand how to better optimize its users’ experiences with the platform. If a large number of users are testing JavaScript, then it would be considered worthwhile for the engineering team to invest in making their JavaScript builds faster.

Before looking at the data, we really wouldn’t fully understand the need for this. Again, data science is at work helping us use our data to understand usage patterns and creating ways for us to optimize our applications effectively.

I believe that practicing data science gives engineering teams the business justification to tackle seemingly abstract problems. We might find an application to be slow, but Data Science allows us to understand what exact parts are slow! We often don’t know which problems to solve until Data Science discovers them for us.

With that in mind, let’s explore how we can practically invest in data science with our development teams.

How to Invest in Data Science

Simply put, digging through data and analyzing it takes time. In many cases, it requires a developer’s full time or attention. That’s why I believe that the best way to invest in data science is to create full-time positions dedicated to it on your team.

This position doesn’t need any kind of label or specific title to actually be effective at utilizing data science. You just need a developer or a team of developers dedicated to the discipline of gathering deeper conclusions from your data. We do need to make sure they’re equipped with certain tools and skills, though.

I’ve seen that most data scientists utilize three basic skillsets:

  • experience in some sort of scripting language
  • a background or interest in math, specifically statistics
  • proficiency in SQL

Data scientists tend to use scripting languages as a means for stringing together various calls for data. This could be as complex as using a scripting language to generate reports or websites displaying insights. However, its sole purpose is to orchestrate smaller functions in order to present and fetch data.

A background or interest in the math behind statistics is really important to data science because its effectiveness leverages what kind of insights and calculations you’re willing and capable to think of performing. Not everything is going to be an average or summation. You may need to leverage mathematical probability functions and other statistical tools to gain deeper insights from your data.

SQL or any kind of database language proficiency is also crucial for gathering data to analyze. There are numerous tools to help make this easier. However, knowing how to properly query a database to grab the data you need is critical to data science.

At the end of the day, these developers have to be comfortable with diving deep into your application to collect data. They also have to be self-guided enough that they know what exactly to do with that data as well. Some data science projects may take weeks or months to show any significant conclusion because they need such large and measured data sets.

Let’s say you don’t have the resources to have a developer or team dedicated to data science. There are a couple of solutions you can check out to help leverage these disciplines and conclusions for your team.

Automated Data Science

The biggest automated data science solutions out there tend to be based around error reporting or application performance analysis. Some products tend to cover parts of both ideas. However, I’d argue that their strength likely lies in only one of them.

Error reporting solutions help you track how many exceptions or bugs occur in your software. Popular solutions like Rollbar, Bugsnag, and Sentry can help alert your developers when errors occur but also offer more specific details about how often and why they occur. In my job, these tools are lifesavers when trying to track down a deeper, more complex issue.

Application analysis solutions help you analyze the slowest parts of your applications. Tools like Skylight and New Relic help developers understand their application’s performance and scalability in ways that might not be obvious through logs or errors. These tools will help carve out your workload when it comes to optimizing your products.

The downside to any of these solutions is that they only offer specific answers to specific problems.

In an ideal world, you have both developers and automated solutions working together. That way, you can have human developers working on the more complex issues while automated tools help deliver a series of conclusions to constantly be working with.

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles.
Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.



We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.