Written by Scott W. Strong

In the world of data science, we have so many (almost too many) options available to us when we are trying to solve a problem. Much of the time, people tend to get caught up in the “deep” this and the “Bayesian” that and forget about having a foundational process to rely on that is required to solve many problems in data science. In the most general sense, data science problems can be broken into two different camps: 1) A problem that requires the exploration of a dataset, and 2) A problem that has a specific task, but an undefined dataset or starting point.

To explain further, we take a situation where we want to understand the landscape of parking tickets in New York City. This would cleanly fall into the first category, where we have a data set provided by the local government, but no clear goal other than to understand “what’s going on” in this dataset. Other examples of this type of problem are things like, looking at national electricity usage to gain insights or understanding how fraud is perpetrated in a calling card company (more on this in my next post).

Another example is a task where we want to predict tornadoes in Oklahoma. This clearly falls into the second type of data science problem, where we have goal, but don’t yet have data to support that goal. Other examples of this type of problem are things like, determining stock market movement or predicting the outcome March Madness.

The issue for many aspiring data scientists, as with many other fields, is that novices in the field struggle to begin solving problems. So, How Do We Get Started?

As I have moved through my career (both in school and on the job), I have found that the key to becoming an expert in any field is asking the right questions. Going from an aerospace engineer, to a financial quantitative analyst, to a data scientist detecting phone fraud, you can imagine there is a lot of field-specific knowledge that needs to be learned in order to become an expert. Although this is true, there is a common thread through each of these fields — Problem Solving. As such, I realized that knowing the right questions to ask when solving a new problem is critical to success. To that end, let’s discuss the key questions that should be asked for the two types of data science problems we face.

Exploring a Dataset

For the situation where we have data, but we don’t yet know how it’s useful (think New York City parking tickets), here is a set of questions to ask as you begin exploring the data:

  1. What labels/information is available?
    • Ticket location, time, cost of ticket, etc.
    • Is the data continuous, discrete, categorical, or a combination?
  2. What can we gain from the information provided?
    • Which fire hydrants produce the most tickets?
    • Is there a time or place that you should never try and park in NYC?
  3. Would additional data be helpful?
    • What if we had Vehicle IDs, types, and colors
      • Are specific vehicles being targeted?
      • Do brighter cars get ticketed more often?
  1. What can we use to understand the structure of the data?
    • Are there basic statistics that will be informative?
    • Are there visualization techniques that will reveal structure?
      • Graphs, clustering, etc.

If you’re interested in seeing what a data scientist, Ben Wellington, actually found in this data, here is a link to a great TEDx Talk and the related NPR article.

Prediction Task, But No Data

For situations where we have a specific goal in mind, but no data to support that goal (think of predicting tornadoes), here is a set of questions to ask as you start to try and make predictions:

  1. What do we know about the system?
    • What are the contributing factors for tornadoes?
    • Why do they occur?
  2. What should we predict?
    • Probability of a tornado
      • Right now, in an hour, or tomorrow?
  1. What data is available to support our predictions?
    • NOAA Severe Weather Database
    • Historical weather reports or Satellite Imagery
  2. Which prediction techniques should we use?
    • Should we be using classification or regression?
    • How much representational capacity should the method have?
  3. How good are our predictions?
    • Are we doing better than a “dumb” learner?
      • Simply predicting “no tornadoes” all of the time would give very high accuracy, but would be completely useless!
    • What metrics best capture prediction performance?

In either of these cases, it is very important to make time to review the answers to each of these questions and iterate over them. By really digging into each of these questions and understanding the environment you are operating in, you will be well on your way to solving even the most complex of data science problems.

In my next two posts I will be breaking each problem down with real examples and diving into my thought process for how to explore fraud at a calling card company and how to predict March Madness. Please feel free to leave your comments below.

Read parts two and three of this series.

0 No comments

Leave a Reply

Your email address will not be published. Required fields are marked *