The Data Science Process

Harrison Gu
5 min readJul 3, 2021

Over the past 2 months, I’ve been fully immersed in Flatiron School’s data science boot camp. To say that I’ve learned a lot would be an understatement. From not knowing anything about coding, let alone python and data science, I’ve learned the fundamentals of python, its applications to data science, and even how to build, train, and test basic models. One of the key things that has been taught to me is the data science process. You can think of this as the blueprint for building a data science model. Although there are a few different processes, the one that we have focused on at Flatiron School is called OSEMN, which stands for Obtain, Scrub, Explore, Model, iNterpret. In the blog, I will walk through how I used this model for my first class project, and reflect upon a few things I could’ve done differently.

The prompt for the project was to identify key characteristics that help movies generate the highest return on investment (ROI). Following OSEMN, the first step in the process would be to obtain data relevant to the prompt. This step was pretty straightforward, as a few datasets were already provided to us. However, I noticed that one of the features that I wanted to investigate wasn’t available in the provided datasets. To address this issue, I used The Movie Database’s API to obtain the extra information that I was looking for.

Once I had data on all the different features I wanted to explore, the next step was to scrub the data. In data science, scrubbing refers to cleaning up the data so that it is usable, as data is rarely inputted perfectly, especially when it is obtained from different sources. For example, one of the problems I ran into was that dates were inputted with different formats. Some had the format MM-DD-YYYY, while others had YYYY-MM-DD, and other variations, like having the month spelled out. To address formatting issues, you first need to select a universal format, then change all other formats to the one that you’ve selected. Another common scrubbing task is deciding what to do with empty entries. In general, if there are few empty entries in a category relative to the total number of data points, it is acceptable to just drop these entries all together. However, if the number of empty entries is a substantial part of your data, dropping all of them could negatively affect the accuracy of your model. If this is the case, there are a few ways to approach the problem. You can examine whether the feature is important for answering the prompt. If not, you can drop that feature as a whole. For example, if one of the features in the dataset about movies was the length of the credits, and a substantial number of movies didn’t have info about this feature, it would be acceptable to just get rid of it because length of credits likely has no effect on ROI. If the feature is important, then one method would be to replace all missing entries with either the mean or the median value. You would just need to note how this affects the distribution of data.

The next step in the process would be to explore the data. After scrubbing the data, it is important to explore the remaining data to check for outliers, distribution, scale, and linearity (if you are doing a linear regression). Depending on what type of model/analysis you are doing, it is also important to check for multicollinearity, and understand the measures of central tendencies (mean, median, mode, and standard deviation). Because this was my first project, and I hadn’t learned about modeling yet, I took a very simple approach to this step, and just removed outliers, and compared the measures of central tendencies. However, if I were to do this project again, I would want to do a regression model, which would require me to check for the other requirements stated above. If the data didn’t have a normal distribution, I would do a log transform. If certain features were highly correlated with other features, I would find one to eliminate. I also know that the scales for the different features were not the same. For example, budget was measured in millions of dollars, while runtime was measures in minutes. In order to account for this inconsistency, I would scale all of the features so that they ranged from -1 to 1.

Again, because of my limited knowledge at the time, I did not do any modeling for my project. If I were to do a regression model, I would first run a baseline multiple regression with all the remaining data after scrubbing and exploring. From here, I would look for ways to improve upon my baseline model. There are many different steps you could take to improve your model. One of them being removing variables with high p values. Generally, p values greater than .05 is considered to be high. The reason why you want to remove these variables is because high p values implies lower statistical significance, which means the variable does not influence the dependent variable in a statistically significant manner. You can also look to combine variables that have high interaction, or run a polynomial regression for variables that are not linear. After each change to the model, I would compare 2 statistics to the baseline/previous model, R squared, and mean squared error (MSE). R squared tells you what percentage of the outcome is explained by your model, and the goal is to get it as close to 1 as possible. MSE is obtained after training and testing your model, and it tells you how “off” your model is, when running it with new testing data. The closer your MSE is to 0, the better your model is at predicting outcomes.

After you’re done tweaking your model, the only thing left to do is interpret the results. This part of the process is going to be different, depending on what business question you’re trying to answer, and what type of model/analysis was performed. For my project, I recommended movie producers to use the characteristics with the highest means and medians, and used standard deviation to assess risk. If I ran a regression model, I would’ve used the coefficient (slope) of each feature to explain how the feature affects ROI.

As you can see, in the 3 weeks from completing my project to writing this blog, I’ve already learned many useful tools to improve upon my data science process. I know that I’ve only scraped the tip of the iceberg, and am excited to see what else there is to discover!

--

--