Data science projects help you learn new skills, build your resume, and they can act as a marketing tool to land job interviews.
BUT, the key is: The project needs to be ORIGINAL.
You can jump on Kaggle and participate in a competition. Or find a dataset in the UCI Machine Learning Repository and build a classifier. BUT those projects have been done before… a lot. They’re great for learning, but they won’t make hiring managers say, “I want to work with this person”.
INSTEAD, if your end goal is to land a job, your project needs to be NOVEL. You have to start from scratch. If you do that, it’s much more likely people will take notice.
Where should you start? In this guide, we’re covering the steps needed to create your own data science project, starting with formulating a problem statement and ending with using that project to market yourself and land an interview. This covers everything you need to know to create an original data science project.
A good data science project starts with an interesting idea. And that’s usually the hardest step. But here’s a suggestion: Find the overlap between your interests and the available data online.
If you start with a topic you’re really interested in, and know there’s data available to investigate it, you’ll be much more motivated to do the work and find an answer.
Here’s an example: Early in our founder Jay’s data science career, he worked on an NBA analytics problem. The leading question: Does the 2-for-1 strategy matter in NBA games? Fortunately, all the data to investigate is available online, thanks to the play-by-play stats on NBA Reference that go back decades.
So start with something you’re interested in: Comic books, movies, sports, public health… And then think about the data that might be available on that subject.
Successful data science and machine learning projects start with clear, measurable hypotheses. But you might be wondering: What makes a good hypothesis in data science?
Here are some mistakes to avoid:
Be Specific: A common mistake is choosing a topic that’s so broad that it becomes impossible to measure.
A question like: How many people are leaving cities because they’re too expensive? That’s really broad, and finding data for a question like that would be near impossible, or take so much time as to make it not worth the effort.
Plus, a question like that makes another critical mistake: There are too many unknowable variables.
A better hypothesis is really clear and investigable. So a better question about housing for a data science project would be something like: What effect did the 2008 recession have on the population of San Francisco?
This question is much more specific, measurable, and there’s a better chance you can find quality data to investigate it.
You can’t solve a data science problem without… quality data. (You would have never guessed that, right?)
So how do you do it? There are really three options for finding interesting data for machine learning projects:
So, there are tons of places you can look for data. The hard part is processing and cleaning it. Once you’ve got that down, you can start your project and investigate your hypothesis.
Now, it’s time for the fun part: doing the data science project. Typically, there are a few common steps:
While you work on the project, remember to DOCUMENT everything. Create a code repository to store code and visualizations, and take notes throughout the process (which will be helpful in the next step).
If you really want to use a data science project to gain exposure, you have to market it. You’re investing a lot of energy… you don’t want it to go unnoticed.
Fortunately, there are tons of easy ways to get your project in front of eyeballs. Some that we’ve had success with are:
If you’ve got a great idea, people will want to read about it. And you don’t have to write a thesis. Just focus on the problem, the data you used, challenges, and the conclusions you drew.
If you really want to go all out you can write a press release and pay for distribution (which is useful for topics with broad audiences like housing, crime or public health). Or you can email a short press release to journalists who cover whatever niche your project was in.
Need some help coming up with creating a data science project from scratch? You’ll find a lot of helpful project ideas and dataset resources on Interview Query: