Whether you are just getting started in data science or have a wealth of experience, entering data science competitions is an excellent way to practice and demonstrate your skill with real world datasets.
In the past few years, more and more organizations, companies and academic institutions have started offering contests for various data science projects, with prizes ranging from virtual recognition in online forums to hundreds of thousands of dollars.
Amazingly, as of writing this post, Kaggle (a major data science competition website owned by Google), had two $100,000 competitions and four competitions offering $30,000 or more in prize money. That’s a lot of money for using data science skills and writing code to solve a real world problem!
We have spent time extensively researching data science competitions across the web and we’ve learned the strategies practiced by winning projects that we are excited to share with the world. For a great resource on current data science competitions from many providers, check out this super helpful website: mlcontests.com.
Here are our top 5 insider tips to help you win data science competitions:
1. Make sure you are using the latest data science packages by reading code from recent winners on GitHub or other public code repositories
Whenever you are competing in a data science competition, you can expect your competitors to be up-to-date on the most recent packages in either R or Python. First, make sure you update all of your packages to their latest versions, which can avoid nasty bugs and can sometimes add new features.
Next, check out contest winners by looking up their username on sites listing former competitions (e.g., on Kaggle’s “Complete Contests” page) and then search on GitHub for their personal pages. The Open Source software community is very collaborative, so you just might find the full code files from prior contest winners posted publicly on GitHub!
2. Triple check all code and script files on the version of R or Python required for the competition
This might go without saying, but having errors in your code will exclude you from winning any competition. This is where a great team comes in handy. When coding and especially after an all-day or late night coding session, you might develop attention blindness, which means you might miss errors or typos in your own code.
To make sure you don’t get disqualified for code bugs or errors in formulas, assemble a great team of fellow data scientists to help you check code and give you feedback. Even if your team members are less experienced than you, having a second or third pair of eyes on your code is always a helpful step in the programming process.
Taking this one step further, triple check that your version of R or Python (and your packages) are the same as is required by the competition. While this sounds simple, different versions of programming languages (e.g., R 4.00’s change to treating strings as strings by default instead of as factors) might get you different answers when compared to the version that the competition requires.
3. Clear your environment before running your code from start to finish (and run it on 2 computers)
Again, most data scientists know to clear their environment before running their code file, but this can trip up even the most experienced data scientists.
When you run a code file, it uses variables and objects (e.g., data.frames, matrices, vectors, etc.) from your environment, so if you have old objects with the same name as objects in your code, your code will perform differently or not run at all with a cleaned environment.
So, just like you clean your house or car, make sure to fully clean and clear your environment before running your final script from top to bottom, and run your code on two different machines before submitting it.
4. Read up on the most accurate machine learning algorithms out there, which change often
Data science is one of the most dynamic and rapidly changing fields in science. Right now (as of October 2020), the XGboost package is very hot because it uses advanced gradient and parallel tree boosting to create highly accurate prediction models that run efficiently on many types of computers.
However, given the dynamic and innovative nature of machine learning and data science, next month’s or next year’s popular package will probably be different, so take some time to stay up-to-date on the latest machine learning packages that are commonly used in data science competitions. In fact, following Tip 1 above is a perfect way to learn about the latest algorithms, which are usually posted on GitHub for the world’s Open Source software community.
Checking our site frequently and following us on social media is another great way to stay current on the latest in data science. 🙂
5. Make sure you have programming skills in more than one data science and statistical analysis language
While many competitions are exclusively in either R or Python, some are not specific about the programming language that you can use. This is where knowing more than one programming language is really important.
Because one programming language might not offer the package or specific functionality you need, having skills in multiple languages can help you address the requirements of the particular project you are working on for a competition.
To get you started on your learning, we offer a variety of software tutorials and articles on statistical packages, including the Top 10 Packages for Data Science in R and a comparison of pros and cons of different data science languages.
The bottom line
While these insider tips won’t guarantee that you’ll win a data science competition, they will go a long way to helping you be competitive and have a chance of winning the (sometimes) huge monetary prizes offered for winners of data science competitions.
If you found this article helpful, make sure to check out our other resources to help you learn and succeed as a data scientist, including our list of top books for learning data science, our reviews of online data science courses, and our guide to must-have technology for data science.