All Hail the Kaggle-Bot 🤖
For those of you who haven’t encountered it, Kaggle is a website chock-full of nerdy data science competitions. These competitions often attract world-class Data Scientists and Machine Learning Researchers, who compete for fame and fortune by seeing who can come up with the best solutions to an eclectic selection of real-world problems. This could include:
- Predicting Molecular Properties
- Classifying Crime in SF
- Predicting Tube prices
- And Efficiently Deliverying presents for Santa?
According to their self-published dataset, ‘Meta-Kaggle’, there have been just north of 5500 competitions hosted with almost 13 million submissions vying for the top spot. Each of these submissions is scored via one of many evaluation metrics and ranked on a leaderboard from highest performing to lowest performing notebooks. Well, as of two weeks ago, Kaggle collated all of the publically disclosed solutions to their notebooks and released them publicly in a new dataset known as ‘Meta Kaggle code’
This is a very cool dataset. Especially when put in the context of the advent of LLMs, opened sourced our otherwise. Google (The owners of Kaggle) likely were interested in compiling this for their own LLM endeavors in recent years, so I would be surprised if this data hadn’t made it into the commercial offerings of PALM or GPT by now. That said, I think this will add a lot of momentum to the open source community that is currently scaling the moats Google and OpenAI have spent so much time erecting.
Why is this Exciting
Having a curated list of:
- Problem statements.
- Datasets.
- Solutions.
- Scores and
- The ability to test updated notebooks against hidden test data….
Creates a perfect dojo of sorts for building Data Science bots based on existing pre-training models. The dataset is a collection of thousands of smart people showing their work when it comes to how they tackle ambiguous problems while taking pains to avoid false minima like over-fitting along the way.
Challenges
That said, there’s no guarantee that something like this will work. At the outset there are a number of known challenges out of the gate. To start, Kaggle leaderboards are well known to be a flawed evaluation for the quality of the data scientists who submit to them. Leading notebooks are often copied and pasted with minimal changes, ‘private’ testing datasets are often leaked, and some of the anonymized features can be reversed engineered. All of these meaning that the leaderboard metric will be a noisy indicator of success. Still, I’m interested to see where this dataset can lead if we find a kaggler or two who’s willing to help out their competition.