[ad_1]
Tired of Kaggle and FiveThirtyEight? Here are alternative strategies I use to get high-quality and unique datasets
The key to a great data science project is big data sets, but finding big data is much easier said than done.
I remember when I was doing my Masters in Data Science a year ago. Throughout the course I found that coming up with project ideas was the easy part – it was Finding a good data set That I fought the most. I’ve spent hours on the internet, pulling my hair out, trying to find juicy data sources and nowhere to be found.
Since then, I’ve come a long way in my approach, and in this article I want to share with you 5 strategies I use to find datasets. If you’re bored with standard sources like Kaggle and FiveThirtyEight, these strategies will allow you to get data that is unique and much more tailored to specific use cases.
Yes, believe it or not, this is actually a legitimate strategy. It even has a fancy technical name (“Synthetic Data Generation”).
If you’re testing a new idea or have very specific data requirements, creating synthetic data is a fantastic way to get original and customized data sets.
For example, let’s say you’re trying to build a churn prediction model—a model that can predict how likely a customer is to leave a company. Churn is a fairly common “operational problem” that many companies face, and solving a problem like this is a great way to show employers that you can use ML to solve commercially relevant problems, as I discussed earlier:
However, if you search the web for “churn datasets”, you will find that there are (at the time of writing) only two major datasets that are clearly available to the public: the Bank Customer Churn Dataset and the Telecom Churn Dataset. This data set is a fantastic place to start, but may not represent the data needed for modeling in other industries.
Instead, you can try to create synthetic data that is more tailored to your requirements.
If this sounds too good to be true, here’s an example dataset I created using the old chestnut short query, ChatGPT:
Of course, ChatGPT is limited by the speed and size of datasets it can generate, so if you want to scale this technique, I recommend using a Python library. faker
or scikit-learn’s sklearn.datasets.make_classification
and sklearn.datasets.make_regression
functions. These tools are a fantastic way to programmatically generate huge data sets in the blink of an eye and are ideal for creating proof-of-concept models without spending years searching for the perfect data search.
In practice, I rarely needed to use synthetic data creation techniques to generate whole dataset (and, as I’ll explain later, it’s wise to exercise caution if you’re going to do this). Instead, I find it a really good technique for generating counterexamples or adding noise to your datasets, allowing me to test the weaknesses of my models and build stronger versions. But however you use this technique, it’s an incredibly useful tool at your disposal.
Creating synthetic data is a good solution for situations where you can’t find the type of data you’re looking for, but the obvious problem is that you have no guarantee that the data are good representations of real populations.
If you want to guarantee that your data is realistic, the best way to do it is to surprise…
… to actually go and find it real data.
One way to do this is to contact companies that may store such data and ask them if they would be interested in sharing it with you. At the risk of stating the obvious, no company is going to give you data that is too sensitive or if you plan to use it for commercial or unethical purposes. That would just be stupid.
However, if you intend to use the data for research (eg, a university project), you may find that companies are willing to provide the data if it is in context. quid pro quo Collaborative Research Agreement.
What do I mean by that? It’s actually quite simple: I mean an agreement where they provide you with some (anonymized/desensitized) data and you use the data to conduct research that benefits them. For example, if you are interested in studying outage modeling, you might write a proposal to compare different outage prediction techniques. Then, share the proposal with some companies and ask if there is potential to work together. If you’re persistent and cast a wide net, you’ll likely find a company willing to provide data for your project. As long as you share your findings with them so that they can benefit from the research.
If that sounds too good to be true, you might be surprised to hear that that’s exactly what I did in graduate school. I approached several companies with proposals on how to use their data for research that would benefit them, signed several documents to confirm that I would not use the data for any other purpose, and did a really fun project. Some real data. It really can be done.
Another thing I particularly like about this strategy is that it allows for practice and the development of fairly broad skills that are important in data science. You need to be a good communicator, demonstrate commercial awareness, and become a pro at meeting stakeholder expectations—all essential skills in the day-to-day life of a data scientist.
Many datasets used in academic research are not published on platforms such as Kaggle, but are still publicly available for other researchers to use.
One of the best ways to find similar data sets is to look at repositories related to academic journal articles. why Because many journals require their authors to make their underlying data public. For example, two of the data sources I used for my master’s (the Fragile Families Database and the Hate Speech Data website) were not available on Kaggle; I found them from academic papers and related code repositories.
How can you find this repository? It’s actually surprisingly easy – I start by opening paperswithcode.com, searching for papers in the area I’m interested in, and browsing through the available datasets until I find something interesting. In my experience, this is a really good way to find datasets that haven’t been done by the masses on Kagle.
I honestly have no idea why more people aren’t using BigQuery public datasets. They actually exist hundreds A dataset that covers everything from Google Search Trends to London bike hire to the genomic sequencing of cannabis.
One of the things I particularly like about this source is that much of this data set is incredibly commercially relevant. You can say goodbye to niche academic topics like flower classification and number prediction; In BigQuery, there are datasets about real business issues like ad performance, website visits, and economic forecasts.
Many people shy away from these datasets because they require SQL skills to load them. But, even if you don’t know SQL and only know a language like Python or R, I still encourage you to spend an hour or two learning basic SQL and then start reading these datasets. It doesn’t take long to run and is truly a treasure trove of high-value data assets.
To use datasets in BigQuery Public Datasets, you can sign up for a completely free account and create a sandbox project by following the instructions here. You don’t need to enter your credit card details or anything like that – just your name, your email, a little information about the project and you’re good to go. If you need more computing power later, you can upgrade the project to a paid one and access GCP computing resources and advanced BigQuery features, but I personally have never needed to do this and find the sandbox more than sufficient. .
My last tip is to try using a data search engine. These are incredible tools that have only emerged in the last few years, and they make it very easy to quickly see what’s out there. My three favorites are:
In my experience, searching with these tools can be a much more efficient strategy than using generic search engines because you are often provided with metadata about datasets and you have the ability to rate them based on their frequency of use and publication. Date. Pretty good approach if you ask me.
[ad_2]
Source link