Most of the time we want a predefined dataset to be used for analyzing the database queries. It is quite hard to find a good dataset repository for different projects like database, data science, machine learning. If you have ever worked on some database project, you’ve probably have spent most of your time browsing the internet looking for some dataset repository. However, it can be both fun and frustrating to sift through a large number of datasets to find the perfect one for your project.
In this post, we will walk through all the good places to find data sets for databases, data science, machine learning.
1. U.S. Government’s open data
This data repository contains around 236,476 datasets in different fields such as Agriculture, Climate, Education, Finance, Health, etc. It also has a search box that helps you to find out the data you are looking for. The data sets are public in nature. In addition, we can download datasets in different formats. The data is maintained by the GitHub repository. Data.gov is a data set aggregator and a home for U.S. Government’s open data.
2. Kaggle
This source contains numerous amounts (approx. 22,325) of real-life datasets of all sizes and in many different formats. Each dataset is associated with the “kernels”, most of which are written in python. These kernels help the data scientists to analyze the data using different notebooks. In addition, some notebooks consist of algorithms that help in prediction problems.
3. Google Dataset Search
This is a toolbox that can search datasets by name. On searching any dataset, thousands of different repositories of datasets are unified that make data discoverable. This is where Google is good at!
4. Open Government Data (OGD) Platform India
Open Government Data (OGD) Platform India is a single point of access to datasets in open formats published by Ministries and Departments. The source consists of datasets on real-life of all shapes and sizes along with their API’s and visualizations. The datasets are available for public use.
5. Microsoft Research Open Data
Microsoft along with the external research community launched a repository in July 2018 known as “Microsoft Research Open Data”. It consists of curated datasets that were used in the published research studies. In addition, datasets are present in different fields such as Computer Science, Biology, HealthCare, Mathematics, etc. Above all, it offers a wide variety of formats for downloading datasets.
6. Socrata Open Data
Socrata OpenData is a portal that contains multiple datasets. This broad range of information makes it more attractive and useful among data scientists and other researchers. You can look for the data in the tabular form in the browser or can use some built-in visualization tools as well.
7. Quandl
As it says, The world’s most powerful data lives on Quandl. Quandle is probably more valuable for those who are working on Machine Learning projects. Moreover, the datasets are clean which helps to predict data more accurately. This is a repository mainly for financial and economic data. However, all the datasets are not freely available, some of them are public.
8. UCI Machine Learning Repository
UCI Machine Learning Repository is one of the most famous data repositories. If one is looking for machine learning datasets, then the UCI Machine Learning Repository should be the first choice. Above all, currently, it contains 487 datasets from different fields and labels like domain, and purpose of the problem like Classification/Regression.
9. Academic Torrents
Academic Torrent is not a mainstream yet powerful repository to share data. The main purpose behind its creation is an attempt to make academic datasets and research papers available via BitTorrent. However, the main focus is to share datasets from different research papers.
10. Reddit or r/datasets
Reddit is a popular social news site, but it also acts as a discussion board to share datasets. Such discussion boards are called as subreddits or r/datasets. It is a place to share, find and discuss datasets. However, The quality of the datasets may vary because different users submit them.