On this page
Datasets Links Google Dataset Search (HN ) (HN )Tencent ML-Images - Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet.Mathematics Dataset - Dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.Moving autonomous vehicles forward, together. Dataset by Lyft CodeSearchNet - Datasets, tools, and benchmarks for representation learning of code.Introducing the CodeSearchNet challenge (2019) (HN )Facets - Visualizations for machine learning datasets.skdata - Data sets for machine learning in Python.TensorFlow Datasets - Collection of datasets ready to use with TensorFlow.Awesome Public Datasets Awesome Public Datasets Core - Next iteration of APD project.LORIS - Web-accessible database solution for longitudinal multi-site studies.ProteinNet - Standardized data set for machine learning of protein structure.Registry of Open Data on AWS (Code )List of datasets for machine-learning research Syndetic - Replaces static data dictionaries with a live data profiling system. Annotate, measure, and monitor your datasets. Share the results. (HN )FaceForensics++ - Learning to Detect Manipulated Facial Images.Scale AI - High quality training and validation data for AI applications.Audio Datasets for Machine Learning (HN )Collection of large datasets for conversational response selection NSFW data source URLs - Collection of NSFW images URLs for the purposes of training an NSFW Image Classifier.Lambdagram - Tiny Cloud Service to Build Image Datasets with Instagram.HN Stories and comments since 2006 My Giant Data Quality Checklist (2020) LabelImg - Graphical image annotation tool.Common Voice - Mozilla's initiative to help teach machines how real people speak.Replica Dataset - Dataset of high quality reconstructions of a variety of indoor spaces.Using Decision Trees for charting ill-behaved datasets (2020) Human parsing datasets Data Programming: Creating Large Training Sets, Quickly (2016) Announcing Artifacts (2020) DataHub - Provide various solutions to Publish and Deploy your Data with power and simplicity.Core Data - Important, commonly-used data as high quality, easy-to-use & open data packages. (Code )Awesome collections on DataHub Label Studio - Multi-type data labeling and annotation tool with standardized output format. (Code ) (Time Series Data Labeling )Heartex - Data Management Platform for Machine Learning.Clothing Dataset: Call for Action (2020) Unsplash Dataset - 2,000,000+ Unsplash images made available for research and machine learning. (Web )100k+ Rows Topic Labeled News Dataset (2020) Fashion-MNIST - MNIST-like fashion product database.FiveThirtyEight Datasets Books in .txt format for AI training purposes (HN )Sweetviz - Visualize and compare datasets, target values and associations, with one line of code.SuperAnnotate - Fastest annotation platform for training AI.Activeloop Hub - Fastest way to access and manage datasets for PyTorch and TensorFlow. (Web ) (Docs ) (Reddit )Objectron Dataset - Dataset of short object centeric video clips with pose annotations.Google Research Datasets matorage - Efficient way to store/load and manage dataset, model and optimizer for deep learning.HN Posts datasets (HN )Hypersim Toolkit - Set of tools for generating photorealistic synthetic datasets from V-Ray scenes.mirdata - Interoperable Dataset Loaders for Music Information Retrieval (MIR).MetFaces Dataset - Image dataset of human faces extracted from works of art.Lionbridge AI - Provides human-labeled data for hundreds of use cases.Traditional Chinese Landscape Painting Dataset Awesome Satellite Imagery Datasets Wikimedia Downloads - Download the Entire Wikimedia Database. (HN )Wikipedia: Database download How to shuffle a big dataset (2018) (Reddit )ESC-50: Dataset for Environmental Sound Classification Booking.com WSDM challenge - Training dataset consists of over a million of anonymized hotel reservations, based on real data.Computer Vision Datasets Voicebook Datasets - Comprehensive list of open-source datasets for voice and sound computing (50+ datasets).The Pile - 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.doccano - Open source text annotation tool for machine learning practitioner. (Web )Weather and Climate Datasets for AI Research (Code )NLP Datasets Total Text Dataset - Consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.Datasets collected for network science, deep learning and general machine learning research MER and SER Data sets - Data sets for Music Emotion Recognition and Speech Emotion Recognition.Common Voice Datasets - Multi-language dataset of voices that anyone can use to train speech-enabled applications. (Code )Label a Dataset with a Few Lines of Code (2021) (HN )Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples (2020) (Code )Datasets should behave like git repositories (2021) The Stanford Question Answering Dataset (Visual Explorer )Data.gov - Home of the U.S. Government’s open data.Visualizing Data Timeliness at Airbnb (2021) The Next Evolution of Data Catalogs: Data Discovery Platforms (2021) DeepLabel - Cross-platform image annotation tool for machine learning.WIT : Wikipedia-based Image Text Dataset Harry Potter Dataset DocRED: A Large-Scale Document-Level Relation Extraction Dataset (2019) (Code )Synthetic Data: Even Better than the Real Thing? (2021) Google C4 dataset - Colossal, cleaned version of Common Crawl's web crawl corpus.Finding a standard dataset format for machine learning (2020) (HN )Hashing techniques to compare large datasets? (2021) Machine Learning Datasets | Papers With Code (Twitter )Ocean Market - Marketplace to find, publish and trade data sets. (Code )Ocean Protocol - Tools for the Web3 Data Economy. (Contracts ) (GitHub )Generating Datasets with Pretrained Language Models (2021) nbodykit - Analysis kit for large-scale structure datasets, the massively parallel way.Dataset Inference: Ownership Resolution in Machine Learning (2021) (Tweet )Diffgram - Data Labeling Software for Machine Learning. (Code )Data Profiler - Python library designed to make data analysis, monitoring and sensitive data detection easy.Tonic - Fake Data Company. (GitHub )Datasets for Google Cloud (Article )SQLite Data Starter Packs GitHub Collection: Open data - Examples of using GitHub to store, publish, and collaborate on open, machine-readable datasets.Scientific Data Repositories (HN )CatMeows: A Publicly-Available Dataset of Cat Vocalizations (2020) (HN )ir_datasets - Python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.SEDE (Stack Exchange Data Explorer) - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. (Article )List of Medical (Imaging) Datasets musescore.com dataset - Dataset of all music sheets and users on musescore.com.generatedata.com - Random data generator. (Code )MTData - Tool automates collection and preparation of machine translation datasets.The MIT Supercloud Dataset (2021) Datasheets for Datasets (2018) (Markdown Datasheet for Datasets )Lightly - Label only the data which improves your ML model. (HN )Small Open Datasets - Collection of automatically-updated, ready-to-use and open-licensed datasets.DataQA - Labelling platform for text using distant supervision.COCO - Common Objects in Context - Large-scale object detection, segmentation, and captioning dataset. (API )img2dataset - Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.How to fit any dataset with a single parameter (2019) (HN )Single-dataset Experts for Multi-dataset Question Answering (2021) (Code )LabelFlow - Open standard platform for image labeling. (Code )Face Synthetics dataset Toloka - Fast and efficient way to collect and label large data sources for machine learning and other business purposes. (Code ) (GitHub )PlainTextWikipedia - Convert Wikipedia database dumps into plaintext files.Discovering Anomalous Data with Self-Supervised Learning (2021) Resources to get you the best quality of ML datasets (2021) Hugging Face Datasets SDMetrics - Metrics to evaluate quality and efficacy of synthetic datasets.doubtlab - General tricks that may help you find bad, or noisy, labels in your dataset.Gretel Synthetics - Synthetic data generators for structured and unstructured text, featuring differentially private learning.Great datasets to teach with (2021) A Cartel of Influential Datasets Are Dominating Machine Learning Research (HN )The Toxicity Dataset Data Linter - Identifies potential issues (lints) in your ML training data.Cloud Annotations - Fast, easy and collaborative open source image annotation tool for teams and individuals. (Web )pyjanitor - Clean APIs for data cleaning. Python implementation of R package Janitor.face2comics datasets arXiv public datasets AIST++ Dance Motion Dataset (API Code )TheAudioDB.com - Community Database of audio artwork and metadata with a JSON API.Awesome Video Datasets Conceptual 12M - Dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.Colliding Circles Toy Datasets Sieve - Transform raw video into high quality datasets in minutes. (HN ) (HN )IKEA 3D Assembly Dataset Imbalanced Dataset Sampler - PyTorch imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.ADE20K Dataset - Composed of more than 27K images from the SUN and Places databases. (Code )Datasets of Automatic Keyphrase Extraction Awesome Forests - Curated list of ground-truth forest datasets for the machine learning and forestry community.PushShift Data Dumps DeepEcho - Synthetic Data Generation for mixed-type, multivariate time series.deduplify - Python tool to search for and remove duplicated files in messy datasets.CSVtoTable - Simple command-line utility to convert CSV files to searchable and sortable HTML table.Kubric - Data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.ASPset-510 - Large-scale video dataset for the training and evaluation of 3D human pose estimation models.Self-Distilled Internet Photos (SDIP) Dataset Fake News Corpus Sniffer - Lightweight Python application for sorting images in your dataset.Dataset Distillation by Matching Training Trajectories (2022) (Code )BeeRef - Simple Reference Image Viewer.BookSum: A Collection of Datasets for Long-form Narrative Summarization (2021) (Code )HierText Dataset - Dataset featuring hierarchical annotations of text in natural scenes and documents.Google Research Datasets MetaShift: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts (2022) CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus Squirrel Datasets Core GTA-3D Dataset - Dataset of 2D imagery, 3D point cloud data, and 3D vehicle bounding box labels all generated using the Grand Theft Auto 5 game engine.Relative Human (RH) - Multi-person in-the-wild RGB images with rich human annotations.CSV Base - Turn CSV files into read+write APIs. (Code )A Dataset and Explorer for 3D Signed Distance Functions (2022) (Code )Vega Datasets - Collection of datasets used in Vega and Vega-Lite examples.Azimuth - Open-source dataset and error analysis tool for text classification.audio2dataset - Easily turn large sets of audio urls to an audio dataset.Datasets for Entity Recognition - Collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.AudioLoader - PyTorch Dataset for Speech and Music audio.Awesome Training Data MIDI Dataset - Code for creating a dataset of MIDI ground truth.Labelbox - Fastest way to annotate data to build and ship computer vision applications. (Code )Bamboo - Mega-scale and information-dense dataset for classification and detection pre-training.The How2 Dataset - Multimodal collection of instructional videos with English subtitles. (Code )Unity Dataset Insights - Python package for downloading, parsing and analyzing synthetic datasets generated using the Unity Perception package.ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (2022) (Code )Perceptual Image Processing ALgorithms (PIPAL) (Code )Hover - Label data at scale. Fun and precision included.How do you share big datasets with your team and others? (2022) Simulacra Aesthetic Captions - Dataset of over 238000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts.Audio Dataset Project - Audio Dataset for training CLAP and other models.Bulk - Quick developer tool to apply some bulk labels.stopes - Library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.MisInfoText - Datasets for fake news and misinformation detection.Awesome Dataset Distillation Cleaning data with sqlite-utils and Datasette Starter code for working with the YouTube-8M dataset BigLAM (Libraries, Archives and Museums) - Open source, community resource of LAM datasets.Data Measurements Tool - Developing tools to automatically analyze datasets.Cleanlab Vizzy - Learn how to automatically find label errors and out-of-distribution data. (Lobsters )