Top Sklearn Datasets for Machine Learning Projects

Top Sklearn Datasets for Machine Learning Projects

Overview

Used by 1 in every 10 developers worldwide in 2024, scikit-learn or sklearn is among the popular open-source machine-learning libraries in Python. It’s built over different Python libraries, especially SciPy, which provides sklearn’s core functionalities for numerical computation, scientific processing, and data visualization.

Reflective of its popularity and importance, sklearn offers many tools for building, evaluating, and deploying machine learning models. Data enthusiasts primarily use sklearn for:

  1. Preprocessing data - using utilities such as normalization, standardization, encoding categorical variables, and feature selection.
  2. Supervised learning - through support vector machines (SVMs), linear regression, ridge regression, lasso, gradient boosting, random forests, decision trees, and logistic regression.
  3. Unsupervised learning - using clustering and dimensionality reduction techniques.
  4. Model selection - through tools for cross-validation and hyperparameter testing.
  5. Ease of usage - its well-documented API and easy integration with other libraries make it beginner-friendly while powerful for advanced use cases.
  6. Datasets - using the sklearn library, which includes multiple datasets that can be accessed for any personal machine learning project.

What Are Sklearn Datasets?

scikit-learn datasets are utilities and preprocessed datasets offered by sklearn.datasets module to help you build your machine-learning projects. They are categorized into toy datasets, real-world datasets, and generated datasets. They are designed to help users easily access and work with various data types without requiring extensive data preprocessing or downloading.

Iris

Overview: A classification dataset with measurements of 150 iris flowers, categorized into three species: setosa, Versicolor, and Virginia. This small, well-balanced dataset has no missing values, making it ideal for beginners interested in machine learning. Due to its straightforward nature with well-separated classes, it can be used to demonstrate basic classification algorithms like k-nearest neighbors (KNN), decision trees, and naive Bayes.

  • Features:
    • Sepal.Length: The length of the sepal (outer petal) in centimeters.
    • Sepal.Width: The width of the sepal in centimeters.
    • Petal.Length: The length of the petal in centimeters.
    • Petal.Width: The width of the petal in centimeters.
  • Target:
    • 0: Setosa
    • 1: Versicolor
    • 2: Virginica

You may load the dataset by using:

from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
print(iris.DESCR)

Diabetes

Overview: A regression dataset with 442 samples and 10 physiological features to predict disease progression after one year. The dataset demonstrates feature scaling’s importance since variables differ significantly in range. You may build an ML project experimenting with regression algorithms to determine the best-performing model with this dataset.

  • Features:
    • Age: The patient’s age, scaled to a unitless form.
    • Sex: The gender of the patient, encoded as a feature.
    • BMI: Body mass index (BMI), a measure of body fat.
    • BP: Blood pressure averaged across various measurements.
    • S1–S6: Six blood serum measurements (e.g., cholesterol, glucose levels).

Target:

  • A continuous variable representing the progression of diabetes after one year.

You may load the dataset by using:

from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
print(diabetes.DESCR)

Digits

Overview: The Digits dataset contains 1,797 grayscale images (8x8 pixels) of handwritten digits (0–9) for classification tasks. This dataset is excellent for exploring image recognition or dimensionality reduction. With it, you can develop a Python application that can recognize handwritten digits in real-time using a webcam.

Features:

  • A total of 64 features representing pixel intensity values of an 8x8 grayscale handwritten digit image.

Target:

  • Digit: The digit represented in the image, ranging from 0 to 9.

You may load the dataset by using:

from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
print(digits.DESCR)

Linnerud

Overview: It’s a multivariate regression dataset linking exercise performance (inputs) to physiological variables (outputs). It contains data on physical exercise and physiological measurements of 20 middle-aged men. An interesting project could be a Python-based system that provides personalized fitness recommendations based on a user’s physiological measurements.

Features:

  1. Chin-ups: The number of chin-ups performed in one attempt.
  2. Sit-ups: The number of sit-ups performed in one attempt.
  3. Jumps: The number of jumps completed.

Target:

  1. Weight: The individual’s body weight in kilograms.
  2. Waist: Waist circumference in centimeters.
  3. Pulse: Pulse rate after exercise.

You may load the dataset by using:

from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
X, y = linnerud.data, linnerud.target
print(linnerud.DESCR)

Wine

Overview: The Wine dataset in sklearn is a classic, consisting of 178 samples and 13 chemical properties of wine from three different cultivars. This dataset can be used for various machine-learning projects, like classifying the cultivars with the chemical properties of a wine.

Features:

  1. Alcohol: Alcohol content in the wine.
  2. Malic Acid: Malic acid content in grams.
  3. Ash: Ash content in grams.
  4. Alkalinity of Ash: Measure of alkalinity in ash.
  5. Magnesium: Magnesium content in milligrams.
  6. Total Phenols: Total phenolic content.
  7. Flavanoids: Flavanoid content.
  8. Nonflavanoid Phenols: Non-flavanoid phenolic content.
  9. Proanthocyanins: Proanthocyanin content.
  10. Color Intensity: Intensity of the wine’s color.
  11. Hue: A measure of the wine’s hue.
  12. OD280/OD315: Ratio of OD280 and OD315 values.
  13. Proline: Proline content in milligrams.

Target:

  • Wine Class: The region of origin (three possible values: 0, 1, 2).

You may load the dataset by using:

from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
print(wine.DESCR)

Breast Cancer

Overview: It’s a binary classification dataset for breast tumor diagnosis (malignant or benign). It provides a well-defined problem where the goal is to predict whether a tumor is malignant or benign based on a set of features like tumor size, shape, and texture. You may build a breast cancer prediction model with it.

Features - 30 variables, including:

  1. Mean Radius: Average radius of the tumor.
  2. Mean Texture: Variability in gray-scale intensity of the cells.
  3. Mean Perimeter: Average perimeter of the tumor.
  4. Mean Area: Average size of the tumor.
  5. Mean Smoothness: Variation in the smoothness of the tumor. and others.

Target:

  • Diagnosis:
    • 0: Malignant
    • 1: Benign

You may load the dataset by using:

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
print(cancer.DESCR)

Olivetti Faces

Overview: The Olivetti Faces dataset is a popular choice for various machine learning tasks, especially those related to image processing and computer vision. It’s a dataset of grayscale images of 40 individuals (10 images/person) used for image recognition tasks. Support vector machine (SVM) projects, like a face recognition program, can be built with this.

Features:

  • 4,096 pixel intensity values for grayscale images (64x64 resolution).

Target:

  • Person ID: Integer labels (0–39) representing 40 individuals.

You may load the dataset by using:

from sklearn.datasets import fetch_olivetti_faces
olivetti = fetch_olivetti_faces()
X, y = olivetti.data, olivetti.target
print(olivetti.DESCR)

20NewsGroup

Overview: 20NewsGroup is a simple sklearn text classification dataset of 20 different news categories, useful for NLP tasks. You can build a new categorization model that automatically categorizes incoming news articles, making it easier for users to find relevant content.

Features:

  • Raw text data of news articles.

Target:

  • Integer label for one of 20 categories (e.g., sci.space, comp.graphics).

You may load the dataset by using:

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print(news.DESCR)

LFW People

Overview: It’s a collection of grayscale face images for face recognition models. Labeled Faces in the Wild data are captured from the web, offering a challenging real-world scenario for facial recognition algorithms. Try building an ML project to identify the person in the query image from a database of known faces.

Features:

  • Pixel intensity values of grayscale images of faces.

Target:

  • Person ID: Labels for individuals in the dataset.

You may load the dataset by using:

from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X, y = lfw_people.data, lfw_people.target
print(lfw_people.DESCR)

Covtype

Overview: The Covtype dataset comprises 581,012 samples, each representing a 30x30 meter patch of forest cover in the US. You may use it to predict the forest cover type based on cartographic features.

The dataset is characterized by 54 features, which can be broadly categorized into:

Features:

  1. Elevation: Elevation of the location in meters.
  2. Aspect: Aspect (compass direction) of the terrain (0–360).
  3. Slope: Slope of the terrain in degrees.
  4. Horizontal Distance to Hydrology: Distance to nearest surface water feature.
  5. Vertical Distance to Hydrology: Elevation difference to the nearest surface water feature.
  6. Hillshade at 3 pm: Hillshade index (a measure of shadow).
  7. Hillshade at Noon: Hillshade index at noon.
  8. Horizontal Distance to Roadways: Distance to the nearest road.
  9. Forest Type: Classifying the forest into one of seven possible types.
  10. Soil Type: 1 to 40 (binary indicators)

Target:

  • Cover Type: Classifies forest cover into one of 7 classes (1 to 7).

You may load the dataset by using:

from sklearn.datasets import fetch_covtype
covtype = fetch_covtype()
X, y = covtype.data, covtype.target
print(covtype.DESCR)

RCV1

Overview: The RCV1 (Reuters Corpus Volume 1) dataset is a large text classification dataset used for multi-class, multi-label classification tasks, representing news articles labeled across multiple categories. Its massive collection comprises over 800,000 news articles, making it a valuable resource for various text mining and natural language processing tasks.

Features:

  • Raw text content converted into a sparse matrix format (document-term matrix or TF-IDF features).

Target:

  • Categories: Multiple categories, with each article labeled with multiple categories. The categories are part of a hierarchical structure.

You may load the dataset by using:

from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
X, y = rcv1.data, rcv1.target
print(rcv1.DESCR)

Kddcup99

Overview: The KDD Cup 1999 dataset is a classic in the field of network security, used to train and evaluate intrusion detection systems.

It contains many network connection records, each characterized by various features such as duration, protocol type, service, flag, and numerous numeric attributes. The target variable indicates whether a connection is normal or an attack, with various attack types like denial-of-service, user-to-root, and probing. As a project, use advanced machine learning to classify network connections as normal or malicious accurately.

Features:

  • Basic Features: Includes features like protocol type, service, and flag.
  • Content Features: Includes features related to the data payload, such as the number of bytes and number of connections.
  • Time-based Features: Features based on time, such as the duration of connections.

Target:

  • Attack Class: The type of attack (e.g., denial of service, probe, remote to local), or “normal” for non-malicious traffic.

You may load the dataset by using:

from sklearn.datasets import fetch_kddcup99
kddcup = fetch_kddcup99(subset='SA')
print(kddcup.DESCR)

California Housing

Overview: The California Housing dataset is a classic regression dataset widely used in machine learning. It provides a wealth of information about housing districts in California, making it an excellent resource for exploring various machine-learning techniques. You may train various regression models to predict median house values based on the given features.

Features:

  1. MedInc: Median income of the neighborhood.
  2. HouseAge: Average age of houses in the neighborhood.
  3. AveRooms: Average number of rooms per household.
  4. AveBedrms: Average number of bedrooms per household.
  5. Population: Total population in the area.
  6. AveOccup: Average number of people per household.
  7. Latitude: Latitude of the area.
  8. Longitude: Longitude of the area.

Target:

  • Median House Value: Median house price in the neighborhood (in $100,000s).

You may load the dataset by using:

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X, y = housing.data, housing.target
print(housing.DESCR)

Species Distribution

The species distribution dataset in sklearn contains information about the geographic distribution of two species, namely Bradypus variegatus, the brown-throated sloth, and the Microryzomys minutus, the forest small rice rat. The dataset includes environmental variables, such as temperature, precipitation, and vegetation cover, which influence the distribution of these species. If you’re interested, you may build a model to predict areas of suitable habitat for the given species based on environmental variables.

Features:

  • Environmental variables: Latitude, longitude, and other environmental characteristics that can help predict species distribution.

Target:

  • Species Presence/Absence: Whether a particular species is present or absent in a given region.

You may load the dataset by using:

from sklearn.datasets import fetch_species_distributions
species = fetch_species_distributions()
print(species.DESCR)

Openml

Overview: The OpenML datasets available through scikit-learn allow you to retrieve datasets from OpenML’s repository directly into Python. The fetch_openml function from sklearn.datasets loads datasets hosted on OpenML into your Python environment.

Features:

  • Features vary widely depending on the dataset chosen from OpenML.

Target:

  • The target variable depends on the dataset you fetch (classification, regression, or other machine learning tasks).

You may choose the datasets by using:

from sklearn.datasets import fetch_openml
dataset = fetch_openml(data_id=1464)  # Example dataset ID
print(dataset.DESCR)

Classification

Overview: Generates synthetic datasets for classification tasks with customizable class separations.

Features:

  • Features are generated using a specific number of informative and redundant features.

Target:

  • Classes: Binary or multi-class targets based on how the dataset is configured.

You may load the dataset by using:

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2)

Blobs

Overview: Generates isotropic Gaussian blobs for clustering tasks, often used in unsupervised learning algorithms like k-means.

Features:

  • Points in 2D or higher-dimensional space, generated around multiple centers.

Target:

  • Cluster Assignment: The cluster to which each data point belongs.

You may load the dataset by using:

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=3, cluster_std=1.0)

Moons and Circles

Overview: Generates two types of synthetic datasets (moons and circles) for classification tasks. These datasets are useful for testing algorithms that deal with non-linear decision boundaries.

Features:

  • Points in 2D space that follow the shape of either two interlocking half-moons or concentric circles.

Target:

  • Class Label: Each point is assigned to one of two classes.

You may load the dataset by using:

from sklearn.datasets import make_moons, make_circles
X_moons, y_moons = make_moons(n_samples=300, noise=0.2)
X_circles, y_circles = make_circles(n_samples=300, noise=0.1)

Multilabel

Overview: A synthetic dataset for multilabel classification, where each instance can belong to multiple classes.

Features:

  • Randomly generated features, with labels for each sample indicating which classes are relevant.

Target:

  • Multiple Labels: Each sample is associated with a set of binary labels indicating the classes it belongs to.

You may load the dataset by using:

from sklearn.datasets import make_multilabel_classification
X, y = make_multilabel_classification(n_samples=1000, n_features=20, n_classes=5)

Sparse Coded Signal

Overview: A dataset that simulates sparse signals using a dictionary to model sparse coding problems.

Features:

  • Signal: The signal is represented as a sparse combination of dictionary atoms.

Target:

  • Sparse Code: The sparse code that explains the signal.

You may load the dataset by using:

from sklearn.datasets import make_sparse_coded_signal
X, y = make_sparse_coded_signal(n_samples=100, n_components=50, n_nonzero_coefs=10)

The Bottom Line

scikit-learn offers a great collection of datasets to jumpstart your machine learning projects. Whether you’re working on simple tasks like classifying flowers with the Iris dataset or tackling more complex challenges like predicting house prices with California Housing, there’s something for everyone. These datasets are easy to access and help you dive into experimenting with models without spending time on data cleaning. All the best!