Used by 1 in every 10 developers worldwide in 2024, scikit-learn or sklearn is among the popular open-source machine-learning libraries in Python. It’s built over different Python libraries, especially SciPy, which provides sklearn’s core functionalities for numerical computation, scientific processing, and data visualization.
Reflective of its popularity and importance, sklearn offers many tools for building, evaluating, and deploying machine learning models. Data enthusiasts primarily use sklearn for:
scikit-learn datasets are utilities and preprocessed datasets offered by sklearn.datasets
module to help you build your machine-learning projects. They are categorized into toy datasets, real-world datasets, and generated datasets. They are designed to help users easily access and work with various data types without requiring extensive data preprocessing or downloading.
Overview: A classification dataset with measurements of 150 iris flowers, categorized into three species: setosa, Versicolor, and Virginia. This small, well-balanced dataset has no missing values, making it ideal for beginners interested in machine learning. Due to its straightforward nature with well-separated classes, it can be used to demonstrate basic classification algorithms like k-nearest neighbors (KNN), decision trees, and naive Bayes.
You may load the dataset by using:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
print(iris.DESCR)
Overview: A regression dataset with 442 samples and 10 physiological features to predict disease progression after one year. The dataset demonstrates feature scaling’s importance since variables differ significantly in range. You may build an ML project experimenting with regression algorithms to determine the best-performing model with this dataset.
Target:
You may load the dataset by using:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
print(diabetes.DESCR)
Overview: The Digits dataset contains 1,797 grayscale images (8x8 pixels) of handwritten digits (0–9) for classification tasks. This dataset is excellent for exploring image recognition or dimensionality reduction. With it, you can develop a Python application that can recognize handwritten digits in real-time using a webcam.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
print(digits.DESCR)
Overview: It’s a multivariate regression dataset linking exercise performance (inputs) to physiological variables (outputs). It contains data on physical exercise and physiological measurements of 20 middle-aged men. An interesting project could be a Python-based system that provides personalized fitness recommendations based on a user’s physiological measurements.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
X, y = linnerud.data, linnerud.target
print(linnerud.DESCR)
Overview: The Wine dataset in sklearn is a classic, consisting of 178 samples and 13 chemical properties of wine from three different cultivars. This dataset can be used for various machine-learning projects, like classifying the cultivars with the chemical properties of a wine.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
print(wine.DESCR)
Overview: It’s a binary classification dataset for breast tumor diagnosis (malignant or benign). It provides a well-defined problem where the goal is to predict whether a tumor is malignant or benign based on a set of features like tumor size, shape, and texture. You may build a breast cancer prediction model with it.
Features - 30 variables, including:
Target:
You may load the dataset by using:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
print(cancer.DESCR)
Overview: The Olivetti Faces dataset is a popular choice for various machine learning tasks, especially those related to image processing and computer vision. It’s a dataset of grayscale images of 40 individuals (10 images/person) used for image recognition tasks. Support vector machine (SVM) projects, like a face recognition program, can be built with this.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_olivetti_faces
olivetti = fetch_olivetti_faces()
X, y = olivetti.data, olivetti.target
print(olivetti.DESCR)
Overview: 20NewsGroup is a simple sklearn text classification dataset of 20 different news categories, useful for NLP tasks. You can build a new categorization model that automatically categorizes incoming news articles, making it easier for users to find relevant content.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print(news.DESCR)
Overview: It’s a collection of grayscale face images for face recognition models. Labeled Faces in the Wild data are captured from the web, offering a challenging real-world scenario for facial recognition algorithms. Try building an ML project to identify the person in the query image from a database of known faces.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X, y = lfw_people.data, lfw_people.target
print(lfw_people.DESCR)
Overview: The Covtype dataset comprises 581,012 samples, each representing a 30x30 meter patch of forest cover in the US. You may use it to predict the forest cover type based on cartographic features.
The dataset is characterized by 54 features, which can be broadly categorized into:
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_covtype
covtype = fetch_covtype()
X, y = covtype.data, covtype.target
print(covtype.DESCR)
Overview: The RCV1 (Reuters Corpus Volume 1) dataset is a large text classification dataset used for multi-class, multi-label classification tasks, representing news articles labeled across multiple categories. Its massive collection comprises over 800,000 news articles, making it a valuable resource for various text mining and natural language processing tasks.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
X, y = rcv1.data, rcv1.target
print(rcv1.DESCR)
Overview: The KDD Cup 1999 dataset is a classic in the field of network security, used to train and evaluate intrusion detection systems.
It contains many network connection records, each characterized by various features such as duration, protocol type, service, flag, and numerous numeric attributes. The target variable indicates whether a connection is normal or an attack, with various attack types like denial-of-service, user-to-root, and probing. As a project, use advanced machine learning to classify network connections as normal or malicious accurately.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_kddcup99
kddcup = fetch_kddcup99(subset='SA')
print(kddcup.DESCR)
Overview: The California Housing dataset is a classic regression dataset widely used in machine learning. It provides a wealth of information about housing districts in California, making it an excellent resource for exploring various machine-learning techniques. You may train various regression models to predict median house values based on the given features.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X, y = housing.data, housing.target
print(housing.DESCR)
The species distribution dataset in sklearn contains information about the geographic distribution of two species, namely Bradypus variegatus, the brown-throated sloth, and the Microryzomys minutus, the forest small rice rat. The dataset includes environmental variables, such as temperature, precipitation, and vegetation cover, which influence the distribution of these species. If you’re interested, you may build a model to predict areas of suitable habitat for the given species based on environmental variables.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import fetch_species_distributions
species = fetch_species_distributions()
print(species.DESCR)
Overview: The OpenML datasets available through scikit-learn allow you to retrieve datasets from OpenML’s repository directly into Python. The fetch_openml
function from sklearn.datasets
loads datasets hosted on OpenML into your Python environment.
Features:
Target:
You may choose the datasets by using:
from sklearn.datasets import fetch_openml
dataset = fetch_openml(data_id=1464) # Example dataset ID
print(dataset.DESCR)
Overview: Generates synthetic datasets for classification tasks with customizable class separations.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2)
Overview: Generates isotropic Gaussian blobs for clustering tasks, often used in unsupervised learning algorithms like k-means.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=3, cluster_std=1.0)
Overview: Generates two types of synthetic datasets (moons and circles) for classification tasks. These datasets are useful for testing algorithms that deal with non-linear decision boundaries.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import make_moons, make_circles
X_moons, y_moons = make_moons(n_samples=300, noise=0.2)
X_circles, y_circles = make_circles(n_samples=300, noise=0.1)
Overview: A synthetic dataset for multilabel classification, where each instance can belong to multiple classes.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import make_multilabel_classification
X, y = make_multilabel_classification(n_samples=1000, n_features=20, n_classes=5)
Overview: A dataset that simulates sparse signals using a dictionary to model sparse coding problems.
Features:
Target:
You may load the dataset by using:
from sklearn.datasets import make_sparse_coded_signal
X, y = make_sparse_coded_signal(n_samples=100, n_components=50, n_nonzero_coefs=10)
scikit-learn offers a great collection of datasets to jumpstart your machine learning projects. Whether you’re working on simple tasks like classifying flowers with the Iris dataset or tackling more complex challenges like predicting house prices with California Housing, there’s something for everyone. These datasets are easy to access and help you dive into experimenting with models without spending time on data cleaning. All the best!