Cluster#

This page provides pylibraft class references for the publicly-exposed elements of the pylibraft.cluster package.

KMeans#

Specifies hyper-parameters for the kmeans algorithm.

Parameters:

n_clustersint, optional: The number of clusters to form as well as the number of centroids to generate
max_iterint, optional: Maximum number of iterations of the k-means algorithm for a single run
tolfloat, optional: Relative tolerance with regards to inertia to declare convergence
verbosityint, optional
seed: int, optional: Seed to the random number generator.
metricstr, optional: Metric names to use for distance computation, see pylibraft.distance.pairwise_distance() for valid values.
initInitMethod, optional
n_initint, optional: Number of instance k-means algorithm will be run with different seeds.
oversampling_factorfloat, optional: Oversampling factor for use in the k-means algorithm

Attributes:

batch_centroids
batch_samples
inertia_check
init
max_iter
n_clusters
oversampling_factor
seed
tol
verbosity

pylibraft.cluster.kmeans.fit(KMeansParams params, X, centroids=None, sample_weights=None, handle=None)[source]#

Find clusters with the k-means algorithm

Parameters:

paramsKMeansParams: Parameters to use to fit KMeans model
XInput CUDA array interface compliant matrix shape (m, k)
centroidsOptional writable CUDA array interface compliant matrix: shape (n_clusters, k)
sample_weightsOptional input CUDA array interface compliant matrix shape: (n_clusters, 1) default: None
handleOptional RAFT resource handle for reusing CUDA resources.: If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:

centroidsraft.device_ndarray: The computed centroids for each cluster
inertiafloat: Sum of squared distances of samples to their closest cluster center
n_iterint: The number of iterations used to fit the model

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import fit, KMeansParams
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)

>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)

pylibraft.cluster.kmeans.cluster_cost(X, centroids, handle=None)[source]#

Compute cluster cost given an input matrix and existing centroids

Parameters:

XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape: (n_clusters, k)
handleOptional RAFT resource handle for reusing CUDA resources.: If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import cluster_cost
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                      dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)

pylibraft.cluster.kmeans.compute_new_centroids(X, centroids, labels, new_centroids, sample_weights=None, weight_per_cluster=None, handle=None)[source]#

Compute new centroids given an input matrix and existing centroids

Parameters:

XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape: (n_clusters, k)
labelsInput CUDA array interface compliant matrix shape: (m, 1)
new_centroidsWritable CUDA array interface compliant matrix shape: (n_clusters, k)
sample_weightsOptional input CUDA array interface compliant matrix shape: (n_clusters, 1) default: None
weight_per_clusterOptional writable CUDA array interface compliant: matrix shape (n_clusters, 1) default: None
batch_samplesOptional integer specifying the batch size for X to compute: distances in batches. default: m
batch_centroidsOptional integer specifying the batch size for centroids: to compute distances in batches. default: n_clusters
handleOptional RAFT resource handle for reusing CUDA resources.: If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.common import Handle
>>> from pylibraft.cluster.kmeans import compute_new_centroids
>>> # A single RAFT handle can optionally be reused across
>>> # pylibraft functions.
>>> handle = Handle()
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                               dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                         dtype=cp.float32)
...
>>> labels = cp.random.randint(0, high=n_clusters, size=n_samples,
...                            dtype=cp.int32)
>>> new_centroids = cp.empty((n_clusters, n_features),
...                          dtype=cp.float32)
>>> compute_new_centroids(
...     X, centroids, labels, new_centroids, handle=handle
... )
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()