Cluster#

This page provides pylibraft class references for the publicly-exposed elements of the pylibraft.cluster package.

KMeans#

class pylibraft.cluster.kmeans.KMeansParams(n_clusters: int | None = None, max_iter: int | None = None, tol: float | None = None, verbosity: int | None = None, seed: int | None = None, metric: str | None = None, init: InitMethod | None = None, n_init: int | None = None, oversampling_factor: float | None = None, batch_samples: int | None = None, batch_centroids: int | None = None, inertia_check: bool | None = None)#

Specifies hyper-parameters for the kmeans algorithm.

Parameters:
n_clustersint, optional

The number of clusters to form as well as the number of centroids to generate

max_iterint, optional

Maximum number of iterations of the k-means algorithm for a single run

tolfloat, optional

Relative tolerance with regards to inertia to declare convergence

verbosityint, optional
seed: int, optional

Seed to the random number generator.

metricstr, optional

Metric names to use for distance computation, see pylibraft.distance.pairwise_distance() for valid values.

initInitMethod, optional
n_initint, optional

Number of instance k-means algorithm will be run with different seeds.

oversampling_factorfloat, optional

Oversampling factor for use in the k-means algorithm

Attributes:
batch_centroids
batch_samples
inertia_check
init
max_iter
n_clusters
oversampling_factor
seed
tol
verbosity
pylibraft.cluster.kmeans.fit(KMeansParams params, X, centroids=None, sample_weights=None, handle=None)[source]#

Find clusters with the k-means algorithm

Parameters:
paramsKMeansParams

Parameters to use to fit KMeans model

XInput CUDA array interface compliant matrix shape (m, k)
centroidsOptional writable CUDA array interface compliant matrix

shape (n_clusters, k)

sample_weightsOptional input CUDA array interface compliant matrix shape

(n_clusters, 1) default: None

handleOptional RAFT resource handle for reusing CUDA resources.

If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
centroidsraft.device_ndarray

The computed centroids for each cluster

inertiafloat

Sum of squared distances of samples to their closest cluster center

n_iterint

The number of iterations used to fit the model

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import fit, KMeansParams
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)
pylibraft.cluster.kmeans.cluster_cost(X, centroids, handle=None)[source]#

Compute cluster cost given an input matrix and existing centroids

Parameters:
XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape

(n_clusters, k)

handleOptional RAFT resource handle for reusing CUDA resources.

If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import cluster_cost
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                      dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)
pylibraft.cluster.kmeans.compute_new_centroids(X, centroids, labels, new_centroids, sample_weights=None, weight_per_cluster=None, handle=None)[source]#

Compute new centroids given an input matrix and existing centroids

Parameters:
XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape

(n_clusters, k)

labelsInput CUDA array interface compliant matrix shape

(m, 1)

new_centroidsWritable CUDA array interface compliant matrix shape

(n_clusters, k)

sample_weightsOptional input CUDA array interface compliant matrix shape

(n_clusters, 1) default: None

weight_per_clusterOptional writable CUDA array interface compliant

matrix shape (n_clusters, 1) default: None

batch_samplesOptional integer specifying the batch size for X to compute

distances in batches. default: m

batch_centroidsOptional integer specifying the batch size for centroids

to compute distances in batches. default: n_clusters

handleOptional RAFT resource handle for reusing CUDA resources.

If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.common import Handle
>>> from pylibraft.cluster.kmeans import compute_new_centroids
>>> # A single RAFT handle can optionally be reused across
>>> # pylibraft functions.
>>> handle = Handle()
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                               dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                         dtype=cp.float32)
...
>>> labels = cp.random.randint(0, high=n_clusters, size=n_samples,
...                            dtype=cp.int32)
>>> new_centroids = cp.empty((n_clusters, n_features),
...                          dtype=cp.float32)
>>> compute_new_centroids(
...     X, centroids, labels, new_centroids, handle=handle
... )
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()