Cluster#
This page provides pylibraft class references for the publicly-exposed elements of the pylibraft.cluster
package.
KMeans#
- class pylibraft.cluster.kmeans.KMeansParams(n_clusters: int | None = None, max_iter: int | None = None, tol: float | None = None, verbosity: int | None = None, seed: int | None = None, metric: str | None = None, init: InitMethod | None = None, n_init: int | None = None, oversampling_factor: float | None = None, batch_samples: int | None = None, batch_centroids: int | None = None, inertia_check: bool | None = None)#
Specifies hyper-parameters for the kmeans algorithm.
- Parameters:
- n_clustersint, optional
The number of clusters to form as well as the number of centroids to generate
- max_iterint, optional
Maximum number of iterations of the k-means algorithm for a single run
- tolfloat, optional
Relative tolerance with regards to inertia to declare convergence
- verbosityint, optional
- seed: int, optional
Seed to the random number generator.
- metricstr, optional
Metric names to use for distance computation, see
pylibraft.distance.pairwise_distance()
for valid values.- initInitMethod, optional
- n_initint, optional
Number of instance k-means algorithm will be run with different seeds.
- oversampling_factorfloat, optional
Oversampling factor for use in the k-means algorithm
- Attributes:
- batch_centroids
- batch_samples
- inertia_check
- init
- max_iter
- n_clusters
- oversampling_factor
- seed
- tol
- verbosity
- pylibraft.cluster.kmeans.fit(KMeansParams params, X, centroids=None, sample_weights=None, handle=None)[source]#
Find clusters with the k-means algorithm
- Parameters:
- paramsKMeansParams
Parameters to use to fit KMeans model
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsOptional writable CUDA array interface compliant matrix
shape (n_clusters, k)
- sample_weightsOptional input CUDA array interface compliant matrix shape
(n_clusters, 1) default: None
- handleOptional RAFT resource handle for reusing CUDA resources.
If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- centroidsraft.device_ndarray
The computed centroids for each cluster
- inertiafloat
Sum of squared distances of samples to their closest cluster center
- n_iterint
The number of iterations used to fit the model
Examples
>>> import cupy as cp >>> from pylibraft.cluster.kmeans import fit, KMeansParams >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters) >>> centroids, inertia, n_iter = fit(params, X)
- pylibraft.cluster.kmeans.cluster_cost(X, centroids, handle=None)[source]#
Compute cluster cost given an input matrix and existing centroids
- Parameters:
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsInput CUDA array interface compliant matrix shape
(n_clusters, k)
- handleOptional RAFT resource handle for reusing CUDA resources.
If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
Examples
>>> import cupy as cp >>> from pylibraft.cluster.kmeans import cluster_cost >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> centroids = cp.random.random_sample((n_clusters, n_features), ... dtype=cp.float32) >>> inertia = cluster_cost(X, centroids)
- pylibraft.cluster.kmeans.compute_new_centroids(X, centroids, labels, new_centroids, sample_weights=None, weight_per_cluster=None, handle=None)[source]#
Compute new centroids given an input matrix and existing centroids
- Parameters:
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsInput CUDA array interface compliant matrix shape
(n_clusters, k)
- labelsInput CUDA array interface compliant matrix shape
(m, 1)
- new_centroidsWritable CUDA array interface compliant matrix shape
(n_clusters, k)
- sample_weightsOptional input CUDA array interface compliant matrix shape
(n_clusters, 1) default: None
- weight_per_clusterOptional writable CUDA array interface compliant
matrix shape (n_clusters, 1) default: None
- batch_samplesOptional integer specifying the batch size for X to compute
distances in batches. default: m
- batch_centroidsOptional integer specifying the batch size for centroids
to compute distances in batches. default: n_clusters
- handleOptional RAFT resource handle for reusing CUDA resources.
If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
Examples
>>> import cupy as cp >>> from pylibraft.common import Handle >>> from pylibraft.cluster.kmeans import compute_new_centroids >>> # A single RAFT handle can optionally be reused across >>> # pylibraft functions. >>> handle = Handle() >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> centroids = cp.random.random_sample((n_clusters, n_features), ... dtype=cp.float32) ... >>> labels = cp.random.randint(0, high=n_clusters, size=n_samples, ... dtype=cp.int32) >>> new_centroids = cp.empty((n_clusters, n_features), ... dtype=cp.float32) >>> compute_new_centroids( ... X, centroids, labels, new_centroids, handle=handle ... ) >>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()