cima.detection package

Submodules

cima.detection.beads_identification module

class cima.detection.beads_identification.BeadsFinder

Bases: object

fit(seg: Segment, n_jobs: int = 1, radius: int = 40, min_persistence: float = 0.75, eps: int = 30, min_samples: int = 100, plot_beads: bool = False, plot_path: str = '', plot_filename: str = '', plot_format: str = 'pdf')

Function to find the beads in a given Segment, label them and add that info to the Segment under graphs_df. Also adds a debeaded segment for ease of use.

Parameters:
  • seg (Segment) – The Segment object to find beads in.

  • n_jobs (int, optional) – The number of jobs to run in parallel, by default 1

  • radius (int, optional) – The radius to use for finding neighbors, by default 40

  • min_persistence (float, optional) – The minimum persistence ratio to classify a localization as bead-originated, by default 0.75

  • eps (int, optional) – The maximum distance between two samples for them to be considered as in the same neighborhood in the DBSCAN clustering, by default 30

  • min_samples (int, optional) – The number of samples in a neighborhood for a point to be considered as a core point in the DBSCAN clustering, by default 100

  • plot_beads (bool, optional) – Whether to plot the beads, by default False

  • plot_path (str, optional) – The path to save the plot, by default “” (which will save it in the same directory as the segment’s filename). DO NOT INCLUDE THE FILENAME

  • plot_filename (str, optional) – The filename to save the plot, by default “” (which will generate a filename based on the segment’s filename). DO NOT INCLUDE THE FORMAT

  • plot_format (str, optional) – The format to save the plot, by default “pdf”

Returns:

self – The fitted BeadsFinder instance.

Return type:

BeadsFinder

show(path2save: str | None = None, n_jobs: int = 1, radius: int = 40)

Displays a representation of the connection matrices, first by considering all localizations and then considering one localization per frame. Then it also displays the persistence of each localization.

Parameters:
  • path2save (Optional[str], optional) – If provided, saves the plots to the specified path instead of displaying them, by default None

  • n_jobs (int, optional) – The number of jobs to run in parallel for the nearest neighbors calculation, by default 1

  • radius (int, optional) – The radius to use for finding neighbors in the nearest neighbors calculation, by default 40

cima.detection.clusters module

cima.detection.clusters.DBscan(StructureObj: Segment, epsilon2test=0, minpoints=100, n_jobs=8)

The function takes a StructureObj object and performs DBSCAN clustering with user-defined parameters.

Parameters:
  • StructureObj (Segment) – Structure Object to cluster

  • epsi (int) – Epsilon value for DBSCAN

  • minpoints (int) – Minimum points for DBSCAN

  • n_jobs (int) – Number of parallel jobs to run during the DBSCAN computation

Returns:

Clustered Structure Object with clusterID assigned.

Return type:

Segment

class cima.detection.clusters.DBscan_grid_search_stable

Bases: object

Selects the pair of min_points and eps parameters which give the clustering with the highest stability, meaning that it changes less when changing the parameters. In the process computes DBSCAN labels for all the combinations of specified parameters. Also computes the grids of ari scores among neighborhood of labels, and the variation of those ari scores.

copy()

Return an identical copy of the object.

fit(segment: SegmentXYZ, min_pts_param: tuple[int, int, int] = (10, 300, 10), eps_param: tuple[int, int, int] = (0, 0, 0), consider_noise: bool = True, n_neighbors: int = 2, conv: bool = False, verbose: bool = False, n_jobs: int = 8, downsampling_rate: float = 1.0, limit_density: bool = False, random_seed: int = 0)

Computes labels, ARI and variance of ARI grids. Computes the rank of parameter combinations according to stability. Saves best labels and a copy of segment with them as clusterIDs.

Parameters:
  • segment (SegmentXYZ) – SegmentXYZ containing the coodinates on which to run the clustering

  • min_pts_param (tuple[int, int, int], optional) – Tuple of the form (min, max, step) defining the range of minimum number of points to consider in DBSCAN

  • eps_param (tuple[int, int, int], optional) – Tuple of the form (min, max, step) defining the range of eps values to provide to DBSCAN

  • consider_noise (bool, optional) – This may be useful when the majority of points are classified as noise, because it concentrates the comparison on the signal part.

  • n_neighbors (int, optional) – Number of neighbors to consider when computing the rank of stability, by default 2.

  • conv (bool, optional) – Whether to apply a blurring on the grid before computing the rank of stability, by default False.

  • n_jobs (int, optional) – How many cpus to use, by default 8.

  • downsampling_rate (float, optional) – Rate of subselection of points on which to run DBSCAN. Allows to decrease computation time. When the rate is <1 the min_pts_range is adjusted so that the pattern on the grid is very similar to that that would be obtained with rate = 1. The results is less comparable as the rate is decreased towards 0. By default 1.0 (no downsampling)

  • limit_density (bool, optional) – Limit the search for stability among those parameters defining a density threshold between the 25th and 75th density percentile of coordinates. By default False.

  • random_seed (int, optional) – Used in the random selection of downsampled coords, to make it reproducible. By default 0.

mergeOtherDBSCANGrid(other_scanner: DBscan_grid_search_stable, conv: bool = False, consider_noise: bool = True, n_neighbors: int = 2, verbose: bool = False)

Integrates the grid of precomputed labels contained in other_scanner into this object. Then computes everything else from them. Useful when you want to extend the grid without having to recompute the labels that you already have.

Parameters:
  • other_scanner (DBscan_grid_search_stable) – The scanner (already fitted) to integrate into this one

  • conv (bool, optional) – Whether to apply a blurring on the grid before computing the rank of stability, by default False

  • consider_noise (bool, optional) – This may be useful when the majority of points are classified as noise, because it concentrates the comparison on the signal part, by default True

  • n_neighbors (int, optional) – The number of neighbors to consider for the ARI calculation, by default 2

  • verbose (bool, optional) – Whether to print progress messages, by default False

Returns:

The updated DBscan_grid_search_stable object.

Return type:

self

Raises:
  • ValueError – If self is not fitted.

  • ValueError – If the other_scanner is not fitted.

  • ValueError – If the segment coordinates don’t match.

plotAriGrid()

Plots the grid of ARI values. It will put a red dot on the best combination of parameters.

plotAriVarGrid()

Plots the grid of ARI variation values. It will put a red dot on the best combination of parameters.

saveLog(outfile)

Saves the ARI and ARI variation values in a single csv file, in decreasing order of stability

class cima.detection.clusters.HDBSCAN_stable

Bases: object

Selects the value of min_cluster_size which gives the clustering with the highest stability, meaning that it changes less when changing the parameter. In the process computes HDBSCAN labels for all the specified parameters. Also computes the grids of ARI scores among neighborhood of labels, and the variation of those ARI scores.

ari_median

1D array containing the median ARI values for each min_cluster_size

Type:

np.ndarray

ari_var

1D array containing the ARI variance values for each min_cluster_size

Type:

np.ndarray

all_labels

2D array containing the clustering labels for each min_cluster_size

Type:

np.ndarray

min_cluster_size_range

List of min_cluster_size values used

Type:

list[int]

segment

SegmentXYZ containing the coordinates on which the clustering was run

Type:

SegmentXYZ

ordered_params_df

DataFrame containing the ordered min_cluster_size, ARI and ARI variation values

Type:

pd.DataFrame

best_mcs

The best min_cluster_size value

Type:

int

best_ind

The index of the best min_cluster_size value in min_cluster_size_range

Type:

int

labels_

1D array containing the clustering labels for the best min_cluster_size

Type:

np.ndarray

optimal_segment

Copy of segment with clusterIDs set to labels_.

Type:

SegmentXYZ

copy()

Returns a copy of this object

fit(segment: SegmentXYZ, min_cluster_size_range: list[int] = [], n_neighbors=2, n_jobs=8, conv=False, verbose=False, consider_noise=True)

Computes labels, ARI and ARI_var grids. Computes the rank of parameter combinations according to stability. Saves best labels and a copy of segment with them as clusterIDs.

Parameters:
  • segment (SegmentXYZ) – SegmentXYZ containing the coordinates on which to run the clustering

  • min_cluster_size_range (list[int], optional) – List of min_samples values to provide to HDBSCAN

  • n_neighbors (int, optional)

mergeOtherHDBSCANGrid(other_scanner, conv: bool = False, consider_noise: bool = True, n_neighbors: int = 2, verbose: bool = False)

Integrates the grid of precomputed labels contained in other_scanner into this object. Then computes everything else from them. Useful when you want to extend the grid without having to recompute the labels that you already have.

Parameters:
  • other_scanner (HDBSCAN_stable) – The scanner (already fitted) to integrate into this one

  • conv (bool, optional) – Whether to apply convolution to the ARI scores, by default False

  • consider_noise (bool, optional) – Whether to consider noise points in the ARI calculation, by default True

  • n_neighbors (int, optional) – The number of neighbors to use for the ARI computation, by default 2

  • verbose (bool, optional) – Whether to print verbose output, by default False

plotAriGrid()

Plots the grid of ari values

plotAriVarGrid()

Plots the grid of ARI var values

saveLog(outfile)

Saves the ari and ari variation values in a single csv file, in decreasing order of stability

Parameters:

outfile (str) – Path to the output CSV file

class cima.detection.clusters.ThresholdClusterFilter

Bases: object

Class that filters clusters in a Segment based on computed features using specified thresholds.

features_to_use

which features were used for the filtering

Type:

list

feats_df

dataframe with the computed features for all clusters

Type:

pd.DataFrame

limits

the limits used for the filtering

Type:

dict

where_retain

boolean array indicating which clusters were retained

Type:

np.ndarray

retained_clusters_ids

list of the ids of the retained clusters

Type:

list

transformed_segment

the segment after filtering

Type:

Segment

new_labels

array of the new cluster labels for the original segment

Type:

np.ndarray

fit(segment: SegmentXYZ, features_to_use: list[str] = ['radius_of_gyration', 'volume', 'numerosity'], method: str = 'proportional', threshold: float = 0.2, custom_limits: dict = {}, n_jobs: int = 1, verbose: bool = False) None

Find clusters to retain based on their features and the specified thresholds.

Parameters:
  • segment (SegmentXYZ) – Segment to filter

  • features_to_use (list) – which features to use. Any subset of radius_of_gyration, volume, numerosity

  • method (str) – ‘proportional’ or ‘percentile’ or ‘custom’

  • threshold (float) – threshold to use for the filtering. If method is ‘proportional’, it is the fraction of the maximum value of each feature to use as limit. If method is ‘percentile’, it is the percentile to use as limit (0-100)

  • custom_limits (dict) – in case of ‘custom’ method, which limits should be used for the features. It is expected to be a dictionary with feature names as keys and floats as values

  • n_jobs (int) – how many cpus to use for the computation

plot(label_clusters: bool = False, show_all: bool = False)

Display plots representing the clusters features and which have been filtered out.

Parameters:
  • label_clusters (bool) – whether to label the clusters on the plots, by default False

  • show_all (bool) – whether to show all clusters or just the retained ones, by default False

writeMRCs(mrc_path: str, filename_pattern: str = 'cluster', verbose: bool = False)

Write MRC files for every cluster found in the segment

Parameters:
  • mrc_path (str) – Path to the directory where to save the MRC files

  • filename_pattern (str, optional) – Pattern for naming the files, by default ‘cluster’

  • verbose (bool, optional) – Whether to print progress messages, by default False

cima.detection.clusters.getPointwiseDensity(coords: ndarray, radius: float, n_jobs: int = 1, verbose: bool = False) ndarray

Estimates the pointwise density of a set of points.

Parameters:
  • coords (np.ndarray) – A 2-dimensional numpy array with each row representing the location of a point

  • radius (float) – Radius inside of which to count neighbors

  • n_jobs (int) – How many CPUs have to be used for the computation

Returns:

A numpy array representing the density of each point in coords, estimated as the count of neighbors inside the specified radius divided by the volume of the corresponding sphere.

Return type:

np.ndarray

cima.detection.clusters.search_epsilon(xyz: ndarray, n_neighbors: int = 0, show: bool = True, n_cpus: int = 1) int

Given point coordinates, this function estimates the optimal value of the epsilon parameter for the DBSCAN clustering algorithm. Run for a varied number of neighbors to find the optimal epsilon value.

Parameters:
  • xyz (np.ndarray) – Array of shape (n_samples, 3) containing the 3D coordinates of the points.

  • n_neighbors (str, optional) – The number of neighbors to use for the nearest neighbors search, by default 0. If 0, it is set to 2 * len(xyz[0]) - 1.

  • show (bool, optional) – If True, the function plots a graph of the sorted distances and the estimated elbow point, by default True

  • n_cpus (int, optional) – The number of jobs to run in parallel, by default 1

Returns:

The estimated optimal value of the epsilon parameter for the DBSCAN clustering algorithm.

Return type:

int