ComScan.neurocombat module

Author: Alexandre CARRE
Created on: Jan 14, 2021
class ComScan.neurocombat.AutoCombat(features, sites_features=None, sites=None, size_min=10, metric='distortion', use_ref_site=False, scaler_clustering=StandardScaler(), discrete_cluster_features=None, continuous_cluster_features=None, features_reduction=None, n_components=2, threshold_missing_sites_features=25, drop_site_columns=False, discrete_combat_covariates=None, continuous_combat_covariates=None, empirical_bayes=True, parametric=True, mean_only=False, return_only_features=False, n_jobs=1, random_state=123, copy=True)[source]

Bases: ComScan.neurocombat.Combat

Harmonize/normalize features using Combat’s parametric empirical Bayes framework.

Combat need to have well-known acquisition sites or scanner to harmonize features. It is sometimes difficult to define an imaging acquisition site if on two sites imaging parameters can be really similar. ComScan gives the possibility to automatically determine the number of sites and their association based on imaging features (e.g. dicom tags) by clustering. Thus ComScan can be used on data not seen during training because it can predict which imager best matches the one it has seen during training.

Parameters
  • features (Target features to be harmonized.) –

  • sites_features (Target variable for define (acquisition sites or scanner) by clustering.) –

  • sites (Target variable for ComScan problems (e.g. acquisition sites or scanner)) – This argument is Optional. If this argument is provided will run traditional ComBat else AutoCombat. In this case args: sites_features, size_min, method, scaler_clustering, discrete_cluster_features, continuous_cluster_features, threshold_missing_sites_features, drop_site_columns are unused.

  • size_min (Constraint of the minimum size of site for clustering.) –

  • metric ("distortion", "silhouette" or "calinski_harabasz".) – Metric to define the optimal number of cluster. Default: distortion.

  • use_ref_site (Use a ref site to be used as reference for batch adjustment. The ref site used is the cluster) – with the minimal inertia. i.e minimizing within-cluster sum-of-squares.

  • scaler_clustering (Scaler to use for continuous site features. Need to be a scikit learn scaler.) – Default is StandardScaler().

  • discrete_cluster_features (Target sites_features which are categorical to one-hot (e.g. ManufacturerModelName)) –

  • continuous_cluster_features (Target sites_features which are continuous to scale (e.g. EchoTime)) –

  • features_reduction (Method for reduction of the embedded space with n_components. Can be 'pca' or 'umap'.) – Default is None.

  • n_components (Dimension of the embedded space for features reduction.) – Default is 2.

  • threshold_missing_sites_features (Threshold of acceptable missing features for sites features clustering.) – 25 specify that 75% of all samples need to have this features. Default is 25.

  • drop_site_columns (Drop sites columns find by clustering in return.) – Default is False.

  • discrete_combat_covariates (Target covariates which are categorical (e.g. male or female)) –

  • continuous_combat_covariates (Target covariates which are continuous (e.g. age)) –

  • empirical_bayes (Performed empirical bayes.) – Default is True.

  • parametric (Performed parametric adjustements.) – Default is True.

  • mean_only (Adjust only the mean (no scaling)) – Default is False.

  • return_only_features (Return only features.) – Default is False.

  • n_jobs (The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.) – If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. Default is 1.

random_stateint, RandomState instance or None, optional, default: 123

If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

copySet to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

Default is True.

cls_
Type

clustering classifier object

info_clustering_

wicss_clusters, best_wicss_cluster

Type

Dictionary that stores info of clustering from sites_features with cluster_nb, labels, ref_label

cls_feature_reduction_
Type

feature reduction object

clustering_data_features_mean_
Type

dict of mean for clustering data (use for imputation)

X_hat_
Type

array after fit

clustering_data_features_
Type

column features for clustering from train (after encoding + scaling)

clustering_data_discrete_features_
Type

column features for clustering after one-hot encoding

dict_cls_fitted
Type

dict of columns of fitted cls used for fitted clustering data

Examples

>>> data = pd.DataFrame([{"features_1": 0.97, "site_features_0": 2, "site_features_1": 0},
>>> {"features_1": 1.35, "site_features_0": 1.01, "site_features_1": 1},
>>> {"features_1": 1.43, "site_features_0": 1.09, "site_features_1": 1},
>>> {"features_1": 0.85, "site_features_0": 2.3, "site_features_1": 0}])
>>> auto_combat = AutoCombat(features=["features_1"], sites_features=["site_features_0", "site_features_1"],
>>> continuous_cluster_features=["site_features_0", "site_features_1"], size_min=2))
>>> print(auto_combat.fit(data))
AutoCombat(continuous_cluster_features=['site_features_0', 'site_features_1'],
       discrete_cluster_features=[], features=['features_1'],
       sites=['sites'],
       sites_features=['site_features_0', 'site_features_1'], size_min=2))

Notes

NaNs values are not treated.

Warning

Be sure to have the same sites features between fit and transform. The choice has not been to imposed an entry format to check a colum name or a slice.

fit(X, *y)[source]

Compute sites, ref_site using clustering. Then compute the stand mean, var pooled, gamma star, delta star to be used for later adjusted data from Combat.

Parameters
  • X (array-like or DataFrame of shape (n_samples, n_features)) – Requires the columns needed by the ComScan(). The data used to find adjustments.

  • *y (y in scikit learn: None) – Ignored.

Returns

self – Fitted ComScan estimator.

Return type

object

transform(X)[source]

Scale features of X according to combat estimator.

Parameters

X (array-like or DataFrame of shape (n_samples, n_features) Requires the columns needed by the Combat()) – Input data that will be transformed.

Returns

Xt – Transformed data.

Return type

array-like of shape (n_samples, n_features)

class ComScan.neurocombat.Combat(features, sites, discrete_covariates=None, continuous_covariates=None, ref_site=None, empirical_bayes=True, parametric=True, mean_only=False, return_only_features=False, raise_ref_site=True, copy=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Harmonize/normalize features using Combat’s parametric empirical Bayes framework

Parameters
  • features (Target features to be harmonized) –

  • sites (Target variable for ComScan problems (e.g. acquisition sites or scanner)) –

  • discrete_covariates (Target covariates which are categorical (e.g. male or female)) –

  • continuous_covariates (Target covariates which are continuous (e.g. age)) –

  • ref_site (Variable value (acquisition sites or scanner) to be used as reference for batch adjustment.) – Default is False.

  • empirical_bayes (Performed empirical bayes.) – Default is True.

  • parametric (Performed parametric adjustements.) – Default is True.

  • mean_only (Adjust only the mean (no scaling)) – Default is False.

  • return_only_features (Return only features.) – Default is False.

  • raise_ref_site (raise when the reference site pass as arguments not exist, else set to no reference.) – Default is True.

  • copy (Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array)) – Default is True.

info_dict_fit_

batch_levels, ref_level, n_batch, n_sample, sample_per_batch, batch_info

Type

dictionary that stores batch info of fitted data with:

stand_mean_

Standardized mean

Type

array-like

var_pooled_

Variance pooled

Type

array-like

mod_mean_

Mod mean

Type

array-like

gamma_star_

Adjustement gamma star

Type

array-like

delta_star_

Adjustement delta star

Type

array-like

info_dict_transform_

batch_levels, ref_level, n_batch, n_sample, sample_per_batch, batch_info

Type

dictionary that stores batch info of transformed data with

Examples

>>> data = pd.DataFrame([{"features_1": 0.97, "features_2": 2, "sites": 0},
>>> {"features_1": 1.35, "features_2": 1.01, "sites": 1}, {"features_1": 1.43, "features_2": 1.09, "sites": 1},
>>> {"features_1": 0.85, "features_2": 2.3, "sites": 0}])
>>> combat = Combat(features=["features_1", "features_2"], sites=["sites"], ref_site=1)
>>> print(combat.fit(data))
Combat(continuous_covariates=[], discrete_covariates=[],
   features=['features_1', 'features_2'], ref_site=1, sites=['sites'])
>>> print(combat.gamma_star_)
[[-11.85476756  27.30493785]
[  0.           0.        ]]
>>> print(combat.transform(data))
[[1.40593957 1.01395564 0.        ]
[1.35       1.01       1.        ]
[1.43       1.09       1.        ]
[1.37064296 1.08999992 0.        ]]

Notes

NaNs values are not treated.

fit(X, *y)[source]

Compute the stand mean, var pooled, gamma star, delta star to be used for later adjusted data.

Parameters
  • X (array-like or DataFrame of shape (n_samples, n_features) Requires the columns needed by the Combat()) – The data used to find adjustments.

  • *y (y in scikit learn: None) – Ignored.

Returns

self – Fitted combat estimator.

Return type

object

load_fit(filepath)[source]

load a fitted model attribute info_dict_fit_, stand_mean_, var_pooled_, gamma_star_, delta_star_

Parameters

filepath (str) – filepath of the pkl file to load

Return type

None

save_fit(filepath)[source]

save a fitted model attribute info_dict_fit_, stand_mean_, var_pooled_, gamma_star_, delta_star_

Parameters

filepath (str) – filepath were to save. if no extension .pkl will add it

Return type

None

transform(X)[source]

Scale features of X according to combat estimator.

Parameters

X (array-like or DataFrame of shape (n_samples, n_features) Requires the columns needed by the Combat()) – Input data that will be transformed.

Returns

Xt – Transformed data.

Return type

array-like of shape (n_samples, n_features)

class ComScan.neurocombat.ImageCombat(image_path, sites_features=None, sites=None, save_path_fit='fit_data', save_path_transform='transform_data', size_min=10, method='silhouette', use_ref_site=False, scaler_clustering=StandardScaler(), discrete_cluster_features=None, continuous_cluster_features=None, features_reduction=None, n_components=2, threshold_missing_sites_features=25, drop_site_columns=True, discrete_combat_covariates=None, continuous_combat_covariates=None, empirical_bayes=True, parametric=True, mean_only=False, random_state=123, flattened_dtype=<class 'numpy.float16'>, output_dtype=<class 'numpy.float32'>, copy=True)[source]

Bases: ComScan.neurocombat.AutoCombat

Harmonize/normalize features using Combat’s parametric empirical Bayes framework directly on image.

ImageCombat allow the possibility to Harmonize/normalize a set of NIFTI images. All images must have the same dimensions and orientation. A common mask is created based on an heuristic proposed by T.Nichols. Images are then vectorizing for ComScan. ImageCombat allows the possibily to use Combat (well-defined site) or AutoCombat (clustering for sites finding)

Parameters
  • image_path (image_path of nifti files.) –

  • sites_features (Target variable for define (acquisition sites or scanner) by clustering.) –

  • sites (Target variable for ComScan problems (e.g. acquisition sites or scanner)) – This argument is Optional. If this argument is provided will run traditional ComBat. In this case args: sites_features, size_min, method, scaler_clustering, discrete_cluster_features, continuous_cluster_features, threshold_missing_sites_features, drop_site_columns are unused.

  • size_min (Constraint of the minimum size of site for clustering.) –

  • method ("silhouette" or "elbow". Method to define the optimal number of cluster.) – Default is silhouette.

  • use_ref_site (Use a ref site to be used as reference for batch adjustment. The ref site used is the cluster) – with the minimal inertia. i.e minimizing within-cluster sum-of-squares. Default is False.

  • scaler_clustering (Scaler to use for continuous site features. Need to be a scikit learn scaler.) – Default is StandardScaler().

  • discrete_cluster_features (Target sites_features which are categorical to one-hot (e.g. ManufacturerModelName)) –

  • continuous_cluster_features (Target sites_features which are continuous to scale (e.g. EchoTime)) –

  • features_reduction (Method for reduction of the embedded space with n_components. Can be 'pca' or 'umap'.) – Default is None.

  • n_components (Dimension of the embedded space for features reduction.) – Default is 2.

  • threshold_missing_sites_features (Threshold of acceptable missing features for sites features clustering.) – 25 specify that 75% of all samples need to have this features. Default is 25.

  • drop_site_columns (Drop sites columns find by clustering in return.) –

  • discrete_combat_covariates (Target covariates which are categorical (e.g. male or female)) –

  • continuous_combat_covariates (Target covariates which are continuous (e.g. age)) –

  • empirical_bayes (Performed empirical bayes.) – Default is True.

  • parametric (Performed parametric adjustements.) – Default is True.

  • mean_only (Adjust only the mean (no scaling)) – Default is False.

  • random_state (int, RandomState instance or None, optional, default: 123) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • copy (Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array)) – Default is True.

mask_
Type

array-like of the common brain mask

flattened_array_
Type

flattened array of all the training set

Notes

NaNs values are not treated.

fit(X, *y)[source]

Compute sites, ref_site using clustering. Then compute the stand mean, var pooled, gamma star, delta star to be used for later adjusted data from Combat.

Parameters
  • X (array-like or DataFrame of shape (n_samples, n_features)) – Requires the columns needed by the ComScan(). The data used to find adjustments.

  • *y (y in scikit learn: None) – Ignored.

Returns

self – Fitted ComScan estimator.

Return type

object

transform(X)[source]

Scale image according to combat estimator and save it.

Parameters

X (array-like or DataFrame of shape (n_samples, n_features) Requires the columns needed by the ImageCombat()) – Input data that will be transformed.

Returns

Return type

None, save transformed image