ComScan.utils module

Author: Alexandre CARRE
Created on: Jan 14, 2021
ComScan.utils.check_exist_vars(df, _vars)[source]

Check that a list of columns name exist in a DataFrame.

Parameters
  • df (DataFrame) – a DataFrame

  • _vars (List) – List of columns name to check

Return type

ndarray

Returns

index of columns name

Raise

ValueError if missing features

ComScan.utils.check_is_nii_exist(input_file_path)[source]

Check if a directory exist.

Parameters

input_file_path (str) – string of the path of the nii or nii.gz.

Return type

str

Returns

string if exist, else raise Error.

Raise

FileNotFoundError or FileExistsError

ComScan.utils.column_var_dtype(df, identify_dtypes=('object'))[source]

identify type of columns in DataFrame

Parameters
  • df (DataFrame) – input dataframe

  • identify_dtypes (Sequence[str]) – pandas dtype

Return type

DataFrame

Returns

summary df with col index and col name for all identify_dtypes vars

ComScan.utils.fix_columns(df, columns, inplace=False, extra_nans=False)[source]

Fix columns for the test set. When the train was encoded with pd.get_dummies.

Parameters
  • df (DataFrame) – input dataframe

  • columns (List[str]) – columns of the original dataframe

  • inplace (bool) – If False, return a copy. Otherwise, do operation inplace and return None

  • extra_nans (bool) – put extra columns as nans based on one hot encoding columns

Return type

DataFrame

Returns

the corrected version of DataFrame for test set

ComScan.utils.get_column_index(df, query_cols)[source]

Get columns index from columns name

Parameters
  • df (DataFrame) – input dataframe

  • query_cols (List[str]) – List name of colunns

Return type

ndarray

Returns

array of column index

ComScan.utils.load_nifty_volume_as_array(input_path_file)[source]

Load nifty image into numpy array [z,y,x] axis order. The output array shape is like [Depth, Height, Width].

Parameters

input_path_file (str) – input path file, should be ‘.nii’ or ‘.nii.gz’

Return type

Tuple[ndarray, Tuple[Tuple, Tuple, Tuple]]

Returns

a numpy data array, (with header)

ComScan.utils.mat_to_bytes(nrows, ncols, dtype=32, out='GB')[source]

Calculate the size of a numpy array in bytes.

Parameters
  • nrows (int) – the number of rows of the matrix.

  • ncols (int) – the number of columns of the matrix.

  • dtype (int) – the size of each element in the matrix. Defaults to 32bits.

  • out (str) – the output unit. Defaults to gigabytes (GB)

Return type

float

Returns

the size of the matrix in the given unit

ComScan.utils.one_hot_encoder(df, columns, drop_column=True, dummy_na=False, add_nan_columns=False, inplace=False)[source]

Encoding categorical feature in the dataframe, allow possibility to keep NaN. The categorical feature index and name are from cat_var function. These columns need to be “object” dtypes.

Parameters
  • df (DataFrame) – input dataframe

  • columns (List[str]) – List of columns to encode

  • drop_column (bool) – Set to True to drop the original column after encoding. Default to True.

  • dummy_na (bool) – Add a column to indicate NaNs, if False NaNs are ignored.

  • add_nan_columns (bool) – Add a empty nan columns if not create (can be used are other categories)

  • inplace (bool) – If False, return a copy. Otherwise, do operation inplace and return None

Return type

DataFrame

Returns

new dataframe where columns are one hot encoded

ComScan.utils.save_to_nii(im, header, output_dir, filename, mode='image', gzip=True)[source]

Save numpy array to nii.gz format to submit.

Parameters
  • im (ndarray) – array numpy

  • header ((<class 'tuple'>, <class 'tuple'>, <class 'tuple'>)) – header metadata (origin, spacing, direction).

  • output_dir (str) – Output directory.

  • filename (str) – Filename of the output file.

  • mode (str) – save as ‘image’ or ‘label’

  • gzip (bool) – zip nii (ie, nii.gz)

Return type

None

ComScan.utils.scaler_encoder(df, columns, scaler=StandardScaler(), inplace=False)[source]

Apply sklearn scaler to columns.

Parameters
  • df (DataFrame) – input dataframe

  • columns (List[str]) – List of columns to encode

  • scaler – scaler object from sklearn

  • inplace (bool) – If False, return a copy. Otherwise, do operation inplace and return None

Return type

DataFrame

Returns

  • df: DataFrame scaled

  • dict_cls_fitted: dict by col of fitted cls

ComScan.utils.split_filename(file_name)[source]

Split file_name into folder path name, basename, and extension name.

Parameters

file_name (str) – full path

Return type

Tuple[str, str, str]

Returns

path name, basename, extension name

ComScan.utils.tsne(df, columns, n_components=2, random_state=123, n_jobs=- 1)[source]

t-distributed Stochastic Neighbor Embedding.

t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

Parameters
  • df (DataFrame) – input dataframe

  • columns (List[str]) – List of columns to use

  • n_components (int) – Dimension of the embedded space. Default 2.

  • random_state (Optional[int]) – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • n_jobs (Optional[int]) – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact when metric="precomputed" or (metric="euclidean" and method="exact"). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Returns

array-like with projections

ComScan.utils.u_map(df, columns, n_components=2, random_state=123, n_jobs=- 1)[source]

Just like t-SNE, UMAP is a dimensionality reduction specifically designed for visualizing complex data in low dimensions (2D or 3D). As the number of data points increase, UMAP becomes more time efficient compared to TSNE.

Parameters
  • df (DataFrame) – input dataframe

  • columns (List[str]) – List of columns to use

  • n_components (int) – Dimension of the embedded space. Default 2.

  • random_state (Optional[int]) – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • n_jobs (Optional[int]) – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact when metric="precomputed" or (metric="euclidean" and method="exact"). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Returns

array-like with projections