ComScan.utils module¶

Author: Alexandre CARRE

Created on: Jan 14, 2021

ComScan.utils.check_exist_vars(df, _vars)[source]¶

Check that a list of columns name exist in a DataFrame.

Parameters

df (DataFrame) – a DataFrame
_vars (List) – List of columns name to check

Return type

ndarray

Returns

index of columns name

Raise

ValueError if missing features

ComScan.utils.check_is_nii_exist(input_file_path)[source]¶

Check if a directory exist.

Parameters: input_file_path (str) – string of the path of the nii or nii.gz.
Return type: str
Returns: string if exist, else raise Error.
Raise: FileNotFoundError or FileExistsError

ComScan.utils.column_var_dtype(df, identify_dtypes=('object'))[source]¶

identify type of columns in DataFrame

Parameters

df (DataFrame) – input dataframe
identify_dtypes (Sequence[str]) – pandas dtype

Note

see https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes for pandas dtype

Return type: DataFrame
Returns: summary df with col index and col name for all identify_dtypes vars

ComScan.utils.fix_columns(df, columns, inplace=False, extra_nans=False)[source]¶

Fix columns for the test set. When the train was encoded with pd.get_dummies.

Note

inspired from: http://fastml.com/how-to-use-pd-dot-get-dummies-with-the-test-set

Parameters

df (DataFrame) – input dataframe
columns (List[str]) – columns of the original dataframe
inplace (bool) – If False, return a copy. Otherwise, do operation inplace and return None
extra_nans (bool) – put extra columns as nans based on one hot encoding columns

Return type

DataFrame

Returns

the corrected version of DataFrame for test set

ComScan.utils.get_column_index(df, query_cols)[source]¶

Get columns index from columns name

Parameters

df (DataFrame) – input dataframe
query_cols (List[str]) – List name of colunns

Return type

ndarray

Returns

array of column index

ComScan.utils.load_nifty_volume_as_array(input_path_file)[source]¶

Load nifty image into numpy array [z,y,x] axis order. The output array shape is like [Depth, Height, Width].

Parameters: input_path_file (str) – input path file, should be ‘.nii’ or ‘.nii.gz’
Return type: Tuple[ndarray, Tuple[Tuple, Tuple, Tuple]]
Returns: a numpy data array, (with header)

ComScan.utils.mat_to_bytes(nrows, ncols, dtype=32, out='GB')[source]¶

Calculate the size of a numpy array in bytes.

Note

code from: https://gist.github.com/dimalik/f4609661fb83e3b5d22e7550c1776b90

Parameters

nrows (int) – the number of rows of the matrix.
ncols (int) – the number of columns of the matrix.
dtype (int) – the size of each element in the matrix. Defaults to 32bits.
out (str) – the output unit. Defaults to gigabytes (GB)

Return type

float

Returns

the size of the matrix in the given unit

ComScan.utils.one_hot_encoder(df, columns, drop_column=True, dummy_na=False, add_nan_columns=False, inplace=False)[source]¶

Encoding categorical feature in the dataframe, allow possibility to keep NaN. The categorical feature index and name are from cat_var function. These columns need to be “object” dtypes.

Parameters

df (DataFrame) – input dataframe
columns (List[str]) – List of columns to encode
drop_column (bool) – Set to True to drop the original column after encoding. Default to True.
dummy_na (bool) – Add a column to indicate NaNs, if False NaNs are ignored.
add_nan_columns (bool) – Add a empty nan columns if not create (can be used are other categories)
inplace (bool) – If False, return a copy. Otherwise, do operation inplace and return None

Return type

DataFrame

Returns

new dataframe where columns are one hot encoded

ComScan.utils.save_to_nii(im, header, output_dir, filename, mode='image', gzip=True)[source]¶

Save numpy array to nii.gz format to submit.

Parameters

im (ndarray) – array numpy
header ((<class 'tuple'>, <class 'tuple'>, <class 'tuple'>)) – header metadata (origin, spacing, direction).
output_dir (str) – Output directory.
filename (str) – Filename of the output file.
mode (str) – save as ‘image’ or ‘label’
gzip (bool) – zip nii (ie, nii.gz)

Return type

None

ComScan.utils.scaler_encoder(df, columns, scaler=StandardScaler(), inplace=False)[source]¶

Apply sklearn scaler to columns.

Parameters

df (DataFrame) – input dataframe
columns (List[str]) – List of columns to encode
scaler – scaler object from sklearn
inplace (bool) – If False, return a copy. Otherwise, do operation inplace and return None

Return type

DataFrame

Returns

df: DataFrame scaled
dict_cls_fitted: dict by col of fitted cls

ComScan.utils.split_filename(file_name)[source]¶

Split file_name into folder path name, basename, and extension name.

Parameters: file_name (str) – full path
Return type: Tuple[str, str, str]
Returns: path name, basename, extension name

ComScan.utils.tsne(df, columns, n_components=2, random_state=123, n_jobs=- 1)[source]¶

t-distributed Stochastic Neighbor Embedding.

t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

Parameters

df (DataFrame) – input dataframe
columns (List[str]) – List of columns to use
n_components (int) – Dimension of the embedded space. Default 2.
random_state (Optional[int]) – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
n_jobs (Optional[int]) – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact when metric="precomputed" or (metric="euclidean" and method="exact"). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Returns

array-like with projections

ComScan.utils.u_map(df, columns, n_components=2, random_state=123, n_jobs=- 1)[source]¶

Just like t-SNE, UMAP is a dimensionality reduction specifically designed for visualizing complex data in low dimensions (2D or 3D). As the number of data points increase, UMAP becomes more time efficient compared to TSNE.

Parameters

df (DataFrame) – input dataframe
columns (List[str]) – List of columns to use
n_components (int) – Dimension of the embedded space. Default 2.
random_state (Optional[int]) – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
n_jobs (Optional[int]) – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact when metric="precomputed" or (metric="euclidean" and method="exact"). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Returns

array-like with projections