ComScan.utils module¶
-
ComScan.utils.check_exist_vars(df, _vars)[source]¶ Check that a list of columns name exist in a DataFrame.
- Parameters
df (
DataFrame) – a DataFrame_vars (
List) – List of columns name to check
- Return type
ndarray- Returns
index of columns name
- Raise
ValueError if missing features
-
ComScan.utils.check_is_nii_exist(input_file_path)[source]¶ Check if a directory exist.
- Parameters
input_file_path (
str) – string of the path of the nii or nii.gz.- Return type
str- Returns
string if exist, else raise Error.
- Raise
FileNotFoundError or FileExistsError
-
ComScan.utils.column_var_dtype(df, identify_dtypes=('object'))[source]¶ identify type of columns in DataFrame
- Parameters
df (
DataFrame) – input dataframeidentify_dtypes (
Sequence[str]) – pandas dtype
Note
see https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes for pandas dtype
- Return type
DataFrame- Returns
summary df with col index and col name for all identify_dtypes vars
-
ComScan.utils.fix_columns(df, columns, inplace=False, extra_nans=False)[source]¶ Fix columns for the test set. When the train was encoded with
pd.get_dummies.Note
inspired from: http://fastml.com/how-to-use-pd-dot-get-dummies-with-the-test-set
- Parameters
df (
DataFrame) – input dataframecolumns (
List[str]) – columns of the original dataframeinplace (
bool) – If False, return a copy. Otherwise, do operation inplace and return Noneextra_nans (
bool) – put extra columns as nans based on one hot encoding columns
- Return type
DataFrame- Returns
the corrected version of DataFrame for test set
-
ComScan.utils.get_column_index(df, query_cols)[source]¶ Get columns index from columns name
- Parameters
df (
DataFrame) – input dataframequery_cols (
List[str]) – List name of colunns
- Return type
ndarray- Returns
array of column index
-
ComScan.utils.load_nifty_volume_as_array(input_path_file)[source]¶ Load nifty image into numpy array [z,y,x] axis order. The output array shape is like [Depth, Height, Width].
- Parameters
input_path_file (
str) – input path file, should be ‘.nii’ or ‘.nii.gz’- Return type
Tuple[ndarray,Tuple[Tuple,Tuple,Tuple]]- Returns
a numpy data array, (with header)
-
ComScan.utils.mat_to_bytes(nrows, ncols, dtype=32, out='GB')[source]¶ Calculate the size of a numpy array in bytes.
Note
code from: https://gist.github.com/dimalik/f4609661fb83e3b5d22e7550c1776b90
- Parameters
nrows (
int) – the number of rows of the matrix.ncols (
int) – the number of columns of the matrix.dtype (
int) – the size of each element in the matrix. Defaults to 32bits.out (
str) – the output unit. Defaults to gigabytes (GB)
- Return type
float- Returns
the size of the matrix in the given unit
-
ComScan.utils.one_hot_encoder(df, columns, drop_column=True, dummy_na=False, add_nan_columns=False, inplace=False)[source]¶ Encoding categorical feature in the dataframe, allow possibility to keep NaN. The categorical feature index and name are from cat_var function. These columns need to be “object” dtypes.
- Parameters
df (
DataFrame) – input dataframecolumns (
List[str]) – List of columns to encodedrop_column (
bool) – Set to True to drop the original column after encoding. Default to True.dummy_na (
bool) – Add a column to indicate NaNs, if False NaNs are ignored.add_nan_columns (
bool) – Add a empty nan columns if not create (can be used are other categories)inplace (
bool) – If False, return a copy. Otherwise, do operation inplace and return None
- Return type
DataFrame- Returns
new dataframe where columns are one hot encoded
-
ComScan.utils.save_to_nii(im, header, output_dir, filename, mode='image', gzip=True)[source]¶ Save numpy array to nii.gz format to submit.
- Parameters
im (
ndarray) – array numpyheader ((<class 'tuple'>, <class 'tuple'>, <class 'tuple'>)) – header metadata (origin, spacing, direction).
output_dir (
str) – Output directory.filename (
str) – Filename of the output file.mode (
str) – save as ‘image’ or ‘label’gzip (
bool) – zip nii (ie, nii.gz)
- Return type
None
-
ComScan.utils.scaler_encoder(df, columns, scaler=StandardScaler(), inplace=False)[source]¶ Apply sklearn scaler to columns.
- Parameters
df (
DataFrame) – input dataframecolumns (
List[str]) – List of columns to encodescaler – scaler object from sklearn
inplace (
bool) – If False, return a copy. Otherwise, do operation inplace and return None
- Return type
DataFrame- Returns
df: DataFrame scaled
dict_cls_fitted: dict by col of fitted cls
-
ComScan.utils.split_filename(file_name)[source]¶ Split file_name into folder path name, basename, and extension name.
- Parameters
file_name (
str) – full path- Return type
Tuple[str,str,str]- Returns
path name, basename, extension name
-
ComScan.utils.tsne(df, columns, n_components=2, random_state=123, n_jobs=- 1)[source]¶ t-distributed Stochastic Neighbor Embedding.
t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
- Parameters
df (
DataFrame) – input dataframecolumns (
List[str]) – List of columns to usen_components (
int) – Dimension of the embedded space. Default 2.random_state (
Optional[int]) – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.n_jobs (
Optional[int]) – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact whenmetric="precomputed"or (metric="euclidean"andmethod="exact").Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.
- Returns
array-like with projections
-
ComScan.utils.u_map(df, columns, n_components=2, random_state=123, n_jobs=- 1)[source]¶ Just like t-SNE, UMAP is a dimensionality reduction specifically designed for visualizing complex data in low dimensions (2D or 3D). As the number of data points increase, UMAP becomes more time efficient compared to TSNE.
- Parameters
df (
DataFrame) – input dataframecolumns (
List[str]) – List of columns to usen_components (
int) – Dimension of the embedded space. Default 2.random_state (
Optional[int]) – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.n_jobs (
Optional[int]) – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact whenmetric="precomputed"or (metric="euclidean"andmethod="exact").Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.
- Returns
array-like with projections