Package documentation open_nipals

open_nipals.nipalsPCA module

Code for calculating the PCA Loadings and Scores using NIPALS algorithm.

One of the most concise definitions can be found in this paper on page 7: Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta 1986, 185, 1–17. https://doi.org/10.1016/0003-2670(86)80028-9.

For the transformation part also see: Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems 1996, 35(1), 45-65.

class open_nipals.nipalsPCA.NipalsPCA(n_components: int = 2, max_iter: int = 10000, tol_criteria: float = 1e-06, mean_centered: bool = True)

Bases: BaseEstimator, TransformerMixin

The custom-built class to use PCA using the NIPALS algorithm, i.e., the same algorithm used in SIMCA.

Attributes:

n_componentsint: The number of principal components.
max_iterint: The max number of iterations for the fitting step.
tol_criteriafloat: The convergence tolerance criterion.
loadingsnp.ndarray: The loadings vectors of the PCA model.
fit_scoresnp.ndarray: The fitted scores of the PCA model.
fit_datanp.ndarray: The data used to fit the model.
mean_centeredbool: Whether or not the original data is mean-centered.
fitted_componentsint: The number of current LVs in the model (0 if not fitted yet.)
explained_variance_ratio_np.ndarray: The explained variance ratios per fitted component.

Methods:

transform: Transform input data to scores.
fit: Fit PCA model on input data.
fit_transform: Fit PCA model on input data, then transform said data to scores.
inverse_transform: Obtain approximation of input data given fitted model and scores.
calc_imd: Calculate within-model distance.
calc_oomd: Calculate out-of-model distance.
calc_limit: Calculate suitable distance threshold given fitted data.
set_components: Change the number of model components.
get_explained_variance_ratio: Calculate explained variances as ratio of total explained variance.

calc_imd(input_scores: ndarray | None = None, input_array: ndarray | None = None, metric: str = 'HotellingT2', covariance: str = 'diag') → ndarray

Calculate in-model distance (IMD) of observations. This is the distance from the center of the hyperplane to the projected observation.

Parameters:

input_scores (Optional[np.ndarray], optional) – The scores from which to calculate the distance. Defaults to None.
input_array (Optional[np.ndarray], optional) – The input data in original space from which to calculate the distance. Defaults to None.
metric (str, optional) – The metric to use. Valid options are {‘HotellingT2’}. Defaults to ‘HotellingT2’.
covariance (str, optional) – Method to compute covariance. Valid options are {‘diag’, ‘full’, ‘ledoit_wolf’}. Defaults to ‘diag’ (quick version). ‘full’ uses the entire covariance matrix computed by numpy. ‘ledoit_wolf’ uses the full covariance matrix computed by Ledoit-Wolf shrinkage.

Raises:

NotFittedError – Model has not been fit yet.
ValueError – Neither scores nor input data provided.
ValueError – Input scores are inconsistent with n_components of model.
NotImplementedError – Any metric that has not yet been implemented.

Returns:

The calculated within-model distance for each observation (row).

Return type:

np.ndarray

calc_limit(metric: str = 'HotellingT2', n: int | None = None, num_lvs: int | None = None, m: int | None = None, alpha: float = 0.95) → float

This function calculates the limits for imd and oomd. Assumptions on the distribution shape underpin this calculation; in practice limits should be judged by the user.

Parameters:

metric (str, optional) – The metric to use. Valid options are {‘HotellingT2’,’DModX’} Defaults to ‘HotellingT2’.
n (Optional[int], optional) – The number of observations. Defaults to None, which results in the n of the fitted scores to be used.
num_lvs (Optional[int], optional) – The number of latent variables (principal components). Defaults to None, which results in the m of the fitted scores to be used.
m (Optional[int], optional) – The number of original features. Defaults to None, which results in the number of features in the original data/fitted loadings to be used.
alpha (float, optional) – The confidence value to use. Defaults to 0.95.

Returns:

The limit threshold for the given metric and: confidence value.

Return type:

float

calc_oomd(input_array: ndarray, metric: str = 'QRes') → ndarray

Calculate the out-of-model distance (OOMD) of an observations.

Parameters:

input_array (np.ndarray) – The data for which to calculate the OOMD.
metric (str, optional) – The metric to use. Valid options are {‘Qres’,’DModX’}. Defaults to ‘QRes’.

Raises:

NotImplementedError – Unknown metric.

Returns:

The distances for the provided observations.

Return type:

np.ndarray

property explained_variance_ratio_: ndarray

calculate the explained variance ratios per fitted component

Parameters:: in_data (np.array, optional) – Alternative input data. Defaults to None.
Raises:: ValueError – if in_data not mean centered.
Returns:: explained variances
Return type:: np.ndarray

fit(X: ndarray, verbose: bool = False) → NipalsPCA

Fits PCA model to input data.

Parameters:

X (np.ndarray) – The input data to fit on.
verbose (bool, optional) – Whether or not to print out additional convergence information. Defaults to False.

Returns:

A reference to the object.

Return type:

NipalsPCA

fit_transform(X: ndarray, verbose: bool = False) → ndarray

Fit, then transform input data. This function is equivalent to >>>> P = NipalsPCA() >>>> P.fit(X) >>>> T = P.transform(X)

Parameters:

X (np.ndarray) – The The input data to fit on and to transform.
verbose (bool, optional) – Whether or not to print out additional convergence information. Defaults to False.

Raises:

ValueError – Model has already been fit.

Returns:

The corresponding scores.

Return type:

np.ndarray

property fitted_components: int

Get total # of LVs in model. This may differ from self.n_components which is the number of components used by the model.

Returns:: Number of fitted components
Return type:: int

get_explained_variance_ratio(in_data: array = None) → ndarray

calculate the explained variance ratios per fitted component

Parameters:: in_data (np.array, optional) – Alternative input data. Defaults to None.
Raises:: ValueError – if in_data not mean centered.
Returns:: explained variances
Return type:: np.ndarray

inverse_transform(X: ndarray) → ndarray

Approximate original data from scores.

Parameters:

X (np.ndarray) – An array containing the scores.

Raises:

NotFittedError – PCA model has not been fit yet.
ValueError – Shape of provided scores does not match n_components in model.

Returns:

The approximation of the original data.

Return type:

np.ndarray

set_components(n_component: int, verbose: bool = False)

Method for setting the number of components in an already-constructed model. It checks to make sure that loadings exist for all of the set components and will fit extras if not. Note that in case of decreasing the number of components, previously fitted components are internally stored. In case you prefer a clean model, create a new model object and fit it with the desired number of components.

Parameters:

n_component (int) – the desired number of components.
verbose (bool) – Whether or not to print out additional
False. (convergence information. Defaults to)

Raises:

TypeError – if n_component is not an int
ValueError – if n_component < 1

set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') → NipalsPCA

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
Returns:: self – The updated object.
Return type:: object

set_transform_request(*, method: bool | None | str = '$UNCHANGED$') → NipalsPCA

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: method (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for method parameter in transform.
Returns:: self – The updated object.
Return type:: object

transform(X: ndarray, method: str = 'naive') → ndarray

This function takes an input array and projects it based on a fitted model.

Parameters:

X (np.ndarray) – The nxm input array in the original feature space to be projected.
method (str, optional) – The method to use for the projection. See reference listed in module docstring. Valid options are {‘naive’,’projection’,’conditional_mean’} Defaults to ‘naive’.

Raises:

NotFittedError – If model has not been fit yet (no loadings).
ValueError – Method ‘conditional_mean’ is selected but fit_data is not available.

Returns:

The corresponding scores.

Return type:

np.ndarray

open_nipals.nipalsPLS module

Algorithm implemented from Chapter 6 of Chiang, Leo H., Evan L. Russell, and Richard D. Braatz. Fault detection and diagnosis in industrial systems. Springer Science & Business Media, 2000.

Alternative algorithm derivation from: Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta 1986, 185, 1–17. https://doi.org/10.1016/0003-2670(86)80028-9.

For the transformation part also see: Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems 1996, 35(1), 45-65.

class open_nipals.nipalsPLS.NipalsPLS(n_components: int = 2, max_iter: int = 10000, tol_criteria: float = 1e-06, mean_centered: bool = True, force_include: bool = False)

Bases: BaseEstimator, TransformerMixin, RegressorMixin

The custom-built class to use PLS using the NIPALS algorithm, i.e., the same algorithm used in SIMCA.

Attributes:

n_componentsint: The number of principal components.
max_iterint: The max number of iterations for the fitting step.
tol_criteriafloat: The convergence tolerance criterion.
mean_centeredbool: Whether or not the original data is mean-centered.
force_includebool: True will force including the data which has all nans in y-block. Defaults to False.
fit_data_xnp.ndarray: The X data used to fit the model.
fit_data_ynp.ndarray: The y data used to fit the model.
loadings_xnp.ndarray: The X loadings vectors of the PLS model.
loadings_ynp.ndarray: The y loadings vectors of the PLS model.
fit_scores_xnp.ndarray: The fitted X scores of the PLS model.
fit_scores_xnp.ndarray: The fitted y scores of the PLS model.
regression_matrixnp.ndarray: The regression matrix of the PLS model.
fitted_componentsint: The number of current LVs in the model (0 if not fitted yet.)
explained_variance_ratio_np.ndarray: The explained variance ratios per fitted component.

Methods:

transform: Transform input data to scores.
fit: Fit PLS model on input data.
fit_transform: Fit PLS model on input data, then transform said data to scores.
inverse_transform: Obtain approximation of X input data given fitted model and X scores.
calc_imd: Calculate within-model distance.
calc_oomd: Calculate out-of-model distance.
predict: Obtain prediction for y data given model and X data.
set_components: Change the number of model components.
get_reg_vector: Give regression vector of the model.
get_explained_variance_ratio: Calculate explained variances as ratio of total explained variance.

calc_imd(input_scores: array | None = None, input_array: array | None = None, metric: str = 'HotellingT2', covariance: str = 'diag')

Calculate the in-model distance (IMD) of observations. This will take in an input array OR scores and return Hotelling’s T2 value for each row. In theory you could expand to include a Y-block in-model distance, but the value is limited for the typical use cases.

Parameters:

input_scores (Optional[np.array], optional) – Scores array. Defaults to None.
input_array (Optional[np.array], optional) – Data array. Defaults to None.
metric (str, optional) – In-model-distance to compute. Must be one of set {‘HotellingT2’}. Defaults to ‘HotellingT2’.
covariance (str, optional) – Method to compute covariance. Valid options are {‘diag’, ‘full’, ‘ledoit_wolf’}. Defaults to ‘diag’ (quick version). ‘full’ uses the entire covariance matrix computed by numpy. ‘ledoit_wolf’ uses the full covariance matrix computed by Ledoit-Wolf shrinkage.

Raises:

NotFittedError – If model has not been fit.
ValueError – If neither scores nor data are provided.
ValueError – If input scores shapes does not match model.
NotImplementedError – If unknown metric was requested.

Returns:

The within-model distance(s).

Return type:

float

calc_oomd(input_array: array, metric: str = 'QRes') → array

Calculate the out-of-model distance (OOMD) of observations. In theory can be used for Y-block, but the value in typical use is limited.

Parameters:

input_array (np.array) – The X input data for which to calculate the OOMD.
metric (str, optional) – The metric to compute. Supported metrics are: {‘QRes’,’DModX’}. Defaults to ‘QRes’.

Raises:

ValueError – If input metric is unknown.

Returns:

The out-of-model distance(s).

Return type:

np.array

property explained_variance_ratio_: np.ndarray, np.ndarray

calculate the explained variance ratios for X and y arrays per fitted component

Parameters:

in_x_data (np.array, optional) – Alternative input X data. Defaults to None.
in_y_data (np.array, optional) – Alternative input y data. Defaults to None.

Raises:

ValueError – If in_x_data not mean centered.
ValueError – If in_y_data not mean centered.

Returns:

explained variance ratios for X and y

Return type:

(np.ndarray, np.ndarray)

fit(X: array, y: array, verbose: bool = False) → NipalsPLS

Function to fit PLS model from X/Y Data.

Parameters:

X (np.array) – Input X data.
y (np.array) – Input Y data.
verbose (bool, optional) – Turn verbosity on and off. Defaults to False.

Raises:

NotFittedError – Model has not yet been fit

Returns:

A reference to the object.

Return type:

NipalsPLS

fit_transform(X: array, y: array) → Tuple[array, array]

Combine fit and transform methods into one command, sklearn style.

Parameters:

X (np.array) – The X-data.
y (np.array) – The Y-data.

Raises:

ValueError – If attempt to use fit_transform on already fitted data.

Returns:

A tuple containing the fitted scores: for X and Y.

Return type:

Tuple[np.array, np.array]

property fitted_components: int

Get total # of LVs in model. This may differ from self.n_components which is the number of components used by the model.

Returns:: number of fitted components
Return type:: int

get_explained_variance_ratio(in_x_data: np.array = None, in_y_data: np.array = None)

calculate the explained variance ratios for X and y arrays per fitted component

Parameters:

in_x_data (np.array, optional) – Alternative input X data. Defaults to None.
in_y_data (np.array, optional) – Alternative input y data. Defaults to None.

Raises:

ValueError – If in_x_data not mean centered.
ValueError – If in_y_data not mean centered.

Returns:

explained variance ratios for X and y

Return type:

(np.ndarray, np.ndarray)

get_reg_vector() → array

Give the user the regression vector for the model.

Raises:: NotFittedError – If the model has not been fit.
Returns:: The regression vector.
Return type:: np.array

inverse_transform(X: array) → array

Given a set of scores, return the simulated data.

Parameters:

X (np.array) – The scores to transform back.

Raises:

NotFittedError – If model has not been fit.
ValueError – If input scores shapes does not match model.

Returns:

The simulated data.

Return type:

np.array

predict(X: array = None, scores_x: array = None) → array

Predict y from data or scores.

Parameters:

X (np.array, optional) – Input X data. Defaults to None.
scores_x (np.array, optional) – Input scores. Defaults to None.

Raises:

NotFittedError – If model has not been fit.
ValueError – If neither data nor scores are provided.

Returns:

The predicted y-values.

Return type:

np.array

set_components(n_component: int, verbose: bool = False)

Method for setting the number of components in an already-constructed model. It checks to make sure that loadings exist for all of the set components and will fit extras if not. Note that in case of decreasing the number of components, previously fitted components are internally stored. In case you prefer a clean model, create a new model object and fit it with the desired number of components.

Parameters:

n_component (int) – the desired number of components.
verbose (bool) – Whether or not to print out additional convergence information. Defaults to False.

Raises:

TypeError – if n_component is not an int
ValueError – if n_component < 1

set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') → NipalsPLS

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
Returns:: self – The updated object.
Return type:: object

set_predict_request(*, scores_x: bool | None | str = '$UNCHANGED$') → NipalsPLS

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: scores_x (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for scores_x parameter in predict.
Returns:: self – The updated object.
Return type:: object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → NipalsPLS

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns:: self – The updated object.
Return type:: object

transform(X: array, y: array | None = None) → array | Tuple[array, array]

Compute scores using model.

Parameters:

X (np.array) – X-data.
y (np.array, optional) – Y-data. Defaults to None.

Raises:

NotFittedError – If model is not fit.

Returns:

Either the: scores for X (when no Y-data is provided) or a tuple of two scores-arrays.

Return type:

Union[np.array, Tuple[np.array, np.array]]

open_nipals.utils module

Code for pipelining a data arrangement in sklearn. Allows for standarscaler() followed by other methods

2020: Ryan Wall, revised 2025: Calvin Ristad, Niels Schlusser

class open_nipals.utils.ArrangeData(var_dict: dict | DataFrame = None)

Bases: TransformerMixin

ArrangeData class creates a sklearn-style transformer object that orders the columns in dataframe correctly

Attributes:

var_dictdict: A dictionary indicating which column should be in which position.

Methods:

var_dict_from_df: Infer var_dict from template dataframe.
fit_transform: concatenation of fit and transform
fit: Applies var_dict_from_df in a consistent manner with sklearn nomenclature
transform: Takes a new dataframe or np.array + variable dictionary and arranges to be consistent with stored var_dict

fit(input_data: DataFrame | ndarray, input_var_dict: dict | None = None)

The function which will fit the var_dict object.

Parameters:

input_data (Union[pd.DataFrame, np.ndarray]) – The input data to use for fitting.
input_var_dict (Optional[dict], optional) – Required if input_data is an array, this dictionary contains the headers as {column Name: column index}.

Raises:

ValueError – input_data is a numpy array and input_var_dict is missing.
ValueError – input_data is a numpy array and its shape does not match input_var_dict.
ValueError – An unknown error occurred.

fit_transform(input_data: DataFrame | ndarray, input_var_dict: dict | None = None) → DataFrame

Fit a data/column model and then transform data.

Parameters:

input_data (Union[pd.DataFrame, np.ndarray]) – The input data to transform. Note that if this is a dataframe it _will_ be used to fit var_dict. This might not be what you want.
input_var_dict (Optional[dict], optional) – The var_dict to be used to fit. Defaults to None.

Returns:

transformed input frame

Return type:

pd.DataFrame

transform(input_data: DataFrame | ndarray, input_var_dict: dict | None = None) → ndarray

Transform input data based on stored data model.

Parameters:

input_data (Union[pd.DataFrame, np.ndarray]) – The input_data to transform.
input_var_dict (Optional[dict], optional) – Required if input_data is an array, this dictionary contains the headers as {column Name: column index}.. Defaults to None.

Raises:

ValueError – ArrangeData object has not yet been fit.
ValueError – input_data is a numpy array and no input_var_dict has been provided.
ValueError – input_data is a numpy array and its shape does not match input_var_dict

Returns:

The transformed (column-rearranged) data.

Return type:

np.ndarray

var_dict_from_df(input_df: DataFrame) → dict

Infer the var_dict from a template dataframe.

Parameters:: input_df (pd.DataFrame) – A dataframe that has the desired order of columns.
Returns:: The corresponding var_dict. {column Name: column index}
Return type:: dict