Package documentation open_nipals

open_nipals.nipalsPCA module

Code for calculating the PCA Loadings and Scores using NIPALS algorithm.

One of the most concise definitions can be found in this paper on page 7: Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta 1986, 185, 1–17. https://doi.org/10.1016/0003-2670(86)80028-9.

For the transformation part also see: Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems 1996, 35(1), 45-65.

(c) 2020-2021: Ryan Wall (lead), David Ochsenbein revised 2024: Niels Schlusser

class open_nipals.nipalsPCA.NipalsPCA(n_components: int = 2, max_iter: int = 10000, tol_criteria: float = 1e-06, mean_centered: bool = True)

Bases: BaseEstimator, TransformerMixin

The custom-built class to use PCA using the NIPALS algorithm, i.e., the same algorithm used in SIMCA.

Attributes:

n_componentsint

The number of principal components.

max_iterint

The max number of iterations for the fitting step.

tol_criteriafloat

The convergence tolerance criterion.

loadingsnp.ndarray

The loadings vectors of the PCA model.

fit_scoresnp.ndarray

The fitted scores of the PCA model.

fit_datanp.ndarray

The data used to fit the model.

mean_centeredbool

Whether or not the original data is mean-centered.

fitted_componentsint

The number of current LVs in the model (0 if not fitted yet.)

explained_variance_ratio_np.ndarray

The explained variance ratios per fitted component.

Methods:

transform

Transform input data to scores.

fit

Fit PCA model on input data.

fit_transform

Fit PCA model on input data, then transform said data to scores.

inverse_transform

Obtain approximation of input data given fitted model and scores.

calc_imd

Calculate within-model distance.

calc_oomd

Calculate out-of-model distance.

calc_limit

Calculate suitable distance threshold given fitted data.

set_components

Change the number of model components.

get_explained_variance_ratio

Calculate explained variances as ratio of total explained variance.

calc_imd(input_scores: ndarray | None = None, input_array: ndarray | None = None, metric: str = 'HotellingT2', covariance: str = 'diag') ndarray

Calculate in-model distance (IMD) of observations. This is the distance from the center of the hyperplane to the projected observation.

Parameters:
  • input_scores (Optional[np.ndarray], optional) – The scores from which to calculate the distance. Defaults to None.

  • input_array (Optional[np.ndarray], optional) – The input data in original space from which to calculate the distance. Defaults to None.

  • metric (str, optional) – The metric to use. Valid options are {‘HotellingT2’}. Defaults to ‘HotellingT2’.

  • covariance (str, optional) – Method to compute covariance. Valid options are {‘diag’, ‘full’, ‘ledoit_wolf’}. Defaults to ‘diag’ (quick version). ‘full’ uses the entire covariance matrix computed by numpy. ‘ledoit_wolf’ uses the full covariance matrix computed by Ledoit-Wolf shrinkage.

Raises:
  • NotFittedError – Model has not been fit yet.

  • ValueError – Neither scores nor input data provided.

  • ValueError – Input scores are inconsistent with n_components of model.

  • NotImplementedError – Any metric that has not yet been implemented.

Returns:

The calculated within-model distance for each observation (row).

Return type:

np.ndarray

calc_limit(metric: str = 'HotellingT2', n: int | None = None, num_lvs: int | None = None, m: int | None = None, alpha: float = 0.95) float

This function calculates the limits for imd and oomd. Assumptions on the distribution shape underpin this calculation; in practice limits should be judged by the user.

Parameters:
  • metric (str, optional) – The metric to use. Valid options are {‘HotellingT2’,’DModX’} Defaults to ‘HotellingT2’.

  • n (Optional[int], optional) – The number of observations. Defaults to None, which results in the n of the fitted scores to be used.

  • num_lvs (Optional[int], optional) – The number of latent variables (principal components). Defaults to None, which results in the m of the fitted scores to be used.

  • m (Optional[int], optional) – The number of original features. Defaults to None, which results in the number of features in the original data/fitted loadings to be used.

  • alpha (float, optional) – The confidence value to use. Defaults to 0.95.

Returns:

The limit threshold for the given metric and

confidence value.

Return type:

float

calc_oomd(input_array: ndarray, metric: str = 'QRes') ndarray

Calculate the out-of-model distance (OOMD) of an observations.

Parameters:
  • input_array (np.ndarray) – The data for which to calculate the OOMD.

  • metric (str, optional) – The metric to use. Valid options are {‘Qres’,’DModX’}. Defaults to ‘QRes’.

Raises:

NotImplementedError – Unknown metric.

Returns:

The distances for the provided observations.

Return type:

np.ndarray

property explained_variance_ratio_: ndarray

calculate the explained variance ratios per fitted component

Parameters:

in_data (np.array, optional) – Alternative input data. Defaults to None.

Raises:

ValueError – if in_data not mean centered.

Returns:

explained variances

Return type:

np.ndarray

fit(X: ndarray, verbose: bool = False) NipalsPCA

Fits PCA model to input data.

Parameters:
  • X (np.ndarray) – The input data to fit on.

  • verbose (bool, optional) – Whether or not to print out additional convergence information. Defaults to False.

Returns:

A reference to the object.

Return type:

NipalsPCA

fit_transform(X: ndarray, verbose: bool = False) ndarray

Fit, then transform input data. This function is equivalent to >>>> P = NipalsPCA() >>>> P.fit(X) >>>> T = P.transform(X)

Parameters:
  • X (np.ndarray) – The The input data to fit on and to transform.

  • verbose (bool, optional) – Whether or not to print out additional convergence information. Defaults to False.

Raises:

ValueError – Model has already been fit.

Returns:

The corresponding scores.

Return type:

np.ndarray

property fitted_components: int

Get total # of LVs in model. This may differ from self.n_components which is the number of components used by the model.

Returns:

Number of fitted components

Return type:

int

get_explained_variance_ratio(in_data: array = None) ndarray

calculate the explained variance ratios per fitted component

Parameters:

in_data (np.array, optional) – Alternative input data. Defaults to None.

Raises:

ValueError – if in_data not mean centered.

Returns:

explained variances

Return type:

np.ndarray

inverse_transform(X: ndarray) ndarray

Approximate original data from scores.

Parameters:

X (np.ndarray) – An array containing the scores.

Raises:
  • NotFittedError – PCA model has not been fit yet.

  • ValueError – Shape of provided scores does not match n_components in model.

Returns:

The approximation of the original data.

Return type:

np.ndarray

set_components(n_component: int, verbose: bool = False)

Method for setting the number of components in an already-constructed model. It checks to make sure that loadings exist for all of the set components and will fit extras if not. Note that in case of decreasing the number of components, previously fitted components are internally stored. In case you prefer a clean model, create a new model object and fit it with the desired number of components.

Parameters:
  • n_component (int) – the desired number of components.

  • verbose (bool) – Whether or not to print out additional

  • False. (convergence information. Defaults to)

Raises:
  • TypeError – if n_component is not an int

  • ValueError – if n_component < 1

set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') NipalsPCA

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, method: bool | None | str = '$UNCHANGED$') NipalsPCA

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

method (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for method parameter in transform.

Returns:

self – The updated object.

Return type:

object

transform(X: ndarray, method: str = 'naive') ndarray

This function takes an input array and projects it based on a fitted model.

Parameters:
  • X (np.ndarray) – The nxm input array in the original feature space to be projected.

  • method (str, optional) – The method to use for the projection. See reference listed in module docstring. Valid options are {‘naive’,’projection’,’conditional_mean’} Defaults to ‘naive’.

Raises:
  • NotFittedError – If model has not been fit yet (no loadings).

  • ValueError – Method ‘conditional_mean’ is selected but fit_data is not available.

Returns:

The corresponding scores.

Return type:

np.ndarray

open_nipals.nipalsPLS module

Algorithm implemented from Chapter 6 of Chiang, Leo H., Evan L. Russell, and Richard D. Braatz. Fault detection and diagnosis in industrial systems. Springer Science & Business Media, 2000.

Alternative algorithm derivation from: Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta 1986, 185, 1–17. https://doi.org/10.1016/0003-2670(86)80028-9.

For the transformation part also see: Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems 1996, 35(1), 45-65.

(C) 2020-2021: Ryan Wall (lead), David Ochsenbein, YBaranwal revised 2024: Niels Schlusser

class open_nipals.nipalsPLS.NipalsPLS(n_components: int = 2, max_iter: int = 10000, tol_criteria: float = 1e-06, mean_centered: bool = True, force_include: bool = False)

Bases: BaseEstimator, TransformerMixin, RegressorMixin

The custom-built class to use PLS using the NIPALS algorithm, i.e., the same algorithm used in SIMCA.

Attributes:

n_componentsint

The number of principal components.

max_iterint

The max number of iterations for the fitting step.

tol_criteriafloat

The convergence tolerance criterion.

mean_centeredbool

Whether or not the original data is mean-centered.

force_includebool

True will force including the data which has all nans in y-block. Defaults to False.

fit_data_xnp.ndarray

The X data used to fit the model.

fit_data_ynp.ndarray

The y data used to fit the model.

loadings_xnp.ndarray

The X loadings vectors of the PLS model.

loadings_ynp.ndarray

The y loadings vectors of the PLS model.

fit_scores_xnp.ndarray

The fitted X scores of the PLS model.

fit_scores_xnp.ndarray

The fitted y scores of the PLS model.

regression_matrixnp.ndarray

The regression matrix of the PLS model.

fitted_componentsint

The number of current LVs in the model (0 if not fitted yet.)

explained_variance_ratio_np.ndarray

The explained variance ratios per fitted component.

Methods:

transform

Transform input data to scores.

fit

Fit PLS model on input data.

fit_transform

Fit PLS model on input data, then transform said data to scores.

inverse_transform

Obtain approximation of X input data given fitted model and X scores.

calc_imd

Calculate within-model distance.

calc_oomd

Calculate out-of-model distance.

predict

Obtain prediction for y data given model and X data.

set_components

Change the number of model components.

get_reg_vector

Give regression vector of the model.

get_explained_variance_ratio

Calculate explained variances as ratio of total explained variance.

calc_imd(input_scores: array | None = None, input_array: array | None = None, metric: str = 'HotellingT2', covariance: str = 'diag')

Calculate the in-model distance (IMD) of observations. This will take in an input array OR scores and return Hotelling’s T2 value for each row. In theory you could expand to include a Y-block in-model distance, but the value is limited for the typical use cases.

Parameters:
  • input_scores (Optional[np.array], optional) – Scores array. Defaults to None.

  • input_array (Optional[np.array], optional) – Data array. Defaults to None.

  • metric (str, optional) – In-model-distance to compute. Must be one of set {‘HotellingT2’}. Defaults to ‘HotellingT2’.

  • covariance (str, optional) – Method to compute covariance. Valid options are {‘diag’, ‘full’, ‘ledoit_wolf’}. Defaults to ‘diag’ (quick version). ‘full’ uses the entire covariance matrix computed by numpy. ‘ledoit_wolf’ uses the full covariance matrix computed by Ledoit-Wolf shrinkage.

Raises:
  • NotFittedError – If model has not been fit.

  • ValueError – If neither scores nor data are provided.

  • ValueError – If input scores shapes does not match model.

  • NotImplementedError – If unknown metric was requested.

Returns:

The within-model distance(s).

Return type:

float

calc_oomd(input_array: array, metric: str = 'QRes') array

Calculate the out-of-model distance (OOMD) of observations. In theory can be used for Y-block, but the value in typical use is limited.

Parameters:
  • input_array (np.array) – The X input data for which to calculate the OOMD.

  • metric (str, optional) – The metric to compute. Supported metrics are: {‘QRes’,’DModX’}. Defaults to ‘QRes’.

Raises:

ValueError – If input metric is unknown.

Returns:

The out-of-model distance(s).

Return type:

np.array

property explained_variance_ratio_: np.ndarray, np.ndarray

calculate the explained variance ratios for X and y arrays per fitted component

Parameters:
  • in_x_data (np.array, optional) – Alternative input X data. Defaults to None.

  • in_y_data (np.array, optional) – Alternative input y data. Defaults to None.

Raises:
  • ValueError – If in_x_data not mean centered.

  • ValueError – If in_y_data not mean centered.

Returns:

explained variance ratios for X and y

Return type:

(np.ndarray, np.ndarray)

fit(X: array, y: array, verbose: bool = False) NipalsPLS

Function to fit PLS model from X/Y Data.

Parameters:
  • X (np.array) – Input X data.

  • y (np.array) – Input Y data.

  • verbose (bool, optional) – Turn verbosity on and off. Defaults to False.

Raises:

NotFittedError – Model has not yet been fit

Returns:

A reference to the object.

Return type:

NipalsPLS

fit_transform(X: array, y: array) Tuple[array, array]

Combine fit and transform methods into one command, sklearn style.

Parameters:
  • X (np.array) – The X-data.

  • y (np.array) – The Y-data.

Raises:

ValueError – If attempt to use fit_transform on already fitted data.

Returns:

A tuple containing the fitted scores

for X and Y.

Return type:

Tuple[np.array, np.array]

property fitted_components: int

Get total # of LVs in model. This may differ from self.n_components which is the number of components used by the model.

Returns:

number of fitted components

Return type:

int

get_explained_variance_ratio(in_x_data: np.array = None, in_y_data: np.array = None)

calculate the explained variance ratios for X and y arrays per fitted component

Parameters:
  • in_x_data (np.array, optional) – Alternative input X data. Defaults to None.

  • in_y_data (np.array, optional) – Alternative input y data. Defaults to None.

Raises:
  • ValueError – If in_x_data not mean centered.

  • ValueError – If in_y_data not mean centered.

Returns:

explained variance ratios for X and y

Return type:

(np.ndarray, np.ndarray)

get_reg_vector() array

Give the user the regression vector for the model.

Raises:

NotFittedError – If the model has not been fit.

Returns:

The regression vector.

Return type:

np.array

inverse_transform(X: array) array

Given a set of scores, return the simulated data.

Parameters:

X (np.array) – The scores to transform back.

Raises:
  • NotFittedError – If model has not been fit.

  • ValueError – If input scores shapes does not match model.

Returns:

The simulated data.

Return type:

np.array

predict(X: array = None, scores_x: array = None) array

Predict y from data or scores.

Parameters:
  • X (np.array, optional) – Input X data. Defaults to None.

  • scores_x (np.array, optional) – Input scores. Defaults to None.

Raises:
  • NotFittedError – If model has not been fit.

  • ValueError – If neither data nor scores are provided.

Returns:

The predicted y-values.

Return type:

np.array

set_components(n_component: int, verbose: bool = False)

Method for setting the number of components in an already-constructed model. It checks to make sure that loadings exist for all of the set components and will fit extras if not. Note that in case of decreasing the number of components, previously fitted components are internally stored. In case you prefer a clean model, create a new model object and fit it with the desired number of components.

Parameters:
  • n_component (int) – the desired number of components.

  • verbose (bool) – Whether or not to print out additional convergence information. Defaults to False.

Raises:
  • TypeError – if n_component is not an int

  • ValueError – if n_component < 1

set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') NipalsPLS

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, scores_x: bool | None | str = '$UNCHANGED$') NipalsPLS

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

scores_x (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for scores_x parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') NipalsPLS

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

transform(X: array, y: array | None = None) array | Tuple[array, array]

Compute scores using model.

Parameters:
  • X (np.array) – X-data.

  • y (np.array, optional) – Y-data. Defaults to None.

Raises:

NotFittedError – If model is not fit.

Returns:

Either the

scores for X (when no Y-data is provided) or a tuple of two scores-arrays.

Return type:

Union[np.array, Tuple[np.array, np.array]]

open_nipals.utils module

Code for pipelining a data arrangement in sklearn. Allows for standarscaler() followed by other methods

  1. 2020: Ryan Wall, revised 2025: Calvin Ristad, Niels Schlusser

class open_nipals.utils.ArrangeData(var_dict: dict | DataFrame = None)

Bases: TransformerMixin

ArrangeData class creates a sklearn-style transformer object that orders the columns in dataframe correctly

Attributes:

var_dictdict

A dictionary indicating which column should be in which position.

Methods:

var_dict_from_df

Infer var_dict from template dataframe.

fit_transform

concatenation of fit and transform

fit

Applies var_dict_from_df in a consistent manner with sklearn nomenclature

transform

Takes a new dataframe or np.array + variable dictionary and arranges to be consistent with stored var_dict

fit(input_data: DataFrame | ndarray, input_var_dict: dict | None = None)

The function which will fit the var_dict object.

Parameters:
  • input_data (Union[pd.DataFrame, np.ndarray]) – The input data to use for fitting.

  • input_var_dict (Optional[dict], optional) – Required if input_data is an array, this dictionary contains the headers as {column Name: column index}.

Raises:
  • ValueError – input_data is a numpy array and input_var_dict is missing.

  • ValueError – input_data is a numpy array and its shape does not match input_var_dict.

  • ValueError – An unknown error occurred.

fit_transform(input_data: DataFrame | ndarray, input_var_dict: dict | None = None) DataFrame

Fit a data/column model and then transform data.

Parameters:
  • input_data (Union[pd.DataFrame, np.ndarray]) – The input data to transform. Note that if this is a dataframe it _will_ be used to fit var_dict. This might not be what you want.

  • input_var_dict (Optional[dict], optional) – The var_dict to be used to fit. Defaults to None.

Returns:

transformed input frame

Return type:

pd.DataFrame

transform(input_data: DataFrame | ndarray, input_var_dict: dict | None = None) ndarray

Transform input data based on stored data model.

Parameters:
  • input_data (Union[pd.DataFrame, np.ndarray]) – The input_data to transform.

  • input_var_dict (Optional[dict], optional) – Required if input_data is an array, this dictionary contains the headers as {column Name: column index}.. Defaults to None.

Raises:
  • ValueError – ArrangeData object has not yet been fit.

  • ValueError – input_data is a numpy array and no input_var_dict has been provided.

  • ValueError – input_data is a numpy array and its shape does not match input_var_dict

Returns:

The transformed (column-rearranged) data.

Return type:

np.ndarray

var_dict_from_df(input_df: DataFrame) dict

Infer the var_dict from a template dataframe.

Parameters:

input_df (pd.DataFrame) – A dataframe that has the desired order of columns.

Returns:

The corresponding var_dict. {column Name: column index}

Return type:

dict