Package documentation open_nipals
open_nipals.nipalsPCA module
Code for calculating the PCA Loadings and Scores using NIPALS algorithm.
One of the most concise definitions can be found in this paper on page 7: Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta 1986, 185, 1–17. https://doi.org/10.1016/0003-2670(86)80028-9.
For the transformation part also see: Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems 1996, 35(1), 45-65.
(c) 2020-2021: Ryan Wall (lead), David Ochsenbein revised 2024: Niels Schlusser
- class open_nipals.nipalsPCA.NipalsPCA(n_components: int = 2, max_iter: int = 10000, tol_criteria: float = 1e-06, mean_centered: bool = True)
Bases:
BaseEstimator,TransformerMixinThe custom-built class to use PCA using the NIPALS algorithm, i.e., the same algorithm used in SIMCA.
Attributes:
- n_componentsint
The number of principal components.
- max_iterint
The max number of iterations for the fitting step.
- tol_criteriafloat
The convergence tolerance criterion.
- loadingsnp.ndarray
The loadings vectors of the PCA model.
- fit_scoresnp.ndarray
The fitted scores of the PCA model.
- fit_datanp.ndarray
The data used to fit the model.
- mean_centeredbool
Whether or not the original data is mean-centered.
- fitted_componentsint
The number of current LVs in the model (0 if not fitted yet.)
- explained_variance_ratio_np.ndarray
The explained variance ratios per fitted component.
Methods:
- transform
Transform input data to scores.
- fit
Fit PCA model on input data.
- fit_transform
Fit PCA model on input data, then transform said data to scores.
- inverse_transform
Obtain approximation of input data given fitted model and scores.
- calc_imd
Calculate within-model distance.
- calc_oomd
Calculate out-of-model distance.
- calc_limit
Calculate suitable distance threshold given fitted data.
- set_components
Change the number of model components.
- get_explained_variance_ratio
Calculate explained variances as ratio of total explained variance.
- calc_imd(input_scores: ndarray | None = None, input_array: ndarray | None = None, metric: str = 'HotellingT2', covariance: str = 'diag') ndarray
Calculate in-model distance (IMD) of observations. This is the distance from the center of the hyperplane to the projected observation.
- Parameters:
input_scores (Optional[np.ndarray], optional) – The scores from which to calculate the distance. Defaults to None.
input_array (Optional[np.ndarray], optional) – The input data in original space from which to calculate the distance. Defaults to None.
metric (str, optional) – The metric to use. Valid options are {‘HotellingT2’}. Defaults to ‘HotellingT2’.
covariance (str, optional) – Method to compute covariance. Valid options are {‘diag’, ‘full’, ‘ledoit_wolf’}. Defaults to ‘diag’ (quick version). ‘full’ uses the entire covariance matrix computed by numpy. ‘ledoit_wolf’ uses the full covariance matrix computed by Ledoit-Wolf shrinkage.
- Raises:
NotFittedError – Model has not been fit yet.
ValueError – Neither scores nor input data provided.
ValueError – Input scores are inconsistent with n_components of model.
NotImplementedError – Any metric that has not yet been implemented.
- Returns:
The calculated within-model distance for each observation (row).
- Return type:
np.ndarray
- calc_limit(metric: str = 'HotellingT2', n: int | None = None, num_lvs: int | None = None, m: int | None = None, alpha: float = 0.95) float
This function calculates the limits for imd and oomd. Assumptions on the distribution shape underpin this calculation; in practice limits should be judged by the user.
- Parameters:
metric (str, optional) – The metric to use. Valid options are {‘HotellingT2’,’DModX’} Defaults to ‘HotellingT2’.
n (Optional[int], optional) – The number of observations. Defaults to None, which results in the n of the fitted scores to be used.
num_lvs (Optional[int], optional) – The number of latent variables (principal components). Defaults to None, which results in the m of the fitted scores to be used.
m (Optional[int], optional) – The number of original features. Defaults to None, which results in the number of features in the original data/fitted loadings to be used.
alpha (float, optional) – The confidence value to use. Defaults to 0.95.
- Returns:
- The limit threshold for the given metric and
confidence value.
- Return type:
float
- calc_oomd(input_array: ndarray, metric: str = 'QRes') ndarray
Calculate the out-of-model distance (OOMD) of an observations.
- Parameters:
input_array (np.ndarray) – The data for which to calculate the OOMD.
metric (str, optional) – The metric to use. Valid options are {‘Qres’,’DModX’}. Defaults to ‘QRes’.
- Raises:
NotImplementedError – Unknown metric.
- Returns:
The distances for the provided observations.
- Return type:
np.ndarray
- property explained_variance_ratio_: ndarray
calculate the explained variance ratios per fitted component
- Parameters:
in_data (np.array, optional) – Alternative input data. Defaults to None.
- Raises:
ValueError – if in_data not mean centered.
- Returns:
explained variances
- Return type:
np.ndarray
- fit(X: ndarray, verbose: bool = False) NipalsPCA
Fits PCA model to input data.
- Parameters:
X (np.ndarray) – The input data to fit on.
verbose (bool, optional) – Whether or not to print out additional convergence information. Defaults to False.
- Returns:
A reference to the object.
- Return type:
- fit_transform(X: ndarray, verbose: bool = False) ndarray
Fit, then transform input data. This function is equivalent to >>>> P = NipalsPCA() >>>> P.fit(X) >>>> T = P.transform(X)
- Parameters:
X (np.ndarray) – The The input data to fit on and to transform.
verbose (bool, optional) – Whether or not to print out additional convergence information. Defaults to False.
- Raises:
ValueError – Model has already been fit.
- Returns:
The corresponding scores.
- Return type:
np.ndarray
- property fitted_components: int
Get total # of LVs in model. This may differ from self.n_components which is the number of components used by the model.
- Returns:
Number of fitted components
- Return type:
int
- get_explained_variance_ratio(in_data: array = None) ndarray
calculate the explained variance ratios per fitted component
- Parameters:
in_data (np.array, optional) – Alternative input data. Defaults to None.
- Raises:
ValueError – if in_data not mean centered.
- Returns:
explained variances
- Return type:
np.ndarray
- inverse_transform(X: ndarray) ndarray
Approximate original data from scores.
- Parameters:
X (np.ndarray) – An array containing the scores.
- Raises:
NotFittedError – PCA model has not been fit yet.
ValueError – Shape of provided scores does not match n_components in model.
- Returns:
The approximation of the original data.
- Return type:
np.ndarray
- set_components(n_component: int, verbose: bool = False)
Method for setting the number of components in an already-constructed model. It checks to make sure that loadings exist for all of the set components and will fit extras if not. Note that in case of decreasing the number of components, previously fitted components are internally stored. In case you prefer a clean model, create a new model object and fit it with the desired number of components.
- Parameters:
n_component (int) – the desired number of components.
verbose (bool) – Whether or not to print out additional
False. (convergence information. Defaults to)
- Raises:
TypeError – if n_component is not an int
ValueError – if n_component < 1
- set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') NipalsPCA
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
verboseparameter infit.- Returns:
self – The updated object.
- Return type:
object
- set_transform_request(*, method: bool | None | str = '$UNCHANGED$') NipalsPCA
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
method (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
methodparameter intransform.- Returns:
self – The updated object.
- Return type:
object
- transform(X: ndarray, method: str = 'naive') ndarray
This function takes an input array and projects it based on a fitted model.
- Parameters:
X (np.ndarray) – The nxm input array in the original feature space to be projected.
method (str, optional) – The method to use for the projection. See reference listed in module docstring. Valid options are {‘naive’,’projection’,’conditional_mean’} Defaults to ‘naive’.
- Raises:
NotFittedError – If model has not been fit yet (no loadings).
ValueError – Method ‘conditional_mean’ is selected but fit_data is not available.
- Returns:
The corresponding scores.
- Return type:
np.ndarray
open_nipals.nipalsPLS module
Algorithm implemented from Chapter 6 of Chiang, Leo H., Evan L. Russell, and Richard D. Braatz. Fault detection and diagnosis in industrial systems. Springer Science & Business Media, 2000.
Alternative algorithm derivation from: Geladi, P.; Kowalski, B. R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta 1986, 185, 1–17. https://doi.org/10.1016/0003-2670(86)80028-9.
For the transformation part also see: Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems 1996, 35(1), 45-65.
(C) 2020-2021: Ryan Wall (lead), David Ochsenbein, YBaranwal revised 2024: Niels Schlusser
- class open_nipals.nipalsPLS.NipalsPLS(n_components: int = 2, max_iter: int = 10000, tol_criteria: float = 1e-06, mean_centered: bool = True, force_include: bool = False)
Bases:
BaseEstimator,TransformerMixin,RegressorMixinThe custom-built class to use PLS using the NIPALS algorithm, i.e., the same algorithm used in SIMCA.
Attributes:
- n_componentsint
The number of principal components.
- max_iterint
The max number of iterations for the fitting step.
- tol_criteriafloat
The convergence tolerance criterion.
- mean_centeredbool
Whether or not the original data is mean-centered.
- force_includebool
True will force including the data which has all nans in y-block. Defaults to False.
- fit_data_xnp.ndarray
The X data used to fit the model.
- fit_data_ynp.ndarray
The y data used to fit the model.
- loadings_xnp.ndarray
The X loadings vectors of the PLS model.
- loadings_ynp.ndarray
The y loadings vectors of the PLS model.
- fit_scores_xnp.ndarray
The fitted X scores of the PLS model.
- fit_scores_xnp.ndarray
The fitted y scores of the PLS model.
- regression_matrixnp.ndarray
The regression matrix of the PLS model.
- fitted_componentsint
The number of current LVs in the model (0 if not fitted yet.)
- explained_variance_ratio_np.ndarray
The explained variance ratios per fitted component.
Methods:
- transform
Transform input data to scores.
- fit
Fit PLS model on input data.
- fit_transform
Fit PLS model on input data, then transform said data to scores.
- inverse_transform
Obtain approximation of X input data given fitted model and X scores.
- calc_imd
Calculate within-model distance.
- calc_oomd
Calculate out-of-model distance.
- predict
Obtain prediction for y data given model and X data.
- set_components
Change the number of model components.
- get_reg_vector
Give regression vector of the model.
- get_explained_variance_ratio
Calculate explained variances as ratio of total explained variance.
- calc_imd(input_scores: array | None = None, input_array: array | None = None, metric: str = 'HotellingT2', covariance: str = 'diag')
Calculate the in-model distance (IMD) of observations. This will take in an input array OR scores and return Hotelling’s T2 value for each row. In theory you could expand to include a Y-block in-model distance, but the value is limited for the typical use cases.
- Parameters:
input_scores (Optional[np.array], optional) – Scores array. Defaults to None.
input_array (Optional[np.array], optional) – Data array. Defaults to None.
metric (str, optional) – In-model-distance to compute. Must be one of set {‘HotellingT2’}. Defaults to ‘HotellingT2’.
covariance (str, optional) – Method to compute covariance. Valid options are {‘diag’, ‘full’, ‘ledoit_wolf’}. Defaults to ‘diag’ (quick version). ‘full’ uses the entire covariance matrix computed by numpy. ‘ledoit_wolf’ uses the full covariance matrix computed by Ledoit-Wolf shrinkage.
- Raises:
NotFittedError – If model has not been fit.
ValueError – If neither scores nor data are provided.
ValueError – If input scores shapes does not match model.
NotImplementedError – If unknown metric was requested.
- Returns:
The within-model distance(s).
- Return type:
float
- calc_oomd(input_array: array, metric: str = 'QRes') array
Calculate the out-of-model distance (OOMD) of observations. In theory can be used for Y-block, but the value in typical use is limited.
- Parameters:
input_array (np.array) – The X input data for which to calculate the OOMD.
metric (str, optional) – The metric to compute. Supported metrics are: {‘QRes’,’DModX’}. Defaults to ‘QRes’.
- Raises:
ValueError – If input metric is unknown.
- Returns:
The out-of-model distance(s).
- Return type:
np.array
- property explained_variance_ratio_: np.ndarray, np.ndarray
calculate the explained variance ratios for X and y arrays per fitted component
- Parameters:
in_x_data (np.array, optional) – Alternative input X data. Defaults to None.
in_y_data (np.array, optional) – Alternative input y data. Defaults to None.
- Raises:
ValueError – If in_x_data not mean centered.
ValueError – If in_y_data not mean centered.
- Returns:
explained variance ratios for X and y
- Return type:
(np.ndarray, np.ndarray)
- fit(X: array, y: array, verbose: bool = False) NipalsPLS
Function to fit PLS model from X/Y Data.
- Parameters:
X (np.array) – Input X data.
y (np.array) – Input Y data.
verbose (bool, optional) – Turn verbosity on and off. Defaults to False.
- Raises:
NotFittedError – Model has not yet been fit
- Returns:
A reference to the object.
- Return type:
- fit_transform(X: array, y: array) Tuple[array, array]
Combine fit and transform methods into one command, sklearn style.
- Parameters:
X (np.array) – The X-data.
y (np.array) – The Y-data.
- Raises:
ValueError – If attempt to use fit_transform on already fitted data.
- Returns:
- A tuple containing the fitted scores
for X and Y.
- Return type:
Tuple[np.array, np.array]
- property fitted_components: int
Get total # of LVs in model. This may differ from self.n_components which is the number of components used by the model.
- Returns:
number of fitted components
- Return type:
int
- get_explained_variance_ratio(in_x_data: np.array = None, in_y_data: np.array = None)
calculate the explained variance ratios for X and y arrays per fitted component
- Parameters:
in_x_data (np.array, optional) – Alternative input X data. Defaults to None.
in_y_data (np.array, optional) – Alternative input y data. Defaults to None.
- Raises:
ValueError – If in_x_data not mean centered.
ValueError – If in_y_data not mean centered.
- Returns:
explained variance ratios for X and y
- Return type:
(np.ndarray, np.ndarray)
- get_reg_vector() array
Give the user the regression vector for the model.
- Raises:
NotFittedError – If the model has not been fit.
- Returns:
The regression vector.
- Return type:
np.array
- inverse_transform(X: array) array
Given a set of scores, return the simulated data.
- Parameters:
X (np.array) – The scores to transform back.
- Raises:
NotFittedError – If model has not been fit.
ValueError – If input scores shapes does not match model.
- Returns:
The simulated data.
- Return type:
np.array
- predict(X: array = None, scores_x: array = None) array
Predict y from data or scores.
- Parameters:
X (np.array, optional) – Input X data. Defaults to None.
scores_x (np.array, optional) – Input scores. Defaults to None.
- Raises:
NotFittedError – If model has not been fit.
ValueError – If neither data nor scores are provided.
- Returns:
The predicted y-values.
- Return type:
np.array
- set_components(n_component: int, verbose: bool = False)
Method for setting the number of components in an already-constructed model. It checks to make sure that loadings exist for all of the set components and will fit extras if not. Note that in case of decreasing the number of components, previously fitted components are internally stored. In case you prefer a clean model, create a new model object and fit it with the desired number of components.
- Parameters:
n_component (int) – the desired number of components.
verbose (bool) – Whether or not to print out additional convergence information. Defaults to False.
- Raises:
TypeError – if n_component is not an int
ValueError – if n_component < 1
- set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') NipalsPLS
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
verboseparameter infit.- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, scores_x: bool | None | str = '$UNCHANGED$') NipalsPLS
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
scores_x (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
scores_xparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') NipalsPLS
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object
- transform(X: array, y: array | None = None) array | Tuple[array, array]
Compute scores using model.
- Parameters:
X (np.array) – X-data.
y (np.array, optional) – Y-data. Defaults to None.
- Raises:
NotFittedError – If model is not fit.
- Returns:
- Either the
scores for X (when no Y-data is provided) or a tuple of two scores-arrays.
- Return type:
Union[np.array, Tuple[np.array, np.array]]
open_nipals.utils module
Code for pipelining a data arrangement in sklearn. Allows for standarscaler() followed by other methods
2020: Ryan Wall, revised 2025: Calvin Ristad, Niels Schlusser
- class open_nipals.utils.ArrangeData(var_dict: dict | DataFrame = None)
Bases:
TransformerMixinArrangeData class creates a sklearn-style transformer object that orders the columns in dataframe correctly
Attributes:
- var_dictdict
A dictionary indicating which column should be in which position.
Methods:
- var_dict_from_df
Infer var_dict from template dataframe.
- fit_transform
concatenation of fit and transform
- fit
Applies var_dict_from_df in a consistent manner with sklearn nomenclature
- transform
Takes a new dataframe or np.array + variable dictionary and arranges to be consistent with stored var_dict
- fit(input_data: DataFrame | ndarray, input_var_dict: dict | None = None)
The function which will fit the var_dict object.
- Parameters:
input_data (Union[pd.DataFrame, np.ndarray]) – The input data to use for fitting.
input_var_dict (Optional[dict], optional) – Required if input_data is an array, this dictionary contains the headers as {column Name: column index}.
- Raises:
ValueError – input_data is a numpy array and input_var_dict is missing.
ValueError – input_data is a numpy array and its shape does not match input_var_dict.
ValueError – An unknown error occurred.
- fit_transform(input_data: DataFrame | ndarray, input_var_dict: dict | None = None) DataFrame
Fit a data/column model and then transform data.
- Parameters:
input_data (Union[pd.DataFrame, np.ndarray]) – The input data to transform. Note that if this is a dataframe it _will_ be used to fit var_dict. This might not be what you want.
input_var_dict (Optional[dict], optional) – The var_dict to be used to fit. Defaults to None.
- Returns:
transformed input frame
- Return type:
pd.DataFrame
- transform(input_data: DataFrame | ndarray, input_var_dict: dict | None = None) ndarray
Transform input data based on stored data model.
- Parameters:
input_data (Union[pd.DataFrame, np.ndarray]) – The input_data to transform.
input_var_dict (Optional[dict], optional) – Required if input_data is an array, this dictionary contains the headers as {column Name: column index}.. Defaults to None.
- Raises:
ValueError – ArrangeData object has not yet been fit.
ValueError – input_data is a numpy array and no input_var_dict has been provided.
ValueError – input_data is a numpy array and its shape does not match input_var_dict
- Returns:
The transformed (column-rearranged) data.
- Return type:
np.ndarray
- var_dict_from_df(input_df: DataFrame) dict
Infer the var_dict from a template dataframe.
- Parameters:
input_df (pd.DataFrame) – A dataframe that has the desired order of columns.
- Returns:
The corresponding var_dict. {column Name: column index}
- Return type:
dict