Skip to main content

2 posts tagged with "scikit-learn"

View All Tags

· 3 min read
Darvin Cotrina

Linear Model

The following linear models are available in scikit-learn for regression and classification tasks, if yy is the target variable, xx is the feature vector, and ww is the weight vector

  • y=w0+w1x1+w2x2+...+wpxpy = w_0 + w_1x_1 + w_2x_2 + ... + w_px_p
  • w0w_0 is the intercept_
  • w1,w2,...,wpw_1, w_2, ..., w_p are the coef_

Linear Regression

Fits a linear model with coefficients w=(w1,,wp)w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

  • minwXwy22\min_{w} || X w - y||_2^2
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

Ridge Regression

Applies L2 regularization to reduce the complexity of the model and prevent overfitting.

  • minwXwy22+αw22\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2
  • Hyperparameter α\alpha
  • if α=0\alpha = 0, then the model is the same as Linear Regression
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)

from sklearn.linear_model import RidgeCV

Lasso Regression

Applies L1 regularization to reduce the complexity of the model and prevent overfitting.

  • minwXwy22+αw1\min_{w} || X w - y||_2^2 + \alpha ||w||_1
  • Hyperparameter α\alpha
  • if α=0\alpha = 0, then the model is the same as Linear Regression
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0)

Elastic Net Regression

Applies both L1 and L2 regularization to reduce the complexity of the model and prevent overfitting.

  • minwXwy22+αρw1+α(1ρ)2w22\min_{w} || X w - y||_2^2 + \alpha \rho ||w||_1 + \frac{\alpha(1-\rho)}{2} ||w||_2^2
  • Hyperparameter α\alpha and l1_ratiol1\_ratio
  • if α=0\alpha = 0, and l1_ratio=0l1\_ratio = 0, then the model is the same as Linear Regression
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5)

Polynomial Regression

Generates polynomial features and fits a linear model to the transformed data.

  • y=w0+w1x1+w2x2+w3x12+w4x1x2+w5x22+...y = w_0 + w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_1x_2 + w_5x_2^2 + ...
  • Hyperparameter degree
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

poly = PolynomialFeatures(degree=2)
poly_reg = make_pipeline(poly, LinearRegression())

Logistic Regression

Use when you want to predict a binary outcome (0 or 1, yes or no, true or false) given a set of independent variables.

  • y=11+e(w0+w1x1+w2x2+...+wpxp)y = \frac{1}{1 + e^{-(w_0 + w_1x_1 + w_2x_2 + ... + w_px_p)}}
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()

Stocastic Gradient Descent

Use when you want to train large datasets.

  • wt+1=wtηQi(wt)w_{t+1} = w_t - \eta \nabla Q_i(w_t)
  • Hyperparameter eta0 is the learning rate
from sklearn.linear_model import SGDClassifier, SGDRegressor
sgd_clf = SGDClassifier()
sgd_reg = SGDRegressor()

Bayesian Ridge Regression

Bayesian Ridge Regression is similar to Ridge Regression, but it introduces a prior on the weights ww.

  • Original Algorithm is detailed in the book Bayesian learning for neural networks
  • Hyperparameter alpha_1, alpha_2, lambda_1, lambda_2
from sklearn.linear_model import BayesianRidge
bayesian_ridge = BayesianRidge()

Passive Aggressive

Passive Aggressive algorithms are a family of algorithms for large-scale learning

from sklearn.linear_model import PassiveAggressiveClassifier, PassiveAggressiveRegressor
passive_aggressive_clf = PassiveAggressiveClassifier()
passive_aggressive_reg = PassiveAggressiveRegressor()

RANSAC Regression

RANSAC (RANdom SAmple Consensus) is an iterative algorithm for the robust estimation of parameters from a subset of inliers from the complete data set.

from sklearn.linear_model import RANSACRegressor
ransac_reg = RANSACRegressor()

· 3 min read
Darvin Cotrina

Missing data

Es importante tener en cuenta que los modelos de machine learning no pueden trabajar con valores nulos, por lo que es necesario reemplazarlos por algún valor.

Eliminar

Si hay muchos valores nulos, se puede eliminar la columna o fila, tener en cuenta que se puede perder información importante.

df.dropna()

Imputar

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputer.fit_transform(X)

crear columna indicadora

from sklearn.impute import MissingIndicator
indicator = MissingIndicator()
indicator.fit_transform(X)

Encoder data

Dummy

VariableDummy
colorcolor_rojocolor_verdecolor_azul
rojo100
verde010
azul001
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit_transform(X)
import pandas as pd
pd.get_dummies(X)

Label

VariableLabel
rojo0
verde1
azul2
```python from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() encoder.fit_transform(X) ```
import pandas as pd
df = pd.DataFrame({'color': ['rojo', 'verde', 'azul']})
df['color'].astype('category').cat.codes

Scaling and Centering Data

StandardScaler

  • xμσ\frac{x - \mu}{\sigma}
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(X)

MinMaxScaler

  • xminmaxmin\frac{x - min}{max - min}
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit_transform(X)

RobustScaler

  • xQ1Q3Q1\frac{x - Q_1}{Q_3 - Q_1}
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit_transform(X)

Normalizer

  • L1: xi=1nxi\frac{x}{\sum_{i=1}^n |x_i|}
  • L2: xi=1nxi2\frac{x}{\sqrt{\sum_{i=1}^n x_i^2}}
  • max: xmax(x)\frac{x}{max(x)}

from sklearn.preprocessing import Normalizer
# L1, L2, max
scaler = Normalizer(norm='l2')
scaler.fit_transform(X)

Feature engineering

PolynomialFeatures

  • x1,x2x12,x1x2,x22x_1, x_2 \rightarrow x_1^2, x_1x_2, x_2^2
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly.fit_transform(X)

Binning

Este proceso se utiliza para discretizar variables continuas, es decir, convertir variables continuas en variables categóricas, agrupando los valores en intervalos.

  • x{0,1,2,...,n}x \rightarrow \{0, 1, 2,..., n\}
from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
discretizer.fit_transform(X)

Feature selection

VarianceThreshold


from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
selector.fit_transform(X)

SelectKBest

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=2)
selector.fit_transform(X, y)

SelectFromModel

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
selector = SelectFromModel(estimator=LogisticRegression())
selector.fit_transform(X, y)

RFE


from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
selector = RFE(estimator=LogisticRegression(), n_features_to_select=2)
selector.fit_transform(X, y)