Feature Engineering: Fix Bad Data Before It Ruins Models (2026)

Learning Objectives

Handle missing values, categorical variables, and outliers
Apply the right encoding strategy for different categorical types
Create new features that improve model performance
Scale and normalize features correctly
Select the most informative features and remove noise

Why Feature Engineering Matters

In practice, feature engineering often contributes more to model performance than algorithm selection. A simple model on great features beats a complex model on poor features.

The feature engineering pipeline is: raw data → cleaned data → encoded data → scaled data → selected features → model input.

Handling Missing Values

Strategy 1: Drop

Drop rows or columns with missing values. Only safe if missingness is rare and random.

Python

df.dropna(subset=['important_column'], inplace=True)  # drop rows
df.drop(columns=['mostly_null_column'], inplace=True)  # drop columns

Strategy 2: Impute with Statistics

Python

from sklearn.impute import SimpleImputer
import pandas as pd

# Numerical: fill with median (robust to outliers)
num_imputer = SimpleImputer(strategy='median')
df[['age', 'income']] = num_imputer.fit_transform(df[['age', 'income']])

# Categorical: fill with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['city']] = cat_imputer.fit_transform(df[['city']])

Strategy 3: Add a Missingness Indicator

Missing data itself can be informative. Add a binary flag before imputing.

Python

df['age_was_missing'] = df['age'].isna().astype(int)
df['age'].fillna(df['age'].median(), inplace=True)

Strategy 4: Model-Based Imputation

Python

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

iter_imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(iter_imputer.fit_transform(df), columns=df.columns)

Encoding Categorical Variables

Label Encoding

Assigns each category an integer. Only appropriate for ordinal categories where order matters.

Python

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['education_encoded'] = le.fit_transform(df['education'])
# "high school" → 0, "bachelor" → 1, "master" → 2

Warning: Don't use label encoding for nominal categories with tree-ensemble models — it implies a false ordering.

One-Hot Encoding

Creates a binary column for each category. Use for nominal categories with low cardinality.

Python

df_encoded = pd.get_dummies(df, columns=['city', 'department'], drop_first=True)

# Or with scikit-learn:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore')
encoded = ohe.fit_transform(df[['city']])

Ordinal Encoding

For ordered categories (low/medium/high, bronze/silver/gold).

Python

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
df['risk_encoded'] = oe.fit_transform(df[['risk_level']])

Target Encoding (Mean Encoding)

Replace each category with the mean of the target variable for that category. Powerful for high-cardinality features but prone to overfitting — use with cross-validation.

Python

# Manual implementation
target_mean = df.groupby('city')['churn'].mean()
df['city_target_encoded'] = df['city'].map(target_mean)

Frequency Encoding

Replace each category with how often it appears in the dataset.

Python

freq = df['city'].value_counts(normalize=True)
df['city_freq'] = df['city'].map(freq)

Scaling Numerical Features

StandardScaler (Z-score normalization)

Centers to mean=0, std=1. Best for normally distributed features and algorithms sensitive to scale (linear models, SVM, neural networks).

Python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use fit from training set only

MinMaxScaler

Scales to a fixed range [0, 1]. Preserves zero values. Sensitive to outliers.

Python

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

RobustScaler

Uses median and IQR instead of mean and std. Best when your data has many outliers.

Python

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

Tree-based models (Random Forest, XGBoost) don't require scaling. Neural networks, linear models, and SVMs do.

Handling Outliers

Detect Outliers

Python

# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['income'] < lower) | (df['income'] > upper)]
print(f"Outliers: {len(outliers)}")

Treatment Options

Cap (Winsorize): Clip values to [lower, upper]
Log transform: Compress skewed distributions
Remove: Only if outliers are data entry errors

Python

# Winsorize
df['income'] = df['income'].clip(lower=lower, upper=upper)

# Log transform (for right-skewed data)
import numpy as np
df['income_log'] = np.log1p(df['income'])  # log1p handles zeros

Creating New Features

Date/Time Features

Python

df['date'] = pd.to_datetime(df['date'])
df['year']        = df['date'].dt.year
df['month']       = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday
df['is_weekend']  = df['day_of_week'].isin([5, 6]).astype(int)
df['hour']        = df['date'].dt.hour

Interaction Features

Python

df['income_per_dependent'] = df['income'] / (df['dependents'] + 1)
df['age_times_income']     = df['age'] * df['income']

Binning / Discretization

Python

# Manual bins
df['age_group'] = pd.cut(df['age'],
    bins=[0, 18, 35, 50, 65, 100],
    labels=['teen', 'young_adult', 'adult', 'middle_age', 'senior'])

# Quantile bins (equal-frequency)
df['income_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1','Q2','Q3','Q4'])

Polynomial Features

Python

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
X_poly = poly.fit_transform(X[['age', 'income']])

Feature Selection

Filter Methods — Statistical Tests

Python

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Select top 10 features by ANOVA F-value
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print(selected_features.tolist())

Wrapper Method — Recursive Feature Elimination

Python

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

rfe = RFE(estimator=RandomForestClassifier(n_estimators=100), n_features_to_select=10)
rfe.fit(X_train, y_train)
print(X.columns[rfe.support_].tolist())

Embedded Method — Feature Importance

Python

import pandas as pd
model = RandomForestClassifier(n_estimators=200).fit(X_train, y_train)
importances = pd.Series(model.feature_importances_, index=X.columns)
top_features = importances.nlargest(15)
print(top_features)

Permutation Importance (Most Reliable)

Python

from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
perm_imp = pd.Series(result.importances_mean, index=X.columns).sort_values(ascending=False)
print(perm_imp.head(15))

Building a Feature Pipeline

Combine all preprocessing steps into a reproducible pipeline:

Python

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

numeric_features = ['age', 'income', 'tenure']
categorical_features = ['city', 'plan_type']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first')),
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier()),
])

full_pipeline.fit(X_train, y_train)
score = full_pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.3f}")

Troubleshooting

Model performance doesn't improve after adding features → Check correlation with target. Low correlation = low predictive power. Use SelectKBest to filter.

One-hot encoding creates too many columns → Use target encoding or frequency encoding for high-cardinality categoricals (>20 unique values).

Train/test performance gap is large despite regularization → Check for data leakage — ensure no future information sneaks into features.

Key Takeaways

Feature quality determines model ceiling — a perfect algorithm on bad features underperforms a simple algorithm on good features; fix data before tuning hyperparameters
Fit all preprocessing (scalers, imputers, encoders) on training data only — applying fit_transform on the full dataset leaks test set statistics into training, inflating performance estimates
Use sklearn.pipeline.Pipeline to chain preprocessing and modeling — it prevents data leakage automatically and makes deployment reproducible
One-hot encoding is correct for nominal categories with no ordering; ordinal encoding is correct for ordered categories (low/medium/high); high-cardinality categories (>20 values) need target encoding or frequency encoding
Log-transform right-skewed features (income, price, count) before training linear models or neural networks — they are sensitive to scale and skew in ways tree-based models are not
Interaction features (multiplying or concatenating two features) can capture non-linear relationships that individual features miss — but validate each with cross-validation before keeping
Datetime features should be decomposed into components (hour, day of week, month, is_weekend) — the raw timestamp integer is meaningless to most models
Remove features with more than 50% missing values before imputing — imputing highly missing features introduces more noise than signal

FAQ

When should I do feature selection? After basic preprocessing. Use feature importance from a quick baseline model to identify candidates for removal. Always validate that removing a feature does not hurt performance before dropping it permanently.

Does feature engineering matter for deep learning? Less so for raw data like images and text — deep learning learns features automatically from the raw signal. For tabular data, yes. Good feature engineering still matters even with deep learning; neural networks on tabular data often benefit from the same encoding and normalization steps used for classical models.

What is the difference between StandardScaler and MinMaxScaler? StandardScaler transforms to zero mean and unit variance (z-score normalization). MinMaxScaler scales to a fixed range, typically [0, 1]. Use StandardScaler for most ML models — it is robust when the distribution has outliers. Use MinMaxScaler when your model requires inputs in a specific range, such as a neural network with sigmoid activations.

How do I handle missing values in test data when I fitted an imputer on training data? Call imputer.transform(X_test) — not imputer.fit_transform(X_test). The imputer was already fit on training statistics. Calling fit_transform on test data re-estimates statistics from test data, which constitutes leakage. Using a Pipeline ensures this is handled correctly automatically.

What is target encoding and when should I use it? Target encoding replaces a categorical value with the mean of the target variable for that category. For example, city "New York" gets encoded as the mean churn rate of New York customers. Use it for high-cardinality categoricals where one-hot encoding would create hundreds of columns. Always compute target encoding means on training data only, and apply a smoothing factor to prevent overfitting to small categories.

How many features is too many? There is no universal threshold, but more features requires more training data to avoid overfitting. A rough heuristic: if you have fewer than 10 samples per feature, you likely have too many. Use variance inflation factor (VIF) to detect multicollinearity and SelectKBest or recursive feature elimination to reduce dimensionality.

Should I always normalize features? Not always. Tree-based models (decision trees, random forests, gradient boosting) are scale-invariant — they split on feature values, not magnitudes. Linear models, SVMs, and neural networks are scale-sensitive and require normalization. Always check which algorithm you are using before adding a scaling step.

Learning Objectives

Why Feature Engineering Matters

Handling Missing Values

Strategy 1: Drop

Strategy 2: Impute with Statistics

Strategy 3: Add a Missingness Indicator

Strategy 4: Model-Based Imputation

Encoding Categorical Variables

Label Encoding

One-Hot Encoding

Ordinal Encoding

Target Encoding (Mean Encoding)

Frequency Encoding

Scaling Numerical Features

StandardScaler (Z-score normalization)

MinMaxScaler

RobustScaler

Handling Outliers

Detect Outliers

Treatment Options

Creating New Features

Date/Time Features

Interaction Features

Binning / Discretization

Polynomial Features

Feature Selection

Filter Methods — Statistical Tests

Wrapper Method — Recursive Feature Elimination

Embedded Method — Feature Importance

Permutation Importance (Most Reliable)

Building a Feature Pipeline

Troubleshooting

Key Takeaways

FAQ

What to Learn Next