Python for ML: The Stack Engineers Actually Use (2026)

Learning Objectives

Use NumPy for fast numerical computation
Manipulate and analyze data with pandas
Visualize distributions and relationships with matplotlib/seaborn
Apply scikit-learn's consistent API for training and evaluation
Write clean, reproducible ML code

Setup

Bash

pip install numpy pandas scikit-learn matplotlib seaborn jupyter

NumPy: Fast Numerical Computation

NumPy is the foundation of scientific Python. Everything in ML ultimately runs on NumPy arrays.

Creating Arrays

Python

import numpy as np

# From lists
a = np.array([1, 2, 3, 4, 5])
b = np.array([[1, 2, 3], [4, 5, 6]])  # 2D

# Special arrays
zeros = np.zeros((3, 4))      # 3×4 matrix of zeros
ones  = np.ones((2, 3))       # 2×3 matrix of ones
eye   = np.eye(4)              # 4×4 identity matrix
rand  = np.random.randn(3, 3)  # random normal

print(a.shape, b.shape, a.dtype)

Array Operations

Python

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

# Element-wise (no loops needed)
print(a + b)      # [11 22 33 44]
print(a * b)      # [10 40 90 160]
print(a ** 2)     # [1 4 9 16]
print(np.sqrt(a)) # [1.   1.41 1.73 2.  ]

# Matrix multiplication
A = np.random.randn(3, 4)
B = np.random.randn(4, 2)
C = A @ B   # (3, 2) result — use @ not np.dot for readability

Slicing and Indexing

Python

X = np.random.randn(100, 5)

X[0]       # first row
X[:, 2]    # third column (all rows)
X[10:20]   # rows 10-19
X[X > 0]   # all positive values (boolean indexing)

# Fancy indexing
indices = np.array([0, 5, 10])
X[indices]  # rows 0, 5, 10

Broadcasting

Python

# Subtract column means from each column
X = np.random.randn(100, 5)
col_means = X.mean(axis=0)  # shape (5,)
X_centered = X - col_means  # broadcasts: (100,5) - (5,) → (100,5)

Statistics

Python

print(X.mean(axis=0))   # mean of each column
print(X.std(axis=0))    # std of each column
print(X.min(), X.max())
print(np.percentile(X, [25, 50, 75]))

Pandas: Data Manipulation

Pandas DataFrames are how you'll work with tabular data before feeding it to ML models.

Loading Data

Python

import pandas as pd

df = pd.read_csv('data.csv')
df = pd.read_json('data.json')
df = pd.read_excel('data.xlsx')

# Quick overview
print(df.shape)        # (rows, cols)
print(df.head())       # first 5 rows
print(df.info())       # column types and null counts
print(df.describe())   # statistics for numeric columns

Selection and Filtering

Python

# Select columns
df['age']                        # single column → Series
df[['age', 'income', 'churn']]  # multiple columns → DataFrame

# Filter rows
df[df['age'] > 30]
df[(df['age'] > 30) & (df['churn'] == 1)]
df.query('age > 30 and churn == 1')  # cleaner syntax

# .loc (label-based) and .iloc (integer-based)
df.loc[0:5, 'age':'income']   # rows 0-5, columns age through income
df.iloc[0:5, 2:6]             # rows 0-4, columns 2-5

Common Operations

Python

# Sorting
df.sort_values('income', ascending=False)

# New columns
df['income_per_age'] = df['income'] / df['age']

# Apply function
df['name_upper'] = df['name'].apply(lambda x: x.upper())
df['age_group'] = df['age'].apply(lambda x: 'senior' if x > 60 else 'adult')

# String operations
df['email_domain'] = df['email'].str.split('@').str[1]
df[df['name'].str.contains('Smith', case=False)]

GroupBy and Aggregation

Python

# Average income by city
df.groupby('city')['income'].mean()

# Multiple aggregations
df.groupby('city').agg({
    'income': ['mean', 'median', 'std'],
    'age':    ['mean', 'min', 'max'],
    'churn':  'sum'
}).round(2)

# Pivot table
pd.pivot_table(df, values='income', index='city', columns='age_group', aggfunc='mean')

Handling Missing Data

Python

print(df.isnull().sum())              # null count per column
print(df.isnull().sum() / len(df))    # null percentage

df.dropna(subset=['income'])          # drop rows where income is null
df['age'].fillna(df['age'].median(), inplace=True)  # fill with median

Merging DataFrames

Python

users    = pd.read_csv('users.csv')
orders   = pd.read_csv('orders.csv')

merged = pd.merge(users, orders, on='user_id', how='left')

Data Visualization

Matplotlib Basics

Python

import matplotlib.pyplot as plt

# Histogram
plt.figure(figsize=(8, 4))
plt.hist(df['income'], bins=30, edgecolor='black')
plt.xlabel('Income')
plt.ylabel('Count')
plt.title('Income Distribution')
plt.tight_layout()
plt.show()

# Scatter plot
plt.scatter(df['age'], df['income'], alpha=0.3, c=df['churn'], cmap='coolwarm')
plt.colorbar(label='Churn')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()

Seaborn for Statistical Plots

Python

import seaborn as sns

# Distribution + box plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(df['income'], kde=True, ax=axes[0])
sns.boxplot(x='churn', y='income', data=df, ax=axes[1])
plt.tight_layout()
plt.show()

# Correlation heatmap
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.show()

# Pairplot (shows all pairwise relationships)
sns.pairplot(df[['age', 'income', 'tenure', 'churn']], hue='churn')
plt.show()

Scikit-learn Core API

Scikit-learn has a consistent API: every estimator has fit(), predict(), and score().

The Estimator Pattern

Python

from sklearn.ensemble import RandomForestClassifier

# 1. Instantiate with hyperparameters
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

# 2. Fit on training data
model.fit(X_train, y_train)

# 3. Predict on new data
y_pred  = model.predict(X_test)        # class labels
y_proba = model.predict_proba(X_test)  # class probabilities

# 4. Score
accuracy = model.score(X_test, y_test)

Transformers Follow the Same Pattern

Python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)              # learn mean and std from training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # use training statistics

# Shortcut: fit_transform on training data
X_train_scaled = scaler.fit_transform(X_train)

Pipelines

Python

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression()),
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
# The pipeline applies scaler.transform automatically at prediction time

Cross-Validation

Python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

Reproducibility Checklist

Python

import numpy as np
import random

# Fix random seeds
np.random.seed(42)
random.seed(42)

# Always set random_state in scikit-learn estimators
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Save and load models
import joblib
joblib.dump(model, 'model.pkl')
model = joblib.load('model.pkl')

# Save DataFrames
df.to_csv('processed_data.csv', index=False)
df.to_parquet('processed_data.parquet')  # faster for large files

Troubleshooting

Shape mismatch error

Python

print(X_train.shape, y_train.shape)  # check dimensions
# Sklearn expects X shape (n_samples, n_features)
# Ensure y is 1D: y.reshape(-1) if needed

ValueError: Input contains NaN → Check np.isnan(X_train).sum() and impute missing values before training.

ConvergenceWarning in LogisticRegression → Increase max_iter=1000 or standardize features.

Key Takeaways

NumPy's array operations are 10–1000x faster than equivalent Python loops because they are implemented in C and operate on contiguous memory — always vectorize instead of iterating over array elements
Pandas DataFrames are the standard input format for scikit-learn pipelines — learn groupby, merge, pivot, and fillna early; these operations cover 90% of real data wrangling work
scikit-learn's fit/transform/predict API is consistent across all estimators — learn it once and you can use any algorithm, preprocessing step, or pipeline without relearning the interface
Always separate exploratory notebooks from production code — use Jupyter for EDA and visualization, then migrate working logic to .py files for reproducibility and version control
Setting random seeds (numpy.random.seed, random_state= in sklearn) is not optional for reproducible results — experiments that cannot be reproduced are impossible to debug
Use joblib for saving models, not pickle directly — it handles numpy arrays more efficiently and is the standard for scikit-learn model persistence
Matplotlib creates publication-quality plots but seaborn produces better-looking statistical visualizations with fewer lines — use seaborn for correlation matrices, distribution plots, and categorical comparisons
Virtual environments (venv or conda) are mandatory for ML projects — ML dependency conflicts are common and a clean environment per project prevents hours of debugging

FAQ

Python 2 or Python 3? Python 3 only. Python 2 reached end-of-life in 2020 and is unsupported by all major ML libraries. Use Python 3.10 or 3.11 for the best compatibility with current ML libraries.

Should I use Jupyter notebooks or Python scripts? Both, for different purposes. Notebooks are ideal for exploration, EDA, visualization, and sharing results — the cell-by-cell execution model matches the exploratory workflow. Scripts are required for reusable pipeline code, scheduled jobs, and anything that goes into production. A common pattern: prototype in a notebook, then refactor the working logic into clean .py modules.

What about GPU acceleration? For classical ML (sklearn, XGBoost), CPU is fine. For deep learning, use PyTorch with CUDA. For large-scale tabular data on GPU, consider RAPIDS cuML — it provides a scikit-learn compatible API that runs on NVIDIA GPUs, with 10–100x speedups for common algorithms.

What is the difference between fit_transform and fit + transform separately? fit_transform is equivalent to calling fit and then transform on the same data, combined for convenience. Use fit_transform on training data. Use only transform (not fit_transform) on validation and test data — you want to apply the same transformation parameters learned from training, not refit on the new data. Fitting on test data is data leakage.

When should I use Pandas vs NumPy directly? Use Pandas for labeled, heterogeneous data — DataFrames handle mixed types, named columns, and missing values well. Use NumPy for homogeneous numerical arrays where performance matters — matrix operations, custom distance computations, and feeding data into PyTorch or TensorFlow. Most ML workflows start in Pandas and convert to NumPy arrays before model training.

What is the best way to handle missing data in Python? Use sklearn.impute.SimpleImputer inside a Pipeline. For numeric columns, median imputation is more robust than mean imputation when outliers are present. For categorical columns, use most_frequent or a constant placeholder. For time-series data, forward-fill or backward-fill is often more appropriate. Always impute after splitting data — fit the imputer on training data only.

How do I profile which part of my ML code is slow? Use line_profiler or cProfile for Python code. The most common bottlenecks are: applying Python functions row-by-row in Pandas (use vectorized operations instead), loading data in a loop (batch-load with pd.read_csv once), and repeated model re-fitting (cache fitted models with joblib). For data loading, switching from CSV to Parquet typically provides 5–10x read speed improvement.

Learning Objectives

Setup

NumPy: Fast Numerical Computation

Creating Arrays

Array Operations

Slicing and Indexing

Broadcasting

Statistics

Pandas: Data Manipulation

Loading Data

Selection and Filtering

Common Operations

GroupBy and Aggregation

Handling Missing Data

Merging DataFrames

Data Visualization

Matplotlib Basics

Seaborn for Statistical Plots

Scikit-learn Core API

The Estimator Pattern

Transformers Follow the Same Pattern

Pipelines

Cross-Validation

Reproducibility Checklist

Troubleshooting

Key Takeaways

FAQ

What to Learn Next