Statistics for Machine Learning: Practical Concepts for Developers
What You Actually Need
This isn't a statistics textbook. These are the concepts that come up repeatedly when you're building, evaluating, and debugging ML models.
Probability Fundamentals
Probability Distributions
A distribution describes how likely different values are. The most important ones:
Normal (Gaussian) Distribution
import numpy as np
import matplotlib.pyplot as plt
# Many natural phenomena follow this bell curve
# Parameters: mean (μ) and standard deviation (σ)
data = np.random.normal(loc=0, scale=1, size=10000)
# ~68% of data falls within 1 std dev of mean
# ~95% within 2 std devs
# ~99.7% within 3 std devs (the "three-sigma rule")
plt.hist(data, bins=50, density=True)
plt.title("Normal Distribution")
plt.show()
Why it matters for ML: Model weights in neural networks are often initialized from normal distributions. Residuals (prediction errors) often follow normal distributions in regression.
Bernoulli Distribution — binary outcomes (0 or 1). Probability of a coin flip. Basis of binary classification: the model predicts P(y=1 | x).
Softmax output — converts raw scores to a probability distribution over classes. Each output is between 0–1 and all outputs sum to 1.
import numpy as np
def softmax(logits):
# Numerically stable softmax
e = np.exp(logits - logits.max())
return e / e.sum()
logits = np.array([2.0, 1.0, 0.5, -1.0]) # raw scores for 4 classes
probs = softmax(logits)
print(probs) # [0.619, 0.228, 0.138, 0.019]
print(probs.sum()) # 1.0
Expected Value and Variance
Expected value (mean) — the average outcome over many trials.
Variance — how spread out values are. High variance = high uncertainty.
Standard deviation — √variance, in the same units as your data.
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print(f"Mean: {data.mean():.2f}") # 5.0
print(f"Std: {data.std():.2f}") # 2.0
print(f"Var: {data.var():.2f}") # 4.0
# For model predictions
predictions = np.array([0.9, 0.7, 0.85, 0.6, 0.95])
print(f"Avg confidence: {predictions.mean():.2f}")
print(f"Confidence spread: {predictions.std():.2f}")
Bias-variance tradeoff — a fundamental ML concept:
- High bias (underfitting): model too simple, misses patterns
- High variance (overfitting): model too complex, memorizes training data
- Goal: minimize both
Bayes' Theorem
The most important formula in probabilistic ML:
P(A|B) = P(B|A) × P(A) / P(B)
In English: "Given that B happened, what's the probability of A?"
Practical example — spam filter:
# Prior: 30% of emails are spam
p_spam = 0.30
# Likelihood: "free money" appears in 80% of spam, 2% of ham
p_word_given_spam = 0.80
p_word_given_ham = 0.02
# P(word) = P(word|spam)*P(spam) + P(word|ham)*P(ham)
p_word = p_word_given_spam * p_spam + p_word_given_ham * (1 - p_spam)
# Posterior: P(spam | word)
p_spam_given_word = (p_word_given_spam * p_spam) / p_word
print(f"P(spam | 'free money'): {p_spam_given_word:.2%}") # 94.5%
Bayesian thinking appears in many places: Naive Bayes classifiers, Bayesian optimization for hyperparameter tuning, uncertainty estimation.
Correlation and Causation
Correlation measures the linear relationship between two variables (range: -1 to +1).
import pandas as pd
df = pd.DataFrame({
"study_hours": [1, 2, 3, 4, 5, 6, 7, 8],
"test_score": [50, 55, 60, 65, 70, 75, 80, 85],
"shoe_size": [9, 8, 10, 7, 9, 8, 10, 9], # irrelevant
})
print(df.corr(numeric_only=True))
# study_hours ↔ test_score: ~0.99 (strong positive correlation)
# shoe_size ↔ test_score: ~0.0 (no correlation)
Critical point: Correlation ≠ causation. Ice cream sales correlate with drowning deaths (both increase in summer). Never assume a correlated feature causes the outcome.
Hypothesis Testing: Does Your Model Actually Improve?
When you change something (add a feature, change architecture), you need to know if the improvement is real or random noise.
from scipy import stats
# Old model: 500 test examples, 75% accuracy
# New model: 500 test examples, 77% accuracy
# Is the 2% improvement real or noise?
old_correct = 375 # 75% of 500
new_correct = 385 # 77% of 500
n = 500
# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest
counts = [new_correct, old_correct]
nobs = [n, n]
z, p_value = proportions_ztest(counts, nobs)
print(f"Z-statistic: {z:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Improvement is {'statistically significant' if p_value < 0.05 else 'NOT significant'}")
P-value < 0.05 means there's less than 5% chance the result is due to random chance. A common but imperfect threshold.
Practical rule: With fewer than ~500 test examples, small improvements (1-2%) are rarely meaningful. Always test on large, held-out datasets.
Information Theory: Entropy and Cross-Entropy
Entropy measures uncertainty/information content:
import numpy as np
def entropy(probs):
"""Shannon entropy: lower = more certain."""
return -sum(p * np.log2(p) for p in probs if p > 0)
# Completely uncertain (uniform distribution over 4 classes)
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform entropy: {entropy(uniform):.2f} bits") # 2.0 bits (max)
# Completely certain
certain = [1.0, 0.0, 0.0, 0.0]
print(f"Certain entropy: {entropy(certain):.2f} bits") # 0.0 bits
Cross-entropy loss — the loss function used in almost all classification models:
def cross_entropy_loss(y_true, y_pred):
"""
y_true: true class probabilities (one-hot or soft labels)
y_pred: model's predicted probabilities
"""
eps = 1e-10 # prevent log(0)
return -np.sum(y_true * np.log(y_pred + eps))
# Perfect prediction
y_true = np.array([0, 1, 0, 0]) # true class is index 1
y_pred = np.array([0.02, 0.90, 0.05, 0.03])
print(f"Good prediction loss: {cross_entropy_loss(y_true, y_pred):.3f}") # ~0.105
# Bad prediction
y_pred_bad = np.array([0.25, 0.25, 0.25, 0.25])
print(f"Random prediction loss: {cross_entropy_loss(y_true, y_pred_bad):.3f}") # ~1.386
Why it matters: when you see "loss" in training curves, it's usually cross-entropy. Lower is better.
Evaluating Classification Models
Beyond accuracy, you need these metrics:
from sklearn.metrics import (
precision_score, recall_score, f1_score,
confusion_matrix, roc_auc_score
)
y_true = [0, 1, 1, 0, 1, 1, 0, 0, 1, 0]
y_pred = [0, 1, 0, 0, 1, 1, 1, 0, 1, 0]
# Precision: of predicted positives, what % are actually positive?
# (Low precision = many false alarms)
print(f"Precision: {precision_score(y_true, y_pred):.2f}")
# Recall: of actual positives, what % did we catch?
# (Low recall = missed many positives)
print(f"Recall: {recall_score(y_true, y_pred):.2f}")
# F1: harmonic mean of precision and recall
print(f"F1 Score: {f1_score(y_true, y_pred):.2f}")
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"Confusion Matrix:\n{cm}")
# Predicted
# 0 1
# True 0 [TN, FP]
# 1 [FN, TP]
Rule of thumb:
- Medical diagnosis → prioritize recall (don't miss sick patients)
- Spam filter → prioritize precision (don't block real emails)
- Search ranking → use NDCG or MRR
Statistical Concepts in LLM Development
Perplexity — measures how well a language model predicts text. Lower is better. A model with perplexity 20 is less "surprised" by text than one with perplexity 100.
Temperature sampling — not technically statistics, but uses the probability distribution output:
temperature=0→ argmax (most likely token always)temperature=1→ sample from the distributiontemperature=2→ exaggerate low-probability tokens (more random)
Top-p (nucleus) sampling — sample only from the smallest set of tokens whose cumulative probability ≥ p. Prevents sampling from the "long tail" of unlikely tokens.
What to Learn Next
- Machine learning in practice → ML Project Workflow
- Model evaluation → Model Evaluation and Metrics
- Deep learning → Deep Learning Fundamentals