🤖 Machine Learning

Quick references for Machine Learning

🏠 Home
🔍
🌙
📚 Fundamentals
Learning Types

Supervised Learning

Learn from labeled data to predict outcomes

Uses: You have input-output pairs

Examples:

  • Classification (spam detection)
  • Regression (price prediction)

Algorithms:

  • Linear Regression
  • Decision Trees
  • Neural Networks

Unsupervised Learning

Discover patterns in unlabeled data

Uses: Exploring data structure

Examples:

  • Customer segmentation
  • Anomaly detection

Algorithms:

  • K-means Clustering
  • DBSCAN
  • Autoencoders

Reinforcement Learning

Learn through trial-and-error with rewards

Uses: Sequential decision-making needed

Examples:

  • Game AI (AlphaGo)
  • Robotic control

Algorithms:

  • Q-Learning
  • Deep Q-Networks (DQN)
  • Multi-Agent RL
Train/Validation/Test Split

Why Split Data?

Prevent overfitting and get honest performance estimates.

Typical Split

  • Training (60-80%): Learn patterns
  • Validation (10-20%): Tune hyperparameters
  • Test (10-20%): Final performance evaluation

Quick Code

from sklearn.model_selection import train_test_split

# Split into train and temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Split temp into validation and test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
Rule: Never train on test data!
Bias-Variance Tradeoff

The Balance

Total Error = Bias² + Variance + Irreducible Error

High Bias (Underfitting)

  • Model too simple
  • Poor performance on training AND test data
  • Fix: Add features, increase model complexity, reduce regularization

High Variance (Overfitting)

  • Model too complex
  • Great on training, poor on test data
  • Fix: Get more data, reduce features, increase regularization, use ensemble methods

Sweet Spot

Balance both to minimize total error. Use cross-validation to find it!

Goal: Generalize well to new data
Overfitting/Underfitting

How to Detect

Underfitting:

  • Training accuracy is low (<80%)
  • Validation accuracy similar to training
  • Learning curves plateau early

Overfitting:

  • Training accuracy very high (>95%)
  • Large gap between training and validation accuracy
  • Validation loss increases while training loss decreases

Solutions

For Underfitting: More features, complex model, less regularization

For Overfitting: More data, dropout, early stopping, regularization (L1/L2)

Monitor both train and validation metrics
🔧 Key Algorithms
Linear/Logistic Regression

Linear Regression

Predicts continuous values: y = mx + b

  • When: Linear relationship between features and target
  • Assumptions: Linearity, independence, homoscedasticity, normality
  • Pros: Fast, interpretable, works with small data
  • Cons: Assumes linearity, sensitive to outliers

Logistic Regression

Binary classification using sigmoid function

  • When: Binary outcomes (yes/no, 0/1)
  • Output: Probability between 0 and 1
  • Pros: Probabilistic output, fast, interpretable
  • Cons: Linear decision boundary
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Decision Trees/Random Forests

Decision Trees

Tree structure of if-else decisions

  • Pros: Easy to interpret, handles non-linear relationships, no scaling needed
  • Cons: Prone to overfitting, unstable (small changes → different tree)

Random Forests

Ensemble of many decision trees (bagging)

  • How: Build multiple trees on random subsets, average predictions
  • Pros: Reduces overfitting, handles missing values, feature importance
  • Cons: Less interpretable, slower than single tree
  • Best for: Tabular data, when you need robust performance
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X_train, y_train)

# Feature importance
importances = rf.feature_importances_
Great baseline model for tabular data
Support Vector Machines

Core Concept

Find the hyperplane that maximizes margin between classes

The Kernel Trick

Transform data to higher dimensions without computing coordinates

  • Linear: For linearly separable data
  • RBF (Radial Basis Function): Most common, handles non-linear
  • Polynomial: For polynomial relationships

When to Use

  • High-dimensional spaces (text, images)
  • Clear margin of separation
  • Small to medium datasets

Pros & Cons

Pros: Effective in high dimensions, memory efficient

Cons: Slow on large datasets, requires feature scaling

from sklearn.svm import SVC

svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)
Best for: Text classification, image recognition
Neural Networks

Architecture Components

  • Input Layer: Receives features
  • Hidden Layers: Learn representations (deep = many layers)
  • Output Layer: Produces predictions
  • Activation Functions: ReLU (hidden), Sigmoid/Softmax (output)

Key Concepts

  • Backpropagation: Update weights using gradient descent
  • Learning Rate: How big each update step is (0.001-0.01 typical)
  • Epochs: Full passes through training data
  • Batch Size: Samples processed before updating weights

When to Use

Complex patterns, images, text, audio, large datasets

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(64, activation='relu', input_shape=(10,)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=50, batch_size=32)
Deep learning powerhouse
k-NN, k-Means, Naive Bayes

k-Nearest Neighbors (k-NN)

Classify based on k closest training examples

  • Pros: Simple, no training phase, works for multi-class
  • Cons: Slow prediction, sensitive to scale and irrelevant features
  • Tip: Always scale features, try k=3,5,7

k-Means Clustering

Partition data into k clusters (unsupervised)

  • How: Assign points to nearest centroid, update centroids, repeat
  • Use for: Customer segmentation, data compression
  • Choosing k: Elbow method (plot within-cluster sum of squares)

Naive Bayes

Probabilistic classifier using Bayes' theorem

  • Assumption: Features are independent (rarely true but works anyway)
  • Best for: Text classification (spam detection, sentiment)
  • Pros: Fast, works with small data, handles high dimensions
# k-NN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

# k-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
Deep Learning Frameworks

TensorFlow/Keras

Google's production-ready deep learning framework

  • Best for: Production deployment, mobile (TensorFlow Lite), research
  • Pros: Industry standard, excellent documentation, TensorBoard visualization
  • Keras: High-level API for TensorFlow (easy to use)
  • Use when: Need production deployment, mobile apps, or serving at scale
import tensorflow as tf
from tensorflow import keras

# Sequential API (simple)
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=5)]
)

# Functional API (complex architectures)
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(64, activation='relu')(inputs)
x = keras.layers.Dense(32, activation='relu')(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)

PyTorch

Facebook's research-focused deep learning framework

  • Best for: Research, experimentation, dynamic models
  • Pros: Pythonic, dynamic computation graphs, easier debugging
  • Popular in: Academic research, NLP (Hugging Face), computer vision
  • Use when: Need flexibility, research, or custom architectures
import torch
import torch.nn as nn
import torch.optim as optim

# Define model
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

TensorFlow vs PyTorch

Aspect TensorFlow PyTorch
Ease of Use Keras makes it easy More Pythonic, intuitive
Learning Curve Moderate Easier for Python devs
Deployment Excellent (TF Serving, Lite) Good (TorchServe)
Research Good Dominant in academia
Debugging Harder (static graphs) Easier (dynamic graphs)
Community Large, industry-focused Large, research-focused

Common Use Cases

  • Computer Vision: Both (PyTorch slightly preferred)
  • NLP: PyTorch (Hugging Face Transformers)
  • Production/Mobile: TensorFlow
  • Research Papers: PyTorch
  • Time Series: Both

Key Libraries

  • TensorFlow: Keras, TensorBoard, TF Data, TF Lite
  • PyTorch: torchvision, torchtext, Lightning (wrapper)
  • Both: ONNX (model interchange format)
Start with Keras for simplicity, PyTorch for research
📊 Model Evaluation
Classification Metrics

Accuracy

Correct predictions / Total predictions

  • When: Balanced classes
  • Misleading when: Imbalanced data (e.g., 95% class A, 5% class B)

Precision

True Positives / (True Positives + False Positives)

  • Question: Of predicted positives, how many are correct?
  • Use when: False positives are costly (spam filter)

Recall (Sensitivity)

True Positives / (True Positives + False Negatives)

  • Question: Of actual positives, how many did we catch?
  • Use when: False negatives are costly (disease detection)

F1-Score

Harmonic mean of precision and recall: 2 × (Precision × Recall) / (Precision + Recall)

  • Use when: Balance between precision and recall matters

ROC-AUC

Area Under the Receiver Operating Characteristic curve

  • Plots True Positive Rate vs False Positive Rate
  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random guessing
  • Use when: Comparing models across thresholds
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred))
auc = roc_auc_score(y_test, y_pred_proba)
Choose metric based on business impact
Regression Metrics

Mean Squared Error (MSE)

Average of squared differences: Σ(actual - predicted)² / n

  • Penalizes large errors heavily
  • Same units as target variable squared

Root Mean Squared Error (RMSE)

Square root of MSE: √MSE

  • Same units as target variable
  • Most common regression metric
  • More interpretable than MSE

Mean Absolute Error (MAE)

Average of absolute differences: Σ|actual - predicted| / n

  • Less sensitive to outliers than MSE/RMSE
  • Same units as target variable
  • More robust metric

R² (Coefficient of Determination)

Proportion of variance explained: 1 - (SS_res / SS_tot)

  • R² = 1.0: Perfect predictions
  • R² = 0.0: As good as predicting mean
  • Can be negative for bad models
  • Scale-independent (compare across datasets)
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
RMSE for magnitude, R² for model quality
Cross-Validation

Why Cross-Validation?

Get more reliable performance estimates using all data for both training and validation

k-Fold Cross-Validation

  • Split data into k folds (typically k=5 or 10)
  • Train on k-1 folds, validate on remaining fold
  • Repeat k times, average results
  • Pros: Every sample used for both training and validation

Stratified k-Fold

  • Maintains class distribution in each fold
  • Use for: Imbalanced classification problems

Leave-One-Out (LOO)

  • k = n (number of samples)
  • Use for: Very small datasets
  • Con: Computationally expensive

Time Series Split

  • Respects temporal ordering
  • Critical for: Sequential data (stocks, sales)
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple k-fold
scores = cross_val_score(model, X, y, cv=5)
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=skf)
Always use CV for model selection
Confusion Matrix

The Matrix

                Predicted
                 Pos    Neg
Actual  Pos     TP     FN
        Neg     FP     TN

Understanding Each Cell

  • True Positive (TP): Correctly predicted positive
  • True Negative (TN): Correctly predicted negative
  • False Positive (FP): Incorrectly predicted positive (Type I error)
  • False Negative (FN): Incorrectly predicted negative (Type II error)

What to Look For

  • High FP? Model too aggressive (reduce threshold)
  • High FN? Model too conservative (increase threshold)
  • Imbalanced diagonal? Class imbalance or poor model

Quick Code

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Always visualize your confusion matrix
🔄 Data Preprocessing
Feature Scaling

Why Scale?

Algorithms using distances (k-NN, SVM, Neural Networks) are sensitive to feature magnitude

Normalization (Min-Max Scaling)

Scale to [0, 1]: (x - min) / (max - min)

  • Use when: Bounded range needed, distribution not Gaussian
  • Sensitive to: Outliers

Standardization (Z-score)

Scale to mean=0, std=1: (x - mean) / std

  • Use when: Features roughly Gaussian, algorithm assumes this
  • Better for: Algorithms with no bounded range assumption
  • More robust to: Outliers (compared to normalization)

Robust Scaling

Use median and IQR: (x - median) / IQR

  • Use when: Heavy outliers present

When NOT to Scale

  • Tree-based models (Random Forest, XGBoost)
  • Already on same scale
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same params!

# Normalization
normalizer = MinMaxScaler()
X_train_norm = normalizer.fit_transform(X_train)
⚠️ Fit only on training data!
Handling Missing Data

Detection

df.isnull().sum()  # Count missing per column
df.isnull().sum() / len(df) * 100  # Percentage

Strategy 1: Delete

  • Drop rows: When <5% rows affected
  • Drop columns: When >50% values missing
  • Risk: Lose valuable information

Strategy 2: Imputation

Mean/Median:

  • Use mean for normal distribution
  • Use median for skewed or with outliers

Mode:

  • For categorical variables

Forward/Backward Fill:

  • For time series data

KNN Imputation:

  • Use similar samples to estimate
  • More sophisticated but slower

Strategy 3: Add Indicator

  • Create binary "was_missing" column
  • Preserves information about missingness
from sklearn.impute import SimpleImputer, KNNImputer

# Mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_imputed = knn_imputer.fit_transform(X)
Understand WHY data is missing
Encoding Categorical Variables

Label Encoding

Convert categories to integers: Red→0, Blue→1, Green→2

  • Use for: Ordinal data (Low, Medium, High)
  • Don't use for: Nominal data (implies ordering)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])

One-Hot Encoding

Create binary column for each category

  • Use for: Nominal data with few categories (<20)
  • Pros: No artificial ordering
  • Cons: High dimensionality with many categories
import pandas as pd

# Pandas
df_encoded = pd.get_dummies(df, columns=['color'], drop_first=True)

# Scikit-learn
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse=False)
encoded = ohe.fit_transform(df[['color']])

Target Encoding

Replace category with mean of target variable

  • Use for: High cardinality features (zip codes, user IDs)
  • Risk: Can cause overfitting (use smoothing/cross-validation)

Frequency Encoding

Replace with frequency/count of each category

  • Simple and effective for high cardinality
Drop first column to avoid multicollinearity
Feature Engineering Tips

Create New Features

  • Interactions: Feature1 × Feature2 (e.g., income × age)
  • Polynomials: x², x³ for non-linear relationships
  • Ratios: price/sqft, sales/employees
  • Aggregations: sum, mean, std of related features

Time-Based Features

  • Hour, day of week, month, quarter
  • Is weekend? Is holiday?
  • Days since last event
  • Cyclical encoding (sin/cos for hours, months)

Text Features

  • Length of text
  • Number of words, sentences
  • Presence of keywords
  • Sentiment scores
  • TF-IDF for important terms

Domain-Specific

  • Use domain knowledge to create meaningful features
  • Example (housing): age of house, distance to city center
  • Example (finance): moving averages, volatility

Feature Selection

  • Remove low variance features
  • Remove highly correlated features (>0.95)
  • Use feature importance from tree models
  • Recursive Feature Elimination (RFE)
  • L1 regularization (Lasso)
from sklearn.feature_selection import SelectKBest, f_classif

# Select top k features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# From tree model
feature_imp = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
Domain knowledge > automated methods
💡 Practical Tips
Hyperparameter Tuning

Grid Search

Try every combination of specified parameters

  • Pros: Exhaustive, guaranteed to find best in grid
  • Cons: Exponentially slow with more parameters
  • Use when: Few parameters, small ranges
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

Random Search

Sample random combinations

  • Pros: Faster, explores more space
  • Cons: May miss optimal
  • Use when: Many parameters, large ranges
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [10, 20, 30, 40, None],
    'min_samples_split': [2, 5, 10, 15]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_dist,
    n_iter=20,  # Number of random combinations
    cv=5,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

Key Hyperparameters by Algorithm

Random Forest: n_estimators, max_depth, min_samples_split

SVM: C (regularization), kernel, gamma

Neural Networks: learning_rate, batch_size, hidden_layers, neurons

XGBoost: learning_rate, max_depth, n_estimators, subsample

Start with defaults, then tune most important params
Algorithm Selection Guide

By Problem Type

Binary Classification:

  • Logistic Regression (baseline)
  • Random Forest (robust)
  • XGBoost (high performance)
  • Neural Networks (complex patterns)

Multi-class Classification:

  • Random Forest
  • XGBoost
  • Naive Bayes (text)

Regression:

  • Linear Regression (baseline)
  • Random Forest
  • XGBoost
  • Neural Networks

Clustering:

  • K-Means (spherical clusters)
  • DBSCAN (arbitrary shapes, outliers)
  • Hierarchical (dendrograms)

By Data Characteristics

Small Data (<10k samples):

  • Logistic Regression, Naive Bayes
  • Simple models to avoid overfitting

Large Data (>100k samples):

  • Neural Networks, XGBoost
  • Can learn complex patterns

High Dimensional (many features):

  • Regularized models (Lasso, Ridge)
  • Random Forest (handles many features)
  • Feature selection first

Imbalanced Classes:

  • Random Forest with class_weight='balanced'
  • XGBoost with scale_pos_weight
  • SMOTE for oversampling

Quick Decision Tree

Need interpretability? → Logistic Regression or Decision Tree

Need high accuracy? → XGBoost or Random Forest

Have images/text? → Neural Networks (CNN/RNN)

Limited time? → Start with Random Forest

Always try multiple algorithms
Common Pitfalls & Debugging

Data Leakage

Information from test set leaks into training

  • Example: Scaling before train/test split
  • Fix: Always split first, then preprocess
  • Example: Using future information in time series
  • Fix: Use time-based split

Class Imbalance

One class dominates dataset (e.g., 95% vs 5%)

  • Symptom: High accuracy but poor recall on minority class
  • Solutions:
    • Use stratified sampling
    • Oversample minority class (SMOTE)
    • Undersample majority class
    • Use class weights
    • Change evaluation metric (F1, AUC instead of accuracy)

Poor Performance Checklist

  • ✓ Check for data leakage
  • ✓ Verify train/test split is correct
  • ✓ Look for missing values
  • ✓ Check feature scaling
  • ✓ Examine class distribution
  • ✓ Plot learning curves (more data needed?)
  • ✓ Try different algorithms
  • ✓ Engineer better features

Model Not Learning

  • Neural Networks: Learning rate too high/low, bad initialization
  • All models: Features not informative, need more data

Overfitting Signs

  • Training accuracy >> test accuracy (gap >10%)
  • Performance degrades on new data
  • Model too complex for data size
# Check for data leakage
from sklearn.model_selection import cross_val_score

# If cross-validation score much worse than train score → leakage
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV: {cv_scores.mean():.3f}, Train: {train_score:.3f}")
⚠️ Always validate on unseen data
Quick Reference Code

Complete ML Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# 1. Load data
df = pd.read_csv('data.csv')

# 2. Basic exploration
print(df.info())
print(df.describe())
print(df.isnull().sum())

# 3. Prepare features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 5. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# 7. Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# 8. Cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

Pandas Essentials

# Load data
df = pd.read_csv('file.csv')

# Exploration
df.head()
df.shape
df.dtypes
df.describe()
df.isnull().sum()

# Selection
df['column']
df[['col1', 'col2']]
df[df['age'] > 30]

# Missing values
df.dropna()
df.fillna(df.mean())

# Encoding
pd.get_dummies(df, columns=['category'])

# Group by
df.groupby('category')['value'].mean()
Bookmark this for quick reference!
🏗️ MLOps Foundational Practices
📦 Version Control for ML
  • Git: Code versioning (branches, commits, merges)
  • DVC: Data Version Control - track datasets and models
  • MLflow: Experiment tracking, parameter logging
  • Weights & Biases: Visualization and collaboration
  • Best Practice: Version data, code, and models together
🔄 Data Pipeline Management
  • Airflow: Workflow orchestration with DAGs
  • Prefect: Modern workflow automation
  • Data Validation: Great Expectations, Pydantic
  • Feature Engineering: Automated feature stores (Feast)
  • Pipeline: Ingestion → Validation → Transform → Store
🧪 Model Training & Experimentation
  • Experiment Tracking: Log metrics, parameters, artifacts
  • Hyperparameter Tuning: Optuna, Ray Tune, Hyperopt
  • Reproducibility: Fix random seeds, document environment
  • Distributed Training: Ray, Horovod for multi-GPU
  • Checkpointing: Save model states during training
🏷️ Model Registry & Versioning
  • Centralized Storage: Single source of truth for models
  • Metadata: Track metrics, parameters, dependencies
  • Lineage: Data → Training → Model connections
  • Stages: Development → Staging → Production
  • Tools: MLflow Registry, Neptune.ai, Weights & Biases
🚀 Model Serving & Deployment
  • REST APIs: FastAPI, Flask for HTTP endpoints
  • Batch Inference: Process large datasets offline
  • Real-time Serving: TensorFlow Serving, TorchServe
  • Edge Deployment: TensorFlow Lite, ONNX Runtime
  • Load Balancing: Handle multiple requests efficiently
📊 Monitoring & Observability
  • Performance Metrics: Accuracy, latency, throughput
  • Data Drift: Monitor input distribution changes
  • Concept Drift: Track output/prediction patterns
  • Alerting: PagerDuty, Opsgenie for anomalies
  • Tools: Evidently AI, WhyLabs, Prometheus + Grafana
🏗️ Infrastructure as Code
  • Terraform: Cloud-agnostic infrastructure provisioning
  • CloudFormation: AWS-specific IaC
  • Pulumi: IaC using programming languages
  • Benefits: Reproducible, versionable, auditable
  • State Management: Track infrastructure changes
🐳 Containerization
  • Docker: Package code, dependencies, models together
  • Dockerfile: Define build steps, base image
  • Multi-stage Builds: Optimize image size
  • Container Registry: Docker Hub, ECR, GCR
  • Benefits: Consistency across dev/staging/prod
🔄 CI/CD Workflow
🌳 Source Control & Branching
  • GitFlow: feature/develop/release/hotfix branches
  • Trunk-Based: Short-lived branches, frequent merges
  • Pull Requests: Code review, approval workflows
  • Branch Protection: Enforce tests, reviews before merge
  • Merge Strategies: Merge commit, squash, rebase
⚙️ Continuous Integration (CI)
  • Automated Testing: Run tests on every commit
  • Linting: flake8, pylint, black for code quality
  • Code Coverage: pytest-cov, coverage.py
  • Build Automation: Compile, package, create artifacts
  • Tools: Jenkins, GitHub Actions, GitLab CI, CircleCI
Automated Testing
  • Unit Tests: Test individual functions (pytest, unittest)
  • Integration Tests: Test component interactions
  • E2E Tests: Test complete workflows (Selenium, Playwright)
  • Model Tests: Validate predictions, data quality
  • Test Pyramid: Many unit, some integration, few E2E
📦 Artifact Management
  • Container Registry: Store Docker images (ECR, GCR, ACR)
  • Model Registry: Store trained models with metadata
  • Package Registry: PyPI, npm for dependencies
  • Versioning: Semantic versioning (v1.2.3)
  • Caching: Speed up builds with dependency caching
🚢 Continuous Deployment (CD)
  • Blue-Green: Two identical environments, instant switch
  • Canary Release: Gradual rollout to subset of users
  • Rolling Update: Replace instances incrementally
  • Rollback: Quick revert to previous version
  • Tools: Spinnaker, ArgoCD, Flux for GitOps
🌍 Environment Management
  • Dev: Development, rapid iteration, debugging
  • Staging: Production-like, final testing
  • Production: Live environment serving users
  • Parity: Keep environments identical
  • Secrets: Vault, AWS Secrets Manager, env variables
🔗 Pipeline Orchestration
  • Multi-stage: Build → Test → Deploy stages
  • Dependencies: Stage order, parallel execution
  • Pipeline as Code: YAML (.github/workflows), Jenkinsfile
  • Triggers: On push, PR, schedule, manual
  • Artifacts: Pass outputs between stages
📈 Monitoring & Feedback
  • Pipeline Metrics: Success rate, build time, failure rate
  • Deployment Tracking: DORA metrics (lead time, frequency)
  • Failure Analysis: Root cause, trends over time
  • Notifications: Slack, email, PagerDuty on failures
  • Rollback Triggers: Auto-rollback on errors