Missing value handling
| Strategy | When to use |
|---|---|
| Mean/median imputation | Numeric columns, few missing values |
| Mode imputation | Categorical columns |
| Forward/backward fill | Time series data |
| KNN imputation | Values depend on other features |
| Drop rows | Small percentage missing, large dataset |
| Indicator column | Missingness itself is informative |
Encoding categorical variables
| Method | When to use |
|---|---|
| One-hot encoding | Nominal categories (color, city) with low cardinality |
| Label encoding | Ordinal categories (low/medium/high) |
| Target encoding | High-cardinality categories (zip code, product ID) |
| Binary encoding | High-cardinality with many categories |
| Frequency encoding | When category frequency matters |
Scaling and normalization
| Method | When to use |
|---|---|
| StandardScaler (Z-score) | Most ML algorithms, normally distributed data |
| MinMaxScaler (0-1) | Neural networks, distance-based algorithms |
| RobustScaler | Data with outliers |
| Log transform | Right-skewed distributions (income, prices) |
| Box-Cox transform | Make data more normal, various skew levels |