Skip to main content

Documentation Index

Fetch the complete documentation index at: https://sourcetable.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Feature engineering turns raw data into inputs that improve model performance. Sourcetable automates common transformations through the AI assistant.

Missing value handling

"Handle missing values in the dataset — use median for numeric and mode for categorical"
StrategyWhen to use
Mean/median imputationNumeric columns, few missing values
Mode imputationCategorical columns
Forward/backward fillTime series data
KNN imputationValues depend on other features
Drop rowsSmall percentage missing, large dataset
Indicator columnMissingness itself is informative

Encoding categorical variables

"Encode all categorical columns for the ML model"
MethodWhen to use
One-hot encodingNominal categories (color, city) with low cardinality
Label encodingOrdinal categories (low/medium/high)
Target encodingHigh-cardinality categories (zip code, product ID)
Binary encodingHigh-cardinality with many categories
Frequency encodingWhen category frequency matters

Scaling and normalization

"Normalize all numeric features to 0-1 range"
MethodWhen to use
StandardScaler (Z-score)Most ML algorithms, normally distributed data
MinMaxScaler (0-1)Neural networks, distance-based algorithms
RobustScalerData with outliers
Log transformRight-skewed distributions (income, prices)
Box-Cox transformMake data more normal, various skew levels

Feature creation

Date/time features

"Extract year, month, day of week, and hour from the timestamp column"
Creates: year, quarter, month, week, day of week, day of month, hour, minute, is_weekend, is_holiday, days_since_start.

Aggregation features

"Create customer-level features: total purchases, average order value, days since last order"
Group-by aggregations: sum, mean, median, count, min, max, std, first, last.

Interaction features

"Create interaction features between price and quantity"
Products, ratios, and polynomial features between numeric columns.

Text features

"Extract features from the product description column"
Creates: word count, character count, average word length, sentiment score, TF-IDF vectors.

Feature selection

Importance-based

"Rank features by importance using a random forest"
Trains a model and ranks features by their contribution to predictions.

Variance Inflation Factor (VIF)

"Check for multicollinearity and drop highly correlated features"
Calculates VIF for each feature. Values above 5-10 indicate problematic multicollinearity.

Recursive Feature Elimination (RFE)

"Use RFE to find the optimal set of features for the model"
Iteratively removes the least important feature until performance stops improving.