There are numerous types of data and each has an encoding mechanism that can improve ML results.
Binary is boolean data. It’s pretty easy to determine binary data - the unique value count is 1, 2, or 3.
Numerical - Continuous
Numeric is data that can be measured on a continuous quantified scale. It is also called interval data. Examples include:
- Heart Rate
Order matters but the distance between the value is not quantified. Includes measures like ranking, ratings, & range bins. Think: t-shirt size.
The range of values is not ordered; may includes things like color, species, and food preferences.
Text data is words, sentences, or documents.
Temporal data is date, time, or order data - time series. Time series data can form trends and can be seasonal OR both. Seasonality + Trends + Noise = Time series.
There are three approaches to missing data: Ask the domain expert Delete the data Impute the data using a mean or regression.
Useless data is data that has no statistical relevance to the problem being solved. Often useless data is called “high cardinality” data because it has no relationship to the label or other inferred value. To determine if the data is useful/useless ask a data domain expert or do a data study to determine any sort of correlation. Common useless data includes:
- Unique identifiers (like Account numbers)
- Names, Email address
Unbalanced data is data that lacks, important rare data. There are three approaches to missing data:
- Ask the domain expert
- Choose the right measure of success - accuracy is less the point of unbalanced data; sensitivity and specificity, or the combined score of these elements an F1 score, are better measures.
- undersample, oversample, or synthesis - undersampling is simply removing some of the overweighted class of data; oversample is including more of the underweighted data (which can be implemented using SMOTE - Synthetic Minority Oversampling Technique - in Data Wrangler); synthesis is just creating some sample data.