Logistic Regression
A binary classification algorithm with the X axis being continuous data and y being categorical data. The results of this sort of logistical regression is a Sigmoid function.
Linear Regression & Stochastic Gradient Descent
To compute the best fit for a linear regression, algos use loss functions. Loss function (cost function) - evaluation of the difference between predicted values and real values. Basic loss functions include Residual Sum of Squares (RSS), an approach that is largely brute force and im-practical at scale. Instead, a better approach is to use a stochastic gradient descent (SGD); in this approach, the algorithm samples along the gradient.
Linear Learner
Expects normalized, shuffled data. Uses SGD and L1/L2 regularization
Support Vector Machines (SVM)
SVM enable binary classification assignment. The points in each classification that are closest together are the support vectors and the optimal hyperplane is the mid-point between the support vectors. The SVM with Radial Basis Function (RBF) kernel is a variation of the SVM (linear) used to separate non-linear data. This is implemented as “Factorization Machine” on Sagemaker.
Random Forests
Useful for anomaly (& outlier) detection; implemented in QuickSite, Kinesis Analytics, SageMaker
- take random sample of data (using bootstrapping is a dataset of the same size randomly selected from original dataset; the data left behind is the out-of-bag (OOB) dataset) and then take random features and then
- create many trees
- Tests with OOB samples to generate the out-of-bag error rate to adjust the
- Vote for predictions
K-means Clustering
unsupervised; group dataset into K classification; ideally K is the knee in the total variation
- decide how many clusters (K)?
- Randomly assign K number of samples to a different cluster
- for each sample, determine the nearest cluster
- find center point (and repeat 3)
- measure the total variation (total length of center points of clusters to center point)
K-Nearest Neighbors
supervised; answers: given K closest neighbors how should a new data point be classified?; optimally K is best determined by experimentation and use caution if the value of K gets near the number of samples in a class
Text Algorithms
Latent Dirichlet Allocation (LDA)
Topic analysis on a bag of words (a corpus of documents); a document can include one or more topics; a topic is a distribution of words
Principal Component Analysis
Principal Component Analysis (PCA) (U) - is the combination of highly correlated data into a fewer number of features; it isn’t dropping columns that are useless/weak relationship with label/target. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. PCA is commonly used as an exploratory data analysis tool.