## Logistic Regression

A binary classification algorithm with the X axis being continuous data and y being categorical data. The results of this sort of logistical regression is a Sigmoid function.

## Linear Regression & Stochastic Gradient Descent

To compute the best fit for a linear regression, algos use loss functions. Loss function (cost function) - evaluation of the difference between predicted values and real values. Basic loss functions include Residual Sum of Squares (RSS), an approach that is largely brute force and im-practical at scale. Instead, a better approach is to use a stochastic gradient descent (SGD); in this approach, the algorithm samples along the gradient.

### Linear Learner

Expects normalized, shuffled data. Uses SGD and L1/L2 regularization

## Support Vector Machines (SVM)

SVM enable binary classification assignment. The points in each classification that are closest together are the *support vectors* and the optimal hyperplane is the mid-point between the support vectors. The SVM with Radial Basis Function (RBF) kernel is a variation of the SVM (linear) used to separate non-linear data. This is implemented as “Factorization Machine” on Sagemaker.

## Random Forests

Useful for anomaly (& outlier) detection; implemented in QuickSite, Kinesis Analytics, SageMaker

- take random sample of data (using
*bootstrapping*is a dataset of the same size randomly selected from original dataset; the data left behind is the*out-of-bag (OOB)*dataset) and then take random features and then - create many trees
- Tests with OOB samples to generate the
*out-of-bag error*rate to adjust the - Vote for predictions

## K-means Clustering

unsupervised; group dataset into K classification; ideally K is the knee in the total variation

- decide how many clusters (K)?
- Randomly assign K number of samples to a different cluster
- for each sample, determine the nearest cluster
- find center point (and repeat 3)
- measure the total variation (total length of center points of clusters to center point)

## K-Nearest Neighbors

supervised; answers: given K closest neighbors how should a new data point be classified?; optimally K is best determined by experimentation and use caution if the value of K gets near the number of samples in a class

# Text Algorithms

## Latent Dirichlet Allocation (LDA)

Topic analysis on a bag of words (a corpus of documents); a document can include one or more topics; a topic is a distribution of words

## Principal Component Analysis

Principal Component Analysis (PCA) (U) - is the combination of highly correlated data into a fewer number of features; it isn’t dropping columns that are useless/weak relationship with label/target. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. PCA is commonly used as an exploratory data analysis tool.