Applied Modeling

Discussion of Kaggle

A discussion of the project for January.

Regression vs. Classification:

Discrete, non-orderable values -> classification Discrete, orderable, low-cardinality -> regression or classification Discrete, orderable, hig-cardinality -> regression

Contiunous can be converted to discrete with compreison operators Discrete values can be grouped to produce binary classifications

Predicted probability: zone betweeen regression and classification

Conversion of cardinlity greater than 2 to binary

Drop rows that are missing values in a certain column Drop Rows

How is the target distributed Target distribution

Classification How many classes do I have?

  • For classification problems:

y.unique() - returns list of unique values - [‘red’, ‘blue’, ‘purple’]

y.nunique() - gives count of unique values = 3

y.value_counts() - gives counts of values

y.value_counts(normalize=True) - gives counts as proportion of all

y.value_counts(normalize=True).max() - gives count as proportion of all of only the max

  • Precision vs. Recall

Accuracy - percentage of correct predictions (of all predictions)

Precision - TP / (TP + FP)

Precision: ability of a classification model to return only relevant instances

Recall (also True Positive Rate) - TP / (TP + FN)

The ability of a classification model to identify all relevant instances

False Positive Rate - FP / (FP + TN)

F1 score: single metric that combines recall and precision using the harmonic mean

Regression vs. Classification

  • Convert all strings to lower case



Know how to make

Simple histogram Simple scatter Ctegorical vs numeric plot Cross tab

How to do template