In statistics and machine learning, classification is the problem of identifying to which of a set of categories – also called sub-populations – a new observation belongs.
Classification is an example of pattern recognition, a branch of machine learning that focuses on the recognition of patterns and regularities in data.
It is considered an instance of supervised learning, such as learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance. We already talked about it.
A common subclass of classification is probabilistic classification. Algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which simply output a “best” class, probabilistic algorithms output a probability of the instance being a member of each of the possible classes. The best class is normally then selected as the one with the highest probability.
However, such an algorithm has numerous advantages over non-probabilistic classifiers:
- It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a confidence-weighted classifier).
- Correspondingly, it can abstain when its confidence of choosing any particular output is too low.
- Because of the probabilities which are generated, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of error propagation.
A probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right, or when combining classifiers into ensembles.
Usually what a classifier is, is some rule or function that assigns to a sample x a class label ŷ:
The samples come from some set X, while the class labels form a finite set Y defined prior to training.
Probabilistic classifiers generalize this notion of classifiers: instead of functions, they are conditional distributions , meaning that for a given
, they assign probabilities to all
(and these probabilities sum to one).
In the end, the predicted class is that which has the highest probability.
Binary probabilistic classifiers are also called binomial regression models in statistics.
A large number of algorithms for classification can be phrased in terms of a linear function that assigns a score to each possible category k by combining the feature vector of an instance with a vector of weights, using a dot product. The predicted category is the one with the highest score.
This type of score function is known as a linear predictor function and has the following general form:
where Xi is the feature vector for instance i, βk is the vector of weights corresponding to category k, and score(Xi, k) is the score associated with assigning instance i to category k.
Algorithms with this basic setup are known as linear classifiers, as an algorithm that implements classification in a concrete implementation, is known as a classifier. The term “classifier” sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.
An example of classification algorithm is the Decision tree algorithm: it uses a decision tree – as a predictive model – to go from observations about an item to conclusions about the item’s target value. The observation of the item is represented in the branches, while the item’s target value is represented in the leaves.
Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.
