There are common ways that you can use to calculate a baseline result.
A baseline result is the simplest possible prediction. For some
problems, this may be a random result, and in others in may be the most common
prediction.
- Classification: If you have a
classification problem, you can select the class that has the most
observations and use that class as the result for all predictions. If
the number of observations are equal for all classes in your training
dataset, you can select a specific class or enumerate each class and see
which gives the better result in your test harness.
- Regression: If you are working on a
regression problem, you can use a central tendency measure as the result
for all predictions, such as the mean or the median.
- Optimization: If you are working on an
optimization problem, you can use a fixed number of random samples in the
domain.
A baseline is a method that uses
heuristics, simple summary statistics, randomness, or machine learning to
create predictions for a dataset. You can use these predictions to measure the
baseline's performance (e.g., accuracy)-- this metric will then become what you
compare any other machine learning algorithm against.
In more detail:
A machine learning algorithm tries to
learn a function that models the relationship between the input (feature) data
and the target variable (or label). When you test it, you will typically
measure performance in one way or another. For example, your algorithm may be
75% accurate. But what does this mean? You can infer this meaning by comparing
with a baseline's performance.
Typical baselines include those
supported by scikit-learn's "dummy" estimators:
Classification baselines:
·
“stratified”: generates predictions by respecting the training set’s
class distribution.
·
“most_frequent”: always predicts the most frequent label in the training
set.
·
“prior”: always predicts the class that maximizes the class prior.
·
“uniform”: generates predictions uniformly at random.
·
“constant”: always predicts a constant label that is provided by the
user. This is useful for metrics that evaluate a non-majority class.
Regression baselines:
“median”: always predicts the median
of the training set
“quantile”: always predicts a
specified quantile of the training set,provided with the quantile parameter.
“constant”: always predicts a
constant value that is provided by the user.
No comments:
Post a Comment