PRTools
A Matlab toolbox for pattern recognition Imported pages from 37Steps

Home   Guide   Software   Glossary   FAQ   Blog   About

<< The error in the error

Post - Tuesday, July 23rd, 2013 a

>>

How large is the classification error? What is the performance of the recognition  system? At the end this is the main question, in applications, in proposing novelties, in comparative studies. But how trustworthy is the number that is measured, how accurate is the error estimate?

The most common way to estimate the error of a classification system is by using a labeled test set and counting. There are a few, obvious, but often neglected conditions for obtaining a reliable, unbiased estimate of the error of the system when it is applied in practice. The test set should be representative for the future application. The best way to do this is to take a random sample of the objects that have to be recognized in the application and have them labeled by an expert.  This can be formulated as conditions on the test set:

An additional and very frequently violated condition is:

The worn out test set

In many practical applications it is almost impossible to satisfy the last condition as often a finite design set has been given by the customer. We will split it in parts for training and testing. The latter will be used during system design multiple times (or the split is performed multiple times, which is virtually the same). The result is that we get an optimistically biased idea about the performance of the system under construction. When the system is delivered and the customer makes a performance test himself by a test set that he has put aside for this purpose, we will be negatively surprised. The delivered system will perform worse than we estimated ourselves as we have used our test set many times in order to improve the system and the test set of our client is used just once. Our test set has been worn out by its use.

Determining the error in the error

Suppose all conditions are fulfilled, including the single use of the test set. How accurate is the error estimate by counting the number of classification mistakes? Well, if the true error is \epsilon than the probability that a particular object is wrongly classified is \epsilon.  The error estimated by the misclassified fraction of N test objects is:

\hat{\epsilon} = {1 \over{N}} \sum_i{U(\hat{\lambda_i} ,\lambda_i)}

in which U(x,y) = 0 for x = y and U(x,y) = 1 if x \ne y. The expectation of U is \epsilon and its variance is \epsilon (1-\epsilon). The expectation of the estimated error is, using the fact that the expectation of a sum is the sum over the expectations:

E(\hat{\epsilon}) = {1 \over{N}} \sum_i{E(U(\hat{\lambda_i} ,\lambda_i))} = {1 \over{N}} N \epsilon = \epsilon

which proves that the estimate is unbiased. The error in the estimate can be expressed in its standard deviation. First we compute the variance, which is the square of the standard deviation. We will make use of the fact that the variance of a sum is the sum of the variances if the terms are independent:

Var(\hat{\epsilon}) = Var({1 \over{N}} \sum_i{U(\hat{\lambda_i} ,\lambda_i))}) = {1 \over{N^2}} N Var(U(\hat{\lambda_i} ,\lambda_i)) = {1 \over{N}} \epsilon (1-\epsilon)

Consequently the standard deviation \sigma_\hat{\epsilon} of the error estimate \hat{\epsilon}  is

\sigma_{\hat{\epsilon}} = \sqrt{{1 \over{N}} \epsilon (1-\epsilon)}

It is nice to have an equation, but it is better to have a look at the real values. See the below table that lists the standard deviations for a various true errors \epsilon and sizes of the test set.

\epsilon = 0.01\epsilon = 0.02\epsilon = 0.05\epsilon = 0.10\epsilon = 0.20
N=10 0.0310.0440.0690.0950.126
N=250.0200.0280.0440.0600.080
N=1000.0100.0140.0220.0300.040
N=2500.0060.0090.0140.0190.025
N=10000.00310.00440.00690.00950.0126
N=25000.00200.00280.00440.00600.0080
N=100000.00100.00140.00220.00300.0040

The standard deviation, however, is only a useful measure for the accuracy of the error estimate when its distribution is symmetric. This holds with a sufficient accuracy when N \epsilon \ge 10. In that case the distribution of \hat{\epsilon} is with good approximation Gaussian. Thereby an interval of two times the standard deviation to both sides has a probability of about 95%. The red figures in the table cannot be used as the condition for normality is not fulfilled. Moreover, a symmetric interval will often include negative numbers. For these small sample sizes more advanced techniques are needed. It is however clear that for these  error rates the given sample size are insufficient to find any reliable error estimate.

In conclusion, to avoid a large error in the error, the size of the test dataset should be at least in the hundreds and preferably in the thousands. For small class overlaps, \epsilon \lt 0.05 this is even a must.