Solving Multiclass Learning Problems via Error-Correcting Output Codes

Jan 1, 1995

T. G. Dietterich, G. Bakiri

Introduction

The task of learning from examples is to find an approximate definition for an unknown function f(x) given training examples of the form h xi ; f(xi) i. For cases in which f takes only the values f 0 ; 1 g |binary functions|there are many algorithms available. For example, the decision-tree methods, such as C4.5 (Quinlan, 1993) and CART (Breiman, Friedman, Olshen, & Stone, 1984) can construct trees whose leaves are labeled with binary values.


Most artificial neural network algorithms, such as the perceptron algorithm (Rosenblatt, 1958) and the error backpropagation (BP) algorithm (Rumelhart, Hinton, & Williams, 1986), are best suited to learning binary functions. Theoretical studies of learning have focused almost entirely on learning binary functions (Valiant, 1984; Natarajan, 1991).


Real-World Learning Tasks

In many real-world learning tasks, however, the unknown function f often takes values from a discrete set of "classes": f ci ; : : : ; ck g . For example, in medical diagnosis, the function might map a description of a patient to one of k possible diseases. In digit recognition (e.g., LeCun, Boser, Denker, Henderson, Howard, Hubbard, & Jackel, 1989), the function maps each hand-printed digit to one of k = 10 classes. Phoneme recognition systems (e.g., Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989) typically classify a speech segment into one of 50 to 60 phonemes.


The Direct Multiclass Approach

Decision-tree algorithms can be easily generalized to handle these "multiclass" learning tasks. Each leaf of the decision tree can be labeled with one of the k classes, and internal nodes can be selected to discriminate among these classes. We will call this the direct multiclass approach.


Other Approaches

Connectionist algorithms are more difficult to apply to multiclass problems. The standard approach is to learn k individual binary functions f1 ; : : : ; fk , one for each class. To assign a new case, x , to one of these classes, each of the fi is evaluated on x , and x is assigned the class j of the function fj that returns the highest activation (Nilsson, 1965). We will call this the one-per-class approach, since one binary function is learned for each class.


Alternatives

An alternative approach explored by some researchers is to employ a distributed output code . This approach was pioneered by Sejnowski and Rosenberg (1987) in their widely-known NETtalk system. Each class is assigned a unique binary string of length n; we will refer to these strings as "code words." Then n binary functions are learned, one for each bit position in these binary strings. During training for an example from class i, the desired outputs of these n binary functions are specified by the code word for class i .


Conclusion

In this paper, we compare the performance of the error-correcting code approach to the three existing approaches: the direct multiclass method (using decision trees), the one-per-class method, and (in the NETtalk task only) the meaningful distributed output representation approach. We show that error-correcting codes produce uniformly better generalization performance across a variety of multiclass domains for both the C4.5 decision-tree learning algorithm and the backpropagation neural network learning algorithm.

Sign up to AI First Newsletter

Recommended

We use our own cookies as well as third-party cookies on our websites to enhance your experience, analyze our traffic, and for security and marketing. Select "Accept All" to allow them to be used. Read our Cookie Policy.