A multi-class SVM classifier utilizing binary decision tree

Informatica, May, 2009 by Gjorgji Madzarov, Dejan Gjorgjevikj, Ivan Chorbev

1 Introduction

The recent results in pattern recognition have shown that support vector machine (SVM) classifiers often have superior recognition rates in comparison to other classification methods. However, the SVM was originally developed for binary decision problems, and its extension to multi-class problems is not straightforward. How to effectively extend it for solving multiclass classification problem is still an on-going research issue. The popular methods for applying SVMs to multiclass classification problems usually decompose the multi-class problems into several two-class problems that can be addressed directly using several SVMs.

For the readers' convenience, we introduce the SVM briefly in section 2. A brief introduction to several widely used multi-class classification methods that utilize binary SVMs is given in section 3. The Kernel-based clustering introduced to convert the multi-class problem into SVM-based binary decision-tree architecture is explained in section 4. In section 5, we discuss related works and compare SVM-BDT with other multi-class SVM methods via theoretical analysis and empirical estimation. The experimental results in section 6 are presented to compare the performance of the proposed SVM-BDT with traditional multi-class approaches based on SVM, ensemble of decision trees and neural network. Section 7 gives a conclusion of the paper.

2 Support vector machines for pattern recognition

The support vector machine is originally a binary classification method developed by Vapnik and colleagues at Bell laboratories [1][2], with further algorithm improvements by others [3]. For a binary problem, we have training data points: {[x.sub.i], [y.sub.i]}, i=1, ..., l, [y.sub.i][member of]{-1, 1}, [x.sub.i][member of][R.sup.d]. Suppose we have some hyperplane which separates the positive from the negative examples (a "separating hyperplane"). The points x which lie on the hyperplane satisfy w x x b = 0, where w is normal to the hyperplane, [absolute value of b]/[parallel]w[parallel] is the perpendicular distance from the hyperplane to the origin, and [parallel]w[parallel] is the Euclidean norm of w. Let [d.sub. ] ([d.sub.-]) be the shortest distance from the separating hyperplane to the closest positive (negative) example. Define the "margin" of a separating hyperplane to be [d.sub. ] [d.sub.-]. For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with largest margin. This can be formulated as follows: suppose that all the training data satisfy the following constraints:

[x.sub.i] x w b [greater than or equal to] 1 for [Y.sub.i] = 1, (1)

[x.sub.i] x w b [less than or equal to] 1 for [Y.sub.i] = - 1, (2)

These can be combined into one set of inequalities:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

Now consider the points for which the equality in Eq. (I) holds (requiring that there exists such a point) is equivalent to choosing a scale for w and b. These points lie on the hyperplane [H.sub.1]: [x.sub.i] x w b = 1 with normal w and perpendicular distance from the origin [absolute value of 1-b]/[parallel]w[parallel]. Similarly, the points for which the equality in Eq. (2) holds lie on the hyperplane H2: x;. w b = -1, with normal again w and perpendicular distance from the origin [absolute value of -1-b]/[parallel]w[parallel]. Hence [d.sub. ] = [d.sub.-] = 1/[parallel]w[parallel] and the margin is simply 2/[parallel]w[parallel].

[FIGURE 1 OMITTED]

Note that [H.sub.1] and [H.sub.2] are parallel (they have the same normal) and that no training points fall between them. Thus we can find the pair of hyperplanes which gives the maximum margin by minimizing [parallel]w[[parallel].sup.2], subject to constraints (3).

Thus we expect the solution for a typical two dimensional case to have the form shown on Fig. I. We introduce nonnegative Lagrange multipliers [[alpha].sub.i], i = 1, ..., l, one for each of the inequality constraints (3). Recall that the rule is that for constraints of the form [c.sub.i] [greater than or equal to] 0, the constraint equations are multiplied by nonnegative Lagrange multipliers and subtracted from the objective function, to form the Lagrangian. For equality constraints, the Lagrange multipliers are unconstrained. This gives Lagrangian:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)

We must now minimize [L.sub.p], with respect to w, b, and maximize with respect to all [[alpha].sub.i] at the same time, all subject to the constraints [[alpha].sub.i] [greater than or equal to] 0 (let's call this particular set of constraints C.). Now this is a convex quadratic programming problem, since the objective function is itself convex, and those points which satisfy the constraints also form a convex set (any linear constraint defines a convex set, and a set of N simultaneous linear constraints defines the intersection of N convex sets, which is also a convex set). This means that we can equivalently solve the following "dual" problem: maximize [L.sub.p], subject to the constraints that the gradient of [L.sub.p] with respect to w and b vanish, and subject also to the constraints that the [[alpha].sub.i] [greater than or equal to] 0 (let's call that particular set of constraints [C.sub.2]). This particular dual formulation of the problem is called the Wolfe dual [4]. It has the property that the maximum of [L.sub.p], subject to constraints [C.sub.2], occurs at the same values of the w, b and [alpha], as the minimum of [L.sub.p], subject to constraints [C.sub.1].

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale