Thursday,
October 15, 12:00 p.m. ECG 317
Wang Juh Chen
School of Math&Stats
A New SVM Model
Abstract
We propose a new formulation of the Support Vector Machine (SVM). It is
based on the development of ideas from the method of total least squares, in
which assumed errors in measured data (errors-in-features) are incorporated
in the model design. For example genetic data measured from micorarrays are
noise contaminated. Also, for genetic data, the number of features is far
greater than the sample size because of not only the high cost of the
experiment but also the requirement of collecting patients with the
necessary conditions. Traditional classification methods can not be applied
directly due to the ``curse of dimensionality'', which is a problem caused
by high dimensionality of the feature spaces (parameters) with not enough
observations to get good estimates, but SVM-based algorithms, which employ
dual methods and the use of a data mapping kernel, have the potential to
overcome this difficulty. In our method, we introduce Lagrange multipliers
and solve for the dual variables. Instead of finding the optimal value of
the Lagrange function, we solve the nonlinear system of equations obtained
from the Karush-Kuhn-Tucker (KKT) conditions. We also implement
complementarity constraints and incorporate weighting of the linear system
by the inverse covariance matrix of the measured data. To improve accuracy
of the classification we introduce regularization for the ill-posed linear
problem which arises during calculation. Some other aspects of improving the
algorithms are also considered such as choosing the initial point and
methods to avoid over-fitting.
We apply the proposed algorithm on several public microarray data sets and
Positron Emission Tomography (PET) images. The results indicate that the
proposed algorithm is competitive with the standard SVM and performs better
in some cases. It also succeeds when applied to the dot-product data mapping
in the kernel, hence demonstrating the ability of classifying the data with
millions of features, i.e. PET images, which is classically incredibly
difficult. The algorithm demonstrates better ability to classify the data
sets even when there exists errors in features and gives improved results
and higher sensitivity for classifying a set of Alzheimer's Disease (AD) PET
images.