# upload_10133_note2.tex

## File contents

%% This document created by Scientific Word (R) Version 3.5 \documentclass{amsart}% \usepackage{amsmath} \usepackage{graphicx}% \usepackage{amsfonts}% \usepackage{amssymb} %TCIDATA{OutputFilter=latex2.dll} %TCIDATA{LastRevised=Thursday, January 03, 2002 16:54:56} %TCIDATA{<META NAME="GraphicsSave" CONTENT="32">} \theoremstyle{plain} \newtheorem{acknowledgement}{Acknowledgement} \newtheorem{algorithm}{Algorithm} \newtheorem{case}{Case} \newtheorem{claim}{Claim} \newtheorem{conclusion}{Conclusion} \newtheorem{condition}{Condition} \newtheorem{criterion}{Criterion} \newtheorem{notation}{Notation} \newtheorem{problem}{Problem} \newtheorem{solution}{Solution} \newtheorem{summary}{Summary} \numberwithin{equation}{section} \begin{document} \title{A Note on Applications of Support Vector Machine} \author{Seung-chan Ahn} \address{Fermilab, MS 360, P.O. Box 500, Batavia, IL 60510} \author{Gene Kim} \address{93B Taylor Ave, East Brunswick, NJ 08816} \author{MyungHo Kim} \address{93B Taylor Ave, East Brunswick, NJ 08816} \maketitle \begin{abstract} We describe in a rudimentary fashion how \emph{SVM}(support vector machine) plays the role of classifier in a mathematical setting. We then discuss its application in the study of multiple \emph{SNP}(single nucleotide polymorphism) variations. Also presented is a set of preliminary test results with clinical data. \end{abstract} \section{Introduction} It is a generally accepted wisdom that the causes of biological effects can be divided into two categories - inhritable(genes from parents) and environmental(food, gravity, sunlight, surroundings etc). In this paper, we focus on inheritable factors. Our suggestion to multiple \emph{SNP} variations is based on the following general assumptions(For more details, see \cite{KK}). \begin{description} \item[1] \emph{Suppose all the SNPs are known and there are no environmental factors. Then each human is determined by a complete set of SNP variations uniquely.} \end{description} Its consequences are: identical twins are exactly the same. Thus it is possible to classify \emph{SNP} data sets into several subgroups. Classification(grouping or clustering) is one of basic and important generic method for distinguishing one from another. \begin{description} \item[2] \emph{To classify objects we are interested in, the most powerful technique people developed is to numericalize them, in other words, finding a way of representation into numbers and the collection of numbers into vectors in a Euclidean space.} \end{description} The two assumptions are separately common senses among researchers. The new twist is that the two assumptions were not considered in the same scope and \emph{SVM} offers a powerful machinery to tackle the problem of classification in a rigorous and systematic way. \section{Support Vector Machine and Its Analogy} The concept of \emph{SVM}(Support Vector Machine) was introduced by Vapnik(\cite{Va}) in the late 1970's. Since then the idea of \emph{SVM} found its application in many diverse fields such as machine learning, gene expression data analysis, high energy physics experiment at \emph{CERN} (European Organization for Nuclear Research). Why the idea of \emph{SVM} has been used in such diverse and unrelated fields ? The reason is clear and obvious: \emph{SVM}, based on a solid mathematical foundation, attempts to solve a universal problem of classification, i.e., we need to know which belongs to which group. The basic idea of \emph{SVM} is deceptively simple. Given a collection of vectors in $R^{n},$ labeled +1 or -1 that are separable by a hyperplane, \emph{SVM} finds the hyperplane with the maximal margin. More precisely, the distance between the closest labeled vectors to the hyperplane is maximal.(Vapnik, cleverly, connected this distance problem to an optimization problem by using Kuhn-Tucker condition, \cite{Si}). This hyperplane could be used to determine to which group an unlabeled vector belongs. This machine fits with inductive scientific method. To give you a definite flavor of \emph{SVM} in everyday experience, let's consider about familiar concepts , speed limit, height,weight, blood pressure, Lipid measurements in blood etc. When the speed limit, critical values for blood pressure of normal people, Lipid mesurements are determined, people mainly depend on experimenatal data in the past. As a toy model, we considered an analogy or correspondence between finding the speed limit on the road and using Support Vector Machine for a criterion to determine an association between a given set of multiple SNP variations and a disease or trait. In mathematical setting, car speed is a point in $R^{1},$while a set of numbers consisting of SNP variations(or anything we count several variables at the same time) is represented as a point of $R^{n}$. \begin{picture}(124,45) \put(54,10){\oval(124,36)}% \put(54,10){\makebox(0,15){\sl Speed}}% \put(54,10){\makebox(0,-10){\sl A Single Number}} \end{picture} $\Longleftrightarrow$\begin{picture}(124,45) \put(90,10){\oval(124,36)}% \put(90,10){\makebox(0,15){\sl Feature Vector}}% \put(90,10){\makebox(0,-10){\sl A Set of Numbers}} \end{picture} \vspace{0.2in} \hspace{1.0in}$\Downarrow$\hspace{2.3in}$\Downarrow$ \newline \hspace{1.0in} \begin{picture}% (72,45) \put(5,-15){\framebox(110,50){{\sl By Simple Statistic}}% } \put(5,0){\framebox(110,50){{\sl Speed Limit}}} \end{picture} \hspace{1.0in}$\Longleftrightarrow$\begin{picture}% (72,45) \put(30,-15){\framebox(150,50){{\sl By Support Vector Machine}}% } \put(30, 0){\framebox(150,50){{\sl Hyperplane of Criterion}}}% \end{picture} \hspace{1.5in} \hspace{1.5in} \newline \newline \begin{description} \item[Conclusion] We come to the conclusion that we have to find out a way of representation of \emph{SNP} variataions at each position. This subject is open and could be adjusted with experiments for better performance.\footnote{After we found out to use the \emph{SVM} to classify multiple SNP variations, Honki Kim, statistician, pointed out that Classification tree(or decision tree) might work as well.} Suppose we want to express east, west, south and north(or DNA letters, \emph{A, C, G, T}). Then we may represent them as $\{(1,0,0,0),(0,1,0,0),(0,0,1,0)$, $(0,0,0,1)\}$ or $\{0.2,0.4,0.6,0.8\}$. This way, at each SNP location, we have a number depending on genotype in a consistent way, which give us a vector. \end{description} \section{Test Results with clinical data} We generated feature vectors of cardio-patient records by using the same principle described in \textbf{section 2}. Height, age, sex, weight, ethnic background, medical history, birth place, blood pressure(systolic and diastolic), Lipid measurements etc are numericalized and we labeled +1 for a patient who had a history of either heart attack, stroke or heart failure, otherwise -1. We used Thorsten Joachims' implementation of \emph{SVM}, which gives us the following results(See \cite{Jo} and, for a different implementation, \cite{Van}). The results strongly indicate that \emph{SVM} works as intended to separate the data set into two classes. \begin{table}[ptb] \begin{center}% \begin{tabular} [c]{|l|l|l|l|l|l|l|l|}\hline Test & No of Patients & +1 labeled & -1 labeled & C(bound) & Misclassified & postoneg & negtopos\\\hline 1 & 1000 & \ \ 212 & \ \ 788 & \ \ 1 & \ \ 56 & \ \ 32 & \ \ 24\\ 2 & 1000 & \ \ 212 & \ \ 788 & \ \ 2 & \ \ 41 & \ \ 23 & \ \ 18\\ 3 & 1000 & \ \ 409 & \ \ 591 & \ \ 1 & \ \ 153 & \ \ 37 & \ \ 116\\ 4 & 4000 & \ \ 1055 & \ \ 2945 & \ \ 1 & \ \ 438 & \ \ 168 & \ \ 270\\\hline \end{tabular} \end{center} \caption{Tests with clinical data}% \label{tb:slopes}% \end{table} For the summary of tests, see the Table 1. \begin{description} \item[1] Postoneg means the number of +1 labeled vectors in the group of -1 labeled majority, while negtopos the number of -1 labeled vectors in the group of +1 labeled. \item[2] Test 1 and 2 are the same data with different C values. \item[3] Test 1 and 3 are different. \item[4] Test 3 is contained in Test 4. \end{description} \section{Implication} Support Vector Machine can be applied for diagnosis of diseases and drug adverse. If, for each possible patient, we input all the test results as a vector, the status of a disease and its prescription could be determined from the past disease records. It should be noted that the data is not limited to numerical ones and it could include visual data such as X-ray or MRI image and possibly other sources. For example, in the image data, one extracts area, length, its topological invariant and others for the totality of input data. Due to its generic nature of \emph{SVM} already found its application in diverse field and it may find even more application elsewhere. (Depending on the users's insight and intuitions, for example, putting genotypes with phenotypes, drug and phenotypes or genotyes etc.) \section{\textsc{Acknowledgments}} We are grateful to thank to Profs Larry Shepp of Rugers University, Chul Ahn of University of Texas at Houston and Dr. Honki Kim of Cedent Corp. for comments and criticism. Special thanks are due to Prof. Vanderbei at Princeton University for demonstration of his software implementation of \emph{SVM}. \begin{thebibliography}{9} % \bibitem[1]{Jo}Joachims, Thorsten, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Sch\"{o}lkopf and C. Burges and A. Smola (ed.), MIT Press, 1999. \bibitem[2]{KK}Kim Gene and Kim, MyungHo. Application of Support Vector Machine to detect an association between a disease or trait and multiple SNP variations, (http://xxx.lanl.gov/abs/cs.CC/0104015) \bibitem[3]{Si}Simmons, Donald, Nonlinear programming for operations research, Prentice Hall, 1975 \bibitem[4]{Van}Vanderbei, Robert. http://www.princeton.edu/~ rvdb \bibitem[5]{Va}Vapnik, Vladimir. The Nature of Statistical Learning, Springer, New York, 1995 \end{thebibliography} \end{document}