Full Article On svm From classification to kernel selection to outlier detection with code in R and python

Basic Informatica:

Support vector machines (SVMs) are a set of supervised learning methods used for classificationregression and outliers detection.

The advantages of support vector machines are:

  • Effective in high dimensional spaces.
  • Still effective in cases where number of dimensions is greater than the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

  • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

The support vector machines in scikit-learn(python package of ML) support both dense (numpy.ndarray and convertible to that by numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-ordered numpy.ndarray (dense) orscipy.sparse.csr_matrix (sparse) with dtype=float64.

What is Support Vector Machine?

the support vector machine algorithm is used to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

watch this video by caltech’s Professor Yaser Abu-Mostafa to understand the math behind it

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces. Now you are wondering how a linear expert solves non linear problem well let me tell that with an simple example so that you can understand that i know some math stuff.

This is a typical plot of x,y in a x^2-y^2 plane but if you change the plane to other the plot becomes something like this

Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.

HYPERPLANE:

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

How Svm classifier Works?

For a dataset consisting of features set and labels set, an SVM classifier builds a model to predict classes for new examples. It assigns new example/data points to one of the classes. If there are only 2 classes then it can be called as a Binary SVM Classifier.

There are 2 kinds of SVM classifiers:

  1. Linear SVM Classifier
  2. Non-Linear SVM Classifier

Svm Linear Classifier:

In the linear classifier model, we assumed that training examples plotted in space. These data points are expected to be separated by an apparent gap. It predicts a straight hyperplane dividing 2 classes. The primary focus while drawing the hyperplane is on maximizing the distance from hyperplane to the nearest data point of either class. The drawn hyperplane called as a maximum-margin hyperplane.

SVM Non-Linear Classifier:

 In the real world, our dataset is generally dispersed up to some extent. To solve this problem separation of data into different classes on the basis of a straight linear hyperplane can’t be considered a good choice. For this Vapnik suggested creating Non-Linear Classifiers by applying the kernel trick to maximum-margin hyperplanes. In Non-Linear SVM Classification, data points plotted in a higher dimensional space.

Typical case how non linear separation happens

Linear Support Vector Machine Classifier

In Linear Classifier, A data point considered as a p-dimensional vector(list of p-numbers) and we separate points using (p-1) dimensional hyperplane. There can be many hyperplanes separating data in a linear order, but the best hyperplane is considered to be the one which maximizes the margin i.e., the distance between hyperplane and closest data point of either class.

The Maximum-margin hyperplane is determined by the data points that lie nearest to it. Since we have to maximize the distance between hyperplane and the data points. These data points which influences our hyperplane are known as support vectors.

Non-Linear Support Vector Machine Classifier

 It often happens that our data points are not linearly separable in a p-dimensional(finite) space. To solve this, it was proposed to map p-dimensional space into a much higher dimensional space. We can draw customized/non-linear hyperplanes using Kernel trick.
Every kernel holds a non-linear kernel function.

This function helps to build a high dimensional feature space. There are many kernels that have been developed. Some standard kernels are:

  1. Polynomial (homogeneous) Kernel:The polynomial kernel function can be represented by the above expression.
{\displaystyle k({\vec {x_{i}}},{\vec {x_{j}}})=({\vec {x_{i}}}\cdot {\vec {x_{j}}})^{d}}
  1. Where k(xi, xj) is a kernel function, x& xj  are vectors of feature space and d is the degree of polynomial function.
  2. Polynomial(non-homogeneous) Kernel:
    In the non-homogeneous kernel, a constant term is also added.
K(x,y) = (x^\mathsf{T} y + c)^{d}

The constant term “c” is also known as a free parameter. It influences the combination of features. x & y are vectors of feature space.

Radial Basis Function Kernel:
It is also known as RBF kernel. It is one of the most popular kernels. For distance metric squared euclidean distance is used here. It is used to draw completely non-linear hyperplanes.

K(\mathbf {x} ,\mathbf {x'} )=\exp \left(-{\frac {||\mathbf {x} -\mathbf {x'} ||^{2}}{2\sigma ^{2}}}\right)


where x & x’ are vectors of feature space.  is a free parameter. Selection of parameters is a critical choice. Using a typical value of the parameter can lead to overfitting our data.

Support Vector Machine Libraries / Packages:

For implementing support vector machine on a dataset, we can use libraries. There are many libraries or packages available that can help us to implement SVM smoothly. We just need to call functions with parameters according to our need.

In Python, we can use libraries like sklearn. For classification, Sklearn provides functions like SVC, NuSVC & LinearSVC.

SVC() and NuSVC() methods are almost similar but with some difference in parameters. We pass values of kernel parameter, gamma and C parameter etc. By default kernel parameter uses “rbf” as its value but we can pass values like “poly”, “linear”, “sigmoid” or callable function.

LinearSVC() is an SVC for Classification that uses only linear kernel. In LinearSVC(), we don’t pass value of kernel, since it’s specifically for linear classification.

In R programming language, we can use packages like “e1071” or “caret”. For using a package, we need to install it first. For installing “e1071”, we can type  install.packages(“e1071”) in console.
e1071 provides an SVM() method, it can be used for both regression and classification. SVM() method accepts data, gamma values and kernel etc.

Cost Function and Gradient Of SVM:

we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss. The cost is 0 if the predicted value and the actual value are of the same sign. If they are not, we then calculate the loss value. We also add a regularization parameter the cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below.

cost with regularization

Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using the gradients, we can update our weights .

Now lets do our coding to implement our learning:

PYTHON:

Let’s start with baby example the iris you know if you follow decision tree you have a pretty good idea how you can classify different species with hierarchical format only problem arise when you have to classify inter classes for that we have neural nets but again lets look at the distribution of petal length and sepal length, widths to get motivated then we will use different kernels as specify in above to get the svm classification .

and finally use svm classifier

:

X = iris_dataset.data[:,:]  
y = iris_dataset.target
C = 1.0  # SVM regularization parameter
 
# SVC with linear kernel
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
# LinearSVC (linear kernel)
lin_svc = svm.LinearSVC(C=C).fit(X, y)
# SVC with RBF kernel
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
# SVC with polynomial (degree 3) kernel
poly_svc3 = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)

# SVC with polynomial (degree 4) kernel
poly_svc4 = svm.SVC(kernel='poly', degree=4, C=C).fit(X, y)

Now we are will work with A image classification problem CIFAR -10 from cs.toronto here how the dataset looks

Run get_datasets.sh in terminal to download the datasets, or download from Alex Krizhevsky.

get_datasets.sh

# Get CIFAR10
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xzvf cifar-10-python.tar.gz
rm cifar-10-python.tar.gz 

The results of the downloading is showed in following figure.

After writing code for loading data you can write custom visualization code

def visualize_sample(X_train, y_train, classes, samples_per_class=7):
    """visualize some samples in the training datasets """
    num_classes = len(classes)
    for y, cls in enumerate(classes):
        idxs = np.flatnonzero(y_train == y) # get all the indexes of cls
        idxs = np.random.choice(idxs, samples_per_class, replace=False)
        for i, idx in enumerate(idxs): # plot the image one by one
            plt_idx = i * num_classes + y + 1 # i*num_classes and y+1 determine the row and column respectively
            plt.subplot(samples_per_class, num_classes, plt_idx)
            plt.imshow(X_train[idx].astype('uint8'))
            plt.axis('off')
            if i == 0:
                plt.title(cls)
    plt.show()

Then you can write the classifier function

# Test the loss and gradient
from algorithms.classifiers import loss_grad_svm_vectorized
import time

# generate a rand weights W 
W = np.random.randn(10, X_train.shape[0]) * 0.001

tic = time.time()
loss_vec, grad_vect = loss_grad_svm_vectorized(W, X_train, y_train, 0)
toc = time.time()
print 'Vectorized loss: %f, and gradient: computed in %fs' % (loss_vec, toc - tic)

and after computing gradient loss you can check the performance on test dataset

y_test_predict_result = best_svm.predict(X_test)
y_test_predict = y_test_predict_result[0]
test_accuracy = np.mean(y_test == y_test_predict)
print 'The test accuracy is: %f' % test_accuracy

R

to use svm we need e1071 package from R

lets start by using titanic dataset if you are not familiar with titanic data set you can see here .

we will take only age and fare column to predict survival

train_f<-train[c(6,10)]

we will use linear kernel for prediction

classifier = svm(formula = Survived ~ .,
data = train_c,
type = ‘C-classification’,
kernel = ‘linear’)
classifier call will give us details about the call

classifier

Call:
svm(formula = Survived ~ ., data = train_c, type = “C-classification”, kernel = “linear”)

Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1
gamma: 0.5

Number of Support Vectors: 517

the support vector boundary will look like this

what can you do now

  1. download the repositories git clone https://github.com/MachineLearningWithHuman/R and git clone https://github.com/MachineLearningWithHuman/python
  2. Go to respective SVM folders play with the dataset and see how your svm kernel point and boundary changes.

Reference:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s