The advantages of support vector machines are:
- Effective in high dimensional spaces.
- Still effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
- If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
The support vector machines in scikit-learn(python package of ML) support both dense (
numpy.ndarray and convertible to that by
numpy.asarray) and sparse (any
scipy.sparse) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-ordered
numpy.ndarray (dense) or
scipy.sparse.csr_matrix (sparse) with
What is Support Vector Machine?
the support vector machine algorithm is used to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces. Now you are wondering how a linear expert solves non linear problem well let me tell that with an simple example so that you can understand that i know some math stuff.
This is a typical plot of x,y in a x^2-y^2 plane but if you change the plane to other the plot becomes something like this
Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.
How Svm classifier Works?
For a dataset consisting of features set and labels set, an SVM classifier builds a model to predict classes for new examples. It assigns new example/data points to one of the classes. If there are only 2 classes then it can be called as a Binary SVM Classifier.
There are 2 kinds of SVM classifiers:
- Linear SVM Classifier
- Non-Linear SVM Classifier
Svm Linear Classifier:
In the linear classifier model, we assumed that training examples plotted in space. These data points are expected to be separated by an apparent gap. It predicts a straight hyperplane dividing 2 classes. The primary focus while drawing the hyperplane is on maximizing the distance from hyperplane to the nearest data point of either class. The drawn hyperplane called as a maximum-margin hyperplane.
SVM Non-Linear Classifier:
In the real world, our dataset is generally dispersed up to some extent. To solve this problem separation of data into different classes on the basis of a straight linear hyperplane can’t be considered a good choice. For this Vapnik suggested creating Non-Linear Classifiers by applying the kernel trick to maximum-margin hyperplanes. In Non-Linear SVM Classification, data points plotted in a higher dimensional space.
Linear Support Vector Machine Classifier
In Linear Classifier, A data point considered as a p-dimensional vector(list of p-numbers) and we separate points using (p-1) dimensional hyperplane. There can be many hyperplanes separating data in a linear order, but the best hyperplane is considered to be the one which maximizes the margin i.e., the distance between hyperplane and closest data point of either class.
The Maximum-margin hyperplane is determined by the data points that lie nearest to it. Since we have to maximize the distance between hyperplane and the data points. These data points which influences our hyperplane are known as support vectors.
Non-Linear Support Vector Machine Classifier
It often happens that our data points are not linearly separable in a p-dimensional(finite) space. To solve this, it was proposed to map p-dimensional space into a much higher dimensional space. We can draw customized/non-linear hyperplanes using Kernel trick.
Every kernel holds a non-linear kernel function.
This function helps to build a high dimensional feature space. There are many kernels that have been developed. Some standard kernels are:
- Polynomial (homogeneous) Kernel:The polynomial kernel function can be represented by the above expression.
- Where k(xi, xj) is a kernel function, xi & xj are vectors of feature space and d is the degree of polynomial function.
- Polynomial(non-homogeneous) Kernel:
In the non-homogeneous kernel, a constant term is also added.
The constant term “c” is also known as a free parameter. It influences the combination of features. x & y are vectors of feature space.
Radial Basis Function Kernel:
It is also known as RBF kernel. It is one of the most popular kernels. For distance metric squared euclidean distance is used here. It is used to draw completely non-linear hyperplanes.
where x & x’ are vectors of feature space. is a free parameter. Selection of parameters is a critical choice. Using a typical value of the parameter can lead to overfitting our data.
Support Vector Machine Libraries / Packages:
For implementing support vector machine on a dataset, we can use libraries. There are many libraries or packages available that can help us to implement SVM smoothly. We just need to call functions with parameters according to our need.
In Python, we can use libraries like sklearn. For classification, Sklearn provides functions like SVC, NuSVC & LinearSVC.
SVC() and NuSVC() methods are almost similar but with some difference in parameters. We pass values of kernel parameter, gamma and C parameter etc. By default kernel parameter uses “rbf” as its value but we can pass values like “poly”, “linear”, “sigmoid” or callable function.
LinearSVC() is an SVC for Classification that uses only linear kernel. In LinearSVC(), we don’t pass value of kernel, since it’s specifically for linear classification.
In R programming language, we can use packages like “e1071” or “caret”. For using a package, we need to install it first. For installing “e1071”, we can type install.packages(“e1071”) in console.
e1071 provides an SVM() method, it can be used for both regression and classification. SVM() method accepts data, gamma values and kernel etc.
Cost Function and Gradient Of SVM:
we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss. The cost is 0 if the predicted value and the actual value are of the same sign. If they are not, we then calculate the loss value. We also add a regularization parameter the cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below.
Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using the gradients, we can update our weights .
Now lets do our coding to implement our learning:
Let’s start with baby example the iris you know if you follow decision tree you have a pretty good idea how you can classify different species with hierarchical format only problem arise when you have to classify inter classes for that we have neural nets but again lets look at the distribution of petal length and sepal length, widths to get motivated then we will use different kernels as specify in above to get the svm classification .
and finally use svm classifier
X = iris_dataset.data[:,:] y = iris_dataset.target C = 1.0 # SVM regularization parameter # SVC with linear kernel svc = svm.SVC(kernel='linear', C=C).fit(X, y) # LinearSVC (linear kernel) lin_svc = svm.LinearSVC(C=C).fit(X, y) # SVC with RBF kernel rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y) # SVC with polynomial (degree 3) kernel poly_svc3 = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y) # SVC with polynomial (degree 4) kernel poly_svc4 = svm.SVC(kernel='poly', degree=4, C=C).fit(X, y)
Now we are will work with A image classification problem CIFAR -10 from cs.toronto here how the dataset looks
Run get_datasets.sh in terminal to download the datasets, or download from Alex Krizhevsky.
# Get CIFAR10 wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz tar -xzvf cifar-10-python.tar.gz rm cifar-10-python.tar.gz
The results of the downloading is showed in following figure.
After writing code for loading data you can write custom visualization code
def visualize_sample(X_train, y_train, classes, samples_per_class=7): """visualize some samples in the training datasets """ num_classes = len(classes) for y, cls in enumerate(classes): idxs = np.flatnonzero(y_train == y) # get all the indexes of cls idxs = np.random.choice(idxs, samples_per_class, replace=False) for i, idx in enumerate(idxs): # plot the image one by one plt_idx = i * num_classes + y + 1 # i*num_classes and y+1 determine the row and column respectively plt.subplot(samples_per_class, num_classes, plt_idx) plt.imshow(X_train[idx].astype('uint8')) plt.axis('off') if i == 0: plt.title(cls) plt.show()
Then you can write the classifier function
# Test the loss and gradient from algorithms.classifiers import loss_grad_svm_vectorized import time # generate a rand weights W W = np.random.randn(10, X_train.shape) * 0.001 tic = time.time() loss_vec, grad_vect = loss_grad_svm_vectorized(W, X_train, y_train, 0) toc = time.time() print 'Vectorized loss: %f, and gradient: computed in %fs' % (loss_vec, toc - tic)
and after computing gradient loss you can check the performance on test dataset
y_test_predict_result = best_svm.predict(X_test) y_test_predict = y_test_predict_result test_accuracy = np.mean(y_test == y_test_predict) print 'The test accuracy is: %f' % test_accuracy
to use svm we need e1071 package from R
lets start by using titanic dataset if you are not familiar with titanic data set you can see here .
we will take only age and fare column to predict survival
we will use linear kernel for prediction
classifier = svm(formula = Survived ~ .,
data = train_c,
type = ‘C-classification’,
kernel = ‘linear’)
classifier call will give us details about the call
svm(formula = Survived ~ ., data = train_c, type = “C-classification”, kernel = “linear”)
Number of Support Vectors: 517
the support vector boundary will look like this
what can you do now
- download the repositories git clone https://github.com/MachineLearningWithHuman/R and git clone https://github.com/MachineLearningWithHuman/python
- Go to respective SVM folders play with the dataset and see how your svm kernel point and boundary changes.