A Practical Approach To Stochastic gradient descent(SGD)

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It is called stochastic because the method uses randomly selected (or shuffled) samples to evaluate the gradients, hence SGD can be regarded as a stochastic approximation of gradient descent optimization.

Background

Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum:{\displaystyle Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),}

{\displaystyle Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),}

where the parameter  w that minimizes  Q(w) is to be estimated. Each summand function  Q_{i} is typically associated with the  i-th observation in the data set (used for training).

In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation.[2] Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).

The sum-minimization problem also arises for empirical risk minimization. In this case,  Q_{i}(w) is the value of the loss function at  i-th example, and  Q(w) is the empirical risk.

When used to minimize the above function, a standard (or “batch”) gradient descent method would perform the following iterations :

{\displaystyle w:=w-\eta \nabla Q(w)=w-\eta \sum _{i=1}^{n}\nabla Q_{i}(w)/n,}

where n  is a step size (sometimes called the learning rate in machine learning).

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.

However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions’ gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.

In stochastic (or “on-line”) gradient descent, the true gradient of {\displaystyle Q(w)} is approximated by a gradient at a single example:

w:=w-\eta \nabla Q_{i}(w).

As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.

In pseudocode, stochastic gradient descent can be presented as follows:

  • Choose an initial vector of parameters w and learning rate n.
  • Repeat until an approximate minimum is obtained:
    • Randomly shuffle examples in the training set.
    • For { i=1,2,…,n}, do:
      • {w:=w-n*Q_{i}(w).}
y=\!w_{1}+w_{2}x
{\displaystyle (x_{1},x_{2},\ldots ,x_{n})}
{\displaystyle ({\hat {y_{1}}},{\hat {y_{2}}},\ldots ,{\hat {y_{n}}})}

Let’s suppose we want to fit a straight line {\displaystyle y=w1+w2*x to a training set with observations{\displaystyle (x_{1},x_{2},x_{n})} and corresponding estimated responses { ( {y_{1}}},{ {y_{2}}} ,{ {y_{n}}})} using least squares. The objective function to be minimized is:

{\displaystyle Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y_{i}}}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.}

The last line in the above pseudocode for this specific problem will become:

{\displaystyle {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}:={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.}

Note that in each iteration (also called update), only the gradient evaluated at a single point xi instead of evaluating at the set of all samples.

The key difference compared to standard (Batch) Gradient Descent is that only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step.

An Example to show how effective SGD is:

So as we have cover SGD definetion we will now explore the SGD into details with practical approach (trying to implement in python and different languages )

Stochastic Gradient Descent in a nutshell

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. it has received a considerable amount of attention just recently in the context of large-scale learning.

SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features.

The advantages of Stochastic Gradient Descent are:

  • Efficiency.
  • Ease of implementation (lots of opportunities for code tuning).

The disadvantages of Stochastic Gradient Descent include:

  • SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
  • SGD is sensitive to feature scaling

The concrete loss function can be set via the loss parameter. SGDClassifier supports the following loss functions:

  • loss="hinge": (soft-margin) linear Support Vector Machine,
  • loss="modified_huber": smoothed hinge loss,
  • loss="log": logistic regression,
  • and all regression losses below.

Using loss="log" or loss="modified_huber" enables the predict_proba method

The concrete penalty can be set via the penalty parameter. SGD supports the following penalties:

  • penalty="l2": L2 norm penalty on coef_.
  • penalty="l1": L1 norm penalty on coef_.
  • penalty="elasticnet": Convex combination of L2 and L1; (1 - l1_ratio) * L2 + l1_ratio * L1

All these deals with Classification but what about Regression

 Regression

The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend RidgeLasso, or ElasticNet.

The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:

  • loss="squared_loss": Ordinary least squares,
  • loss="huber": Huber loss for robust regression,
  • loss="epsilon_insensitive": linear Support Vector Regression.

The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.

Note:

The sparse implementation produces slightly different results than the dense implementation due to a shrunk learning rate for the intercept.

Practical Tips:

  • Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be easily done using StandardScaler:from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) # Don’t cheat – fit only on training data X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # apply same transformation to test data If your attributes have an intrinsic scale (e.g. word frequencies or indicator features) scaling is not needed.
  • Finding a reasonable regularization term α is best done using GridSearchCV, usually in the range 10.0**-np.arange(1,7).
  • Empirically, we found that SGD converges after observing approx. 10^6 training samples. Thus, a reasonable first guess for the number of iterations is max_iter = np.ceil(10**6 / n), where n is the size of the training set.
  • If you apply SGD to features extracted using PCA we found that it is often wise to scale the feature values by some constant c such that the average L2 norm of the training data equals one.
  • We found that Averaged SGD works best with a larger number of features and a higher eta0
../_images/sphx_glr_plot_sgd_loss_functions_0011.png

play with different features of SGD here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s