Before we Start our journey let’s explore what is spark and what is tensorflow and why we want them to be combined.
Apache Spark™ is a unified analytics engine for large-scale data processing.
Speed: Run workloads 100x faster. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.(DAG means )
Logistic regression in Hadoop and Spark
Ease of Use: Write applications quickly in Java, Scala, Python, R, and SQL.Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.
To learn how to write scripts performing ML in spark using python , R Api click here(will updated soon).
df = spark.read.json("logs.json") df.where("age > 21").select("name.first").show() #Spark's Python DataFrame API #Read JSON files with automatic schema inference
Runs Everywhere: Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
TensorFlow makes it easy for beginners and experts to create machine learning models for desktop, mobile, web, and cloud. See the sections below to get started.
Easy model building: TensorFlow offers multiple levels of abstraction so you can choose the right one for your needs. Build and train models by using the high-level Keras API, which makes getting started with TensorFlow and machine learning easy.
If you need more flexibility, eager execution allows for immediate iteration and intuitive debugging. For large ML training tasks, use the Distribution Strategy API for distributed training on different hardware configurations without changing the model definition.
Robust ML production anywhere: TensorFlow has always provided a direct path to production. Whether it’s on servers, edge devices, or the web, TensorFlow lets you train and deploy your model easily, no matter what language or platform you use.
see tensorflow tutorials here. click on tensorflow below.
Now we understand for deep learning as a strong framework we need tensorflow and we need spark for in memory high speed parallel computation. But why we need both for Deep learning because you can use Spark and a cluster of machines to improve deep learning pipelines with TensorFlow:
- Hyperparameter Tuning: use Spark to find the best set of hyperparameters for neural network training, leading to 10X reduction in training time and 34% lower error rate.
- Deploying models at scale: use Spark to apply a trained neural network model on a large amount of data.
An example of a deep learning machine learning (ML) technique is artificial neural networks. They take a complex input, such as an image or an audio recording, and then apply complex mathematical transforms on these signals. The output of this transform is a vector of numbers that is easier to manipulate by other ML algorithms. Artificial neural networks perform this transformation by mimicking the neurons in the visual cortex of the human brain (in a much-simplified form).
Just as humans learn to interpret what they see, artificial neural networks need to be trained to recognize specific patterns that are ‘interesting’. For example, these can be simple patterns such as edges, circles, but they can be much more complicated. Here, we are going to use a classical dataset put together by NIST and train a neural network to recognize these digits:
The TensorFlow library automates the creation of training algorithms for neural networks of various shapes and sizes. The actual process of building a neural network, however, is more complicated than just running some function on a dataset. There are typically a number of very important hyperparameters (configuration parameters in layman’s terms) to set, which affects how the model is trained. Picking the right parameters leads to high performance, while bad parameters can lead to prolonged training and bad performance. In practice, machine learning practitioners rerun the same model multiple times with different hyperparameters in order to find the best set. This is a classical technique called hyperparameter tuning.
When building a neural network, there are many important hyperparameters to choose carefully. For example:
- Number of neurons in each layer: Too few neurons will reduce the expression power of the network, but too many will substantially increase the running time and return noisy estimates.
- Learning rate: If it is too high, the neural network will only focus on the last few samples seen and disregard all the experience accumulated before. If it is too low, it will take too long to reach a good state.
The interesting thing here is that even though TensorFlow itself is not distributed, the hyperparameter tuning process is “embarrassingly parallel” and can be distributed using Spark. In this case, we can use Spark to broadcast the common elements such as data and model description, and then schedule the individual repetitive computations across a cluster of machines in a fault-tolerant manner.
How does using Spark improve the accuracy? The accuracy with the default set of hyperparameters is 99.2%. Our best result with hyperparameter tuning has a 99.47% accuracy on the test set, which is a 34% reduction of the test error. Distributing the computations scaled linearly with the number of nodes added to the cluster: using a 13-node cluster, we were able to train 13 models in parallel, which translates into a 7x speedupcompared to training the models one at a time on one machine. Here is a graph of the computation times (in seconds) with respect to the number of machines on the cluster:
More important though, we get insights into the sensibility of the training procedure to various hyperparameters of training. For example, we plot the final test performance with respect to the learning rate, for different numbers of neurons:
This shows a typical tradeoff curve for neural networks:
- The learning rate is critical: if it is too low, the neural network does not learn anything (high test error). If it is too high, the training process may oscillate randomly and even diverge in some configurations.
- The number of neurons is not as important for getting a good performance, and networks with many neurons are much more sensitive to the learning rate. This is Occam’s Razor principle: simpler model tend to be “good enough” for most purposes. If you have the time and resource to go after the missing 1% test error, you must be willing to invest a lot of resources in training, and to find the proper hyperparameters that will make the difference.
By using a sparse sample of parameters, we can zero in on the most promising sets of parameters.
How do I use it?
Since TensorFlow can use all the cores on each worker, we only run one task at one time on each worker and we batch them together to limit contention. The TensorFlow library can be installed on Spark clusters as a regular Python library, following the instructions on the TensorFlow website.
Deploying Models at Scale
TensorFlow models can directly be embedded within pipelines to perform complex recognition tasks on datasets. The model is first distributed to the workers of the clusters, using Spark’s built-in broadcasting mechanism . Then this model is loaded on each node and applied to images .
To elaborate with example we will use example taken from DataBricks
A little tutorial on how can use spark in deep learning:
Working with images in Spark
The first step to apply deep learning on images is the ability to load the images. Spark and Deep Learning Pipelines include utility functions that can load millions of images into a Spark DataFrame and decode them automatically in a distributed fashion, allowing manipulation at scale. Using Spark’s ImageSchema
from pyspark.ml.image import ImageSchema image_df = ImageSchema.readImages("/data/myimages") image_df.show()
The resulting DataFrame contains a string column named “image” containing an image struct with schema == ImageSchema.
Applying deep learning models at scale
from pyspark.ml.image import ImageSchema from sparkdl import DeepImagePredictor image_df = ImageSchema.readImages(sample_img_dir) predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10) predictions_df = predictor.transform(image_df)
Use TensorFlow :
Deep Learning Pipelines provides an MLlib Transformer that will apply the given TensorFlow Graph to a DataFrame containing a column of images (e.g. loaded using the utilities described in the previous section). Here is a very simple example of how a TensorFlow Graph can be used with the Transformer. In practice, the TensorFlow Graph will likely be restored from files before calling
from pyspark.ml.image import ImageSchema from sparkdl import TFImageTransformer import sparkdl.graph.utils as tfx # strip_and_freeze_until was moved from sparkdl.transformers to sparkdl.graph.utils in 0.2.0 from sparkdl.transformers import utils import tensorflow as tf graph = tf.Graph() with tf.Session(graph=graph) as sess: image_arr = utils.imageInputPlaceholder() resized_images = tf.image.resize_images(image_arr, (299, 299)) # the following step is not necessary for this graph, but can be for graphs with variables, etc frozen_graph = tfx.strip_and_freeze_until([resized_images], graph, sess, return_graph=True) transformer = TFImageTransformer(inputCol="image", outputCol="predictions", graph=frozen_graph, inputTensor=image_arr, outputTensor=resized_images, outputMode="image") image_df = ImageSchema.readImages(sample_img_dir) processed_image_df = transformer.transform(image_df)