Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default. Spark uses Hadoop’s client libraries for HDFS and YARN. Users can also download a “Hadoop free” binary and run Spark with any Hadoop version by augmenting Spark’s classpath. Scala and Java users can include Spark in their projects using its Maven coordinates and in the future Python users can also install Spark from PyPI.
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It’s easy to run locally on one machine — all you need is to have
java installed on your system
PATH, or the
JAVA_HOME environment variable pointing to a Java installation.
Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.4.4 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
Path on a Cluster
The Spark cluster mode overview explains the key concepts in running on a cluster. Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment:
- Standalone Deploy Mode: simplest way to deploy Spark on a private cluster
- Apache Mesos
- Hadoop YARN
Spark applications run as independent sets of processes on a cluster, coordinated by the
SparkContext object in your main program (called the driver program).
step 1: Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
step 2: Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.
Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
useful things to note about this architecture:
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
spark-submit script in Spark’s
bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.
Bundling Your Application’s Dependencies
If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as
provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the
bin/spark-submit script as shown here while passing your jar.
For Python, you can use the
--py-files argument of
spark-submit to add
.egg files to be distributed with your application. If you depend on multiple Python files use
Launching Applications with spark-submit
Once a user application is bundled, it can be launched using the
bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
Some of the commonly used options are:
--class: The entry point for your application (e.g.
--master: The master URL for the cluster (e.g.
--deploy-mode: Whether to deploy your driver on the worker nodes (
cluster) or locally as an external client (
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an
hdfs://path or a
file://path that is present on all nodes.
# Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local \ /path/to/examples.jar \ 100 # Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://220.127.116.11:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 # Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://18.104.22.168:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000 # Run a Python application on a Spark standalone cluster ./bin/spark-submit \ --master spark://22.214.171.124:7077 \ examples/src/main/python/pi.py \ 1000 # Run on a Mesos cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master mesos://126.96.36.199:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ http://path/to/examples.jar \ 1000 # Run on a Kubernetes cluster in cluster deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master k8s://xx.yy.zz.ww:443 \ --deploy-mode cluster \ --executor-memory 20G \ --num-executors 50 \ http://path/to/examples.jar \ 1000
Each driver program has a web UI, typically on port 4040, that displays information about running tasks, executors, and storage usage. Simply go to
http://<driver-node>:4040 in a web browser to access this UI.
Spark gives control over resource allocation both across applications (at the level of the cluster manager) and within applications (if multiple computations are happening on the same SparkContext).
|Application||User program built on Spark. Consists of a driver program and executors on the cluster.|
|Application jar||A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime.|
|Driver program||The process running the main() function of the application and creating the SparkContext|
|Cluster manager||An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)|
|Deploy mode||Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.|
|Worker node||Any node that can run application code in the cluster|
|Executor||A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.|
|Task||A unit of work that will be sent to one executor|
|Job||A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. |
|Stage||Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.|