.Net framework and Apache spark

Why choose .NET for Apache Spark?

.NET for Apache Spark empowers developers with .NET experience or code bases to participate in the world of big data analytics. .NET for Apache Spark provides high performance APIs for using Spark from C# and F#. With C# and F#, you can access:

  • DataFrame and SparkSQL for working with structured data.
  • Spark Structured Streaming for working with streaming data.
  • Spark SQL for writing queries with SQL syntax.
  • Machine learning integration for faster training and prediction (that is, use .NET for Apache Spark alongside ML.NET).

.NET for Apache Spark is compliant with .NET Standard, a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.

.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core. It also runs on Windows using .NET Framework. You can deploy your applications to all major cloud providers including Azure HDInsight Spark, Amazon EMR Spark, Azure Databricks, and Databricks on AWS.

.NET for Apache Spark architecture

The C#/ F# language binding to Spark is written on a new Spark interop layer which offers easier extensibility. This new layer of Spark interop was written using the best practices for language extension and optimizes for interop and performance. Long term, this extensibility can be used for adding support for other languages in Spark.

.NET for Apache Spark architecture

.NET for Apache Spark performance

When compared against Python and Scala using the TPC-H benchmark, .NET for Apache Spark performs well in most cases and is 2x faster than Python when user-defined function performance is critical. There is an ongoing effort to improve and benchmark performance.

Get started with .NET for Apache Spark

This tutorial teaches you how to run a .NET for Apache Spark app using .NET Core on Windows.

In this tutorial, you learn how to:

  • Prepare your Windows environment for .NET for Apache Spark
  • Write your first .NET for Apache Spark application
  • Build and run your simple .NET for Apache Spark application

Prepare your environment

Before you begin writing your app, you need to setup some prerequisite dependencies. If you can run dotnetjavamvnspark-shell from your command line environment, then your environment is already prepared and you can skip to the next section. If you cannot run any or all of the commands, do the following steps.

1. Install .NET

To start building .NET apps, you need to download and install the .NET SDK (Software Development Kit).

Download and install the .NET Core SDK. Installing the SDK adds the dotnet toolchain to your PATH.

Once you’ve installed the .NET Core SDK, open a new command prompt and run dotnet.

If the command runs and prints out information about how to use dotnet, can move to the next step. If you receive a 'dotnet' is not recognized as an internal or external command error, make sure you opened a new command prompt before running the command.

2. Install Java

Install Java 8.1.

Select the appropriate version for your operating system. For example, select jdk-8u201-windows-x64.exe for a Windows x64 machine. Then, use the command java to verify the installation.

Java Download

3. Install 7-zip

Apache Spark is downloaded as a compressed .tgz file. Use an extraction program, like 7-zip, to extract the file.

  • Visit 7-Zip downloads.
  • In the first table on the page, select the 32-bit x86 or 64-bit x64 download, depending on your operating system.
  • When the download completes, run the installer.
7Zip Download

4. Install Apache Spark

Download and install Apache Spark. You’ll need to select from version 2.3.* or 2.4.0, 2.4.1, 2.4.3, or 2.4.4 (.NET for Apache Spark is not compatible with other versions of Apache Spark).

The commands used in the following steps assume you have downloaded and installed Apache Spark 2.4.1. If you wish to use a different version, replace 2.4.1 with the appropriate version number. Then, extract the .tar file and the Apache Spark files.

To extract the nested .tar file:

  • Locate the spark-2.4.1-bin-hadoop2.7.tgz file that you downloaded.
  • Right click on the file and select 7-Zip -> Extract here.
  • spark-2.4.1-bin-hadoop2.7.tar is created alongside the .tgz file you downloaded.

To extract the Apache Spark files:

  • Right click on spark-2.4.1-bin-hadoop2.7.tar and select 7-Zip -> Extract files…
  • Enter C:\bin in the Extract to field.
  • Uncheck the checkbox below the Extract to field.
  • Select OK.
  • The Apache Spark files are extracted to C:\bin\spark-2.4.1-bin-hadoop2.7\
Install Spark

Run the following commands to set the environment variables used to locate Apache Spark:consoleCopy

setx HADOOP_HOME C:\bin\spark-2.4.1-bin-hadoop2.7\
setx SPARK_HOME C:\bin\spark-2.4.1-bin-hadoop2.7\

Once you’ve installed everything and set your environment variables, open a new command prompt and run the following command:

%SPARK_HOME%\bin\spark-submit --version

If the command runs and prints version information, you can move to the next step.

If you receive a 'spark-submit' is not recognized as an internal or external command error, make sure you opened a new command prompt.

5. Install .NET for Apache Spark

Download the Microsoft.Spark.Worker release from the .NET for Apache Spark GitHub. For example if you’re on a Windows machine and plan to use .NET Core, download the Windows x64 netcoreapp2.1 release.

To extract the Microsoft.Spark.Worker:

  • Locate the Microsoft.Spark.Worker.netcoreapp2.1.win-x64-0.6.0.zip file that you downloaded.
  • Right click and select 7-Zip -> Extract files….
  • Enter C:\bin in the Extract to field.
  • Uncheck the checkbox below the Extract to field.
  • Select OK.
Install .NET Spark

6. Install WinUtils

.NET for Apache Spark requires WinUtils to be installed alongside Apache Spark. Download winutils.exe. Then, copy WinUtils into C:\bin\spark-2.4.1-bin-hadoop2.7\bin.

 Note

If you are using a different version of Hadoop, which is annotated at the end of your Spark install folder name, select the version of WinUtils that’s compatible with your version of Hadoop.

7. Set DOTNET_WORKER_DIR and check dependencies

Run the following command to set the DOTNET_WORKER_DIR Environment Variable, which is used by .NET apps to locate .NET for Apache Spark:

setx DOTNET_WORKER_DIR "C:\bin\Microsoft.Spark.Worker-0.6.0"

Finally, double-check that you can run dotnetjavamvnspark-shell from your command line before you move to the next section.

Write a .NET for Apache Spark app

1. Create a console app

In your command prompt, run the following commands to create a new console application:consoleCopy

dotnet new console -o mySparkApp
cd mySparkApp

The dotnet command creates a new application of type console for you. The -o parameter creates a directory named mySparkApp where your app is stored and populates it with the required files. The cd mySparkApp command changes the directory to the app directory you just created.

2. Install NuGet package

To use .NET for Apache Spark in an app, install the Microsoft.Spark package. In your command prompt, run the following command:

dotnet add package Microsoft.Spark --version 0.6.0

3. Code your app

Open Program.cs in Visual Studio Code, or any text editor, and replace all of the code with the following:C#Copy

using Microsoft.Spark.Sql;

namespace MySparkApp
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a Spark session.
            var spark = SparkSession
                .Builder()
                .AppName("word_count_sample")
                .GetOrCreate();

            // Create initial DataFrame.
            DataFrame dataFrame = spark.Read().Text("input.txt");

            // Count words.
            var words = dataFrame
                .Select(Functions.Split(Functions.Col("value"), " ").Alias("words"))
                .Select(Functions.Explode(Functions.Col("words"))
                .Alias("word"))
                .GroupBy("word")
                .Count()
                .OrderBy(Functions.Col("count").Desc());

            // Show results.
            words.Show();

            // Stop Spark session.
            spark.Stop();
        }
    }
}

4. Add data file

Your app processes a file containing lines of text. Create an input.txt file in your mySparkApp directory, containing the following text:textCopy

Hello World
This .NET app uses .NET for Apache Spark
This .NET app counts words with Apache Spark

Run your .NET for Apache Spark app

  1. Run the following command to build your application:.NET Core CLICopydotnet build
  2. Run the following command to submit your application to run on Apache Spark:PowerShellCopy%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.0\microsoft-spark-2.4.x-0.6.0.jar dotnet bin\Debug\netcoreapp3.0\mySparkApp.dll
  3. When your app runs, the word count data of the input.txt file is written to the console.

Congratulations! You successfully authored and ran a .NET for Apache Spark app.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s