Now that we've written a basic word count program using Apache Spark™ APIs, let's package the application and deploy to the cluster, that we created in the first step.
Hey! Hold on. What's the need of packaging ?
Good question! The main reason being is that your application is going to be run
at a specific interval defined by the scheduler like Apache Airflow, and not run
interactively as you just ran in the REPL. For this purpose, Apache Spark™
exposes a command
spark-submit that allows us to run the application from
command line, and schedule using cron job or some other scheduler.
If you want to list all the spark commands available then type
spark and press
Before we package the application, let's understand the frequently used options
spark-submit provides to configure the deployment.
--name : Provide the name of the spark application
--master : Specifies the cluster manager that is used while submitting the spark application. A cluster manager is responsible for distributing the job across multiple workers in the cluster. Specifying
localwill result in using single executor in the same JVM as the driver.
For this example, we will be using
localas this won't require any Spark cluster running in the desktop.
Wo wo wo ! Now what are workers, executors and driver ?
Spark when running in Standalone mode, manages the cluster resources using master-slave architecture. Each slave or worker is responsibile for taking the job from master, and scheduling them in that node using executor. Each executor is a separate JVM process, that is assigned specific heap memory and cores while initiating.
Driver is the first executor process that is responsible for interacting
with the cluster manager through SparkContext object. Once you've create a
SparkSession in the program, you can connect to datasources and implement
our data pipeline. Internally, it uses SparkContext to allocate the
resources using the configured Cluster manager to run the job. When using the
local as the spark master, the driver will act as the executor responsible
for run the data pipeline. This is ideal for quick testing with small dataset.
Now let's get back to other
- --py-files : You can include multiple python files when submitting the application, which are comma separated.
- --conf : This allows to configure the runtime of your spark application,
which we would be looking in great detail tomorrow. A quick example is
--conf spark.executor.memory=2g, which instructs the cluster manager to allocate 2GB of heap memory.
- --class : Specifies the entrypoint of the application in the jar.
For this guide, we'll go with VS Code, although there are other alternatives like PyCharm. VSCode has support for multiple languages through plugins. Refer the official guide to set up and install the Python plugin.
Once you've installed the plugin, create project with directory
spark-samples/src/main/py and paste the following into requirement.txt file.
Assuming you're in the directory
spark-samples, create an virtual environment
and activate it.
You can now install the above dependencies.
Make sure in VSCode you've selected the correct interpreter by pressing
CTRL+SHIFT+P in Windows and
CMD+SHIFT+P on Mac to select the interpreter
located in venv.
Follow the below steps to start writing and debugging code from your IDE.
- Download and install IntelliJ
- Set up the Scala Plugin
- Next step would be dependent on which build tool you're more familiar with.
Maven is a commonly used build tool for Java and other JVM language projects. In order to setup maven build tool, download and install the same. Once you're done with installation, you can create maven project using this guide.
Now, paste the following content in your
SBT is a build tool for Scala projects, though you can also use for Java projects. In order to setup SBT, follow the steps outlined in the official guide and create a simple SBT project.
build.sbt file, paste the following content to configure the project.
Now, we're ready to implement the program, which reads a text file and collects a list of words which are greater than two characters.
As a last step, lets package the application using the build tools respectively.
Python doesn't require to be packaged in this case, given its a single file. Let's deploy it to the cluster.
We can either use SBT or Maven for building the project. In this case, let's go ahead with the SBT
The above command should produce a jar in the target directory with name
spark-samples.jar Lets deploy this jar locally.
Go to the project folder
/path/to/spark-samples and run the following command
to build the project using
Deploy the application to the cluster locally
Run the following command to deploy the application to the
Today we've looked at how we can develop, package and deploy the program in Java, Scala and Python. If everything works, then you should be able to view the list of words generated in the output directory as specified in the program.
The entire source code is available in Github.
For any queries or issues that you face, feel free to discuss in the Slack workspace.
Tomorrow, we'll be exploring how to deploy the Spark cluster in
and customize the runtime with different configuration options available.