Day 03 - Package your application
Now that we've written a basic word count program using Apache Spark™ APIs, let's package the application and deploy to the cluster, that we created in the first step.
Hey! Hold on. What's the need of packaging ?
Good question! The main reason being is that your application is going to be run
at a specific interval defined by the scheduler like Apache Airflow, and not run
interactively as you just ran in the REPL. For this purpose, Apache Spark™
exposes a command spark-submit
that allows us to run the application from
command line, and schedule using cron job or some other scheduler.
If you want to list all the spark commands available then type spark
and press
tab
key.
#
spark-submit command optionsBefore we package the application, let's understand the frequently used options
that spark-submit
provides to configure the deployment.
--name : Provide the name of the spark application
--master : Specifies the cluster manager that is used while submitting the spark application. A cluster manager is responsible for distributing the job across multiple workers in the cluster. Specifying
local
will result in using single executor in the same JVM as the driver.For this example, we will be using
local
as this won't require any Spark cluster running in the desktop.
Wo wo wo ! Now what are workers, executors and driver ?
Spark when running in Standalone mode, manages the cluster resources using master-slave architecture. Each slave or worker is responsibile for taking the job from master, and scheduling them in that node using executor. Each executor is a separate JVM process, that is assigned specific heap memory and cores while initiating.
Driver is the first executor process that is responsible for interacting
with the cluster manager through SparkContext object. Once you've create a
SparkSession in the program, you can connect to datasources and implement
our data pipeline. Internally, it uses SparkContext to allocate the
resources using the configured Cluster manager to run the job. When using the
local
as the spark master, the driver will act as the executor responsible
for run the data pipeline. This is ideal for quick testing with small dataset.
Now let's get back to other spark-submit
options
- --py-files : You can include multiple python files when submitting the application, which are comma separated.
- --conf : This allows to configure the runtime of your spark application,
which we would be looking in great detail tomorrow. A quick example is
--conf spark.executor.memory=2g
, which instructs the cluster manager to allocate 2GB of heap memory. - --class : Specifies the entrypoint of the application in the jar.
#
Setup development environment for PythonFor this guide, we'll go with VS Code, although there are other alternatives like PyCharm. VSCode has support for multiple languages through plugins. Refer the official guide to set up and install the Python plugin.
Once you've installed the plugin, create project with directory
spark-samples/src/main/py
and paste the following into requirement.txt file.
Assuming you're in the directory spark-samples
, create an virtual environment
and activate it.
You can now install the above dependencies.
Make sure in VSCode you've selected the correct interpreter by pressing
CTRL+SHIFT+P
in Windows and CMD+SHIFT+P
on Mac to select the interpreter
located in venv.
#
Setup development environment for Scala or JavaFollow the below steps to start writing and debugging code from your IDE.
- Download and install IntelliJ
- Set up the Scala Plugin
- Next step would be dependent on which build tool you're more familiar with.
- Maven
- SBT
Maven is a commonly used build tool for Java and other JVM language projects. In order to setup maven build tool, download and install the same. Once you're done with installation, you can create maven project using this guide.
Now, paste the following content in your pom.xml
SBT is a build tool for Scala projects, though you can also use for Java projects. In order to setup SBT, follow the steps outlined in the official guide and create a simple SBT project.
In your build.sbt
file, paste the following content to configure the project.
#
Complete programNow, we're ready to implement the program, which reads a text file and collects a list of words which are greater than two characters.
- Python
- Scala
- Java
#
Package and deploy the applicationAs a last step, lets package the application using the build tools respectively.
- Python
- Scala
- Java
Python doesn't require to be packaged in this case, given its a single file. Let's deploy it to the cluster.
We can either use SBT or Maven for building the project. In this case, let's go ahead with the SBT
The above command should produce a jar in the target directory with name
spark-samples.jar
Lets deploy this jar locally.
Go to the project folder /path/to/spark-samples
and run the following command
to build the project using mvn
Deploy the application to the cluster locally
Run the following command to deploy the application to the local
cluster.
#
SummaryToday we've looked at how we can develop, package and deploy the program in Java, Scala and Python. If everything works, then you should be able to view the list of words generated in the output directory as specified in the program.
The entire source code is available in Github.
For any queries or issues that you face, feel free to discuss in the Slack workspace.
Tomorrow, we'll be exploring how to deploy the Spark cluster in cluster
mode
and customize the runtime with different configuration options available.