Day 04 - Configure runtime of spark application
Yesterday we've packaged our application that generates and writes words which are greater than two characters. This app was then deployed locally, which is the driver mode that has only single executor.
#
Today's ObjectiveToday we'll learn how to leverage the power of distributed computing using the
cluster
mode when running the application in Spark Standalone Cluster.
#
Spark Overall architectureAlthough we've written a basic program, but its important to understand how it gets executed by understanding the overall architecture of Spark.
Spark consists of workers, and each worker would be executing 'N' number of
executors. Every time we submit the spark application, a driver program
initializes a session, called as SparkSession
which gives us access to
SparkContext
object. This object can be leveraged to upload and distribute
jars, interact with cluster manager to allocate resources to the job and receive
events as the job progresses.
#
Cluster ManagerApache Sparkâ„¢ has different types of Cluster Manager like - Standalone, YARN, Mesos and Kubernetes. Each of these cluster manager is responsible to allocate resources to the job.
Today we'll be using Spark Standalone cluster manager to submit the job. In order to do that, let's start the Spark cluster through Gigahex Dashboard, by following the below steps.
- Select the spark cluster from the list
- Click on the start button and wait for the cluster to start
- Now submit the the application jar using the following command, which
configures the executor to use only
1
executor.
- Scala
- Python
- You can view the list of Spark application in the cluster history section as shown below.
Click on the application ID link to open the Spark UI, and click on the Executors tab to view the list of executors as shown below.
As you can see, only one core was allocated to the executor and storage memory was 434.4MiB.
#
Spark Memory ManagementEvery executor is a JVM process which is allocated heap memory. This heap memory is sub-divided as shown below in the diagram.
Now given that storage memory was 434.4MB which amounts to 60% of the total memory. Therefore total memory = 434.4/0.6 MB = 724MB + 300MB(Reserved memory) = 1024MB. Hence we can conclude that by default, each executor is assigned 1GB of heap memory.
Lets try changing the executor memory through the spark-submit
command.
- Scala
- Python
With this change in the spark-submit
command, lets relook at the executors
resources.
Now the storage memory for the Executor ID - 0 dropped to 127.2MB. Using the same formula as above we can get the memory that was allocated for the executor.
Total memory = 127.2/0.6 MB = 212MB + 300MB(Reserved memory) = 512MB
Let's assume that our application would need more memory for execution and
storage, therefore lets bump up the spark.storage.memory=0.8
. Below is the
spark-submit
command for increasing the spark memory fraction.
- Scala
- Python
Let's check the executor storage memory again.
We can see that the storage memory has increased. Let's confirm our assumption based on the same formula.
Total memory = 169.6/0.6 MB = 212MB + 300MB(Reserved memory) = 512MB
#
SummaryWe have learnt today how to configure executor's total memory, the storage memory and the maximum CPU Cores allocated to the executor. We have also looked at the Spark UI to better understand how executors are allocated memory. There are many other configurations available and we'll explore them in the upcoming guide.
The entire source code is available in Github.
For any queries or issues that you face, feel free to discuss in the Slack workspace.
Tomorrow, we'll be exploring in detail Spark UI which helps us the optimize the performance and identify root cause.