Write your first Spark listener
#
ObjectiveIn this tutorial we will be developing Spark listener for capturing runtime metrics of Spark's job, stages and tasks.
#
About Apache SparkIt's an analytics engine for processing large scale data stored in variety of file system and database.
#
About Spark listenersSpark listeners allows you to hook custom code on different events emitted while the spark application is running. These events help us to capture metrics, that could be quite helpful in debugging and optimizing the code.
#
Pre-requisites- A spark cluster running on YARN or standalone. To quickly spin up a local cluster, checkout this tutorial.
- Basic knowledge of Scala programming language and Spark APIs.
- Eclipse or IntelliJ IDEA which supports Scala language.
#
Instructions- Create a Scala project and add the following dependencies in the build.sbt file as shown below.
build.sbt
- Next, create a Listener class, which extends
SparkListener
abstract class, which internally implementsSparkListenerInterface
. This class will consist of the overriden methods, that will receive the metrics once a Job, stage, task or the entire application ends.
src/org/apache/spark/listeners/SparkMetricsListener.scala
- We need some variable to be initialized, that will be storing all the metrics.
src/org/apache/spark/listeners/SparkMetricsListener.scala
- The listener interface allows us to hook custom code on certain events. For this basic example, we want to capture metrics once a job, stage, task or application completes. Following are the events -
src/org/apache/spark/listeners/SparkMetricsListener.scala
- Implement the methods as shown below to capture the metrics.
src/org/apache/spark/listeners/SparkMetricsListener.scala
- Now its time to use the above created Spark listener in any spark application. For simplicity, create a basic Spark application, that reads a text file and counts the number of lines present in the file. The highlighted line shows one way of adding your custom Spark listener to your application. You can also pass a list of custom spark listeners as mentioned in the official Spark configuration documentation.
src/com/gigahex/samples/rdd/SparkApp.scala
- Let's run the above application and check the logs to get the metrics.
The complete example can be found in the Github repository. If you have any queries, do post in Q&A section