Day 02 - Save output to a file
#
ObjectiveFilter all the words which are greater than two characters and write to a file in a specified directory.
#
Read and generate a dataset of wordsWe would be reading the file created in the last program, and generating the list of words present in the text file.
- Scala
- Python
#
Filter the wordsNow that we have the list of all the words, lets filter all those words which are greater than two characters.
- Scala
- Python
We are using filter
method defined on the dataset to apply the above
constraint.
We store this dataframe in another variable largeWords
and call write
which
materializes the contents of the dataframe in a specific directory.
We are using filter
method defined on the pyspark RDD to apply the above
constraint.
We then convert the RDD to a dataframe by calling the map
function and passing
the lambda that transforms the str
value into a Dataframe Row which would need
to be imported. We have stored this dataframe in another variable
large_words_df
and calling write
method will materialize the contents of the
dataframe in the specified directory.
You can verify the output by going to directory mentioned above.
#
Transformations and actionsSpark exposes api like filter
as shown above to define transformations based
on the business requirement, and once an action like write
is executed, these
transformations are converted to into a DAG ( Direct Acyclic Graph). DAG is a
representation of how the execution of the entire job would happen. This takes
care of minimizing the data read and distributed across nodes, during shuffle
stage of any job.
#
Understanding DAGLet's view the DAG for the above Spark job.
Navigate to the Gigahex dashboard, to view the cluster details.
Navigate to the history tab, and click on the application ID.
You should be able to view the Spark UI, that shows the executors and a job with description as
text at <console>:27
. Click on the Job ID and you would be navigated to the Stages page, where you can view the DAG as shown below.
The above diagram helps you to understand how your four lines of code was transformed into a DAG to run in a single stage. Each Spark application is composed of multiple jobs and each job consists of multiple stages, and each stage wil consist of multiple tasks. For the program that we've written, the application consists of a single job and a single stage.