Filter all the words which are greater than two characters and write to a file in a specified directory.
We would be reading the file created in the last program, and generating the list of words present in the text file.
Now that we have the list of all the words, lets filter all those words which are greater than two characters.
We are using
filter method defined on the dataset to apply the above
We store this dataframe in another variable
largeWords and call
materializes the contents of the dataframe in a specific directory.
We are using
filter method defined on the pyspark RDD to apply the above
We then convert the RDD to a dataframe by calling the
map function and passing
the lambda that transforms the
str value into a Dataframe Row which would need
to be imported. We have stored this dataframe in another variable
large_words_df and calling
write method will materialize the contents of the
dataframe in the specified directory.
You can verify the output by going to directory mentioned above.
Spark exposes api like
filter as shown above to define transformations based
on the business requirement, and once an action like
write is executed, these
transformations are converted to into a DAG ( Direct Acyclic Graph). DAG is a
representation of how the execution of the entire job would happen. This takes
care of minimizing the data read and distributed across nodes, during shuffle
stage of any job.
Let's view the DAG for the above Spark job.
Navigate to the Gigahex dashboard, to view the cluster details.
Navigate to the history tab, and click on the application ID.
You should be able to view the Spark UI, that shows the executors and a job with description as
text at <console>:27. Click on the Job ID and you would be navigated to the Stages page, where you can view the DAG as shown below.
The above diagram helps you to understand how your four lines of code was transformed into a DAG to run in a single stage. Each Spark application is composed of multiple jobs and each job consists of multiple stages, and each stage wil consist of multiple tasks. For the program that we've written, the application consists of a single job and a single stage.