Day 01 - Your First Program
#
Hello World ProgramLet's write our first program that counts the number of lines a file. We need to
create a file hello.in
in your current working directory and read it from the
spark/python shell. Copy the following text and paste in your file.
Now you can read the file using the APIs for Dataframe in Spark, as shown below.
- Scala
- Python
Congratulations! You've written your first Spark program.
The above program stores the Dataset in a variable text
and once the action
count
is called on the dataset, it Spark launches jobs to read and count all
the lines in that particular file.
Dataset is a strongly typed, representation of collection of objects, that
can be of any type, like String, Int, Long or any complex data type. This
collection of distributed objects can then be processed using different
functional and relational operations like - filter
, count
, dropDuplicates
and many others.
Congratulations! You've written your first Spark program.
The above program stores the Dataframe in a variable text
and once the action
count
is called on the dataframe, it Spark launches jobs to read and count all
the lines in that particular file.
Dataframe is a representation of collection of objects, that can be of any
type, like String, Int, Long or any complex data type. This collection of
distributed objects can then be processed using different functional and
relational operations like - filter
, count
, dropDuplicates
and many
others.
#
Count the wordsNow that we've counted the total number of lines, let's count the total number
of words in the above file. In order to do that, we would need to split the text
with space
and generate collection of words. Once we have the collection of
words dataset, we can run a count on this, to get the total number of words.
- Scala
- Python
Before we start writing the program, couple of things to keep in mind.
Every dataset is strongly typed, which means that every object is of specific type that must be known to compiler in advance.
Using implicit conversion in Spark, we can automatically infer the type using Encoder.
With
import spark.implicits._
, we are able to get this encoder work for common data types like String.Using
flatMap
we are transforming each line, into a collection of words, thereby giving us a Dataset of words.
Before we start writing the program, couple of things to keep in mind.
Every dataframe can be converted into RDD. RDD is a representation of the distributed dataset in memory, which resilient, which means it can sustain failures and can be re-computed when required. RDD is the internal representation that is used when working with Dataset APIs in Scala or Dataframe in Python.
flatMap
is a method defined on the RDD, that allows us to take value from the dataset and generate a collection of values. In this case we are we are transforming each line, into a collection of words, and then calling thecount()
method to get the length of the words.