We will developing a sample spark application in Scala that will read JSON file from S3, do some basic calculation and then write to S3 in csv format.
S3 is an AWS managed distributed object storage that can be used for a wide variety of scenarios like video storage, static file hosting, data warehouse storage and many more.
Before we starting writing the program, we will declare the dependencies required for the application to work. Here is the list of dependencies that needs to be added.
We will be creating a basic Spark program that reads a json file that contains data pertaining to flight schedules and using Spark Dataframe APIs we will calculate the total flights starting from a specific city. The result of the program would be saved in CSV format.
Here is the sample record of the dataset in json format, that would be read
We will start by initializing the Spark session and inject the AWS credentials
Accept the parameters for the program to read the input path and the output path where the result will be stored.
Implement the data processing pipeline using Dataframe APIs as shown below.
You can run the program from IntelliJ using
local executor by configuring the
Below is the entire code that we just developed. To get the entire project, head over to Github.