Read and write data in S3 with Spark
#
ObjectiveWe will developing a sample spark application in Scala that will read JSON file from S3, do some basic calculation and then write to S3 in csv format.
#
About S3S3 is an AWS managed distributed object storage that can be used for a wide variety of scenarios like video storage, static file hosting, data warehouse storage and many more.
#
Configure dependenciesBefore we starting writing the program, we will declare the dependencies required for the application to work. Here is the list of dependencies that needs to be added.
#
Program descriptionWe will be creating a basic Spark program that reads a json file that contains data pertaining to flight schedules and using Spark Dataframe APIs we will calculate the total flights starting from a specific city. The result of the program would be saved in CSV format.
Here is the sample record of the dataset in json format, that would be read
using spark.read.json
api
We will start by initializing the Spark session and inject the AWS credentials
using the System property
.
Accept the parameters for the program to read the input path and the output path where the result will be stored.
Implement the data processing pipeline using Dataframe APIs as shown below.
#
Run the programYou can run the program from IntelliJ using local
executor by configuring the
run options.
#
Source codeBelow is the entire code that we just developed. To get the entire project, head over to Github.