Day 06 - Analyse Website traffic
#
The Giggle AnalyticsFor the next five days, we will be building a mini-solution, similar to Google analytics, called as Giggle Analytics
Today we would be focusing on understanding the demographics of the website users, as shown below.
#
Understanding the Website eventsFor this mini-solution, we will assume that we have received website logs from the client's browser in the form of json files with the following fields.
The field we are interested in, is the ip_addr
, from which we'll find the
location details using
IP Location Finder by KeyCDN.
Here's a simple example of how to use this API using curl
command.
And the response is as below.
The complete dataset will be like below
#
Parse the website logsLet's parse the website logs and transform it into a dataframe that consists of the following fields :
- country
- user_id
- ip_addr
- Python
- Scala
- Java
We'll be creating a database for storing the location data with IP and save it
as ip.csv
.
We would then be reading and storing this IP-location database in a dictionary. This dictionary will then be referenced for fetching the corresponding country for the IP.
Open the Spark Shell and import all the classes below. Create a function
getCountry(ip: String)
that fetches the corresponding country name using the
keycdn API.
tip
Use the :paste
command in the shell to copy and paste the below and then press
CTRL+D to execute all the pasted lines.
Add the following code for defining the function getCountry
in a new class
JLogAnalysis. This function will be called for each row in the Dataset.
Now we'll read the website logs and transform the dataframe to include the country of the respective client, using the function as defined above. While transforming we'll be using the Dataset APIs for Scala and Java, and Dataframe APIs for Python.
What's difference between Dataframe and Dataset? And When should I use it, over the other?
Yeah! I know its a common interview question. Well the Dataset is for advanced use case, where transformation from one type to another is non-trivial, and Dataframe is for commonly used transformation that is defined by different SQL like operators - COUNT, AGGREGATE, SUM, GROUPBY, DATE_DIFF and so on.
For this example, we don't have a readily available function defined on the Dataframe to get country from an IP, therefore we have taken the route of using Dataset APIs.
The following table highlights the difference between different languages and how the transformation works
Language | API Abstraction |
---|---|
Scala | Dataset[T] & Dataframe ( ie. Dataset[Row]) |
Java | Dataset[T] |
Python | Dataframe (Convert to rdd for non-trivial transformation) |
- Python
- Scala
- Java
- Import the
Row
as we need to extract fields stored in each row. - Read the logs stored in json format from the path
/path/to/logs.json
. - For fetching the country, we need to convert the dataframe into rdd and then
pass the lambda that takes each row, extracts the field and fetches the
country from the
ip_database
defined earlier.
Read the logs from the path
/path/to/logs.json
as a Dataframe.Add a column,
country
which is fetched using the functiongetCountry
that we defined above. This gives us a Dataset of String tuples with sequenced column names like below.We will convert this dataset back into Dataframe using
toDF(...)
method, by specifying the column names. So you can consider that Dataframe is an alias of Dataset[Row]
- Read the json file from the path
/path/to/logs.json
and save it in a variablewebsiteLogs
as Dataset. - Use
MapFunction<Row,Row>
to transform each row to add another column, that contains the IP address.MapFunction
requires encoder, and we will be using the RowEncoder, that takes the schemastructCountry
. - A schema is required to define the columns of the table. As
Dataset
follows relational data format, therefore it requires a schema, usingStructType
.
#
Count the users by locationLets count the total number of users, and then we will use this count to get the
percentage of users that visited our website from each country. For getting the
users for each country, we will use the groupBy
function defined on the
dataframe.
- Python
- Scala
- Java
- Import the
Row
as we need to extract fields stored in each row. - Calculate the total number of users, assuming each user accessed the website
once. This is stored in
total_users
dataframe. - We get the total number of users by each country and then convert to rdd and get the fraction of total users for that country.
- After computing the
percentage_users
, convert back into dataframe. Run thestats.show()
command to get a view of the end result.
- Get the total count of the users, assuming each user accessed the website once.
- Calculate the total number of users, assuming each user accessed the website
once. This is stored in
total
dataframe. - We get the total number of users by each country and then extract the field
value using
getAs[String/Long]
, that will be used to calculate the fraction of total users for that country. - Convert back into dataframe. Run the
stats.show()
command to get a view of the end result.
- Get the total count of the users, assuming each user accessed the website once.
- Define the schema of the new row format, that would contain the country, total website users from that country and the percentage of the total users.
- Use
MapFunction<Row,Row>
to transform each row to add another column, that contains the percentage count, which is calculated by dividing with the total number of users. When using the.map(...)
function on the dataset, we need an encoder that lets the compiler know, how to change or transform to the new data type. This is achieved by usingRowEncoder.apply(...)
that takes the schema of this new data type.
#
SummaryToday we've seen how to use dataframe and dataset APIs to aggregate users across different country, thereby providing interesting insights to the website owners, that will help them to target the right audience. Analytical applications will definitely make softwares more intelligent and affective.
Therefore, stay connected with us to learn more through our Slack workspace.
#
What's next ?Tomorrow we'll identify which devices are used the most, for accessing the website.