Day 07 - Find usage by browser
#
The Giggle AnalyticsToday we would be focusing on understanding the devices that were used for accessing the website.
#
Understanding the Website eventsFor this mini-solution, we will assume that we have received website logs from the client's browser in the form of json files with the following fields.
The field we are interested in, is the user_agent
, from which we'll find the
browser name and device details.
The complete dataset will be like below
#
Parse the website logsLet's parse the website logs and transform it into a dataframe that consists of the following fields :
- device_name
- os_name
- browser_name
- user_id
- Python
- Scala
- Java
We'll be using the library uap-python for parsing the user agent and extracting the device information.
Below is an example, how we can parse the user agent string.
We'll be using the library Yet Another UserAgent Analyzer for parsing the user agent string, and extracting the above mentioned attributes.
Let's define a function getDeviceInfo
, which will take each Row
and return a
tuple of device info.
We'll be using the library Yet Another UserAgent Analyzer for parsing the user agent string, and extracting the above mentioned attributes.
Let's define a function getDeviceInfo
, which will take each Row
and return a
tuple of device info.
Add the following code for defining the function getDeviceInfo
in a new class
JDeviceAnalysis. This function will be called for each row in the Dataset.
Now we'll read the website logs and transform the dataframe to include the device information of the respective client, using the function as defined above. While transforming we'll be using the Dataset APIs for Scala and Java, and Dataframe APIs for Python.
- Python
- Scala
- Java
- Import the
Row
and theuser_agent_parser
, as we need to extract fields stored in each row. - Read the logs stored in json format from the path
/path/to/logs_devices.json
. - For fetching the browser, we need to convert the dataframe into rdd and then pass the lambda that takes each row, extracts the browser name as shown above.
Read the logs from the path
/path/to/logs_devices.json
as a Dataframe.Add the columns, - browser, os and device which are fetched using the function
getDeviceInfo
that we defined above. This gives us a Dataset of String tuples with sequenced column names.We will convert this dataset back into Dataframe using
toDF(...)
method, by specifying the column names. So you can consider that Dataframe is an alias of Dataset[Row]
- Read the json file from the path
/path/to/logs_devices.json
and save it in a variablewebsiteLogs
as Dataset. - Use
MapFunction<Row,Row>
to transform each row to add another column, that contains the device info.MapFunction
requires encoder, and we will be using the RowEncoder, that takes the schemastructDevice
. - A schema is required to define the columns of the table. As
Dataset
follows relational data format, therefore it requires a schema, usingStructType
.
#
Count the users by browserLets count the total number of users, and then we will use this count to get the
percentage of users that visited our website from each browser. For getting the
users for each browser, we will use the groupBy
function defined on the
dataframe.
- Python
- Scala
- Java
- Import the
Row
as we need to extract fields stored in each row. - Calculate the total number of users, assuming each user accessed the website
once. This is stored in
total_users
dataframe. - We get the total number of users by each browser and then convert to rdd and get the fraction of total users for that browser.
- After computing the
percentage_users
, convert back into dataframe. Run thestats.show()
command to get a view of the end result.
- Get the total count of the users, assuming each user accessed the website once.
- Calculate the total number of users, assuming each user accessed the website
once. This is stored in
total
dataframe. - We get the total number of users by each browser and then extract the field
value using
getAs[String/Long]
, that will be used to calculate the fraction of total users for that browser. - Convert back into dataframe. Run the
stats.show()
command to get a view of the end result.
- Get the total count of the users, assuming each user accessed the website once.
- Define the schema of the new row format, that would contain the browser, total website users, using that browser and the percentage of the total users.
- Use
MapFunction<Row,Row>
to transform each row to add another column, that contains the percentage count, which is calculated by dividing with the total number of users. When using the.map(...)
function on the dataset, we need an encoder that lets the compiler know, how to change or transform to the new data type. This is achieved by usingRowEncoder.apply(...)
that takes the schema of this new data type.
#
SummaryToday we've seen how to use dataframe and dataset APIs to aggregate users across different browser, thereby providing interesting insights to the website owners, that will help them to target the right audience. Analytical applications will definitely make softwares more intelligent and affective.
Browse the complete source code for each programming language in the Github.
If you get stuck, join our Slack workspace and ask questions.
#
What's next ?Tomorrow we'll identify which operating system were used the most, for accessing the website.