Day 08 - Find popular Operating system
#
The Giggle AnalyticsToday we would be focusing on find the popular operating system used by devices that accessed our website. Apart from this, we will also use Hadoop for storing the logs and reading from the same.
#
About HadoopApache Hadoop is an open source software library that allows distributed processing of large dataset. It has mainly four components:
- Hadoop Common: Common utilites which are used by other components explained below.
- Hadoop Distributed File System(HDFS): Storage layer that enables different distributed processing engines like Spark and Flink to read and write to the file system.
- Hadoop YARN: Framework for job scheduling and cluster resources.
- Hadoop MapReduce: Parallel processing framework.
For this tutorial, we will be using only HDFS for storage of the
logs_devices.json
file.
Lets quickly install Hadoop before we go ahead with the program.
#
Understanding the Website eventsFor this mini-solution, we will assume that we have received website logs from the client's browser in the form of json files with the following fields.
The field we are interested in, is the user_agent
, from which we'll find the
operating system name and other device details.
The complete dataset will be like below
Lets now upload the logs_devices.json file to the hdfs. For this you can use Gigahex file browser or hdfs command as shown below.
#
Using the HDFS Command#
Using the Gigahex HDFS File browserAfter starting the HDFS single node cluster, click on
Upload File
buttonEnter the directory as
/user/gigahex
and choose the file from the file browser. This will automatically create the parent directories before uploading the file.
#
Parse the website logsLet's parse the website logs and transform it into a dataframe that consists of the following fields :
- device_name
- os_name
- browser_name
- user_id
- Python
- Scala
- Java
We'll be using the library uap-python for parsing the user agent and extracting the device information.
Below is an example, how we can parse the user agent string.
We'll be using the library Yet Another UserAgent Analyzer for parsing the user agent string, and extracting the above mentioned attributes.
Let's define a function getDeviceInfo
, which will take each Row
and return a
tuple of device info.
We'll be using the library Yet Another UserAgent Analyzer for parsing the user agent string, and extracting the above mentioned attributes.
Let's define a function getDeviceInfo
, which will take each Row
and return a
tuple of device info.
Add the following code for defining the function getDeviceInfo
in a new class
JDeviceAnalysis. This function will be called for each row in the Dataset.
Now we'll read the website logs and transform the dataframe to include the device information of the respective client, using the function as defined above. While transforming we'll be using the Dataset APIs for Scala and Java, and Dataframe APIs for Python.
- Python
- Scala
- Java
- Import the
Row
and theuser_agent_parser
, as we need to extract fields stored in each row. - Read the logs stored in json format from the hdfs path. When using Gigahex for
running a single node HDFS, the namenode is running at port
9075
. Thefore the path to the file ishdfs://0.0.0.0:9075/user/gigahex/logs_devices.json
- For fetching the browser, we need to convert the dataframe into rdd and then pass the lambda that takes each row, extracts the browser name as shown above.
Read the logs from the path
hdfs://0.0.0.0:9075/user/gigahex/logs_devices.json
as a Dataframe. When using Gigahex for running a single node HDFS, the namenode is running at port9075
.Add the columns, - browser, os and device which are fetched using the function
getDeviceInfo
that we defined above. This gives us a Dataset of String tuples with sequenced column names.We will convert this dataset back into Dataframe using
toDF(...)
method, by specifying the column names. So you can consider that Dataframe is an alias of Dataset[Row]
- Read the json file from the path
hdfs://0.0.0.0:9075/user/gigahex/logs_devices.json
and save it in a variablewebsiteLogs
as Dataset. - Use
MapFunction<Row,Row>
to transform each row to add another column, that contains the device info.MapFunction
requires encoder, and we will be using the RowEncoder, that takes the schemastructDevice
. - A schema is required to define the columns of the table. As
Dataset
follows relational data format, therefore it requires a schema, usingStructType
.
#
Count the users by operating systemLets count the total number of users, and then we will use this count to get the
percentage of users that visited our website from each operating system. For
getting the users for each operating system, we will use the groupBy
function
defined on the dataframe.
- Python
- Scala
- Java
- Import the
Row
as we need to extract fields stored in each row. - Calculate the total number of users, assuming each user accessed the website
once. This is stored in
total_users
dataframe. - We get the total number of users by each OS and then convert to rdd and get the fraction of total users for that OS.
- After computing the
percentage_users
, convert back into dataframe. Run thestats.show()
command to get a view of the end result.
- Calculate the total number of users, assuming each user accessed the website
once. This is stored in
total
dataframe. - We get the total number of users by each OS and then extract the field value
using
getAs[String/Long]
, that will be used to calculate the fraction of total users for that OS. - Convert back into dataframe. Run the
stats.show()
command to get a view of the end result.
- Get the total count of the users, assuming each user accessed the website once.
- Define the schema of the new row format, that would contain the browser, total website users, using that browser and the percentage of the total users.
- Use
MapFunction<Row,Row>
to transform each row to add another column, that contains the percentage count, which is calculated by dividing with the total number of users. When using the.map(...)
function on the dataset, we need an encoder that lets the compiler know, how to change or transform to the new data type. This is achieved by usingRowEncoder.apply(...)
that takes the schema of this new data type.
#
SummaryToday we've seen how to read from HDFS and use dataframe and dataset APIs to aggregate users across different operating system, thereby providing interesting insights to the website owners, that will help them to target the right audience. Analytical applications will definitely make softwares more intelligent and affective.
Browse the complete source code for each programming language in the Github.
If you get stuck, join our Slack workspace and ask questions.
#
What's next ?Tomorrow we'll identify which time of the day the website is accessed the most.