Day 08 - Find popular Operating system

The Giggle Analytics#

Today we would be focusing on find the popular operating system used by devices that accessed our website. Apart from this, we will also use Hadoop for storing the logs and reading from the same.

GA Demographic!

About Hadoop#

Apache Hadoop is an open source software library that allows distributed processing of large dataset. It has mainly four components:

  • Hadoop Common: Common utilites which are used by other components explained below.
  • Hadoop Distributed File System(HDFS): Storage layer that enables different distributed processing engines like Spark and Flink to read and write to the file system.
  • Hadoop YARN: Framework for job scheduling and cluster resources.
  • Hadoop MapReduce: Parallel processing framework.

For this tutorial, we will be using only HDFS for storage of the logs_devices.json file.

Lets quickly install Hadoop before we go ahead with the program.

Understanding the Website events#

For this mini-solution, we will assume that we have received website logs from the client's browser in the form of json files with the following fields.

logs_devices.json
{
"ip_addr":"xx.xx.xx.x", //IP address of the user
"user_id": "u09", //The identity of the client
"timestamp": 1644218754, //The time that the request was received
"request": "GET /guides/100-days-of-spark/", //The request line that includes the HTTP method used, the requested resource path
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36", // Browser user agent
"status": 200, //The status code that the server sends back to the client
"size": 2048 //The size of the object requested
}

The field we are interested in, is the user_agent, from which we'll find the operating system name and other device details.

The complete dataset will be like below

logs_devices.json
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u01", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.50","ip_addr":"122.180.169.236", "user_id": "u02", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u03", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u04", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u05", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:97.0) Gecko/20100101 Firefox/97.0","ip_addr":"122.180.169.236", "user_id": "u06", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:97.0) Gecko/20100101 Firefox/97.0","ip_addr":"122.180.169.236", "user_id": "u07", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u08", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u09", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u10", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}

Lets now upload the logs_devices.json file to the hdfs. For this you can use Gigahex file browser or hdfs command as shown below.

Using the HDFS Command#

hdfs dfs mkdir -p /user/gigahex
hdfs dfs put logs_devices.json /user/gigahex/

Using the Gigahex HDFS File browser#

  • After starting the HDFS single node cluster, click on Upload File button

  • Enter the directory as /user/gigahex and choose the file from the file browser. This will automatically create the parent directories before uploading the file.

    HDFS Browser!

Parse the website logs#

Let's parse the website logs and transform it into a dataframe that consists of the following fields :

  • device_name
  • os_name
  • browser_name
  • user_id

We'll be using the library Yet Another UserAgent Analyzer for parsing the user agent string, and extracting the above mentioned attributes.

Let's define a function getDeviceInfo, which will take each Row and return a tuple of device info.

import nl.basjes.parse.useragent.UserAgentAnalyzer
import org.apache.spark.sql._
def getDeviceInfo(row: Row): (String, String, String, String) = {
val uaa = UserAgentAnalyzer
.newBuilder()
.hideMatcherLoadStats()
.withCache(10000)
.build()
val result = uaa.parse(row.getAs[String]("user_agent"))
(row.getAs[String]("user_id"),
result.getValue("DeviceName"),
result.getValue("AgentName"),
result.getValue("OperatingSystemNameVersionMajor"))
}

Now we'll read the website logs and transform the dataframe to include the device information of the respective client, using the function as defined above. While transforming we'll be using the Dataset APIs for Scala and Java, and Dataframe APIs for Python.

  • Read the logs from the path hdfs://0.0.0.0:9075/user/gigahex/logs_devices.json as a Dataframe. When using Gigahex for running a single node HDFS, the namenode is running at port 9075.

  • Add the columns, - browser, os and device which are fetched using the function getDeviceInfo that we defined above. This gives us a Dataset of String tuples with sequenced column names.

    We will convert this dataset back into Dataframe using toDF(...) method, by specifying the column names. So you can consider that Dataframe is an alias of Dataset[Row]

val websiteLogs = spark.read.json("hdfs://0.0.0.0:9075/user/gigahex/logs_devices.json")
val withDevices = websiteLogs
.map(row => getDeviceInfo(row)).toDF("user_id", "device", "browser", "os")
withDevices.show()
+-------+---------------+-------+-----------+
|user_id| device|browser| os|
+-------+---------------+-------+-----------+
| u01| Desktop| Chrome| Windows 10|
| u02| Desktop| Edge| Windows 10|
| u03|Apple Macintosh| Chrome|Mac OS X 10|
| u04|Apple Macintosh| Chrome|Mac OS X 10|
| u05|Apple Macintosh| Chrome|Mac OS X 10|
| u06|Apple Macintosh|Firefox|Mac OS X 10|
| u07|Apple Macintosh|Firefox|Mac OS X 10|
| u08| Desktop| Chrome| Windows 10|
| u09| Desktop| Chrome| Windows 10|
| u10| Desktop| Chrome| Windows 10|
+-------+---------------+-------+-----------+

Count the users by operating system#

Lets count the total number of users, and then we will use this count to get the percentage of users that visited our website from each operating system. For getting the users for each operating system, we will use the groupBy function defined on the dataframe.

  • Calculate the total number of users, assuming each user accessed the website once. This is stored in total dataframe.
  • We get the total number of users by each OS and then extract the field value using getAs[String/Long], that will be used to calculate the fraction of total users for that OS.
  • Convert back into dataframe. Run the stats.show() command to get a view of the end result.
val total = websiteLogs.count()
val stats = withDevices.groupBy("os").count()
.map(row => (row.getAs[String]("browser"),
row.getAs[Long]("count"),
(row.getAs[Long]("count") / total.toDouble) * 100))
.toDF("Operating System", "Users", "% Users")
.orderBy(desc("Users"))
stats.show()
+----------------+-----+-------+
|Operating System|Users|% Users|
+----------------+-----+-------+
| Windows 10| 5| 50.0|
| Mac OS X 10| 5| 50.0|
+----------------+-----+-------+

Summary#

Today we've seen how to read from HDFS and use dataframe and dataset APIs to aggregate users across different operating system, thereby providing interesting insights to the website owners, that will help them to target the right audience. Analytical applications will definitely make softwares more intelligent and affective.

Browse the complete source code for each programming language in the Github.

If you get stuck, join our Slack workspace and ask questions.

What's next ?#

Tomorrow we'll identify which time of the day the website is accessed the most.

Plan post day8!