Day 07 - Find usage by browser

The Giggle Analytics#

Today we would be focusing on understanding the devices that were used for accessing the website.

GA Demographic!

Understanding the Website events#

For this mini-solution, we will assume that we have received website logs from the client's browser in the form of json files with the following fields.

logs.json
{
"ip_addr":"xx.xx.xx.x", //IP address of the user
"user_id": "u09", //The identity of the client
"timestamp": 1644218754, //The time that the request was received
"request": "GET /guides/100-days-of-spark/", //The request line that includes the HTTP method used, the requested resource path
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36", // Browser user agent
"status": 200, //The status code that the server sends back to the client
"size": 2048 //The size of the object requested
}

The field we are interested in, is the user_agent, from which we'll find the browser name and device details.

The complete dataset will be like below

logs.json
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u01", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.50","ip_addr":"122.180.169.236", "user_id": "u02", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u03", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u04", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u05", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:97.0) Gecko/20100101 Firefox/97.0","ip_addr":"122.180.169.236", "user_id": "u06", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:97.0) Gecko/20100101 Firefox/97.0","ip_addr":"122.180.169.236", "user_id": "u07", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u08", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u09", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}
{"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36","ip_addr":"122.180.169.236", "user_id": "u10", "timestamp": 1644218754, "request": "GET /","status": 200, "size": 2048}

Parse the website logs#

Let's parse the website logs and transform it into a dataframe that consists of the following fields :

  • device_name
  • os_name
  • browser_name
  • user_id

We'll be using the library Yet Another UserAgent Analyzer for parsing the user agent string, and extracting the above mentioned attributes.

Let's define a function getDeviceInfo, which will take each Row and return a tuple of device info.

import nl.basjes.parse.useragent.UserAgentAnalyzer
import org.apache.spark.sql._
def getDeviceInfo(row: Row): (String, String, String, String) = {
val uaa = UserAgentAnalyzer
.newBuilder()
.hideMatcherLoadStats()
.withCache(10000)
.build()
val result = uaa.parse(row.getAs[String]("user_agent"))
(row.getAs[String]("user_id"),
result.getValue("DeviceName"),
result.getValue("AgentName"),
result.getValue("OperatingSystemNameVersionMajor"))
}

Now we'll read the website logs and transform the dataframe to include the device information of the respective client, using the function as defined above. While transforming we'll be using the Dataset APIs for Scala and Java, and Dataframe APIs for Python.

  • Read the logs from the path /path/to/logs_devices.json as a Dataframe.

  • Add the columns, - browser, os and device which are fetched using the function getDeviceInfo that we defined above. This gives us a Dataset of String tuples with sequenced column names.

    We will convert this dataset back into Dataframe using toDF(...) method, by specifying the column names. So you can consider that Dataframe is an alias of Dataset[Row]

val websiteLogs = spark.read.json("/path/to/logs_devices.json")
val withDevices = websiteLogs
.map(row => getDeviceInfo(row)).toDF("user_id", "device", "browser", "os")
withDevices.show()
+-------+---------------+-------+-----------+
|user_id| device|browser| os|
+-------+---------------+-------+-----------+
| u01| Desktop| Chrome| Windows 10|
| u02| Desktop| Edge| Windows 10|
| u03|Apple Macintosh| Chrome|Mac OS X 10|
| u04|Apple Macintosh| Chrome|Mac OS X 10|
| u05|Apple Macintosh| Chrome|Mac OS X 10|
| u06|Apple Macintosh|Firefox|Mac OS X 10|
| u07|Apple Macintosh|Firefox|Mac OS X 10|
| u08| Desktop| Chrome| Windows 10|
| u09| Desktop| Chrome| Windows 10|
| u10| Desktop| Chrome| Windows 10|
+-------+---------------+-------+-----------+

Count the users by browser#

Lets count the total number of users, and then we will use this count to get the percentage of users that visited our website from each browser. For getting the users for each browser, we will use the groupBy function defined on the dataframe.

  • Get the total count of the users, assuming each user accessed the website once.
  • Calculate the total number of users, assuming each user accessed the website once. This is stored in total dataframe.
  • We get the total number of users by each browser and then extract the field value using getAs[String/Long], that will be used to calculate the fraction of total users for that browser.
  • Convert back into dataframe. Run the stats.show() command to get a view of the end result.
val total = websiteLogs.count()
val stats = withDevices.groupBy("browser").count()
.map(row => (row.getAs[String]("browser"),
row.getAs[Long]("count"),
(row.getAs[Long]("count") / total.toDouble) * 100))
.toDF("Browser", "Users", "% Users")
.orderBy(desc("Users"))
stats.show()
+-------+-----+-------+
|Browser|Users|% Users|
+-------+-----+-------+
| Chrome| 7| 70.0|
|Firefox| 2| 20.0|
| Edge| 1| 10.0|
+-------+-----+-------+

Summary#

Today we've seen how to use dataframe and dataset APIs to aggregate users across different browser, thereby providing interesting insights to the website owners, that will help them to target the right audience. Analytical applications will definitely make softwares more intelligent and affective.

Browse the complete source code for each programming language in the Github.

If you get stuck, join our Slack workspace and ask questions.

What's next ?#

Tomorrow we'll identify which operating system were used the most, for accessing the website.

Plan post day7!