StatsCosmos: How to summarize the Book-Crossing dataset using Hadoop 2.6.0 and Spark 1.5.1

This post is designed for a joint installation of Ubuntu Server 14.04.3 LTS, Apache Hadoop 2.6.0 single cluster and Apache Spark 1.5.1 (pre-built for Hadoop 2.6 and later). The steps outlined involve setting up a job to summarize the GroupLens Book-Crossing dataset using Hadoop Streaming and Spark word count (in local mode). The Hadoop Streaming part of the job is implemented using mapper-reducer sets prepared in Python and Ruby. The Spark word count part of the job is implemented in the SparkContext within the Scala Spark shell (in local mode).

The Book-Crossing dataset is a great information source for book ratings. The information is particularly useful when analyzed in relation to the GroupLens MovieLens datasets and other GroupLens datasets.

1. Prepare the data

The Book-Crossing data was collected by Cai-Nicolas Ziegler in a 4-week crawl (during the August/September 2004 period) from the Book-Crossing community with the kind permission of Ron Hornbaker (CTO of Humankind Systems). The reference to the dataset also has excellent additional resources for methods on Book rating related analyses.

The specific dataset considered in this illustration is the BX-Book-Ratings dataset. The dataset version selected for the illustration is the CSV Dump version of the BX-Book-Ratings dataset. The other version that is available is the SQL Dump.

The BX-Book-Ratings dataset contains three variables:

UserID
ISBN
Rating (for each ISBN)

The summary of the notes on the BX-Book-Ratings variable is as follows:

Ratings are either explicit, expressed on a scale from one to ten (with higher values denoting higher appreciation), or implicit, expressed by 0.

In this illustration I will show how to determine the number of UserIDs that rated, the number of ISBNs that were rated, the average rating for each ISBN and the number of ratings in each ISBN rating score category.

In a simple MapReduce approach to the job, this amounts to four jobs. These are:

Counting the number of UserIDs in a UserID text file created from the first column of the dataset
Counting the number of ISBNs in a UserID text file created from the second column of the dataset
Calculating the average rating for each ISBN in a text file created from the second and third columns (or all the columns) of the dataset
Calculating the number of times each score rating value occurs in a text file created from the third column of the dataset

The approach will be to show how to complete the jobs in the following manner:

Complete the first job in Spark running on local mode using Resilient Distributed Dataset (RDD) processing methods
Complete the second and third jobs using Hadoop Streaming with mapper-reducer sets written in Python
Complete the fourth job using Hadoop Streaming with a mapper-reducer set written in Ruby

The resulting output files yield the required information for analysis (or further processing).

In this illustration the ISBN text file must be created in such a manner that the ISBNs are more than one character long (i.e. all field lengths must be greater than one).

The average rating for each ISBN is obtained by using the ISBN column and the Ratings column of the dataset. In this illustration the whole dataset is used for the third job (i.e the UserID column is also included with the ISBN and Ratings columns). The rationale for this approach is to take advantage of the linked nature of the dataset.

For example, the same dataset can be used (with some minor adjustments to the Python code or by using another mapper-reducer set) to calculate the average of the ratings made by each user.

The average ISBN rating mapper-reducer set Python code can, however, be easily modified to exclude the UserID column if a more compact dataset is desired for the job.

The four resulting datasets prepared in this manner can be processed using Hadoop and Spark. The next step is to prepare the mapper-reducer sets for the Hadoop Streaming part of the job. The Spark part of the job can be run using a simple program (within a SparkContext) in the Scala Spark shell run in local mode.

2. Prepare the mapper and reducer sets

Python Mapper and Reducer set for the ISBN counts

The Python mapper-reducer set for the ISBN counts was prepared using the tutorial in this post.

The mapper.

The reducer.

Python Mapper and Reducer set for the ISBN rating averages

The Python mapper-reducer set for the ISBN average ratings was prepared using the tutorial in this post.

The mapper.

The reducer.

Ruby Mapper and Reducer set for the rating category counts

The Ruby mapper-reducer set for the rating category counts was prepared using the tutorial in this post.

The mapper.

The reducer.

The mapper-reducer sets complete the set up for the Hadoop Streaming part of the job. The next step is to process the datasets in Hadoop and Spark.

3. Process the data in Hadoop and Spark

The Hadoop Streaming configuration that will used in the illustration is the standard set up without any modifications to enhance performance. The Spark shell set up in Scala is the standard local mode set up with four cores without any jar files and/or Apache Maven dependencies (using Maven coordinates).

UserID counts using the Scala Spark shell

The Scala Spark shell running in local mode creates a special interpreter-aware SparkContext in the variable called sc. The RDD for the job is created using the SparkContext’s textFile method on the UserID column text file from the local file system. The results are saved on a text file in the local file system. The Spark word count example can be found in Apache Spark Examples website.

The assumption in the job is that the UserID column text file is called InputData.txt, has been saved in the local folder- <InputData folder> and the local system output folder has been selected to be - <Output folder>.

The word count is implemented by running the following simple program within the Scala Spark shell.

This is an excerpt from my results.

ISBN counts using Hadoop Streaming (Python Mapper-Reducer set)

In this section the assumption is that the ISBN column text file is called InputData.txt.

In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the Hadoop Distributed File System (HDFS) input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper (BCmapper.py) is located in the local system folder - <Python mapper>, the Python reducer (BCreducer.py) is located in the local system folder - <Python reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder>. The next step is to run the following command on Ubuntu Server.

This is an excerpt from my results.

ISBN average rating using Hadoop streaming (Python Mapper-Reducer set)

In this section the assumption is that the UserID, ISBN column and Rating column text file is called InputData.txt.

In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the HDFS input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper (Bavgmapper.py) is located in the local system folder - <Python mapper>, the Python reducer (Bavgreducer.py) is located in the local system folder - <Python reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder> . The next step is to run the following command on Ubuntu Server.

This is an excerpt from my results.

Rating counts using Hadoop streaming (Ruby Mapper-Reducer set)

In this section the assumption is that the Rating column text file is called InputData.txt.

In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the HDFS input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Ruby mapper (Bmapper.rb) is located in the local system folder - <Ruby mapper>, the Ruby reducer (Breducer.rb) is located in the local system folder - <Ruby reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder> . The next step is to run the following command on Ubuntu Server.

This is an excerpt from my results.

4. Results summary and highlights

The results yielded some very interesting and wonderful book ratings information. The results can be nicely summarized using bar charts, summary statistics and histograms. The bar charts and histograms can be generated in R using the lessR package. The specific functions are BarChart for the bar chart and Histogram for the histogram. The summary statistics can be generated in R using the stat.desc function in the pastecs package.

UserID Ratings

The UserID with the most ratings was UserID 11676 with 13602 ratings.

The basic summary statistics provide analysis information for the UserID rating counts and diagnostic checks for the Spark job. The key metric is the 1149780 rating counts total. The number of UserIDs (nbr.val of 105283) reflect the anonymized UserIDs.

ISBN Ratings

The ISBN with the most number of ratings was ISBN 971880107 with 2504 ratings.

The basic summary statistics provide analysis information for the ISBN rating counts and diagnostic checks for the Python Hadoop Streaming job. The key metrics are the 1149780 rating counts total and the number of ISBNs (nbr.val) of 339528. The number of ISBNs (nbr.val) balance with number of ISBNs from the Python Hadoop Streaming job for the rating averages.

ISBN Rating Averages

The histogram of rating averages shows that the implicit average rating had the most counts.

The basic summary statistics provide analysis information for the ISBN rating averages and diagnostic checks for the Python Hadoop Streaming job. The key metric is the nbr.val of 339528 (number of ISBNs) which balance with the Python Hadoop Streaming job for the ISBN counts.

ISBN Rating counts

The ISBN rating value counts shows that the implicit rating had the most counts.

The basic summary statistics provide analysis information for the ISBN rating average counts and diagnostic checks for the Ruby Hadoop Streaming job. The key metric is the sum of 1149780 (number of ratings) which balance with the Spark job UserID rating counts total and the Python Hadoop Streaming job ISBN rating counts total. The availability of metrics for diagnostic checks across jobs illustrate the advantages of the linked nature of the dataset.

Summary

The illustration shows how to set up a scheme to extract some basic results from the Book-Crossing Book-Ratings dataset. The dataset is very rich in information about book user ratings. The scheme was designed to yield results that can be used with results from the (GroupLens) MovieLens and other GroupLens datasets. For example, some movies in the MovieLens datasets have book counterparts in the Book-Crossings dataset and vice versa.

The results combine will provide for a richer analysis of user ratings across the products in the GroupLens datasets. I hope this post proves useful in your big data analyses.

Interested in other posts about Hadoop Streaming and cloud computing?

Check out my other posts.

Or subscribe to the Stats Cosmos RSS feeds to keep updated

Alternatively, check out the Stats Cosmos services page.

Stats Cosmos services

Or our blog resources page and training course.

Stats Cosmos blog resources

Stats Cosmos training

Sources:

http://bit.ly/1LSkoOI
http://bit.ly/1Yzm2qe
http://bit.ly/1M7MAYL
http://bit.ly/1R89nos
http://ibm.co/1T3h0ml
http://bit.ly/1Qc7Gc8
http://bit.ly/1oXvms2
http://bit.ly/1SN27EA
http://bit.ly/1p9hgn7
http://bit.ly/1Watnek
http://bit.ly/1QNk68v
http://bit.ly/1nh2Osx

StatsCosmos

Pages

Wednesday, March 9, 2016

How to summarize the Book-Crossing dataset using Hadoop 2.6.0 and Spark 1.5.1

1. Prepare the data

2. Prepare the mapper and reducer sets

Python Mapper and Reducer set for the ISBN counts

Python Mapper and Reducer set for the ISBN rating averages

Ruby Mapper and Reducer set for the rating category counts

3. Process the data in Hadoop and Spark

UserID counts using the Scala Spark shell

ISBN counts using Hadoop Streaming (Python Mapper-Reducer set)

ISBN average rating using Hadoop streaming (Python Mapper-Reducer set)

Rating counts using Hadoop streaming (Ruby Mapper-Reducer set)

4. Results summary and highlights

UserID Ratings

ISBN Ratings

ISBN Rating Averages

ISBN Rating counts

Summary

No comments:

Post a Comment

Get new posts by email: