This post is designed for a joint installation of Ubuntu Server 14.04.3 LTS, Apache Hadoop 2.6.0 single cluster and Apache Spark 1.5.1 (pre-built for Hadoop 2.6 and later). The steps outlined involve setting up a job to summarize the GroupLens Book-Crossing dataset using Hadoop Streaming and Spark word count (in local mode). The Hadoop Streaming part of the job is implemented using mapper-reducer sets prepared in Python and Ruby. The Spark word count part of the job is implemented in the SparkContext within the Scala Spark shell (in local mode).
The Book-Crossing dataset is a great information source for book ratings. The information is particularly useful
when analyzed in relation to the GroupLens MovieLens datasets and other
GroupLens datasets.
1. Prepare the data
The Book-Crossing data was collected by Cai-Nicolas Ziegler in a 4-week crawl (during the August/September 2004 period) from the
Book-Crossing community with the kind permission of Ron Hornbaker (CTO of
Humankind Systems). The reference to the dataset also has excellent additional
resources for methods on Book rating related analyses.
The specific dataset considered in this illustration
is the BX-Book-Ratings dataset. The dataset version selected for the
illustration is the CSV Dump version of the BX-Book-Ratings dataset. The other
version that is available is the SQL Dump.
The BX-Book-Ratings dataset contains three variables:
- UserID
- ISBN
- Rating (for each ISBN)
The summary of the notes on the BX-Book-Ratings variable is
as follows:
- Ratings are either explicit, expressed on a scale from one to ten (with higher values denoting higher appreciation), or implicit, expressed by 0.
In this illustration I will show how to determine
the number of UserIDs that rated, the number of ISBNs that were rated, the
average rating for each ISBN and the number of ratings in each ISBN rating score
category.
In a simple MapReduce approach to the job, this amounts
to four jobs. These are:
- Counting the number of UserIDs in a UserID text file created from the first column of the dataset
- Counting the number of ISBNs in a UserID text file created from the second column of the dataset
- Calculating the average rating for each ISBN in a text file created from the second and third columns (or all the columns) of the dataset
- Calculating the number of times each score rating value occurs in a text file created from the third column of the dataset
The approach will be to show how to complete the
jobs in the following manner:
- Complete the first job in Spark running on local mode using Resilient Distributed Dataset (RDD) processing methods
- Complete the second and third jobs using Hadoop Streaming with mapper-reducer sets written in Python
- Complete the fourth job using Hadoop Streaming with a mapper-reducer set written in Ruby
The resulting output files yield the required
information for analysis (or further processing).
In this illustration the ISBN text file must be created in such a manner that the ISBNs are more than one character long (i.e. all field lengths must be greater than one).
The average rating for each ISBN is obtained by using
the ISBN column and the Ratings column of the dataset. In this illustration the whole dataset is used for the third job (i.e the UserID column is also included with the ISBN and Ratings columns). The rationale
for this approach is to take advantage of the linked nature of the dataset.
For example, the same dataset can be used (with some
minor adjustments to the Python code or by using another mapper-reducer set) to
calculate the average of the ratings made by each user.
The average ISBN rating mapper-reducer set Python
code can, however, be easily modified to exclude the UserID column if a more compact
dataset is desired for the job.
The four resulting datasets prepared in this manner
can be processed using Hadoop and Spark. The next step is to prepare the
mapper-reducer sets for the Hadoop Streaming part of the job. The Spark part of
the job can be run using a simple program (within a SparkContext) in the Scala Spark shell run in local mode.
2. Prepare the mapper and reducer sets
Python Mapper and Reducer set for the ISBN counts
The Python mapper-reducer set for the ISBN counts was
prepared using the tutorial in this post.
The mapper.
The reducer.
The Python mapper-reducer set for the ISBN average ratings was prepared using the tutorial in this post.
The mapper.
The reducer.
The Hadoop Streaming configuration that will used in the illustration is the standard set up without any modifications to enhance performance. The Spark shell set up in Scala is the standard local mode set up with four cores without any jar files and/or Apache Maven dependencies (using Maven coordinates).
In this section the assumption is that the UserID, ISBN column and Rating column text file is called InputData.txt.
In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the HDFS input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper (Bavgmapper.py) is located in the local system folder - <Python mapper>, the Python reducer (Bavgreducer.py) is located in the local system folder - <Python reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder> . The next step is to run the following command on Ubuntu Server.
In this section the assumption is that the Rating column text file is called InputData.txt.
In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the HDFS input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Ruby mapper (Bmapper.rb) is located in the local system folder - <Ruby mapper>, the Ruby reducer (Breducer.rb) is located in the local system folder - <Ruby reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder> . The next step is to run the following command on Ubuntu Server.
The UserID with the most ratings was UserID 11676 with 13602 ratings.
The basic summary statistics provide analysis information for the UserID rating counts and diagnostic checks for the Spark job. The key metric is the 1149780 rating counts total. The number of UserIDs (nbr.val of 105283) reflect the anonymized UserIDs.
The ISBN with the most number of ratings was ISBN 971880107 with 2504 ratings.
The basic summary statistics provide analysis information for the ISBN rating counts and diagnostic checks for the Python Hadoop Streaming job. The key metrics are the 1149780 rating counts total and the number of ISBNs (nbr.val) of 339528. The number of ISBNs (nbr.val) balance with number of ISBNs from the Python Hadoop Streaming job for the rating averages.
The basic summary statistics provide analysis information for the ISBN rating averages and diagnostic checks for the Python Hadoop Streaming job. The key metric is the nbr.val of 339528 (number of ISBNs) which balance with the Python Hadoop Streaming job for the ISBN counts.
The ISBN rating value counts shows that the implicit rating had the most counts.
The basic summary statistics provide analysis information for the ISBN rating average counts and diagnostic checks for the Ruby Hadoop Streaming job. The key metric is the sum of 1149780 (number of ratings) which balance with the Spark job UserID rating counts total and the Python Hadoop Streaming job ISBN rating counts total. The availability of metrics for diagnostic checks across jobs illustrate the advantages of the linked nature of the dataset.
The illustration shows how to set up a scheme to extract some basic results from the Book-Crossing Book-Ratings dataset. The dataset is very rich in information about book user ratings. The scheme was designed to yield results that can be used with results from the (GroupLens) MovieLens and other GroupLens datasets. For example, some movies in the MovieLens datasets have book counterparts in the Book-Crossings dataset and vice versa.
The results combine will provide for a richer analysis of user ratings across the products in the GroupLens datasets. I hope this post proves useful in your big data analyses.
Interested in other posts about Hadoop Streaming and cloud computing?
Check out my other posts.
Or subscribe to the Stats Cosmos RSS feeds to keep updated
Alternatively, check out the Stats Cosmos services page.
Stats Cosmos services
Or our blog resources page and training course.
Stats Cosmos blog resources
Stats Cosmos training
Sources:
http://bit.ly/1LSkoOI
http://bit.ly/1Yzm2qe
http://bit.ly/1M7MAYL
http://bit.ly/1R89nos
http://ibm.co/1T3h0ml
http://bit.ly/1Qc7Gc8
http://bit.ly/1oXvms2
http://bit.ly/1SN27EA
http://bit.ly/1p9hgn7
http://bit.ly/1Watnek
http://bit.ly/1QNk68v
http://bit.ly/1nh2Osx
The mapper.
The reducer.
Python Mapper and Reducer set for the ISBN rating averages
The Python mapper-reducer set for the ISBN average ratings was prepared using the tutorial in this post.
The mapper.
Ruby Mapper and Reducer set for the rating category counts
The Ruby mapper-reducer set for the rating category
counts was prepared using the tutorial in this post.
The mapper.
The mapper.
The reducer.
The mapper-reducer sets complete the set up for the
Hadoop Streaming part of the job. The next step is to process the datasets in Hadoop and Spark.
3. Process the data in Hadoop and Spark
The Hadoop Streaming configuration that will used in the illustration is the standard set up without any modifications to enhance performance. The Spark shell set up in Scala is the standard local mode set up with four cores without any jar files and/or Apache Maven dependencies (using Maven coordinates).
UserID counts using the Scala Spark shell
The Scala Spark shell running in local mode creates
a special interpreter-aware SparkContext in the variable called sc. The RDD for
the job is created using the SparkContext’s textFile method on the UserID
column text file from the local file system. The results are saved on a text
file in the local file system. The Spark word count example can be found in
Apache Spark Examples website.
The assumption in the job is that the UserID column
text file is called InputData.txt, has been saved in the local folder- <InputData
folder> and the local system output folder has been selected to be -
<Output folder>.
The word count is implemented by running the following simple
program within the Scala Spark shell.
This is an excerpt from my results.
ISBN counts using Hadoop Streaming (Python Mapper-Reducer set)
In this section the assumption is that the ISBN
column text file is called InputData.txt.
In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the Hadoop Distributed File System (HDFS) input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper (BCmapper.py) is located in the local system folder - <Python mapper>, the Python reducer (BCreducer.py) is located in the local system folder - <Python reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder>. The next step is to run the following command on Ubuntu Server.
In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the Hadoop Distributed File System (HDFS) input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper (BCmapper.py) is located in the local system folder - <Python mapper>, the Python reducer (BCreducer.py) is located in the local system folder - <Python reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder>. The next step is to run the following command on Ubuntu Server.
This is an excerpt from my results.
ISBN average rating using Hadoop streaming (Python Mapper-Reducer set)
In this section the assumption is that the UserID, ISBN column and Rating column text file is called InputData.txt.
In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the HDFS input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper (Bavgmapper.py) is located in the local system folder - <Python mapper>, the Python reducer (Bavgreducer.py) is located in the local system folder - <Python reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder> . The next step is to run the following command on Ubuntu Server.
This is an excerpt from my results.
Rating counts using Hadoop streaming (Ruby Mapper-Reducer set)
In this section the assumption is that the Rating column text file is called InputData.txt.
In terms of the Hadoop Streaming part of the code the assumption is that the InputData.txt file has been successfully loaded into the HDFS input folder - <HDFS Input folder>, the Hadoop Streaming jar file is called hadoop-streaming-2.6.0.jar, the Hadoop Streaming jar file is located in the local system folder - <hadoop-streaming-2.6.0.jar local folder>, the Ruby mapper (Bmapper.rb) is located in the local system folder - <Ruby mapper>, the Ruby reducer (Breducer.rb) is located in the local system folder - <Ruby reducer>, and the output folder in HDFS has been selected to be - <HDFS Output folder> . The next step is to run the following command on Ubuntu Server.
This is an excerpt from my results.
4. Results summary and highlights
The results yielded some very interesting and
wonderful book ratings information. The results can be nicely summarized using
bar charts, summary statistics and histograms. The bar charts and histograms
can be generated in R using the lessR package. The specific functions are BarChart
for the bar chart and Histogram for the histogram. The summary statistics can
be generated in R using the stat.desc function in the pastecs package.
UserID Ratings
The UserID with the most ratings was UserID 11676 with 13602 ratings.
The basic summary statistics provide analysis information for the UserID rating counts and diagnostic checks for the Spark job. The key metric is the 1149780 rating counts total. The number of UserIDs (nbr.val of 105283) reflect the anonymized UserIDs.
ISBN Ratings
The ISBN with the most number of ratings was ISBN 971880107 with 2504 ratings.
The basic summary statistics provide analysis information for the ISBN rating counts and diagnostic checks for the Python Hadoop Streaming job. The key metrics are the 1149780 rating counts total and the number of ISBNs (nbr.val) of 339528. The number of ISBNs (nbr.val) balance with number of ISBNs from the Python Hadoop Streaming job for the rating averages.
ISBN Rating Averages
The histogram of rating averages shows that the implicit average rating had the most counts.
The basic summary statistics provide analysis information for the ISBN rating averages and diagnostic checks for the Python Hadoop Streaming job. The key metric is the nbr.val of 339528 (number of ISBNs) which balance with the Python Hadoop Streaming job for the ISBN counts.
ISBN Rating counts
The ISBN rating value counts shows that the implicit rating had the most counts.
The basic summary statistics provide analysis information for the ISBN rating average counts and diagnostic checks for the Ruby Hadoop Streaming job. The key metric is the sum of 1149780 (number of ratings) which balance with the Spark job UserID rating counts total and the Python Hadoop Streaming job ISBN rating counts total. The availability of metrics for diagnostic checks across jobs illustrate the advantages of the linked nature of the dataset.
Summary
The illustration shows how to set up a scheme to extract some basic results from the Book-Crossing Book-Ratings dataset. The dataset is very rich in information about book user ratings. The scheme was designed to yield results that can be used with results from the (GroupLens) MovieLens and other GroupLens datasets. For example, some movies in the MovieLens datasets have book counterparts in the Book-Crossings dataset and vice versa.
The results combine will provide for a richer analysis of user ratings across the products in the GroupLens datasets. I hope this post proves useful in your big data analyses.
Interested in other posts about Hadoop Streaming and cloud computing?
Check out my other posts.
Or subscribe to the Stats Cosmos RSS feeds to keep updated
Alternatively, check out the Stats Cosmos services page.
Stats Cosmos services
Or our blog resources page and training course.
Stats Cosmos blog resources
Stats Cosmos training
Sources:
http://bit.ly/1LSkoOI
http://bit.ly/1Yzm2qe
http://bit.ly/1M7MAYL
http://bit.ly/1R89nos
http://ibm.co/1T3h0ml
http://bit.ly/1Qc7Gc8
http://bit.ly/1oXvms2
http://bit.ly/1SN27EA
http://bit.ly/1p9hgn7
http://bit.ly/1Watnek
http://bit.ly/1QNk68v
http://bit.ly/1nh2Osx
No comments:
Post a Comment
Thank you for your comment.