StatsCosmos: How to analyze TF Cosine-based Similarity measures for the Last.fm Social Tagging System with Apache Hive and Apache Spark SQL

*Photo by chuttersnap on Unsplash

This post is one in a series of posts designed for a joint installation of Apache Flink, Apache Hadoop, Apache Hive, MongoDB, Apache Pig, Apache Spark (pre-built for Hadoop) and Ubuntu. This post is designed for an installation of Apache Hive 2.1.1, Apache Hadoop 2.6.1, Apache Spark 2.2.0 (pre-built for Hadoop) and Ubuntu 16.04.2. The purpose of the illustrations in the posts is to show how one can construct content-based recommendation measures for the Last.fm social system using the GroupLens HetRec Last.fm dataset. The modelling framework for the similarity measure analysis is that outlined in Cantador, Bellogin and Vallet (2010). The post follows on from my previous post: How to summarize Last.fm Social Tagging System Profiles using Golang, Hadoop, MongoDB and Spark.

The calculation of the similarity measures for the analyses in the posts involves implementing seventeen MapReduces on the user_taggedartists.dat dataset. The first six MapReduces were implemented in the previous post.

The similarity measures considered (in the posts) are as follows:

TF-based Similarity
TF Cosine-based Similarity
TF-IDF Cosine-based Similarity
Okapi BM25-based Similarity
Okapi BM25 Cosine-based Similarity

In this post only the TF Cosine-based Similarity measure is considered. The methodology for the construction of the measures is considered in How to summarize Last.fm Social Tagging System Profiles using Golang, Hadoop, MongoDB and Spark.

The MapReduces in the post series are illustrated using piping and non-piping methods. The purpose of the approach is to provide a choice of methods according to each setting. The advantage of this approach is that it inherently highlights the features available from each application programming interface (API). This, in turn, provides a casual portfolio of platform specific methods to choose from to implement the MapReduces. For example, one can implement the MapReduce using a Java (Apache) Spark pipe application, Java Spark non-pipe application or a Java (Apache) Flink Dataset application. The three approaches inherently illustrate the kind of programming advantages that can be harnessed from the features of the Spark Java Resilient Distributed Dataset (RDD) and the Flink Java Dataset.

The Flink illustrations use non-piping methods. The Spark illustrations use a mixture of piping and non-piping methods.The (Apache) Hadoop, Hive and Pig parts of the illustrations use piping type methods. The Hadoop MapReduce illustration uses piping type methods through the Hadoop Streaming facility. The Hive illustration uses piping type methods through the Hive map and reduce commands. The Spark SQL part of the illustration uses piping type methods through the Hive2 transform. The Pig part of the illustration uses piping type methods through the Pig stream facility.

In this post the piping type methods use mapper-reducer sets prepared in Java and Python. In the post series the MapReduces have two main orders of magnitude, namely, three set and one set. In the case of the three set MapReduces the three set mapper-reducer sets can be used with the: Hadoop Streaming facility; Hive map and reduce commands; Pig stream facility; Java Spark pipe facility; Scala Spark pipe facility; SparkR pipe facility; PySpark pipe facility and Spark SQL (using the Beeline interface to the Spark SQL Thrift Server) transform command.

The three set MapReduce piping type scripts are used to calculate the Cosine-based similarity measures (TF, TF-IDF, and Okapi BM25).

1. Prepare the data

The implementation of similarity measure MapReduces one to four to obtain the elements used in the proposed profile and recommendation models (Table one) in Cantador, Bellogin and Vallet (2010) were illustrated in the first post in the series.

In this post the calculation involves using the output datasets from the first two MapReduces. The output dataset for the users will have u_m,l as the index/key and tf_{u_m}(t_l) as the value. The output dataset for the items will have i_n,l as the index/key and tf_{i_n}(t_l) as the value.

From the {{user id, tag}, user tag frequency} key-value pair (u_m,l, tf_{u_m} (t_l)) and {{item id,tag}, item tag frequency} key-value pair (i_n.l, tf_{i_n}(t_l)) in the output files from the first two MapReduces create new combined key-value combinations ({u_m,i_n,}, tf_{u_m} (t_l), tf_{i_n}(t_l)) without the tag part of the uncombined key indices (i.e. keep the user u_m index and item i_n index, respectively) for the similarity measure MapReduce.

In the MapReduce mapping phase the numerator entry values can be the cross-products tf_{u_m}(t_l)*tf_{i_n}(t_l) (i.e. the t_l entry must be the same in the product) and the denominator values can be squares of the individual values in the form of (tf_{u_m}(t_l))² and (tf_{i_n}) (t_l))² ). In the reduce phase the sums can be outputted by key for the numerator and the square roots of the sums can be outputted by key for the denominators.

The operational aspects of the calculation in this illustration are as follows:

The combined tuple will initially take the form, {(u_m,l,i_n,l), tf_{u_m}(t_l), tf_{i_n}(t_l )}. Then, one will change the tuple to, say, the following {(u_m;i_n;), tf_{u_m}(t_l), tf_{i_n} (t_l)} for the actual MapReduce. It is important to make sure that the {tf_{u_m}(t_l), tf_{i_n}(t_l)} part of the tuple always pertains to the same l during the data preparation. Hence, one simply has to make sure that all the l’s for each (u_m;i_n;) key are obtained because this information is lost during the mapping phase when the keys do not include the information about the tag l for each key-value pair for the actual MapReduce. The products for the numerator can be programmed into the mapper. The squares for the denominators can also be programmed into the mapper.

In the reduce phase, the values of the numerator products can be summed and the totals outputted by key-value combination. In the case of the denominator entries, the square roots of the sums can be outputted for each key, value combination. This will generate the outputs required by the similarity measure formulae in the case of the three set MapReduces. This is how the three term MapReduces can be implemented.

The next step is to construct the mapper-reducer sets in order to implement the MapReduces for the similarity measure in Hive and Spark SQL.

2. Prepare the mapper-reduce sets

Java mapper-reducer set

The Java mapper-reducer set was prepared using the tutorial in this post. The mapper is as follows:

The reducer is as follows:

The next step is to compile the two files into classes with the javac command:

The java classes can be run using shell scripts. The shell script to run the mapper:

The shell script to run the reducer:

The chmod command can be used to give the files (Java, Java classes, and Bash) execution permission:

Python mapper-reducer set

The Python mapper-reducer set was prepared using a framework outlined in this book and this post. The mapper is as follows:

The reducer is as follows:

The chmod command can be used to give the files execution permission:

The mapper and reducer files can be copied from the <Local System MapReduce Folder> folder to the <SPARK_HOME> folder for the Beeline processing.

3. Process the data in Hive

The three set MapReduces in the post series aim to introduce the different methods for calculating the three set (Cosine-based) Similarity measures using the map and reduce functions. The three set MapReduces in piping type form are implemented within the Hadoop MapReduce framework (using Hadoop Streaming, Pig stream command and Hive map/reduce commands) and the Spark in-memory framework (using the Spark pipe function and Hive2 transform command with SparkSQL).

The three set MapReduce can be implemented in Hive with the Bash based Java three set mapper-reducer set using the following script prepared according to the tutorial in this post and this post.

The three set MapReduce can be implemented after making the following arrangements:

Input data: InputData.txt
Hadoop Distributed File System (HDFS) Input data folder: <HDFS Input Data Folder>
Local system Hive script folder: <Local System Hive script Folder>
Hive script: HiveThreesetscript.sql
Three set mapper: Threesetmapper.sh
Three set reducer: Threesetreducer.sh
Local system MapReduce folder for the mapper-reducer set: <Local System MapReduce Folder>
The HDFS output data folder: <HDFS Output Data Folder>

The script can be submitted to Hive using the following command:

This will yield the following output:

4. Check the results in Spark SQL

The results of the three set MapReduce in section three can be replicated with the Python mapper-reducer set using a Hive2 script in the Spark SQL Thrift Server submitted to the Beeline interface. The three set MapReduce can be implemented using the following Hive2 script prepared using the tutorial in this post and this post.

In order to implement the three set MapReduce in Spark SQL using Beeline the following arrangements can be made:

Input data: InputData.txt
Local system Input data folder: <Local System Input Data Folder>
Local system Beeline script folder: <Local System Beeline script Folder>
Beeline script: BeelineThreesetscript.sql
Three set mapper: Threesetmapper.py
Three set reducer: Threesetreducer.py
Local system folder where the Python mapper-reducer set is saved: <SPARK_HOME>
In the <SPARK HOME> folder one can run the following commands (to start the Thrift server and submit the script to Beeline)

This will yield the following output:

The next step is to stop the Thrift Server.

The output data from the Hive query and the Spark SQL Thrift Server query through the Beeline interface yield independent results that can be used to check the analysis dataset.

5. Brief analysis

The cosine-based similarity provides a measure that gives an indication of the angle between the user profile vector u_m= {u_m,1,....., u_m,L} and the item profile vector i_n= {i_n,1,....., i_n,L}, thus providing a measure of the similarity. In the context of the modelling framework, items that have a large value for this measure between user and item are potential candidates to be included in the set of items that maximize the utility function g() for the user. These items can be recommended to the user.

In the output above, the cos_tf(u_m, i_n) for userid 1007 and itemid 913 is 0,887638 which yields an angle of 0,478606 radians (27.42214 degrees). This process can be used to find a bundle of items (whose measures are in the output dataset of the MapReduce) that would be best to recommend to user with id 1007 in order to maximize the utility function g() for the available items in the system.

Conclusions

Essentially, as a recapitulation, for a totally ordered set R, and utility function g, g:U×I →R, which measures the gain of usefulness of an item i_nto user u_m. The aim of the analysis was, for each user u ∈U, to find items i ^max,u ∈I, unknown to the user, that maximize the utility function g():

∀u∈U, i ^max,u= arg⁡ max_i∈I g(u,i).

The identified items can be recommended to the user.

The TF Cosine-based Similarity is easy to interpret and very useful for identifying items to recommend to users in a folksonomy like Last.fm. The similarity measures and elements used in the proposed profile and recommendation models in Cantador, Bellogin and Vallet (2010) provide a way to satisfy the aim of the analysis.

Interested in more Big data materials from the Stats Cosmos blog?

Check out my previous Big data posts

Or check out our statistics and e-learning services

Or check out our blog resources page

Sources

http://bit.ly/2G6CXNP
http://bit.ly/2G8gqjP
https://oreil.ly/2G4SuOg
http://bit.ly/2INOtzw
http://bit.ly/2DQjvD1
http://bit.ly/2FX4THU
http://bit.ly/2DQxJ6N

http://bit.ly/2pGKITQ

http://bit.ly/2ujdcbA
http://bit.ly/1SN27EA
http://bit.ly/2GtWmM9
http://bit.ly/2GcEXbv
http://bit.ly/1SN27EA

Apache®, Apache Hadoop, Apache Hive, Apache Spark and the logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

StatsCosmos

Pages

Wednesday, March 28, 2018

How to analyze TF Cosine-based Similarity measures for the Last.fm Social Tagging System with Apache Hive and Apache Spark SQL