StatsCosmos: How to set up the Hadoop 2.6.0 MapReduce wordcount example to summarize digital, social and mobile data

This post outlines a simple framework for analyzing digital, social and mobile (technology) data using a simple framework that can be set up using Microsoft Excel (gnumeric in Ubuntu), Apache Hadoop version 2.6.0, R and SAS software. The setup requires elementary programming knowledge, although MicrosoftVisual Basic, R, RHadoop and Java programming can greatly simplify (and enhance) the process. The approach ultimately involves setting up a data lake which grows according to a Kronecker product multiplication. The mathematical foundation of the framework was introduced and discussed in my previous blog posts (whose links are included at the end of the post).

The prerequisites for implementation are an Apache Hadoop 2.6.0 (Hadoop) installation (single node cluster and examples), Microsoft Excel (or any other Microsoft Excel-like spreadsheet), R (or SAS software), elementary statistics and Internet World Stats data sets (2014 Internet user population, 2012 Facebook population and the 2008 to 2014 population spatial time series).

1. Obtain the data sets

The procedure for setting up the data sets is outlined in my previous blog post: The sixty-five Regional Digital, Social and Mobile in 2015 statistics list that every blogger, writer or internet (content)specialist should take a look at to get a good mathematical basis from which to formulate social media and digital statistical content.

The first step is to load the population data into Microsoft Excel. The figures must be converted into ordinal categories that have been set up to have a name length that gives an indication of the (size of the) values.

For example sm_cat one (small category one) has a name length that is shorter than sm_cat_two_ (small category two). This is how one can tell Apache Hadoop MapReduce that the word sm_cat_one is attached to a number that is smaller than that of sm_cat_two.

This will set up the classification for the 2014 (global) internet user population, 2014 (global) population and 2012 (global) Facebook user population data sets. The cut-offs can be set up according to the choice of the programmer/analyst. The next step is to set up the spatial time series variance-covariance data.

The details for calculating the spatial time series variance-covariance matrix for the global population using the Kronecker product is outlined in my previous blog post: 5 matrix decompositions for visualizing the global internet user population spatial time series variance-covariance matrix. The spatial time series variance-covariance matrix I selected for the analysis was the 2008 to 2014 global population spatial time series.

The next step was to calculate the absolute value of the matrix values in order to handle the cases of negative variance-covariances. A procedure that can be followed is to leave the matrix values as is and adjust the value categories to represent the numbers (i.e. the sign and the size of the values). The 2008 to 2014 global internet user population spatial time series variance-covariance matrix can be set up similarly with different (or the same) cut-offs. The categorization of the variables needs to be adjusted because separable spatial time series variances-variances are on a different scale to the original data.

The approach I will illustrate for the analysis involves using a double squared scale of the original data. The rationale is for a squared scale for the spatial covariances and a squared scale for the time covariances. In my case the decision also involved an analysis of the magnitude of variances-covariance values.

The data sets can then be saved into a text file. It is important to make sure that there are leading and trailing spaces for each word so that Hadoop can separate the words in the wordcount mapreduce code.

2. Process the data in Hadoop

The four data sets can then be loaded into your input folder on the Apache Hadoop Distributed File System (HDFS). The next step will be to analyze the data sets using the procedure outlined in the Apache Hadoop project.

For example, from an Ubuntu 14.04.3 terminal in Hadoop, the following line of code can be run where the MapReduce examples programs are housed:

$ hadoop jar hadoop-mapreduce-examples-2.6.0.jar wordcount HDFS_input_folder/<analysis file>.txt HDFS_output_folder

2014 Global population

In my case the resulting category wordcounts for the 2014 global population were as follows.

The data can be extracted and plotted in R, SAS software or Microsoft Excel (or gnumeric). The procedure can be further fine-tuned according to the requirements of the analyst.

2014 Internet user population

In the case of the 2014 internet population I obtained the following counts for the categories.

2012 Facebook users

In the case of the 2012 Facebook population I obtained the following counts for the categories.

2008 to 2014 Global population spatial time series variance-covariance matrix

In the case of the 2008 to 2014 spatial time series variance-variance matrix, I obtained the following counts for the categories.

The next step is to head to the R-console and find out which category contains the mean for the variance-covariance data (absolute value of the original variance-covariance data). In my case I obtained the following output.

The values can be plotted in R using the gvisBarChart function GoogleVis package.

The mean is in large_cat_two which is a little counter-intuitive. It is important, however, to keep in mind that larger values will pull the mean toward them (i.e. have a larger “implicit weight” in the equal weight mean value calculated in R). The second consideration is that the numbers in each category depend on the cut-offs. In my case, the next logical step will be to repeat the procedure with the newly available information from Hadoop.

The exciting feature about the Hadoop output is that this scheme represents a home-made/simple/elementary decomposition of the spatial time series variance-covariance matrix. I have found that the decomposition, because of its simplicity, is useful for constructing (enhancing) the more sophisticated spatial time series variance-covariance decompositions like the singular value decomposition/spectral decomposition, QR decomposition, polar decomposition, and Fourier/spectral density decomposition.

Conclusions

The Hadoop version 2 framework is a great resource for generating efficient solutions to data lake related problems. In this post I was able to achieve this by combining statistical programming with the built-in Hadoop version 2 tools. A similar approach can be taken with other Hadoop version 2 capabilities to solve other data lake problems. In the next Hadoop examples post I will explore this approach with the wordmean, wordstandarddeviation and other Hadoop examples.

Before then, would you like to get more information which you can use to customize Hadoop to your own setting? Then check out our other resources.

Check out our other blog posts and screencast series