This post outlines a simple framework for analyzing digital, social and mobile (technology) data using a simple framework that can be set up using Microsoft Excel (gnumeric in Ubuntu), Apache Hadoop version 2.6.0, R and SAS software. The setup requires elementary programming knowledge, although MicrosoftVisual Basic, R, RHadoop and Java programming can greatly simplify (and enhance) the process. The approach ultimately involves setting up a data lake which grows according to a Kronecker product multiplication. The mathematical foundation of the framework was introduced and discussed in my previous blog posts (whose links are included at the end of the post).
The prerequisites for implementation are an Apache Hadoop
2.6.0 (Hadoop) installation (single node cluster and examples), Microsoft Excel
(or any other Microsoft Excel-like spreadsheet), R (or SAS software), elementary
statistics and Internet World Stats data sets (2014 Internet user population, 2012 Facebook population and the 2008 to 2014 population spatial time series).
1. Obtain the data sets
The procedure for setting up the data sets is
outlined in my previous blog post: The sixty-five Regional Digital, Social and Mobile in 2015 statistics list that every blogger, writer or internet (content)specialist should take a look at to get a good mathematical basis from which to formulate social media and digital statistical content.
The first step is to load the population data into
Microsoft Excel. The figures must be converted into ordinal categories that
have been set up to have a name length that gives an indication of the (size of
the) values.
For example sm_cat one (small category one) has a
name length that is shorter than sm_cat_two_ (small category two). This is how
one can tell Apache Hadoop MapReduce that the word sm_cat_one is attached to a
number that is smaller than that of sm_cat_two.
This will set up the classification for the 2014 (global) internet user population, 2014 (global) population and 2012 (global) Facebook
user population data sets. The cut-offs can be set up according to the choice
of the programmer/analyst. The next step is to set up the spatial time series variance-covariance data.
The details for calculating the spatial time series variance-covariance
matrix for the global population using the Kronecker product is outlined in my
previous blog post: 5 matrix decompositions for visualizing the global internet user population spatial time series variance-covariance matrix. The spatial
time series variance-covariance matrix I selected for the analysis was the 2008
to 2014 global population spatial time series.
The next step was to calculate the absolute value of
the matrix values in order to handle the cases of negative variance-covariances.
A procedure that can be followed is to leave the matrix values as is and adjust
the value categories to represent the numbers (i.e. the sign and the size of
the values). The 2008 to 2014 global internet user population spatial time
series variance-covariance matrix can be set up similarly with different (or
the same) cut-offs. The categorization of the variables needs to be adjusted
because separable spatial time series variances-variances are on a different scale to the
original data.
The data sets can then be saved into a text file. It is important to make sure that there are leading and trailing spaces for each word so that Hadoop can separate the words in the wordcount mapreduce code.
2. Process the data in Hadoop
The four data sets can then be loaded into your input folder on the Apache Hadoop Distributed File System (HDFS). The next step will be to analyze the data sets using the procedure outlined in the Apache Hadoop project.
For example, from an Ubuntu 14.04.3 terminal in Hadoop, the following
line of code can be run where the MapReduce examples programs are housed:
$ hadoop jar hadoop-mapreduce-examples-2.6.0.jar
wordcount HDFS_input_folder/<analysis
file>.txt HDFS_output_folder
2014 Global population
In my case the resulting category wordcounts for the 2014 global
population were as follows.
The data can be extracted and plotted in R, SAS software or Microsoft Excel (or gnumeric). The procedure can be further fine-tuned according to the requirements of the analyst.
2014 Internet user population
In the case of the 2014 internet population I obtained the following counts for the categories.
2012 Facebook users
In the case of the 2012 Facebook population I obtained the following counts for the categories.
2008 to 2014 Global population spatial time series variance-covariance matrix
In the case of the 2008 to 2014 spatial time series variance-variance matrix, I obtained the following counts for the categories.
The next step is to head to the R-console and find
out which category contains the mean for the variance-covariance data (absolute
value of the original variance-covariance data). In my case I obtained the following output.
The mean is in large_cat_two which is a little counter-intuitive. It is important, however, to keep in mind that larger values will pull the mean toward them (i.e. have a larger “implicit weight” in the equal weight mean value calculated in R). The second consideration is that the numbers in each category depend on the cut-offs. In my case, the next logical step will be to repeat the procedure with the newly available information from Hadoop.
The exciting feature about the Hadoop output is that
this scheme represents a home-made/simple/elementary decomposition of the spatial
time series variance-covariance matrix. I have found that the decomposition, because of its
simplicity, is useful for constructing (enhancing) the more sophisticated spatial time series variance-covariance decompositions like
the singular value decomposition/spectral decomposition, QR decomposition, polar
decomposition, and Fourier/spectral density decomposition.
Conclusions
The Hadoop version 2 framework is a great resource for generating efficient solutions to data lake related problems. In this post I was able to achieve this by combining statistical programming with the built-in Hadoop version 2 tools. A similar approach can be taken with other Hadoop version 2 capabilities to solve other data lake problems. In the next Hadoop examples post I will explore this approach with the wordmean, wordstandarddeviation and other Hadoop examples.
Before then, would you like to get more information which you can use to customize Hadoop to your own setting? Then check out our other resources.
Check out our other blog posts and screencast series
No comments:
Post a Comment
Thank you for your comment.