This setup guide is designed for an Apache Hadoop 2.6.0 installation. Hadoop streaming is a utility/facility that allows one to create and run MapReduce jobs with any executable or script as the mapper and/or reducer. The functionality is part of the Hadoop distribution. A detailed explanation of Hadoop streaming and Hadoop 2.6.0 can be found in the Apache Hadoop project website. In this post I will explain how to execute the Hadoop 2.6.0 MapReduce examples word count, word mean and word standard deviation. The examples word count, word mean and word standard deviation are also part of the Hadoop distribution. In this scheme, Hadoop Streaming is used for the word count MapReduce instead of the Hadoop distribution word count (implemented in my previous blogpost).
The first part of the post gives the setup
and execution of word count using Hadoop Streaming MapReduce. The Hadoop
Streaming MapReduce setup has mapper/reducer set in Python script and a
set in R script. The second part of the post gives the setup and execution
of the word mean and word standard deviation using the standard Hadoop
MapReduce.
The MapReduce job is designed to analyze four sets
of aggregates. These are the 2014 global population, 2014 global internet user
population, 2012 Facebook population and the spatial time series variance-covariance
matrix (annual steps) for the global internet user population between the years 2008 to 2014. The analysis of
the first three sets of aggregates in Hadoop was for testing purposes and the
last set was the main analysis.
In the scheme the word count MapReduce job was
implemented using Python for all the sets. The word count MapReduce job using R
script was implemented only for the global internet user population spatial time series variance-covariance
matrix aggregates (the fourth set). The word mean and word standard deviation standard
MapReduce jobs were also only implemented for the fourth set of aggregates.
A detailed account of the aggregates can be found in my previous blog post: 5 matrix decompositions for visualizing the global internet user population spatial time series variance-covariance matrix. The data preparation essentially involved categorizing the aggregates into decile categories (classes). The decile classes are then given word values whose length gives an indication of the size of the figure. For example, the first decile class, namely, decile_one, had a word length that is shorter than that of the second decile class, namely, decile_two_. The naming convention is designed to facilitate the word mean and word standard deviation of the analysis.
1. Prepare the data
A detailed account of the aggregates can be found in my previous blog post: 5 matrix decompositions for visualizing the global internet user population spatial time series variance-covariance matrix. The data preparation essentially involved categorizing the aggregates into decile categories (classes). The decile classes are then given word values whose length gives an indication of the size of the figure. For example, the first decile class, namely, decile_one, had a word length that is shorter than that of the second decile class, namely, decile_two_. The naming convention is designed to facilitate the word mean and word standard deviation of the analysis.
Decile classes for the 2014 Global population aggregates
The decile classes for the 2014 global population aggregates are shown in the following table.
Decile classes for the 2014 Global internet user population aggregates
The decile classes for the 2014 global internet user
population aggregates are shown in the following table.
Decile classes for the 2012 Facebook user population aggregates
The decile classes for the 2012 Facebook user
population aggregates are shown in the following table.
Decile classes for the Global internet user population (2008 to 2014) spatial time series variance-covariance matrix aggregates
In processing the matrix, the first step is to
obtain the absolute value for all the entries (in order to handle the cases of negative
variance-covariances). It is also worth noting that variance-covariance
matrices are symmetric so in this analysis one half of the off-diagonal
elements can be omitted from the processing. There are advantages (lower number
of values to process) and disadvantages (transforming the final Hadoop results
before presenting them, possibilities of errors from further processing of the
matrix because of its size and more complex processing procedures) of following
this procedure. In the present analyses, however, they were retained in the
processing because of the resulting disadvantages.
The decile classes for the 2008 to 2014 annual global internet user population aggregates are shown the following table.
The classified aggregates were then read into the
Hadoop Distributed File System (HDFS) in preparation for the MapReduce job. The
procedure for loading data into the HDFS can be found in the Apache Hadoop
project website.
2. Prepare the Mappers and Reducers for Hadoop Streaming
The next step is to prepare the mappers and reducer scripts that will be used in the Streaming job.
Python mapper and reducer
The Python mapper and reducer for the word count
jobs were obtained and prepared according to the tutorial in this post. The
improved Python mapper and reducer combination was selected. The Python mapper
is as follows:
The Python reducer is as follows:
R script mapper and reducer
The R Script mapper and reducer for the word count
jobs were obtained and prepared according to this post. The R script
mapper is as follows:
The R script reducer is as follows:
3. Analyze the data in Hadoop
The next step is to execute the jobs in Hadoop.
2014 Global population word count job (Python)
In this section the first assumption is that the
2014 global population data is called InputData.txt (i.e. name in HDFS). In terms of the MapReduce Streaming code the assumption is that the data has been successfully loaded into HDFS in the input folder - <HDFS input folder>, the hadoop-streaming-2.6.0.jar file is located in a local system folder called - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper is located in a local system mapper folder called - <Python mapper folder>, the Python reducer is located in a local system reducer folder called - <Python reducer folder> and the HDFS output folder name has been selected to be - <HDFS output folder>. The next step is to run the following command in Hadoop.
In my case I obtained the following category counts.
In this section the first assumption is that the 2014 global internet user population data is called InputData.txt (i.e. name in HDFS).
In my case I obtained the following category counts.
2012 Facebook user population word count job (Python)
In this section the first assumption is that the
2012 Facebook population data is called InputData.txt (i.e. name in HDFS). In terms of the MapReduce Streaming code the assumption is that the data has been successfully loaded into HDFS in the input folder - <HDFS input folder>, the hadoop-streaming-2.6.0.jar file is located in a local system folder called - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper is located in a local system mapper folder called - <Python mapper folder>, the Python reducer is located in a local system reducer folder called - <Python reducer folder> and the HDFS output folder name has been selected to be - <HDFS output folder>. The next step is to run the following command in Hadoop.
In my case I obtained the following category counts.
2008 to 2014 Global internet user population spatial time series variance-covariance matrix word count job (Python)
In this section the first assumption is that the
2008 to 2014 global internet user population spatial time series variance-covariance
matrix data is called InputData.txt (i.e. name in HDFS). In terms of the MapReduce Streaming code the assumption is that the data has been successfully loaded into HDFS in the input folder - <HDFS input folder>, the hadoop-streaming-2.6.0.jar file is located in a local system folder called - <hadoop-streaming-2.6.0.jar local folder>, the Python mapper is located in a local system mapper folder called - <Python mapper folder>, the Python reducer is located in a local system reducer folder called - <Python reducer folder> and the HDFS output folder name has been selected to be - <HDFS output folder>. The next step is to run the following command in Hadoop.
In my case I obtained the following category counts.
2008 to 2014 Global internet user population spatial time series variance-covariance matrix word count job (R script)
In this section the first assumption is that the
2008 to 2014 global internet user population spatial time series variance-covariance
matrix data is called InputData.txt (i.e. name in HDFS). In terms of the MapReduce Streaming code the assumption is that the data has been successfully loaded into HDFS in the input folder - <HDFS input folder>, the hadoop-streaming-2.6.0.jar file is located in a local system folder called - <hadoop-streaming-2.6.0.jar local folder>, the R script mapper is located in a local system mapper folder called - <R script mapper folder>, the R script reducer is located in a local system reducer folder called - <R script reducer folder> and the HDFS output folder name has been selected to be - <HDFS output folder>. The next step is to run the following command in Hadoop.
In my case I obtained the following category counts.
2008 to 2014 Global internet user population spatial time series variance-covariance matrix word mean and word standard deviation jobs
In this section the first assumption is that the
2008 to 2014 global internet user population spatial time series variance-covariance
matrix data is called InputData.txt (i.e. name in HDFS). In terms of the standard
Hadoop 2.6.0 MapReduce code the assumption is that the data has been
successfully loaded in the input folder - <HDFS input folder>, the hadoop-mapreduce-examples-2.6.0.jar
file is located in the local system
folder - < hadoop-mapreduce-examples-2.6.0.jar folder> and the HDFS output folder name has been selected to be - <HDFS output folder>. The word mean is obtained by running the following command in Hadoop.
In my case I obtained the following decile class based word mean
value.
The word mean value (i.e. mean of the decile class data) is a function of the mean of the original data (quantitative values).
The word standard deviation is obtained by running the following command in Hadoop.
In my case I obtained the following decile class based word standard
deviation value.
The word standard deviation (i.e standard deviation of the decile class data) is a function of the standard deviation of the original data (quantitative values). The scheme can be refined to use percentile
divisions that have a finer granulity. An example of the approach is to use 5% percentile interval cut-offs, 2.5%
and so on.
Summary
In the post I outlined how to setup a MapReduce job that can be used to generate summaries of a big annual spatial time series variance-covariance matrix of the global internet user population between the years 2008 to 2014. The summaries can be used to generate more specific/elegant/specialized analyses of the spatial time series variance-covariance matrix.
The procedure is simple to execute in the sense that
there are essentially three sets of MapReduce jobs/procedures that were run. For example, the R script procedure replicates the Python
MapReduce job. Secondly, a Python word count procedure wthat has the same structure was run on four
different data sets. This essentially reduces the jobs to a Hadoop Streaming Python MapReduce
word count job, a standard Hadoop MapReduce word mean job and a standard Hadoop
MapReduce word standard deviation job.
I hope this post proves useful for your own analyses. Check
out my other related blog posts for a better context on how you can use the
procedure in your own analyses.
Sources:
Nice Article you have posted here. Thank you for giving this innovative information and
ReplyDeleteplease add more in future.
Hadoop Admin Training in Chennai
Big Data Administrator Training
Blockchain Training in Chennai
German Courses in Chennai
French Language Classes in Chennai
Xamarin Training in Chennai
Hadoop Admin Training in Tnagar
Hadoop Admin Training in OMR
Very nice post here and thanks for it .I like this blog and really good content.
ReplyDeleteHadoop Admin Training in Chennai
Big Data Administration Training in Chennai
Best IELTS Coaching in Chennai
learn Japanese in Chennai
Best Spoken English Class in Chennai
TOEFL Coaching Centres in Chennai
content writing course in chennai
spanish coaching in chennai
Hadoop Admin Training in Anna Nagar
Hadoop Admin Training in Tnagar