Integrating R with Apache Hadoop

发表评论
4,215 阅读

A+

(This article was first published on DataScience+ , and kindly contributed toR-bloggers)

Integrating R to work on Hadoop is to address the requirement to scale R program to work with petabyte scale data. The primary goal of this post is to elaborate different techniques for integrating R with Hadoop.

Approach 1: Using R and Streaming APIs in Hadoop

In order to integrate an R function with Hadoop and see it running in a MapReduce mode, Hadoop supports Streaming APIs for R. These Streaming APIs primary help running any script that can access and operate with standard I/O in a map- reduce mode. So, in case of R, there wouldn’t be any explicit client side integration done with R. Following is an example for R and streaming:

$ ${HADOOP_HOME}/bin/Hadoop jar
${HADOOP_HOME}/contrib/streaming/*.jar
-inputformat
org.apache.hadoop.mapred.TextInputFormat
-input input_data.txt
-output output
-mapper /home/tst/src/map.R
-reducer /home/tst/src/reduce.R
-file /home/tst/src/map.R
-file /home/tst/src/reduce.R

Approach 2: Using Rhipe package of R

There is a package in R called “Rhipe” that allows running a MapReduce job within R. To use this way of implementing R on Hadoop there are some pre-requisites. R needs to be installed on each Data node in the Hadoop Cluster, Protocol Buffers will be installed and available on each Data node (for more on Protocol Buffer refer http://wiki.apache.org/hadoop/ProtocolBuffers) and Rhipe should be available on each data node.

Following is a sample format for using Rhipe library in R to implement MapReduce:

library(Rhipe)
rhinit(TRUE, TRUE);
map<-expression ( {lapply (map.values, function(mapper)…)})
reduce<-expression(
pre = {…},
reduce = {…},
post = {…},
)
x <- rhmr(map=map, reduce=reduce,
ifolder=inputPath,
ofolder=outputPath,
inout=c('text', 'text'),
jobname='test name'))
rhex(x)

Approach 3: Using RHADOOP

RHadoop, very similar to RHipe, facilitates running R functions in a MapReduce mode. It is an open source library built by Revolution Analytics. Following are some packages the are a part of the RHadoop library. plyrmr apackage that provides functions for common data manipulation requirements for large datasets running on Hadoop. rmr a package that has collection of functions that integrate R and Hadoop.rdfs a package with functions that help interface R and HDFS. rhbase a package with functions that help interface R and HBase

Following is an example that uses rmr package and demonstrates the steps to integrate R and Hadoop using the functions from that package.

library(rmr)
maplogic<-function(k,v) { …}
reducelogic<-function(k,vv) { …}
mapreduce( input ="data.txt",
output="output",
textinputformat =rawtextinputformat,
map = maplogic,
reduce=reducelogic
)

Summary of R / Hadoop integration approaches

In summary, all the above three approaches yield results and facilitate integrating R and Hadoop and help scaling R to operate on large scale data that will be help on HDFS and each of these approaches has pros and cons.

Below is a summary of conclusions:

Hadoop Streaming API is the simplest of all the approaches as there are not any complications in terms of installation and set-up requirements. Both Rhipe and RHadoop requires some effort to set up R and related packages on the Hadoop cluster. In terms of implementation approach Streaming API is more of a command line map and reduce functions are inputs to the function while both Rhipe and RHadoop allows developers to define and call custom MapReduce function within R. In case of Hadoop Streaming API, there is no client side integration required while both Rhipe and RHadoop require client side integration. The other alternatives to scaling machine learning are Apache Mahout, Apache Hive and some commercials versions of R from Revolution Analytics, Segue framework and others.

This post is an extract from my latest publication on Machine learning .

Feel free to comment below if you have any question.