Re: Using R code as part of a Spark Application

Sun Rui Wed, 29 Jun 2016 19:57:07 -0700

Hi, Gilad,

You can try the dapply() and gapply() function in SparkR in Spark 2.0. Yes, it 
is required that R installed on each worker node.


However, if your Spark application is Scala/Java based, it is not supported for 
now to run R code in DataFrames. There is closed lira 
https://issues.apache.org/jira/browse/SPARK-14746 which remains discussion 
purpose. You have to convert DataFrames to RDDs, and use pipe() on RDDs to 
launch external R processes and run R code.

> On Jun 30, 2016, at 07:08, Xinh Huynh <xinh.hu...@gmail.com> wrote:
> 
> It looks like it. "DataFrame UDFs in R" is resolved in Spark 2.0: 
> https://issues.apache.org/jira/browse/SPARK-6817 
> <https://issues.apache.org/jira/browse/SPARK-6817>
> 
> Here's some of the code:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala
>  
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala>
> 
> /**
>  * A function wrapper that applies the given R function to each partition.
>  */
> private[sql] case class MapPartitionsRWrapper(
>     func: Array[Byte],
>     packageNames: Array[Byte],
>     broadcastVars: Array[Broadcast[Object]],
>     inputSchema: StructType,
>     outputSchema: StructType) extends (Iterator[Any] => Iterator[Any]) 
> 
> Xinh
> 
> On Wed, Jun 29, 2016 at 2:59 PM, Sean Owen <so...@cloudera.com 
> <mailto:so...@cloudera.com>> wrote:
> Here we (or certainly I) am not talking about R Server, but plain vanilla R, 
> as used with Spark and SparkR. Currently, SparkR doesn't distribute R code at 
> all (it used to, sort of), so I'm wondering if that is changing back.
> 
> On Wed, Jun 29, 2016 at 10:53 PM, John Aherne <john.ahe...@justenough.com 
> <mailto:john.ahe...@justenough.com>> wrote:
> I don't think R server requires R on the executor nodes. I originally set up 
> a SparkR cluster for our Data Scientist on Azure which required that I 
> install R on each node, but for the R Server set up, there is an extra edge 
> node with R server that they connect to. From what little research I was able 
> to do, it seems that there are some special functions in R Server that can 
> distribute the work to the cluster. 
> 
> Documentation is light, and hard to find but I found this helpful:
> https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/
>  
> <https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/>
> 
> 
> 
> On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen <so...@cloudera.com 
> <mailto:so...@cloudera.com>> wrote:
> Oh, interesting: does this really mean the return of distributing R
> code from driver to executors and running it remotely, or do I
> misunderstand? this would require having R on the executor nodes like
> it used to?
> 
> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh <xinh.hu...@gmail.com 
> <mailto:xinh.hu...@gmail.com>> wrote:
> > There is some new SparkR functionality coming in Spark 2.0, such as
> > "dapply". You could use SparkR to load a Parquet file and then run "dapply"
> > to apply a function to each partition of a DataFrame.
> >
> > Info about loading Parquet file:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
> >  
> > <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources>
> >
> > API doc for "dapply":
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
> >  
> > <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html>
> >
> > Xinh
> >
> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog <sujeet....@gmail.com 
> > <mailto:sujeet....@gmail.com>> wrote:
> >>
> >> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
> >> stuff you want to do on the Rscript stdin,  p
> >>
> >>
> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <gilad.lan...@clicktale.com 
> >> <mailto:gilad.lan...@clicktale.com>>
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>>
> >>>
> >>> I want to use R code as part of spark application (the same way I would
> >>> do with Scala/Python).  I want to be able to run an R syntax as a map
> >>> function on a big Spark dataframe loaded from a parquet file.
> >>>
> >>> Is this even possible or the only way to use R is as part of RStudio
> >>> orchestration of our Spark  cluster?
> >>>
> >>>
> >>>
> >>> Thanks for the help!
> >>>
> >>>
> >>>
> >>> Gilad
> >>>
> >>>
> >>
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
> 
> 
> -- 
> John Aherne
> Big Data and SQL Developer
> 
> 
> Cell:
> Email:
> Skype:
> Web:
> 
> +1 (303) 809-9718 <tel:%2B1%20%28303%29%20809-9718>
> john.ahe...@justenough.com <mailto:john.ahe...@justenough.com>
> john.aherne.je <http://john.aherne.je/>
> www.justenough.com <http://www.justenough.com/>
> 
> Confidentiality Note: The information contained in this email and document(s) 
> attached are for the exclusive use of the addressee and may contain 
> confidential, privileged and non-disclosable information. If the recipient of 
> this email is not the addressee, such recipient is strictly prohibited from 
> reading, photocopying, distribution or otherwise using this email or its 
> contents in any way.
> 
> 
>

Re: Using R code as part of a Spark Application

Reply via email to