Hi, Gilad, You can try the dapply() and gapply() function in SparkR in Spark 2.0. Yes, it is required that R installed on each worker node.
However, if your Spark application is Scala/Java based, it is not supported for now to run R code in DataFrames. There is closed lira https://issues.apache.org/jira/browse/SPARK-14746 which remains discussion purpose. You have to convert DataFrames to RDDs, and use pipe() on RDDs to launch external R processes and run R code. > On Jun 30, 2016, at 07:08, Xinh Huynh <xinh.hu...@gmail.com> wrote: > > It looks like it. "DataFrame UDFs in R" is resolved in Spark 2.0: > https://issues.apache.org/jira/browse/SPARK-6817 > <https://issues.apache.org/jira/browse/SPARK-6817> > > Here's some of the code: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala > > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala> > > /** > * A function wrapper that applies the given R function to each partition. > */ > private[sql] case class MapPartitionsRWrapper( > func: Array[Byte], > packageNames: Array[Byte], > broadcastVars: Array[Broadcast[Object]], > inputSchema: StructType, > outputSchema: StructType) extends (Iterator[Any] => Iterator[Any]) > > Xinh > > On Wed, Jun 29, 2016 at 2:59 PM, Sean Owen <so...@cloudera.com > <mailto:so...@cloudera.com>> wrote: > Here we (or certainly I) am not talking about R Server, but plain vanilla R, > as used with Spark and SparkR. Currently, SparkR doesn't distribute R code at > all (it used to, sort of), so I'm wondering if that is changing back. > > On Wed, Jun 29, 2016 at 10:53 PM, John Aherne <john.ahe...@justenough.com > <mailto:john.ahe...@justenough.com>> wrote: > I don't think R server requires R on the executor nodes. I originally set up > a SparkR cluster for our Data Scientist on Azure which required that I > install R on each node, but for the R Server set up, there is an extra edge > node with R server that they connect to. From what little research I was able > to do, it seems that there are some special functions in R Server that can > distribute the work to the cluster. > > Documentation is light, and hard to find but I found this helpful: > https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/ > > <https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/> > > > > On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen <so...@cloudera.com > <mailto:so...@cloudera.com>> wrote: > Oh, interesting: does this really mean the return of distributing R > code from driver to executors and running it remotely, or do I > misunderstand? this would require having R on the executor nodes like > it used to? > > On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh <xinh.hu...@gmail.com > <mailto:xinh.hu...@gmail.com>> wrote: > > There is some new SparkR functionality coming in Spark 2.0, such as > > "dapply". You could use SparkR to load a Parquet file and then run "dapply" > > to apply a function to each partition of a DataFrame. > > > > Info about loading Parquet file: > > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources > > > > <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources> > > > > API doc for "dapply": > > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html > > > > <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html> > > > > Xinh > > > > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog <sujeet....@gmail.com > > <mailto:sujeet....@gmail.com>> wrote: > >> > >> try Spark pipeRDD's , you can invoke the R script from pipe , push the > >> stuff you want to do on the Rscript stdin, p > >> > >> > >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <gilad.lan...@clicktale.com > >> <mailto:gilad.lan...@clicktale.com>> > >> wrote: > >>> > >>> Hello, > >>> > >>> > >>> > >>> I want to use R code as part of spark application (the same way I would > >>> do with Scala/Python). I want to be able to run an R syntax as a map > >>> function on a big Spark dataframe loaded from a parquet file. > >>> > >>> Is this even possible or the only way to use R is as part of RStudio > >>> orchestration of our Spark cluster? > >>> > >>> > >>> > >>> Thanks for the help! > >>> > >>> > >>> > >>> Gilad > >>> > >>> > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > > > > > -- > John Aherne > Big Data and SQL Developer > > > Cell: > Email: > Skype: > Web: > > +1 (303) 809-9718 <tel:%2B1%20%28303%29%20809-9718> > john.ahe...@justenough.com <mailto:john.ahe...@justenough.com> > john.aherne.je <http://john.aherne.je/> > www.justenough.com <http://www.justenough.com/> > > Confidentiality Note: The information contained in this email and document(s) > attached are for the exclusive use of the addressee and may contain > confidential, privileged and non-disclosable information. If the recipient of > this email is not the addressee, such recipient is strictly prohibited from > reading, photocopying, distribution or otherwise using this email or its > contents in any way. > > >