Re: Using R code as part of a Spark Application

sujeet jog Wed, 29 Jun 2016 23:43:49 -0700

Thanks for the link Sun,  I believe running external Scripts like R code in
Data Frames is a much needed facility,  for example for the algorithms that
are not available in MLLIB, invoking such from a R script would definitely
be a powerful feature when your APP is Scala/Python based,  you don;t have
to use Spark-R for this sake when much of your application code is in
Scala/python.


On Thu, Jun 30, 2016 at 8:25 AM, Sun Rui <sunrise_...@163.com> wrote:

> Hi, Gilad,
>
> You can try the dapply() and gapply() function in SparkR in Spark 2.0.
> Yes, it is required that R installed on each worker node.
>
> However, if your Spark application is Scala/Java based, it is not
> supported for now to run R code in DataFrames. There is closed lira
> https://issues.apache.org/jira/browse/SPARK-14746 which remains
> discussion purpose. You have to convert DataFrames to RDDs, and use pipe()
> on RDDs to launch external R processes and run R code.
>
> On Jun 30, 2016, at 07:08, Xinh Huynh <xinh.hu...@gmail.com> wrote:
>
> It looks like it. "DataFrame UDFs in R" is resolved in Spark 2.0:
> https://issues.apache.org/jira/browse/SPARK-6817
>
> Here's some of the code:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala
>
> /**
> * A function wrapper that applies the given R function to each partition.
> */
> private[sql] case class MapPartitionsRWrapper(
> func: Array[Byte],
> packageNames: Array[Byte],
> broadcastVars: Array[Broadcast[Object]],
> inputSchema: StructType,
> outputSchema: StructType) extends (Iterator[Any] => Iterator[Any])
>
> Xinh
>
> On Wed, Jun 29, 2016 at 2:59 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Here we (or certainly I) am not talking about R Server, but plain vanilla
>> R, as used with Spark and SparkR. Currently, SparkR doesn't distribute R
>> code at all (it used to, sort of), so I'm wondering if that is changing
>> back.
>>
>> On Wed, Jun 29, 2016 at 10:53 PM, John Aherne <john.ahe...@justenough.com
>> > wrote:
>>
>>> I don't think R server requires R on the executor nodes. I originally
>>> set up a SparkR cluster for our Data Scientist on Azure which required that
>>> I install R on each node, but for the R Server set up, there is an extra
>>> edge node with R server that they connect to. From what little research I
>>> was able to do, it seems that there are some special functions in R Server
>>> that can distribute the work to the cluster.
>>>
>>> Documentation is light, and hard to find but I found this helpful:
>>>
>>> https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/
>>>
>>>
>>>
>>> On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Oh, interesting: does this really mean the return of distributing R
>>>> code from driver to executors and running it remotely, or do I
>>>> misunderstand? this would require having R on the executor nodes like
>>>> it used to?
>>>>
>>>> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh <xinh.hu...@gmail.com>
>>>> wrote:
>>>> > There is some new SparkR functionality coming in Spark 2.0, such as
>>>> > "dapply". You could use SparkR to load a Parquet file and then run
>>>> "dapply"
>>>> > to apply a function to each partition of a DataFrame.
>>>> >
>>>> > Info about loading Parquet file:
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>>>> >
>>>> > API doc for "dapply":
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>>>> >
>>>> > Xinh
>>>> >
>>>> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog <sujeet....@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> try Spark pipeRDD's , you can invoke the R script from pipe , push
>>>> the
>>>> >> stuff you want to do on the Rscript stdin,  p
>>>> >>
>>>> >>
>>>> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <
>>>> gilad.lan...@clicktale.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hello,
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> I want to use R code as part of spark application (the same way I
>>>> would
>>>> >>> do with Scala/Python).  I want to be able to run an R syntax as a
>>>> map
>>>> >>> function on a big Spark dataframe loaded from a parquet file.
>>>> >>>
>>>> >>> Is this even possible or the only way to use R is as part of RStudio
>>>> >>> orchestration of our Spark  cluster?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Thanks for the help!
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Gilad
>>>> >>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> John Aherne
>>> Big Data and SQL Developer
>>>
>>> [image: JustEnough Logo]
>>>
>>> Cell:
>>> Email:
>>> Skype:
>>> Web:
>>>
>>> +1 (303) 809-9718
>>> john.ahe...@justenough.com
>>> john.aherne.je
>>> www.justenough.com
>>>
>>>
>>> Confidentiality Note: The information contained in this email and 
>>> document(s) attached are for the exclusive use of the addressee and may 
>>> contain confidential, privileged and non-disclosable information. If the 
>>> recipient of this email is not the addressee, such recipient is strictly 
>>> prohibited from reading, photocopying, distribution or otherwise using this 
>>> email or its contents in any way.
>>>
>>>
>>
>
>

Re: Using R code as part of a Spark Application

Reply via email to