Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
I don't think R server requires R on the executor nodes. I originally set
up a SparkR cluster for our Data Scientist on Azure which required that I
install R on each node, but for the R Server set up, there is an extra edge
node with R server that they connect to. From what little research I was
able to do, it seems that there are some special functions in R Server that
can distribute the work to the cluster.

Documentation is light, and hard to find but I found this helpful:
https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/



On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen <so...@cloudera.com> wrote:

> Oh, interesting: does this really mean the return of distributing R
> code from driver to executors and running it remotely, or do I
> misunderstand? this would require having R on the executor nodes like
> it used to?
>
> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh <xinh.hu...@gmail.com> wrote:
> > There is some new SparkR functionality coming in Spark 2.0, such as
> > "dapply". You could use SparkR to load a Parquet file and then run
> "dapply"
> > to apply a function to each partition of a DataFrame.
> >
> > Info about loading Parquet file:
> >
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
> >
> > API doc for "dapply":
> >
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
> >
> > Xinh
> >
> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog <sujeet@gmail.com>
> wrote:
> >>
> >> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
> >> stuff you want to do on the Rscript stdin,  p
> >>
> >>
> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <
> gilad.lan...@clicktale.com>
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>>
> >>>
> >>> I want to use R code as part of spark application (the same way I would
> >>> do with Scala/Python).  I want to be able to run an R syntax as a map
> >>> function on a big Spark dataframe loaded from a parquet file.
> >>>
> >>> Is this even possible or the only way to use R is as part of RStudio
> >>> orchestration of our Spark  cluster?
> >>>
> >>>
> >>>
> >>> Thanks for the help!
> >>>
> >>>
> >>>
> >>> Gilad
> >>>
> >>>
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 

John Aherne
Big Data and SQL Developer

[image: JustEnough Logo]

Cell:
Email:
Skype:
Web:

+1 (303) 809-9718
john.ahe...@justenough.com
john.aherne.je
www.justenough.com


Confidentiality Note: The information contained in this email and
document(s) attached are for the exclusive use of the addressee and
may contain confidential, privileged and non-disclosable information.
If the recipient of this email is not the addressee, such recipient is
strictly prohibited from reading, photocopying, distribution or
otherwise using this email or its contents in any way.


Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
Microsoft Azure has an option to create a spark cluster with R Server. MS
bought RevoScale (I think that was the name) and just recently deployed it.

On Wed, Jun 29, 2016 at 10:53 AM, Xinh Huynh <xinh.hu...@gmail.com> wrote:

> There is some new SparkR functionality coming in Spark 2.0, such as
> "dapply". You could use SparkR to load a Parquet file and then run "dapply"
> to apply a function to each partition of a DataFrame.
>
> Info about loading Parquet file:
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>
> API doc for "dapply":
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>
> Xinh
>
> On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog <sujeet@gmail.com> wrote:
>
>> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
>> stuff you want to do on the Rscript stdin,  p
>>
>>
>> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <gilad.lan...@clicktale.com
>> > wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I want to use R code as part of spark application (the same way I would
>>> do with Scala/Python).  I want to be able to run an R syntax as a map
>>> function on a big Spark dataframe loaded from a parquet file.
>>>
>>> Is this even possible or the only way to use R is as part of RStudio
>>> orchestration of our Spark  cluster?
>>>
>>>
>>>
>>> Thanks for the help!
>>>
>>>
>>>
>>> Gilad
>>>
>>>
>>>
>>
>>
>


-- 

John Aherne
Big Data and SQL Developer

[image: JustEnough Logo]

Cell:
Email:
Skype:
Web:

+1 (303) 809-9718
john.ahe...@justenough.com
john.aherne.je
www.justenough.com


Confidentiality Note: The information contained in this email and
document(s) attached are for the exclusive use of the addressee and
may contain confidential, privileged and non-disclosable information.
If the recipient of this email is not the addressee, such recipient is
strictly prohibited from reading, photocopying, distribution or
otherwise using this email or its contents in any way.


Re: Explode row with start and end dates into row for each date

2016-06-22 Thread John Aherne
Thanks Saurabh!

That explode function looks like it is exactly what I need.

We will be using MLlib quite a lot - Do I have to worry about python
versions for that?

John

On Wed, Jun 22, 2016 at 4:34 PM, Saurabh Sardeshpande <saurabh...@gmail.com>
wrote:

> Hi John,
>
> If you can do it in Hive, you should be able to do it in Spark. Just make
> sure you import HiveContext instead of SQLContext.
>
> If your intent is to explore rather than get stuff done, I've not aware of
> any RDD operations that do this for you, but there is a DataFrame operation
> called 'explode' which does this -
> https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.functions.explode.
> You'll just have to generate the array of dates using something like this -
> http://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-dates
> .
>
> It's generally recommended to use Python 3 if you're starting a new
> project and don't have old dependencies. But remember that there is still
> quite a lot of stuff that is not yet ported to Python 3.
>
> Regards,
> Saurabh
>
> On Wed, Jun 22, 2016 at 3:20 PM, John Aherne <john.ahe...@justenough.com>
> wrote:
>
>> Hi Everyone,
>>
>> I am pretty new to Spark (and the mailing list), so forgive me if the
>> answer is obvious.
>>
>> I have a dataset, and each row contains a start date and end date.
>>
>> I would like to explode each row so that each day between the start and
>> end dates becomes its own row.
>> e.g.
>> row1  2015-01-01  2015-01-03
>> becomes
>> row1   2015-01-01
>> row1   2015-01-02
>> row1   2015-01-03
>>
>> So, my questions are:
>> Is Spark a good place to do that?
>> I can do it in Hive, but it's a bit messy, and this seems like a good
>> problem to use for learning Spark (and Python).
>>
>> If so, any pointers on what methods I should use? Particularly how to
>> split one row into multiples.
>>
>> Lastly, I am a bit hesitant to ask but is there a recommendation on which
>> version of python to use? Not interested in which is better, just want to
>> know if they are both supported equally.
>>
>> I am using Spark 1.6.1 (Hortonworks distro).
>>
>> Thanks!
>> John
>>
>> --
>>
>> John Aherne
>> Big Data and SQL Developer
>>
>> [image: JustEnough Logo]
>>
>> Cell:
>> Email:
>> Skype:
>> Web:
>>
>> +1 (303) 809-9718
>> john.ahe...@justenough.com
>> john.aherne.je
>> www.justenough.com
>>
>>
>> Confidentiality Note: The information contained in this email and 
>> document(s) attached are for the exclusive use of the addressee and may 
>> contain confidential, privileged and non-disclosable information. If the 
>> recipient of this email is not the addressee, such recipient is strictly 
>> prohibited from reading, photocopying, distribution or otherwise using this 
>> email or its contents in any way.
>>
>>
>


-- 

John Aherne
Big Data and SQL Developer

[image: JustEnough Logo]

Cell:
Email:
Skype:
Web:

+1 (303) 809-9718
john.ahe...@justenough.com
john.aherne.je
www.justenough.com


Confidentiality Note: The information contained in this email and
document(s) attached are for the exclusive use of the addressee and
may contain confidential, privileged and non-disclosable information.
If the recipient of this email is not the addressee, such recipient is
strictly prohibited from reading, photocopying, distribution or
otherwise using this email or its contents in any way.


Explode row with start and end dates into row for each date

2016-06-22 Thread John Aherne
Hi Everyone,

I am pretty new to Spark (and the mailing list), so forgive me if the
answer is obvious.

I have a dataset, and each row contains a start date and end date.

I would like to explode each row so that each day between the start and end
dates becomes its own row.
e.g.
row1  2015-01-01  2015-01-03
becomes
row1   2015-01-01
row1   2015-01-02
row1   2015-01-03

So, my questions are:
Is Spark a good place to do that?
I can do it in Hive, but it's a bit messy, and this seems like a good
problem to use for learning Spark (and Python).

If so, any pointers on what methods I should use? Particularly how to split
one row into multiples.

Lastly, I am a bit hesitant to ask but is there a recommendation on which
version of python to use? Not interested in which is better, just want to
know if they are both supported equally.

I am using Spark 1.6.1 (Hortonworks distro).

Thanks!
John

-- 

John Aherne
Big Data and SQL Developer

[image: JustEnough Logo]

Cell:
Email:
Skype:
Web:

+1 (303) 809-9718
john.ahe...@justenough.com
john.aherne.je
www.justenough.com


Confidentiality Note: The information contained in this email and
document(s) attached are for the exclusive use of the addressee and
may contain confidential, privileged and non-disclosable information.
If the recipient of this email is not the addressee, such recipient is
strictly prohibited from reading, photocopying, distribution or
otherwise using this email or its contents in any way.