Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-31 Thread Shivaram Venkataraman
+1

Thanks
Shivaram

On Thu, Jul 30, 2020 at 11:56 PM Wenchen Fan  wrote:
>
> +1, thanks for driving it, Holden!
>
> On Fri, Jul 31, 2020 at 10:24 AM Holden Karau  wrote:
>>
>> +1 from myself :)
>>
>> On Thu, Jul 30, 2020 at 2:53 PM Jungtaek Lim  
>> wrote:
>>>
>>> +1 (non-binding, I guess)
>>>
>>> Thanks for raising the issue and sorting it out!
>>>
>>> On Fri, Jul 31, 2020 at 6:47 AM Holden Karau  wrote:

 Hi Spark Developers,

 After the discussion of the proposal to amend Spark committer guidelines, 
 it appears folks are generally in agreement on policy clarifications. (See 
 https://lists.apache.org/thread.html/r6706e977fda2c474a7f24775c933c2f46ea19afbfafb03c90f6972ba%40%3Cdev.spark.apache.org%3E,
  as well as some on the private@ list for PMC.) Therefore, I am calling 
 for a majority VOTE, which will last at least 72 hours. See the ASF voting 
 rules for procedural changes at 
 https://www.apache.org/foundation/voting.html.

 The proposal is to add a new section entitled “When to Commit” to the 
 Spark committer guidelines, currently at 
 https://spark.apache.org/committers.html.

 ** START OF CHANGE **

 PRs shall not be merged during active, on-topic discussion unless they 
 address issues such as critical security fixes of a public vulnerability. 
 Under extenuating circumstances, PRs may be merged during active, 
 off-topic discussion and the discussion directed to a more appropriate 
 venue. Time should be given prior to merging for those involved with the 
 conversation to explain if they believe they are on-topic.

 Lazy consensus requires giving time for discussion to settle while 
 understanding that people may not be working on Spark as their full-time 
 job and may take holidays. It is believed that by doing this, we can limit 
 how often people feel the need to exercise their veto.

 All -1s with justification merit discussion.  A -1 from a non-committer 
 can be overridden only with input from multiple committers, and suitable 
 time must be offered for any committer to raise concerns. A -1 from a 
 committer who cannot be reached requires a consensus vote of the PMC under 
 ASF voting rules to determine the next steps within the ASF guidelines for 
 code vetoes ( https://www.apache.org/foundation/voting.html ).

 These policies serve to reiterate the core principle that code must not be 
 merged with a pending veto or before a consensus has been reached (lazy or 
 otherwise).

 It is the PMC’s hope that vetoes continue to be infrequent, and when they 
 occur, that all parties will take the time to build consensus prior to 
 additional feature work.

 Being a committer means exercising your judgement while working in a 
 community of people with diverse views. There is nothing wrong in getting 
 a second (or third or fourth) opinion when you are uncertain. Thank you 
 for your dedication to the Spark project; it is appreciated by the 
 developers and users of Spark.

 It is hoped that these guidelines do not slow down development; rather, by 
 removing some of the uncertainty, the goal is to make it easier for us to 
 reach consensus. If you have ideas on how to improve these guidelines or 
 other Spark project operating procedures, you should reach out on the dev@ 
 list to start the discussion.

 ** END OF CHANGE TEXT **

 I want to thank everyone who has been involved with the discussion leading 
 to this proposal and those of you who take the time to vote on this. I 
 look forward to our continued collaboration in building Apache Spark.

 I believe we share the goal of creating a welcoming community around the 
 project. On a personal note, it is my belief that consistently applying 
 this policy around commits can help to make a more accessible and 
 welcoming community.

 Kind Regards,

 Holden

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.): 
 https://amzn.to/2MaRAG9
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-14 Thread Shivaram Venkataraman
Hi all

Just wanted to check if there are any blockers that we are still waiting
for to start the new release process.

Thanks
Shivaram

On Sun, Jul 5, 2020, 06:51 wuyi  wrote:

> Ok, after having another look, I think it only affects local cluster deploy
> mode, which is for testing only.
>
>
> wuyi wrote
> > Please also includes https://issues.apache.org/jira/browse/SPARK-32120
>  in
> > Spark 3.0.1. It's a regression compares to Spark 3.0.0-preview2.
> >
> > Thanks,
> > Yi Wu
> >
> >
> > Yuanjian Li wrote
> >> Hi dev-list,
> >>
> >> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
> >> since 4 blocker issues were found after Spark 3.0.0:
> >>
> >>
> >>1.
> >>
> >>[SPARK-31990]
> >> https://issues.apache.org/jira/browse/SPARK-31990;
> >> The
> >>state store compatibility broken will cause a correctness issue when
> >>Streaming query with `dropDuplicate` uses the checkpoint written by
> >> the
> >> old
> >>Spark version.
> >>2.
> >>
> >>[SPARK-32038]
> >> https://issues.apache.org/jira/browse/SPARK-32038;
> >> The
> >>regression bug in handling NaN values in COUNT(DISTINCT)
> >>3.
> >>
> >>[SPARK-31918]
> >> https://issues.apache.org/jira/browse/SPARK-31918[WIP]
> >>CRAN requires to make it working with the latest R 4.0. It makes the
> >> 3.0
> >>release unavailable on CRAN, and only supports R [3.5, 4.0)
> >>4.
> >>
> >>[SPARK-31967]
> >> https://issues.apache.org/jira/browse/SPARK-31967;
> >>Downgrade vis.js to fix Jobs UI loading time regression
> >>
> >>
> >> I also noticed branch-3.0 already has 39 commits
> >> 
> https://issues.apache.org/jira/browse/SPARK-32038?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%203.0.1
> ;
> >> after Spark 3.0.0. I think it would be great if we have Spark 3.0.1 to
> >> deliver the critical fixes.
> >>
> >> Any comments are appreciated.
> >>
> >> Best,
> >>
> >> Yuanjian
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Shivaram Venkataraman
Thanks Holden -- it would be great to also get 2.4.7 started

Thanks
Shivaram

On Tue, Jun 30, 2020 at 10:31 PM Holden Karau  wrote:
>
> I can take care of 2.4.7 unless someone else wants to do it.
>
> On Tue, Jun 30, 2020 at 8:29 PM Jason Moore  
> wrote:
>>
>> Hi all,
>>
>>
>>
>> Could I get some input on the severity of this one that I found yesterday?  
>> If that’s a correctness issue, should it block this patch?  Let me know 
>> under the ticket if there’s more info that I can provide to help.
>>
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-32136
>>
>>
>>
>> Thanks,
>>
>> Jason.
>>
>>
>>
>> From: Jungtaek Lim 
>> Date: Wednesday, 1 July 2020 at 10:20 am
>> To: Shivaram Venkataraman 
>> Cc: Prashant Sharma , 郑瑞峰 , 
>> Gengliang Wang , gurwls223 
>> , Dongjoon Hyun , Jules Damji 
>> , Holden Karau , Reynold Xin 
>> , Yuanjian Li , 
>> "dev@spark.apache.org" , Takeshi Yamamuro 
>> 
>> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>
>>
>>
>> SPARK-32130 [1] looks to be a performance regression introduced in Spark 
>> 3.0.0, which is ideal to look into before releasing another bugfix version.
>>
>>
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-32130
>>
>>
>>
>> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman 
>>  wrote:
>>
>> Hi all
>>
>>
>>
>> I just wanted to ping this thread to see if all the outstanding blockers for 
>> 3.0.1 have been fixed. If so, it would be great if we can get the release 
>> going. The CRAN team sent us a note that the version SparkR available on 
>> CRAN for the current R version (4.0.2) is broken and hence we need to update 
>> the package soon --  it will be great to do it with 3.0.1.
>>
>>
>>
>> Thanks
>>
>> Shivaram
>>
>>
>>
>> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma  wrote:
>>
>> +1 for 3.0.1 release.
>>
>> I too can help out as release manager.
>>
>>
>>
>> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>>
>> I volunteer to be a release manager of 3.0.1, if nobody is working on this.
>>
>>
>>
>>
>>
>> -- 原始邮件 --
>>
>> 发件人: "Gengliang Wang";
>>
>> 发送时间: 2020年6月24日(星期三) 下午4:15
>>
>> 收件人: "Hyukjin Kwon";
>>
>> 抄送: "Dongjoon Hyun";"Jungtaek 
>> Lim";"Jules 
>> Damji";"Holden Karau";"Reynold 
>> Xin";"Shivaram 
>> Venkataraman";"Yuanjian 
>> Li";"Spark dev list";"Takeshi 
>> Yamamuro";
>>
>> 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>
>>
>>
>> +1, the issues mentioned are really serious.
>>
>>
>>
>> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon  wrote:
>>
>> +1.
>>
>> Just as a note,
>> - SPARK-31918 is fixed now, and there's no blocker. - When we build SparkR, 
>> we should use the latest R version at least 4.0.0+.
>>
>>
>>
>> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이 작성:
>>
>> +1
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim  
>> wrote:
>>
>> +1 on a 3.0.1 soon.
>>
>>
>>
>> Probably it would be nice if some Scala experts can take a look at 
>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix into 
>> 3.0.1 if possible.
>>
>> Looks like APIs designed to work with Scala 2.11 & Java bring ambiguity in 
>> Scala 2.12 & Java.
>>
>>
>>
>> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji  wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>>
>>
>> On Jun 23, 2020, at 11:36 AM, Holden Karau  wrote:
>>
>> +1 on a patch release soon
>>
>>
>>
>> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin  wrote:
>>
>> Error! Filename not specified.
>>
>> +1 on doing a new patch release soon. I saw some of these issues when 
>> preparing the 3.0 release, and some of them are very serious.
>>
>>
>>
>>
>>
>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman 
>>  wrote:
>>
>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-30 Thread Shivaram Venkataraman
Hi all

I just wanted to ping this thread to see if all the outstanding blockers
for 3.0.1 have been fixed. If so, it would be great if we can get the
release going. The CRAN team sent us a note that the version SparkR
available on CRAN for the current R version (4.0.2) is broken and hence we
need to update the package soon --  it will be great to do it with 3.0.1.

Thanks
Shivaram

On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
wrote:

> +1 for 3.0.1 release.
> I too can help out as release manager.
>
> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>
>> I volunteer to be a release manager of 3.0.1, if nobody is working on
>> this.
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Gengliang Wang";
>> *发送时间:* 2020年6月24日(星期三) 下午4:15
>> *收件人:* "Hyukjin Kwon";
>> *抄送:* "Dongjoon Hyun";"Jungtaek Lim"<
>> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
>> Karau";"Reynold Xin";"Shivaram
>> Venkataraman";"Yuanjian Li"<
>> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
>> Yamamuro";
>> *主题:* Re: [DISCUSS] Apache Spark 3.0.1 Release
>>
>> +1, the issues mentioned are really serious.
>>
>> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon  wrote:
>>
>>> +1.
>>>
>>> Just as a note,
>>> - SPARK-31918 <https://issues.apache.org/jira/browse/SPARK-31918> is
>>> fixed now, and there's no blocker. - When we build SparkR, we should use
>>> the latest R version at least 4.0.0+.
>>>
>>> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이 작성:
>>>
>>>> +1
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> +1 on a 3.0.1 soon.
>>>>>
>>>>> Probably it would be nice if some Scala experts can take a look at
>>>>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
>>>>> into 3.0.1 if possible.
>>>>> Looks like APIs designed to work with Scala 2.11 & Java bring
>>>>> ambiguity in Scala 2.12 & Java.
>>>>>
>>>>> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Sent from my iPhone
>>>>>> Pardon the dumb thumb typos :)
>>>>>>
>>>>>> On Jun 23, 2020, at 11:36 AM, Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>> 
>>>>>> +1 on a patch release soon
>>>>>>
>>>>>> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 on doing a new patch release soon. I saw some of these issues
>>>>>>> when preparing the 3.0 release, and some of them are very serious.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>>>>>>> shiva...@eecs.berkeley.edu> wrote:
>>>>>>>
>>>>>>>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1
>>>>>>>> release soon.
>>>>>>>>
>>>>>>>> Shivaram
>>>>>>>>
>>>>>>>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro <
>>>>>>>> linguin@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Thanks for the heads-up, Yuanjian!
>>>>>>>>
>>>>>>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>>>>>>>
>>>>>>>> wow, the updates are so quick. Anyway, +1 for the release.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Takeshi
>>>>>>>>
>>>>>>>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi dev-list,
>>>>>>>>
>>>>>>>> I’m writing this to raise the discussion about Spark 3.0.1
>>>>>>>> feasibility since 4 blocker issues were found after Spark 3.0.0:
>>>>>>>>
>>>>>>>> [SPARK-31990] The state store compatibility broken will cause a
>>>>>>>> correctness issue when Streaming query with `dropDuplicate` uses the
>>>>>>>> checkpoint written by the old Spark version.
>>>>>>>>
>>>>>>>> [SPARK-32038] The regression bug in handling NaN values in
>>>>>>>> COUNT(DISTINCT)
>>>>>>>>
>>>>>>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest
>>>>>>>> R 4.0. It makes the 3.0 release unavailable on CRAN, and only supports 
>>>>>>>> R
>>>>>>>> [3.5, 4.0)
>>>>>>>>
>>>>>>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time
>>>>>>>> regression
>>>>>>>>
>>>>>>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>>>>>>> I think it would be great if we have Spark 3.0.1 to deliver the 
>>>>>>>> critical
>>>>>>>> fixes.
>>>>>>>>
>>>>>>>> Any comments are appreciated.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Yuanjian
>>>>>>>>
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> Takeshi Yamamuro
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Shivaram Venkataraman
+1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release soon.

Shivaram

On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro  wrote:
>
> Thanks for the heads-up, Yuanjian!
>
> > I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
> wow, the updates are so quick. Anyway, +1 for the release.
>
> Bests,
> Takeshi
>
> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li  wrote:
>>
>> Hi dev-list,
>>
>>
>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility since 
>> 4 blocker issues were found after Spark 3.0.0:
>>
>>
>> [SPARK-31990] The state store compatibility broken will cause a correctness 
>> issue when Streaming query with `dropDuplicate` uses the checkpoint written 
>> by the old Spark version.
>>
>> [SPARK-32038] The regression bug in handling NaN values in COUNT(DISTINCT)
>>
>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R 4.0. 
>> It makes the 3.0 release unavailable on CRAN, and only supports R [3.5, 4.0)
>>
>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I think 
>> it would be great if we have Spark 3.0.1 to deliver the critical fixes.
>>
>>
>> Any comments are appreciated.
>>
>>
>> Best,
>>
>> Yuanjian
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkR latest API docs missing?

2019-05-08 Thread Shivaram Venkataraman
Comparing 
https://github.com/apache/spark-website/tree/asf-site/site/docs/2.4.2/api/R
and https://github.com/apache/spark-website/tree/asf-site/site/docs/2.4.3/api/R,
 it looks like the github commit of the docs is missing this.

cc'ing recent release managers.

Thanks
Shivaram

On Wed, May 8, 2019 at 11:27 AM Shivaram Venkataraman
 wrote:
>
> Actually I found this while I was uploading the latest release to CRAN
> -- these docs should be generated as a part of the release process
> though and shouldn't be related to CRAN.
>
> On Wed, May 8, 2019 at 11:24 AM Sean Owen  wrote:
> >
> > I think the SparkR release always trails a little bit due to the
> > additional CRAN processes.
> >
> > On Wed, May 8, 2019 at 11:23 AM Shivaram Venkataraman
> >  wrote:
> > >
> > > I just noticed that the SparkR API docs are missing at
> > > https://spark.apache.org/docs/latest/api/R/index.html --- It looks
> > > like they were missing from the 2.4.3 release?
> > >
> > > Thanks
> > > Shivaram
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkR latest API docs missing?

2019-05-08 Thread Shivaram Venkataraman
Actually I found this while I was uploading the latest release to CRAN
-- these docs should be generated as a part of the release process
though and shouldn't be related to CRAN.

On Wed, May 8, 2019 at 11:24 AM Sean Owen  wrote:
>
> I think the SparkR release always trails a little bit due to the
> additional CRAN processes.
>
> On Wed, May 8, 2019 at 11:23 AM Shivaram Venkataraman
>  wrote:
> >
> > I just noticed that the SparkR API docs are missing at
> > https://spark.apache.org/docs/latest/api/R/index.html --- It looks
> > like they were missing from the 2.4.3 release?
> >
> > Thanks
> > Shivaram
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



SparkR latest API docs missing?

2019-05-08 Thread Shivaram Venkataraman
I just noticed that the SparkR API docs are missing at
https://spark.apache.org/docs/latest/api/R/index.html --- It looks
like they were missing from the 2.4.3 release?

Thanks
Shivaram

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Fwd: CRAN submission SparkR 2.3.3

2019-02-24 Thread Shivaram Venkataraman
FYI here is the note from CRAN from submitting 2.3.3. There were some
minor issues with the package description file in our CRAN submission.
We are discussing with the CRAN team about this and also Felix has a
patch to address this for upcoming releases.

One thing I was wondering is that if there have not been too many
changes since 2.3.3, how much effort would it be to cut a 2.3.4 with
just this change.

Thanks
Shivaram

-- Forwarded message -
From: Uwe Ligges 
Date: Sun, Feb 17, 2019 at 12:28 PM
Subject: Re: CRAN submission SparkR 2.3.3
To: Shivaram Venkataraman , CRAN



Thanks, but see below.


On 17.02.2019 18:46, CRAN submission wrote:
> [This was generated from CRAN.R-project.org/submit.html]
>
> The following package was uploaded to CRAN:
> ===
>
> Package Information:
> Package: SparkR
> Version: 2.3.3
> Title: R Frontend for Apache Spark

Perhaps omit the redundant R?

Please single quote software names.


> Author(s): Shivaram Venkataraman [aut, cre], Xiangrui Meng [aut], Felix
>Cheung [aut], The Apache Software Foundation [aut, cph]
> Maintainer: Shivaram Venkataraman 
> Depends: R (>= 3.0), methods
> Suggests: knitr, rmarkdown, testthat, e1071, survival
> Description: Provides an R Frontend for Apache Spark.

Please single quote software names and give a ewb reference in the form
 to Apache SPark.

Best,
Uwe Ligges




> License: Apache License (== 2.0)
>
>
> The maintainer confirms that he or she
> has read and agrees to the CRAN policies.
>
> =
>
> Original content of DESCRIPTION file:
>
> Package: SparkR
> Type: Package
> Version: 2.3.3
> Title: R Frontend for Apache Spark
> Description: Provides an R Frontend for Apache Spark.
> Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
>  email = "shiva...@cs.berkeley.edu"),
>   person("Xiangrui", "Meng", role = "aut",
>  email = "m...@databricks.com"),
>   person("Felix", "Cheung", role = "aut",
>  email = "felixche...@apache.org"),
>   person(family = "The Apache Software Foundation", role = 
> c("aut", "cph")))
> License: Apache License (== 2.0)
> URL: http://www.apache.org/ http://spark.apache.org/
> BugReports: http://spark.apache.org/contributing.html
> SystemRequirements: Java (== 8)
> Depends: R (>= 3.0), methods
> Suggests: knitr, rmarkdown, testthat, e1071, survival
> Collate: 'schema.R' 'generics.R' 'jobj.R' 'column.R' 'group.R' 'RDD.R'
>  'pairRDD.R' 'DataFrame.R' 'SQLContext.R' 'WindowSpec.R'
>  'backend.R' 'broadcast.R' 'catalog.R' 'client.R' 'context.R'
>  'deserialize.R' 'functions.R' 'install.R' 'jvm.R'
>  'mllib_classification.R' 'mllib_clustering.R' 'mllib_fpm.R'
>  'mllib_recommendation.R' 'mllib_regression.R' 'mllib_stat.R'
>  'mllib_tree.R' 'mllib_utils.R' 'serialize.R' 'sparkR.R'
>  'stats.R' 'streaming.R' 'types.R' 'utils.R' 'window.R'
> RoxygenNote: 6.1.1
> VignetteBuilder: knitr
> NeedsCompilation: no
> Packaged: 2019-02-04 15:40:09 UTC; spark-rm
> Author: Shivaram Venkataraman [aut, cre],
>Xiangrui Meng [aut],
>Felix Cheung [aut],
>The Apache Software Foundation [aut, cph]
> Maintainer: Shivaram Venkataraman 
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Vectorized R gapply[Collect]() implementation

2019-02-09 Thread Shivaram Venkataraman
Those speedups look awesome! Great work Hyukjin!

Thanks
Shivaram

On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon  wrote:
>
> Guys, as continuation of Arrow optimization for R DataFrame to Spark 
> DataFrame,
>
> I am trying to make a vectorized gapply[Collect] implementation as an 
> experiment like vectorized Pandas UDFs
>
> It brought 820%+ performance improvement. See 
> https://github.com/apache/spark/pull/23746
>
> Please come and take a look if you're interested in R APIs :D. I have already 
> cc'ed some people I know but please come, review and discuss for both Spark 
> side and Arrow side.
>
> This Arrow optimization job is being done under 
> https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to take 
> one if anyone of you is interested in it.
>
> Thanks.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Arrow optimization in conversion from R DataFrame to Spark DataFrame

2018-11-09 Thread Shivaram Venkataraman
Thanks Hyukjin! Very cool results

Shivaram
On Fri, Nov 9, 2018 at 10:58 AM Felix Cheung  wrote:
>
> Very cool!
>
>
> 
> From: Hyukjin Kwon 
> Sent: Thursday, November 8, 2018 10:29 AM
> To: dev
> Subject: Arrow optimization in conversion from R DataFrame to Spark DataFrame
>
> Hi all,
>
> I am trying to introduce R Arrow optimization by reusing PySpark Arrow 
> optimization.
>
> It boosts R DataFrame > Spark DataFrame up to roughly 900% ~ 1200% faster.
>
> Looks working fine so far; however, I would appreciate if you guys have some 
> time to take a look (https://github.com/apache/spark/pull/22954) so that we 
> can directly go ahead as soon as R API of Arrow is released.
>
> More importantly, I want some more people who're more into Arrow R API side 
> but also interested in Spark side. I have already cc'ed some people I know 
> but please come, review and discuss for both Spark side and Arrow side.
>
> Thanks.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-07 Thread Shivaram Venkataraman
Agree with the points Felix made.

One thing is that it looks like the only problem is vignettes and the
tests are being skipped as designed. If you see
https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log
and 
https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Debian/00check.log,
the tests run in 1s.
On Tue, Nov 6, 2018 at 1:29 PM Felix Cheung  wrote:
>
> I’d rather not mess with 2.4.0 at this point. On CRAN is nice but users can 
> also install from Apache Mirror.
>
> Also I had attempted and failed to get vignettes not to build, it was non 
> trivial and could t get it to work.  It I have an idea.
>
> As for tests I don’t know exact why is it not skipped. Need to investigate 
> but worse case test_package can run with 0 test.
>
>
>
> 
> From: Sean Owen 
> Sent: Tuesday, November 6, 2018 10:51 AM
> To: Shivaram Venkataraman
> Cc: Felix Cheung; Wenchen Fan; Matei Zaharia; dev
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> I think the second option, to skip the tests, is best right now, if
> the alternative is to have no SparkR release at all!
> Can we monkey-patch the 2.4.0 release for SparkR in this way, bless it
> from the PMC, and release that? It's drastic but so is not being able
> to release, I think.
> Right? or is CRAN not actually an important distribution path for
> SparkR in particular?
>
> On Tue, Nov 6, 2018 at 12:49 PM Shivaram Venkataraman
>  wrote:
> >
> > Right - I think we should move on with 2.4.0.
> >
> > In terms of what can be done to avoid this error there are two strategies
> > - Felix had this other thread about JDK 11 that should at least let
> > Spark run on the CRAN instance. In general this strategy isn't
> > foolproof because the JDK version and other dependencies on that
> > machine keep changing over time and we dont have much control over it.
> > Worse we also dont have much control
> > - The other solution is to not run code to build the vignettes
> > document and just have static code blocks there that have been
> > pre-evaluated / pre-populated. We can open a JIRA to discuss the
> > pros/cons of this ?
> >
> > Thanks
> > Shivaram
> >
> > On Tue, Nov 6, 2018 at 10:57 AM Felix Cheung  
> > wrote:
> > >
> > > We have not been able to publish to CRAN for quite some time (since 2.3.0 
> > > was archived - the cause is Java 11)
> > >
> > > I think it’s ok to announce the release of 2.4.0
> > >
> > >
> > > 
> > > From: Wenchen Fan 
> > > Sent: Tuesday, November 6, 2018 8:51 AM
> > > To: Felix Cheung
> > > Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> > > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> > >
> > > Do you mean we should have a 2.4.0 release without CRAN and then do a 
> > > 2.4.1 immediately?
> > >
> > > On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung  
> > > wrote:
> > >>
> > >> Shivaram and I were discussing.
> > >> Actually we worked with them before. Another possible approach is to 
> > >> remove the vignettes eval and all test from the source package... in the 
> > >> next release.
> > >>
> > >>
> > >> 
> > >> From: Matei Zaharia 
> > >> Sent: Tuesday, November 6, 2018 12:07 AM
> > >> To: Felix Cheung
> > >> Cc: Sean Owen; dev; Shivaram Venkataraman
> > >> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> > >>
> > >> Maybe it’s wroth contacting the CRAN maintainers to ask for help? 
> > >> Perhaps we aren’t disabling it correctly, or perhaps they can ignore 
> > >> this specific failure. +Shivaram who might have some ideas.
> > >>
> > >> Matei
> > >>
> > >> > On Nov 5, 2018, at 9:09 PM, Felix Cheung  
> > >> > wrote:
> > >> >
> > >> > I don¡Št know what the cause is yet.
> > >> >
> > >> > The test should be skipped because of this check
> > >> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
> > >> >
> > >> > And this
> > >> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
> > >> >
> > >> > But it ran:
> > >> > callJStatic(

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-06 Thread Shivaram Venkataraman
Right - I think we should move on with 2.4.0.

In terms of what can be done to avoid this error there are two strategies
- Felix had this other thread about JDK 11 that should at least let
Spark run on the CRAN instance. In general this strategy isn't
foolproof because the JDK version and other dependencies on that
machine keep changing over time and we dont have much control over it.
Worse we also dont have much control
- The other solution is to not run code to build the vignettes
document and just have static code blocks there that have been
pre-evaluated / pre-populated. We can open a JIRA to discuss the
pros/cons of this  ?

Thanks
Shivaram

On Tue, Nov 6, 2018 at 10:57 AM Felix Cheung  wrote:
>
> We have not been able to publish to CRAN for quite some time (since 2.3.0 was 
> archived - the cause is Java 11)
>
> I think it’s ok to announce the release of 2.4.0
>
>
> 
> From: Wenchen Fan 
> Sent: Tuesday, November 6, 2018 8:51 AM
> To: Felix Cheung
> Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 
> immediately?
>
> On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung  
> wrote:
>>
>> Shivaram and I were discussing.
>> Actually we worked with them before. Another possible approach is to remove 
>> the vignettes eval and all test from the source package... in the next 
>> release.
>>
>>
>> 
>> From: Matei Zaharia 
>> Sent: Tuesday, November 6, 2018 12:07 AM
>> To: Felix Cheung
>> Cc: Sean Owen; dev; Shivaram Venkataraman
>> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>>
>> Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps we 
>> aren’t disabling it correctly, or perhaps they can ignore this specific 
>> failure. +Shivaram who might have some ideas.
>>
>> Matei
>>
>> > On Nov 5, 2018, at 9:09 PM, Felix Cheung  wrote:
>> >
>> > I don¡Št know what the cause is yet.
>> >
>> > The test should be skipped because of this check
>> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
>> >
>> > And this
>> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
>> >
>> > But it ran:
>> > callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
>> > "fit", formula,
>> >
>> > The earlier release was achived because of Java 11+ too so this 
>> > unfortunately isn¡Št new.
>> >
>> >
>> > From: Sean Owen 
>> > Sent: Monday, November 5, 2018 7:22 PM
>> > To: Felix Cheung
>> > Cc: dev
>> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>> >
>> > What can we do to get the release through? is there any way to
>> > circumvent these tests or otherwise hack it? or does it need a
>> > maintenance release?
>> > On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung  
>> > wrote:
>> > >
>> > > FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
>> > > vignettes but not skipping tests as would be expected.
>> > >
>> > > Error: processing vignette ¡¥sparkr-vignettes.Rmd¡Š failed with 
>> > > diagnostics:
>> > > Java version 8 is required for this package; found version: 11.0.1
>> > > Execution halted
>> > >
>> > > * checking PDF version of manual ... OK
>> > > * DONE
>> > > Status: 1 WARNING, 1 NOTE
>> > >
>> > > Current CRAN status: ERROR: 1, OK: 1
>> > > See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
>> > >
>> > > Version: 2.3.0
>> > > Check: tests, Result: ERROR
>> > > Running ¡¥run-all.R¡Š [8s/35s]
>> > > Running the tests in ¡¥tests/run-all.R¡Š failed.
>> > > Last 13 lines of output:
>> > > 4: 
>> > > callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
>> > > "fit", formula,
>> > > data@sdf, tolower(family$family), family$link, tol, as.integer(maxIter), 
>> > > weightCol,
>> > > regParam, as.double(var.power), as.double(link.power), 
>> > > stringIndexerOrderType,
>> > > offsetCol)
>> > > 5: invokeJava(isStatic = TRUE, className, methodName, ...)
>> > &

Re: Removing non-deprecated R methods that were deprecated in Python, Scala?

2018-11-06 Thread Shivaram Venkataraman
Yep. That sounds good to me.
On Tue, Nov 6, 2018 at 11:06 AM Sean Owen  wrote:
>
> Sounds good, remove in 3.1? I can update accordingly.
>
> On Tue, Nov 6, 2018, 10:46 AM Reynold Xin >
>> Maybe deprecate and remove in next version? It is bad to just remove a 
>> method without deprecation notice.
>>
>> On Tue, Nov 6, 2018 at 5:44 AM Sean Owen  wrote:
>>>
>>> See https://github.com/apache/spark/pull/22921#discussion_r230568058
>>>
>>> Methods like toDegrees, toRadians, approxCountDistinct were 'renamed'
>>> in Spark 2.1: deprecated, and replaced with an identical method with
>>> different name. However, these weren't actually deprecated in SparkR.
>>>
>>> Is it an oversight that we should just correct anyway by removing, to
>>> stay synced?
>>> Or deprecate and retain these in Spark 3.0.0?
>>>
>>> Sean
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Shivaram Venkataraman
Sounds good to me as well. Thanks Shane.

Shivaram
On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin  wrote:
>
> SGTM
>
> On Fri, Aug 10, 2018 at 1:39 PM shane knapp  wrote:
>>
>> https://issues.apache.org/jira/browse/SPARK-25089
>>
>> basically since these branches are old, and there will be a greater than 
>> zero amount of work to get lint-r to pass (on the new ubuntu workers), sean 
>> and i are proposing to remove the lint-r checks for the builds.
>>
>> this is super not important for the 2.4 cut/code freeze, but i wanted to get 
>> this done before it gets pushed down my queue and before we revisit the 
>> ubuntu port.
>>
>> thanks in advance,
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Shivaram Venkataraman
I dont think we need to respin 2.2.2 -- Given that 2.3.2 is on the way
we can just submit that.

Shivaram
On Mon, Jul 9, 2018 at 6:19 PM Tom Graves  wrote:
>
> is there anyway to push it to CRAN without this fix, I don't really want to 
> respin 2.2.2 just with the test fix.
>
> Tom
>
> On Monday, July 9, 2018, 4:50:18 PM CDT, Shivaram Venkataraman 
>  wrote:
>
>
> Yes. I think Felix checked in a fix to ignore tests run on java
> versions that are not Java 8 (I think the fix was in
> https://github.com/apache/spark/pull/21666 which is in 2.3.2)
>
> Shivaram
> On Mon, Jul 9, 2018 at 5:39 PM Sean Owen  wrote:
> >
> > Yes, this flavor of error should only come up in Java 9. Spark doesn't 
> > support that. Is there any way to tell CRAN this should not be tested?
> >
> > On Mon, Jul 9, 2018, 4:17 PM Shivaram Venkataraman 
> >  wrote:
> >>
> >> The upcoming 2.2.2 release was submitted to CRAN. I think there are
> >> some knows issues on Windows, but does anybody know what the following
> >> error with Netty is ?
> >>
> >> >WARNING: Illegal reflective access by 
> >> > io.netty.util.internal.PlatformDependent0$1 
> >> > (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
> >> >  to field java.nio.Buffer.address
> >>
> >> Thanks
> >> Shivaram
> >>
> >>
> >> -- Forwarded message -
> >> From: 
> >> Date: Mon, Jul 9, 2018 at 12:12 PM
> >> Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
> >> To: 
> >> Cc: 
> >>
> >>
> >> Dear maintainer,
> >>
> >> package SparkR_2.2.2.tar.gz does not pass the incoming checks
> >> automatically, please see the following pre-tests:
> >> Windows: 
> >> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Windows/00check.log>
> >> Status: 1 ERROR, 1 WARNING
> >> Debian: 
> >> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Debian/00check.log>
> >> Status: 1 ERROR, 2 WARNINGs
> >>
> >> Last released version's CRAN status: ERROR: 1, OK: 1
> >> See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
> >>
> >> CRAN Web: <https://cran.r-project.org/package=SparkR>
> >>
> >> Please fix all problems and resubmit a fixed version via the webform.
> >> If you are not sure how to fix the problems shown, please ask for help
> >> on the R-package-devel mailing list:
> >> <https://stat.ethz.ch/mailman/listinfo/r-package-devel>
> >> If you are fairly certain the rejection is a false positive, please
> >> reply-all to this message and explain.
> >>
> >> More details are given in the directory:
> >> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/>
> >> The files will be removed after roughly 7 days.
> >>
> >> No strong reverse dependencies to be checked.
> >>
> >> Best regards,
> >> CRAN teams' auto-check service
> >> Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
> >> Check: CRAN incoming feasibility, Result: WARNING
> >>  Maintainer: 'Shivaram Venkataraman '
> >>
> >>  New submission
> >>
> >>  Package was archived on CRAN
> >>
> >>  Insufficient package version (submitted: 2.2.2, existing: 2.3.0)
> >>
> >>  Possibly mis-spelled words in DESCRIPTION:
> >>Frontend (4:10, 5:28)
> >>
> >>  CRAN repository db overrides:
> >>X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
> >>  corrected despite reminders.
> >>
> >>  Found the following (possibly) invalid URLs:
> >>URL: http://spark.apache.org/docs/latest/api/R/mean.html
> >>  From: inst/doc/sparkr-vignettes.html
> >>  Status: 404
> >>  Message: Not Found
> >>
> >> Flavor: r-devel-windows-ix86+x86_64
> >> Check: running tests for arch 'x64', Result: ERROR
> >>Running 'run-all.R' [175s]
> >>  Running the tests in 'tests/run-all.R' failed.
> >>  Complete output:
> >>> #
> >>> # Licensed to the Apache Software Foundation (ASF) under one or more
> >>> # contributor license agreements.  See the NOTICE file distributed 
> >> with
> >>> # this work for additional information regarding copyr

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Shivaram Venkataraman
Yes. I think Felix checked in a fix to ignore tests run on java
versions that are not Java 8 (I think the fix was in
https://github.com/apache/spark/pull/21666 which is in 2.3.2)

Shivaram
On Mon, Jul 9, 2018 at 5:39 PM Sean Owen  wrote:
>
> Yes, this flavor of error should only come up in Java 9. Spark doesn't 
> support that. Is there any way to tell CRAN this should not be tested?
>
> On Mon, Jul 9, 2018, 4:17 PM Shivaram Venkataraman 
>  wrote:
>>
>> The upcoming 2.2.2 release was submitted to CRAN. I think there are
>> some knows issues on Windows, but does anybody know what the following
>> error with Netty is ?
>>
>> > WARNING: Illegal reflective access by 
>> > io.netty.util.internal.PlatformDependent0$1 
>> > (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
>> >  to field java.nio.Buffer.address
>>
>> Thanks
>> Shivaram
>>
>>
>> -- Forwarded message -
>> From: 
>> Date: Mon, Jul 9, 2018 at 12:12 PM
>> Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
>> To: 
>> Cc: 
>>
>>
>> Dear maintainer,
>>
>> package SparkR_2.2.2.tar.gz does not pass the incoming checks
>> automatically, please see the following pre-tests:
>> Windows: 
>> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Windows/00check.log>
>> Status: 1 ERROR, 1 WARNING
>> Debian: 
>> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Debian/00check.log>
>> Status: 1 ERROR, 2 WARNINGs
>>
>> Last released version's CRAN status: ERROR: 1, OK: 1
>> See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
>>
>> CRAN Web: <https://cran.r-project.org/package=SparkR>
>>
>> Please fix all problems and resubmit a fixed version via the webform.
>> If you are not sure how to fix the problems shown, please ask for help
>> on the R-package-devel mailing list:
>> <https://stat.ethz.ch/mailman/listinfo/r-package-devel>
>> If you are fairly certain the rejection is a false positive, please
>> reply-all to this message and explain.
>>
>> More details are given in the directory:
>> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/>
>> The files will be removed after roughly 7 days.
>>
>> No strong reverse dependencies to be checked.
>>
>> Best regards,
>> CRAN teams' auto-check service
>> Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
>> Check: CRAN incoming feasibility, Result: WARNING
>>   Maintainer: 'Shivaram Venkataraman '
>>
>>   New submission
>>
>>   Package was archived on CRAN
>>
>>   Insufficient package version (submitted: 2.2.2, existing: 2.3.0)
>>
>>   Possibly mis-spelled words in DESCRIPTION:
>> Frontend (4:10, 5:28)
>>
>>   CRAN repository db overrides:
>> X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
>>   corrected despite reminders.
>>
>>   Found the following (possibly) invalid URLs:
>> URL: http://spark.apache.org/docs/latest/api/R/mean.html
>>   From: inst/doc/sparkr-vignettes.html
>>   Status: 404
>>   Message: Not Found
>>
>> Flavor: r-devel-windows-ix86+x86_64
>> Check: running tests for arch 'x64', Result: ERROR
>> Running 'run-all.R' [175s]
>>   Running the tests in 'tests/run-all.R' failed.
>>   Complete output:
>> > #
>> > # Licensed to the Apache Software Foundation (ASF) under one or more
>> > # contributor license agreements.  See the NOTICE file distributed with
>> > # this work for additional information regarding copyright ownership.
>> > # The ASF licenses this file to You under the Apache License, Version 
>> 2.0
>> > # (the "License"); you may not use this file except in compliance with
>> > # the License.  You may obtain a copy of the License at
>> > #
>> > #http://www.apache.org/licenses/LICENSE-2.0
>> > #
>> > # Unless required by applicable law or agreed to in writing, software
>> > # distributed under the License is distributed on an "AS IS" BASIS,
>> > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied.
>> > # See the License for the specific language governing permissions and
>> > # limitations under the License.
>> > #
>> >
>> 

Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Shivaram Venkataraman
The upcoming 2.2.2 release was submitted to CRAN. I think there are
some knows issues on Windows, but does anybody know what the following
error with Netty is ?

> WARNING: Illegal reflective access by 
> io.netty.util.internal.PlatformDependent0$1 
> (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
>  to field java.nio.Buffer.address

Thanks
Shivaram


-- Forwarded message -
From: 
Date: Mon, Jul 9, 2018 at 12:12 PM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
To: 
Cc: 


Dear maintainer,

package SparkR_2.2.2.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Windows/00check.log>
Status: 1 ERROR, 1 WARNING
Debian: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Debian/00check.log>
Status: 1 ERROR, 2 WARNINGs

Last released version's CRAN status: ERROR: 1, OK: 1
See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>

CRAN Web: <https://cran.r-project.org/package=SparkR>

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:
<https://stat.ethz.ch/mailman/listinfo/r-package-devel>
If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/>
The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: WARNING
  Maintainer: 'Shivaram Venkataraman '

  New submission

  Package was archived on CRAN

  Insufficient package version (submitted: 2.2.2, existing: 2.3.0)

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

  Found the following (possibly) invalid URLs:
URL: http://spark.apache.org/docs/latest/api/R/mean.html
  From: inst/doc/sparkr-vignettes.html
  Status: 404
  Message: Not Found

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'x64', Result: ERROR
Running 'run-all.R' [175s]
  Running the tests in 'tests/run-all.R' failed.
  Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following object is masked from 'package:testthat':

describe

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+   Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directory if they exist.
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://mirror.dkd.de/apache/spark
Downloading 

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

2018-06-12 Thread Shivaram Venkataraman
#1 - Yes. It doesn't look like that is being honored. This is
something we should follow up with CRAN about

#2 - Looking at it more closely, I'm not sure what the problem is. If
the version string is 1.8.0_144 then our parsing code does work
correctly. We might need to add more debug logging or ask CRAN to
figure out what the output of `java -version` is on that machine. We
can move this discussion to the JIRA.

Shivaram
On Tue, Jun 12, 2018 at 3:21 PM Felix Cheung  wrote:
>
> For #1 is system requirements not honored?
>
> For #2 it looks like Oracle JDK?
>
> ____
> From: Shivaram Venkataraman 
> Sent: Tuesday, June 12, 2018 3:17:52 PM
> To: dev
> Cc: Felix Cheung
> Subject: Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
>
> Corresponding to the Spark 2.3.1 release, I submitted the SparkR build
> to CRAN yesterday. Unfortunately it looks like there are a couple of
> issues (full message from CRAN is forwarded below)
>
> 1. There are some builds started with Java 10
> (http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Debian/00check.log)
> which are right now counted as test failures. I wonder if we should
> somehow mark them as skipped ? I can ping the CRAN team about this.
>
> 2. There is another issue with Java version parsing which
> unfortunately affects even Java 8 builds. I've created
> https://issues.apache.org/jira/browse/SPARK-24535 to track this.
>
> Thanks
> Shivaram
>
>
> -- Forwarded message -
> From: 
> Date: Mon, Jun 11, 2018 at 11:24 AM
> Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
> To: 
> Cc: 
>
>
> Dear maintainer,
>
> package SparkR_2.3.1.tar.gz does not pass the incoming checks
> automatically, please see the following pre-tests:
> Windows: 
> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Windows/00check.log>
> Status: 2 ERRORs, 1 NOTE
> Debian: 
> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Debian/00check.log>
> Status: 1 ERROR, 1 WARNING, 1 NOTE
>
> Last released version's CRAN status: ERROR: 1, OK: 1
> See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
>
> CRAN Web: <https://cran.r-project.org/package=SparkR>
>
> Please fix all problems and resubmit a fixed version via the webform.
> If you are not sure how to fix the problems shown, please ask for help
> on the R-package-devel mailing list:
> <https://stat.ethz.ch/mailman/listinfo/r-package-devel>
> If you are fairly certain the rejection is a false positive, please
> reply-all to this message and explain.
>
> More details are given in the directory:
> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/>
> The files will be removed after roughly 7 days.
>
> No strong reverse dependencies to be checked.
>
> Best regards,
> CRAN teams' auto-check service
> Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
> Check: CRAN incoming feasibility, Result: NOTE
>   Maintainer: 'Shivaram Venkataraman '
>
>   New submission
>
>   Package was archived on CRAN
>
>   Possibly mis-spelled words in DESCRIPTION:
> Frontend (4:10, 5:28)
>
>   CRAN repository db overrides:
> X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
>   corrected despite reminders.
>
> Flavor: r-devel-windows-ix86+x86_64
> Check: running tests for arch 'i386', Result: ERROR
> Running 'run-all.R' [30s]
>   Running the tests in 'tests/run-all.R' failed.
>   Complete output:
> > #
> > # Licensed to the Apache Software Foundation (ASF) under one or more
> > # contributor license agreements.  See the NOTICE file distributed with
> > # this work for additional information regarding copyright ownership.
> > # The ASF licenses this file to You under the Apache License, Version 
> 2.0
> > # (the "License"); you may not use this file except in compliance with
> > # the License.  You may obtain a copy of the License at
> > #
> > #http://www.apache.org/licenses/LICENSE-2.0
> > #
> > # Unless required by applicable law or agreed to in writing, software
> > # distributed under the License is distributed on an "AS IS" BASIS,
> > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> > # See the License for the specific language governing permissions and
> > # limitations under the License.
> > #
> >
> > library(testthat)
> > library(SparkR)
>
> Attaching package: 'SparkR'
>
> Th

Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

2018-06-12 Thread Shivaram Venkataraman
Corresponding to the Spark 2.3.1 release, I submitted the SparkR build
to CRAN yesterday. Unfortunately it looks like there are a couple of
issues (full message from CRAN is forwarded below)

1. There are some builds started with Java 10
(http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Debian/00check.log)
which are right now counted as test failures. I wonder if we should
somehow mark them as skipped ? I can ping the CRAN team about this.

2. There is another issue with Java version parsing which
unfortunately affects even Java 8 builds. I've created
https://issues.apache.org/jira/browse/SPARK-24535 to track this.

Thanks
Shivaram


-- Forwarded message -
From: 
Date: Mon, Jun 11, 2018 at 11:24 AM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
To: 
Cc: 


Dear maintainer,

package SparkR_2.3.1.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Windows/00check.log>
Status: 2 ERRORs, 1 NOTE
Debian: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Debian/00check.log>
Status: 1 ERROR, 1 WARNING, 1 NOTE

Last released version's CRAN status: ERROR: 1, OK: 1
See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>

CRAN Web: <https://cran.r-project.org/package=SparkR>

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:
<https://stat.ethz.ch/mailman/listinfo/r-package-devel>
If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/>
The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: NOTE
  Maintainer: 'Shivaram Venkataraman '

  New submission

  Package was archived on CRAN

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'i386', Result: ERROR
Running 'run-all.R' [30s]
  Running the tests in 'tests/run-all.R' failed.
  Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following objects are masked from 'package:testthat':

describe, not

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+   Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directory if they exist.
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://apache.mirror.digionline.de/spark
Downloading spark-2.3.

Re: [VOTE] SPIP ML Pipelines in R

2018-05-31 Thread Shivaram Venkataraman
Hossein -- Can you clarify what the resolution on the repository /
release issue discussed on SPIP ?

Shivaram

On Thu, May 31, 2018 at 9:06 AM, Felix Cheung  wrote:
> +1
> With my concerns in the SPIP discussion.
>
> 
> From: Hossein 
> Sent: Wednesday, May 30, 2018 2:03:03 PM
> To: dev@spark.apache.org
> Subject: [VOTE] SPIP ML Pipelines in R
>
> Hi,
>
> I started discussion thread for a new R package to expose MLlib pipelines in
> R.
>
> To summarize we will work on utilities to generate R wrappers for MLlib
> pipeline API for a new R package. This will lower the burden for exposing
> new API in future.
>
> Following the SPIP process, I am proposing the SPIP for a vote.
>
> +1: Let's go ahead and implement the SPIP.
> +0: Don't really care.
> -1: I do not think this is a good idea for the following reasons.
>
> Thanks,
> --Hossein

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkR was removed from CRAN on 2018-05-01

2018-05-29 Thread Shivaram Venkataraman
Apologies for not forwarding this to the dev list. AFAIK CRAN rules
require that we provide a user's email address as opposed to a list so
the emails come to me. I will forward all CRAN correspondences to the
dev list in the future.

Thanks
Shivaram

On Tue, May 29, 2018 at 12:02 PM, Michael Heuer  wrote:
> A friendly request to please be transparent about the changes being
> requested and how those are addressed.
>
> As a downstream library that would like to get into CRAN, it is hard when
> upstream comes and goes
>
> https://github.com/bigdatagenomics/adam/issues/1851
>
> On Tue, May 29, 2018 at 1:52 PM, Shivaram Venkataraman
>  wrote:
>>
>> Yes.  That is correct
>>
>> Shivaram
>>
>> On Tue, May 29, 2018 at 11:48 AM, Hossein  wrote:
>> > I guess this relates to our conversation on the SPIP. When this happens,
>> > do
>> > we wait for a new minor release to submit it to CRAN again?
>> >
>> > --Hossein
>> >
>> > On Fri, May 25, 2018 at 5:11 PM, Felix Cheung
>> > 
>> > wrote:
>> >>
>> >> This is the fix
>> >>
>> >>
>> >> https://github.com/apache/spark/commit/f27a035daf705766d3445e5c6a99867c11c552b0#diff-e1e1d3d40573127e9ee0480caf1283d6
>> >>
>> >> I don’t have the email though.
>> >>
>> >> 
>> >> From: Hossein 
>> >> Sent: Friday, May 25, 2018 10:58:42 AM
>> >> To: dev@spark.apache.org
>> >> Subject: SparkR was removed from CRAN on 2018-05-01
>> >>
>> >> Would you please forward the email from CRAN? Is there a JIRA?
>> >>
>> >> Thanks,
>> >> --Hossein
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkR was removed from CRAN on 2018-05-01

2018-05-29 Thread Shivaram Venkataraman
Yes.  That is correct

Shivaram

On Tue, May 29, 2018 at 11:48 AM, Hossein  wrote:
> I guess this relates to our conversation on the SPIP. When this happens, do
> we wait for a new minor release to submit it to CRAN again?
>
> --Hossein
>
> On Fri, May 25, 2018 at 5:11 PM, Felix Cheung 
> wrote:
>>
>> This is the fix
>>
>> https://github.com/apache/spark/commit/f27a035daf705766d3445e5c6a99867c11c552b0#diff-e1e1d3d40573127e9ee0480caf1283d6
>>
>> I don’t have the email though.
>>
>> 
>> From: Hossein 
>> Sent: Friday, May 25, 2018 10:58:42 AM
>> To: dev@spark.apache.org
>> Subject: SparkR was removed from CRAN on 2018-05-01
>>
>> Would you please forward the email from CRAN? Is there a JIRA?
>>
>> Thanks,
>> --Hossein
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Time for 2.3.1?

2018-05-13 Thread Shivaram Venkataraman
+1 We had a SparkR fix for CRAN SystemRequirements that will also be good
to get out.

Shivaram

On Fri, May 11, 2018 at 12:34 PM, Henry Robinson  wrote:

> https://github.com/apache/spark/pull/21302
>
> On 11 May 2018 at 11:47, Henry Robinson  wrote:
>
>> I was planning to do so shortly.
>>
>> Henry
>>
>> On 11 May 2018 at 11:45, Ryan Blue  wrote:
>>
>>> The Parquet Java 1.8.3 release is out. Has anyone started a PR to
>>> update, or should I?
>>>
>>> On Fri, May 11, 2018 at 7:40 AM, Cody Koeninger 
>>> wrote:
>>>
 Sounds good, I'd like to add SPARK-24067 today assuming there's no
 objections

 On Thu, May 10, 2018 at 1:22 PM, Henry Robinson 
 wrote:
 > +1, I'd like to get a release out with SPARK-23852 fixed. The Parquet
 > community are about to release 1.8.3 - the voting period closes
 tomorrow -
 > and I've tested it with Spark 2.3 and confirmed the bug is fixed.
 Hopefully
 > it is released and I can post the version change to branch-2.3 before
 you
 > start to roll the RC this weekend.
 >
 > Henry
 >
 > On 10 May 2018 at 11:09, Marcelo Vanzin  wrote:
 >>
 >> Hello all,
 >>
 >> It's been a while since we shipped 2.3.0 and lots of important bug
 >> fixes have gone into the branch since then. I took a look at Jira and
 >> it seems there's not a lot of things explicitly targeted at 2.3.1 -
 >> the only potential blocker (a parquet issue) is being worked on since
 >> a new parquet with the fix was just released.
 >>
 >> So I'd like to propose to release 2.3.1 soon. If there are important
 >> fixes that should go into the release, please let those be known (by
 >> replying here or updating the bug in Jira), otherwise I'm
 volunteering
 >> to prepare the first RC soon-ish (around the weekend).
 >>
 >> Thanks!
 >>
 >>
 >> --
 >> Marcelo
 >>
 >> 
 -
 >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >>
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>


Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Shivaram Venkataraman
>
>
>
>>- Fault tolerance and execution model: Spark assumes fine-grained
>>task recovery, i.e. if something fails, only that task is rerun. This
>>doesn’t match the execution model of distributed ML/DL frameworks that are
>>typically MPI-based, and rerunning a single task would lead to the entire
>>system hanging. A whole stage needs to be re-run.
>>
>> This is not only useful for integrating with 3rd-party frameworks, but
> also useful for scaling MLlib algorithms. One of my earliest attempts in
> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
> ). But we ended up with
> some compromised solutions. With the new execution model, we can set up a
> hybrid cluster and do all-reduce properly.
>
>
Is there a particular new execution model you are referring to or do we
plan to investigate a new execution model ?  For the MPI-like model, we
also need gang scheduling (i.e. schedule all tasks at once or none of them)
and I dont think we have support for that in the scheduler right now.

>
>> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] 
>


Re: [Spark][Scheduler] Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2018-03-06 Thread Shivaram Venkataraman
The problem with doing work in the callsite thread is that there are a
number of data structures that are updated during job submission and
these data structures are guarded by the event loop ensuring only one
thread accesses them.  I dont think there is a very easy fix for this
given the structure of the DAGScheduler.

Thanks
Shivaram

On Tue, Mar 6, 2018 at 8:53 AM, Ryan Blue  wrote:
> I agree with Reynold. We don't need to use a separate pool, which would have
> the problem you raised about FIFO. We just need to do the planning outside
> of the scheduler loop. The call site thread sounds like a reasonable place
> to me.
>
> On Mon, Mar 5, 2018 at 12:56 PM, Reynold Xin  wrote:
>>
>> Rather than using a separate thread pool, perhaps we can just move the
>> prep code to the call site thread?
>>
>>
>> On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty 
>> wrote:
>>>
>>> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted
>>> events has to be processed as DAGSchedulerEventProcessLoop is single
>>> threaded and it will block other tasks in queue like TaskCompletion.
>>>
>>> The JobSubmitted event is time consuming depending on the nature of the
>>> job (Example: calculating parent stage dependencies, shuffle dependencies,
>>> partitions) and thus it blocks all the events to be processed.
>>>
>>>
>>>
>>> I see multiple JIRA referring to this behavior
>>>
>>> https://issues.apache.org/jira/browse/SPARK-2647
>>>
>>> https://issues.apache.org/jira/browse/SPARK-4961
>>>
>>>
>>>
>>> Similarly in my cluster some jobs partition calculation is time consuming
>>> (Similar to stack at SPARK-2647) hence it slows down the spark
>>> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if
>>> its tasks are finished within seconds, as TaskCompletion Events are
>>> processed at a slower rate due to blockage.
>>>
>>>
>>>
>>> I think we can split a JobSubmitted Event into 2 events
>>>
>>> Step 1. JobSubmittedPreperation - Runs in separate thread on
>>> JobSubmission, this will involve steps
>>> org.apache.spark.scheduler.DAGScheduler#createResultStage
>>>
>>> Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to
>>> DAGSchedulerEventProcessLoop and let it process output of
>>> org.apache.spark.scheduler.DAGScheduler#createResultStage
>>>
>>>
>>>
>>> I can see the effect of doing this may be that Job Submissions may not be
>>> FIFO depending on how much time Step 1 mentioned above is going to consume.
>>>
>>>
>>>
>>> Does above solution suffice for the problem described? And is there any
>>> other side effect of this solution?
>>>
>>>
>>>
>>> Regards
>>>
>>> Ajith
>>
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Shivaram Venkataraman
For (1) I think it has something to do with
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/
not automatically going to
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/index.html
-- So if you see the link to approx_percentile the link we generate is
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/#approx_percentile
-- This doesn't work as Felix said but
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/index.html#approx_percentile
works

I'm not sure how this will behave on the main site. FWIW
http://spark.apache.org/docs/latest/api/python/ does redirect to
http://spark.apache.org/docs/latest/api/python/index.html

Thanks
Shivaram

On Mon, Feb 19, 2018 at 6:31 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Ah sorry I realize my wordings were unclear (not enough zzz or coffee)
>
> So to clarify,
> 1) when searching for a word in the Sql function doc, it does return that
> search result page correctly, however, none of the link in result opens to
> the actual doc page, so to take the search I included as an example, if you
> click on approx_percentile, for instance, it brings open the web directory
> instead.
>
> 2) The second is the dist location we are voting on has a .iml file, which
> is normally not included in release or release RC and it is unsigned and
> without hash (therefore seems like it should not be in the release)
>
> Thanks!
>
> _
> From: Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
> Sent: Tuesday, February 20, 2018 2:24 AM
> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: Sean Owen <sro...@gmail.com>, dev <dev@spark.apache.org>
>
>
>
> FWIW The search result link works for me
>
> Shivaram
>
> On Mon, Feb 19, 2018 at 6:21 PM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> These are two separate things:
>>
>> Does the search result links work for you?
>>
>> The second is the dist location we are voting on has a .iml file.
>>
>> _
>> From: Sean Owen <sro...@gmail.com>
>> Sent: Tuesday, February 20, 2018 2:19 AM
>> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
>> To: Felix Cheung <felixcheun...@hotmail.com>
>> Cc: dev <dev@spark.apache.org>
>>
>>
>>
>> Maybe I misunderstand, but I don't see any .iml file in the 4 results on
>> that page? it looks reasonable.
>>
>> On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> Any idea with sql func docs search result returning broken links as
>>> below?
>>>
>>> *From:* Felix Cheung <felixcheun...@hotmail.com>
>>> *Sent:* Sunday, February 18, 2018 10:05:22 AM
>>> *To:* Sameer Agarwal; Sameer Agarwal
>>>
>>> *Cc:* dev
>>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>>> Quick questions:
>>>
>>> is there search link for sql functions quite right?
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs
>>> /_site/api/sql/search.html?q=app
>>>
>>> this file shouldn't be included? https://dist.apache.org/repos/
>>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>>
>>>
>>
>>
>
>
>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Shivaram Venkataraman
FWIW The search result link works for me

Shivaram

On Mon, Feb 19, 2018 at 6:21 PM, Felix Cheung 
wrote:

> These are two separate things:
>
> Does the search result links work for you?
>
> The second is the dist location we are voting on has a .iml file.
>
> _
> From: Sean Owen 
> Sent: Tuesday, February 20, 2018 2:19 AM
> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
> To: Felix Cheung 
> Cc: dev 
>
>
>
> Maybe I misunderstand, but I don't see any .iml file in the 4 results on
> that page? it looks reasonable.
>
> On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung 
> wrote:
>
>> Any idea with sql func docs search result returning broken links as below?
>>
>> *From:* Felix Cheung 
>> *Sent:* Sunday, February 18, 2018 10:05:22 AM
>> *To:* Sameer Agarwal; Sameer Agarwal
>>
>> *Cc:* dev
>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>> Quick questions:
>>
>> is there search link for sql functions quite right?
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-
>> docs/_site/api/sql/search.html?q=app
>>
>> this file shouldn't be included? https://dist.apache.org/repos/
>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>
>>
>
>


Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-13 Thread Shivaram Venkataraman
The R artifacts have some issue that Felix and I are debugging. Lets not
block the announcement for that.

Thanks
Shivaram

On Wed, Dec 13, 2017 at 5:59 AM, Sean Owen <so...@cloudera.com> wrote:

> Looks like Maven artifacts are up, site's up -- what about the Python and
> R artifacts?
> I can also move the spark.apache/docs/latest link to point to 2.2.1 if
> it's pretty ready.
> We should announce the release officially too then.
>
> On Wed, Dec 6, 2017 at 5:00 PM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> I saw the svn move on Monday so I’m working on the website updates.
>>
>> I will look into maven today. I will ask if I couldn’t do it.
>>
>>
>> On Wed, Dec 6, 2017 at 10:49 AM Sean Owen <so...@cloudera.com> wrote:
>>
>>> Pardon, did this release finish? I don't see it in Maven. I know there
>>> was some question about getting a hand in finishing the release process,
>>> including copying artifacts in svn. Was there anything else you're waiting
>>> on someone to do?
>>>
>>>
>>> On Fri, Dec 1, 2017 at 2:10 AM Felix Cheung <felixche...@apache.org>
>>> wrote:
>>>
>>>> This vote passes. Thanks everyone for testing this release.
>>>>
>>>>
>>>> +1:
>>>>
>>>> Sean Owen (binding)
>>>>
>>>> Herman van Hövell tot Westerflier (binding)
>>>>
>>>> Wenchen Fan (binding)
>>>>
>>>> Shivaram Venkataraman (binding)
>>>>
>>>> Felix Cheung
>>>>
>>>> Henry Robinson
>>>>
>>>> Hyukjin Kwon
>>>>
>>>> Dongjoon Hyun
>>>>
>>>> Kazuaki Ishizaki
>>>>
>>>> Holden Karau
>>>>
>>>> Weichen Xu
>>>>
>>>>
>>>> 0: None
>>>>
>>>> -1: None
>>>>
>>>


Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-29 Thread Shivaram Venkataraman
+1

SHA, MD5 and signatures look fine. Built and ran Maven tests on my Macbook.

Thanks
Shivaram

On Wed, Nov 29, 2017 at 10:43 AM, Holden Karau  wrote:

> +1 (non-binding)
>
> PySpark install into a virtualenv works, PKG-INFO looks correctly
> populated (mostly checking for the pypandoc conversion there).
>
> Thanks for your hard work Felix (and all of the testers :)) :)
>
> On Wed, Nov 29, 2017 at 9:33 AM, Wenchen Fan  wrote:
>
>> +1
>>
>> On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>>> for core/sql-core/sql-catalyst/mllib/mllib-local have passed.
>>>
>>> $ java -version
>>> openjdk version "1.8.0_131"
>>> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.1
>>> 6.04.3-b11)
>>> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>>>
>>> % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
>>> -T 24 clean package install
>>> % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
>>> -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
>>> ...
>>> Run completed in 13 minutes, 54 seconds.
>>> Total number of tests run: 1118
>>> Suites: completed 170, aborted 0
>>> Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0
>>> All tests passed.
>>> [INFO] 
>>> 
>>> [INFO] Reactor Summary:
>>> [INFO]
>>> [INFO] Spark Project Core . SUCCESS
>>> [17:13 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>>  6.065 s]
>>> [INFO] Spark Project Catalyst . SUCCESS
>>> [11:51 min]
>>> [INFO] Spark Project SQL .. SUCCESS
>>> [17:55 min]
>>> [INFO] Spark Project ML Library ... SUCCESS
>>> [17:05 min]
>>> [INFO] 
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO] 
>>> 
>>> [INFO] Total time: 01:04 h
>>> [INFO] Finished at: 2017-11-30T01:48:15+09:00
>>> [INFO] Final Memory: 128M/329M
>>> [INFO] 
>>> 
>>> [WARNING] The requested profile "hive" could not be activated because it
>>> does not exist.
>>>
>>> Kazuaki Ishizaki
>>>
>>>
>>>
>>> From:Dongjoon Hyun 
>>> To:Hyukjin Kwon 
>>> Cc:Spark dev list , Felix Cheung <
>>> felixche...@apache.org>, Sean Owen 
>>> Date:2017/11/29 12:56
>>> Subject:Re: [VOTE] Spark 2.2.1 (RC2)
>>> --
>>>
>>>
>>>
>>> +1 (non-binding)
>>>
>>> RC2 is tested on CentOS, too.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Nov 28, 2017 at 4:35 PM, Hyukjin Kwon <*gurwls...@gmail.com*
>>> > wrote:
>>> +1
>>>
>>> 2017-11-29 8:18 GMT+09:00 Henry Robinson <*he...@apache.org*
>>> >:
>>> (My vote is non-binding, of course).
>>>
>>> On 28 November 2017 at 14:53, Henry Robinson <*he...@apache.org*
>>> > wrote:
>>> +1, tests all pass for me on Ubuntu 16.04.
>>>
>>> On 28 November 2017 at 10:36, Herman van Hövell tot Westerflier <
>>> *hvanhov...@databricks.com* > wrote:
>>> +1
>>>
>>> On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung <*felixche...@apache.org*
>>> > wrote:
>>> +1
>>>
>>> Thanks Sean. Please vote!
>>>
>>> Tested various scenarios with R package. Ubuntu, Debian, Windows r-devel
>>> and release and on r-hub. Verified CRAN checks are clean (only 1 NOTE!) and
>>> no leaked files (.cache removed, /tmp clean)
>>>
>>>
>>> On Sun, Nov 26, 2017 at 11:55 AM Sean Owen <*so...@cloudera.com*
>>> > wrote:
>>> Yes it downloads recent releases. The test worked for me on a second
>>> try, so I suspect a bad mirror. If this comes up frequently we can just add
>>> retry logic, as the closer.lua script will return different mirrors each
>>> time.
>>>
>>> The tests all pass for me on the latest Debian, so +1 for this release.
>>>
>>> (I committed the change to set -Xss4m for tests consistently, but this
>>> shouldn't block a release.)
>>>
>>>
>>> On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung <*felixche...@apache.org*
>>> > wrote:
>>> Ah sorry digging through the history it looks like this is changed
>>> relatively recently and should only download previous releases.
>>>
>>> Perhaps we are intermittently hitting a mirror that doesn’t have the
>>> files?
>>>
>>>
>>> *https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d348f1d5c9ae*
>>> 

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Shivaram Venkataraman
Mark, I agree with your point on the risks of using Cloudfront while
building Spark. I was only trying to provide background on when we
started using Cloudfront.

Personally, I don't have enough about context about the test case in
question (e.g. Why are we downloading Spark in a test case ?).

Thanks
Shivaram

On Wed, Sep 13, 2017 at 11:50 AM, Mark Hamstra <m...@clearstorydata.com> wrote:
> Yeah, but that discussion and use case is a bit different -- providing a
> different route to download the final released and approved artifacts that
> were built using only acceptable artifacts and sources vs. building and
> checking prior to release using something that is not from an Apache mirror.
> This new use case puts us in the position of approving spark artifacts that
> weren't built entirely from canonical resources located in presumably secure
> and monitored repositories. Incorporating something that is not completely
> trusted or approved into the process of building something that we are then
> going to approve as trusted is different from the prior use of cloudfront.
>
> On Wed, Sep 13, 2017 at 10:26 AM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>>
>> The bucket comes from Cloudfront, a CDN thats part of AWS. There was a
>> bunch of discussion about this back in 2013
>>
>> https://lists.apache.org/thread.html/9a72ff7ce913dd85a6b112b1b2de536dcda74b28b050f70646aba0ac@1380147885@%3Cdev.spark.apache.org%3E
>>
>> Shivaram
>>
>> On Wed, Sep 13, 2017 at 9:30 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Not a big deal, but Mark noticed that this test now downloads Spark
>> > artifacts from the same 'direct download' link available on the
>> > downloads
>> > page:
>> >
>> >
>> > https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L53
>> >
>> > https://d3kbcqa49mib13.cloudfront.net/spark-$version-bin-hadoop2.7.tgz
>> >
>> > I don't know of any particular problem with this, which is a parallel
>> > download option in addition to the Apache mirrors. It's also the
>> > default.
>> >
>> > Does anyone know what this bucket is and if there's a strong reason we
>> > can't
>> > just use mirrors?
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Shivaram Venkataraman
The bucket comes from Cloudfront, a CDN thats part of AWS. There was a
bunch of discussion about this back in 2013
https://lists.apache.org/thread.html/9a72ff7ce913dd85a6b112b1b2de536dcda74b28b050f70646aba0ac@1380147885@%3Cdev.spark.apache.org%3E

Shivaram

On Wed, Sep 13, 2017 at 9:30 AM, Sean Owen  wrote:
> Not a big deal, but Mark noticed that this test now downloads Spark
> artifacts from the same 'direct download' link available on the downloads
> page:
>
> https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L53
>
> https://d3kbcqa49mib13.cloudfront.net/spark-$version-bin-hadoop2.7.tgz
>
> I don't know of any particular problem with this, which is a parallel
> download option in addition to the Apache mirrors. It's also the default.
>
> Does anyone know what this bucket is and if there's a strong reason we can't
> just use mirrors?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Submitting SparkR to CRAN

2017-05-09 Thread Shivaram Venkataraman
Closely related to the PyPi upload thread (https://s.apache.org/WLtM), I
just wanted to give a heads up that we are working on submitting SparkR
from Spark 2.1.1 as a package to CRAN. The package submission is under
review with CRAN right now and I will post updates to this thread.

The main ticket tracking this effort SPARK-15799 and I'll also create a new
PR on the website on how to update the package with a new release.

Many thanks to everybody who helped with this effort !

Thanks
Shivaram


Re: Build completed: spark 866-master

2017-03-04 Thread Shivaram Venkataraman
Thanks for investigating. We should file an INFRA jira about this.

Shivaram

On Mar 4, 2017 16:20, "Reynold Xin" <r...@databricks.com> wrote:

> Most of the previous notifications were caught as spam. We should really
> disable this.
>
>
> On Sat, Mar 4, 2017 at 4:17 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Oh BTW, I was asked about this by Reynold. Few month ago and I said the
>> similar answer.
>>
>> I think I am not supposed to don't recieve the emails (not sure but I
>> have not recieved) so I am not too sure if this has happened so far or
>> occationally.
>>
>>
>>
>> On 5 Mar 2017 9:08 a.m., "Hyukjin Kwon" <gurwls...@gmail.com> wrote:
>>
>> I think we should ask to disable this within Web UI configuration. In
>> this JIRA, https://issues.apache.org/jira/browse/INFRA-12590, Daniel said
>>
>> > ... configured to send build results to dev@spark.apache.org.
>>
>> In the case of my accounts, I manually went to https://ci.appveyor.com/
>> notifications and configured them all as  "Do not send" and it does not
>> send me any email.
>>
>> However, in case of AFS account, this turns out an assumption because I
>> don't know how it is defined as I can't access.
>>
>> This might be defined in account - https://ci.appveyor.com/notifications
>> or in project - https://ci.appveyor.com/project/ApacheSoftwareFoundation/
>> spark/settings
>>
>> I'd like to note that I disabled the notification in the appveyor.yml but
>> it seems the configurations are merged in Web UI,
>> according to the documentation (https://www.appveyor.com/
>> docs/notifications/#global-email-notifications).
>>
>> > Warning: Notifications defined on project settings UI are merged with
>> notifications defined in appveyor.yml.
>>
>> Should we maybe an INFRA JIRA to check and ask this?
>>
>>
>>
>> 2017-03-05 8:31 GMT+09:00 Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu>:
>>
>> I'm not sure why the AppVeyor updates are coming to the dev list.
>> Hyukjin -- Do you know if we made any recent changes that might have caused
>> this ?
>>
>> Thanks
>> Shivaram
>>
>> -- Forwarded message --
>> From: *AppVeyor* <no-re...@appveyor.com>
>> Date: Sat, Mar 4, 2017 at 2:46 PM
>> Subject: Build completed: spark 866-master
>> To: dev@spark.apache.org
>>
>>
>> Build spark 866-master completed
>> <https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/866-master>
>>
>> Commit ccf54f64d9 <https://github.com/apache/spark/commit/ccf54f64d9> by Xiao
>> Li <gatorsm...@gmail.com> on 3/4/2017 9:50 PM:
>> fix.
>>
>> Configure your notification preferences
>> <https://ci.appveyor.com/notifications>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>


Fwd: Build completed: spark 866-master

2017-03-04 Thread Shivaram Venkataraman
I'm not sure why the AppVeyor updates are coming to the dev list.  Hyukjin
-- Do you know if we made any recent changes that might have caused this ?

Thanks
Shivaram

-- Forwarded message --
From: AppVeyor 
Date: Sat, Mar 4, 2017 at 2:46 PM
Subject: Build completed: spark 866-master
To: dev@spark.apache.org


Build spark 866-master completed


Commit ccf54f64d9  by Xiao
Li  on 3/4/2017 9:50 PM:
fix.

Configure your notification preferences

- To
unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Shivaram Venkataraman
FWIW there is an option to Delete the issue (in More -> Delete).

Shivaram

On Fri, Jan 13, 2017 at 8:11 AM, Shivaram Venkataraman
<shiva...@eecs.berkeley.edu> wrote:
> I can't see the resolve button either - Maybe we can forward this to
> Apache Infra and see if they can close these issues ?
>
> Shivaram
>
> On Fri, Jan 13, 2017 at 6:35 AM, Sean Owen <so...@cloudera.com> wrote:
>> Yes, I'm asking about a specific range: 19191 - 19202. These seem to be the
>> ones created during the downtime. Most are duplicates or incomplete.
>>
>> On Fri, Jan 13, 2017 at 2:32 PM Artur Sukhenko <artur.sukhe...@gmail.com>
>> wrote:
>>>
>>> Yes, I can resolve/close SPARK-19214 for example.
>>>
>>> On Fri, Jan 13, 2017 at 4:29 PM Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> Do you see a button to resolve other issues? you may not be able to
>>>> resolve any of them. I am a JIRA admin though, like most other devs, so
>>>> should be able to resolve anything.
>>>>
>>>> Yes, I certainly know how resolving issues works but it's suddenly today
>>>> only working for a subset of issues, and I bet it's related to the JIRA
>>>> problems this week.
>>>>
>>>> On Fri, Jan 13, 2017 at 1:55 PM Artur Sukhenko <artur.sukhe...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hello Sean,
>>>>>
>>>>> I can't resolve SPARK-19191 to SPARK-19202 too. I believe this is a bug.
>>>>> Here is JIRA Documentation which states this or similar problems - How
>>>>> to Edit the Resolution of an Issue
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 13, 2017 at 3:28 PM Sean Owen <so...@cloudera.com> wrote:
>>>>>>
>>>>>> Looks like the JIRA maintenance left a bunch of duplicate JIRAs, from
>>>>>> SPARK-19191 to SPARK-19202. For some reason, I can't resolve these 
>>>>>> issues,
>>>>>> but I can resolve others. Does anyone else see the same?
>>>>>>
>>>>>> I know SPARK-19190 was similarly borked but closed by its owner.
>>>>>
>>>>> --
>>>>> --
>>>>> Artur Sukhenko
>>>
>>> --
>>> --
>>> Artur Sukhenko

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Can anyone edit JIRAs SPARK-19191 to SPARK-19202?

2017-01-13 Thread Shivaram Venkataraman
I can't see the resolve button either - Maybe we can forward this to
Apache Infra and see if they can close these issues ?

Shivaram

On Fri, Jan 13, 2017 at 6:35 AM, Sean Owen  wrote:
> Yes, I'm asking about a specific range: 19191 - 19202. These seem to be the
> ones created during the downtime. Most are duplicates or incomplete.
>
> On Fri, Jan 13, 2017 at 2:32 PM Artur Sukhenko 
> wrote:
>>
>> Yes, I can resolve/close SPARK-19214 for example.
>>
>> On Fri, Jan 13, 2017 at 4:29 PM Sean Owen  wrote:
>>>
>>> Do you see a button to resolve other issues? you may not be able to
>>> resolve any of them. I am a JIRA admin though, like most other devs, so
>>> should be able to resolve anything.
>>>
>>> Yes, I certainly know how resolving issues works but it's suddenly today
>>> only working for a subset of issues, and I bet it's related to the JIRA
>>> problems this week.
>>>
>>> On Fri, Jan 13, 2017 at 1:55 PM Artur Sukhenko 
>>> wrote:

 Hello Sean,

 I can't resolve SPARK-19191 to SPARK-19202 too. I believe this is a bug.
 Here is JIRA Documentation which states this or similar problems - How
 to Edit the Resolution of an Issue



 On Fri, Jan 13, 2017 at 3:28 PM Sean Owen  wrote:
>
> Looks like the JIRA maintenance left a bunch of duplicate JIRAs, from
> SPARK-19191 to SPARK-19202. For some reason, I can't resolve these issues,
> but I can resolve others. Does anyone else see the same?
>
> I know SPARK-19190 was similarly borked but closed by its owner.

 --
 --
 Artur Sukhenko
>>
>> --
>> --
>> Artur Sukhenko

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Shivaram Venkataraman
In addition to usual binary artifacts, this is the first release where
we have installable packages for Python [1] and R [2] that are part of
the release.  I'm including instructions to test the R package below.
Holden / other Python developers can chime in if there are special
instructions to test the pip package.

To test the R source package you can follow the following commands.
1. Download the SparkR source package from
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/SparkR_2.1.0.tar.gz
2. Install the source package with R CMD INSTALL SparkR_2.1.0.tar.gz
3. As the SparkR package doesn't contain Spark JARs (this is due to
package size limits from CRAN), we'll need to run [3]
export 
SPARKR_RELEASE_DOWNLOAD_URL="http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/spark-2.1.0-bin-hadoop2.6.tgz;
4. Launch R. You can now use include SparkR with `library(SparkR)` and
test it with your applications.
5. Note that the first time a SparkSession is created the binary
artifacts will the downloaded.

Thanks
Shivaram

[1] https://issues.apache.org/jira/browse/SPARK-18267
[2] https://issues.apache.org/jira/browse/SPARK-18590
[3] Note that this isn't required once 2.1.0 has been released as
SparkR can automatically resolve and download releases.

On Thu, Dec 15, 2016 at 9:16 PM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc5
> (cd0a08361e2526519e7c131c42116bf56fa62c76)
>
> List of JIRA tickets resolved are:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1223/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
>
>
> FAQ
>
> How can I help test this release?
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> What should happen to JIRA tickets still targeting 2.1.0?
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
> What happened to RC3/RC5?
>
> They had issues withe release packaging and as a result were skipped.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-08 Thread Shivaram Venkataraman
+0

I am not sure how much of a problem this is but the pip packaging
seems to have changed the size of the hadoop-2.7 artifact. As you can
see in http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/,
the Hadoop 2.7 build is 359M almost double the size of the other
Hadoop versions.

This comes from the fact that we build our pip package using the
Hadoop 2.7 profile [1] and the pip package is contained inside this
tarball. The fix for this is to exclude the pip package from the
distribution in [2]

Thanks
Shivaram

[1] 
https://github.com/apache/spark/blob/202fcd21ce01393fa6dfaa1c2126e18e9b85ee96/dev/create-release/release-build.sh#L242
[2] 
https://github.com/apache/spark/blob/202fcd21ce01393fa6dfaa1c2126e18e9b85ee96/dev/make-distribution.sh#L240

On Thu, Dec 8, 2016 at 12:39 AM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc2
> (080717497365b83bc202ab16812ced93eb1ea7bd)
>
> List of JIRA tickets resolved are:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1217
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>
>
> (Note that the docs and staging repo are still being uploaded and will be
> available soon)
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Shivaram Venkataraman
FWIW 2.0.1 is also used in the 'Link With Spark' and 'Spark Source
Code Management' sections in that page.

Shivaram

On Mon, Nov 14, 2016 at 11:11 PM, Reynold Xin  wrote:
> It's on there on the page (both the release notes and the download version
> dropdown).
>
> The one line text is outdated. I'm just going to delete that text as a
> matter of fact so we don't run into this issue in the future.
>
>
> On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson 
> wrote:
>>
>> While you can download spark 2.0.2, the description is still spark 2.0.1:
>>
>> Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016
>> (release notes) (git tag)
>>
>>
>>
>>
>>
>> From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
>> email]]
>> Sent: Tuesday, November 15, 2016 7:15 AM
>> To: Mendelson, Assaf
>> Subject: [ANNOUNCE] Apache Spark 2.0.2
>>
>>
>>
>> We are happy to announce the availability of Spark 2.0.2!
>>
>>
>>
>> Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along
>> with Kafka 0.10 support and runtime metrics for Structured Streaming. This
>> release is based on the branch-2.0 maintenance branch of Spark. We strongly
>> recommend all 2.0.x users to upgrade to this stable release.
>>
>>
>>
>> To download Apache Spark 2.0.12 visit
>> http://spark.apache.org/downloads.html
>>
>>
>>
>> We would like to acknowledge all community members for contributing
>> patches to this release.
>>
>>
>>
>>
>>
>>
>>
>> 
>>
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Apache-Spark-2-0-2-tp19870.html
>>
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email]
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>>
>>
>> 
>> View this message in context: RE: [ANNOUNCE] Apache Spark 2.0.2
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Shivaram Venkataraman
The release is available on http://www.apache.org/dist/spark/ and its
on Maven central
http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.0.2/

I guess Reynold hasn't yet put together the release notes / updates to
the website.

Thanks
Shivaram

On Mon, Nov 14, 2016 at 12:49 PM, Nicholas Chammas
 wrote:
> Has the release already been made? I didn't see any announcement, but
> Homebrew has already updated to 2.0.2.
> 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성:
>>
>> The vote has passed with the following +1s and no -1. I will work on
>> packaging the release.
>>
>> +1:
>>
>> Reynold Xin*
>> Herman van Hövell tot Westerflier
>> Ricardo Almeida
>> Shixiong (Ryan) Zhu
>> Sean Owen*
>> Michael Armbrust*
>> Dongjoon Hyun
>> Jagadeesan As
>> Liwei Lin
>> Weiqing Yang
>> Vaquar Khan
>> Denny Lee
>> Yin Huai*
>> Ryan Blue
>> Pratik Sharma
>> Kousuke Saruta
>> Tathagata Das*
>> Mingjie Tang
>> Adam Roberts
>>
>> * = binding
>>
>>
>> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if a
>>> majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.2-rc3
>>> (584354eaac02531c9584188b143367ba694b0c34)
>>>
>>> This release candidate resolves 84 issues:
>>> https://s.apache.org/spark-2.0.2-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.1.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>> present in 2.0.1, missing features, or bugs related to new features will not
>>> necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0 from
>>> now on?
>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Shivaram Venkataraman
Do we have any query workloads for which we can benchmark these
proposals in terms of performance ?

Thanks
Shivaram

On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin  wrote:
> One additional note: in terms of size, the size of a count-min sketch with
> eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>
> To look up what that means, see
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html
>
>
>
>
>
> On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin  wrote:
>>
>> I want to bring this discussion to the dev list to gather broader
>> feedback, as there have been some discussions that happened over multiple
>> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> statistics to collect and how to use them.
>>
>> There are some basic statistics on columns that are obvious to use and we
>> don't need to debate these: estimated size (in bytes), row count, min, max,
>> number of nulls, number of distinct values, average column length, max
>> column length.
>>
>> In addition, we want to be able to estimate selectivity for equality and
>> range predicates better, especially taking into account skewed values and
>> outliers.
>>
>> Before I dive into the different options, let me first explain count-min
>> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
>> counts. It has the following nice properties:
>> - sublinear space
>> - can be generated in one-pass in a streaming fashion
>> - can be incrementally maintained (i.e. for appending new data)
>> - it's already implemented in Spark
>> - more accurate for frequent values, and less accurate for less-frequent
>> values, i.e. it tracks skewed values well.
>> - easy to compute inner product, i.e. trivial to compute the count-min
>> sketch of a join given two count-min sketches of the join tables
>>
>>
>> Proposal 1 is is to use a combination of count-min sketch and equi-height
>> histograms. In this case, count-min sketch will be used for selectivity
>> estimation on equality predicates, and histogram will be used on range
>> predicates.
>>
>> Proposal 2 is to just use count-min sketch on equality predicates, and
>> then simple selected_range / (max - min) will be used for range predicates.
>> This will be less accurate than using histogram, but simpler because we
>> don't need to collect histograms.
>>
>> Proposal 3 is a variant of proposal 2, and takes into account that skewed
>> values can impact selectivity heavily. In 3, we track the list of heavy
>> hitters (HH, most frequent items) along with count-min sketch on the column.
>> Then:
>> - use count-min sketch on equality predicates
>> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
>> (max - min)
>>
>> Proposal 4 is to not use any sketch, and use histogram for high
>> cardinality columns, and exact (value, frequency) pairs for low cardinality
>> columns (e.g. num distinct value <= 255).
>>
>> Proposal 5 is a variant of proposal 4, and adapts it to track exact
>> (value, frequency) pairs for the most frequent values only, so we can still
>> have that for high cardinality columns. This is actually very similar to
>> count-min sketch, but might use less space, although requiring two passes to
>> compute the initial value, and more difficult to compute the inner product
>> for joins.
>>
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming status

2016-10-19 Thread Shivaram Venkataraman
At the AMPLab we've been working on a research project that looks at
just the scheduling latencies and on techniques to get lower
scheduling latency. It moves away from the micro-batch model, but
reuses the fault tolerance etc. in Spark. However we haven't yet
figure out all the parts in integrating this with the rest of
structured streaming. I'll try to post a design doc / SIP about this
soon.

On a related note - are there other problems users face with
micro-batch other than latency ?

Thanks
Shivaram

On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
 wrote:
> I know people are seriously thinking about latency.  So far that has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger  wrote:
>>
>> Is anyone seriously thinking about alternatives to microbatches?
>>
>> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>>  wrote:
>> > Anything that is actively being designed should be in JIRA, and it seems
>> > like you found most of it.  In general, release windows can be found on
>> > the
>> > wiki.
>> >
>> > 2.1 has a lot of stability fixes as well as the kafka support you
>> > mentioned.
>> > It may also include some of the following.
>> >
>> > The items I'd like to start thinking about next are:
>> >  - Evicting state from the store based on event time watermarks
>> >  - Sessionization (grouping together related events by key / eventTime)
>> >  - Improvements to the query planner (remove some of the restrictions on
>> > what queries can be run).
>> >
>> > This is roughly in order based on what I've been hearing users hit the
>> > most.
>> > Would love more feedback on what is blocking real use cases.
>> >
>> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
>> > wrote:
>> >>
>> >> Hi,
>> >> I hope it is the right forum.
>> >> I am looking for some information of what to expect from
>> >> StructuredStreaming in its next releases to help me choose when / where
>> >> to
>> >> start using it more seriously (or where to invest in workarounds and
>> >> where
>> >> to wait). I couldn't find a good place where such planning discussed
>> >> for 2.1
>> >> (like, for example ML and SPARK-15581).
>> >> I'm aware of the 2.0 documented limits
>> >>
>> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> >> like no support for multiple aggregations levels, joins are strictly to
>> >> a
>> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
>> >> (like
>> >> no sink for interactive queries) etc etc
>> >> I'm also aware of some changes that have landed in master, like the new
>> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> >> metrics in SPARK-17731, and some improvements for the file source.
>> >> If I remember correctly, the discussion on Spark release cadence
>> >> concluded
>> >> with a preference to a four-month cycles, with likely code freeze
>> >> pretty
>> >> soon (end of October). So I believe the scope for 2.1 should likely
>> >> quite
>> >> clear to some, and that 2.2 planning should likely be starting about
>> >> now.
>> >> Any visibility / sharing will be highly appreciated!
>> >> thanks in advance,
>> >>
>> >> Ofir Manor
>> >>
>> >> Co-Founder & CTO | Equalum
>> >>
>> >> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>> >
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-18 Thread Shivaram Venkataraman
+1 - Given that our website is now on github
(https://github.com/apache/spark-website), I think we can move most of
our wiki into the main website. That way we'll only have two sources
of documentation to maintain: A release specific one in the main repo
and the website which is more long lived.

Thanks
Shivaram

On Tue, Oct 18, 2016 at 9:59 AM, Matei Zaharia  wrote:
> Is there any way to tie wiki accounts with JIRA accounts? I found it weird
> that they're not tied at the ASF.
>
> Otherwise, moving this into the docs might make sense.
>
> Matei
>
> On Oct 18, 2016, at 6:19 AM, Cody Koeninger  wrote:
>
> +1 to putting docs in one clear place.
>
>
> On Oct 18, 2016 6:40 AM, "Sean Owen"  wrote:
>>
>> I'm OK with that. The upside to the wiki is that it can be edited directly
>> outside of a release cycle. However, in practice I find that the wiki is
>> rarely changed. To me it also serves as a place for information that isn't
>> exactly project documentation like "powered by" listings.
>>
>> In a way I'd like to get rid of the wiki to have one less place for docs,
>> that doesn't have the same accessibility (I don't know who can give edit
>> access), and doesn't have a review process.
>>
>> For now I'd settle for bringing over a few key docs like the one you
>> mention. I spent a little time a while ago removing some duplication across
>> the wiki and project docs and think there's a bit more than could be done.
>>
>>
>> On Tue, Oct 18, 2016 at 12:25 PM Holden Karau 
>> wrote:
>>>
>>> Right now the wiki isn't particularly accessible to updates by external
>>> contributors. We've already got a contributing to spark page which just
>>> links to the wiki - how about if we just move the wiki contents over? This
>>> way contributors can contribute to our documentation about how to contribute
>>> probably helping clear up points of confusion for new contributors which the
>>> rest of us may be blind to.
>>>
>>> If we do this we would probably want to update the wiki page to point to
>>> the documentation generated from markdown. It would also mean that the
>>> results of any update to the contributing guide take a full release cycle to
>>> be visible. Another alternative would be opening up the wiki to a broader
>>> set of people.
>>>
>>> I know a lot of people are probably getting ready for Spark Summit EU
>>> (and I hope to catch up with some of y'all there) but I figured this a
>>> relatively minor proposal.
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Shivaram Venkataraman
Thanks Fred for the detailed reply. The stability points are
especially interesting as a goal for the streaming component in Spark.
In terms of next steps, one approach that might be helpful is trying
to create benchmarks or situations that mimic real-life workloads and
then we can work on isolating specific changes that are required etc.
It'd also be great to hear other approaches / next steps to concretize
some of these goals.

Thanks
Shivaram

On Thu, Oct 13, 2016 at 8:39 AM, Fred Reiss <freiss@gmail.com> wrote:
> On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>>
>> >
>> Could you expand a little bit more on stability ? Is it just bursty
>> workloads in terms of peak vs. average throughput ? Also what level of
>> latencies do you find users care about ? Is it on the order of 2-3
>> seconds vs. 1 second vs. 100s of milliseconds ?
>> >
>
>
> Regarding stability, I've seen two levels of concrete requirements.
>
> The first is "don't bring down my Spark cluster". That is to say, regardless
> of the input data rate, Spark shouldn't thrash or crash outright. Processing
> may lag behind the data arrival rate, but the cluster should stay up and
> remain fully functional.
>
> The second level is "don't bring down my application". A common use for
> streaming systems is to handle heavyweight computations that are part of a
> larger application, like a web application, a mobile app, or a plant control
> system. For example, an online application for car insurance might need to
> do some pretty involved machine learning to produce an accurate quote and
> suggest good upsells to the customer. If the heavyweight portion times out,
> the whole application times out, and you lose a customer.
>
> In terms of bursty vs. non-bursty, the "don't bring down my Spark cluster
> case" is more about handling bursts, while the "don't bring down my
> application" case is more about delivering acceptable end-to-end response
> times under typical load.
>
> Regarding latency: One group I talked to mentioned requirements in the
> 100-200 msec range, driven by the need to display a web page on a browser or
> mobile device. Another group in the Internet of Things space mentioned times
> ranging from 5 seconds to 30 seconds throughout the conversation. But most
> people I've talked to have been pretty vague about specific numbers.
>
> My impression is that these groups are not motivated by anxiety about
> meeting a particular latency target for a particular application. Rather,
> they want to make low latency the norm so that they can stop having to think
> about latency. Today, low latency is a special requirement of special
> applications. But that policy imposes a lot of hidden costs. IT architects
> have to spend time estimating the latency requirements of every application
> and lobbying for special treatment when those requirements are strict.
> Managers have to spend time engineering business processes around latency.
> Data scientists have to spend time packaging up models and negotiating how
> those models will be shipped over to the low-latency serving tier. And
> customers who are accustomed to Google and smartphones end up with an
> experience that is functional but unsatisfying.
>
> It's best to think of latency as a sliding scale. A given level of latency
> imposes a given level of cost enterprise-wide. Someone who is making a
> decision on middleware policy will balance this cost against other costs:
> How much does it cost to deploy the middleware? How much does it cost to
> train developers to use the system? The winner will be the system that
> minimizes the overall cost.
>
> Fred

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Shivaram Venkataraman
Thanks Fred - that is very helpful.

> Delivering low latency, high throughput, and stability simultaneously: Right
> now, our own tests indicate you can get at most two of these characteristics
> out of Spark Streaming at the same time. I know of two parties that have
> abandoned Spark Streaming because "pick any two" is not an acceptable answer
> to the latency/throughput/stability question for them.
>
Could you expand a little bit more on stability ? Is it just bursty
workloads in terms of peak vs. average throughput ? Also what level of
latencies do you find users care about ? Is it on the order of 2-3
seconds vs. 1 second vs. 100s of milliseconds ?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Shivaram Venkataraman
Yeah I see the apache maven repos have the 2.0.1 artifacts at
https://repository.apache.org/content/repositories/releases/org/apache/spark/spark-core_2.11/
-- Not sure why they haven't synced to maven central yet

Shivaram

On Wed, Oct 5, 2016 at 8:37 PM, Luciano Resende  wrote:
> It usually don't take that long to be synced, I still don't see any 2.0.1
> related artifacts on maven central
>
> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22%20AND%20v%3A%222.0.1%22
>
>
> On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin  wrote:
>>
>> They have been published yesterday, but can take a while to propagate.
>>
>>
>> On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar 
>> wrote:
>>>
>>> Hi,
>>>
>>> It seems like, 2.0.1 artifact hasn't been published to Maven Central. Can
>>> anyone confirm?
>>>
>>> On Tue, Oct 4, 2016 at 5:39 PM, Reynold Xin  wrote:

 We are happy to announce the availability of Spark 2.0.1!

 Apache Spark 2.0.1 is a maintenance release containing 300 stability and
 bug fixes. This release is based on the branch-2.0 maintenance branch of
 Spark. We strongly recommend all 2.0.0 users to upgrade to this stable
 release.

 To download Apache Spark 2.0.1, visit
 http://spark.apache.org/downloads.html

 We would like to acknowledge all community members for contributing
 patches to this release.


>>>
>>>
>>>
>>> --
>>> --
>>> Cheers,
>>> Praj
>>
>>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Shivaram Venkataraman
+1 I think having a 4 month window instead of a 3 month window sounds good.

However I think figuring out a timeline for maintenance releases would
also be good. This is a common concern that comes up in many user
threads and it'll be better to have some structure around this. It
doesn't need to be strict, but something like the first maintenance
release for the latest 2.x.0 release within 2 months. And then a
second maintenance release within 6 months or something like that.

Thanks
Shivaram

On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin  wrote:
> We are 2 months past releasing Spark 2.0.0, an important milestone for the
> project. Spark 2.0.0 deviated (took 6 month from the regular release cadence
> we had for the 1.x line, and we never explicitly discussed what the release
> cadence should look like for 2.x. Thus this email.
>
> During Spark 1.x, roughly every three months we make a new 1.x feature
> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> happened primarily in the first two months, and then a release branch was
> cut at the end of month 2, and the last month was reserved for QA and
> release preparation.
>
> During 2.0.0 development, I really enjoyed the longer release cycle because
> there was a lot of major changes happening and the longer time was critical
> for thinking through architectural changes as well as API design. While I
> don't expect the same degree of drastic changes in a 2.x feature release, I
> do think it'd make sense to increase the length of release cycle so we can
> make better designs.
>
> My strawman proposal is to maintain a regular release cadence, as we did in
> Spark 1.x, and increase the cycle from 3 months to 4 months. This
> effectively gives us ~50% more time to develop (in reality it'd be slightly
> less than 50% since longer dev time also means longer QA time). As for
> maintenance releases, I think those should still be cut on-demand, similar
> to Spark 1.x, but more aggressively.
>
> To put this into perspective, 4-month cycle means we will release Spark
> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
> end of Oct).
>
> I am curious what others think.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-09-26 Thread Shivaram Venkataraman
Disclaimer - I am not very closely involved with Structured Streaming
design / development, so this is just my two cents from looking at the
discussion in the linked JIRAs and PRs.

It seems to me there are a couple of issues being conflated here: (a)
is the question of how to specify or add more functionality to the
Sink API such as ability to get model updates back to the driver [A
design issue IMHO] (b) question of how to pass parameters to
DataFrameWriter, esp. strings vs. typed objects and whether the API is
stable vs. experimental

TLDR is that I think we should first focus on refactoring the Sink and
add new functionality after that. Detailed comments below.

Sink design / functionality: Looking at SPARK-10815, a JIRA linked
from SPARK-16407, it looks like the existing Sink API is limited
because it is tied to the RDD/Dataframe definitions. It also has
surprising limitations like not being able to run operators on `data`
and only using `collect/foreach`.  Given these limitations, I think it
makes sense to redesign the Sink API first *before* adding new
functionality to the existing Sink. I understand that we have not
marked this experimental in 2.0.0 -- but I guess since
StructuredStreaming is new as a whole, so we can probably break the
Sink API in a upcoming 2.1.0 release.

As a part of the redesign, I think we need to do two things: (i) come
up with a new data handle that separates RDD from what is passed to
the Sink (ii) Have some way to specify code that can run on the
driver. This might not be an issue if the data handle already has
clean abstraction for this.

Micro-batching: Ideally it would be good to not expose the micro-batch
processing model in the Sink API as this might change going forward.
Given the consistency model we are presenting I think there will be
some notion of batch / time-range identifier in the API. But I think
if we can avoid having hard constraints on where functions will get
run (i.e. on the driver vs. as a part of a job etc.) and when
functions will get run (i.e. strictly after every micro-batch) it
might give us more freedom in improving performance going forward [1].

Parameter passing: I think your point that typed is better than
untyped is pretty good and supporting both APIs isn't necessarily bad
either. My understand of the discussion around this is that we should
do this after Sink is refactored to avoid exposing the old APIs ?

Thanks
Shivaram

[1] FWIW this is something I am looking at and
https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/
has some details about this.


On Mon, Sep 26, 2016 at 1:38 PM, Holden Karau  wrote:
> Hi Spark Developers,
>
>
> After some discussion on SPARK-16407 (and on the PR) we’ve decided to jump
> back to the developer list (SPARK-16407 itself comes from our early work on
> SPARK-16424 to enable ML with the new Structured Streaming API). SPARK-16407
> is proposing to extend the current DataStreamWriter API to allow users to
> specify a specific instance of a StreamSinkProvider - this makes it easier
> for users to create sinks that are configured with things besides strings
> (for example things like lambdas). An example of something like this already
> inside Spark is the ForeachSink.
>
>
> We have been working on adding support for online learning in Structured
> Streaming, similar to what Spark Streaming and MLLib provide today. Details
> are available in  SPARK-16424. Along the way, we noticed that there is
> currently no way for code running in the driver to access the streaming
> output of a Structured Streaming query (in our case ideally as an Dataset or
> RDD - but regardless of the underlying data structure). In our specific
> case, we wanted to update a model in the driver using aggregates computed by
> a Structured Streaming query.
>
>
> A lot of other applications are going to have similar requirements. For
> example, there is no way (outside of using private Spark internals)* to
> implement a console sink with a user supplied formatting function, or
> configure a templated or generic sink at runtime, trigger a custom Python
> call-back or even implement the ForeachSink outside of Spark. For work
> inside of Spark to enable Structured Streaming with ML we clearly don’t need
> SPARK-16407 as we can directly access the internals (although it would be
> cleaner to not have to) but if we want to empower people working outside of
> the Spark codebase itself with Structured Streaming I think we need to
> provide some mechanism for this and it would be great to see what
> options/ideas the community can come up with.
>
>
> One of the arguments against SPARK-16407 seems to be mostly that it exposes
> the Sink API which is implemented using micro-batching, but the counter
> argument to this is that the Sink API is already exposed (instead of passing
> in an instance the user needs to pass in a class name which is then created
> through reflection and has configuration parameters 

Re: R docs no longer building for branch-2.0

2016-09-22 Thread Shivaram Venkataraman
I looked into this and found the problem. Will send a PR now to fix this.

If you are curious about what is happening here: When we build the
docs separately we don't have the JAR files from the Spark build in
the same tree. We added a new set of docs recently in SparkR called an
R vignette that runs Spark and generates docs using outputs from the
run.  So this doesn't work when the JARs are not available.

Thanks
Shivaram

On Thu, Sep 22, 2016 at 5:06 AM, Sean Owen  wrote:
> FWIW it worked for me, but I may not be executing the same thing. I
> was running the commands given in R/DOCUMENTATION.md
>
> It succeeded for me in creating the vignette, on branch-2.0.
>
> Maybe it's a version or library issue? what R do you have installed,
> and are you up to date with packages like devtools and roxygen2?
>
> On Thu, Sep 22, 2016 at 7:47 AM, Reynold Xin  wrote:
>> I'm working on packaging 2.0.1 rc but encountered a problem: R doc fails to
>> build. Can somebody take a look at the issue ASAP?
>>
>>
>>
>> ** knitting documentation of write.parquet
>> ** knitting documentation of write.text
>> ** knitting documentation of year
>> ~/workspace/spark-release-docs/spark/R
>> ~/workspace/spark-release-docs/spark/R
>>
>>
>> processing file: sparkr-vignettes.Rmd
>>
>>   |
>>   | |   0%
>>   |
>>   |.|   1%
>>inline R code fragments
>>
>>
>>   |
>>   |.|   2%
>> label: unnamed-chunk-1 (with options)
>> List of 1
>>  $ message: logi FALSE
>>
>> Loading required package: methods
>>
>> Attaching package: 'SparkR'
>>
>> The following objects are masked from 'package:stats':
>>
>> cov, filter, lag, na.omit, predict, sd, var, window
>>
>> The following objects are masked from 'package:base':
>>
>> as.data.frame, colnames, colnames<-, drop, intersect, rank,
>> rbind, sample, subset, summary, transform, union
>>
>>
>>   |
>>   |..   |   3%
>>   ordinary text without R code
>>
>>
>>   |
>>   |..   |   4%
>> label: unnamed-chunk-2 (with options)
>> List of 1
>>  $ message: logi FALSE
>>
>> Spark package found in SPARK_HOME:
>> /home/jenkins/workspace/spark-release-docs/spark
>> Error: Could not find or load main class org.apache.spark.launcher.Main
>> Quitting from lines 30-31 (sparkr-vignettes.Rmd)
>> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  :
>>   JVM is not ready after 10 seconds
>> Calls: render ... eval -> eval -> sparkR.session -> sparkR.sparkContext
>>
>> Execution halted
>> jekyll 2.5.3 | Error:  R doc generation failed
>> Deleting credential directory
>> /home/jenkins/workspace/spark-release-docs/spark-utils/new-release-scripts/jenkins/jenkins-credentials-IXCkuX6w
>> Build step 'Execute shell' marked build as failure
>> [WS-CLEANUP] Deleting project workspace...[WS-CLEANUP] done
>> Finished: FAILURE

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Change the settings in AppVeyor to prevent triggering the tests in other PRs in other branches

2016-09-09 Thread Shivaram Venkataraman
The infra ticket has been updated so I'd say let's stick to running tests
on master branch. We can of course create JIRAs for tests that fail in
branch 2.0 and 1.6

Shivaram

On Sep 9, 2016 09:33, "Hyukjin Kwon" <gurwls...@gmail.com> wrote:

> FYI, I just ran the SparkR tests on Windows for branch-2.0 and 1.6.
>
> branch-2.0 - https://github.com/spark-test/spark/pull/7
> branch-1.6 - https://github.com/spark-test/spark/pull/8
>
>
>
>
> 2016-09-10 0:59 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>:
>
>> Yes, if we don't have any PRs to other branches on branch-1.5 and lower
>> versions, I think it'd be fine.
>>
>> One concern is, I am not sure if SparkR tests can pass on  branch-1.6 (I
>> checked it passes on branch-2.0 before).
>>
>> I can try to check if it passes and identify the related causes if it
>> does not pass.
>>
>> On 10 Sep 2016 12:52 a.m., "Shivaram Venkataraman" <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> One thing we could do is to backport the commit to branch-2.0 and
>>> branch-1.6 -- Do you think that will fix the problem ?
>>>
>>> On Fri, Sep 9, 2016 at 8:50 AM, Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>> > Ah, thanks! I wasn't too sure on this so I thought asking here somehow
>>> > reaches out to who's in charge of the account :).
>>> >
>>> >
>>> > On 10 Sep 2016 12:41 a.m., "Shivaram Venkataraman"
>>> > <shiva...@eecs.berkeley.edu> wrote:
>>> >>
>>> >> Thanks for debugging - I'll reply on
>>> >> https://issues.apache.org/jira/browse/INFRA-12590 and ask for this
>>> >> change.
>>> >>
>>> >> FYI I don't any of the committers have access to the appveyor account
>>> >> which is at https://ci.appveyor.com/projec
>>> t/ApacheSoftwareFoundation/spark
>>> >>  . To request changes that need to be done in the UI we need to open a
>>> >> INFRA ticket.
>>> >>
>>> >> Thanks
>>> >> Shivaram
>>> >>
>>> >> On Fri, Sep 9, 2016 at 6:55 AM, Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>> >> > Hi all,
>>> >> >
>>> >> >
>>> >> > Currently, it seems the settings in AppVeyor is default and runs
>>> some
>>> >> > tests
>>> >> > on different branches. For example,
>>> >> >
>>> >> >
>>> >> > https://github.com/apache/spark/pull/15023
>>> >> >
>>> >> > https://github.com/apache/spark/pull/15022
>>> >> >
>>> >> >
>>> >> > It seems it happens only in other branches as they don’t have
>>> >> > appveyor.yml
>>> >> > and try to refer the configuration in the web (although I have to
>>> test
>>> >> > this).
>>> >> >
>>> >> >
>>> >> > I’d be great if any of auhorized one sets the branch to test to
>>> master
>>> >> > brunch only as described in
>>> >> >
>>> >> >
>>> >> > https://github.com/apache/spark/blob/master/dev/appveyor-gui
>>> de.md#specifying-the-branch-for-building-and-setting-the-build-schedule
>>> >> >
>>> >> >
>>> >> > I just manually tested this. With the setting, it would not trigger
>>> the
>>> >> > test
>>> >> > for another branch, for example,
>>> >> > https://github.com/spark-test/spark/pull/5
>>> >> >
>>> >> > Currently, with the default settings, it will run the tests on
>>> another
>>> >> > branch, for example, https://github.com/spark-test/spark/pull/4
>>> >> >
>>> >> >
>>> >> > Thanks.
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>> >
>>>
>>
>


Re: Change the settings in AppVeyor to prevent triggering the tests in other PRs in other branches

2016-09-09 Thread Shivaram Venkataraman
One thing we could do is to backport the commit to branch-2.0 and
branch-1.6 -- Do you think that will fix the problem ?

On Fri, Sep 9, 2016 at 8:50 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> Ah, thanks! I wasn't too sure on this so I thought asking here somehow
> reaches out to who's in charge of the account :).
>
>
> On 10 Sep 2016 12:41 a.m., "Shivaram Venkataraman"
> <shiva...@eecs.berkeley.edu> wrote:
>>
>> Thanks for debugging - I'll reply on
>> https://issues.apache.org/jira/browse/INFRA-12590 and ask for this
>> change.
>>
>> FYI I don't any of the committers have access to the appveyor account
>> which is at https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark
>>  . To request changes that need to be done in the UI we need to open a
>> INFRA ticket.
>>
>> Thanks
>> Shivaram
>>
>> On Fri, Sep 9, 2016 at 6:55 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>> > Hi all,
>> >
>> >
>> > Currently, it seems the settings in AppVeyor is default and runs some
>> > tests
>> > on different branches. For example,
>> >
>> >
>> > https://github.com/apache/spark/pull/15023
>> >
>> > https://github.com/apache/spark/pull/15022
>> >
>> >
>> > It seems it happens only in other branches as they don’t have
>> > appveyor.yml
>> > and try to refer the configuration in the web (although I have to test
>> > this).
>> >
>> >
>> > I’d be great if any of auhorized one sets the branch to test to master
>> > brunch only as described in
>> >
>> >
>> > https://github.com/apache/spark/blob/master/dev/appveyor-guide.md#specifying-the-branch-for-building-and-setting-the-build-schedule
>> >
>> >
>> > I just manually tested this. With the setting, it would not trigger the
>> > test
>> > for another branch, for example,
>> > https://github.com/spark-test/spark/pull/5
>> >
>> > Currently, with the default settings, it will run the tests on another
>> > branch, for example, https://github.com/spark-test/spark/pull/4
>> >
>> >
>> > Thanks.
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Change the settings in AppVeyor to prevent triggering the tests in other PRs in other branches

2016-09-09 Thread Shivaram Venkataraman
Thanks for debugging - I'll reply on
https://issues.apache.org/jira/browse/INFRA-12590 and ask for this
change.

FYI I don't any of the committers have access to the appveyor account
which is at https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark
 . To request changes that need to be done in the UI we need to open a
INFRA ticket.

Thanks
Shivaram

On Fri, Sep 9, 2016 at 6:55 AM, Hyukjin Kwon  wrote:
> Hi all,
>
>
> Currently, it seems the settings in AppVeyor is default and runs some tests
> on different branches. For example,
>
>
> https://github.com/apache/spark/pull/15023
>
> https://github.com/apache/spark/pull/15022
>
>
> It seems it happens only in other branches as they don’t have appveyor.yml
> and try to refer the configuration in the web (although I have to test
> this).
>
>
> I’d be great if any of auhorized one sets the branch to test to master
> brunch only as described in
>
> https://github.com/apache/spark/blob/master/dev/appveyor-guide.md#specifying-the-branch-for-building-and-setting-the-build-schedule
>
>
> I just manually tested this. With the setting, it would not trigger the test
> for another branch, for example, https://github.com/spark-test/spark/pull/5
>
> Currently, with the default settings, it will run the tests on another
> branch, for example, https://github.com/spark-test/spark/pull/4
>
>
> Thanks.
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Discuss SparkR executors/workers support virtualenv

2016-09-07 Thread Shivaram Venkataraman
I think this makes sense -- making it easier to use additional R
packages would be a good feature. I am not sure we need Packrat for
this use case though. Lets continue discussion on the JIRA at
https://issues.apache.org/jira/browse/SPARK-17428

Thanks
Shivaram

On Tue, Sep 6, 2016 at 11:36 PM, Yanbo Liang  wrote:
> Hi All,
>
>
> Many users have requirements to use third party R packages in
> executors/workers, but SparkR can not satisfy this requirements elegantly.
> For example, you should to mess with the IT/administrators of the cluster to
> deploy these R packages on each executors/workers node which is very
> inflexible.
>
> I think we should support third party R packages for SparkR users as what we
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for
> each executors.
> 2, Users can load their local R packages and install them on each executors.
>
> To achieve this goal, the first thing is to make SparkR executors support
> virtualenv like Python conda. I have investigated and found
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to
> support virtualenv for R. Packrat is a dependency management system for R
> and can isolate the dependent R packages in its own private package space.
> Then SparkR users can install third party packages in the application
> scope(destroy after the application exit) and don’t need to bother
> IT/administrators to install these packages manually.
>
> I would like to know whether it make sense.
>
>
> Thanks
>
> Yanbo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: sparkR array type not supported

2016-09-02 Thread Shivaram Venkataraman
I think it needs a type for the elements in the array. For example

f <- structField("x", "array")

Thanks
Shivaram

On Fri, Sep 2, 2016 at 8:26 AM, Paul R  wrote:
> Hi there,
>
> I’ve noticed the following command in sparkR
>
 field = structField(“x”, “array”)
>
> Throws this error
>
 Error in checkType(type) : Unsupported type for SparkDataframe: array
>
> Was wondering if this is a bug as the documentation says “array” should be 
> implemented
>
> Thanks
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: KMeans calls takeSample() twice?

2016-08-30 Thread Shivaram Venkataraman
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at 
https://github.com/apache/spark/blob/412b0e8969215411b97efd3d0984dc6cac5d31e0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L508
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.

Thanks
Shivaram

On Tue, Aug 30, 2016 at 9:50 AM, Georgios Samaras
 wrote:
>
> -- Forwarded message --
> From: Georgios Samaras 
> Date: Tue, Aug 30, 2016 at 9:49 AM
> Subject: Re: KMeans calls takeSample() twice?
> To: "Sean Owen [via Apache Spark Developers List]"
> 
>
>
> I am not sure what you want me to check. Note that I see two takeSample()s
> being invoked every single time I execute KMeans(). In a current job I have,
> I did view the details and updated the:
>
> StackOverflow question.
>
>
>
> On Tue, Aug 30, 2016 at 9:25 AM, Sean Owen [via Apache Spark Developers
> List]  wrote:
>>
>> I'm not sure it's a UI bug; it really does record two different
>> stages, the second of which executes quickly. I am not sure why that
>> would happen off the top of my head. I don't see anything that failed
>> here.
>>
>> Digging into those two stages and what they executed might give a clue
>> to what's really going on there.
>>
>> On Tue, Aug 30, 2016 at 5:18 PM, gsamaras <[hidden email]> wrote:
>> > Yanbo thank you for your reply. So you are saying that this is a bug in
>> > the
>> > Spark UI in general, and not in the local Spark UI of our cluster, where
>> > I
>> > work, right?
>> >
>> > George
>>
>> -
>> To unsubscribe e-mail: [hidden email]
>>
>>
>>
>> 
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-takeSample-twice-tp18761p18788.html
>> To unsubscribe from KMeans calls takeSample() twice?, click here.
>> NAML
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark R - Loading Third Party R Library in YARN Executors

2016-08-17 Thread Shivaram Venkataraman
I think you can also pass in a zip file using the --files option
(http://spark.apache.org/docs/latest/running-on-yarn.html has some
examples). The files should then be present in the current working
directory of the driver R process.

Thanks
Shivaram

On Wed, Aug 17, 2016 at 4:16 AM, Felix Cheung  wrote:
> When you call library(), that is the library loading function in native R.
> As of now it does not support HDFS but there are several packages out there
> that might help.
>
> Another approach is to have a prefetch/installation mechanism to call HDFS
> command to download the R package from HDFS onto the worker node first.
>
>
> _
> From: Senthil Kumar 
> Sent: Wednesday, August 17, 2016 2:23 AM
> Subject: Spark R - Loading Third Party R Library in YARN Executors
> To: Senthil kumar , ,
> , 
>
>
>
> Hi All ,  We are using Spark 1.6 Version R library .. Below is our code
> which Loads the THIRD Party Library .
>
>
> library("BreakoutDetection", lib.loc = "hdfs://xx/BreakoutDetection/") :
> library("BreakoutDetection", lib.loc = "//xx/BreakoutDetection/") :
>
>
> When i try to execute the code using LOCAL Mode , Spark R code is Working
> fine without any issue . If i submit the Job in Cluster , we will end up
> with error.
>
> error in evaluating the argument 'X' in selecting a method for function
> 'lapply': Error in library("BreakoutDetection", lib.loc =
> "hdfs://xxx/BreakoutDetection/") :
>   no library trees found in 'lib.loc'
> Calls: f ... lapply -> FUN -> mainProcess -> angleValid -> library
>
>
> Can't we read libraries in R as below ?
> library("BreakoutDetection", lib.loc = "hdfs://xx/BreakoutDetection/") :
>
> If not what is the other way to solve this problem ?
>
> Since our cluster having close to 2500 nodes we cant copy the Third Party
> Libs to all nodes .. Copying to all DNs is not good practice too ..
>
> Can someone help me here How to load R libs from HDFS or any other way  ?
>
>
> --Senthil
>
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Shivaram Venkataraman
+1

SHA and MD5 sums match for all binaries. Docs look fine this time
around. Built and ran `dev/run-tests` with Java 7 on a linux machine.

No blocker bugs on JIRA and the only critical bug with target as 2.0.0
is SPARK-16633, which doesn't look like a release blocker. I also
checked issues which are marked as Critical affecting version 2.0.0
and the only other ones that seem applicable are SPARK-15703 and
SPARK-16334. Both of them don't look like blockers to me.

Thanks
Shivaram


On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc5
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread Shivaram Venkataraman
Hashes, sigs match. I built and ran tests with Hadoop 2.3 ("-Pyarn
-Phadoop-2.3 -Phive -Pkinesis-asl -Phive-thriftserver"). I couldn't
get the following tests to pass but I think it might be something
specific to my setup as Jenkins on branch-2.0 seems quite stable.

[error] Failed tests:
[error] org.apache.spark.sql.hive.client.VersionsSuite
[error] org.apache.spark.sql.hive.HiveSparkSubmitSuite
[error] Error during tests:
[error] org.apache.spark.sql.hive.HiveExternalCatalogSuite

Regarding the open issues, I agree with Sean that most of them seem
minor to me and not worth blocking a release for. It would be good to
get more details on SPARK-16011 though

As for the docs, ideally we should have them in place before the RC
but given that this is a recurring issue I'm wondering if having a
separate updatable link (like the 2.0.0-rc4-updated that Reynold
posted yesterday) can be used. The semantics we could then have are
that the docs should be ready when the vote succeeds rather than being
ready when the vote starts.

Thanks
Shivaram

On Fri, Jul 15, 2016 at 6:59 AM, Sean Owen  wrote:
> Signatures and hashes are OK. I built and ran tests successfully on
> Ubuntu 16 + Java 8 with "-Phive -Phadoop-2.7 -Pyarn". Although I
> encountered a few tests failures, none were repeatable.
>
> Regarding other issues brought up so far:
>
> SPARK-16522
> Does not seem quite enough to be a blocker if it's just an error at
> shutdown that does not affect the result. If there's another RC, worth
> fixing.
> SPARK-15899
> Not a blocker. Only affects Windows and possibly even only affects
> tests. Not a regression.
> SPARK-16515
> Not sure but Cheng please mark it a Blocker if you're pretty confident
> it must be fixed.
>
> Davies marked SPARK-16011 a Blocker, though should confirm that it's
> for 2.0.0. That's the only one officially open now.
>
> So I suppose that's provisionally a -1 from me as it's not clear there
> aren't blocking issues. It's close, and this should be tested by
> everyone.
>
>
> Remaining Critical issues are below. I'm still uncomfortable with
> documentation issues for 2.0 not being done before 2.0. If anyone's
> intent is to release and then finish the docs a few days later, I'd
> vote against that. There's just no rush that makes that make sense.
>
> However it's entirely possible that the remaining work is not
> essential for 2.0; I don't know. These should be retitled then. But to
> make this make sense, one or the other needs to happen. "Audit" JIRAs
> are similar, especially before a major release.
>
>
> SPARK-13393 Column mismatch issue in left_outer join using Spark DataFrame
> SPARK-13753 Column nullable is derived incorrectly
> SPARK-13959 Audit MiMa excludes added in SPARK-13948 to make sure none
> are unintended incompatibilities
> SPARK-14808 Spark MLlib, GraphX, SparkR 2.0 QA umbrella
> SPARK-14816 Update MLlib, GraphX, SparkR websites for 2.0
> SPARK-14817 ML, Graph, R 2.0 QA: Programming guide update and migration guide
> SPARK-14823 Fix all references to HiveContext in comments and docs
> SPARK-15340 Limit the size of the map used to cache JobConfs to void OOM
> SPARK-15393 Writing empty Dataframes doesn't save any _metadata files
> SPARK-15703 Spark UI doesn't show all tasks as completed when it should
> SPARK-15944 Make spark.ml package backward compatible with spark.mllib vectors
> SPARK-16032 Audit semantics of various insertion operations related to
> partitioned tables
> SPARK-16090 Improve method grouping in SparkR generated docs
> SPARK-16301 Analyzer rule for resolving using joins should respect
> case sensitivity setting
>
> On Thu, Jul 14, 2016 at 7:59 PM, Reynold Xin  wrote:
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.0-rc4
>> (e5f8c1117e0c48499f54d62b556bc693435afae0).
>>
>> This release candidate resolves ~2500 issues:
>> https://s.apache.org/spark-2.0.0-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1192/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/
>>
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running 

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Shivaram Venkataraman
I think the docs build was broken because of
https://issues.apache.org/jira/browse/SPARK-16553 - A fix has been
merged and we are testing it now

Shivaram

On Thu, Jul 14, 2016 at 1:56 PM, Matthias Niehoff
 wrote:
> Some of the programming guides in the docs only give me blank page (Spark
> programming guide works. Streaming, DataFrame/SQL and Structured Streaming
> do not).
>
> 2016-07-14 22:21 GMT+02:00 Nicholas Chammas :
>>
>> Oh nevermind, just noticed your note. Apologies.
>>
>> On Thu, Jul 14, 2016 at 4:20 PM Nicholas Chammas
>>  wrote:
>>>
>>> Just curious: Did we have an RC3? I don't remember seeing one.
>>>
>>>
>>> On Thu, Jul 14, 2016 at 3:00 PM Reynold Xin  wrote:

 Please vote on releasing the following candidate as Apache Spark version
 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.0.0
 [ ] -1 Do not release this package because ...


 The tag to be voted on is v2.0.0-rc4
 (e5f8c1117e0c48499f54d62b556bc693435afae0).

 This release candidate resolves ~2500 issues:
 https://s.apache.org/spark-2.0.0-jira

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1192/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/


 =
 How can I help test this release?
 =
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions from 1.x.

 ==
 What justifies a -1 vote for this release?
 ==
 Critical bugs impacting major functionalities.

 Bugs already present in 1.x, missing features, or bugs related to new
 features will not necessarily block this release. Note that historically
 Spark documentation has been published on the website separately from the
 main release so we do not need to block the release due to documentation
 errors either.


 Note: There was a mistake made during "rc3" preparation, and as a result
 there is no "rc3", but only "rc4".

>
>
>
> --
> Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
> codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
> tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
> 172.1702676
> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
> www.more4fi.de
>
> Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
>
> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
> bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter
> Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl.
> beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht
> gestattet

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Call to new JObject sometimes returns an empty R environment

2016-07-05 Thread Shivaram Venkataraman
-sparkr-dev@googlegroups +dev@spark.apache.org

[Please send SparkR development questions to the Spark user / dev
mailing lists. Replies inline]

> From:  
> Date: Tue, Jul 5, 2016 at 3:30 AM
> Subject: Call to new JObject sometimes returns an empty R environment
> To: SparkR Developers 
>
>
>
>  Hi all,
>
>  I have recently moved from SparkR 1.5.2 to 1.6.0. I am doing some
> experiments using SparkR:::newJObject("java.util.HashMap") and I
> notice the behaviour has changed, and it now returns an "environment"
> instead of a "jobj":
>
>> print(class(SparkR:::newJObject("java.util.HashMap")))  # SparkR 1.5.2
> [1] "jobj"
>
>> print(class(SparkR:::newJObject("java.util.HashMap")))  # SparkR 1.6.0
> [1] "environment"
>
> Moreover, the environment returned is apparently empty (when I call
> ls() on the resulting environment, it returns character(0)) . This
> problem only happens with some Java classes. I am not able to say
> exactly which classes cause the problem.

The reason this is different in Spark 1.6 is that we added support for
automatically deserializing Maps returned from the JVM as environments
on the R side. The pull request
https://github.com/apache/spark/pull/8711 has some more details. The
reason BitSet / ArrayList "work" is that we don't do any special
serialization / de-serialization for them.

>
> If I try to create an instance of other classes such as
> java.util.BitSet, it works successfully. I thought it might be related
> with parameterized types, but it does work successfully with ArrayList
> and with HashSet, which take a parameter.
>
> Any suggestions on this change of behaviour (apart from "do not use
> private functions" :-)   ) ?

Unfortunately there isn't much more to say than that. The
serialization/de-serialization is an internal API and we don't claim
to maintain backwards compatibility. You might be able to work around
this particular issue by wrapping your Map in a different object.

Thanks
Shivaram

>
> Thank you very much
>
> --
> You received this message because you are subscribed to the Google
> Groups "SparkR Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sparkr-dev+unsubscr...@googlegroups.com.
> To post to this group, send email to sparkr-...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sparkr-dev/14dbc4ce-2579-4008-96ae-818d8a94a4a7%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: spark-ec2 scripts with spark-2.0.0-preview

2016-06-14 Thread Shivaram Venkataraman
Can you open an issue on https://github.com/amplab/spark-ec2 ?  I
think we should be able to escape the version string and pass the
2.0.0-preview through the scripts

Shivaram

On Tue, Jun 14, 2016 at 12:07 PM, Sunil Kumar
 wrote:
> Hi,
>
> The spark-ec2 scripts are missing from spark-2.0.0-preview. Is there a
> workaround available ? I tried to change the ec2 scripts to accomodate
> spark-2.0.0...If I call the release spark-2.0.0-preview, then it barfs
> because the command line argument : --spark-version=spark-2.0.0-preview
> gets translated to spark-2.0.0-preiew (-v is taken as a switch)...If I call
> the release spark-2.0.0, then it cant find it in aws, since it looks for
> http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-bin-hadoop2.4.tgz
> instead of
> http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-preview-bin-hadoop2.4.tgz
>
> Any ideas on how to make this work ? How can I tweak/hack the code to look
> for spark-2.0.0-preview in spark-related-packages ?
>
> thanks
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-07 Thread Shivaram Venkataraman
As far as I know the process is just to copy docs/_site from the build
to the appropriate location in the SVN repo (i.e.
site/docs/2.0.0-preview).

Thanks
Shivaram

On Tue, Jun 7, 2016 at 8:14 AM, Sean Owen  wrote:
> As a stop-gap, I can edit that page to have a small section about
> preview releases and point to the nightly docs.
>
> Not sure who has the power to push 2.0.0-preview to site/docs, but, if
> that's done then we can symlink "preview" in that dir to it and be
> done, and update this section about preview docs accordingly.
>
> On Tue, Jun 7, 2016 at 4:10 PM, Tom Graves  wrote:
>> Thanks Sean, you were right, hard refresh made it show up.
>>
>> Seems like we should at least link to the preview docs from
>> http://spark.apache.org/documentation.html.
>>
>> Tom
>>
>>
>> On Tuesday, June 7, 2016 10:04 AM, Sean Owen  wrote:
>>
>>
>> It's there (refresh maybe?). See the end of the downloads dropdown.
>>
>> For the moment you can see the docs in the nightly docs build:
>> https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/
>>
>> I don't know, what's the best way to put this into the main site?
>> under a /preview root? I am not sure how that process works.
>>
>> On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves  wrote:
>>> I just checked and I don't see the 2.0 preview release at all anymore on
>>> .http://spark.apache.org/downloads.html, is it in transition?The only
>>> place I can see it is at
>>> http://spark.apache.org/news/spark-2.0.0-preview.html
>>>
>>>
>>> I would like to see docs there too.  My opinion is it should be as easy to
>>> use/try out as any other spark release.
>>>
>>> Tom
>>
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Shivaram Venkataraman
On Thu, May 12, 2016 at 2:29 PM, Reynold Xin  wrote:
> We currently have three levels of interface annotation:
>
> - unannotated: stable public API
> - DeveloperApi: A lower-level, unstable API intended for developers.
> - Experimental: An experimental user-facing API.
>
>
> After using this annotation for ~ 2 years, I would like to propose the
> following changes:
>
> 1. Require explicitly annotation for public APIs. This reduces the chance of
> us accidentally exposing private APIs.
>
+1

> 2. Separate interface annotation into two components: one that describes
> intended audience, and the other that describes stability, similar to what
> Hadoop does. This allows us to define "low level" APIs that are stable, e.g.
> the data source API (I'd argue this is the API that should be more stable
> than end-user-facing APIs).
>
> InterfaceAudience: Public, Developer
>
> InterfaceStability: Stable, Experimental
>
I'm not very sure about this. What advantage do we get from Public vs.
Developer ? Also somebody needs to take a judgement call on that which
might not always be easy to do
>
> What do you think?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR unit test failures on local master

2016-04-28 Thread Shivaram Venkataraman
I just ran the tests using a recently synced master branch and the
tests seemed to work fine. My guess is some of the Java classes
changed and you need to rebuild Spark ?

Thanks
Shivaram

On Thu, Apr 28, 2016 at 1:19 PM, Gayathri Murali
 wrote:
> Hi All,
>
> I am running the sparkR unit test(./R/run-tests.sh) on a local master branch
> and I am seeing the following issues with SparkR ML wrapper test cases.
>
> Failed
> -
> 1. Error: glm and predict (@test_mllib.R#31)
> ---
>
> 1: glm(Sepal_Width ~ Sepal_Length, training, family = "gaussian") at
> /Users/gayathri/spark/R/lib/SparkR/tests/testthat/test_mllib.R:31
>
> Error: summary coefficients match with native glm (@test_mllib.R#79)
> 
> error in evaluating the argument 'object' in selecting a method for function
> 'summary':
>
> Error: summary coefficients match with native glm of family 'binomial'
> (@test_mllib.R#97)
> error in evaluating the argument 'object' in selecting a method for function
> 'summary':
>
> 10. Failure: SQL error message is returned from JVM (@test_sparkSQL.R#1820)
> 
> grepl("Table not found: blah", retError) not equal to TRUE.
> 1 element mismatch
>
> Any thoughts on what could be causing this?
>
> Thanks
> Gayathri
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Shivaram Venkataraman
Overall this sounds good to me. One question I have is that in
addition to the ML algorithms we have a number of linear algebra
(various distributed matrices) and statistical methods in the
spark.mllib package. Is the plan to port or move these to the spark.ml
namespace in the 2.x series ?

Thanks
Shivaram

On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
> FWIW, all of that sounds like a good plan to me. Developing one API is
> certainly better than two.
>
> On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
>> Hi all,
>>
>> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>> been developed under the spark.ml package, while the old RDD-based API has
>> been developed in parallel under the spark.mllib package. While it was
>> easier to implement and experiment with new APIs under a new package, it
>> became harder and harder to maintain as both packages grew bigger and
>> bigger. And new users are often confused by having two sets of APIs with
>> overlapped functions.
>>
>> We started to recommend the DataFrame-based API over the RDD-based API in
>> Spark 1.5 for its versatility and flexibility, and we saw the development
>> and the usage gradually shifting to the DataFrame-based API. Just counting
>> the lines of Scala code, from 1.5 to the current master we added ~1
>> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>> gather more resources on the development of the DataFrame-based API and to
>> help users migrate over sooner, I want to propose switching RDD-based MLlib
>> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>
>> * We do not accept new features in the RDD-based spark.mllib package, unless
>> they block implementing new features in the DataFrame-based spark.ml
>> package.
>> * We still accept bug fixes in the RDD-based API.
>> * We will add more features to the DataFrame-based API in the 2.x series to
>> reach feature parity with the RDD-based API.
>> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>> the RDD-based API.
>> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>
>> Though the RDD-based API is already in de facto maintenance mode, this
>> announcement will make it clear and hence important to both MLlib developers
>> and users. So we’d greatly appreciate your feedback!
>>
>> (As a side note, people sometimes use “Spark ML” to refer to the
>> DataFrame-based API or even the entire MLlib component. This also causes
>> confusion. To be clear, “Spark ML” is not an official name and there are no
>> plans to rename MLlib to “Spark ML” at this time.)
>>
>> Best,
>> Xiangrui
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Are we running SparkR tests in Jenkins?

2016-01-15 Thread Shivaram Venkataraman
Yes - we should be running R tests AFAIK. That error message is a
deprecation warning about the script `bin/sparkR` which needs to be
changed in 
https://github.com/apache/spark/blob/7cd7f2202547224593517b392f56e49e4c94cabc/R/run-tests.sh#L26
to bin/spark-submit.

Thanks
Shivaram

On Fri, Jan 15, 2016 at 3:19 PM, Herman van Hövell tot Westerflier
 wrote:
> Hi all,
>
> I just noticed the following log entry in Jenkins:
>
>> 
>> Running SparkR tests
>> 
>> Running R applications through 'sparkR' is not supported as of Spark 2.0.
>> Use ./bin/spark-submit 
>
>
> Are we still running R tests? Or just saying that this will be deprecated?
>
> Kind regards,
>
> Herman van Hövell tot Westerflier
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Are we running SparkR tests in Jenkins?

2016-01-15 Thread Shivaram Venkataraman
Ah I see. I wasn't aware of that PR. We should do a find and replace
in all the documentation and rest of the repository as well.

Shivaram

On Fri, Jan 15, 2016 at 3:20 PM, Reynold Xin  wrote:
> +Shivaram
>
> Ah damn - we should fix it.
>
> This was broken by https://github.com/apache/spark/pull/10658 - which
> removed a functionality that has been deprecated since Spark 1.0.
>
>
>
>
>
> On Fri, Jan 15, 2016 at 3:19 PM, Herman van Hövell tot Westerflier
>  wrote:
>>
>> Hi all,
>>
>> I just noticed the following log entry in Jenkins:
>>
>>> 
>>> Running SparkR tests
>>> 
>>> Running R applications through 'sparkR' is not supported as of Spark 2.0.
>>> Use ./bin/spark-submit 
>>
>>
>> Are we still running R tests? Or just saying that this will be deprecated?
>>
>> Kind regards,
>>
>> Herman van Hövell tot Westerflier
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Shivaram Venkataraman
The SparkR callJMethod can only invoke methods as they show up in the
Java byte code. So in this case you'll need to check the SparkContext
byte code (with javap or something like that) to see how that method
looks. My guess is the type is passed in as a class tag argument, so
you'll need to do something like create a class tag for the
LinearRegressionModel and pass that in as the first or last argument
etc.

Thanks
Shivaram

On Wed, Dec 9, 2015 at 10:11 AM, Chris Freeman  wrote:
> Hey everyone,
>
> I’m currently looking at ways to save out SparkML model objects from SparkR
> and I’ve had some luck putting the model into an RDD and then saving the RDD
> as an Object File. Once it’s saved, I’m able to load it back in with
> something like:
>
> sc.objectFile[LinearRegressionModel](“path/to/model”)
>
> I’d like to try and replicate this same process from SparkR using the JVM
> backend APIs (e.g. “callJMethod”), but so far I haven’t been able to
> replicate my success and I’m guessing that it’s (at least in part) due to
> the necessity of specifying the type when calling the objectFile method.
>
> Does anyone know if this is actually possible? For example, here’s what I’ve
> come up with so far:
>
> loadModel <- function(sc, modelPath) {
>   modelRDD <- SparkR:::callJMethod(sc,
>
> "objectFile[PipelineModel]",
> modelPath,
> SparkR:::getMinPartitions(sc, NULL))
>   return(modelRDD)
> }
>
> Any help is appreciated!
>
> --
> Chris Freeman
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-12-01 Thread Shivaram Venkataraman
Yeah - that needs to be changed as well. Could you send a PR to fix this ?

Shivaram

On Tue, Dec 1, 2015 at 12:32 AM, Alexander Pivovarov
<apivova...@gmail.com> wrote:
> Thank you,
> I looked at master branch. I did not realize that it's behind branch-1.5
> BTW, line 54 still has SPARK_EC2_VERSION = "1.5.1"
>
> On Tue, Dec 1, 2015 at 12:22 AM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>>
>> Yeah we just need to add 1.5.2 as in
>>
>> https://github.com/apache/spark/commit/97956669053646f00131073358e53b05d0c3d5d0#diff-ada66bbeb2f1327b508232ef6c3805a5
>> to the master branch as well
>>
>> Thanks
>> Shivaram
>>
>>
>>
>> On Mon, Nov 30, 2015 at 11:38 PM, Alexander Pivovarov
>> <apivova...@gmail.com> wrote:
>> > just want to follow up
>> >
>> > On Nov 25, 2015 9:19 PM, "Alexander Pivovarov" <apivova...@gmail.com>
>> > wrote:
>> >>
>> >> Hi Everyone
>> >>
>> >> I noticed that spark ec2 script is outdated.
>> >> How to add 1.5.2 support to ec2/spark_ec2.py?
>> >> What else (except of updating spark version in the script) should be
>> >> done
>> >> to add 1.5.2 support?
>> >>
>> >> We also need to update scala to 2.10.4 (currently it's 2.10.3)
>> >>
>> >> Alex
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-12-01 Thread Shivaram Venkataraman
Yeah we just need to add 1.5.2 as in
https://github.com/apache/spark/commit/97956669053646f00131073358e53b05d0c3d5d0#diff-ada66bbeb2f1327b508232ef6c3805a5
to the master branch as well

Thanks
Shivaram



On Mon, Nov 30, 2015 at 11:38 PM, Alexander Pivovarov
 wrote:
> just want to follow up
>
> On Nov 25, 2015 9:19 PM, "Alexander Pivovarov"  wrote:
>>
>> Hi Everyone
>>
>> I noticed that spark ec2 script is outdated.
>> How to add 1.5.2 support to ec2/spark_ec2.py?
>> What else (except of updating spark version in the script) should be done
>> to add 1.5.2 support?
>>
>> We also need to update scala to 2.10.4 (currently it's 2.10.3)
>>
>> Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-10 Thread Shivaram Venkataraman
+1

On a related note I think making it lightweight will ensure that we
stay on the current release schedule and don't unnecessarily delay 2.0
to wait for new features / big architectural changes.

In terms of fixes to 1.x, I think our current policy of back-porting
fixes to older releases would still apply. I don't think developing
new features on both 1.x and 2.x makes a lot of sense as we would like
users to switch to 2.x.

Shivaram

On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis  wrote:
> +1 on a lightweight 2.0
>
> What is the thinking around the 1.x line after Spark 2.0 is released? If not
> terminated, how will we determine what goes into each major version line?
> Will 1.x only be for stability fixes?
>
> Thanks,
> Kostas
>
> On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell  wrote:
>>
>> I also feel the same as Reynold. I agree we should minimize API breaks and
>> focus on fixing things around the edge that were mistakes (e.g. exposing
>> Guava and Akka) rather than any overhaul that could fragment the community.
>> Ideally a major release is a lightweight process we can do every couple of
>> years, with minimal impact for users.
>>
>> - Patrick
>>
>> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>  wrote:
>>>
>>> > For this reason, I would *not* propose doing major releases to break
>>> > substantial API's or perform large re-architecting that prevent users from
>>> > upgrading. Spark has always had a culture of evolving architecture
>>> > incrementally and making changes - and I don't think we want to change 
>>> > this
>>> > model.
>>>
>>> +1 for this. The Python community went through a lot of turmoil over the
>>> Python 2 -> Python 3 transition because the upgrade process was too painful
>>> for too long. The Spark community will benefit greatly from our explicitly
>>> looking to avoid a similar situation.
>>>
>>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> > enormous assembly jar in order to run Spark.
>>>
>>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> distribution means.
>>>
>>> Nick
>>>
>>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:

 I’m starting a new thread since the other one got intermixed with
 feature requests. Please refrain from making feature request in this 
 thread.
 Not that we shouldn’t be adding features, but we can always add features in
 1.7, 2.1, 2.2, ...

 First - I want to propose a premise for how to think about Spark 2.0 and
 major releases in Spark, based on discussion with several members of the
 community: a major release should be low overhead and minimally disruptive
 to the Spark community. A major release should not be very different from a
 minor release and should not be gated based on new features. The main
 purpose of a major release is an opportunity to fix things that are broken
 in the current API and remove certain deprecated APIs (examples follow).

 For this reason, I would *not* propose doing major releases to break
 substantial API's or perform large re-architecting that prevent users from
 upgrading. Spark has always had a culture of evolving architecture
 incrementally and making changes - and I don't think we want to change this
 model. In fact, we’ve released many architectural changes on the 1.X line.

 If the community likes the above model, then to me it seems reasonable
 to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or 
 immediately
 after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
 major releases every 2 years seems doable within the above model.

 Under this model, here is a list of example things I would propose doing
 in Spark 2.0, separated into APIs and Operation/Deployment:


 APIs

 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
 Spark 1.x.

 2. Remove Akka from Spark’s API dependency (in streaming), so user
 applications can use Akka (SPARK-5293). We have gotten a lot of complaints
 about user applications being unable to use Akka due to Spark’s dependency
 on Akka.

 3. Remove Guava from Spark’s public API (JavaRDD Optional).

 4. Better class package structure for low level developer API’s. In
 particular, we have some DeveloperApi (mostly various listener-related
 classes) added over the years. Some packages include only one or two public
 classes but a lot of private classes. A better structure is to have public
 classes isolated to a few public packages, and these public packages should
 have minimal private classes for low level developer APIs.

 5. Consolidate task metric and accumulator API. Although having some
 subtle differences, these two are very similar but have completely 
 different

Re: Recommended change to core-site.xml template

2015-11-05 Thread Shivaram Venkataraman
Thanks for investigating this. The right place to add these is the
core-site.xml template we have at
https://github.com/amplab/spark-ec2/blob/branch-1.5/templates/root/spark/conf/core-site.xml
and/or 
https://github.com/amplab/spark-ec2/blob/branch-1.5/templates/root/ephemeral-hdfs/conf/core-site.xml

Feel free to open a PR against the amplab/spark-ec2 repository for this.

Thanks
Shivaram

On Thu, Nov 5, 2015 at 8:25 AM, Christian  wrote:
> We ended up reading and writing to S3 a ton in our Spark jobs.
> For this to work, we ended up having to add s3a, and s3 key/secret pairs. We
> also had to add fs.hdfs.impl to get these things to work.
>
> I thought maybe I'd share what we did and it might be worth adding these to
> the spark conf for out of the box functionality with S3.
>
> We created:
> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>
> We changed the contents form the original, adding in the following:
>
>   
> fs.file.impl
> org.apache.hadoop.fs.LocalFileSystem
>   
>
>   
> fs.hdfs.impl
> org.apache.hadoop.hdfs.DistributedFileSystem
>   
>
>   
> fs.s3.impl
> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>   
>
>   
> fs.s3.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3n.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3n.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3a.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3a.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
> This change makes spark on ec2 work out of the box for us. It took us
> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
> version 2.
>
> Best Regards,
> Christian

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Shivaram Venkataraman
I think that getting them from the ASF mirrors is a better strategy in
general as it'll remove the overhead of keeping the S3 bucket up to
date. It works in the spark-ec2 case because we only support a limited
number of Hadoop versions from the tool. FWIW I don't have write
access to the bucket and also haven't heard of any plans to support
newer versions in spark-ec2.

Thanks
Shivaram

On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran  wrote:
>
> On 1 Nov 2015, at 03:17, Nicholas Chammas 
> wrote:
>
> https://s3.amazonaws.com/spark-related-packages/
>
> spark-ec2 uses this bucket to download and install HDFS on clusters. Is it
> owned by the Spark project or by the AMPLab?
>
> Anyway, it looks like the latest Hadoop install available on there is Hadoop
> 2.4.0.
>
> Are there plans to add newer versions of Hadoop for use by spark-ec2 and
> similar tools, or should we just be getting that stuff via an Apache mirror?
> The latest version is 2.7.1, by the way.
>
>
> you should be grabbing the artifacts off the ASF and then verifying their
> SHA1 checksums as published on the ASF HTTPS web site
>
>
> The problem with the Apache mirrors, if I am not mistaken, is that you
> cannot use a single URL that automatically redirects you to a working mirror
> to download Hadoop. You have to pick a specific mirror and pray it doesn't
> disappear tomorrow.
>
>
> They don't go away, especially http://mirror.ox.ac.uk , and in the us the
> apache.osuosl.org, osu being a where a lot of the ASF servers are kept.
>
> full list with availability stats
>
> http://www.apache.org/mirrors/
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Shivaram Venkataraman
I think the lua one at
https://svn.apache.org/repos/asf/infrastructure/site/trunk/content/dyn/closer.lua
has replaced the cgi one from before. Also it looks like the lua one
also supports `action=download` with a filename argument. So you could
just do something like

wget 
http://www.apache.org/dyn/closer.lua?filename=hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz=download

Thanks
Shivaram

On Sun, Nov 1, 2015 at 3:18 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> Oh, sweet! For example:
>
> http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz?asjson=1
>
> Thanks for sharing that tip. Looks like you can also use as_json (vs.
> asjson).
>
> Nick
>
>
> On Sun, Nov 1, 2015 at 5:32 PM Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>>
>> On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>> > OK, I’ll focus on the Apache mirrors going forward.
>> >
>> > The problem with the Apache mirrors, if I am not mistaken, is that you
>> > cannot use a single URL that automatically redirects you to a working
>> > mirror
>> > to download Hadoop. You have to pick a specific mirror and pray it
>> > doesn’t
>> > disappear tomorrow.
>> >
>> > They don’t go away, especially http://mirror.ox.ac.uk , and in the us
>> > the
>> > apache.osuosl.org, osu being a where a lot of the ASF servers are kept.
>> >
>> > So does Apache offer no way to query a URL and automatically get the
>> > closest
>> > working mirror? If I’m installing HDFS onto servers in various EC2
>> > regions,
>> > the best mirror will vary depending on my location.
>> >
>> Not sure if this is officially documented somewhere but if you pass
>> '=1' you will get back a JSON which has a 'preferred' field set
>> to the closest mirror.
>>
>> Shivaram
>> > Nick
>> >
>> >
>> > On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman
>> > <shiva...@eecs.berkeley.edu> wrote:
>> >>
>> >> I think that getting them from the ASF mirrors is a better strategy in
>> >> general as it'll remove the overhead of keeping the S3 bucket up to
>> >> date. It works in the spark-ec2 case because we only support a limited
>> >> number of Hadoop versions from the tool. FWIW I don't have write
>> >> access to the bucket and also haven't heard of any plans to support
>> >> newer versions in spark-ec2.
>> >>
>> >> Thanks
>> >> Shivaram
>> >>
>> >> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <ste...@hortonworks.com>
>> >> wrote:
>> >> >
>> >> > On 1 Nov 2015, at 03:17, Nicholas Chammas
>> >> > <nicholas.cham...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > https://s3.amazonaws.com/spark-related-packages/
>> >> >
>> >> > spark-ec2 uses this bucket to download and install HDFS on clusters.
>> >> > Is
>> >> > it
>> >> > owned by the Spark project or by the AMPLab?
>> >> >
>> >> > Anyway, it looks like the latest Hadoop install available on there is
>> >> > Hadoop
>> >> > 2.4.0.
>> >> >
>> >> > Are there plans to add newer versions of Hadoop for use by spark-ec2
>> >> > and
>> >> > similar tools, or should we just be getting that stuff via an Apache
>> >> > mirror?
>> >> > The latest version is 2.7.1, by the way.
>> >> >
>> >> >
>> >> > you should be grabbing the artifacts off the ASF and then verifying
>> >> > their
>> >> > SHA1 checksums as published on the ASF HTTPS web site
>> >> >
>> >> >
>> >> > The problem with the Apache mirrors, if I am not mistaken, is that
>> >> > you
>> >> > cannot use a single URL that automatically redirects you to a working
>> >> > mirror
>> >> > to download Hadoop. You have to pick a specific mirror and pray it
>> >> > doesn't
>> >> > disappear tomorrow.
>> >> >
>> >> >
>> >> > They don't go away, especially http://mirror.ox.ac.uk , and in the us
>> >> > the
>> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
>> >> > kept.
>> >> >
>> >> > full list with availability stats
>> >> >
>> >> > http://www.apache.org/mirrors/
>> >> >
>> >> >

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Shivaram Venkataraman
On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> OK, I’ll focus on the Apache mirrors going forward.
>
> The problem with the Apache mirrors, if I am not mistaken, is that you
> cannot use a single URL that automatically redirects you to a working mirror
> to download Hadoop. You have to pick a specific mirror and pray it doesn’t
> disappear tomorrow.
>
> They don’t go away, especially http://mirror.ox.ac.uk , and in the us the
> apache.osuosl.org, osu being a where a lot of the ASF servers are kept.
>
> So does Apache offer no way to query a URL and automatically get the closest
> working mirror? If I’m installing HDFS onto servers in various EC2 regions,
> the best mirror will vary depending on my location.
>
Not sure if this is officially documented somewhere but if you pass
'=1' you will get back a JSON which has a 'preferred' field set
to the closest mirror.

Shivaram
> Nick
>
>
> On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>>
>> I think that getting them from the ASF mirrors is a better strategy in
>> general as it'll remove the overhead of keeping the S3 bucket up to
>> date. It works in the spark-ec2 case because we only support a limited
>> number of Hadoop versions from the tool. FWIW I don't have write
>> access to the bucket and also haven't heard of any plans to support
>> newer versions in spark-ec2.
>>
>> Thanks
>> Shivaram
>>
>> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>> >
>> > On 1 Nov 2015, at 03:17, Nicholas Chammas <nicholas.cham...@gmail.com>
>> > wrote:
>> >
>> > https://s3.amazonaws.com/spark-related-packages/
>> >
>> > spark-ec2 uses this bucket to download and install HDFS on clusters. Is
>> > it
>> > owned by the Spark project or by the AMPLab?
>> >
>> > Anyway, it looks like the latest Hadoop install available on there is
>> > Hadoop
>> > 2.4.0.
>> >
>> > Are there plans to add newer versions of Hadoop for use by spark-ec2 and
>> > similar tools, or should we just be getting that stuff via an Apache
>> > mirror?
>> > The latest version is 2.7.1, by the way.
>> >
>> >
>> > you should be grabbing the artifacts off the ASF and then verifying
>> > their
>> > SHA1 checksums as published on the ASF HTTPS web site
>> >
>> >
>> > The problem with the Apache mirrors, if I am not mistaken, is that you
>> > cannot use a single URL that automatically redirects you to a working
>> > mirror
>> > to download Hadoop. You have to pick a specific mirror and pray it
>> > doesn't
>> > disappear tomorrow.
>> >
>> >
>> > They don't go away, especially http://mirror.ox.ac.uk , and in the us
>> > the
>> > apache.osuosl.org, osu being a where a lot of the ASF servers are kept.
>> >
>> > full list with availability stats
>> >
>> > http://www.apache.org/mirrors/
>> >
>> >

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR package path

2015-09-24 Thread Shivaram Venkataraman
I don't think the crux of the problem is about users who download the
source -- Spark's source distribution is clearly marked as something
that needs to be built and they can run `mvn -DskipTests -Psparkr
package` based on instructions in the Spark docs.

The crux of the problem is that with a source or binary R package, the
client side the SparkR code needs the Spark JARs to be available. So
we can't just connect to a remote Spark cluster using just the R
scripts as we need the Scala classes around to create a Spark context
etc.

But this is a use case that I've heard from a lot of users -- my take
is that this should be a separate package / layer on top of SparkR.
Dan Putler (cc'd) had a proposal on a client package for this and
maybe able to add more.

Thanks
Shivaram

On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fal...@gmail.com> wrote:
> Requiring users to download entire Spark distribution to connect to a remote
> cluster (which is already running Spark) seems an over kill. Even for most
> spark users who download Spark source, it is very unintuitive that they need
> to run a script named "install-dev.sh" before they can run SparkR.
>
> --Hossein
>
> On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <rui@intel.com> wrote:
>>
>> SparkR package is not a standalone R package, as it is actually R API of
>> Spark and needs to co-operate with a matching version of Spark, so exposing
>> it in CRAN does not ease use of R users as they need to download matching
>> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> (packageing with Spark), is this desirable? Actually, for normal users who
>> are not developers, they are not required to download Spark source, build
>> and install SparkR package. They just need to download a Spark distribution,
>> and then use SparkR.
>>
>>
>>
>> For using SparkR in Rstudio, there is a documentation at
>> https://github.com/apache/spark/tree/master/R
>>
>>
>>
>>
>>
>>
>>
>> From: Hossein [mailto:fal...@gmail.com]
>> Sent: Thursday, September 24, 2015 1:42 AM
>> To: shiva...@eecs.berkeley.edu
>> Cc: Sun, Rui; dev@spark.apache.org
>> Subject: Re: SparkR package path
>>
>>
>>
>> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
>> both SparkR and Spark itself to a larger community of data scientists (and
>> statisticians).
>>
>>
>>
>> I have been getting questions on how to use SparkR in RStudio. Most of
>> these folks have a Spark Cluster and wish to talk to it from RStudio. While
>> that is a bigger task, for now, first step could be not requiring them to
>> download Spark source and run a script that is named install-dev.sh. I filed
>> SPARK-10776 to track this.
>>
>>
>>
>>
>> --Hossein
>>
>>
>>
>> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> <shiva...@eecs.berkeley.edu> wrote:
>>
>> As Rui says it would be good to understand the use case we want to
>> support (supporting CRAN installs could be one for example). I don't
>> think it should be very hard to do as the RBackend itself doesn't use
>> the R source files. The RRDD does use it and the value comes from
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> AFAIK -- So we could introduce a new config flag that can be used for
>> this new mode.
>>
>> Thanks
>> Shivaram
>>
>>
>> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.com> wrote:
>> > Hossein,
>> >
>> >
>> >
>> > Any strong reason to download and install SparkR source package
>> > separately
>> > from the Spark distribution?
>> >
>> > An R user can simply download the spark distribution, which contains
>> > SparkR
>> > source and binary package, and directly use sparkR. No need to install
>> > SparkR package at all.
>> >
>> >
>> >
>> > From: Hossein [mailto:fal...@gmail.com]
>> > Sent: Tuesday, September 22, 2015 9:19 AM
>> > To: dev@spark.apache.org
>> > Subject: SparkR package path
>> >
>> >
>> >
>> > Hi dev list,
>> >
>> >
>> >
>> > SparkR backend assumes SparkR source files are located under
>> > "SPARK_HOME/R/lib/." This directory is created by running
>> > R/install-dev.sh.
>> > This setting makes sense for Spark developers, but if an R user
>> > downloads
>> > and installs SparkR source package, the source files are going to be in
>> > placed different locations.
>> >
>> >
>> >
>> > In the R runtime it is easy to find location of package files using
>> > path.package("SparkR"). But we need to make some changes to R backend
>> > and/or
>> > spark-submit so that, JVM process learns the location of worker.R and
>> > daemon.R and shell.R from the R runtime.
>> >
>> >
>> >
>> > Do you think this change is feasible?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > --Hossein
>>
>>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR package path

2015-09-22 Thread Shivaram Venkataraman
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui  wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:fal...@gmail.com]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR streaming source code

2015-09-16 Thread Shivaram Venkataraman
I think Hao posted a link to the source code in the description of
https://issues.apache.org/jira/browse/SPARK-6803

On Wed, Sep 16, 2015 at 10:06 AM, Reynold Xin  wrote:
> You should reach out to the speakers directly.
>
>
> On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong  wrote:
>>
>> SparkR streaming is mentioned at about page 17 in below pdf, can anyone
>> share source code? (could not find it on GitHub)
>>
>>
>>
>> https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf
>>
>>
>> Thanks,
>>
>> Renyi.
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR driver side JNI

2015-09-11 Thread Shivaram Venkataraman
Its possible -- in the sense that a lot of designs are possible. But
AFAIK there are no clean interfaces for getting all the arguments /
SparkConf options from spark-submit and its all the more tricker to
handle scenarios where the first JVM has already created a
SparkContext that you want to use from R. The inter-process
communication is cleaner, pretty lightweight and handles all the
scenarios.

Thanks
Shivaram

On Fri, Sep 11, 2015 at 10:54 AM, Renyi Xiong <renyixio...@gmail.com> wrote:
> forgot to reply all.
>
> I see. but what prevents e.g. R driver getting those command line arguments
> from spark-submit and setting them with SparkConf to R diver's in-process
> JVM through JNI?
>
> On Thu, Sep 10, 2015 at 9:29 PM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>>
>> Yeah in addition to the downside of having 2 JVMs the command line
>> arguments and SparkConf etc. will be set by spark-submit in the first
>> JVM which won't be available in the second JVM.
>>
>> Shivaram
>>
>> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong <renyixio...@gmail.com>
>> wrote:
>> > for 2nd case where JVM comes up first, we also can launch in-process JNI
>> > just like inter-process mode, correct? (difference is that a 2nd JVM
>> > gets
>> > loaded)
>> >
>> > On Thu, Aug 6, 2015 at 9:51 PM, Shivaram Venkataraman
>> > <shiva...@eecs.berkeley.edu> wrote:
>> >>
>> >> The in-process JNI only works out when the R process comes up first
>> >> and we launch a JVM inside it. In many deploy modes like YARN (or
>> >> actually in anything using spark-submit) the JVM comes up first and we
>> >> launch R after that. Using an inter-process solution helps us cover
>> >> both use cases
>> >>
>> >> Thanks
>> >> Shivaram
>> >>
>> >> On Thu, Aug 6, 2015 at 8:33 PM, Renyi Xiong <renyixio...@gmail.com>
>> >> wrote:
>> >> > why SparkR chose to uses inter-process socket solution eventually on
>> >> > driver
>> >> > side instead of in-process JNI showed in one of its doc's below
>> >> > (about
>> >> > page
>> >> > 20)?
>> >> >
>> >> >
>> >> >
>> >> > https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf
>> >> >
>> >> >
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-20 Thread Shivaram Venkataraman
FYI

The staging repository published as version 1.5.0 is at
https://repository.apache.org/content/repositories/orgapachespark-1136
while the staging repository published as version 1.5.0-rc1 is at
https://repository.apache.org/content/repositories/orgapachespark-1137

Thanks
Shivaram

On Thu, Aug 20, 2015 at 9:37 PM, Reynold Xin r...@databricks.com wrote:
 Please vote on releasing the following candidate as Apache Spark version
 1.5.0!

 The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc1:
 https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1137/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/


 ===
 == How can I help test this release? ==
 ===
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.


 
 == What justifies a -1 vote for this release? ==
 
 This vote is happening towards the end of the 1.5 QA period, so -1 votes
 should only occur for significant regressions from 1.4. Bugs already present
 in 1.4, minor regressions, or bugs related to new features will not block
 this release.


 ===
 == What should happen to JIRA tickets still targeting 1.5.0? ==
 ===
 1. It is OK for documentation patches to target 1.5.0 and still go into
 branch-1.5, since documentations will be packaged separately from the
 release.
 2. New features for non-alpha-modules should target 1.6+.
 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
 version.


 ==
 == Major changes to help you focus your testing ==
 ==
 As of today, Spark 1.5 contains more than 1000 commits from 220+
 contributors. I've curated a list of important changes for 1.5. For the
 complete list, please refer to Apache JIRA changelog.

 RDD/DataFrame/SQL APIs

 - New UDAF interface
 - DataFrame hints for broadcast join
 - expr function for turning a SQL expression into DataFrame column
 - Improved support for NaN values
 - StructType now supports ordering
 - TimestampType precision is reduced to 1us
 - 100 new built-in expressions, including date/time, string, math
 - memory and local disk only checkpointing

 DataFrame/SQL Backend Execution

 - Code generation on by default
 - Improved join, aggregation, shuffle, sorting with cache friendly
 algorithms and external algorithms
 - Improved window function performance
 - Better metrics instrumentation and reporting for DF/SQL execution plans

 Data Sources, Hive, Hadoop, Mesos and Cluster Management

 - Dynamic allocation support in all resource managers (Mesos, YARN,
 Standalone)
 - Improved Mesos support (framework authentication, roles, dynamic
 allocation, constraints)
 - Improved YARN support (dynamic allocation with preferred locations)
 - Improved Hive support (metastore partition pruning, metastore connectivity
 to 0.13 to 1.2, internal Hive upgrade to 1.2)
 - Support persisting data in Hive compatible format in metastore
 - Support data partitioning for JSON data sources
 - Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
 discovery and schema merging, support reading non-standard legacy Parquet
 files generated by other libraries)
 - Faster and more robust dynamic partition insert
 - DataSourceRegister interface for external data sources to specify short
 names

 SparkR

 - YARN cluster mode in R
 - GLMs with R formula, binomial/Gaussian families, and elastic-net
 regularization
 - Improved error messages
 - Aliases to make DataFrame functions more R-like

 Streaming

 - Backpressure for handling bursty input streams.
 - Improved Python support for streaming sources (Kafka offsets, Kinesis,
 MQTT, Flume)
 - Improved Python streaming machine learning algorithms (K-Means, linear
 regression, logistic regression)
 - Native reliable Kinesis stream support
 - Input metadata like Kafka offsets 

Re: [ANNOUNCE] Nightly maven and package builds for Spark

2015-08-17 Thread Shivaram Venkataraman
This should be fixed now. I just triggered a manual build and the
latest binaries are at
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-1.5.0-SNAPSHOT-2015_08_17_00_36-3ff81ad-bin/

Thanks
Shivaram

On Mon, Aug 17, 2015 at 12:26 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
 thx for this, let me know if you need help

 2015-08-16 23:38 GMT+02:00 Shivaram Venkataraman
 shiva...@eecs.berkeley.edu:

 I just investigated this and this is happening because of a Maven
 version requirement not being met. I'll look at modifying the build
 scripts to use Maven 3.3.3 (with build/mvn --force ?)

 Shivaram

 On Sun, Aug 16, 2015 at 10:16 AM, Olivier Girardot
 o.girar...@lateral-thoughts.com wrote:
  Hi Patrick,
  is there any way for the nightly build to include common distributions
  like
  : with/without hive/yarn support, hadoop 2.4, 2.6 etc... ?
  For now it seems that the nightly binary package builds actually ships
  only
  the source ?
  I can help on that too if you want,
 
  Regards,
 
  Olivier.
 
  2015-08-02 5:19 GMT+02:00 Bharath Ravi Kumar reachb...@gmail.com:
 
  Thanks for fixing it.
 
  On Sun, Aug 2, 2015 at 3:17 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Hey All,
 
  I got it up and running - it was a newly surfaced bug in the build
  scripts.
 
  - Patrick
 
  On Wed, Jul 29, 2015 at 6:05 AM, Bharath Ravi Kumar
  reachb...@gmail.com
  wrote:
   Hey Patrick,
  
   Any update on this front please?
  
   Thanks,
   Bharath
  
   On Fri, Jul 24, 2015 at 8:38 PM, Patrick Wendell
   pwend...@gmail.com
   wrote:
  
   Hey Bharath,
  
   There was actually an incompatible change to the build process that
   broke several of the Jenkins builds. This should be patched up in
   the
   next day or two and nightly builds will resume.
  
   - Patrick
  
   On Fri, Jul 24, 2015 at 12:51 AM, Bharath Ravi Kumar
   reachb...@gmail.com wrote:
I noticed the last (1.5) build has a timestamp of 16th July. Have
nightly
builds been discontinued since then?
   
Thanks,
Bharath
   
On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell
pwend...@gmail.com
wrote:
   
Hi All,
   
This week I got around to setting up nightly builds for Spark on
Jenkins. I'd like feedback on these and if it's going well I can
merge
the relevant automation scripts into Spark mainline and document
it
on
the website. Right now I'm doing:
   
1. SNAPSHOT's of Spark master and release branches published to
ASF
Maven snapshot repo:
   
   
   
   
   
https://repository.apache.org/content/repositories/snapshots/org/apache/spark/
   
These are usable by adding this repository in your build and
using
a
snapshot version (e.g. 1.3.2-SNAPSHOT).
   
2. Nightly binary package builds and doc builds of master and
release
versions.
   
http://people.apache.org/~pwendell/spark-nightly/
   
These build 4 times per day and are tagged based on commits.
   
If anyone has feedback on these please let me know.
   
Thanks!
- Patrick
   
   
   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
  
 
 
 
 
 
  --
  Olivier Girardot | Associé
  o.girar...@lateral-thoughts.com
  +33 6 24 09 17 94




 --
 Olivier Girardot | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Nightly maven and package builds for Spark

2015-08-16 Thread Shivaram Venkataraman
I just investigated this and this is happening because of a Maven
version requirement not being met. I'll look at modifying the build
scripts to use Maven 3.3.3 (with build/mvn --force ?)

Shivaram

On Sun, Aug 16, 2015 at 10:16 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
 Hi Patrick,
 is there any way for the nightly build to include common distributions like
 : with/without hive/yarn support, hadoop 2.4, 2.6 etc... ?
 For now it seems that the nightly binary package builds actually ships only
 the source ?
 I can help on that too if you want,

 Regards,

 Olivier.

 2015-08-02 5:19 GMT+02:00 Bharath Ravi Kumar reachb...@gmail.com:

 Thanks for fixing it.

 On Sun, Aug 2, 2015 at 3:17 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey All,

 I got it up and running - it was a newly surfaced bug in the build
 scripts.

 - Patrick

 On Wed, Jul 29, 2015 at 6:05 AM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:
  Hey Patrick,
 
  Any update on this front please?
 
  Thanks,
  Bharath
 
  On Fri, Jul 24, 2015 at 8:38 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Hey Bharath,
 
  There was actually an incompatible change to the build process that
  broke several of the Jenkins builds. This should be patched up in the
  next day or two and nightly builds will resume.
 
  - Patrick
 
  On Fri, Jul 24, 2015 at 12:51 AM, Bharath Ravi Kumar
  reachb...@gmail.com wrote:
   I noticed the last (1.5) build has a timestamp of 16th July. Have
   nightly
   builds been discontinued since then?
  
   Thanks,
   Bharath
  
   On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell
   pwend...@gmail.com
   wrote:
  
   Hi All,
  
   This week I got around to setting up nightly builds for Spark on
   Jenkins. I'd like feedback on these and if it's going well I can
   merge
   the relevant automation scripts into Spark mainline and document it
   on
   the website. Right now I'm doing:
  
   1. SNAPSHOT's of Spark master and release branches published to ASF
   Maven snapshot repo:
  
  
  
  
   https://repository.apache.org/content/repositories/snapshots/org/apache/spark/
  
   These are usable by adding this repository in your build and using
   a
   snapshot version (e.g. 1.3.2-SNAPSHOT).
  
   2. Nightly binary package builds and doc builds of master and
   release
   versions.
  
   http://people.apache.org/~pwendell/spark-nightly/
  
   These build 4 times per day and are tagged based on commits.
  
   If anyone has feedback on these please let me know.
  
   Thanks!
   - Patrick
  
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
 





 --
 Olivier Girardot | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR DataFrame fail to return data of Decimal type

2015-08-14 Thread Shivaram Venkataraman
Thanks for the catch. Could you send a PR with this diff ?

On Fri, Aug 14, 2015 at 10:30 AM, Shkurenko, Alex ashkure...@enova.com wrote:
 Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897,
 but with the Decimal datatype coming from a Postgres DB:

 //Set up SparkR

Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark)
Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path
 ~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell)
.libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
library(SparkR)
sc - sparkR.init(master=local)

 // Connect to a Postgres DB via JDBC
sqlContext - sparkRSQL.init(sc)
sql(sqlContext, 
 CREATE TEMPORARY TABLE mytable
 USING org.apache.spark.sql.jdbc
 OPTIONS (url 'jdbc:postgresql://servername:5432/dbname'
 ,dbtable 'mydbtable'
 )
 )

 // Try pulling a Decimal column from a table
myDataFrame - sql(sqlContext,(select a_decimal_column  from mytable ))

 // The schema shows up fine

show(myDataFrame)

 DataFrame[a_decimal_column:decimal(10,0)]

schema(myDataFrame)

 StructType
 |-name = a_decimal_column, type = DecimalType(10,0), nullable = TRUE

 // ... but pulling data fails:

 localDF - collect(myDataFrame)

 Error in as.data.frame.default(x[[i]], optional = TRUE) :
   cannot coerce class jobj to a data.frame


 ---
 Proposed fix:

 diff --git a/core/src/main/scala/org/apache/spark/api/r/SerDe.scala
 b/core/src/main/scala/org/apache/spark/api/r/SerDe.scala
 index d5b4260..b77ae2a 100644
 --- a/core/src/main/scala/org/apache/spark/api/r/SerDe.scala
 +++ b/core/src/main/scala/org/apache/spark/api/r/SerDe.scala
 @@ -219,6 +219,9 @@ private[spark] object SerDe {
  case float | java.lang.Float =
writeType(dos, double)
writeDouble(dos, value.asInstanceOf[Float].toDouble)
 +case decimal | java.math.BigDecimal =
 +   writeType(dos, double)
 +   writeDouble(dos,
 scala.math.BigDecimal(value.asInstanceOf[java.math.BigDecimal]).toDouble)
  case double | java.lang.Double =
writeType(dos, double)
writeDouble(dos, value.asInstanceOf[Double])

 Thanks,
 Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR driver side JNI

2015-08-06 Thread Shivaram Venkataraman
The in-process JNI only works out when the R process comes up first
and we launch a JVM inside it. In many deploy modes like YARN (or
actually in anything using spark-submit) the JVM comes up first and we
launch R after that. Using an inter-process solution helps us cover
both use cases

Thanks
Shivaram

On Thu, Aug 6, 2015 at 8:33 PM, Renyi Xiong renyixio...@gmail.com wrote:
 why SparkR chose to uses inter-process socket solution eventually on driver
 side instead of in-process JNI showed in one of its doc's below (about page
 20)?

 https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Why SparkR didn't reuse PythonRDD

2015-08-06 Thread Shivaram Venkataraman
PythonRDD.scala has a number of PySpark specific conventions (for
example worker reuse, exceptions etc.) and PySpark specific protocols
(e.g. for communicating accumulators, broadcasts between the JVM and
Python etc.). While it might be possible to refactor the two classes
to share some more code I don't think its worth making the code more
complex in order to do that.

Thanks
Shivaram

On Thu, Aug 6, 2015 at 1:27 AM, Daniel Li daniell...@gmail.com wrote:
 On behalf of Renyi Xiong -

 When reading Spark codebase, looks to me PythonRDD.scala is reusable, I
 wonder why SparkR choose to implement its own RRDD.scala?

 thanks
 Daniel

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should spark-ec2 get its own repo?

2015-08-03 Thread Shivaram Venkataraman
I sent a note to the Mesos developers and created
https://github.com/apache/spark/pull/7899 to change the repository
pointer. There are 3-4 open PRs right now in the mesos/spark-ec2
repository and I'll work on migrating them to amplab/spark-ec2 later
today.

My thoughts on moving the python script is that we should have a
wrapper shell script that just fetches the latest version of
spark_ec2.py for the corresponding Spark branch. We already have
separate branches in our spark-ec2 repository for different Spark
versions so it can just be a call to `wget
https://github.com/amplab/spark-ec2/tree/spark-version/driver/spark_ec2.py`.

Thanks
Shivaram

On Sun, Aug 2, 2015 at 11:34 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 On Sat, Aug 1, 2015 at 1:09 PM Matt Goodman meawo...@gmail.com wrote:

 I am considering porting some of this to a more general spark-cloud
 launcher, including google/aliyun/rackspace.  It shouldn't be hard at all
 given the current approach for setup/install.


 FWIW, there are already some tools for launching Spark clusters on GCE and
 Azure:

 http://spark-packages.org/?q=tags%3A%22Deployment%22

 Nick


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Moving spark-ec2 to amplab github organization

2015-08-03 Thread Shivaram Venkataraman
Hi Mesos developers

The Apache Spark project has been hosting using
https://github.com/mesos/spark-ec2 as a supporting repository for some
of our EC2 scripts. This is a remnant from the days when the Spark
project itself was hosted at github.com/mesos/spark. Based on
discussions in the Spark Developer mailing list [1], we plan to move
the repository to github.com/amplab/spark-ec2 to enable a better
development workflow. As these scripts are not used by the Apache
Mesos project I don’t think any action is required from the Mesos
developers, but please let me know if you have any thoughts about
this.

Thanks
Shivaram

[1] 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Should-spark-ec2-get-its-own-repo-td13151.html

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should spark-ec2 get its own repo?

2015-07-31 Thread Shivaram Venkataraman
Yes - It is still in progress, but I have just not gotten time to get to
this. I think getting the repo moved from mesos to amplab in the codebase
by 1.5 should be possible.

Thanks
Shivaram

On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote:

 PS is this still in progress? it feels like something that would be
 good to do before 1.5.0, if it's going to happen soon.

 On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  Yeah I'll send a note to the mesos dev list just to make sure they are
  informed.
 
  Shivaram
 
  On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote:
 
  I agree it's worth informing Mesos devs and checking that there are no
  big objections. I presume Shivaram is plugged in enough to Mesos that
  there won't be any surprises there, and that the project would also
  agree with moving this Spark-specific bit out. they may also want to
  leave a pointer to the new location in the mesos repo of course.
 
  I don't think it is something that requires a formal vote. It's not a
  question of ownership -- neither Apache nor the project PMC owns the
  code. I don't think it's different from retiring or removing any other
  code.
 
 
 
 
 
  On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
   If I am not wrong, since the code was hosted within mesos project
   repo, I assume (atleast part of it) is owned by mesos project and so
   its PMC ?
  
   - Mridul
  
   On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
   shiva...@eecs.berkeley.edu wrote:
   There is technically no PMC for the spark-ec2 project (I guess we are
   kind
   of establishing one right now). I haven't heard anything from the
 Spark
   PMC
   on the dev list that might suggest a need for a vote so far. I will
   send
   another round of email notification to the dev list when we have a
 JIRA
   / PR
   that actually moves the scripts (right now the only thing that
 changed
   is
   the location of some scripts in mesos/ to amplab/).
  
   Thanks
   Shivaram
  
 
 



Re: Should spark-ec2 get its own repo?

2015-07-22 Thread Shivaram Venkataraman
Yeah I'll send a note to the mesos dev list just to make sure they are
informed.

Shivaram

On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote:

 I agree it's worth informing Mesos devs and checking that there are no
 big objections. I presume Shivaram is plugged in enough to Mesos that
 there won't be any surprises there, and that the project would also
 agree with moving this Spark-specific bit out. they may also want to
 leave a pointer to the new location in the mesos repo of course.

 I don't think it is something that requires a formal vote. It's not a
 question of ownership -- neither Apache nor the project PMC owns the
 code. I don't think it's different from retiring or removing any other
 code.





 On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
  If I am not wrong, since the code was hosted within mesos project
  repo, I assume (atleast part of it) is owned by mesos project and so
  its PMC ?
 
  - Mridul
 
  On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
  There is technically no PMC for the spark-ec2 project (I guess we are
 kind
  of establishing one right now). I haven't heard anything from the Spark
 PMC
  on the dev list that might suggest a need for a vote so far. I will send
  another round of email notification to the dev list when we have a JIRA
 / PR
  that actually moves the scripts (right now the only thing that changed
 is
  the location of some scripts in mesos/ to amplab/).
 
  Thanks
  Shivaram
 



Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
There is technically no PMC for the spark-ec2 project (I guess we are kind
of establishing one right now). I haven't heard anything from the Spark PMC
on the dev list that might suggest a need for a vote so far. I will send
another round of email notification to the dev list when we have a JIRA /
PR that actually moves the scripts (right now the only thing that changed
is the location of some scripts in mesos/ to amplab/).

Thanks
Shivaram

On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com
wrote:

 Might be a good idea to get the PMC's of both projects to sign off to
 prevent future issues with apache.

 Regards,
 Mridul

 On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  I've created https://github.com/amplab/spark-ec2 and added an initial
 set of
  committers. Note that this is not a fork of the existing
  github.com/mesos/spark-ec2 and users will need to fork from here. This
 is
  mostly to avoid the base-fork in pull requests being set incorrectly etc.
 
  I'll be migrating some PRs / closing them in the old repo and will also
  update the README in that repo.
 
  Thanks
  Shivaram
 
  On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote:
 
  On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I am not sure why the ASF JIRA can be only used to track one set of
   artifacts that are packaged and released together. I agree that
 marking
   a
   fix version as 1.5 for a change in another repo doesn't make a lot of
   sense,
   but we could just not use fix versions for the EC2 issues ?
 
  *shrug* it just seems harder and less natural to use ASF JIRA. What's
  the benefit? I agree it's not a big deal either way but it's a small
  part of the problem we're solving in the first place. I suspect that
  one way or the other, there would be issues filed both places, so this
  probably isn't worth debating.
 
 
   My concerns are less about it being pushed out etc. For better or
 worse
   we
   have had EC2 scripts be a part of the Spark distribution from a very
   early
   stage (from version 0.5.0 if my git history reading is correct).  So
   users
   will assume that any error with EC2 scripts belong to the Spark
 project.
   In
   addition almost all the contributions to the EC2 scripts come from
 Spark
   developers and so keeping the issues in the same mailing list / JIRA
   seems
   natural. This I guess again relates to the question of managing issues
   for
   code that isn't part of the Spark release artifact.
 
  Yeah good question -- Github doesn't give you a mailing list. I think
  dev@ would still be where it's discussed which is ... again 'part of
  the problem' but as you say, probably beneficial. It's a pretty low
  traffic topic anyway.
 
 
   I'll create the amplab/spark-ec2 repo over the next couple of days
   unless
   there are more comments on this thread. This will at least alleviate
   some of
   the naming confusion over using a repository in mesos and I'll give
   Sean,
   Nick, Matthew commit access to it. I am still not convinced about
 moving
   the
   issues over though.
 
  I won't move the issues. Maybe time tells whether one approach is
  better, or that it just doesn't matter.
 
  However it'd be a great opportunity to review and clear stale EC2
 issues.
 
 



Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
Thats part of the confusion we are trying to fix here -- the repository
used to live in the mesos github account but was never a part of the Apache
Mesos project. It was a remnant part of Spark from when Spark used to live
at github.com/mesos/spark.

Shivaram

On Tue, Jul 21, 2015 at 11:03 AM, Mridul Muralidharan mri...@gmail.com
wrote:

 If I am not wrong, since the code was hosted within mesos project
 repo, I assume (atleast part of it) is owned by mesos project and so
 its PMC ?

 - Mridul

 On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  There is technically no PMC for the spark-ec2 project (I guess we are
 kind
  of establishing one right now). I haven't heard anything from the Spark
 PMC
  on the dev list that might suggest a need for a vote so far. I will send
  another round of email notification to the dev list when we have a JIRA
 / PR
  that actually moves the scripts (right now the only thing that changed is
  the location of some scripts in mesos/ to amplab/).
 
  Thanks
  Shivaram
 
  On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
  Might be a good idea to get the PMC's of both projects to sign off to
  prevent future issues with apache.
 
  Regards,
  Mridul
 
  On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I've created https://github.com/amplab/spark-ec2 and added an initial
   set of
   committers. Note that this is not a fork of the existing
   github.com/mesos/spark-ec2 and users will need to fork from here.
 This
   is
   mostly to avoid the base-fork in pull requests being set incorrectly
   etc.
  
   I'll be migrating some PRs / closing them in the old repo and will
 also
   update the README in that repo.
  
   Thanks
   Shivaram
  
   On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com
 wrote:
  
   On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman
   shiva...@eecs.berkeley.edu wrote:
I am not sure why the ASF JIRA can be only used to track one set of
artifacts that are packaged and released together. I agree that
marking
a
fix version as 1.5 for a change in another repo doesn't make a lot
 of
sense,
but we could just not use fix versions for the EC2 issues ?
  
   *shrug* it just seems harder and less natural to use ASF JIRA. What's
   the benefit? I agree it's not a big deal either way but it's a small
   part of the problem we're solving in the first place. I suspect that
   one way or the other, there would be issues filed both places, so
 this
   probably isn't worth debating.
  
  
My concerns are less about it being pushed out etc. For better or
worse
we
have had EC2 scripts be a part of the Spark distribution from a
 very
early
stage (from version 0.5.0 if my git history reading is correct).
 So
users
will assume that any error with EC2 scripts belong to the Spark
project.
In
addition almost all the contributions to the EC2 scripts come from
Spark
developers and so keeping the issues in the same mailing list /
 JIRA
seems
natural. This I guess again relates to the question of managing
issues
for
code that isn't part of the Spark release artifact.
  
   Yeah good question -- Github doesn't give you a mailing list. I think
   dev@ would still be where it's discussed which is ... again 'part of
   the problem' but as you say, probably beneficial. It's a pretty low
   traffic topic anyway.
  
  
I'll create the amplab/spark-ec2 repo over the next couple of days
unless
there are more comments on this thread. This will at least
 alleviate
some of
the naming confusion over using a repository in mesos and I'll give
Sean,
Nick, Matthew commit access to it. I am still not convinced about
moving
the
issues over though.
  
   I won't move the issues. Maybe time tells whether one approach is
   better, or that it just doesn't matter.
  
   However it'd be a great opportunity to review and clear stale EC2
   issues.
  
  
 
 



Re: Should spark-ec2 get its own repo?

2015-07-20 Thread Shivaram Venkataraman
I've created https://github.com/amplab/spark-ec2 and added an initial set
of committers. Note that this is not a fork of the existing
github.com/mesos/spark-ec2 and users will need to fork from here. This is
mostly to avoid the base-fork in pull requests being set incorrectly etc.

I'll be migrating some PRs / closing them in the old repo and will also
update the README in that repo.

Thanks
Shivaram

On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote:

 On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  I am not sure why the ASF JIRA can be only used to track one set of
  artifacts that are packaged and released together. I agree that marking a
  fix version as 1.5 for a change in another repo doesn't make a lot of
 sense,
  but we could just not use fix versions for the EC2 issues ?

 *shrug* it just seems harder and less natural to use ASF JIRA. What's
 the benefit? I agree it's not a big deal either way but it's a small
 part of the problem we're solving in the first place. I suspect that
 one way or the other, there would be issues filed both places, so this
 probably isn't worth debating.


  My concerns are less about it being pushed out etc. For better or worse
 we
  have had EC2 scripts be a part of the Spark distribution from a very
 early
  stage (from version 0.5.0 if my git history reading is correct).  So
 users
  will assume that any error with EC2 scripts belong to the Spark project.
 In
  addition almost all the contributions to the EC2 scripts come from Spark
  developers and so keeping the issues in the same mailing list / JIRA
 seems
  natural. This I guess again relates to the question of managing issues
 for
  code that isn't part of the Spark release artifact.

 Yeah good question -- Github doesn't give you a mailing list. I think
 dev@ would still be where it's discussed which is ... again 'part of
 the problem' but as you say, probably beneficial. It's a pretty low
 traffic topic anyway.


  I'll create the amplab/spark-ec2 repo over the next couple of days unless
  there are more comments on this thread. This will at least alleviate
 some of
  the naming confusion over using a repository in mesos and I'll give Sean,
  Nick, Matthew commit access to it. I am still not convinced about moving
 the
  issues over though.

 I won't move the issues. Maybe time tells whether one approach is
 better, or that it just doesn't matter.

 However it'd be a great opportunity to review and clear stale EC2 issues.



Re: Model parallelism with RDD

2015-07-17 Thread Shivaram Venkataraman
You can also use checkpoint to truncate the lineage and the data can be
persisted to HDFS. Fundamentally the state of the RDD needs to be saved to
memory or disk if you don't want to repeat the computation.

Thanks
Shivaram

On Thu, Jul 16, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Dear Spark developers,



 What happens if RDD does not fit into memory and cache would not work in
 the code below? Will all previous iterations repeated each new iteration
 within iterative RDD update (as described below)?



 Also, could you clarify regarding DataFrame and GC overhead: does setting 
 spark.sql.unsafe.enabled=true
 removes the GC when persisting/unpersisting the DataFrame?



 Best regards, Alexander



 *From:* Ulanov, Alexander
 *Sent:* Monday, July 13, 2015 11:15 AM
 *To:* shiva...@eecs.berkeley.edu
 *Cc:* dev@spark.apache.org
 *Subject:* RE: Model parallelism with RDD



 Below are the average timings for one iteration of model update with RDD
  (with cache, as Shivaram suggested):

 Model size, RDD[Double].count / time, s

 10M 0.585336926

 100M 1.767947506

 1B 125.6078817



 There is a ~100x increase in time while 10x increase in model size (from
 100 million to 1 billion of Double). More than half of the time is spent in
 GC, and this time varies heavily. Two questions:

 1)Can I use project Tungsten’s unsafe? Actually, can I reduce the GC time
 if I use DataFrame instead of RDD and set the Tungsten key:
 spark.sql.unsafe.enabled=true ?

 2) RDD[Double] of one billion elements is 26.1GB persisted (as Spark UI
 shows). It is around 26 bytes per element. How many bytes is RDD overhead?



 The code:

 val modelSize = 10

 val numIterations = 10

 val parallelism = 5

 var oldRDD = sc.parallelize(1 to modelSize, parallelism).map(x =
 0.1).cache

 var newRDD = sc.parallelize(1 to 1, parallelism).map(x = 0.1)

 var i = 0

 var avgTime = 0.0

 while (i  numIterations) {

   val t = System.nanoTime()

   val newRDD = oldRDD.map(x = x * x)

   newRDD.cache

   newRDD.count()

   oldRDD.unpersist(true)

   newRDD.mean

   avgTime += (System.nanoTime() - t) / 1e9

   oldRDD = newRDD

   i += 1

 }

 println(Avg iteration time: + avgTime / numIterations)



 Best regards, Alexander



 *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu
 shiva...@eecs.berkeley.edu]
 *Sent:* Friday, July 10, 2015 10:04 PM
 *To:* Ulanov, Alexander
 *Cc:* shiva...@eecs.berkeley.edu; dev@spark.apache.org
 *Subject:* Re: Model parallelism with RDD



 Yeah I can see that being the case -- caching implies creating objects
 that will be stored in memory. So there is a trade-off between storing data
 in memory but having to garbage collect it later vs. recomputing the data.



 Shivaram



 On Fri, Jul 10, 2015 at 9:49 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Hi Shivaram,

 Thank you for suggestion! If I do .cache and .count, each iteration take
 much more time, which is spent in GC. Is it normal?

 10 июля 2015 г., в 21:23, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edumailto:shiva...@eecs.berkeley.edu написал(а):

 I think you need to do `newRDD.cache()` and `newRDD.count` before you do
 oldRDD.unpersist(true) -- Otherwise it might be recomputing all the
 previous iterations each time.

 Thanks
 Shivaram

 On Fri, Jul 10, 2015 at 7:44 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Hi,

 I am interested how scalable can be the model parallelism within Spark.
 Suppose, the model contains N weights of type Double and N is so large that
 does not fit into the memory of a single node. So, we can store the model
 in RDD[Double] within several nodes. To train the model, one needs to
 perform K iterations that update all the weights and check the convergence.
 Then we also need to exchange some weights between the nodes to synchronize
 the model or update the global state. I’ve sketched the code that does
 iterative updates with RDD (without global update yet). Surprisingly, each
 iteration takes more time than previous as shown below (time in seconds).
 Could you suggest what is the reason for that? I’ve checked GC, it does
 something within few milliseconds.

 Configuration: Spark 1.4, 1 master and 5 worker nodes, 5 executors, Intel
 Xeon 2.2, 16GB RAM each
 Iteration 0 time:1.127990986
 Iteration 1 time:1.391120414
 Iteration 2 time:1.642969138102
 Iteration 3 time:1.9344402954
 Iteration 4 time:2.207529424697
 Iteration 5 time:2.6328659593
 Iteration 6 time:2.791169049296
 Iteration 7 time:3.0850374104
 Iteration 8 time:3.4031050061
 Iteration 9 time:3.8826580919

 Code:
 val modelSize = 10
 val numIterations = 10
 val parallelizm = 5
 var oldRDD = sc.parallelize(1 to modelSize, parallelizm).map(x = 0.1)
 var newRDD = sc.parallelize(1 to 1, parallelizm).map(x = 0.1)
 var i = 0
 while (i  numIterations) {
   val t = System.nanoTime()
   // updating the weights
   val newRDD = oldRDD.map(x = x * x

Re: Should spark-ec2 get its own repo?

2015-07-17 Thread Shivaram Venkataraman
Some replies inline

On Wed, Jul 15, 2015 at 1:08 AM, Sean Owen so...@cloudera.com wrote:

 The code can continue to be a good reference implementation, no matter
 where it lives. In fact, it can be a better more complete one, and
 easier to update.

 I agree that ec2/ needs to retain some kind of pointer to the new
 location. Yes, maybe a script as well that does the checkout as you
 say. We have to be careful that the effect here isn't to make people
 think this code is still part of the blessed bits of a Spark release,
 since it isn't. But I suppose the point is that it isn't quite now
 either (isn't tested, isn't fully contained in apache/spark) and
 that's what we're fixing.

 I still don't like the idea of using the ASF JIRA for Spark to track
 issues in a separate project, as these kinds of splits are what we're
 trying to get rid of. I think it's a plus to be able to only bother
 with the Github PR/issue system, and not parallel JIRAs as well. I
 also worry that this blurs the line between code that is formally
 tested and blessed in a Spark release, and that which is not. You fix
 an issue in this separate repo and marked it fixed in Spark 1.5 --
 what does that imply?

 I am not sure why the ASF JIRA can be only used to track one set of
artifacts that are packaged and released together. I agree that marking a
fix version as 1.5 for a change in another repo doesn't make a lot of
sense, but we could just not use fix versions for the EC2 issues ?


 I think the issue is people don't like the sense this is getting
 pushed outside the wall, or 'removed' from Spark. On the one hand I
 argue it hasn't really properly been part of Spark -- that's why we
 need this change to happen. But, I also think this is easy to resolve
 other ways: spark-packages.org, the pointer in the repo, prominent
 notes in the wiki, etc.

 My concerns are less about it being pushed out etc. For better or worse we
have had EC2 scripts be a part of the Spark distribution from a very early
stage (from version 0.5.0 if my git history reading is correct).  So users
will assume that any error with EC2 scripts belong to the Spark project. In
addition almost all the contributions to the EC2 scripts come from Spark
developers and so keeping the issues in the same mailing list / JIRA seems
natural. This I guess again relates to the question of managing issues for
code that isn't part of the Spark release artifact.

I suggest Shivaram owns this, and that amplab/spark-ec2 is used to
 host? I'm not qualified to help make the new copy or repo admin but
 would be happy to help with the rest, like triaging, if you can give
 me rights to open issues.

 I'll create the amplab/spark-ec2 repo over the next couple of days unless
there are more comments on this thread. This will at least alleviate some
of the naming confusion over using a repository in mesos and I'll give
Sean, Nick, Matthew commit access to it. I am still not convinced about
moving the issues over though.

Thanks
Shivaram


Re: Spark Core and ways of talking to it for enhancing application language support

2015-07-14 Thread Shivaram Venkataraman
Both SparkR and the PySpark API call into the JVM Spark API (i.e.
JavaSparkContext, JavaRDD etc.). They use different methods (Py4J vs. the
R-Java bridge) to call into the JVM based on libraries available / features
supported in each language. So for Haskell, one would need to see what is
the best way to call the underlying Java API functions from Haskell and get
results back.

Thanks
Shivaram

On Mon, Jul 13, 2015 at 8:51 PM, Vasili I. Galchin vigalc...@gmail.com
wrote:

 Hello,

  So far I think there are at two ways (maybe more) to interact
 from various programming languages with the Spark Core: PySpark API
 and R API. From reading code it seems that PySpark approach and R
 approach are very disparate ... with the latter using the R-Java
 bridge. Vis-a-vis/regarding I am trying to decide Haskell which way to
 go. I realize that like any open software effort that approaches
 varied based on history. Is there an intent to adopt one approach as
 standard?(Not trying to start a war :-) :-(.

 Vasili

 BTW I guess Java and Scala APIs are simple given the nature of both
 languages vis-a-vis the JVM??

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Should spark-ec2 get its own repo?

2015-07-13 Thread Shivaram Venkataraman
I think moving the repo-location and re-organizing the python code to
handle dependencies, testing etc. sounds good to me. However, I think there
are a couple of things which I am not sure about

1. I strongly believe that we should preserve existing command-line in
ec2/spark-ec2 (i.e. the shell script not the python file). This could be a
thin wrapper script that just checks out the or downloads something
(similar to say build/mvn). Mainly, I see no reason to break the workflow
that users are used to right now.

2. I am also not sure about that moving the issue tracker is necessarily a
good idea. I don't think we get a large number of issues due to the EC2
stuff  and if we do have a workflow for launching EC2 clusters, the Spark
JIRA would still be the natural place to report issues related to this.

At a high level I see the spark-ec2 scripts as an effort to provide a
reference implementation for launching EC2 clusters with Apache Spark --
Given this view I am not sure it makes sense to completely decouple this
from the Apache project.

Thanks
Shivaram

On Sun, Jul 12, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote:

 I agree with these points. The ec2 support is substantially a separate
 project, and would likely be better managed as one. People can much
 more rapidly iterate on it and release it.

 I suggest:

 1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ?
 2. Add interested parties as owners/contributors
 3. Reassemble a working clone of the current code from spark/ec2 and
 mesos/spark-ec2 and check it in
 4. Announce the new location on user@, dev@
 5. Triage open JIRAs to the new repo's issue tracker and close them
 elsewhere
 6. Remove the old copies of the code and leave a pointer to the new
 location in their place

 I'd also like to hear a few more nods before pulling the trigger though.

 On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman meawo...@gmail.com wrote:
  I wanted to revive the conversation about the spark-ec2 tools, as it
 seems
  to have been lost in the 1.4.1 release voting spree.
 
  I think that splitting it into its own repository is a really good move,
 and
  I would also be happy to help with this transition, as well as help
 maintain
  the resulting repository.  Here is my justification for why we ought to
 do
  this split.
 
  User Facing:
 
  The spark-ec2 launcher dosen't use anything in the parent spark
 repository
  spark-ec2 version is disjoint from the parent repo.  I consider it
 confusing
  that the spark-ec2 script dosen't launch the version of spark it is
  checked-out with.
  Someone interested in setting up spark-ec2 with anything but the default
  configuration will have to clone at least 2 repositories at present, and
  probably fork and push changes to 1.
  spark-ec2 has mismatched dependencies wrt. to spark itself.  This
 includes a
  confusing shim in the spark-ec2 script to install boto, which frankly
 should
  just be a dependency of the script
 
  Developer Facing:
 
  Support across 2 repos will be worse than across 1.  Its unclear where to
  file issues/PRs, and requires extra communications for even fairly
 trivial
  stuff.
  Spark-ec2 also depends on a number binary blobs being in the right place,
  currently the responsibility for these is decentralized, and likely
 prone to
  various flavors of dumb.
  The current flow of booting a spark-ec2 cluster is _complicated_ I spent
 the
  better part of a couple days figuring out how to integrate our custom
 tools
  into this stack.  This is very hard to fix when commits/PR's need to span
  groups/repositories/buckets-o-binary, I am sure there are several other
  problems that are languishing under similar roadblocks
  It makes testing possible.  The spark-ec2 script is a great case for CI
  given the number of permutations of launch criteria there are.  I suspect
  AWS would be happy to foot the bill on spark-ec2 testing (probably ~20
 bucks
  a month based on some envelope sketches), as it is a piece of software
 that
  directly impacts other people giving them money.  I have some contacts
  there, and I am pretty sure this would be an easy conversation,
 particularly
  if the repo directly concerned with ec2.  Think also being able to
 assemble
  the binary blobs into s3 bucket dedicated to spark-ec2
 
  Any other thoughts/voices appreciated here.  spark-ec2 is a super-power
 tool
  and deserves a fair bit of attention!
  --Matthew Goodman
 
  =
  Check Out My Website: http://craneium.net
  Find me on LinkedIn: http://tinyurl.com/d6wlch

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Model parallelism with RDD

2015-07-10 Thread Shivaram Venkataraman
I think you need to do `newRDD.cache()` and `newRDD.count` before you do
oldRDD.unpersist(true) -- Otherwise it might be recomputing all the
previous iterations each time.

Thanks
Shivaram

On Fri, Jul 10, 2015 at 7:44 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi,



 I am interested how scalable can be the model parallelism within Spark.
 Suppose, the model contains N weights of type Double and N is so large that
 does not fit into the memory of a single node. So, we can store the model
 in RDD[Double] within several nodes. To train the model, one needs to
 perform K iterations that update all the weights and check the convergence.
 Then we also need to exchange some weights between the nodes to synchronize
 the model or update the global state. I’ve sketched the code that does
 iterative updates with RDD (without global update yet). Surprisingly, each
 iteration takes more time than previous as shown below (time in seconds).
 Could you suggest what is the reason for that? I’ve checked GC, it does
 something within few milliseconds.



 Configuration: Spark 1.4, 1 master and 5 worker nodes, 5 executors, Intel
 Xeon 2.2, 16GB RAM each

  Iteration 0 time:1.127990986

 Iteration 1 time:1.391120414

 Iteration 2 time:1.642969138102

 Iteration 3 time:1.9344402954

 Iteration 4 time:2.207529424697

 Iteration 5 time:2.6328659593

 Iteration 6 time:2.791169049296

 Iteration 7 time:3.0850374104

 Iteration 8 time:3.4031050061

 Iteration 9 time:3.8826580919



 Code:

 val modelSize = 10

 val numIterations = 10

 val parallelizm = 5

 var oldRDD = sc.parallelize(1 to modelSize, parallelizm).map(x = 0.1)

 var newRDD = sc.parallelize(1 to 1, parallelizm).map(x = 0.1)

 var i = 0

 while (i  numIterations) {

   val t = System.nanoTime()

   // updating the weights

   val newRDD = oldRDD.map(x = x * x)

   oldRDD.unpersist(true)

   // “checking” convergence

   newRDD.mean

   println(Iteration  + i +  time: + (System.nanoTime() - t) / 1e9 /
 numIterations)

   oldRDD = newRDD

   i += 1

 }





 Best regards, Alexander



Re: Model parallelism with RDD

2015-07-10 Thread Shivaram Venkataraman
Yeah I can see that being the case -- caching implies creating objects that
will be stored in memory. So there is a trade-off between storing data in
memory but having to garbage collect it later vs. recomputing the data.

Shivaram

On Fri, Jul 10, 2015 at 9:49 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Shivaram,

 Thank you for suggestion! If I do .cache and .count, each iteration take
 much more time, which is spent in GC. Is it normal?

 10 июля 2015 г., в 21:23, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edumailto:shiva...@eecs.berkeley.edu написал(а):

 I think you need to do `newRDD.cache()` and `newRDD.count` before you do
 oldRDD.unpersist(true) -- Otherwise it might be recomputing all the
 previous iterations each time.

 Thanks
 Shivaram

 On Fri, Jul 10, 2015 at 7:44 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Hi,

 I am interested how scalable can be the model parallelism within Spark.
 Suppose, the model contains N weights of type Double and N is so large that
 does not fit into the memory of a single node. So, we can store the model
 in RDD[Double] within several nodes. To train the model, one needs to
 perform K iterations that update all the weights and check the convergence.
 Then we also need to exchange some weights between the nodes to synchronize
 the model or update the global state. I’ve sketched the code that does
 iterative updates with RDD (without global update yet). Surprisingly, each
 iteration takes more time than previous as shown below (time in seconds).
 Could you suggest what is the reason for that? I’ve checked GC, it does
 something within few milliseconds.

 Configuration: Spark 1.4, 1 master and 5 worker nodes, 5 executors, Intel
 Xeon 2.2, 16GB RAM each
 Iteration 0 time:1.127990986
 Iteration 1 time:1.391120414
 Iteration 2 time:1.642969138102
 Iteration 3 time:1.9344402954
 Iteration 4 time:2.207529424697
 Iteration 5 time:2.6328659593
 Iteration 6 time:2.791169049296
 Iteration 7 time:3.0850374104
 Iteration 8 time:3.4031050061
 Iteration 9 time:3.8826580919

 Code:
 val modelSize = 10
 val numIterations = 10
 val parallelizm = 5
 var oldRDD = sc.parallelize(1 to modelSize, parallelizm).map(x = 0.1)
 var newRDD = sc.parallelize(1 to 1, parallelizm).map(x = 0.1)
 var i = 0
 while (i  numIterations) {
   val t = System.nanoTime()
   // updating the weights
   val newRDD = oldRDD.map(x = x * x)
   oldRDD.unpersist(true)
   // “checking” convergence
   newRDD.mean
   println(Iteration  + i +  time: + (System.nanoTime() - t) / 1e9 /
 numIterations)
   oldRDD = newRDD
   i += 1
 }


 Best regards, Alexander




Re: PySpark vs R

2015-07-10 Thread Shivaram Venkataraman
The R and Python implementations differ in how they communicate with the
JVM so there is no invariant there per-se.

Thanks
Shivaram

On Thu, Jul 9, 2015 at 10:40 PM, Vasili I. Galchin vigalc...@gmail.com
wrote:

 Hello,

 Just trying to get up to speed ( a week .. pls be patient with me).

 I have been reading several docs .. plus ...

 reading PySpark vs R code. I don't see an invariant between the Python
 and R implementations. ??

Probably I should read native Scala code, yes?

 Kind thx,

 Vasili

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




  1   2   >