Re: Apache Spark 3.2.3 Release?

2022-10-18 Thread vaquar khan
+1

On Tue, Oct 18, 2022, 8:58 PM 416161...@qq.com  wrote:

> +1
>
> --
> Ruifeng Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Yuming Wang" ;
> *Date:* Wed, Oct 19, 2022 09:35 AM
> *To:* "kazuyuki tanimura";
> *Cc:* "Gengliang Wang";"huaxin gao"<
> huaxin.ga...@gmail.com>;"Dongjoon Hyun";"Sean
> Owen";"Chao Sun";"dev"<
> dev@spark.apache.org>;
> *Subject:* Re: Apache Spark 3.2.3 Release?
>
> +1
>
> On Wed, Oct 19, 2022 at 4:17 AM kazuyuki tanimura
>  wrote:
>
>> +1 Thanks Chao!
>>
>>
>> Kazu
>>
>> On Oct 18, 2022, at 11:48 AM, Gengliang Wang  wrote:
>>
>> +1. Thanks Chao!
>>
>> On Tue, Oct 18, 2022 at 11:45 AM huaxin gao 
>> wrote:
>>
>>> +1 Thanks Chao!
>>>
>>> Huaxin
>>>
>>> On Tue, Oct 18, 2022 at 11:29 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 Thank you for volunteering, Chao!

 Dongjoon.


 On Tue, Oct 18, 2022 at 9:55 AM Sean Owen  wrote:

> OK by me, if someone is willing to drive it.
>
> On Tue, Oct 18, 2022 at 11:47 AM Chao Sun  wrote:
>
>> Hi All,
>>
>> It's been more than 3 months since 3.2.2 (tagged at Jul 11) was
>> released There are now 66 patches accumulated in branch-3.2, including
>> 2 correctness issues.
>>
>> Is it a good time to start a new release? If there's no objection, I'd
>> like to volunteer as the release manager for the 3.2.3 release, and
>> start preparing the first RC next week.
>>
>> # Correctness issues
>>
>> SPARK-39833Filtered parquet data frame count() and show() produce
>> inconsistent results when spark.sql.parquet.filterPushdown is true
>> SPARK-40002.   Limit improperly pushed down through window using
>> ntile function
>>
>> Best,
>> Chao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>


Re: Welcome Yikun Jiang as a Spark committer

2022-10-09 Thread vaquar khan
Congratulations.

Regards,
Vaquar khan

On Sun, Oct 9, 2022, 6:46 AM 叶先进  wrote:

> Congrats
>
> On Oct 9, 2022, at 16:44, XiDuo You  wrote:
>
> Congratulations, Yikun !
>
> Maxim Gekk  于2022年10月9日周日 15:59写道:
>
>> Keep up the great work, Yikun!
>>
>> On Sun, Oct 9, 2022 at 10:52 AM Gengliang Wang  wrote:
>>
>>> Congratulations, Yikun!
>>>
>>> On Sun, Oct 9, 2022 at 12:33 AM 416161...@qq.com 
>>> wrote:
>>>
>>>> Congrats, Yikun!
>>>>
>>>> --
>>>> Ruifeng Zheng
>>>> ruife...@foxmail.com
>>>>
>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=Ruifeng+Zheng=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242=ruifengz%40foxmail.com=>
>>>>
>>>>
>>>>
>>>> -- Original --
>>>> *From:* "Martin Grigorov" ;
>>>> *Date:* Sun, Oct 9, 2022 05:01 AM
>>>> *To:* "Hyukjin Kwon";
>>>> *Cc:* "dev";"Yikun Jiang";
>>>> *Subject:* Re: Welcome Yikun Jiang as a Spark committer
>>>>
>>>> Congratulations, Yikun!
>>>>
>>>> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> The Spark PMC recently added Yikun Jiang as a committer on the
>>>>> project.
>>>>> Yikun is the major contributor of the infrastructure and GitHub
>>>>> Actions in Apache Spark as well as Kubernates and PySpark.
>>>>> He has put a lot of effort into stabilizing and optimizing the builds
>>>>> so we all can work together in Apache Spark more
>>>>> efficiently and effectively. He's also driving the SPIP for Docker
>>>>> official image in Apache Spark as well for users and developers.
>>>>> Please join me in welcoming Yikun!
>>>>>
>>>>>
>


Re: Welcoming three new PMC members

2022-08-09 Thread vaquar khan
Congratulations

On Tue, Aug 9, 2022, 11:40 AM Xiao Li  wrote:

> Hi all,
>
> The Spark PMC recently voted to add three new PMC members. Join me in
> welcoming them to their new roles!
>
> New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk
>
> The Spark PMC
>


Re: [VOTE] Release Apache Spark 2.4.2

2019-04-21 Thread vaquar khan
+1

Regards,
Vaquar khan

On Sun, Apr 21, 2019, 11:19 PM Felix Cheung 
wrote:

> +1
>
> R tests, package tests on r-hub. Manually check commits under R, doc etc
>
>
> --
> *From:* Sean Owen 
> *Sent:* Saturday, April 20, 2019 11:27 AM
> *To:* Wenchen Fan
> *Cc:* Spark dev list
> *Subject:* Re: [VOTE] Release Apache Spark 2.4.2
>
> +1 from me too.
>
> It seems like there is support for merging the Jackson change into
> 2.4.x (and, I think, a few more minor dependency updates) but this
> doesn't have to go into 2.4.2. That said, if there is another RC for
> any reason, I think we could include it. Otherwise can wait for 2.4.3.
>
> On Thu, Apr 18, 2019 at 9:51 PM Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.2.
> >
> > The vote is open until April 23 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.2
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.2-rc1 (commit
> a44880ba74caab7a987128cb09c4bee41617770a):
> > https://github.com/apache/spark/tree/v2.4.2-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1322/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/
> >
> > The list of bug fixes going into 2.4.1 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12344996
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.2?
> > ===
> >
> > The current list of open tickets targeted at 2.4.2 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.2
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: time for Apache Spark 3.0?

2018-06-16 Thread vaquar khan
+1  for 2.4 next, followed by 3.0.

Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
is it possible we can share future release proposed specification same
like  releases (https://spark.apache.org/releases/spark-release-2-3-0.html)
Regards,
Viquar khan

On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan  wrote:

> Plz ignore last email link (you tube )not sure how it added .
> Apologies not sure how to delete it.
>
>
> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan 
> wrote:
>
>> +1
>>
>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>
>> Regards,
>> Vaquar khan
>>
>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin  wrote:
>>
>>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>>
>>>
>>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>> I agree, I dont see pressing need for major version bump as well.
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
>>>> wrote:
>>>> >
>>>> > Changing major version numbers is not about new features or a vague
>>>> notion that it is time to do something that will be seen to be a
>>>> significant release. It is about breaking stable public APIs.
>>>> >
>>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>>> >
>>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>>>> >>
>>>> >> Dear all:
>>>> >>
>>>> >> It have been 2 months since this topic being proposed. Any progress
>>>> now? 2018 has been passed about 1/2.
>>>> >>
>>>> >> I agree with that the new version should be some exciting new
>>>> feature. How about this one:
>>>> >>
>>>> >> 6. ML/DL framework to be integrated as core component and feature.
>>>> (Such as Angel / BigDL / ……)
>>>> >>
>>>> >> 3.0 is a very important version for an good open source project. It
>>>> should be better to drift away the historical burden and focus in new area.
>>>> Spark has been widely used all over the world as a successful big data
>>>> framework. And it can be better than that.
>>>> >>
>>>> >> Andy
>>>> >>
>>>> >>
>>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>>>> wrote:
>>>> >>>
>>>> >>> There was a discussion thread on scala-contributors about Apache
>>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>>>> about time for Spark to work towards the 3.0 release. By the time it comes
>>>> out, it will be more than 2 years since Spark 2.0.
>>>> >>>
>>>> >>> For contributors less familiar with Spark’s history, I want to give
>>>> more context on Spark releases:
>>>> >>>
>>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>>>> Spark 3.0 in 2018.
>>>> >>>
>>>> >>> 2. Spark’s versioning policy promises that Spark does not break
>>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>>>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>>>> 2.0, 2.x to 3.0).
>>>> >>>
>>>> >>> 3. That said, a major version isn’t necessarily the playground for
>>>> disruptive API changes to make it painful for users to update. The main
>>>> purpose of a major release is an opportunity to fix things that are broken
>>>> in the current API and remove certain deprecated APIs.
>>>> >>>
>>>> >>> 4. Spark as a project has a culture of evolving architecture and
>>>> developing major new features incrementally, so major releases are not the
>>>> only time for exciting new features. For example, the bulk of the work in
>>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>>> Processing was introduced in Spark 2.3. Both were feature releases rather
>>>> than major releases.
>>>> >>>
>>>> >>>
>>>> >>> You can find more background in the thread discussing Spark 2.0:
>>&

Re: time for Apache Spark 3.0?

2018-06-16 Thread vaquar khan
Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.


On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan  wrote:

> +1
>
> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>
> Regards,
> Vaquar khan
>
> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin  wrote:
>
>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>
>>
>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
>> wrote:
>>
>>> I agree, I dont see pressing need for major version bump as well.
>>>
>>>
>>> Regards,
>>> Mridul
>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
>>> wrote:
>>> >
>>> > Changing major version numbers is not about new features or a vague
>>> notion that it is time to do something that will be seen to be a
>>> significant release. It is about breaking stable public APIs.
>>> >
>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>> >
>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>>> >>
>>> >> Dear all:
>>> >>
>>> >> It have been 2 months since this topic being proposed. Any progress
>>> now? 2018 has been passed about 1/2.
>>> >>
>>> >> I agree with that the new version should be some exciting new
>>> feature. How about this one:
>>> >>
>>> >> 6. ML/DL framework to be integrated as core component and feature.
>>> (Such as Angel / BigDL / ……)
>>> >>
>>> >> 3.0 is a very important version for an good open source project. It
>>> should be better to drift away the historical burden and focus in new area.
>>> Spark has been widely used all over the world as a successful big data
>>> framework. And it can be better than that.
>>> >>
>>> >> Andy
>>> >>
>>> >>
>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>>> wrote:
>>> >>>
>>> >>> There was a discussion thread on scala-contributors about Apache
>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>>> about time for Spark to work towards the 3.0 release. By the time it comes
>>> out, it will be more than 2 years since Spark 2.0.
>>> >>>
>>> >>> For contributors less familiar with Spark’s history, I want to give
>>> more context on Spark releases:
>>> >>>
>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>>> Spark 3.0 in 2018.
>>> >>>
>>> >>> 2. Spark’s versioning policy promises that Spark does not break
>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>>> 2.0, 2.x to 3.0).
>>> >>>
>>> >>> 3. That said, a major version isn’t necessarily the playground for
>>> disruptive API changes to make it painful for users to update. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs.
>>> >>>
>>> >>> 4. Spark as a project has a culture of evolving architecture and
>>> developing major new features incrementally, so major releases are not the
>>> only time for exciting new features. For example, the bulk of the work in
>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>> Processing was introduced in Spark 2.3. Both were feature releases rather
>>> than major releases.
>>> >>>
>>> >>>
>>> >>> You can find more background in the thread discussing Spark 2.0:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-
>>> proposal-for-Spark-2-0-td15122.html
>>> >>>
>>> >>>
>>> >>> The primary motivating factor IMO for a major version bump is to
>>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>>> changes that we know have been biting us for a long time but can’t be
>>> changed in feature releases (to be clear, I’m actually not sure they are
>>> all good ideas, but I’m writing them down as candidates for consideration):
>&

Re: time for Apache Spark 3.0?

2018-06-16 Thread vaquar khan
+1

https://www.youtube.com/watch?v=-ik7aJ5U6kg

Regards,
Vaquar khan

On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin  wrote:

> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>
>
> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
> wrote:
>
>> I agree, I dont see pressing need for major version bump as well.
>>
>>
>> Regards,
>> Mridul
>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
>> wrote:
>> >
>> > Changing major version numbers is not about new features or a vague
>> notion that it is time to do something that will be seen to be a
>> significant release. It is about breaking stable public APIs.
>> >
>> > I still remain unconvinced that the next version can't be 2.4.0.
>> >
>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>> >>
>> >> Dear all:
>> >>
>> >> It have been 2 months since this topic being proposed. Any progress
>> now? 2018 has been passed about 1/2.
>> >>
>> >> I agree with that the new version should be some exciting new feature.
>> How about this one:
>> >>
>> >> 6. ML/DL framework to be integrated as core component and feature.
>> (Such as Angel / BigDL / ……)
>> >>
>> >> 3.0 is a very important version for an good open source project. It
>> should be better to drift away the historical burden and focus in new area.
>> Spark has been widely used all over the world as a successful big data
>> framework. And it can be better than that.
>> >>
>> >> Andy
>> >>
>> >>
>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>> wrote:
>> >>>
>> >>> There was a discussion thread on scala-contributors about Apache
>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>> about time for Spark to work towards the 3.0 release. By the time it comes
>> out, it will be more than 2 years since Spark 2.0.
>> >>>
>> >>> For contributors less familiar with Spark’s history, I want to give
>> more context on Spark releases:
>> >>>
>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>> Spark 3.0 in 2018.
>> >>>
>> >>> 2. Spark’s versioning policy promises that Spark does not break
>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>> >>>
>> >>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>> >>>
>> >>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>> >>>
>> >>>
>> >>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-
>> Spark-2-0-td15122.html
>> >>>
>> >>>
>> >>> The primary motivating factor IMO for a major version bump is to
>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>> >>>
>> >>> 1. Support Scala 2.12.
>> >>>
>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>> >>>
>> >>> 3. Shade all dependencies.
>> >>>
>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make i

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread vaquar khan
+1

Regards,
Vaquar khan

On Mon, Feb 19, 2018 at 10:29 PM, Xiao Li <gatorsm...@gmail.com> wrote:

> +1.
>
> So far, no function/performance regression in Spark SQL, Core and PySpark.
>
> Thanks!
>
> Xiao
>
> 2018-02-19 19:47 GMT-08:00 Hyukjin Kwon <gurwls...@gmail.com>:
>
>> Ah, I see. For 1), I overlooked Felix's input here. I couldn't foresee
>> this when I added this documentation because it worked in my simple demo:
>>
>> https://spark-test.github.io/sparksqldoc/search.html?q=approx
>> https://spark-test.github.io/sparksqldoc/#approx_percentile
>>
>> Will try to investigate this shortly too.
>>
>>
>>
>> 2018-02-20 11:45 GMT+09:00 Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu>:
>>
>>> For (1) I think it has something to do with https://dist.apache.org/r
>>> epos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/ not automatically
>>> going to https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-d
>>> ocs/_site/api/sql/index.html -- So if you see the link to
>>> approx_percentile the link we generate is https://dist.apache.org/rep
>>> os/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/#approx_percentile --
>>> This doesn't work as Felix said but https://dist.apache.org/re
>>> pos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/index.html#
>>> approx_percentile works
>>>
>>> I'm not sure how this will behave on the main site. FWIW
>>> http://spark.apache.org/docs/latest/api/python/ does redirect to
>>> http://spark.apache.org/docs/latest/api/python/index.html
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Mon, Feb 19, 2018 at 6:31 PM, Felix Cheung <felixcheun...@hotmail.com
>>> > wrote:
>>>
>>>> Ah sorry I realize my wordings were unclear (not enough zzz or coffee)
>>>>
>>>> So to clarify,
>>>> 1) when searching for a word in the Sql function doc, it does return
>>>> that search result page correctly, however, none of the link in result
>>>> opens to the actual doc page, so to take the search I included as an
>>>> example, if you click on approx_percentile, for instance, it brings open
>>>> the web directory instead.
>>>>
>>>> 2) The second is the dist location we are voting on has a .iml file,
>>>> which is normally not included in release or release RC and it is unsigned
>>>> and without hash (therefore seems like it should not be in the release)
>>>>
>>>> Thanks!
>>>>
>>>> _
>>>> From: Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
>>>> Sent: Tuesday, February 20, 2018 2:24 AM
>>>> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>> Cc: Sean Owen <sro...@gmail.com>, dev <dev@spark.apache.org>
>>>>
>>>>
>>>>
>>>> FWIW The search result link works for me
>>>>
>>>> Shivaram
>>>>
>>>> On Mon, Feb 19, 2018 at 6:21 PM, Felix Cheung <
>>>> felixcheun...@hotmail.com> wrote:
>>>>
>>>>> These are two separate things:
>>>>>
>>>>> Does the search result links work for you?
>>>>>
>>>>> The second is the dist location we are voting on has a .iml file.
>>>>>
>>>>> _
>>>>> From: Sean Owen <sro...@gmail.com>
>>>>> Sent: Tuesday, February 20, 2018 2:19 AM
>>>>> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
>>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>>> Cc: dev <dev@spark.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> Maybe I misunderstand, but I don't see any .iml file in the 4 results
>>>>> on that page? it looks reasonable.
>>>>>
>>>>> On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung <
>>>>> felixcheun...@hotmail.com> wrote:
>>>>>
>>>>>> Any idea with sql func docs search result returning broken links as
>>>>>> below?
>>>>>>
>>>>>> *From:* Felix Cheung <felixcheun...@hotmail.com>
>>>>>> *Sent:* Sunday, February 18, 2018 10:05:22 AM
>>>>>> *To:* Sameer Agarwal; Sameer Agarwal
>>>>>>
>>>>>> *Cc:* dev
>>>>>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>>>>>> Quick questions:
>>>>>>
>>>>>> is there search link for sql functions quite right?
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs
>>>>>> /_site/api/sql/search.html?q=app
>>>>>>
>>>>>> this file shouldn't be included? https://dist.apache.org/repos/
>>>>>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-03 Thread vaquar khan
+1

On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu <weichen...@databricks.com>
wrote:

> +1.
>
> On Sat, Nov 4, 2017 at 8:04 AM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> +1 from me too.
>>
>> Matei
>>
>> > On Nov 3, 2017, at 4:59 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>> >
>> > +1.
>> >
>> > I think this architecture makes a lot of sense to let executors talk to
>> source/sink directly, and bring very low latency.
>> >
>> > On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen <so...@cloudera.com> wrote:
>> > +0 simply because I don't feel I know enough to have an opinion. I have
>> no reason to doubt the change though, from a skim through the doc.
>> >
>> >
>> > On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin <r...@databricks.com> wrote:
>> > Earlier I sent out a discussion thread for CP in Structured Streaming:
>> >
>> > https://issues.apache.org/jira/browse/SPARK-20928
>> >
>> > It is meant to be a very small, surgical change to Structured Streaming
>> to enable ultra-low latency. This is great timing because we are also
>> designing and implementing data source API v2. If designed properly, we can
>> have the same data source API working for both streaming and batch.
>> >
>> >
>> > Following the SPIP process, I'm putting this SPIP up for a vote.
>> >
>> > +1: Let's go ahead and design / implement the SPIP.
>> > +0: Don't really care.
>> > -1: I do not think this is a good idea for the following reasons.
>> >
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-12 Thread vaquar khan
+1

Regards,
Vaquar khan

On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen...@databricks.com> wrote:

+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> +1
>
> Xiao
>
> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin <r...@databricks.com> wrote:
>
>> +1
>>
>> One thing with MetadataSupport - It's a bad idea to call it that unless
>> adding new functions in that trait wouldn't break source/binary
>> compatibility in the future.
>>
>>
>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> I'm adding my own +1 (binding).
>>>
>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> I'm going to update the proposal: for the last point, although the
>>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
>>>> mixes data and metadata operations, we are still able to separate them in
>>>> the data source write API. We can have a mix-in trait `MetadataSupport`
>>>> which has a method `create(options)`, so that data sources can mix in this
>>>> trait and provide metadata creation support. Spark will call this `create`
>>>> method inside `DataFrameWriter.save` if the specified data source has it.
>>>>
>>>> Note that file format data sources can ignore this new trait and still
>>>> write data without metadata(it doesn't have metadata anyway).
>>>>
>>>> With this updated proposal, I'm calling a new vote for the data source
>>>> v2 write path.
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following
>>>> technical reasons.
>>>>
>>>> Thanks!
>>>>
>>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> After we merge the infrastructure of data source v2 read path, and
>>>>> have some discussion for the write path, now I'm sending this email to 
>>>>> call
>>>>> a vote for Data Source v2 write path.
>>>>>
>>>>> The full document of the Data Source API V2 is:
>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>>>
>>>>> The ready-for-review PR that implements the basic infrastructure for
>>>>> the write path:
>>>>> https://github.com/apache/spark/pull/19269
>>>>>
>>>>>
>>>>> The Data Source V1 write path asks implementations to write a
>>>>> DataFrame directly, which is painful:
>>>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>>>> good for maintenance.
>>>>> 2. Data sources may need to preprocess the input data before writing,
>>>>> like cluster/sort the input by some columns. It's better to do the
>>>>> preprocessing in Spark instead of in the data source.
>>>>> 3. Data sources need to take care of transaction themselves, which is
>>>>> hard. And different data sources may come up with a very similar approach
>>>>> for the transaction, which leads to many duplicated codes.
>>>>>
>>>>> To solve these pain points, I'm proposing the data source v2 writing
>>>>> framework which is very similar to the reading framework, i.e.,
>>>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>>>>
>>>>> Data Source V2 write path follows the existing FileCommitProtocol, and
>>>>> have task/job level commit/abort, so that data sources can implement
>>>>> transaction easier.
>>>>>
>>>>> We can create a mix-in trait for DataSourceV2Writer to specify the
>>>>> requirement for input data, like clustering and ordering.
>>>>>
>>>>> Spark provides a very simple protocol for uses to connect to data
>>>>> sources. A common way to write a dataframe to data sources:
>>>>> `df.write.format(...).option(...).mode(...).save()`.
>>>>> Spark passes the options and save mode to data sources, and schedules
>>>>> the write job on the input data. And the data source should take care of
>>>>> the metadata, e.g., the JDBC data source can create the table if it 
>>>>> doesn't
>>>>> exist, or fail the job and ask users to create the table in the
>>>>> corresponding database first. Data sources can define some options for
>>>>> users to carry some metadata information like partitioning/bucketing.
>>>>>
>>>>>
>>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>>
>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>> +0: Don't really care.
>>>>> -1: I don't think this is a good idea because of the following
>>>>> technical reasons.
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>>
>>>
>>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-05 Thread vaquar khan
+1 (non binding ) tested on Ubuntu ,all test case  are passed.

Regards,
Vaquar khan

On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> +1 too.
>
>
> On 6 Oct 2017 10:49 am, "Reynold Xin" <r...@databricks.com> wrote:
>
> +1
>
>
> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.1.2-rc4
>> <https://github.com/apache/spark/tree/v2.1.2-rc4> (2abaea9e40fce81
>> cd4626498e0f5c28a70917499)
>>
>> List of JIRA tickets resolved in this release can be found with this
>> filter.
>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>
>> Release artifacts are signed with a key from:
>> https://people.apache.org/~holden/holdens_keys.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you
>> can add the staging repository to your projects resolvers and test with the
>> RC (make sure to clean up the artifact cache before/after so you don't
>> end up building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.3.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1. That being said
>> if there is something which is a regression form 2.1.1 that has not been
>> correctly targeted please ping a committer to help target the issue (you
>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.1.2%20OR%20affectedVersion%20%3D%202.1.1)>
>> )
>>
>> *What are the unresolved* issues targeted for 2.1.2
>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.2>
>> ?
>>
>> At this time there are no open unresolved issues.
>>
>> *Is there anything different about this release?*
>>
>> This is the first release in awhile not built on the AMPLAB Jenkins. This
>> is good because it means future releases can more easily be built and
>> signed securely (and I've been updating the documentation in
>> https://github.com/apache/spark-website/pull/66 as I progress), however
>> the chances of a mistake are higher with any change like this. If there
>> something you normally take for granted as correct when checking a release,
>> please double check this time :)
>>
>> *Should I be committing code to branch-2.1?*
>>
>> Thanks for asking! Please treat this stage in the RC process as "code
>> freeze" so bug fixes only. If you're uncertain if something should be back
>> ported please reach out. If you do commit to branch-2.1 please tag your
>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the 2.1.3
>> fixed into 2.1.2 as appropriate.
>>
>> *What happened to RC3?*
>>
>> Some R+zinc interactions kept it from getting out the door.
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Interested to Contribute in Spark Development

2017-10-04 Thread vaquar khan
Hi Nishant,

1) Start with helping spark users on mailing list and stack .

2) Start helping build and testing.

3) Once comfortable with code start working on Spark Jira.


Regards,
Vaquar khan

On Oct 4, 2017 11:29 AM, "Kumar Nishant" <knishan...@gmail.com> wrote:

> Hi Team,
> I am new to Apache community and I would love to contribute effort in
> Spark development. Can anyone mentor & guide me how to proceed and start
> contributing? I am beginner here so I am not sure what process is be
> followed.
>
> Thanks
> Nishant
>
>


Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-29 Thread vaquar khan
+1 (non-binding)

Regards,
Vaquar khan

On Fri, Sep 29, 2017 at 1:52 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> +1 (non-binding)
>
> Checked all signatures/checksums for binaries and source, spot-checked
> maven artifacts. Thanks for fixing the signatures, Holden!
>
> On Fri, Sep 29, 2017 at 8:25 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> As a follow up the JIRA for this is at https://issues.apache.org/j
>> ira/browse/SPARK-22167
>>
>> On Fri, Sep 29, 2017 at 2:50 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> This vote is canceled and will be replaced with an RC3 once Felix and I
>>> figure out the R packaging issue.
>>>
>>> On Fri, Sep 29, 2017 at 1:03 AM Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> -1
>>>>
>>>> (Sorry) spark-2.1.2-bin-hadoop2.7.tgz is missing the R directory, not
>>>> sure why yet.
>>>>
>>>> Tested on multiple platform as source package, (against 2.1.1 jar)
>>>> seemed fine except this WARNING on R-devel
>>>>
>>>> * checking for code/documentation mismatches ... WARNING
>>>> Codoc mismatches from documentation object 'attach':
>>>> attach
>>>>   Code: function(what, pos = 2L, name = deparse(substitute(what),
>>>>  backtick = FALSE), warn.conflicts = TRUE)
>>>>   Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>>>>  warn.conflicts = TRUE)
>>>>   Mismatches in argument default values:
>>>> Name: 'name' Code: deparse(substitute(what), backtick = FALSE)
>>>> Docs: deparse(substitute(what))
>>>>
>>>> Checked the latest release R 3.4.1 and the signature change wasn't
>>>> there. This likely indicated an upcoming change in the next R release that
>>>> could insur this new warning when we attempt to publish the package.
>>>>
>>>> Not sure what we can do now since we work with multiple versions of R
>>>> and they will have different signatures then.
>>>> --
>>>> *From:* Luciano Resende <luckbr1...@gmail.com>
>>>> *Sent:* Thursday, September 28, 2017 10:29:18 PM
>>>> *To:* Holden Karau
>>>> *Cc:* dev@spark.apache.org
>>>>
>>>> *Subject:* Re: [VOTE] Spark 2.1.2 (RC2)
>>>> +1 (non-binding)
>>>>
>>>> Minor comments:
>>>> The apache infra has a staging repository to add release candidates,
>>>> and it might be better/simpler to use that instead of home.a.o. See
>>>> https://dist.apache.org/repos/dist/dev/spark/.
>>>>
>>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 9:47 PM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.1.2. The vote is open until Wednesday October 4th at 23:59
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.1.2-rc2
>>>>> <https://github.com/apache/spark/tree/v2.1.2-rc2> (fabbb7f59e47590
>>>>> 114366d14e15fbbff8c88593c)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found with this
>>>>> filter.
>>>>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>>>>>
>>>>> Release artifacts are signed with a key from:
>>>>> https://people.apache.org/~holden/holdens_keys.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1251
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>>>>>
>>>>>
>>>>

Re: Welcoming Tejas Patil as a Spark committer

2017-09-29 Thread vaquar khan
Congrats Tejas

Regards,
Vaquar khan

On Fri, Sep 29, 2017 at 4:33 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:

> Congratulations Tejas !
>
> Regards,
> Mridul
>
> On Fri, Sep 29, 2017 at 12:58 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> > Hi all,
> >
> > The Spark PMC recently added Tejas Patil as a committer on the
> > project. Tejas has been contributing across several areas of Spark for
> > a while, focusing especially on scalability issues and SQL. Please
> > join me in welcoming Tejas!
> >
> > Matei
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread vaquar khan
+1 looks good,

Regards,
Vaquar khan

On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> +1; we should consider something similar for multi-dimensional tensors too.
>
> Matei
>
> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang <yblia...@gmail.com> wrote:
> >
> > +1
> >
> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan <nomanbp...@live.com> wrote:
> > +1
> >
> > Regards
> > Noman
> > From: Denny Lee <denny.g@gmail.com>
> > Sent: Friday, September 22, 2017 2:59:33 AM
> > To: Apache Spark Dev; Sean Owen; Tim Hunter
> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
> >
> > +1
> >
> > On Thu, Sep 21, 2017 at 11:15 Sean Owen <so...@cloudera.com> wrote:
> > Am I right that this doesn't mean other packages would use this
> representation, but that they could?
> >
> > The representation looked fine to me w.r.t. what DL frameworks need.
> >
> > My previous comment was that this is actually quite lightweight. It's
> kind of like how I/O support is provided for CSV and JSON, so makes enough
> sense to add to Spark. It doesn't really preclude other solutions.
> >
> > For those reasons I think it's fine. +1
> >
> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter <timhun...@databricks.com>
> wrote:
> > Hello community,
> >
> > I would like to call for a vote on SPARK-21866. It is a short proposal
> that has important applications for image processing and deep learning.
> Joseph Bradley has offered to be the shepherd.
> >
> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
> > PDF version: https://issues.apache.org/jira/secure/attachment/
> 12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
> >
> > Background and motivation
> > As Apache Spark is being used more and more in the industry, some new
> use cases are emerging for different data formats beyond the traditional
> SQL types or the numerical types (vectors and matrices). Deep Learning
> applications commonly deal with image processing. A number of projects add
> some Deep Learning capabilities to Spark (see list below), but they
> struggle to communicate with each other or with MLlib pipelines because
> there is no standard way to represent an image in Spark DataFrames. We
> propose to federate efforts for representing images in Spark by defining a
> representation that caters to the most common needs of users and library
> developers.
> > This SPIP proposes a specification to represent images in Spark
> DataFrames and Datasets (based on existing industrial standards), and an
> interface for loading sources of images. It is not meant to be a
> full-fledged image processing library, but rather the core description that
> other libraries and users can rely on. Several packages already offer
> various processing facilities for transforming images or doing more complex
> operations, and each has various design tradeoffs that make them better as
> standalone solutions.
> > This project is a joint collaboration between Microsoft and Databricks,
> which have been testing this design in two open source packages: MMLSpark
> and Deep Learning Pipelines.
> > The proposed image format is an in-memory, decompressed representation
> that targets low-level applications. It is significantly more liberal in
> memory usage than compressed image representations such as JPEG, PNG, etc.,
> but it allows easy communication with popular image processing libraries
> and has no decoding overhead.
> > Targets users and personas:
> > Data scientists, data engineers, library developers.
> > The following libraries define primitives for loading and representing
> images, and will gain from a common interchange format (in alphabetical
> order):
> >   • BigDL
> >   • DeepLearning4J
> >   • Deep Learning Pipelines
> >   • MMLSpark
> >   • TensorFlow (Spark connector)
> >   • TensorFlowOnSpark
> >   • TensorFrames
> >   • Thunder
> > Goals:
> >   • Simple representation of images in Spark DataFrames, based on
> pre-existing industrial standards (OpenCV)
> >   • This format should eventually allow the development of
> high-performance integration points with image processing libraries such as
> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
> >   • The reader should be able to read popular formats of images from
> distributed sources.
> > Non-Goals:
> > Images are a versatile medium and en

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread vaquar khan
+1

Regards,
Vaquar khan

On Sep 10, 2017 5:18 AM, "Noman Khan" <nomanbp...@live.com> wrote:

> +1
> --
> *From:* wangzhenhua (G) <wangzhen...@huawei.com>
> *Sent:* Friday, September 8, 2017 2:20:07 AM
> *To:* Dongjoon Hyun; 蒋星博
> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>
>
> +1 (non-binding)  Great to see data source API is going to be improved!
>
>
>
> best regards,
>
> -Zhenhua(Xander)
>
>
>
> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
> *发送时间:* 2017年9月8日 4:07
> *收件人:* 蒋星博
> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>
>
>
> +1 (non-binding).
>
>
>
> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com> wrote:
>
> +1
>
>
>
>
>
> Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道:
>
> +1 as well
>
>
>
> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> +1
>
>
>
> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> +1 (non-binding)
>
> Thanks for making the updates reflected in the current PR. It would be
> great to see the doc updated before it is finally published though.
>
> Right now it feels like this SPIP is focused more on getting the basics
> right for what many datasources are already doing in API V1 combined with
> other private APIs, vs pushing forward state of the art for performance.
>
> I think that’s the right approach for this SPIP. We can add the support
> you’re talking about later with a more specific plan that doesn’t block
> fixing the problems that this addresses.
>
> ​
>
>
>
> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
> +1 (binding)
>
>
>
> I personally believe that there is quite a big difference between having a
> generic data source interface with a low surface area and pushing down a
> significant part of query processing into a datasource. The later has much
> wider wider surface area and will require us to stabilize most of the
> internal catalyst API's which will be a significant burden on the community
> to maintain and has the potential to slow development velocity
> significantly. If you want to write such integrations then you should be
> prepared to work with catalyst internals and own up to the fact that things
> might change across minor versions (and in some cases even maintenance
> releases). If you are willing to go down that road, then your best bet is
> to use the already existing spark session extensions which will allow you
> to write such integrations and can be used as an `escape hatch`.
>
>
>
>
>
> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> wrote:
>
> +0 (non-binding)
>
>
>
> I think there are benefits to unifying all the Spark-internal datasources
> into a common public API for sure.  It will serve as a forcing function to
> ensure that those internal datasources aren't advantaged vs datasources
> developed externally as plugins to Spark, and that all Spark features are
> available to all datasources.
>
>
>
> But I also think this read-path proposal avoids the more difficult
> questions around how to continue pushing datasource performance forwards.
> James Baker (my colleague) had a number of questions about advanced
> pushdowns (combined sorting and filtering), and Reynold also noted that
> pushdown of aggregates and joins are desirable on longer timeframes as
> well.  The Spark community saw similar requests, for aggregate pushdown in
> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
> in SPARK-12449.  Clearly a number of people are interested in this kind of
> performance work for datasources.
>
>
>
> To leave enough space for datasource developers to continue experimenting
> with advanced interactions between Spark and their datasources, I'd propose
> we leave some sort of escape valve that enables these datasources to keep
> pushing the boundaries without forking Spark.  Possibly that looks like an
> additional unsupported/unstable interface that pushes down an entire
> (unstable API) logical plan, which is expected to break API on every
> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
> ignores it and continu

Re: SPIP: Spark on Kubernetes

2017-08-30 Thread vaquar khan
+1 (non-binding)

Regards,
Vaquar khan

On Mon, Aug 28, 2017 at 5:09 PM, Erik Erlandson <eerla...@redhat.com> wrote:

>
> In addition to the engineering & software aspects of the native Kubernetes
> community project, we have also worked at building out the community, with
> the goal of providing the foundation for sustaining engineering on the
> Kubernetes scheduler back-end.  That said, I agree 100% with your point
> that adding committers with kube-specific experience is good strategy for
> increasing review bandwidth to help service PRs from this community.
>
> On Mon, Aug 28, 2017 at 2:16 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> In my opinion, the fact that there are nearly no changes to spark-core,
>>> and most of our changes are additive should go to prove that this adds
>>> little complexity to the workflow of the committers.
>>
>>
>> Actually (and somewhat perversely), the otherwise praiseworthy isolation
>> of the Kubernetes code does mean that it adds complexity to the workflow of
>> the existing Spark committers. I'll reiterate Imran's concerns: The
>> existing Spark committers familiar with Spark's scheduler code have
>> adequate knowledge of the Standalone and Yarn implementations, and still
>> not sufficient coverage of Mesos. Adding k8s code to Spark would mean that
>> the progression of that code would start seeing the issues that the Mesos
>> code in Spark currently sees: Reviews and commits tend to languish because
>> we don't have currently active committers with sufficient knowledge and
>> cycles to deal with the Mesos PRs. Some of this is because the PMC needs to
>> get back to addressing the issue of adding new Spark committers who do have
>> the needed Mesos skills, but that isn't as simple as we'd like because
>> ideally a Spark committer has demonstrated skills across a significant
>> portion of the Spark code, not just tightly focused on one area (such as
>> Mesos or k8s integration.) In short, adding Kubernetes support directly
>> into Spark isn't likely (at least in the short-term) to be entirely
>> positive for the spark-on-k8s project, since merging of PRs to the
>> spark-on-k8s is very likely to be quite slow at least until such time as we
>> have k8s-focused Spark committers. If this project does end up getting
>> pulled into the Spark codebase, then the PMC will need to start looking at
>> bringing in one or more new committers who meet our requirements for such a
>> role and responsibility, and who also have k8s skills. The success and pace
>> of development of the spark-on-k8s will depend in large measure on the
>> PMC's ability to find such new committers.
>>
>> All that said, I'm +1 if the those currently responsible for the
>> spark-on-k8s project still want to bring the code into Spark.
>>
>>
>> On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan <
>> ramanath...@google.com.invalid> wrote:
>>
>>> Thank you for your comments Imran.
>>>
>>> Regarding integration tests,
>>>
>>> What you inferred from the documentation is correct -
>>> Integration tests do not require any prior setup or a Kubernetes cluster
>>> to run. Minikube is a single binary that brings up a one-node cluster and
>>> exposes the full Kubernetes API. It is actively maintained and kept up to
>>> date with the rest of the project. These local integration tests on Jenkins
>>> (like the ones with spark-on-yarn), should allow for the committers to
>>> merge changes with a high degree of confidence.
>>> I will update the proposal to include more information about the extent
>>> and kinds of testing we do.
>>>
>>> As for (b), people on this thread and the set of contributors on our
>>> fork are a fairly wide community of contributors and committers who would
>>> be involved in the maintenance long-term. It was one of the reasons behind
>>> developing separately as a fork. In my opinion, the fact that there are
>>> nearly no changes to spark-core, and most of our changes are additive
>>> should go to prove that this adds little complexity to the workflow of the
>>> committers.
>>>
>>> Separating out the cluster managers (into an as yet undecided new home)
>>> appears far more disruptive and a high risk change for the short term.
>>> However, when there is enough community support behind that effort, tracked
>>> in 19700 <https://issues.apache.org/jira/browse/SPARK-19700>; and if
>>> that is realized in the future, it wouldn't be difficult to switch over
>>> 

Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread vaquar khan
Verify your configuration, following link covered all Spark tuning points.

https://spark.apache.org/docs/latest/tuning.html

Regards,
Vaquar khan

On Jul 17, 2017 6:56 AM, "何文婷" <hewenting_...@163.com> wrote:

2.1.1

发自网易邮箱大师
On 07/17/2017 20:55, vaquar khan <vaquar.k...@gmail.com> wrote:

Could you please let us know your Spark version?


Regards,
vaquar khan

On Jul 17, 2017 12:18 AM, "163" <hewenting_...@163.com> wrote:

> I change the UDF but the performance seems still slow. What can I do else?
>
>
> 在 2017年7月14日,下午8:34,Wenchen Fan <cloud0...@gmail.com> 写道:
>
> Try to replace your UDF with Spark built-in expressions, it should be as
> simple as `$”x” * (lit(1) - $”y”)`.
>
> On 14 Jul 2017, at 5:46 PM, 163 <hewenting_...@163.com> wrote:
>
> I modify the tech query5 to DataFrame:
>
> val forders = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/orders*”*).filter("o_orderdate
>  < 1995-01-01 and o_orderdate >= 1994-01-01").select("o_custkey", 
> "o_orderkey")
> val flineitem = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/lineitem")
> val fcustomer = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/customer")
> val fsupplier = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/supplier")
> val fregion = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/region*”*).where("r_name
>  = 'ASIA'").select($"r_regionkey")
> val fnation = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/nation*”*)
>
> val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
>
> val res =   flineitem.join(forders, $"l_orderkey" === forders("o_orderkey"))
>  .join(fcustomer, $"o_custkey" === fcustomer("c_custkey"))
>  .join(fsupplier, $"l_suppkey" === fsupplier("s_suppkey") && 
> $"c_nationkey" === fsupplier("s_nationkey"))
>  .join(fnation, $"s_nationkey" === fnation("n_nationkey"))
>  .join(fregion, $"n_regionkey" === fregion("r_regionkey"))
>  .select($"n_name", decrease($"l_extendedprice", 
> $"l_discount").as("value"))
>  .groupBy($"n_name")
>  .agg(sum($"value").as("revenue"))
>  .sort($"revenue".desc).show()
>
>
> My environment is one master(Hdfs-namenode), four workers(HDFS-datanode), 
> each with 40 cores and 128GB memory.  TPCH 100G stored on HDFS using parquet 
> format.
>
> It executed about 1.5m, I found that read these 6 tables using 
> spark.read.parqeut is sequential, How can I made this to run parallelly ?
>
>  I’ve already set data locality and spark.default.parallelism, 
> spark.serializer, using G1, But the runtime  is still not reduced.
>
> And is there any advices for me to tuning this performance?
>
> Thank you.
>
> Wenting He
>
>
>
>
>


Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread vaquar khan
Could you please let us know your Spark version?


Regards,
vaquar khan

On Jul 17, 2017 12:18 AM, "163" <hewenting_...@163.com> wrote:

> I change the UDF but the performance seems still slow. What can I do else?
>
>
> 在 2017年7月14日,下午8:34,Wenchen Fan <cloud0...@gmail.com> 写道:
>
> Try to replace your UDF with Spark built-in expressions, it should be as
> simple as `$”x” * (lit(1) - $”y”)`.
>
> On 14 Jul 2017, at 5:46 PM, 163 <hewenting_...@163.com> wrote:
>
> I modify the tech query5 to DataFrame:
>
> val forders = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/orders*”*).filter("o_orderdate
>  < 1995-01-01 and o_orderdate >= 1994-01-01").select("o_custkey", 
> "o_orderkey")
> val flineitem = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/lineitem")
> val fcustomer = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/customer")
> val fsupplier = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/supplier")
> val fregion = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/region*”*).where("r_name
>  = 'ASIA'").select($"r_regionkey")
> val fnation = 
> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/nation*”*)
>
> val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
>
> val res =   flineitem.join(forders, $"l_orderkey" === forders("o_orderkey"))
>  .join(fcustomer, $"o_custkey" === fcustomer("c_custkey"))
>  .join(fsupplier, $"l_suppkey" === fsupplier("s_suppkey") && 
> $"c_nationkey" === fsupplier("s_nationkey"))
>  .join(fnation, $"s_nationkey" === fnation("n_nationkey"))
>  .join(fregion, $"n_regionkey" === fregion("r_regionkey"))
>  .select($"n_name", decrease($"l_extendedprice", 
> $"l_discount").as("value"))
>  .groupBy($"n_name")
>  .agg(sum($"value").as("revenue"))
>  .sort($"revenue".desc).show()
>
>
> My environment is one master(Hdfs-namenode), four workers(HDFS-datanode), 
> each with 40 cores and 128GB memory.  TPCH 100G stored on HDFS using parquet 
> format.
>
> It executed about 1.5m, I found that read these 6 tables using 
> spark.read.parqeut is sequential, How can I made this to run parallelly ?
>
>  I’ve already set data locality and spark.default.parallelism, 
> spark.serializer, using G1, But the runtime  is still not reduced.
>
> And is there any advices for me to tuning this performance?
>
> Thank you.
>
> Wenting He
>
>
>
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-07 Thread vaquar khan
+1 non-binding

Regards,
vaquar khan

On Jun 7, 2017 4:32 PM, "Ricardo Almeida" <ricardo.alme...@actnowib.com>
wrote:

+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
-Phive-thriftserver -Pscala-2.11 on

   - Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
   - macOS 10.12.5 Java 8 (build 1.8.0_131)


On 5 June 2017 at 21:14, Michael Armbrust <mich...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc4
> <https://github.com/apache/spark/tree/v2.2.0-rc4> (377cfa8ac7ff7a8
> a6a6d273182e18ea7dc25ce7e)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1241/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


Re: Spark Improvement Proposals

2017-03-09 Thread vaquar khan
Many of us have issue with "shepherd role " , i think we should go with
vote.

Regards,
Vaquar khan

On Thu, Mar 9, 2017 at 11:00 AM, Reynold Xin <r...@databricks.com> wrote:

> I'm fine without a vote. (are we voting on wether we need a vote?)
>
>
> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.
>>
>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <c...@koeninger.org> wrote:
>>
>>> I started this idea as a fork with a merge-able change to docs.
>>> Reynold moved it to his google doc, and has suggested during this
>>> email thread that a vote should occur.
>>> If a vote needs to occur, I can't see anything on
>>> http://apache.org/foundation/voting.html suggesting that I can call
>>> for a vote, which is why I'm asking PMC members to do it since they're
>>> the ones who would vote anyway.
>>> Now Sean is saying this is a code/doc change that can just be reviewed
>>> and merged as usual...which is what I tried to do to begin with.
>>>
>>> The fact that you haven't agreed on a process to agree on your process
>>> is, I think, an indication that the process really does need
>>> improvement ;)
>>>
>>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Spark Improvement Proposals

2017-02-17 Thread vaquar khan
I like document and happy to see SPIP draft version however i feel shepherd
role is again hurdle in process improvement ,It's like everything depends
only on shepherd .

Also want to add point that SPIP  should be time bound with define SLA else
will defeats purpose.


Regards,
Vaquar khan

On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> > [The shepherd] can advise on technical and procedural considerations for
> people outside the community
>
> The sentiment is good, but this doesn't justify requiring a shepherd for a
> proposal. There are plenty of people that wouldn't need this, would get
> feedback during discussion, or would ask a committer or PMC member if it
> weren't a formal requirement.
>
> > if no one is willing to be a shepherd, the proposed idea is probably not
> going to receive much traction in the first place.
>
> This also doesn't sound like a reason for needing a shepherd. Saying that
> a shepherd probably won't hurt the process doesn't give me an idea of why a
> shepherd should be required in the first place.
>
> What was the motivation for adding a shepherd originally? It may not be
> bad and it could be helpful, but neither of those makes me think that they
> should be required or else the proposal fails.
>
> rb
>
> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <timhun...@databricks.com>
> wrote:
>
>> The doc looks good to me.
>>
>> Ryan, the role of the shepherd is to make sure that someone
>> knowledgeable with Spark processes is involved: this person can advise
>> on technical and procedural considerations for people outside the
>> community. Also, if no one is willing to be a shepherd, the proposed
>> idea is probably not going to receive much traction in the first
>> place.
>>
>> Tim
>>
>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>> > Reynold, thanks, LGTM.
>> >
>> > Sean, great concerns.  I agree that behavior is largely cultural and
>> > writing down a process won't necessarily solve any problems one way or
>> > the other.  But one outwardly visible change I'm hoping for out of
>> > this a way for people who have a stake in Spark, but can't follow
>> > jiras closely, to go to the Spark website, see the list of proposed
>> > major changes, contribute discussion on issues that are relevant to
>> > their needs, and see a clear direction once a vote has passed.  We
>> > don't have that now.
>> >
>> > Ryan, realistically speaking any PMC member can and will stop any
>> > changes they don't like anyway, so might as well be up front about the
>> > reality of the situation.
>> >
>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
>> >> The text seems fine to me. Really, this is not describing a
>> fundamentally
>> >> new process, which is good. We've always had JIRAs, we've always been
>> able
>> >> to call a VOTE for a big question. This just writes down a sensible
>> set of
>> >> guidelines for putting those two together when a major change is
>> proposed. I
>> >> look forward to turning some big JIRAs into a request for a SPIP.
>> >>
>> >> My only hesitation is that this seems to be perceived by some as a new
>> or
>> >> different thing, that is supposed to solve some problems that aren't
>> >> otherwise solvable. I see mentioned problems like: clear process for
>> >> managing work, public communication, more committers, some sort of
>> binding
>> >> outcome and deadline.
>> >>
>> >> If SPIP is supposed to be a way to make people design in public and a
>> way to
>> >> force attention to a particular change, then, this doesn't do that by
>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
>> detract
>> >> from the discussion about doing what SPIP implies. It's just a process
>> >> document.
>> >>
>> >> Still, a fine step IMHO.
>> >>
>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
>> wrote:
>> >>>
>> >>> Updated. Any feedback from other community members?
>> >>>
>> >>>
>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
>> >>> wrote:
>> >>>>
>> >>>> Thanks for doing that.
>> >>>>
>> >>>> Given that there are at lea

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread vaquar khan
+1 (non-binding)

Regards,
vaquar khan

On Sun, Dec 18, 2016 at 2:33 PM, Adam Roberts <arobe...@uk.ibm.com> wrote:

> +1 (non-binding)
>
> *Functional*: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's
> latest SDK for Java (8 SR3 FP21).
>
> Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM
> specific platforms including big-endian. On slower machines I see these
> failing but nothing to be concerned over (timeouts):
>
> *org.apache.spark.DistributedSuite.caching on disk*
> *org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails
> with informative message*
> *org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by
> current_time, complete mode*
> *org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by
> current_date, complete mode*
> *org.apache.spark.sql.hive.HiveSparkSubmitSuite.set
> hive.metastore.warehouse.dir*
>
> *Performance vs 2.0.2:* lots of improvements seen using the HiBench and
> SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the Kryo
> serializer, controlled test environment. These are all open source
> benchmarks anyone can use and experiment with. Elapsed times measured, *+
> scores* are an improvement (so it's that much percent faster) and *-
> scores* are used for regressions I'm seeing.
>
>- K-means: Java API *+22%* (100 sec to 78 sec), Scala API *+30%* (34
>seconds to 24 seconds), Python API unchanged
>- PageRank: minor improvement from 40 seconds to 38 seconds, *+5%*
>- Sort: minor improvement, 10.8 seconds to 9.8 seconds, *+10%*
>- WordCount: unchanged
>- Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is
>*-47%*, other times marginally faster by *15%*, something to keep an
>eye on
>- Terasort: *+18%* (39 seconds to 32 seconds) with the Java/Scala APIs
>
>
> For TPC-DS SQL queries the results are a mixed bag again, I see > 10%
> boosts for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52,
> q57, q89. Five iterations, average times compared, only changing which
> version of Spark we're using
>
>
>
> From:Holden Karau <hol...@pigscanfly.ca>
> To:Denny Lee <denny.g@gmail.com>, Liwei Lin <lwl...@gmail.com>,
> "dev@spark.apache.org" <dev@spark.apache.org>
> Date:18/12/2016 20:05
> Subject:Re: [VOTE] Apache Spark 2.1.0 (RC5)
> --
>
>
>
> +1 (non-binding) - checked Python artifacts with virtual env.
>
> On Sun, Dec 18, 2016 at 11:42 AM Denny Lee <*denny.g@gmail.com*
> <denny.g@gmail.com>> wrote:
> +1 (non-binding)
>
>
> On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin <*lwl...@gmail.com*
> <lwl...@gmail.com>> wrote:
> +1
>
> Cheers,
> Liwei
>
>
>
> On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang <*wgy...@gmail.com*
> <wgy...@gmail.com>> wrote:
> I hope *https://github.com/apache/spark/pull/16252*
> <https://github.com/apache/spark/pull/16252> can be fixed until release
> 2.1.0. It's a fix for broadcast cannot fit in memory.
>
> On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley <*jos...@databricks.com*
> <jos...@databricks.com>> wrote:
> +1
>
> On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
> *hvanhov...@databricks.com* <hvanhov...@databricks.com>> wrote:
> +1
>
> On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li <*gatorsm...@gmail.com*
> <gatorsm...@gmail.com>> wrote:
> +1
>
> Xiao Li
>
> 2016-12-16 12:19 GMT-08:00 Felix Cheung <*felixcheun...@hotmail.com*
> <felixcheun...@hotmail.com>>:
>
>
>
>
>
>
>
>
>
>
>
>
> For R we have a license field in the DESCRIPTION, and this is standard
> practice (and requirement) for R packages.
>
>
>
>
>
>
>
> *https://cran.r-project.org/doc/manuals/R-exts.html#Licensing*
> <https://cran.r-project.org/doc/manuals/R-exts.html#Licensing>
>
>
>
>
>
>
>
> --
>
>
> *From:* Sean Owen <*so...@cloudera.com* <so...@cloudera.com>>
>
>
> * Sent:* Friday, December 16, 2016 9:57:15 AM
>
>
> * To:* Reynold Xin; *dev@spark.apache.org* <dev@spark.apache.org>
>
>
> * Subject:* Re: [VOTE] Apache Spark 2.1.0 (RC5)
>
>
>
>
>
>
>
>
>
>
> (If you have a template for these emails, maybe update it to use https
> links. They work for
>
> *apache.org* <http://apache.org/> domains. After all we are asking people
> to verify the integrity of release artifacts, so it might as well be
> secure.)
>
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread vaquar khan
*+1 (non binding)*

On Tue, Nov 8, 2016 at 10:21 PM, Weiqing Yang <yangweiqing...@gmail.com>
wrote:

>  +1 (non binding)
>
>
> Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
> "1.8.0_111"
>
>
>
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr -DskipTests clean package
>
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr test
>
>
>
> On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin <lwl...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>> Cheers,
>> Liwei
>>
>> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
>> ricardo.alme...@actnowib.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
>>> YARN, Hive
>>>
>>>
>>> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and
>>>>> passes if a majority of at least 3+1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>>>>> 7ba694b0c34)
>>>>>
>>>>> This release candidate resolves 84 issues:
>>>>> https://s.apache.org/spark-2.0.2-jira
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>> spark-1214/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>>>>> 2-rc3-docs/
>>>>>
>>>>>
>>>>> Q: How can I help test this release?
>>>>> A: If you are a Spark user, you can help us test this release by
>>>>> taking an existing Spark workload and running on this release candidate,
>>>>> then reporting any regressions from 2.0.1.
>>>>>
>>>>> Q: What justifies a -1 vote for this release?
>>>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>>>> present in 2.0.1, missing features, or bugs related to new features will
>>>>> not necessarily block this release.
>>>>>
>>>>> Q: What fix version should I use for patches merging into branch-2.0
>>>>> from now on?
>>>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new
>>>>> RC (i.e. RC4) is cut, I will change the fix version of those patches to
>>>>> 2.0.2.
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread vaquar khan
+1



On Thu, Oct 27, 2016 at 11:56 AM, Davies Liu <dav...@databricks.com> wrote:

> +1
>
> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin <r...@databricks.com> wrote:
> > Greetings from Spark Summit Europe at Brussels.
> >
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes
> if a
> > majority of at least 3+1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.2
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v2.0.2-rc1
> > (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
> >
> > This release candidate resolves 75 issues:
> > https://s.apache.org/spark-2.0.2-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1208/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
> >
> >
> > Q: How can I help test this release?
> > A: If you are a Spark user, you can help us test this release by taking
> an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 2.0.1.
> >
> > Q: What justifies a -1 vote for this release?
> > A: This is a maintenance release in the 2.0.x series. Bugs already
> present
> > in 2.0.1, missing features, or bugs related to new features will not
> > necessarily block this release.
> >
> > Q: What fix version should I use for patches merging into branch-2.0 from
> > now on?
> > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> > (i.e. RC2) is cut, I will change the fix version of those patches to
> 2.0.2.
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Spark Improvement Proposals

2016-10-08 Thread vaquar khan
+1 for SIP lebles,waiting for Reynolds detailed proposal .

Regards,
Vaquar khan

On 8 Oct 2016 16:22, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:

> Sounds good. Just to comment on the compatibility part:
>
> > I meant changing public user interfaces.  I think the first design is
> > unlikely to be right, because it's done at a time when you have the
> > least information.  As a user, I find it considerably more frustrating
> > to be unable to use a tool to get my job done, than I do having to
> > make minor changes to my code in order to take advantage of features.
> > I've seen committers be seriously reluctant to allow changes to
> > @experimental code that are needed in order for it to really work
> > right.  You need to be able to iterate, and if people on both sides of
> > the fence aren't going to respect that some newer apis are subject to
> > change, then why even mark them as such?
> >
> > Ideally a finished SIP should give me a checklist of things that an
> > implementation must do, and things that it doesn't need to do.
> > Contributors/committers should be seriously discouraged from putting
> > out a version 0.1 that doesn't have at least a prototype
> > implementation of all those things, especially if they're then going
> > to argue against interface changes necessary to get the the rest of
> > the things done in the 0.2 version.
>
> Experimental APIs and alpha components are indeed supposed to be
> changeable (https://cwiki.apache.org/confluence/display/SPARK/
> Spark+Versioning+Policy). Maybe people are being too conservative in some
> cases, but I do want to note that regardless of what precise policy we try
> to write down, this type of issue will ultimately be a judgment call. Is it
> worth making a small cosmetic change in an API that's marked experimental,
> but has been used widely for a year? Perhaps not. Is it worth making it in
> something one month old, or even in an older API as we move to 2.0? Maybe
> yes. I think we should just discuss each one (start an email thread if
> resolving it on JIRA is too complex) and perhaps be more religious about
> making things non-experimental when we think they're done.
>
> Matei
>
>
> >
> >
> > On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <r...@databricks.com> wrote:
> >> I like the lightweight proposal to add a SIP label.
> >>
> >> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> >> track the list of major changes, but that never really materialized due
> to
> >> the overhead. Adding a SIP label on major JIRAs and then link to them
> >> prominently on the Spark website makes a lot of sense.
> >>
> >>
> >> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <matei.zaha...@gmail.com
> >
> >> wrote:
> >>>
> >>> For the improvement proposals, I think one major point was to make them
> >>> really visible to users who are not contributors, so we should do more
> than
> >>> sending stuff to dev@. One very lightweight idea is to have a new
> type of
> >>> JIRA called a SIP and have a link to a filter that shows all such
> JIRAs from
> >>> http://spark.apache.org. I also like the idea of SIP and design doc
> >>> templates (in fact many projects have them).
> >>>
> >>> Matei
> >>>
> >>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <r...@databricks.com> wrote:
> >>>
> >>> I called Cody last night and talked about some of the topics in his
> email.
> >>> It became clear to me Cody genuinely cares about the project.
> >>>
> >>> Some of the frustrations come from the success of the project itself
> >>> becoming very "hot", and it is difficult to get clarity from people who
> >>> don't dedicate all their time to Spark. In fact, it is in some ways
> similar
> >>> to scaling an engineering team in a successful startup: old processes
> that
> >>> worked well might not work so well when it gets to a certain size,
> cultures
> >>> can get diluted, building culture vs building process, etc.
> >>>
> >>> I also really like to have a more visible process for larger changes,
> >>> especially major user facing API changes. Historically we upload
> design docs
> >>> for major changes, but it is not always consistent and difficult to
> quality
> >>> of the docs, due to the volunteering nature of the organization.
> >>>
> >>> Some of the more concrete ideas we disc

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread vaquar khan
+1 (non-binding)
Regards,
Vaquar  khan

On 29 Sep 2016 23:00, "Denny Lee" <denny.g@gmail.com> wrote:

> +1 (non-binding)
>
> On Thu, Sep 29, 2016 at 9:43 PM Jeff Zhang <zjf...@gmail.com> wrote:
>
>> +1
>>
>> On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Sep 29, 2016 4:33 PM, "Kyle Kelley" <rgb...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Sep 29, 2016 at 4:27 PM, Yin Huai <yh...@databricks.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende <luckbr1...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and
>>>>>>> passes if a majority of at least 3+1 PMC votes are cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>>
>>>>>>> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa4
>>>>>>> 577ba4be38)
>>>>>>>
>>>>>>> This release candidate resolves 301 issues:
>>>>>>> https://s.apache.org/spark-2.0.1-jira
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-
>>>>>>> 2.0.1-rc4-bin/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>> https://repository.apache.org/content/repositories/
>>>>>>> orgapachespark-1203/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-
>>>>>>> 2.0.1-rc4-docs/
>>>>>>>
>>>>>>>
>>>>>>> Q: How can I help test this release?
>>>>>>> A: If you are a Spark user, you can help us test this release by
>>>>>>> taking an existing Spark workload and running on this release candidate,
>>>>>>> then reporting any regressions from 2.0.0.
>>>>>>>
>>>>>>> Q: What justifies a -1 vote for this release?
>>>>>>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>>>>>>> present in 2.0.0, missing features, or bugs related to new features will
>>>>>>> not necessarily block this release.
>>>>>>>
>>>>>>> Q: What fix version should I use for patches merging into branch-2.0
>>>>>>> from now on?
>>>>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new
>>>>>>> RC (i.e. RC5) is cut, I will change the fix version of those patches to
>>>>>>> 2.0.1.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Luciano Resende
>>>>>> http://twitter.com/lresende1975
>>>>>> http://lresende.blogspot.com/
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com)
>>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread vaquar khan
+1 (non-binding)

Regards,
Vaquar khan

On 25 Sep 2016 20:41, "Kousuke Saruta" <saru...@oss.nttdata.co.jp> wrote:

> +1 (non-binding)
>
> On 2016年09月26日 07:26, Herman van Hövell tot Westerflier wrote:
>
> +1 (non-binding)
>
> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida <
> ricardo.alme...@actnowib.com> wrote:
>
>> +1 (non-binding)
>>
>> Built and tested on
>> - Ubuntu 16.04 / OpenJDK 1.8.0_91
>> - CentOS / Oracle Java 1.7.0_55
>> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn)
>>
>>
>> On 25 September 2016 at 22:35, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> +1
>>>
>>> Matei
>>>
>>> On Sep 25, 2016, at 1:25 PM, Josh Rosen <joshro...@databricks.com>
>>> wrote:
>>>
>>> +1
>>>
>>> On Sun, Sep 25, 2016 at 1:16 PM Yin Huai <yh...@databricks.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun <dongj...@apache.org>
>>>> wrote:
>>>>
>>>>> +1 (non binding)
>>>>>
>>>>> RC3 is compiled and tested on the following two systems, too. All
>>>>> tests passed.
>>>>>
>>>>> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>>>>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>>>> -Dsparkr
>>>>> * CentOS 7.2 / Open JDK 1.8.0_102
>>>>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>>>>
>>>>> Cheers,
>>>>> Dongjoon
>>>>>
>>>>>
>>>>>
>>>>> On Saturday, September 24, 2016, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and
>>>>>> passes if a majority of at least 3+1 PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>>
>>>>>> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
>>>>>> c2cc2730d17)
>>>>>>
>>>>>> This release candidate resolves 290 issues:
>>>>>> https://s.apache.org/spark-2.0.1-jira
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>>>>>> 1-rc3-bin/
>>>>>>
>>>>>> Release artifacts are signed with the following key:
>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>> spark-1201/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>>>>>> 1-rc3-docs/
>>>>>>
>>>>>>
>>>>>> Q: How can I help test this release?
>>>>>> A: If you are a Spark user, you can help us test this release by
>>>>>> taking an existing Spark workload and running on this release candidate,
>>>>>> then reporting any regressions from 2.0.0.
>>>>>>
>>>>>> Q: What justifies a -1 vote for this release?
>>>>>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>>>>>> present in 2.0.0, missing features, or bugs related to new features will
>>>>>> not necessarily block this release.
>>>>>>
>>>>>> Q: What fix version should I use for patches merging into branch-2.0
>>>>>> from now on?
>>>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new
>>>>>> RC (i.e. RC4) is cut, I will change the fix version of those patches to
>>>>>> 2.0.1.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-23 Thread vaquar khan
+1 non binding
No issue found.
Regards,
Vaquar khan

On 23 Sep 2016 17:25, "Mark Hamstra" <m...@clearstorydata.com> wrote:

Similar but not identical configuration (Java 8/macOs 10.12 with build/mvn
-Phive -Phive-thriftserver -Phadoop-2.7 -Pyarn clean install);
Similar but not identical failure:

...

- line wrapper only initialized once when used as encoder outer scope

Spark context available as 'sc' (master = local-cluster[1,1,1024], app id =
app-20160923150640-).

Spark session available as 'spark'.

Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError:
GC overhead limit exceeded

Exception in thread "dispatcher-event-loop-7" java.lang.OutOfMemoryError:
GC overhead limit exceeded

- define case class and create Dataset together with paste mode

java.lang.OutOfMemoryError: GC overhead limit exceeded

- should clone and clean line object in ClosureCleaner *** FAILED ***

  java.util.concurrent.TimeoutException: Futures timed out after [10
minutes]

...


On Fri, Sep 23, 2016 at 3:08 PM, Sean Owen <so...@cloudera.com> wrote:

> +1 Signatures and hashes check out. I checked that the Kinesis
> assembly artifacts are not present.
>
> I compiled and tested on Java 8 / Ubuntu 16 with -Pyarn -Phive
> -Phive-thriftserver -Phadoop-2.7 -Psparkr and only saw one test
> problem. This test never completed. If nobody else sees it, +1,
> assuming it's a bad test or env issue.
>
> - should clone and clean line object in ClosureCleaner *** FAILED ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
> /_/
>
>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>
>   scala> // Entering paste mode (ctrl-D to finish)
>
>
>   // Exiting paste mode, now interpreting.
>
>   org.apache.spark.SparkException: Job 0 cancelled because
> SparkContext was shut down
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfte
> rSchedulerStop$1.apply(DAGScheduler.scala:818)
> ...
>
>
> On Fri, Sep 23, 2016 at 7:01 AM, Reynold Xin <r...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.1. The vote is open until Sunday, Sep 25, 2016 at 23:59 PDT and
> passes
> > if a majority of at least 3+1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.1
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v2.0.1-rc2
> > (04141ad49806a48afccc236b699827997142bd57)
> >
> > This release candidate resolves 284 issues:
> > https://s.apache.org/spark-2.0.1-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1199
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-docs/
> >
> >
> > Q: How can I help test this release?
> > A: If you are a Spark user, you can help us test this release by taking
> an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 2.0.0.
> >
> > Q: What justifies a -1 vote for this release?
> > A: This is a maintenance release in the 2.0.x series.  Bugs already
> present
> > in 2.0.0, missing features, or bugs related to new features will not
> > necessarily block this release.
> >
> > Q: What happened to 2.0.1 RC1?
> > A: There was an issue with RC1 R documentation during release candidate
> > preparation. As a result, rc1 was canceled before a vote was called.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: renaming "minor release" to "feature release"

2016-07-28 Thread vaquar khan
+1
Though following is commonly use standard for
release(http://semver.org/) ,feature
also looks good as Minor release indicate significant features have been
added

   1. MAJOR version when you make incompatible API changes,
   2. MINOR version when you add functionality in a backwards-compatible
   manner, and
   3. PATCH version when you make backwards-compatible bug fixes.


Apart from verbiage "Minor" with "feature"  no other changes in  versioning
policy.

regards,
Vaquar khan

On Thu, Jul 28, 2016 at 6:20 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> I also agree with this given the way we develop stuff. We don't really
> want to move to possibly-API-breaking major releases super often, but we do
> have lots of large features that come out all the time, and our current
> name doesn't convey that.
>
> Matei
>
> On Jul 28, 2016, at 4:15 PM, Reynold Xin <r...@databricks.com> wrote:
>
> Yea definitely. Those are consistent with what is defined here:
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy
>
> The only change I'm proposing is replacing "minor" with "feature".
>
>
> On Thu, Jul 28, 2016 at 4:10 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Although 'minor' is the standard term, the important thing is making
>> the nature of the release understood. 'feature release' seems OK to me
>> as an additional description.
>>
>> Is it worth agreeing on or stating a little more about the theory?
>>
>> patch release: backwards/forwards compatible within a minor release,
>> generally fixes only
>> minor/feature release: backwards compatible within a major release,
>> not forward; generally also includes new features
>> major release: not backwards compatible and may remove or change
>> existing features
>>
>> On Thu, Jul 28, 2016 at 3:46 PM, Reynold Xin <r...@databricks.com> wrote:
>> > tl;dr
>> >
>> > I would like to propose renaming “minor release” to “feature release” in
>> > Apache Spark.
>> >
>> >
>> > details
>> >
>> > Apache Spark’s official versioning policy follows roughly semantic
>> > versioning. Each Spark release is versioned as
>> > [major].[minor].[maintenance]. That is to say, 1.0.0 and 2.0.0 are both
>> > “major releases”, whereas “1.1.0” and “1.3.0” would be minor releases.
>> >
>> > I have gotten a lot of feedback from users that the word “minor” is
>> > confusing and does not accurately describes those releases. When users
>> hear
>> > the word “minor”, they think it is a small update that introduces couple
>> > minor features and some bug fixes. But if you look at the history of
>> Spark
>> > 1.x, here are just a subset of large features added:
>> >
>> > Spark 1.1: sort-based shuffle, JDBC/ODBC server, new stats library, 2-5X
>> > perf improvement for machine learning.
>> >
>> > Spark 1.2: HA for streaming, new network module, Python API for
>> streaming,
>> > ML pipelines, data source API.
>> >
>> > Spark 1.3: DataFrame API, Spark SQL graduate out of alpha, tons of new
>> > algorithms in machine learning.
>> >
>> > Spark 1.4: SparkR, Python 3 support, DAG viz, robust joins in SQL, math
>> > functions, window functions, SQL analytic functions, Python API for
>> > pipelines.
>> >
>> > Spark 1.5: code generation, Project Tungsten
>> >
>> > Spark 1.6: automatic memory management, Dataset API, ML pipeline
>> persistence
>> >
>> >
>> > So while “minor” is an accurate depiction of the releases from an API
>> > compatibiility point of view, we are miscommunicating and doing Spark a
>> > disservice by calling these releases “minor”. I would actually call
>> these
>> > releases “major”, but then it would be a larger deviation from semantic
>> > versioning. I think calling these “feature releases” would be a smaller
>> > change and a more accurate depiction of what they are.
>> >
>> > That said, I’m not attached to the name “feature” and am open to
>> > suggestions, as long as they don’t convey the notion of “minor”.
>> >
>> >
>>
>
>
>


-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-23 Thread vaquar khan
+1 (non-binding

Regards,
Vaquar khan
On 23 Jun 2016 07:50, "Sean Owen" <so...@cloudera.com> wrote:

> I don't think that qualifies as a blocker; not even clear it's a
> regression. Even non-binding votes here should focus on whether this
> is OK to release as a maintenance update to 1.6.1.
>
> On Thu, Jun 23, 2016 at 1:45 PM, Maciej Bryński <mac...@brynski.pl> wrote:
> > -1
> >
> > I need SPARK-13283 to be solved.
> >
> > Regards,
> > Maciek Bryński
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-25 Thread vaquar khan
+1
On 24 Dec 2015 22:01, "Vinay Shukla"  wrote:

> +1
> Tested on HDP 2.3, YARN cluster mode, spark-shell
>
> On Wed, Dec 23, 2015 at 6:14 AM, Allen Zhang 
> wrote:
>
>>
>> +1 (non-binding)
>>
>> I have just tarball a new binary and tested am.nodelabelexpression and
>> executor.nodelabelexpression manully, result is expected.
>>
>>
>>
>>
>> At 2015-12-23 21:44:08, "Iulian Dragoș" 
>> wrote:
>>
>> +1 (non-binding)
>>
>> Tested Mesos deployments (client and cluster-mode, fine-grained and
>> coarse-grained). Things look good
>> .
>>
>> iulian
>>
>> On Wed, Dec 23, 2015 at 2:35 PM, Sean Owen  wrote:
>>
>>> Docker integration tests still fail for Mark and I, and should
>>> probably be disabled:
>>> https://issues.apache.org/jira/browse/SPARK-12426
>>>
>>> ... but if anyone else successfully runs these (and I assume Jenkins
>>> does) then not a blocker.
>>>
>>> I'm having intermittent trouble with other tests passing, but nothing
>>> unusual.
>>> Sigs and hashes are OK.
>>>
>>> We have 30 issues fixed for 1.6.1. All but those resolved in the last
>>> 24 hours or so should be fixed for 1.6.0 right? I can touch that up.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
>>>  wrote:
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version
>>> > 1.6.0!
>>> >
>>> > The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>> passes if
>>> > a majority of at least 3 +1 PMC votes are cast.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v1.6.0-rc4
>>> > (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>> >
>>> > Release artifacts are signed with the following key:
>>> > https://people.apache.org/keys/committer/pwendell.asc
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>> >
>>> > The test repository (versioned as v1.6.0-rc4) for this release can be
>>> found
>>> > at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>> >
>>> > ===
>>> > == How can I help test this release? ==
>>> > ===
>>> > If you are a Spark user, you can help us test this release by taking an
>>> > existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > 
>>> > == What justifies a -1 vote for this release? ==
>>> > 
>>> > This vote is happening towards the end of the 1.6 QA period, so -1
>>> votes
>>> > should only occur for significant regressions from 1.5. Bugs already
>>> present
>>> > in 1.5, minor regressions, or bugs related to new features will not
>>> block
>>> > this release.
>>> >
>>> > ===
>>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> > ===
>>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> > branch-1.6, since documentations will be published separately from the
>>> > release.
>>> > 2. New features for non-alpha-modules should target 1.7+.
>>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target
>>> > version.
>>> >
>>> >
>>> > ==
>>> > == Major changes to help you focus your testing ==
>>> > ==
>>> >
>>> > Notable changes since 1.6 RC3
>>> >
>>> >
>>> >   - SPARK-12404 - Fix serialization error for Datasets with
>>> > Timestamps/Arrays/Decimal
>>> >   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>> >   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>> >   - SPARK-12413 - Fix mesos HA
>>> >
>>> >
>>> > Notable changes since 1.6 RC2
>>> >
>>> >
>>> > - SPARK_VERSION has been set correctly
>>> > - SPARK-12199 ML Docs are publishing correctly
>>> > - SPARK-12345 Mesos cluster mode has been fixed
>>> >
>>> > Notable changes since 1.6 RC1
>>> >
>>> > Spark Streaming
>>> >
>>> > SPARK-2629  trackStateByKey has been renamed to mapWithState
>>> >
>>> > Spark SQL
>>> >
>>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of 

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-25 Thread vaquar khan
+1 (non-binding)

Regards,
Vaquar khan
On 25 Sep 2015 18:28, "Eugene Zhulenev" <eugene.zhule...@gmail.com> wrote:

> +1
>
> Running latest build from 1.5 branch, SO much more stable than 1.5.0
> release.
>
> On Fri, Sep 25, 2015 at 8:55 AM, Doug Balog <doug.spark...@dugos.com>
> wrote:
>
>> +1 (non-binding)
>>
>> Tested on secure YARN cluster with HIVE.
>>
>> Notes:  SPARK-10422, SPARK-10737 were causing us problems with 1.5.0. We
>> see 1.5.1 as a big improvement.
>>
>> Cheers,
>>
>> Doug
>>
>>
>> > On Sep 24, 2015, at 3:27 AM, Reynold Xin <r...@databricks.com> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.5.1
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The release fixes 81 known issues in Spark 1.5.0, listed here:
>> > http://s.apache.org/spark-1.5.1
>> >
>> > The tag to be voted on is v1.5.1-rc1:
>> >
>> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release (1.5.1) can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1148/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>> >
>> >
>> > ===
>> > How can I help test this release?
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> >
>> > 
>> > What justifies a -1 vote for this release?
>> > 
>> > -1 vote should occur for regressions from Spark 1.5.0. Bugs already
>> present in 1.5.0 will not block this release.
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 1.5.1?
>> > ===
>> > Please target 1.5.2 or 1.6.0.
>> >
>> >
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-29 Thread vaquar khan
+1 (1.5.0 RC2)Compiled on Windows with YARN.

Regards,
Vaquar khan
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql(SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. joins,sql,set operations,udf OK

Cheers
k/

On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin r...@databricks.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc2:

 https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release (published as 1.5.0-rc2) can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1141/

 The staging repository for this release (published as 1.5.0) can be found
 at:
 https://repository.apache.org/content/repositories/orgapachespark-1140/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/


 ===
 How can I help test this release?
 ===
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.


 
 What justifies a -1 vote for this release?
 
 This vote is happening towards the end of the 1.5 QA period, so -1 votes
 should only occur for significant regressions from 1.4. Bugs already
 present in 1.4, minor regressions, or bugs related to new features will not
 block this release.


 ===
 What should happen to JIRA tickets still targeting 1.5.0?
 ===
 1. It is OK for documentation patches to target 1.5.0 and still go into
 branch-1.5, since documentations will be packaged separately from the
 release.
 2. New features for non-alpha-modules should target 1.6+.
 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
 version.


 ==
 Major changes to help you focus your testing
 ==

 As of today, Spark 1.5 contains more than 1000 commits from 220+
 contributors. I've curated a list of important changes for 1.5. For the
 complete list, please refer to Apache JIRA changelog.

 RDD/DataFrame/SQL APIs

 - New UDAF interface
 - DataFrame hints for broadcast join
 - expr function for turning a SQL expression into DataFrame column
 - Improved support for NaN values
 - StructType now supports ordering
 - TimestampType precision is reduced to 1us
 - 100 new built-in expressions, including date/time, string, math
 - memory and local disk only checkpointing

 DataFrame/SQL Backend Execution

 - Code generation on by default
 - Improved join, aggregation, shuffle, sorting with cache friendly
 algorithms and external algorithms
 - Improved window function performance
 - Better metrics instrumentation and reporting for DF/SQL execution plans

Re: [VOTE] Release Apache Spark 1.4.1

2015-07-02 Thread vaquar khan
+1
On 2 Jul 2015 18:03, shenyan zhen shenya...@gmail.com wrote:

 +1
 On Jun 30, 2015 8:28 PM, Reynold Xin r...@databricks.com wrote:

 +1

 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 60e08e50751fe3929156de956d62faea79f5b801

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1118/
 [published as version: 1.4.1-rc1]
 https://repository.apache.org/content/repositories/orgapachespark-1119/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Saturday, June 27, at 06:32 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: Contribution in java

2014-12-20 Thread vaquar khan
Hi Sreenivas,

Please read Spark doc first, everything mention in doc , without reading
doc how can you contribute ?

regards,
vaquar khan

On Sat, Dec 20, 2014 at 6:00 PM, sreenivas putta putta.sreeni...@gmail.com
wrote:

 Hi,

 I want to contribute for spark in java. Does it support java? please let me
 know.

 Thanks,
 Sreenivas




-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-29 Thread vaquar khan
+1
1 Compiled binaries
2 All Tests Pass

Regards,
Vaquar khan
On 30 Nov 2014 04:21, Krishna Sankar ksanka...@gmail.com wrote:

 +1
 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
 -Dhadoop.version=2.4.0 -DskipTests clean package 16:46 min (slightly slower
 connection)
 2. Tested pyspark, mlib - running as well as compare esults with 1.1.x
 2.1. statistics OK
 2.2. Linear/Ridge/Laso Regression OK
Slight difference in the print method (vs. 1.1.x) of the model
 object - with a label  more details. This is good.
 2.3. Decision Tree, Naive Bayes OK
Changes in print(model) - now print (model.ToDebugString()) - OK
Some changes in NaiveBayes. Different from my 1.1.x code - had to
 flatten list structures, zip required same number in partitions
After code changes ran fine.
 2.4. KMeans OK
zip occasionally fails with error localhost):
 org.apache.spark.SparkException: Can only zip RDDs with same number of
 elements in each partition
 Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ?
 Made it work by doing a different transformation ie reusing an original
 rdd.
 2.5. rdd operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. recommendation OK
 2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium
 dataset which never worked. Works fine in 1.2.0 !
 3. Scala Mlib - subset of examples as in #2 above, with Scala
 3.1. statistics OK
 3.2. Linear Regression OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 Cheers
 k/
 P.S: Plan to add RF and .ml mechanics to this bank

 On Fri, Nov 28, 2014 at 9:16 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.2.0!
 
  The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1048/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.2.0!
 
  The vote is open until Tuesday, December 02, at 05:15 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.1.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What justifies a -1 vote for this release? ==
  This vote is happening very late into the QA period compared with
  previous votes, so -1 votes should only occur for significant
  regressions from 1.0.2. Bugs already present in 1.1.X, minor
  regressions, or bugs related to new features will not block this
  release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.shuffle.blockTransferService has been
  changed to netty
  -- Old behavior can be restored by switching to nio
 
  2. The default value of spark.shuffle.manager has been changed to
 sort.
  -- Old behavior can be restored by setting spark.shuffle.manager to
  hash.
 
  == Other notes ==
  Because this vote is occurring over a weekend, I will likely extend
  the vote if this RC survives until the end of the vote period.
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-24 Thread vaquar khan
+1 Release this package as Apache Spark 1.1.1
On 20 Nov 2014 04:22, Andrew Or and...@databricks.com wrote:

 I will start with a +1

 2014-11-19 14:51 GMT-08:00 Andrew Or and...@databricks.com:

  Please vote on releasing the following candidate as Apache Spark version
 1
  .1.1.
 
  This release fixes a number of bugs in Spark 1.1.0. Some of the notable
  ones are
  - [SPARK-3426] Sort-based shuffle compression settings are incompatible
  - [SPARK-3948] Stream corruption issues in sort-based shuffle
  - [SPARK-4107] Incorrect handling of Channel.read() led to data
 truncation
  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
  attached.
 
  Additionally, this candidate fixes two blockers from the previous RC:
  - [SPARK-4434] Cluster mode jar URLs are broken
  - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle
  spills
 
  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
  http://s.apache.org/p8
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/andrewor14.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1043/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.1!
 
  The vote is open until Saturday, November 22, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.1.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  Cheers,
  Andrew
 



Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread vaquar khan
+1 (binding)
On 8 Nov 2014 07:26, Davies Liu dav...@databricks.com wrote:

 Sorry for my last email, I misunderstood the proposal here, all the
 committer still have equal -1 to all the code changes.

 Also, as mentioned in the proposal, the sign off only happens to
 public API and architect, something like discussion about code style
 things are still the same.

 So, I'd revert my vote to +1. Sorry for this.

 Davies


 On Fri, Nov 7, 2014 at 3:18 PM, Davies Liu dav...@databricks.com wrote:
  -1 (not binding, +1 for maintainer, -1 for sign off)
 
  Agree with Greg and Vinod. In the beginning, everything is better
  (more efficient, more focus), but after some time, fighting begins.
 
  Code style is the most hot topic to fight (we already saw it in some
  PRs). If two committers (one of them is maintainer) have not got a
  agreement on code style, before this process, they will ask comments
  from other committers, but after this process, the maintainer have
  higher priority to -1, then maintainer will keep his/her personal
  preference, it's hard to make a agreement. Finally, different
  components will have different code style (or others).
 
  Right now, maintainers are kind of first contact or best contacts, the
  best person to review the PR in that component. We could announce it,
  then new contributors can easily find the right one to review.
 
  My 2 cents.
 
  Davies
 
 
  On Thu, Nov 6, 2014 at 11:43 PM, Vinod Kumar Vavilapalli
  vino...@apache.org wrote:
  With the maintainer model, the process is as follows:
 
  - Any committer could review the patch and merge it, but they would
 need to forward it to me (or another core API maintainer) to make sure we
 also approve
  - At any point during this process, I could come in and -1 it, or give
 feedback
  - In addition, any other committer beyond me is still allowed to -1
 this patch
 
  The only change in this model is that committers are responsible to
 forward patches in these areas to certain other committers. If every
 committer had perfect oversight of the project, they could have also seen
 every patch to their component on their own, but this list ensures that
 they see it even if they somehow overlooked it.
 
 
  Having done the job of playing an informal 'maintainer' of a project
 myself, this is what I think you really need:
 
  The so called 'maintainers' do one of the below
   - Actively poll the lists and watch over contributions. And follow
 what is repeated often around here: Trust but verify.
   - Setup automated mechanisms to send all bug-tracker updates of a
 specific component to a list that people can subscribe to
 
  And/or
   - Individual contributors send review requests to unofficial
 'maintainers' over dev-lists or through tools. Like many projects do with
 review boards and other tools.
 
  Note that none of the above is a required step. It must not be, that's
 the point. But once set as a convention, they will all help you address
 your concerns with project scalability.
 
  Anything else that you add is bestowing privileges to a select few and
 forming dictatorships. And contrary to what the proposal claims, this is
 neither scalable nor confirming to Apache governance rules.
 
  +Vinod

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org