[VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-07 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.2 [ ] -1 Do not release this package because ... The

Re: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
It turned out suggested edits (trackable) don't show up for non-owners, so I've just merged all the edits in place. It should be visible now. On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> wrote: > Oops. Let me try figure that out. > > > On Monday, Nove

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Reynold Xin
I will cut a new one once https://github.com/apache/spark/pull/15774 gets in. On Fri, Nov 4, 2016 at 11:44 AM, Sean Owen wrote: > I guess it's worth explicitly stating that I think we need another RC one > way or the other because this test seems to consistently fail. It

[VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-05 Thread Reynold Xin
The vote has passed with the following +1 votes and no -1 votes. +1 Reynold Xin* Herman van Hövell tot Westerflier Yin Huai* Davies Liu Dongjoon Hyun Jeff Zhang Liwei Lin Kousuke Saruta Joseph Bradley* Sean Owen* Ricardo Almeida Weiqing Yang * = binding I will work on packaging the release

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
ust "involve". SIPs should also have at least two emails that go to dev@. While I was editing this, I thought we really needed a suggested template for design doc too. I will get to that too ... On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> wrote: > Most

Re: Handling questions in the mailing lists

2016-11-06 Thread Reynold Xin
nly Massg and me being able to directly close duplicates). > > Believe me, I've seen this before. > On 11/07/2016 05:08 AM, Reynold Xin wrote: > > You have substantially underestimated how opinionated people can be on > mailing lists too :) > > On Sunday, November 6, 2016, Maciej Sz

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-19 Thread Reynold Xin
For the contributing guide I think it makes more sense to put it in apache/spark github, since that's where contributors start. I'd also link to it from the website ... On Tue, Oct 18, 2016 at 10:03 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > +1 - Given that our website is

Re: `Project` not preserving child partitioning ?

2016-10-12 Thread Reynold Xin
It actually does -- but do it through a really weird way. UnaryNodeExec actually defines: trait UnaryExecNode extends SparkPlan { def child: SparkPlan override final def children: Seq[SparkPlan] = child :: Nil override def outputPartitioning: Partitioning = child.outputPartitioning } I

cutting 1.6.3 release candidate

2016-10-14 Thread Reynold Xin
It's been a while and we have fixed a few bugs in branch-1.6. I plan to cut rc1 for 1.6.3 next week (just in time for Spark Summit Europe). Let me know if there are specific issues that should be addressed before that. Thanks.

Re: cutting 1.6.3 release candidate

2016-10-14 Thread Reynold Xin
nd spark-1.6.x BUT spark-1.6 fix was NOT > merged - https://github.com/apache/spark/pull/13027 > > Is it possible to include the fix to spark-1.6.3? > > > Thank you > Alex > > > On Fri, Oct 14, 2016 at 1:39 PM, Reynold Xin <r...@databricks.com> wrote: > >

Re: On convenience methods

2016-10-14 Thread Reynold Xin
It is very difficult to give a general answer. We would need to discuss each case. In general things that are trivially doable using existing APIs, it is not a good idea to provide them, unless for compatibility with other frameworks (e.g. Pandas). On Fri, Oct 14, 2016 at 5:38 PM, roehst

[VOTE] Release Apache Spark 1.6.3 (RC1)

2016-10-17 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.6.3. The vote is open until Thursday, Oct 20, 2016 at 18:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.3 [ ] -1 Do not release this package because ...

Re: collect_list alternative for SQLContext?

2016-10-25 Thread Reynold Xin
This shouldn't be required anymore since Spark 2.0. On Tue, Oct 25, 2016 at 6:16 AM, Matt Smith wrote: > Is there an alternative function or design pattern for the collect_list > UDAF that can used without taking a dependency on HiveContext? How does > one typically

[PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin
FYI - Xiangrui submitted an amazing pull request to fix a long standing issue with a lot of the nondeterministic expressions (rand, randn, monotonically_increasing_id): https://github.com/apache/spark/pull/15567 Prior to this PR, we were using TaskContext.partitionId as the partition index in

Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin
of work around it with empty foreach after the map, but > it's really awkward to explain to people. > > On Thu, Oct 20, 2016 at 12:52 PM, Reynold Xin <r...@databricks.com> wrote: > > FYI - Xiangrui submitted an amazing pull request to fix a long standing > > issue with a lot of the

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Reynold Xin
The vote has passed with the following +1s and no -1. I will work on packaging the release. +1: Reynold Xin* Herman van Hövell tot Westerflier Ricardo Almeida Shixiong (Ryan) Zhu Sean Owen* Michael Armbrust* Dongjoon Hyun Jagadeesan As Liwei Lin Weiqing Yang Vaquar Khan Denny Lee Yin Huai* Ryan

Re: separate spark and hive

2016-11-14 Thread Reynold Xin
that working without hive should be either a simple > configuration or even the default and that if there is any missing > functionality it should be documented. > > Assaf. > > > > > > *From:* Reynold Xin [mailto:r...@databricks.com] > *Sent:* Tuesday, November 15, 2016 9:31 AM &

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
Good catch. Updated! On Mon, Nov 14, 2016 at 11:13 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > FWIW 2.0.1 is also used in the 'Link With Spark' and 'Spark Source > Code Management' sections in that page. > > Shivaram > > On Mon, Nov 14, 2016 at

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Reynold Xin
t. > > > On Mon, Nov 14, 2016 at 9:49 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Has the release already been made? I didn't see any announcement, but >> Homebrew has already updated to 2.0.2. >> 2016년 11월 11일 (금) 오후 2:59, Reynold Xin

[ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.2! Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along with Kafka 0.10 support and runtime metrics for Structured Streaming. This release is based on the branch-2.0 maintenance branch of Spark. We strongly recommend all

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin
LE, analyzeColumns etc ... all look good. >> >> 9. From the release point of view, how this is planned ? Will all this be >> implemented in one go or in phases? >> >> Thanks, >> Yogesh Mahajan >> http://www.snappydata.io/blog <http://snappydata.io

Re: separate spark and hive

2016-11-14 Thread Reynold Xin
I agree with the high level idea, and thus SPARK-15691 . In reality, it's a huge amount of work to create & maintain a custom catalog. It might actually make sense to do, but it just seems a lot of work to do right now and it'd take a toll on

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
It's on there on the page (both the release notes and the download version dropdown). The one line text is outdated. I'm just going to delete that text as a matter of fact so we don't run into this issue in the future. On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin
iva...@eecs.berkeley.edu> wrote: > Do we have any query workloads for which we can benchmark these > proposals in terms of performance ? > > Thanks > Shivaram > > On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com > <javascript:;>> wrote: > >

Re: Memory leak warnings in Spark 2.0.1

2016-11-22 Thread Reynold Xin
See https://issues.apache.org/jira/browse/SPARK-18557 On Mon, Nov 21, 2016 at 1:16 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I'm also curious about this. Is there something we can do to help > troubleshoot these leaks and file

Re: Third party library

2016-11-25 Thread Reynold Xin
bcc dev@ and add user@ This is more a user@ list question rather than a dev@ list question. You can do something like this: object MySimpleApp { def loadResources(): Unit = // define some idempotent way to load resources, e.g. with a flag or lazy val def main() = { ...

Re: Please limit commits for branch-2.1

2016-11-22 Thread Reynold Xin
I did send an email out with those information on Nov 1st. It is not meant to be in new feature development mode anymore. FWIW, I will cut an RC today to remind people of that. The RC will fail, but it can serve as a good reminder. On Tue, Nov 22, 2016 at 1:53 AM Sean Owen

Re: Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-24 Thread Reynold Xin
It's already there isn't it? The in-memory columnar cache format. On Thu, Nov 24, 2016 at 9:06 PM, Nitin Goyal wrote: > Hi, > > Do we have any plan of supporting parquet-like partitioning support in > Spark SQL in-memory cache? Something like one RDD[CachedBatch] per >

Re: Two major versions?

2016-11-27 Thread Reynold Xin
I think this highly depends on what issues are found, e.g. critical bugs that impact wide use cases, or security bugs. On Sun, Nov 27, 2016 at 12:49 PM, Dongjoon Hyun wrote: > Hi, All. > > Do we have a release plan of Apache Spark 1.6.4? > > Up to my knowledge, Apache

[VOTE] Apache Spark 2.1.0 (RC1)

2016-11-28 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.0 [ ] -1 Do not release this package because

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-28 Thread Reynold Xin
This one: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0 On Mon, Nov 28, 2016 at 9:00 PM, Prasanna Santhanam <t...@apache.org> wrote: > > > On Tue, Nov 29, 2016 at 6:55 AM, Reynold Xin <r...@databricks.com> wrote: > >

Re: Bit-wise AND operation between integers

2016-11-28 Thread Reynold Xin
Bcc dev@ and add user@ The dev list is not meant for users to ask questions on how to use Spark. For that you should use StackOverflow or the user@ list. scala> sql("select 1 & 2").show() +---+ |(1 & 2)| +---+ | 0| +---+ scala> sql("select 1 & 3").show() +---+ |(1 & 3)|

Re: [SPARK-16654][CORE][WIP] Add UI coverage for Application Level Blacklisting

2016-11-21 Thread Reynold Xin
You can submit a pull request against Imran's branch for the pull request. On Mon, Nov 21, 2016 at 7:33 PM Jose Soltren wrote: > Hi - I'm proposing a patch set for UI coverage of Application Level > Blacklisting: > > https://github.com/jsoltren/spark/pull/1 > > This patch set

issues with github pull request notification emails missing

2016-11-16 Thread Reynold Xin
I've noticed that a lot of github pull request notifications no longer come to my inbox. In the past I'd get an email for every reply to a pull request that I subscribed to (i.e. commented on). Lately I noticed for a lot of them I didn't get any emails, but if I opened the pull requests directly

Re: SQL Syntax for pivots

2016-11-16 Thread Reynold Xin
Not right now. On Wed, Nov 16, 2016 at 10:44 PM, Niranda Perera wrote: > Hi all, > > I see that the pivot functionality is being added to spark DFs from 1.6 > onward. > > I am interested to see if there is a Spark SQL syntax available for > pivoting? example: Slide 11

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread Reynold Xin
Adding a new data type is an enormous undertaking and very invasive. I don't think it is worth it in this case given there are clear, simple workarounds. On Thu, Nov 17, 2016 at 12:24 PM, kant kodali wrote: > Can we have a JSONType for Spark SQL? > > On Wed, Nov 16, 2016 at

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Reynold Xin
Ha funny. Never noticed that. On Thursday, November 17, 2016, Nicholas Chammas wrote: > Hmm... somehow the image didn't show up. > > How about now? > > [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] > > On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot

Re: [build system] massive jenkins infrastructure changes forthcoming

2016-11-17 Thread Reynold Xin
Thanks for the headsup, Shane. On Thu, Nov 17, 2016 at 2:33 PM, shane knapp wrote: > TL;DR: amplab is becomine riselab, and is much more C++ oriented. > centos 6 is so far behind, and i'm already having to roll C++ > compilers and various libraries by hand. centos 7 is

statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin
I want to bring this discussion to the dev list to gather broader feedback, as there have been some discussions that happened over multiple JIRA tickets (SPARK-16026 , etc) and GitHub pull requests about what statistics to collect and how to use

Re: withExpr private method duplication in Column and functions objects?

2016-11-11 Thread Reynold Xin
private[sql] has no impact in Java, and these functions are literally one line of code. It's overkill to think about code duplication for functions that simple. On Fri, Nov 11, 2016 at 1:12 PM, Jacek Laskowski wrote: > Hi, > > Any reason for withExpr duplication in Column [1]

Re: does The Design of spark consider the scala parallelize collections?

2016-11-13 Thread Reynold Xin
Some places in Spark do use it: > git grep "\\.par\\." mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala: val models = Range(0, numClasses).par.map { index => sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala: (0 until

Re: statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin
, Reynold Xin <r...@databricks.com> wrote: > I want to bring this discussion to the dev list to gather broader > feedback, as there have been some discussions that happened over multiple > JIRA tickets (SPARK-16026 > <https://issues.apache.org/jira/browse/SPARK-16026&g

github mirroring is broken

2016-11-20 Thread Reynold Xin
FYI Github mirroring from Apache's official git repo to GitHub is broken since Sat Nov 19, and as a result GitHub is now stale. Merged pull requests won't show up in GitHub until ASF infra fixes the issue.

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-31 Thread Reynold Xin
;> that .gitignore is missing. It would be nice if the tarball were created >>> using git archive so that the commit ref is present, but otherwise >>> everything looks fine. >>> ​ >>> >>> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin <r...@databricks

Re: Odp.: Spark Improvement Proposals

2016-11-01 Thread Reynold Xin
Most things looked OK to me too, although I do plan to take a closer look after Nov 1st when we cut the release branch for 2.1. On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin wrote: > The proposal looks OK to me. I assume, even though it's not explicitly > called, that

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Reynold Xin
:185) > > at > > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclara > tion(UnitCompiler.java:347) > > at > > org.codehaus.janino.Java$PackageMemberClassDeclaration. > accept(Java.java:1139) > > at org.codehaus.janino.UnitCompiler.compile(UnitC

Re: Handling questions in the mailing lists

2016-11-02 Thread Reynold Xin
Actually after talking with more ASF members, I believe the only policy is that development decisions have to be made and announced on ASF properties (dev list or jira), but user questions don't have to. I'm going to double check this. If it is true, I would actually recommend us moving entirely

Re: Updating Parquet dep to 1.9

2016-11-01 Thread Reynold Xin
Ryan want to submit a pull request? On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue wrote: > 1.9.0 includes some fixes intended specifically for Spark: > > * PARQUET-389: Evaluates push-down predicates for missing columns as > though they are null. This is to address

Re: Interesting in contributing to spark

2016-10-31 Thread Reynold Xin
Welcome! This is the best guide to get started: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Mon, Oct 31, 2016 at 5:09 AM, Zak H wrote: > Hi, > > I'd like to introduce myself. My name is Zak and I'm a software engineer. > I'm interested

Re: JIRA Components for Streaming

2016-10-31 Thread Reynold Xin
Maybe just streaming or SS in GitHub? On Monday, October 31, 2016, Cody Koeninger wrote: > Makes sense to me. > > I do wonder if e.g. > > [SPARK-12345][STRUCTUREDSTREAMING][KAFKA] > > is going to leave any room in the Github PR form for actual title content? > > On Mon, Oct

Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-11-02 Thread Reynold Xin
This vote is cancelled and I'm sending out a new vote for rc2 now. On Mon, Oct 17, 2016 at 5:18 PM, Reynold Xin <r...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.3. The vote is open until Thursday, Oct 20, 2016

[VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-02 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.3 [ ] -1 Do not release this package because ... The

view canonicalization - looking for database gurus to chime in

2016-11-01 Thread Reynold Xin
I know there are a lot of people with experience on developing database internals on this list. Please take a look at this proposal for a new, simpler way to handle view canonicalization in Spark SQL: https://issues.apache.org/jira/browse/SPARK-18209 It sounds much simpler than what we currently

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-01 Thread Reynold Xin
Vinayak, Thanks for the email. This is really not the thread meant for reporting existing regressions. It's best just commenting on the jira ticket and even better submit a fix for it. On Tuesday, November 1, 2016, vijoshi wrote: > > Hi, > > Have encountered an issue with

[VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-01 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.2 [ ] -1 Do not release this package because ... The

[ANNOUNCE] Apache Spark branch-2.1

2016-11-01 Thread Reynold Xin
Hi all, Following the release schedule as outlined in the wiki, I just created branch-2.1 to form the basis of the 2.1 release. As of today we have less than 50 open issues for 2.1.0. The next couple of weeks we as a community should focus on testing and bug fixes and burn down the number of

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Reynold Xin
On Tue, Oct 11, 2016 at 10:55 AM, Michael Armbrust wrote: > *Complex event processing and state management:* Several groups I've >> talked to want to run a large number (tens or hundreds of thousands now, >> millions in the near future) of state machines over low-rate

Re: Monitoring system extensibility

2016-10-10 Thread Reynold Xin
be reviewed has not got any committer's attention. Without that, > it is going nowhere. The historic Jiras requesting other sinks such as > Kafka, OpenTSBD etc have also been ignored. > > So for now we continue creating classes in o.a.s package. > > On Fri, 7 Oct 2016 at 09:50 Reynol

FYI - marking data type APIs stable

2016-10-10 Thread Reynold Xin
I noticed today that our data types APIs (org.apache.spark.sql.types) are actually DeveloperApis, which means they can be changed from one feature release to another. In reality these APIs have been there since the original introduction of the DataFrame API in Spark 1.3, and has not seen any

cutting 2.0.2?

2016-10-16 Thread Reynold Xin
Since 2.0.1, there have been a number of correctness fixes as well as some nice improvements to the experimental structured streaming (notably basic Kafka support). I'm thinking about cutting 2.0.2 later this week, before Spark Summit Europe. Let me know if there are specific things (bug fixes)

Mark DataFrame/Dataset APIs stable

2016-10-12 Thread Reynold Xin
I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we merged DataFrame and Dataset. I think it's long overdue to mark them stable. I'm tracking

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-12-08 Thread Reynold Xin
This vote is closed in favor of rc2. On Mon, Nov 28, 2016 at 5:25 PM, Reynold Xin <r...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and > passe

2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Reynold Xin
Thanks.

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Reynold Xin
d in 2.1. Unfortunately, it's been manually verified only. There's >> no unit test that covers it, and building one is far from trivial. >> >> Michael >> >> >> >> >> On Dec 8, 2016, at 12:39 AM, Reynold Xin <r...@databricks.com> wrote: >

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-13 Thread Reynold Xin
I'm going to -1 this myself: https://issues.apache.org/jira/browse/ SPARK-18856 <https://issues.apache.org/jira/browse/SPARK-18856> On Thu, Dec 8, 2016 at 12:39 AM, Reynold Xin <r...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Reynold Xin
You can just write some files out directly (and idempotently) in your map/mapPartitions functions. It is just a function that you can run arbitrary code after all. On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit wrote: > Any suggestions on this one? > > Regards > Sumit

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-15 Thread Reynold Xin
In general this falls directly into the domain of external cluster managers (YARN, Mesos, Kub). The standalone thing was meant as a simple way to deploy Spark, and we gotta be careful with introducing a lot more features to it because then it becomes just a full fledged cluster manager and is

Spark 2.1.0-rc3 cut

2016-12-15 Thread Reynold Xin
Committers please use 2.1.1 as the fix version for patches merged into the branch. I will post a voting email once the packaging is done.

[VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Reynold Xin
been released as > SparkR can automatically resolve and download releases. > > On Thu, Dec 15, 2016 at 9:16 PM, Reynold Xin <r...@databricks.com> wrote: > > Please vote on releasing the following candidate as Apache Spark version > > 2.1.0. The vote is open until Sun,

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-06 Thread Reynold Xin
This is not supposed to happen. Do you have a repro? On Tue, Dec 6, 2016 at 6:11 PM, Nicholas Chammas wrote: > [Re-titling thread.] > > OK, I see that the exception from my original email is being triggered > from this part of UnsafeInMemorySorter: > >

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Reynold Xin
It would be good to break them down a bit more, provided that we don't increase for example total runtime due to extra setup. On Wed, Jan 11, 2017 at 9:45 AM Saikat Kanjilal wrote: > > > > > > > > > > > > > > > Hello Maciej, > > > If there's a jira available for this I'd

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Reynold Xin
Yes absolutely. On Wed, Jan 11, 2017 at 9:54 AM Saikat Kanjilal <sxk1...@hotmail.com> wrote: > > > > > > > > > > > > > > > Is it worth to come up with a proposal for this and float to dev? > > > > > > > > > >

Re: Spark Improvement Proposals

2017-01-11 Thread Reynold Xin
gt; > getting to a point where there is agreement. Isn't that agreement >>>>> what we >>>>> > want to achieve with these proposals? >>>>> > >>>>> > Second, lazy consensus only removes the requirement for three +1 >>>>> votes. Why >>>>> &

Re: [SQL][CodeGen] Is there a way to set break point and debug the generated code?

2017-01-10 Thread Reynold Xin
It's unfortunately difficult to debug -- that's one downside of codegen. You can dump all the code via "explain codegen" though. That's typically enough for me to debug. On Tue, Jan 10, 2017 at 3:21 AM, dragonly wrote: > I am recently hacking into the SparkSQL and trying

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Reynold Xin
Can you give a repro? Anything less than -(1 << 63) is considered negative infinity (i.e. unbounded preceding). On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz wrote: > Hi, > > I've been looking at the SPARK-17845 and I am curious if there is any > reason to make it

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Reynold Xin
de which > used to work before is off by one. > On 11/30/2016 06:43 PM, Reynold Xin wrote: > > Can you give a repro? Anything less than -(1 << 63) is considered negative > infinity (i.e. unbounded preceding). > > On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Reynold Xin
ceding = -sys.maxsize > > unboundedFollowing = sys.maxsize > > to keep backwards compatibility. > On 11/30/2016 06:52 PM, Reynold Xin wrote: > > Ah ok for some reason when I did the pull request sys.maxsize was much > larger than 2^63. Do you want to submit a patch to fix this? > > &g

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Reynold Xin
den email]> wrote: > > It is platform specific so theoretically can be larger, but 2**63 - 1 is a > > standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can submit a > > patch but I am not sure how to proceed. Personally I would set > > > > unboundedPr

Re: Future of the Python 2 support.

2016-12-04 Thread Reynold Xin
Echoing Nick. I don't see any strong reason to drop Python 2 support. We typically drop support for X when it is rarely used and support for X is long past EOL. Python 2 is still very popular, and depending on the statistics it might be more popular than Python 3. On Sun, Dec 4, 2016 at 9:29 AM

Re: Please limit commits for branch-2.1

2016-12-05 Thread Reynold Xin
; Looks great -- cutting branch = in RC period. > > On Tue, Nov 22, 2016 at 5:31 PM Reynold Xin <r...@databricks.com> wrote: > >> I did send an email out with those information on Nov 1st. It is not >> meant to be in new feature development mode anymore. >> >>

Re: Spark-9487, Need some insight

2016-12-05 Thread Reynold Xin
Honestly it is pretty difficult. Given the difficulty, would it still make sense to do that change? (the one that sets the same number of workers/parallelism across different languages in testing) On Mon, Dec 5, 2016 at 3:33 PM, Saikat Kanjilal wrote: > Hello again dev

Re: Parquet patch release

2017-01-06 Thread Reynold Xin
Thanks for the heads up, Ryan! On Fri, Jan 6, 2017 at 3:46 PM, Ryan Blue wrote: > Last month, there was interest in a Parquet patch release on PR #16281 > . I went ahead and reviewed > commits that should go into a Parquet

Re: Skip Corrupted Parquet blocks / footer.

2017-01-01 Thread Reynold Xin
In Spark 2.1, set spark.sql.files.ignoreCorruptFiles to true. On Sun, Jan 1, 2017 at 1:11 PM, khyati wrote: > Hi, > > I am trying to read the multiple parquet files in sparksql. In one dir > there > are two files, of which one is corrupted. While trying to read these

Re: Clarification about typesafe aggregations

2017-01-04 Thread Reynold Xin
Your understanding is correct - it is indeed slower due to extra serialization. In some cases we can get rid of the serialization if the value is already deserialized. On Wed, Jan 4, 2017 at 7:19 AM, geoHeil wrote: > Hi I would like to know more about typeface

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-19 Thread Reynold Xin
The vote passed with the following +1 and -1: +1 Reynold Xin* Sean Owen* Dongjoon Hyun Xiao Li Herman van Hövell tot Westerflier Joseph Bradley* Liwei Lin Denny Lee Holden Karau Adam Roberts vaquar khan 0/+1 (not sure what this means but putting it here just in case) Felix Cheung -1 Franklyn

Re: planning & discussion for larger scheduler changes

2017-03-24 Thread Reynold Xin
On Fri, Mar 24, 2017 at 4:41 PM, Imran Rashid wrote: > Kay and I were discussing some of the bigger scheduler changes getting > proposed lately, and realized there is a broader discussion to have with > the community, outside of any single jira. I'll start by sharing my >

Re: spark-without-hive assembly for hive build/development purposes

2017-03-16 Thread Reynold Xin
Why do you need an assembly? Is there something preventing Hive from depending on normal jars like all other applications? On Thu, Mar 16, 2017 at 3:42 PM, Zoltan Haindrich wrote: > Hello, > > Hive needs a spark assembly to execute the HoS tests. > Until now…this assembly have been

Re: Lineage between Datasets

2017-04-12 Thread Reynold Xin
The physical plans are not subtrees, but the analyzed plan (before the optimizer runs) is actually similar to "lineage". You can get that by calling explain(true) and look at the analyzed plan. On Wed, Apr 12, 2017 at 3:03 AM Chang Chen wrote: > Hi All > > I believe that

Re: distributed computation of median

2017-04-17 Thread Reynold Xin
The DataFrame API includes an approximate quartile implementation. If you ask for quantile 0.5, you will get approximate median. On Sun, Apr 16, 2017 at 9:24 PM svjk24 wrote: > Hello, > Is there any interest in an efficient distributed computation of the > median algorithm?

Re: New Optimizer Hint

2017-04-20 Thread Reynold Xin
Doesn't common sub expression elimination address this issue as well? On Thu, Apr 20, 2017 at 6:40 AM Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > Hi Michael, > > This sounds like a good idea. Can you open a JIRA to track this? > > My initial feedback on your proposal

Re: RDD functions using GUI

2017-04-18 Thread Reynold Xin
This is not really a dev list question ... I'm sure some tools exist out there, e.g. Talend, Alteryx. On Tue, Apr 18, 2017 at 10:35 AM, Ke Yang (Conan) wrote: > Ping… wonder why there aren’t any such drag-n-drop GUI tool for creating > batch query scripts? > > Thanks > > >

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-19 Thread Reynold Xin
+1 On Wed, Apr 19, 2017 at 3:31 PM, Marcelo Vanzin wrote: > +1 (non-binding). > > Ran the hadoop-2.6 binary against our internal tests and things look good. > > On Tue, Apr 18, 2017 at 11:59 AM, Michael Armbrust > wrote: > > Please vote on releasing

Re: Spark Improvement Proposals

2017-03-09 Thread Reynold Xin
I'm fine without a vote. (are we voting on wether we need a vote?) On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen wrote: > I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. > Nah, anyone can call a vote. This really isn't that formal. We just want to >

Re: Spark Improvement Proposals

2017-03-10 Thread Reynold Xin
We can just start using spip label and link to it. On Fri, Mar 10, 2017 at 9:18 AM, Cody Koeninger wrote: > So to be clear, if I translate that google doc to markup and submit a > PR, you will merge it? > > If we're just using "spip" label, that's probably fine, but we

Re: Build completed: spark 866-master

2017-03-04 Thread Reynold Xin
Most of the previous notifications were caught as spam. We should really disable this. On Sat, Mar 4, 2017 at 4:17 PM Hyukjin Kwon wrote: > Oh BTW, I was asked about this by Reynold. Few month ago and I said the > similar answer. > > I think I am not supposed to don't

Re: Thoughts on release cadence?

2017-07-31 Thread Reynold Xin
On Mon, Jul 31, 2017, 18:06 Michael Armbrust <mich...@databricks.com> > wrote: > >> +1, should we update https://spark.apache.org/versioning-policy.html ? >> >> On Sun, Jul 30, 2017 at 3:34 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Thi

Re: [VOTE] [SPIP] SPARK-18085: Better History Server scalability

2017-08-03 Thread Reynold Xin
A late +1 too. On Thu, Aug 3, 2017 at 1:37 PM Marcelo Vanzin wrote: > This vote passes with 3 binding +1 votes, 5 non-binding votes, and no -1 > votes. > > Thanks all! > > +1 votes (binding): > Tom Graves > Sean Owen > Marcelo Vanzin > > +1 votes (non-binding): > Ryan Blue

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Reynold Xin
Do you not use the catalog? On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash wrote: > I would support moving ORC from sql/hive -> sql/core because it brings me > one step closer to eliminating Hive from my Spark distribution by removing > -Phive at build time. > > On Thu, Aug

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-06 Thread Reynold Xin
+1 On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and > passes if a majority of at least 3 +1 PMC votes are cast. > >

<    4   5   6   7   8   9   10   11   12   13   >