Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-11 Thread Jia, Ke A
Hi all, We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance tests in 3TB TPC-DS on 5 node Cascade Lake cluster. 2 queries bring about more than 1.5x performance and 37 queries bring more than 1.1x performance with AQE. There is no query has significant performance degradat

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing. I would be more than happy if we document such implied relationship, *and* if we are very sure all configurations are structured correctly coherently. Until that point, I think i

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Jungtaek Lim
I'm looking into the case of `spark.dynamicAllocation` and this seems to be the thing to support my voice. https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation I don't disagree with adding "This requires spark.shuffle.service.enabled to be set." in the description

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it although it might be "redundant" :-) since anyone can give feedback to any thread in Spark dev mailing list, and discuss. This is actually more prevailing given my rough reading of configuration files. I would like to see this

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Jungtaek Lim
I'm sorry if I miss something, but this is ideally better to be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice. For me it's just there're two different practices co-existing on the codebase, meaning it's closer to the preference of individual (with implicit

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
> I don't plan to document this officially yet Just to prevent confusion, I meant I don't yet plan to document the fact that we should write the relationships in configurations as a code/review guideline in https://spark.apache.org/contributing.html 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon 님이 작성:

Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
Hi all, I happened to review some PRs and I noticed that some configurations don't have some information necessary. To be explicit, I would like to make sure we document the direct relationship between other configurations in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostS

Re: [build system] enabled the ubuntu staging node to help w/build queue

2020-02-11 Thread Takeshi Yamamuro
Thanks always..! Bests, Takeshi On Wed, Feb 12, 2020 at 3:28 AM shane knapp ☠ wrote: > the build queue has been increasing and to help throughput i enabled the > 'ubuntu-testing' node. i spot-checked a bunch of the spark maven builds, > and they passed. > > i'll keep an eye out for any failure

SMJ operator spilling perf improvements PR 27246

2020-02-11 Thread sinisa knezevic
Hello All,  Could you please let me know what would be next step for PR: https://github.com/apache/spark/pull/27246?I would like to know if there is any action item on my side. Thank youSinisa

Re: Apache Spark Docker image repository

2020-02-11 Thread Dongjoon Hyun
Hi, Sean. Yes. We should keep this minimal. BTW, for the following questions, > But how much value does that add? How much value do you think we have at our binary distribution in the following link? - https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz Docker

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Enrico Minack
I compute the difference of two timestamps and compare them with a constant interval: Seq(("2019-01-02 12:00:00", "2019-01-02 13:30:00"))   .toDF("start", "end")   .select($"start".cast(TimestampType), $"end".cast(TimestampType))   .select($"start", $"end", ($"end" - $"start").as("diff"))   .whe

[build system] enabled the ubuntu staging node to help w/build queue

2020-02-11 Thread shane knapp ☠
the build queue has been increasing and to help throughput i enabled the 'ubuntu-testing' node. i spot-checked a bunch of the spark maven builds, and they passed. i'll keep an eye out for any failures caused by the system and either remove it from the worker pool of fix what i need to. shane --

Re: Apache Spark Docker image repository

2020-02-11 Thread Sean Owen
To be clear this is a convenience 'binary' for end users, not just an internal packaging to aid the testing framework? There's nothing wrong with providing an additional official packaging if we vote on it and it follows all the rules. There is an open question about how much value it adds vs that

Re: Apache Spark Docker image repository

2020-02-11 Thread Erik Erlandson
My takeaway from the last time we discussed this was: 1) To be ASF compliant, we needed to only publish images at official releases 2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to i

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Joseph Torres
The problem is that there isn't a consistent number of seconds an interval represents - as Wenchen mentioned, a month interval isn't a fixed number of days. If your use case can account for that, maybe you could add the interval to a fixed reference date and then compare the result. On Tue, Feb 11

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Wenchen Fan
What's your use case to compare intervals? It's tricky in Spark as there is only one interval type and you can't really compare one month with 30 days. On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack wrote: > Hi Devs, > > I would like to know what is the current roadmap of making > CalendarInterv

comparable and orderable CalendarInterval

2020-02-11 Thread Enrico Minack
Hi Devs, I would like to know what is the current roadmap of making CalendarInterval comparable and orderable again (SPARK-29679, SPARK-29385, #26337). With #27262, this got reverted but SPARK-30551 does not mention how to go forward in this matter. I have found SPARK-28494, but this seems t