Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-02 Thread Jungtaek Lim
I'm not 100% sure I understand the question. Assuming you're referring "both" as SPARK-26283 [1] and SPARK-29322 [2], if you ask about the fix then yes, only master branch as fix for SPARK-26283 is not ported back to branch-2.4. If you ask about the issue (problem) then maybe no, according to the

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
I just looked at the PR. I think there are some follow up work that needs to be done, e.g. we shouldn't create a top level packageĀ  org.apache.spark.sql.dynamicpruning. On Wed, Oct 02, 2019 at 1:52 PM, Maryann Xue < maryann@databricks.com > wrote: > > There is no internal write up, but I

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Maryann Xue
There is no internal write up, but I think we should at least give some up-to-date description on that JIRA entry. On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin wrote: > No there is no separate write up internally. > > On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue wrote: > >> Thanks for the pointers,

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
No there is no separate write up internally. On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue wrote: > Thanks for the pointers, but what I'm looking for is information about the > design of this implementation, like what requires this to be in spark-sql > instead of spark-catalyst. > > Even a

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Maryann Xue
The reason why it's in spark-sql is simply because HadoopFsRelation which the rule tries to match is in spark-sql. We should probably update the high-level description in the JIRA. I'll work on that shortly. On Wed, Oct 2, 2019 at 2:29 PM Ryan Blue wrote: > Thanks for the pointers, but what

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue
Thanks for the pointers, but what I'm looking for is information about the design of this implementation, like what requires this to be in spark-sql instead of spark-catalyst. Even a high-level description, like what the optimizer rules are and what they do would be great. Was there one written

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Maryann Xue
> It lists 3 cases for how a filter is built, but nothing about the overall approach or design that helps when trying to find out where it should be placed in the optimizer rules. The overall idea/design of DPP can be simply put as using the result of one side of the join to prune partitions of a

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in SparkĀ  https://issues.apache.org/jira/browse/HIVE-9152 ). Henry R's description was also correct. On Wed, Oct 02,

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue
Where can I find a design doc for dynamic partition pruning that explains how it works? The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition pruning (as pointed out by Henry R.) and doesn't have any comments about the implementation's approach. And the PR description also

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-02 Thread Dongjoon Hyun
Thank you for the investigation and making a fix. So, both issues are on only master (3.0.0) branch? Bests, Dongjoon. On Wed, Oct 2, 2019 at 00:06 Jungtaek Lim wrote: > FYI: patch submitted - https://github.com/apache/spark/pull/25996 > > On Wed, Oct 2, 2019 at 3:25 PM Jungtaek Lim > wrote:

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Wenchen Fan
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule. For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder. On Sun, Sep 29, 2019 at 5:51 AM

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-02 Thread Jungtaek Lim
FYI: patch submitted - https://github.com/apache/spark/pull/25996 On Wed, Oct 2, 2019 at 3:25 PM Jungtaek Lim wrote: > I need to do full manual test to make sure, but according to experiment > (small UT) "closeFrameOnFlush" seems to work. > > There was relevant change on master branch

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-02 Thread Jacek Laskowski
Hi Jungtaek, Thanks a lot for your very prompt response! > Looks like it's missing, or intended to force custom streaming source implemented as DSv2. That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-02 Thread Jungtaek Lim
I need to do full manual test to make sure, but according to experiment (small UT) "closeFrameOnFlush" seems to work. There was relevant change on master branch SPARK-26283 [1], and it changed the way to read the zstd event log file to "continuous", which seems to read open frame. With