Re:PHJ node assignment

2018-02-12 Thread Quanlong Huang
IMU, the left side is always located with the hash join node. If the stats are correct, the left side will always be a larger table/input. There're two terminologies in the hash join algorithm: build and probe. The smaller table that can be built into an in-memory hash table is called the

PHJ node assignment

2018-02-12 Thread Jeszy
IIUC, every row scanned in a partitioned hash join (both sides) is sent across the network (an exchange on HASH(key)). The targets of this exchange are nodes that have data locality with the left side of the join. Why does Impala do it that way? Since all rows are sent across the network anyway,

Re: PHJ node assignment

2018-02-12 Thread Jeszy
Thanks for the response, Quanlong. The behaviour you describe is broadcast join (versus partitioned / shuffle) - sorry for confusing usage of terms! Take a look at the differences in the cost model for the two (in lieu of better description):

Help with task: Fix python test script to not return error on trivial success

2018-02-12 Thread Shayak Sadhu
I would like to help out with the task listed at https://helpwanted.apache.org/task.html?3cb146d3

Re: Re: ORC scanner - points for discussion

2018-02-12 Thread Jim Apple
I agree with the previous comments on this thread. Thank you for contributing, Quanlong!

Re: Re: ORC scanner - points for discussion

2018-02-12 Thread Tim Armstrong
Putting it behind a flag sounds good to me too. Hopefully we can get feedback from Hulu and other users of Impala that will try out the experimental version. On Mon, Feb 12, 2018 at 10:26 AM, Dimitris Tsirogiannis < dtsirogian...@cloudera.com> wrote: > Does the patch also implement an ORC

Re: Re: ORC scanner - points for discussion

2018-02-12 Thread Dimitris Tsirogiannis
Does the patch also implement an ORC writer? Dimitris On Mon, Feb 12, 2018 at 8:48 AM, Jim Apple wrote: > I agree with the previous comments on this thread. Thank you for > contributing, Quanlong! >

Re: PHJ node assignment

2018-02-12 Thread Alexander Behm
Jeszy, the way I read your question is: How much inter-node parallelism is good? As usual with perf question the answer is "it depends". Involving all nodes in the cluster for a PHJ may not work well. Intuitively, each node should have a minimum amount of work for the cost of shipping fragments

Re: Help with task: Fix python test script to not return error on trivial success

2018-02-12 Thread Jim Apple
Thank you for volunteering! It looks like that ticket, https://issues.apache.org/jira/browse/IMPALA-5886, already has an assignee. Are there any other newbie tickets you would be interested in working on? https://issues.apache.org/jira/issues/?filter=12341668 On Mon, Feb 12, 2018 at 3:31 AM,

Re:Re: Re: ORC scanner - points for discussion

2018-02-12 Thread Quanlong Huang
Dimitris, as the first step, this patch only supports reading primitive types from ORC files. I just created two follow-up JIRAs for reading complex types (IMPALA-6503) and writing to ORC tables (IMPALA-6504). Will work on them later. Tim, I also created some follow-on JIRAs as you suggest in

Re: Re: Re: ORC scanner - points for discussion

2018-02-12 Thread Tim Armstrong
Maybe it would make sense to create an Epic in JIRA for ORC scanner enhancements, following on from the initial implementation. I don't really feel strongly as long as the related JIRAs are linked together somehow. On Mon, Feb 12, 2018 at 1:42 PM, Quanlong Huang wrote: >