Re: Local join instead of data exchange - co-located blocks

2018-06-25 Thread Lars Volker
Hi Philip, In https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler.cc#L749 you see that we call ComputeFragmentExecParams() and ComputeBackendExecParams(). You should check both to see that they're doing the right thing in your case. The former also modifies the

Re: Local join instead of data exchange - co-located blocks

2018-06-18 Thread Lars Volker
Hi Philip, Apologies for the delay. Since you're currently looking for a correct implementation more than a fast one, I would highly recommend to use debug builds instead of release builds. The latter won't have DCHECKs enabled and you might find it much harder to debug any mistakes you make

Re: Local join instead of data exchange - co-located blocks

2018-05-30 Thread Lars Volker
Hi Philipp, The ScanRangeAssignment logfile entry gets printed by the scheduler in L918 in PrintAssignment(). For each host and each plan node it shows the scan ranges assigned. per_node_scan_ranges is a per-host structure in that assignment. When inspecting the full logs you should be able to

Re: Local join instead of data exchange - co-located blocks

2018-05-14 Thread Lars Volker
Hi Philipp, Looking at the profile, one of your scan nodes doesn't seem to receive any scan ranges ("Hdfs split stats" is empty). The other one receives one split, but it get's filtered out by the runtime filter coming from that first node ("Files rejected: 1"). You might want to disable runtime

Re: Local join instead of data exchange - co-located blocks

2018-05-14 Thread Philipp Krause
Hello Alex, I suppose you're very busy, so I apologize for the interruption. If you have any idea of what I could try to solve this problem, please let me know. Currently I don't know how to progress and I'd appreciate any help you can give me. Best regards Philipp Philipp Krause

Re: Local join instead of data exchange - co-located blocks

2018-04-15 Thread Philipp Krause
Hi Alex! Thank you for the list! The build of the modified cdh5-trunk branch (debug mode) was sucessfull. After replacing "impala-frontend-0.1-SNAPSHOT.jar" in /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/jars/ I got the following error in my existing cluster: F0416 01:16:45.402997 17897

Re: Local join instead of data exchange - co-located blocks

2018-04-13 Thread Alexander Behm
Here's the foll list. It might not be minimal, but copying/overwriting these should work. debug/service/impalad debug/service/libfesupport.so debug/service/libService.a release/service/impalad release/service/libfesupport.so release/service/libService.a yarn-extras-0.1-SNAPSHOT.jar

Re: Local join instead of data exchange - co-located blocks

2018-04-13 Thread Philipp Krause
Yes, I have a running (virtual) cluster. I would try to follow your way with the custom impala build (DistributedPlanner.java is the only modified file at the moment). Thank you in advance for the file list! Best regards Philipp Alexander Behm schrieb am Fr., 13. Apr.

Re: Local join instead of data exchange - co-located blocks

2018-04-05 Thread Alexander Behm
Apologies for the late response. Btw, your previous post was clear enough to me, so no worries :) On Wed, Apr 4, 2018 at 7:46 AM, Philipp Krause < philippkrause.m...@googlemail.com> wrote: > Hello Alex, > > I think my previous post has been too long and confusing. I apologize for > that! > > If

Re: Local join instead of data exchange - co-located blocks

2018-04-05 Thread Alexander Behm
On Wed, Mar 28, 2018 at 12:04 PM, Philipp Krause wrote: > Thank you for your answer and sorry for my delay! > > If my understanding is correct, the list of scan nodes consists of all > nodes which contain a *local* block from a table that is needed for the >

Re: Local join instead of data exchange - co-located blocks

2018-03-28 Thread Philipp Krause
Thank you for your answer and sorry for my delay! If my understanding is correct, the list of scan nodes consists of all nodes which contain a *local* block from a table that is needed for the query (Assumption: I have no replicas in my first tests). If TableA-Block0 is on Node_0, isn't

Re: Local join instead of data exchange - co-located blocks

2018-03-20 Thread Alexander Behm
Thanks for following up. I think I understand your setup. If you want to not think about scan ranges, then you can modify HdfsScanNode.computeScanRangeLocations(). For example, you could change it to produce one scan range per file or per HDFS block. That way you'd know exactly what a scan range

Re: Local join instead of data exchange - co-located blocks

2018-03-19 Thread Philipp Krause
I'd like to provide a small example for our purpose. The last post may be a bit confusing, so here's a very simple example in the attached pdf file. I hope, it's understandable. Otherwise, please give me a short feedback. Basically, I only want each data node to join all it's local blocks. Is

Re: Local join instead of data exchange - co-located blocks

2018-03-18 Thread Philipp Krause
Hi! At the moment the data to parquet (block) mapping is based on a simple modulo function: Id % #data_nodes. So with 5 data nodes all rows with Id's 0,5,10,... are written to Parquet_0, Id's 1,4,9 are written to Parquet_1 etc. That's what I did manually. Since the parquet file size and the

Re: Local join instead of data exchange - co-located blocks

2018-03-14 Thread Philipp Krause
Thank you very much for these information! I'll try to implement these two steps and post some updates within the next days! Best regards Philipp 2018-03-13 5:38 GMT+01:00 Alexander Behm : > Cool that you working on a research project with Impala! > > Properly adding

Re: Local join instead of data exchange - co-located blocks

2018-03-12 Thread Alexander Behm
Cool that you working on a research project with Impala! Properly adding such a feature to Impala is a substantial effort, but hacking the code for an experiment or two seems doable. I think you will need to modify two things: (1) the planner to not add exchange nodes, and (2) the scheduler to

Re: Local join instead of data exchange - co-located blocks

2018-03-12 Thread Philipp Krause
Thank you very much for your quick answers! The intention behind this is to improve the execution time and (primarily) to examine the impact of block-co-location (research project) for this particular query (simplified): select A.x, B.y, A.z from tableA as A inner join tableB as B on

Re: Local join instead of data exchange - co-located blocks

2018-03-12 Thread Alexander Behm
Such a specific block arrangement is very uncommon for typical Impala setups, so we don't attempt to recognize and optimize this narrow case. In particular, such an arrangement tends to be short lived if you have the HDFS balancer turned on. Without making code changes, there is no way today to

Local join instead of data exchange - co-located blocks

2018-03-12 Thread Philipp Krause
Hello everyone! In order to prevent network traffic, I'd like to perform local joins on each node instead of exchanging the data and perform a join over the complete data afterwards. My query is basically a join over three three tables on an ID attribute. The blocks are perfectly distributed,