Hi Philip,
In
https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler.cc#L749
you see that we call ComputeFragmentExecParams() and
ComputeBackendExecParams(). You should check both to see that they're doing
the right thing in your case. The former also modifies the
Hi Philip,
Apologies for the delay. Since you're currently looking for a correct
implementation more than a fast one, I would highly recommend to use debug
builds instead of release builds. The latter won't have DCHECKs enabled and
you might find it much harder to debug any mistakes you make
Hi Philipp,
The ScanRangeAssignment logfile entry gets printed by the scheduler in L918
in PrintAssignment(). For each host and each plan node it shows the scan
ranges assigned. per_node_scan_ranges is a per-host structure in that
assignment. When inspecting the full logs you should be able to
Hi Philipp,
Looking at the profile, one of your scan nodes doesn't seem to receive any
scan ranges ("Hdfs split stats" is empty). The other one receives one
split, but it get's filtered out by the runtime filter coming from that
first node ("Files rejected: 1"). You might want to disable runtime
Hello Alex,
I suppose you're very busy, so I apologize for the interruption. If you
have any idea of what I could try to solve this problem, please let me
know. Currently I don't know how to progress and I'd appreciate any help
you can give me.
Best regards
Philipp
Philipp Krause
Hi Alex! Thank you for the list! The build of the modified cdh5-trunk
branch (debug mode) was sucessfull. After replacing
"impala-frontend-0.1-SNAPSHOT.jar" in
/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/jars/ I got the
following error in my existing cluster:
F0416 01:16:45.402997 17897
Here's the foll list. It might not be minimal, but copying/overwriting
these should work.
debug/service/impalad
debug/service/libfesupport.so
debug/service/libService.a
release/service/impalad
release/service/libfesupport.so
release/service/libService.a
yarn-extras-0.1-SNAPSHOT.jar
Yes, I have a running (virtual) cluster. I would try to follow your way
with the custom impala build (DistributedPlanner.java is the only modified
file at the moment). Thank you in advance for the file list!
Best regards
Philipp
Alexander Behm schrieb am Fr., 13. Apr.
Apologies for the late response. Btw, your previous post was clear enough
to me, so no worries :)
On Wed, Apr 4, 2018 at 7:46 AM, Philipp Krause <
philippkrause.m...@googlemail.com> wrote:
> Hello Alex,
>
> I think my previous post has been too long and confusing. I apologize for
> that!
>
> If
On Wed, Mar 28, 2018 at 12:04 PM, Philipp Krause wrote:
> Thank you for your answer and sorry for my delay!
>
> If my understanding is correct, the list of scan nodes consists of all
> nodes which contain a *local* block from a table that is needed for the
>
Thank you for your answer and sorry for my delay!
If my understanding is correct, the list of scan nodes consists of all
nodes which contain a *local* block from a table that is needed for the
query (Assumption: I have no replicas in my first tests). If
TableA-Block0 is on Node_0, isn't
Thanks for following up. I think I understand your setup.
If you want to not think about scan ranges, then you can modify
HdfsScanNode.computeScanRangeLocations(). For example, you could change it
to produce one scan range per file or per HDFS block. That way you'd know
exactly what a scan range
I'd like to provide a small example for our purpose. The last post may
be a bit confusing, so here's a very simple example in the attached pdf
file. I hope, it's understandable. Otherwise, please give me a short
feedback.
Basically, I only want each data node to join all it's local blocks. Is
Hi! At the moment the data to parquet (block) mapping is based on a
simple modulo function: Id % #data_nodes. So with 5 data nodes all rows
with Id's 0,5,10,... are written to Parquet_0, Id's 1,4,9 are written to
Parquet_1 etc. That's what I did manually. Since the parquet file size
and the
Thank you very much for these information! I'll try to implement these two
steps and post some updates within the next days!
Best regards
Philipp
2018-03-13 5:38 GMT+01:00 Alexander Behm :
> Cool that you working on a research project with Impala!
>
> Properly adding
Cool that you working on a research project with Impala!
Properly adding such a feature to Impala is a substantial effort, but
hacking the code for an experiment or two seems doable.
I think you will need to modify two things: (1) the planner to not add
exchange nodes, and (2) the scheduler to
Thank you very much for your quick answers!
The intention behind this is to improve the execution time and
(primarily) to examine the impact of block-co-location (research
project) for this particular query (simplified):
select A.x, B.y, A.z from tableA as A inner join tableB as B on
Such a specific block arrangement is very uncommon for typical Impala
setups, so we don't attempt to recognize and optimize this narrow case. In
particular, such an arrangement tends to be short lived if you have the
HDFS balancer turned on.
Without making code changes, there is no way today to
Hello everyone!
In order to prevent network traffic, I'd like to perform local joins on each
node instead of exchanging the data and perform a join over the complete data
afterwards. My query is basically a join over three three tables on an ID
attribute. The blocks are perfectly distributed,
19 matches
Mail list logo