Adding the user list back to the thread.

Satish, a couple of things.

First off, the text reader doesn't currently implement what we call
skip-all semantics.  You can think of this as a way to avoid reading the
data if you're only asking for a count.  As such, you'll actually faster
performance if you do a select count(columns[0]) instead of a select
count(*).

Second, upon reviewing your profile, it looks like there is some skew in
execution completion.  The completion time for major fragment 1 (the
reading fragment) has a mode of 10-11 minutes.  In these cases each thread
is averaging 2MM records/minute and each is reading 20-25MM records.
However, you have some instances where you're averaging only 700k
records/minute.  In the worse case, these are also running where you're
reading ~32mm records.  So it looks like your performance is being impacted
by a combination of at least two things:

1. overcommitted nodes on ec2
2. uneven work distribution.

There isn't much you can do about (1) except kill off poor nodes and
recomission new nodes and hope for better.  For (2), Drill does its best
here but you're also influenced by the nature of division.  Drill currently
splits on block boundaries.  You're running with effectively 150 slices in
this query.  If it happens you have 200 work units, then 50 of the slices
will have a larger amount of work to do.  In these cases, increasing
parallelization should level things out further (more nodes or greater
width per node).

If you review the text subscan operator information, you'll see that you're
averaging 1m of wait time per thread on s3 returning data.  However, you're
also seeing upwards of 4-5m on some nodes.  So while changing to another
disk type may improve performance, that isn't where the bulk of the time is
being spent.

Another thing, I have seen some situations where gzip encoding does not use
the native gzip libraries for Hadoop.  I would review the logs as the
general read speed (2MM records/minute, whatever the core speed) is far
below what I would expect.

Finally, if you want to do performance runs, you may want to add
-Ddrill.unsafe_memory_access=true to the drill-env.conf as a command line
parameter.  This risks JVM crash in the case of a memory bug but should
improve your memory access performance.  This will be the default shortly
but we still have one or two bugs that we can hit under high concurrency or
strange query patterns that we need to fix before enabling this for
everyone.

thanks,
Jacques

On Mon, Jun 8, 2015 at 6:04 AM, Satish Cattamanchi <[email protected]>
wrote:

>  Hi Jacques,
>
>  Thanks for your response. I have attached a copy of the JSON profile. I
> would love to see Apache Drill perform well, as we are looking to use right
> tool for Real Time Analytics.
>
>
>  Thanks,
> Satish
>
>

Reply via email to