Adding the user list back to the thread. Satish, a couple of things.
First off, the text reader doesn't currently implement what we call skip-all semantics. You can think of this as a way to avoid reading the data if you're only asking for a count. As such, you'll actually faster performance if you do a select count(columns[0]) instead of a select count(*). Second, upon reviewing your profile, it looks like there is some skew in execution completion. The completion time for major fragment 1 (the reading fragment) has a mode of 10-11 minutes. In these cases each thread is averaging 2MM records/minute and each is reading 20-25MM records. However, you have some instances where you're averaging only 700k records/minute. In the worse case, these are also running where you're reading ~32mm records. So it looks like your performance is being impacted by a combination of at least two things: 1. overcommitted nodes on ec2 2. uneven work distribution. There isn't much you can do about (1) except kill off poor nodes and recomission new nodes and hope for better. For (2), Drill does its best here but you're also influenced by the nature of division. Drill currently splits on block boundaries. You're running with effectively 150 slices in this query. If it happens you have 200 work units, then 50 of the slices will have a larger amount of work to do. In these cases, increasing parallelization should level things out further (more nodes or greater width per node). If you review the text subscan operator information, you'll see that you're averaging 1m of wait time per thread on s3 returning data. However, you're also seeing upwards of 4-5m on some nodes. So while changing to another disk type may improve performance, that isn't where the bulk of the time is being spent. Another thing, I have seen some situations where gzip encoding does not use the native gzip libraries for Hadoop. I would review the logs as the general read speed (2MM records/minute, whatever the core speed) is far below what I would expect. Finally, if you want to do performance runs, you may want to add -Ddrill.unsafe_memory_access=true to the drill-env.conf as a command line parameter. This risks JVM crash in the case of a memory bug but should improve your memory access performance. This will be the default shortly but we still have one or two bugs that we can hit under high concurrency or strange query patterns that we need to fix before enabling this for everyone. thanks, Jacques On Mon, Jun 8, 2015 at 6:04 AM, Satish Cattamanchi <[email protected]> wrote: > Hi Jacques, > > Thanks for your response. I have attached a copy of the JSON profile. I > would love to see Apache Drill perform well, as we are looking to use right > tool for Real Time Analytics. > > > Thanks, > Satish > >
