Apache Drill and S3 performance

Satish Cattamanchi Sun, 07 Jun 2015 16:20:31 -0700

We are evaluating Apache Drill performance, and we have setup  Apache Drill on 
Amazon.


All EC2 machines are r3.2xLarge instance type.

Model   vCPU    Mem (GiB)       SSD Storage (GB)




r3.2xlarge      8       61      1 x 160






Zookeeper - 1 EC2 machine
Drillbits - 25 EC2 machines.
Data on - Amazon  S3
Data Format - Flat File with PSV ( Pipe Separated) and GZIP'ed.
Storage Hierarchy  - /logs/requests/y=2015/m=01/d=01/hh=-01/
Daily Data Size - 2TB approx.
Daily Rows - 3.5B

Using Apache Drill with Default Configuration.

I was successfully able to configure Apache Drill and connect to S3 and query 
the data from S3.

But when I do count(*) on the day folder, its taking around 45-50min with the 
above setup. Any other queries with WHERE condition also takes similar time. I 
was wondering whether the slowness is due to copying data back n forth from S3.

Could anyone give some suggestions on setup/configuration to achieve better 
performance with Apache Drill?

Thanks,
Satish

Apache Drill and S3 performance

Reply via email to