We are evaluating Apache Drill performance, and we have setup Apache Drill on Amazon.
All EC2 machines are r3.2xLarge instance type. Model vCPU Mem (GiB) SSD Storage (GB) r3.2xlarge 8 61 1 x 160 Zookeeper - 1 EC2 machine Drillbits - 25 EC2 machines. Data on - Amazon S3 Data Format - Flat File with PSV ( Pipe Separated) and GZIP'ed. Storage Hierarchy - /logs/requests/y=2015/m=01/d=01/hh=-01/ Daily Data Size - 2TB approx. Daily Rows - 3.5B Using Apache Drill with Default Configuration. I was successfully able to configure Apache Drill and connect to S3 and query the data from S3. But when I do count(*) on the day folder, its taking around 45-50min with the above setup. Any other queries with WHERE condition also takes similar time. I was wondering whether the slowness is due to copying data back n forth from S3. Could anyone give some suggestions on setup/configuration to achieve better performance with Apache Drill? Thanks, Satish
