How much data is it in total? With JSON you will find that the nodes will have to read all the data (unless you do some directory pruning), see in the query profile how much time is spend to scan the data. It will be good to see how fast the nodes can actually read the data from S3, as you are likely spending most of the time just reading the data from S3.
To get a count of records with Drill it is normally better to use a field that you know are present for all records rather than *. Less likely to have issues with schema changes and also I notice that you miss one project operator (which saves a little time). --Andries > On Nov 16, 2015, at 8:48 PM, Mikhailau, Alex <[email protected]> wrote: > > Guys, > > I am trying to evaluate performance of a basic query – select count(*) from > MY_TABLE > > I have 800 million records partitioned in S3 in subfolders by YEAR/DAY/HOUR > in 14MB GZ JSON files > > I have a 2+1 node cluster m3.xlarge instance type set up in EMR. It is taking > over 54 minutes to return the total count. JSON documents are flat and > contain only a few properties. > > Is there a reason why the query would take so long to execute? If yes, what > is the faster option? > > Thank you. > > > > ********************************************************** > > MLB.com: Where Baseball is Always On
