Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Andries Engelbrecht Tue, 17 Nov 2015 09:05:21 -0800

How much data is it in total?

With JSON you will find that the nodes will have to read all the data (unless 
you do some directory pruning), see in the query profile how much time is spend 
to scan the data. It will be good to see how fast the nodes can actually read 
the data from S3, as you are likely spending most of the time just reading the 
data from S3.


To get a count of records with Drill it is normally better to use a field that 
you know are present for all records rather than *.  Less likely to have issues 
with schema changes and also I notice that you miss one project operator (which 
saves a little time).

--Andries


> On Nov 16, 2015, at 8:48 PM, Mikhailau, Alex <[email protected]> wrote:
> 
> Guys,
> 
> I am trying to evaluate performance of a basic query – select count(*) from 
> MY_TABLE
> 
> I have 800 million records partitioned in S3 in subfolders by YEAR/DAY/HOUR 
> in 14MB GZ JSON files
> 
> I have a 2+1 node cluster m3.xlarge instance type set up in EMR. It is taking 
> over 54 minutes to return the total count. JSON documents are flat and 
> contain only a few properties.
> 
> Is there a reason why the query would take so long to execute? If yes, what 
> is the faster option?
> 
> Thank you.
> 
> 
> 
> **********************************************************
> 
> MLB.com: Where Baseball is Always On

Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Reply via email to