An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Mikhailau, Alex Mon, 16 Nov 2015 20:49:00 -0800

Guys,

I am trying to evaluate performance of a basic query – select count(*) from 
MY_TABLE


I have 800 million records partitioned in S3 in subfolders by YEAR/DAY/HOUR in 
14MB GZ JSON files

I have a 2+1 node cluster m3.xlarge instance type set up in EMR. It is taking 
over 54 minutes to return the total count. JSON documents are flat and contain 
only a few properties.

Is there a reason why the query would take so long to execute? If yes, what is 
the faster option?

Thank you.



**********************************************************

MLB.com: Where Baseball is Always On

An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Reply via email to