Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Andries Engelbrecht Tue, 17 Nov 2015 10:21:08 -0800

How many GB are the JSON files?

Not sure how big the nodes are, but you may want to add more nodes to see if it 
can potentially improve the S3 read performance. Seems that you may be reading 
a large volume of data form S3 and it simply is taking a long time to read the 
data from S3. More nodes may get you more bandwidth to read from S3. Might be a 
good experiment to test as well.


--Andries 

> On Nov 17, 2015, at 10:14 AM, Mikhailau, Alex <[email protected]> wrote:
> 
> There are 800 million JSON documents in S3 partition in subfodlers by
> YEAR/DAY/HOUR in GZ compressed format. The data set contains only a few
> days worth of records. I was trying to run a query similar to.
> 
> SELECT t.contentId, count(*) `count`
> FROM s3hbo.root.`/2015` t
> GROUP BY t.contentId
> order by `count` DESC
> LIMIT 20
> 
> 
> Going to try to add count(t.contentId) to see if there is an improvement.
> Looking at JSON_SUB_SCAN indeed shows a significant portion of the job.
> 
> -Alex
> 
> **********************************************************
> 
> MLB.com: Where Baseball is Always On
>

Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Reply via email to