Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Mikhailau, Alex Tue, 17 Nov 2015 10:15:02 -0800

There are 800 million JSON documents in S3 partition in subfodlers by
YEAR/DAY/HOUR in GZ compressed format. The data set contains only a few
days worth of records. I was trying to run a query similar to.


SELECT t.contentId, count(*) `count`
FROM s3hbo.root.`/2015` t
GROUP BY t.contentId
order by `count` DESC
LIMIT 20


Going to try to add count(t.contentId) to see if there is an improvement.
Looking at JSON_SUB_SCAN indeed shows a significant portion of the job.

-Alex

**********************************************************

MLB.com: Where Baseball is Always On

Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Reply via email to