Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Amit Hadke Tue, 17 Nov 2015 10:47:37 -0800

Hi Alex,

You mentioned these JSON files are gzip'ed. Is native library (hadoop
native lib) set on Drill's java.library.path*?*
I usually put -Djava.library.path=<hadoop native library> inside
conf/drill-env.sh under DRILL_JAVA_OPTS)


~ Amit.

On Tue, Nov 17, 2015 at 10:20 AM, Andries Engelbrecht <
[email protected]> wrote:

> How many GB are the JSON files?
>
> Not sure how big the nodes are, but you may want to add more nodes to see
> if it can potentially improve the S3 read performance. Seems that you may
> be reading a large volume of data form S3 and it simply is taking a long
> time to read the data from S3. More nodes may get you more bandwidth to
> read from S3. Might be a good experiment to test as well.
>
> --Andries
>
> > On Nov 17, 2015, at 10:14 AM, Mikhailau, Alex <[email protected]>
> wrote:
> >
> > There are 800 million JSON documents in S3 partition in subfodlers by
> > YEAR/DAY/HOUR in GZ compressed format. The data set contains only a few
> > days worth of records. I was trying to run a query similar to.
> >
> > SELECT t.contentId, count(*) `count`
> > FROM s3hbo.root.`/2015` t
> > GROUP BY t.contentId
> > order by `count` DESC
> > LIMIT 20
> >
> >
> > Going to try to add count(t.contentId) to see if there is an improvement.
> > Looking at JSON_SUB_SCAN indeed shows a significant portion of the job.
> >
> > -Alex
> >
> > **********************************************************
> >
> > MLB.com: Where Baseball is Always On
> >
>
>

Re: An extremely long time to run a query in EMR against S3 bucket of JSON files in GZ

Reply via email to