Hi Alex, You mentioned these JSON files are gzip'ed. Is native library (hadoop native lib) set on Drill's java.library.path*?* I usually put -Djava.library.path=<hadoop native library> inside conf/drill-env.sh under DRILL_JAVA_OPTS)
~ Amit. On Tue, Nov 17, 2015 at 10:20 AM, Andries Engelbrecht < [email protected]> wrote: > How many GB are the JSON files? > > Not sure how big the nodes are, but you may want to add more nodes to see > if it can potentially improve the S3 read performance. Seems that you may > be reading a large volume of data form S3 and it simply is taking a long > time to read the data from S3. More nodes may get you more bandwidth to > read from S3. Might be a good experiment to test as well. > > --Andries > > > On Nov 17, 2015, at 10:14 AM, Mikhailau, Alex <[email protected]> > wrote: > > > > There are 800 million JSON documents in S3 partition in subfodlers by > > YEAR/DAY/HOUR in GZ compressed format. The data set contains only a few > > days worth of records. I was trying to run a query similar to. > > > > SELECT t.contentId, count(*) `count` > > FROM s3hbo.root.`/2015` t > > GROUP BY t.contentId > > order by `count` DESC > > LIMIT 20 > > > > > > Going to try to add count(t.contentId) to see if there is an improvement. > > Looking at JSON_SUB_SCAN indeed shows a significant portion of the job. > > > > -Alex > > > > ********************************************************** > > > > MLB.com: Where Baseball is Always On > > > >
