My files are organized like this which amplifies the issue:
/category/random-hash/year/month/day/hour/data-chunk-000.json.gzThe random hash is there to trick S3 into using a different partition/shard for each put [1]. But it looks like this structure is clashing with the way Drill/hadoop.fs.s3a get the list of files.
I think that it should be possible to get the complete list of files under a given "directory" (e.g. `/category`) doing just one HTTP query, but I don't know how hard it would be to incorporate that behavior.
Any ideas? How are you organizing your S3 files to get good performance?
Thanks! [1]: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote:
I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes ranging from 1M to 30Mb. Drill is running in aws in 4 m4.xlarge nodes and it's taking around 50 minutes before the query starts executing.Any idea what could be causing this delay? What's the best way to debug this?Thanks,
--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
-- Isaac Asimov.
signature.asc
Description: Digital signature
