You were right and the time is actually spent on decompressing the files, I must have messed up my initial profiling on that. And I'm already using the native libraries so I guess that's as good as it gets.
After checking that an uncompressed file downloads at full speed, I did some more tests and it turns out that the query on the gzipped file is faster even though the download rate is lower! So at least I learned a good lesson about making wrong assumptions, hehe.
Anyways, I'm still wondering if there are any other places I can squeeze some extra performance from, and I can only think of:
- Maybe using snappy to decompress? I tried to find where the decompression takes place but I couldn't find it. - Use a different compression algorithm, any recommendations based on experience?
Thanks! On Sun, Feb 21, 2016 at 04:58:17PM -0800, Jacques Nadeau wrote:
The zipped question is a good one. I believe you need to add extra native libraries to get reasonable performance when using gzip files. See if you are seeing this in your logs or out: Unable to load native-hadoop library for your platform If so, probably need to get them setup per here: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/NativeLibraries.html On Sun, Feb 21, 2016 at 11:53 AM, Oscar Morante <[email protected]> wrote:I'm still fighting with this, so far I've tried: - Adding the native hadoop libraries. - `s3n://` and `s3a://`. - SSL on/off. - different S3 endpoints. - different values in `drill.exec.buffer.size`. None of these seem to make a difference and `s3cmd` is always ten times faster than Drill to download the same file. Netstat shows that about 800k piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the whole time. I've also noticed that if I cancel the query in the middle, the download speed suddently goes up and matches s3cmd for a while before it stops. Is there anything else that I can try to improve the situation? At the begining I thought that S3 was the bottleneck but everything is pointing to kind of lock in Drill. Or maybe I'm just being unrealistic and asking too much :? Cheers, On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote:Hi there, I'm experiencing very slow download rates from S3 but only when using Drill. This is testing with only one drillbit and querying a 250Mb gzipped JSON: select count(somefield) from s3.`test/big.json.gz`; The download speed while drill is executing the query is about 5Mb/s. Then if I try downloading the same file from the same environment using `s3cmd` the average speed is about 60Mb/s. Any idea what could be causing such a big difference? I'm not sure what's the best way to debug this, or what are the relevant configuration parameters that I should be tweaking. Thanks!
--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
-- Isaac Asimov.
signature.asc
Description: Digital signature
