You were right and the time is actually spent on decompressing the files, I must have messed up my initial profiling on that. And I'm already using the native libraries so I guess that's as good as it gets.

After checking that an uncompressed file downloads at full speed, I did some more tests and it turns out that the query on the gzipped file is faster even though the download rate is lower! So at least I learned a good lesson about making wrong assumptions, hehe.

Anyways, I'm still wondering if there are any other places I can squeeze some extra performance from, and I can only think of:

- Maybe using snappy to decompress? I tried to find where the decompression takes place but I couldn't find it. - Use a different compression algorithm, any recommendations based on experience?

Thanks!


On Sun, Feb 21, 2016 at 04:58:17PM -0800, Jacques Nadeau wrote:
The zipped question is a good one. I believe you need to add extra native
libraries to get reasonable performance when using gzip files.

See if you are seeing this in your logs or out:

Unable to load native-hadoop library for your platform

If so, probably need to get them setup per here:

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/NativeLibraries.html


On Sun, Feb 21, 2016 at 11:53 AM, Oscar Morante <[email protected]> wrote:

I'm still fighting with this, so far I've tried:

 - Adding the native hadoop libraries.
 - `s3n://` and `s3a://`.
 - SSL on/off.
 - different S3 endpoints.
 - different values in `drill.exec.buffer.size`.

None of these seem to make a difference and `s3cmd` is always ten times
faster than Drill to download the same file.  Netstat shows that about 800k
piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the
whole time.

I've also noticed that if I cancel the query in the middle, the download
speed suddently goes up and matches s3cmd for a while before it stops.

Is there anything else that I can try to improve the situation?  At the
begining I thought that S3 was the bottleneck but everything is pointing to
kind of lock in Drill.

Or maybe I'm just being unrealistic and asking too much :?
Cheers,



On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote:

Hi there,

I'm experiencing very slow download rates from S3 but only when using
Drill.  This is testing with only one drillbit and querying a 250Mb gzipped
JSON:

  select count(somefield) from s3.`test/big.json.gz`;

The download speed while drill is executing the query is about 5Mb/s.
Then if I try downloading the same file from the same environment using
`s3cmd` the average speed is about 60Mb/s.

Any idea what could be causing such a big difference?  I'm not sure
what's the best way to debug this, or what are the relevant configuration
parameters that I should be tweaking.

Thanks!

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
                                                         -- Isaac Asimov.

Attachment: signature.asc
Description: Digital signature

Reply via email to