I'm still fighting with this, so far I've tried:

 - Adding the native hadoop libraries.
 - `s3n://` and `s3a://`.
 - SSL on/off.
 - different S3 endpoints.
 - different values in `drill.exec.buffer.size`.

None of these seem to make a difference and `s3cmd` is always ten times faster than Drill to download the same file. Netstat shows that about 800k piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the whole time.

I've also noticed that if I cancel the query in the middle, the download speed suddently goes up and matches s3cmd for a while before it stops.

Is there anything else that I can try to improve the situation? At the begining I thought that S3 was the bottleneck but everything is pointing to kind of lock in Drill.

Or maybe I'm just being unrealistic and asking too much :?
Cheers,


On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote:
Hi there,

I'm experiencing very slow download rates from S3 but only when using Drill. This is testing with only one drillbit and querying a 250Mb gzipped JSON:

  select count(somefield) from s3.`test/big.json.gz`;

The download speed while drill is executing the query is about 5Mb/s. Then if I try downloading the same file from the same environment using `s3cmd` the average speed is about 60Mb/s.

Any idea what could be causing such a big difference? I'm not sure what's the best way to debug this, or what are the relevant configuration parameters that I should be tweaking.

Thanks!



--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
                                                         -- Isaac Asimov.

Attachment: signature.asc
Description: Digital signature

Reply via email to