I'm still fighting with this, so far I've tried: - Adding the native hadoop libraries. - `s3n://` and `s3a://`. - SSL on/off. - different S3 endpoints. - different values in `drill.exec.buffer.size`.
None of these seem to make a difference and `s3cmd` is always ten times faster than Drill to download the same file. Netstat shows that about 800k piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the whole time.
I've also noticed that if I cancel the query in the middle, the download speed suddently goes up and matches s3cmd for a while before it stops.
Is there anything else that I can try to improve the situation? At the begining I thought that S3 was the bottleneck but everything is pointing to kind of lock in Drill.
Or maybe I'm just being unrealistic and asking too much :? Cheers, On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote:
Hi there,I'm experiencing very slow download rates from S3 but only when using Drill. This is testing with only one drillbit and querying a 250Mb gzipped JSON:select count(somefield) from s3.`test/big.json.gz`;The download speed while drill is executing the query is about 5Mb/s. Then if I try downloading the same file from the same environment using `s3cmd` the average speed is about 60Mb/s.Any idea what could be causing such a big difference? I'm not sure what's the best way to debug this, or what are the relevant configuration parameters that I should be tweaking.Thanks!
--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
-- Isaac Asimov.
signature.asc
Description: Digital signature
