Did you try with the data unzipped? On Sun, Feb 21, 2016 at 2:53 PM, Oscar Morante <[email protected]> wrote:
> I'm still fighting with this, so far I've tried: > > - Adding the native hadoop libraries. > - `s3n://` and `s3a://`. > - SSL on/off. > - different S3 endpoints. > - different values in `drill.exec.buffer.size`. > > None of these seem to make a difference and `s3cmd` is always ten times > faster than Drill to download the same file. Netstat shows that about 800k > piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the > whole time. > > I've also noticed that if I cancel the query in the middle, the download > speed suddently goes up and matches s3cmd for a while before it stops. > > Is there anything else that I can try to improve the situation? At the > begining I thought that S3 was the bottleneck but everything is pointing to > kind of lock in Drill. > > Or maybe I'm just being unrealistic and asking too much :? > Cheers, > > > > On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote: > >> Hi there, >> >> I'm experiencing very slow download rates from S3 but only when using >> Drill. This is testing with only one drillbit and querying a 250Mb gzipped >> JSON: >> >> select count(somefield) from s3.`test/big.json.gz`; >> >> The download speed while drill is executing the query is about 5Mb/s. >> Then if I try downloading the same file from the same environment using >> `s3cmd` the average speed is about 60Mb/s. >> >> Any idea what could be causing such a big difference? I'm not sure >> what's the best way to debug this, or what are the relevant configuration >> parameters that I should be tweaking. >> >> Thanks! >> > > > > -- > Oscar Morante > "Self-education is, I firmly believe, the only kind of education there is." > -- Isaac Asimov. > -- ---------------------------------- Paul Ilechko Senior Systems Engineer MapR Technologies 908 331 2207
