Re: Drill + gzipped-CSV performance

Alexander Reshetov Wed, 07 Oct 2015 13:38:22 -0700

Hi Andy,

I think that in your specific setup CPU becomes the bottleneck, which
leads to slower query time. You can try query on other system with
faster CPU. And/or try lower compression ratio.


On Wed, Oct 7, 2015 at 9:15 PM, Andy Pernsteiner
<[email protected]> wrote:
> In thinking this through, it probably is somewhat expected to see a slowdown 
> when having to decompress data (esp gzip) as part of running a Drill query.
>
>
>
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
> Now Available - Free Hadoop On-Demand Training
>
>
>
> From: Andy Pernsteiner <[email protected]>
> Reply: Andy Pernsteiner <[email protected]>>
> Date: October 7, 2015 at 11:27:47 AM
> To: [email protected] <[email protected]>>
> Subject:  Drill + gzipped-CSV performance
>
> I'm running some experimental queries, both against CSV, and against 
> Gzipped-CSV (same data, same file-count, etc).
>
> I'm doing a simple :
>
>> select count(columns[0]) from dfs.workspace.`/csv`
>
> and
>
>> select count(columns[0]) from dfs.workspace.`/gz`
>
> Here are my results:
>
> 70-files, plain-CSV, 5GB on disk: 4.8s
>
>  70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  30.4s
>
>
> When looking at profiles, it would appear that most of the time is spent on 
> the TEXT_SUB_SCAN operation.  Both queries spawn the same # of 
> minor-fragments for this phase (68), but the process_time for those minor 
> fragments is an average of 24s for the GZ data (most of the fragments are 
> pretty close to each other in terms of deviation), and 700ms average for the 
> plain CSV data.
>
> Is this expected?
>
> --
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
> Now Available - Free Hadoop On-Demand Training
>
>

Re: Drill + gzipped-CSV performance

Reply via email to