In thinking this through, it probably is somewhat expected to see a slowdown 
when having to decompress data (esp gzip) as part of running a Drill query.  



 Andy Pernsteiner
 Manager, Field Enablement
ph: 206.228.0737

www.mapr.com
Now Available - Free Hadoop On-Demand Training



From: Andy Pernsteiner <[email protected]>
Reply: Andy Pernsteiner <[email protected]>>
Date: October 7, 2015 at 11:27:47 AM
To: [email protected] <[email protected]>>
Subject:  Drill + gzipped-CSV performance  

I'm running some experimental queries, both against CSV, and against 
Gzipped-CSV (same data, same file-count, etc).

I'm doing a simple :

> select count(columns[0]) from dfs.workspace.`/csv`

and

> select count(columns[0]) from dfs.workspace.`/gz`

Here are my results:

70-files, plain-CSV, 5GB on disk: 4.8s 

 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  30.4s 


When looking at profiles, it would appear that most of the time is spent on the 
TEXT_SUB_SCAN operation.  Both queries spawn the same # of minor-fragments for 
this phase (68), but the process_time for those minor fragments is an average 
of 24s for the GZ data (most of the fragments are pretty close to each other in 
terms of deviation), and 700ms average for the plain CSV data.

Is this expected?  

--
 Andy Pernsteiner
 Manager, Field Enablement
ph: 206.228.0737

www.mapr.com
Now Available - Free Hadoop On-Demand Training


Reply via email to