Hi Andy, I think that in your specific setup CPU becomes the bottleneck, which leads to slower query time. You can try query on other system with faster CPU. And/or try lower compression ratio.
On Wed, Oct 7, 2015 at 9:15 PM, Andy Pernsteiner <[email protected]> wrote: > In thinking this through, it probably is somewhat expected to see a slowdown > when having to decompress data (esp gzip) as part of running a Drill query. > > > > Andy Pernsteiner > Manager, Field Enablement > ph: 206.228.0737 > > www.mapr.com > Now Available - Free Hadoop On-Demand Training > > > > From: Andy Pernsteiner <[email protected]> > Reply: Andy Pernsteiner <[email protected]>> > Date: October 7, 2015 at 11:27:47 AM > To: [email protected] <[email protected]>> > Subject: Drill + gzipped-CSV performance > > I'm running some experimental queries, both against CSV, and against > Gzipped-CSV (same data, same file-count, etc). > > I'm doing a simple : > >> select count(columns[0]) from dfs.workspace.`/csv` > > and > >> select count(columns[0]) from dfs.workspace.`/gz` > > Here are my results: > > 70-files, plain-CSV, 5GB on disk: 4.8s > > 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed): 30.4s > > > When looking at profiles, it would appear that most of the time is spent on > the TEXT_SUB_SCAN operation. Both queries spawn the same # of > minor-fragments for this phase (68), but the process_time for those minor > fragments is an average of 24s for the GZ data (most of the fragments are > pretty close to each other in terms of deviation), and 700ms average for the > plain CSV data. > > Is this expected? > > -- > Andy Pernsteiner > Manager, Field Enablement > ph: 206.228.0737 > > www.mapr.com > Now Available - Free Hadoop On-Demand Training > >
