The other issue you might be running across is I have seen situations where gzip is not using native library for decompression. You should take a look at whether this is being used.
-- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Oct 7, 2015 at 8:27 AM, Andy Pernsteiner <[email protected]> wrote: > I'm running some experimental queries, both against CSV, and against > Gzipped-CSV (same data, same file-count, etc). > > I'm doing a simple : > > > select count(columns[0]) from dfs.workspace.`/csv` > > and > > > select count(columns[0]) from dfs.workspace.`/gz` > > Here are my results: > > 70-files, plain-CSV, 5GB on disk: *4.8s* > > 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed): *30.4s* > > > When looking at profiles, it would appear that most of the time is spent on > the TEXT_SUB_SCAN operation. Both queries spawn the same # of > minor-fragments for this phase (68), but the process_time for those minor > fragments is an average of 24s for the GZ data (most of the fragments are > pretty close to each other in terms of deviation), and 700ms average for > the plain CSV data. > > Is this expected? > > -- > Andy Pernsteiner > Manager, Field Enablement > ph: 206.228.0737 > > www.mapr.com > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >
