The other issue you might be running across is I have seen situations where
gzip is not using native library for decompression. You should take a look
at whether this is being used.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Oct 7, 2015 at 8:27 AM, Andy Pernsteiner <[email protected]>
wrote:

> I'm running some experimental queries, both against CSV, and against
> Gzipped-CSV (same data, same file-count, etc).
>
> I'm doing a simple :
>
> > select count(columns[0]) from dfs.workspace.`/csv`
>
> and
>
> > select count(columns[0]) from dfs.workspace.`/gz`
>
> Here are my results:
>
> 70-files, plain-CSV, 5GB on disk: *4.8s*
>
>  70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  *30.4s*
>
>
> When looking at profiles, it would appear that most of the time is spent on
> the TEXT_SUB_SCAN operation.  Both queries spawn the same # of
> minor-fragments for this phase (68), but the process_time for those minor
> fragments is an average of 24s for the GZ data (most of the fragments are
> pretty close to each other in terms of deviation), and 700ms average for
> the plain CSV data.
>
> Is this expected?
>
> --
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Reply via email to