In thinking this through, it probably is somewhat expected to see a slowdown when having to decompress data (esp gzip) as part of running a Drill query.
Andy Pernsteiner Manager, Field Enablement ph: 206.228.0737 www.mapr.com Now Available - Free Hadoop On-Demand Training From: Andy Pernsteiner <[email protected]> Reply: Andy Pernsteiner <[email protected]>> Date: October 7, 2015 at 11:27:47 AM To: [email protected] <[email protected]>> Subject: Drill + gzipped-CSV performance I'm running some experimental queries, both against CSV, and against Gzipped-CSV (same data, same file-count, etc). I'm doing a simple : > select count(columns[0]) from dfs.workspace.`/csv` and > select count(columns[0]) from dfs.workspace.`/gz` Here are my results: 70-files, plain-CSV, 5GB on disk: 4.8s 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed): 30.4s When looking at profiles, it would appear that most of the time is spent on the TEXT_SUB_SCAN operation. Both queries spawn the same # of minor-fragments for this phase (68), but the process_time for those minor fragments is an average of 24s for the GZ data (most of the fragments are pretty close to each other in terms of deviation), and 700ms average for the plain CSV data. Is this expected? -- Andy Pernsteiner Manager, Field Enablement ph: 206.228.0737 www.mapr.com Now Available - Free Hadoop On-Demand Training
