Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Reynold Xin
Thanks for the experiments and analysis! I think Michael already submitted a patch that avoids scanning all columns for count(*) or count(1). On Mon, May 12, 2014 at 9:46 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, First of all, huge congrats on the parquet integration with

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust mich...@databricks.comwrote: It looks like currently the .count() on parquet is handled incredibly inefficiently and all the columns are materialized.

Preliminary Parquet numbers and including .count() in Catalyst

2014-05-12 Thread Andrew Ash
Hi Spark devs, First of all, huge congrats on the parquet integration with SparkSQL! This is an incredible direction forward and something I can see being very broadly useful. I was doing some preliminary tests to see how it works with one of my workflows, and wanted to share some numbers that