Re: Spark SQL with a sorted file

2014-12-22 Thread Jerry Raj

Michael,
Thanks. Is this still turned off in the released 1.2? Is it possible to 
turn it on just to get an idea of how much of a difference it makes?


-Jerry

On 05/12/14 12:40 am, Michael Armbrust wrote:

I'll add that some of our data formats will actual infer this sort of
useful information automatically.  Both parquet and cached inmemory
tables keep statistics on the min/max value for each column.  When you
have predicates over these sorted columns, partitions will be eliminated
if they can't possibly match the predicate given the statistics.

For parquet this is new in Spark 1.2 and it is turned off by defaults
(due to bugs we are working with the parquet library team to fix).
Hopefully soon it will be on by default.

On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao hao.ch...@intel.com
mailto:hao.ch...@intel.com wrote:

You can try to write your own Relation with filter push down or use
the ParquetRelation2 for workaround.

(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)

Cheng Hao

-Original Message-
From: Jerry Raj [mailto:jerry@gmail.com
mailto:jerry@gmail.com]
Sent: Thursday, December 4, 2014 11:34 AM
To: user@spark.apache.org mailto:user@spark.apache.org
Subject: Spark SQL with a sorted file

Hi,
If I create a SchemaRDD from a file that I know is sorted on a
certain field, is it possible to somehow pass that information on to
Spark SQL so that SQL queries referencing that field are optimized?

Thanks
-Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org For additional commands,
e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL with a sorted file

2014-12-04 Thread Michael Armbrust
I'll add that some of our data formats will actual infer this sort of
useful information automatically.  Both parquet and cached inmemory tables
keep statistics on the min/max value for each column.  When you have
predicates over these sorted columns, partitions will be eliminated if they
can't possibly match the predicate given the statistics.

For parquet this is new in Spark 1.2 and it is turned off by defaults (due
to bugs we are working with the parquet library team to fix).  Hopefully
soon it will be on by default.

On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao hao.ch...@intel.com wrote:

 You can try to write your own Relation with filter push down or use the
 ParquetRelation2 for workaround. (
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
 )

 Cheng Hao

 -Original Message-
 From: Jerry Raj [mailto:jerry@gmail.com]
 Sent: Thursday, December 4, 2014 11:34 AM
 To: user@spark.apache.org
 Subject: Spark SQL with a sorted file

 Hi,
 If I create a SchemaRDD from a file that I know is sorted on a certain
 field, is it possible to somehow pass that information on to Spark SQL so
 that SQL queries referencing that field are optimized?

 Thanks
 -Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark SQL with a sorted file

2014-12-03 Thread Jerry Raj

Hi,
If I create a SchemaRDD from a file that I know is sorted on a certain 
field, is it possible to somehow pass that information on to Spark SQL 
so that SQL queries referencing that field are optimized?


Thanks
-Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark SQL with a sorted file

2014-12-03 Thread Cheng, Hao
You can try to write your own Relation with filter push down or use the 
ParquetRelation2 for workaround. 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)

Cheng Hao

-Original Message-
From: Jerry Raj [mailto:jerry@gmail.com] 
Sent: Thursday, December 4, 2014 11:34 AM
To: user@spark.apache.org
Subject: Spark SQL with a sorted file

Hi,
If I create a SchemaRDD from a file that I know is sorted on a certain field, 
is it possible to somehow pass that information on to Spark SQL so that SQL 
queries referencing that field are optimized?

Thanks
-Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org