RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df(filter_field) === value).select(field1).show()
Would it still read all the columns in df or would it read only “filter_field” 
and “field1” since only two columns are used (assuming other columns from df 
are not used anywhere else)?

Mohammed

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Each operation on a dataframe is completely independent and doesn't know what 
operations happened before it.  When you do a selection, you are removing other 
columns from the dataframe and so the filter has nothing to operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis 
mike.trie...@orcsol.commailto:mike.trie...@orcsol.com wrote:
I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).show()
However, the next one fails with the error in operator !Filter 
(filter_field#60 = value);

  *   df.select(field1).filter(df(filter_field) === value).show()
As a work-around, it seems that I can do the following

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).drop(filter_field).show()

Thanks, Mike.



Re: Data frames select and where clause dependency

2015-07-20 Thread Harish Butani
Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.

You can see the optimized plan by printing: df.queryExecution.optimizedPlan

On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Michael,

 How would the Catalyst optimizer optimize this version?

 df.filter(df(filter_field) === value).select(field1).show()

 Would it still read all the columns in df or would it read only
 “filter_field” and “field1” since only two columns are used (assuming other
 columns from df are not used anywhere else)?



 Mohammed



 *From:* Michael Armbrust [mailto:mich...@databricks.com]
 *Sent:* Friday, July 17, 2015 1:39 PM
 *To:* Mike Trienis
 *Cc:* user@spark.apache.org
 *Subject:* Re: Data frames select and where clause dependency



 Each operation on a dataframe is completely independent and doesn't know
 what operations happened before it.  When you do a selection, you are
 removing other columns from the dataframe and so the filter has nothing to
 operate on.



 On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.com
 wrote:

 I'd like to understand why the where field must exist in the select
 clause.



 For example, the following select statement works fine

- df.select(field1, filter_field).filter(df(filter_field) ===
value).show()

  However, the next one fails with the error in operator !Filter
 (filter_field#60 = value);

- df.select(field1).filter(df(filter_field) === value).show()

  As a work-around, it seems that I can do the following

- df.select(field1, filter_field).filter(df(filter_field) ===
value).drop(filter_field).show()



 Thanks, Mike.





RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Thanks, Harish.

Mike – this would be a cleaner version for your use case:
df.filter(df(filter_field) === value).select(field1).show()

Mohammed

From: Harish Butani [mailto:rhbutani.sp...@gmail.com]
Sent: Monday, July 20, 2015 5:37 PM
To: Mohammed Guller
Cc: Michael Armbrust; Mike Trienis; user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.

You can see the optimized plan by printing: df.queryExecution.optimizedPlan

On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df(filter_field) === value).select(field1).show()
Would it still read all the columns in df or would it read only “filter_field” 
and “field1” since only two columns are used (assuming other columns from df 
are not used anywhere else)?

Mohammed

From: Michael Armbrust 
[mailto:mich...@databricks.commailto:mich...@databricks.com]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Each operation on a dataframe is completely independent and doesn't know what 
operations happened before it.  When you do a selection, you are removing other 
columns from the dataframe and so the filter has nothing to operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis 
mike.trie...@orcsol.commailto:mike.trie...@orcsol.com wrote:
I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).show()
However, the next one fails with the error in operator !Filter 
(filter_field#60 = value);

  *   df.select(field1).filter(df(filter_field) === value).show()
As a work-around, it seems that I can do the following

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).drop(filter_field).show()

Thanks, Mike.




Re: Data frames select and where clause dependency

2015-07-20 Thread Mike Trienis
Definitely, thanks Mohammed.

On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Thanks, Harish.



 Mike – this would be a cleaner version for your use case:

 df.filter(df(filter_field) === value).select(field1).show()



 Mohammed



 *From:* Harish Butani [mailto:rhbutani.sp...@gmail.com]
 *Sent:* Monday, July 20, 2015 5:37 PM
 *To:* Mohammed Guller
 *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org

 *Subject:* Re: Data frames select and where clause dependency



 Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning

 See DefaultOptimizer.batches for list of logical rewrites.



 You can see the optimized plan by printing: df.queryExecution.optimizedPlan



 On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Michael,

 How would the Catalyst optimizer optimize this version?

 df.filter(df(filter_field) === value).select(field1).show()

 Would it still read all the columns in df or would it read only
 “filter_field” and “field1” since only two columns are used (assuming other
 columns from df are not used anywhere else)?



 Mohammed



 *From:* Michael Armbrust [mailto:mich...@databricks.com]
 *Sent:* Friday, July 17, 2015 1:39 PM
 *To:* Mike Trienis
 *Cc:* user@spark.apache.org
 *Subject:* Re: Data frames select and where clause dependency



 Each operation on a dataframe is completely independent and doesn't know
 what operations happened before it.  When you do a selection, you are
 removing other columns from the dataframe and so the filter has nothing to
 operate on.



 On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.com
 wrote:

 I'd like to understand why the where field must exist in the select
 clause.



 For example, the following select statement works fine

- df.select(field1, filter_field).filter(df(filter_field) ===
value).show()

  However, the next one fails with the error in operator !Filter
 (filter_field#60 = value);

- df.select(field1).filter(df(filter_field) === value).show()

  As a work-around, it seems that I can do the following

- df.select(field1, filter_field).filter(df(filter_field) ===
value).drop(filter_field).show()



 Thanks, Mike.







Data frames select and where clause dependency

2015-07-17 Thread Mike Trienis
I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

   - df.select(field1, filter_field).filter(df(filter_field) ===
   value).show()

However, the next one fails with the error in operator !Filter
(filter_field#60 = value);

   - df.select(field1).filter(df(filter_field) === value).show()

As a work-around, it seems that I can do the following

   - df.select(field1, filter_field).filter(df(filter_field) ===
   value).drop(filter_field).show()


Thanks, Mike.


Re: Data frames select and where clause dependency

2015-07-17 Thread Michael Armbrust
Each operation on a dataframe is completely independent and doesn't know
what operations happened before it.  When you do a selection, you are
removing other columns from the dataframe and so the filter has nothing to
operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.com
wrote:

 I'd like to understand why the where field must exist in the select
 clause.

 For example, the following select statement works fine

- df.select(field1, filter_field).filter(df(filter_field) ===
value).show()

 However, the next one fails with the error in operator !Filter
 (filter_field#60 = value);

- df.select(field1).filter(df(filter_field) === value).show()

 As a work-around, it seems that I can do the following

- df.select(field1, filter_field).filter(df(filter_field) ===
value).drop(filter_field).show()


 Thanks, Mike.