RE: Data frames select and where clause dependency
Michael, How would the Catalyst optimizer optimize this version? df.filter(df(filter_field) === value).select(field1).show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used anywhere else)? Mohammed From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, July 17, 2015 1:39 PM To: Mike Trienis Cc: user@spark.apache.org Subject: Re: Data frames select and where clause dependency Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.commailto:mike.trie...@orcsol.com wrote: I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine * df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 = value); * df.select(field1).filter(df(filter_field) === value).show() As a work-around, it seems that I can do the following * df.select(field1, filter_field).filter(df(filter_field) === value).drop(filter_field).show() Thanks, Mike.
Re: Data frames select and where clause dependency
Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning See DefaultOptimizer.batches for list of logical rewrites. You can see the optimized plan by printing: df.queryExecution.optimizedPlan On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller moham...@glassbeam.com wrote: Michael, How would the Catalyst optimizer optimize this version? df.filter(df(filter_field) === value).select(field1).show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used anywhere else)? Mohammed *From:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Friday, July 17, 2015 1:39 PM *To:* Mike Trienis *Cc:* user@spark.apache.org *Subject:* Re: Data frames select and where clause dependency Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.com wrote: I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 = value); - df.select(field1).filter(df(filter_field) === value).show() As a work-around, it seems that I can do the following - df.select(field1, filter_field).filter(df(filter_field) === value).drop(filter_field).show() Thanks, Mike.
RE: Data frames select and where clause dependency
Thanks, Harish. Mike – this would be a cleaner version for your use case: df.filter(df(filter_field) === value).select(field1).show() Mohammed From: Harish Butani [mailto:rhbutani.sp...@gmail.com] Sent: Monday, July 20, 2015 5:37 PM To: Mohammed Guller Cc: Michael Armbrust; Mike Trienis; user@spark.apache.org Subject: Re: Data frames select and where clause dependency Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning See DefaultOptimizer.batches for list of logical rewrites. You can see the optimized plan by printing: df.queryExecution.optimizedPlan On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller moham...@glassbeam.commailto:moham...@glassbeam.com wrote: Michael, How would the Catalyst optimizer optimize this version? df.filter(df(filter_field) === value).select(field1).show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used anywhere else)? Mohammed From: Michael Armbrust [mailto:mich...@databricks.commailto:mich...@databricks.com] Sent: Friday, July 17, 2015 1:39 PM To: Mike Trienis Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Data frames select and where clause dependency Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.commailto:mike.trie...@orcsol.com wrote: I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine * df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 = value); * df.select(field1).filter(df(filter_field) === value).show() As a work-around, it seems that I can do the following * df.select(field1, filter_field).filter(df(filter_field) === value).drop(filter_field).show() Thanks, Mike.
Re: Data frames select and where clause dependency
Definitely, thanks Mohammed. On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller moham...@glassbeam.com wrote: Thanks, Harish. Mike – this would be a cleaner version for your use case: df.filter(df(filter_field) === value).select(field1).show() Mohammed *From:* Harish Butani [mailto:rhbutani.sp...@gmail.com] *Sent:* Monday, July 20, 2015 5:37 PM *To:* Mohammed Guller *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org *Subject:* Re: Data frames select and where clause dependency Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning See DefaultOptimizer.batches for list of logical rewrites. You can see the optimized plan by printing: df.queryExecution.optimizedPlan On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller moham...@glassbeam.com wrote: Michael, How would the Catalyst optimizer optimize this version? df.filter(df(filter_field) === value).select(field1).show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used anywhere else)? Mohammed *From:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Friday, July 17, 2015 1:39 PM *To:* Mike Trienis *Cc:* user@spark.apache.org *Subject:* Re: Data frames select and where clause dependency Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.com wrote: I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 = value); - df.select(field1).filter(df(filter_field) === value).show() As a work-around, it seems that I can do the following - df.select(field1, filter_field).filter(df(filter_field) === value).drop(filter_field).show() Thanks, Mike.
Data frames select and where clause dependency
I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 = value); - df.select(field1).filter(df(filter_field) === value).show() As a work-around, it seems that I can do the following - df.select(field1, filter_field).filter(df(filter_field) === value).drop(filter_field).show() Thanks, Mike.
Re: Data frames select and where clause dependency
Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis mike.trie...@orcsol.com wrote: I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 = value); - df.select(field1).filter(df(filter_field) === value).show() As a work-around, it seems that I can do the following - df.select(field1, filter_field).filter(df(filter_field) === value).drop(filter_field).show() Thanks, Mike.