[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:

val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect

I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence 

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
//  (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:
{code}
val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect
{code}
I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently 

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

For example:

val df = spark.read.parquet(...)
df.count()
val newSession = spark.newSession()
SparkSession.setActiveSession(newSession)
 (simplest one to try is disable vectorized 
reads)
val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still 
holds reference to original sparksession and changes don't take effect

I suggest that it shouldn't be necessary to create a new dataset for changes to 
take effect. For most of the plans doing Dataset.ofRows work but this is not 
the case for hadoopfsrelation.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.
> For example:
> 

[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the code it looks like we should be using the 
>sparksession that is currently active hence take the one from spark plan. 
>However, in case you want share Datasets across SparkSessions that is not 
>enough since as soon as dataset is executed the queryexecution will have 
>capture spark session at that point. If we want to share datasets across users 
>we need to make configurations not fixed upon first execution. I consider 1st 
>part (using sparksession from logical plan) a bug while the second (using 
>sparksession active at runtime) an enhancement so that sharing across sessions 
>is made easier.

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-19213:
--
Description: 
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 

  was:
If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 

I am sending pr along with the latter.


> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the io code it would be beneficial to be able to use 
> the active session in order to be able to modify hadoop config without 
> recreating the dataset. What would be interesting is to not lock the spark 
> session in the physical plan for ios and let you share datasets across spark 
> sessions. Is that supposed to work? Otherwise you'd have to get a new query 
> execution to bind to new sparksession which would only let you share logical 
> plans. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org