[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution
[ https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-19213: -- Description: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the code it looks like we should be using the >sparksession that is currently active hence take the one from spark plan. >However, in case you want share Datasets across SparkSessions that is not >enough since as soon as dataset is executed the queryexecution will have >capture spark session at that point. If we want to share datasets across users >we need to make configurations not fixed upon first execution. I consider 1st >part (using sparksession from logical plan) a bug while the second (using >sparksession active at runtime) an enhancement so that sharing across sessions >is made easier. For example: {code} val df = spark.read.parquet(...) df.count() val newSession = spark.newSession() SparkSession.setActiveSession(newSession) (simplest one to try is disable vectorized reads) val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still holds reference to original sparksession and changes don't take effect {code} I suggest that it shouldn't be necessary to create a new dataset for changes to take effect. For most of the plans doing Dataset.ofRows work but this is not the case for hadoopfsrelation. was: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the code it looks like we should be using the >sparksession that is currently active hence take the one from spark plan. >However, in case you want share Datasets across SparkSessions that is not >enough since as soon as dataset is executed the queryexecution will have >capture spark session at that point. If we want to share datasets across users >we need to make configurations not fixed upon first execution. I consider 1st >part (using sparksession from logical plan) a bug while the second (using >sparksession active at runtime) an enhancement so that sharing across sessions >is made easier. For example: val df = spark.read.parquet(...) df.count() val newSession = spark.newSession() SparkSession.setActiveSession(newSession) (simplest one to try is disable vectorized reads) val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still holds reference to original sparksession and changes don't take effect I suggest that it shouldn't be necessary to create a new dataset for changes to take effect. For most of the plans doing Dataset.ofRows work but this is not the case for hadoopfsrelation. > FileSourceScanExec usese sparksession from hadoopfsrelation creation time > instead of the one active at time of execution > > > Key: SPARK-19213 > URL: https://issues.apache.org/jira/browse/SPARK-19213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > If you look at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 > you'll notice that the sparksession used for execution is the one that was > captured from logicalplan. Whereas in other places you have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 > and SparkPlan captures active session upon execution in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 > From my understanding of the code it looks like we should be using the > sparksession that is currently active hence
[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution
[ https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-19213: -- Description: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the code it looks like we should be using the >sparksession that is currently active hence take the one from spark plan. >However, in case you want share Datasets across SparkSessions that is not >enough since as soon as dataset is executed the queryexecution will have >capture spark session at that point. If we want to share datasets across users >we need to make configurations not fixed upon first execution. I consider 1st >part (using sparksession from logical plan) a bug while the second (using >sparksession active at runtime) an enhancement so that sharing across sessions >is made easier. For example: {code} val df = spark.read.parquet(...) df.count() val newSession = spark.newSession() SparkSession.setActiveSession(newSession) // (simplest one to try is disable vectorized reads) val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still holds reference to original sparksession and changes don't take effect {code} I suggest that it shouldn't be necessary to create a new dataset for changes to take effect. For most of the plans doing Dataset.ofRows work but this is not the case for hadoopfsrelation. was: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the code it looks like we should be using the >sparksession that is currently active hence take the one from spark plan. >However, in case you want share Datasets across SparkSessions that is not >enough since as soon as dataset is executed the queryexecution will have >capture spark session at that point. If we want to share datasets across users >we need to make configurations not fixed upon first execution. I consider 1st >part (using sparksession from logical plan) a bug while the second (using >sparksession active at runtime) an enhancement so that sharing across sessions >is made easier. For example: {code} val df = spark.read.parquet(...) df.count() val newSession = spark.newSession() SparkSession.setActiveSession(newSession) (simplest one to try is disable vectorized reads) val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still holds reference to original sparksession and changes don't take effect {code} I suggest that it shouldn't be necessary to create a new dataset for changes to take effect. For most of the plans doing Dataset.ofRows work but this is not the case for hadoopfsrelation. > FileSourceScanExec usese sparksession from hadoopfsrelation creation time > instead of the one active at time of execution > > > Key: SPARK-19213 > URL: https://issues.apache.org/jira/browse/SPARK-19213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > If you look at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 > you'll notice that the sparksession used for execution is the one that was > captured from logicalplan. Whereas in other places you have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 > and SparkPlan captures active session upon execution in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 > From my understanding of the code it looks like we should be using the > sparksession that is currently
[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution
[ https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-19213: -- Description: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the code it looks like we should be using the >sparksession that is currently active hence take the one from spark plan. >However, in case you want share Datasets across SparkSessions that is not >enough since as soon as dataset is executed the queryexecution will have >capture spark session at that point. If we want to share datasets across users >we need to make configurations not fixed upon first execution. I consider 1st >part (using sparksession from logical plan) a bug while the second (using >sparksession active at runtime) an enhancement so that sharing across sessions >is made easier. For example: val df = spark.read.parquet(...) df.count() val newSession = spark.newSession() SparkSession.setActiveSession(newSession) (simplest one to try is disable vectorized reads) val df2 = Dataset.ofRows(newSession, df.logicalPlan) <- logical plan still holds reference to original sparksession and changes don't take effect I suggest that it shouldn't be necessary to create a new dataset for changes to take effect. For most of the plans doing Dataset.ofRows work but this is not the case for hadoopfsrelation. was: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the code it looks like we should be using the >sparksession that is currently active hence take the one from spark plan. >However, in case you want share Datasets across SparkSessions that is not >enough since as soon as dataset is executed the queryexecution will have >capture spark session at that point. If we want to share datasets across users >we need to make configurations not fixed upon first execution. I consider 1st >part (using sparksession from logical plan) a bug while the second (using >sparksession active at runtime) an enhancement so that sharing across sessions >is made easier. > FileSourceScanExec usese sparksession from hadoopfsrelation creation time > instead of the one active at time of execution > > > Key: SPARK-19213 > URL: https://issues.apache.org/jira/browse/SPARK-19213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > If you look at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 > you'll notice that the sparksession used for execution is the one that was > captured from logicalplan. Whereas in other places you have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 > and SparkPlan captures active session upon execution in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 > From my understanding of the code it looks like we should be using the > sparksession that is currently active hence take the one from spark plan. > However, in case you want share Datasets across SparkSessions that is not > enough since as soon as dataset is executed the queryexecution will have > capture spark session at that point. If we want to share datasets across > users we need to make configurations not fixed upon first execution. I > consider 1st part (using sparksession from logical plan) a bug while the > second (using sparksession active at runtime) an enhancement so that sharing > across sessions is made easier. > For example: >
[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution
[ https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-19213: -- Description: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the code it looks like we should be using the >sparksession that is currently active hence take the one from spark plan. >However, in case you want share Datasets across SparkSessions that is not >enough since as soon as dataset is executed the queryexecution will have >capture spark session at that point. If we want to share datasets across users >we need to make configurations not fixed upon first execution. I consider 1st >part (using sparksession from logical plan) a bug while the second (using >sparksession active at runtime) an enhancement so that sharing across sessions >is made easier. was: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the io code it would be beneficial to be able to use >the active session in order to be able to modify hadoop config without >recreating the dataset. What would be interesting is to not lock the spark >session in the physical plan for ios and let you share datasets across spark >sessions. Is that supposed to work? Otherwise you'd have to get a new query >execution to bind to new sparksession which would only let you share logical >plans. > FileSourceScanExec usese sparksession from hadoopfsrelation creation time > instead of the one active at time of execution > > > Key: SPARK-19213 > URL: https://issues.apache.org/jira/browse/SPARK-19213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > If you look at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 > you'll notice that the sparksession used for execution is the one that was > captured from logicalplan. Whereas in other places you have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 > and SparkPlan captures active session upon execution in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 > From my understanding of the code it looks like we should be using the > sparksession that is currently active hence take the one from spark plan. > However, in case you want share Datasets across SparkSessions that is not > enough since as soon as dataset is executed the queryexecution will have > capture spark session at that point. If we want to share datasets across > users we need to make configurations not fixed upon first execution. I > consider 1st part (using sparksession from logical plan) a bug while the > second (using sparksession active at runtime) an enhancement so that sharing > across sessions is made easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution
[ https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-19213: -- Description: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the io code it would be beneficial to be able to use >the active session in order to be able to modify hadoop config without >recreating the dataset. What would be interesting is to not lock the spark >session in the physical plan for ios and let you share datasets across spark >sessions. Is that supposed to work? Otherwise you'd have to get a new query >execution to bind to new sparksession which would only let you share logical >plans. was: If you look at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 you'll notice that the sparksession used for execution is the one that was captured from logicalplan. Whereas in other places you have https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 and SparkPlan captures active session upon execution in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 >From my understanding of the io code it would be beneficial to be able to use >the active session in order to be able to modify hadoop config without >recreating the dataset. What would be interesting is to not lock the spark >session in the physical plan for ios and let you share datasets across spark >sessions. Is that supposed to work? Otherwise you'd have to get a new query >execution to bind to new sparksession which would only let you share logical >plans. I am sending pr along with the latter. > FileSourceScanExec usese sparksession from hadoopfsrelation creation time > instead of the one active at time of execution > > > Key: SPARK-19213 > URL: https://issues.apache.org/jira/browse/SPARK-19213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > If you look at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 > you'll notice that the sparksession used for execution is the one that was > captured from logicalplan. Whereas in other places you have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 > and SparkPlan captures active session upon execution in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 > From my understanding of the io code it would be beneficial to be able to use > the active session in order to be able to modify hadoop config without > recreating the dataset. What would be interesting is to not lock the spark > session in the physical plan for ios and let you share datasets across spark > sessions. Is that supposed to work? Otherwise you'd have to get a new query > execution to bind to new sparksession which would only let you share logical > plans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org