[jira] [Updated] (SPARK-19798) Query returns stale results when tables are modified on other sessions
[ https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giambattista updated SPARK-19798: - Affects Version/s: 3.0.0 > Query returns stale results when tables are modified on other sessions > -- > > Key: SPARK-19798 > URL: https://issues.apache.org/jira/browse/SPARK-19798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 3.0.0 >Reporter: Giambattista >Priority: Major > > I observed the problem on master branch with thrift server in multisession > mode (default), but I was able to replicate also with spark-shell as well > (see below the sequence for replicating). > I observed cases where changes made in a session (table insert, table > renaming) are not visible to other derived sessions (created with > session.newSession). > The problem seems due to the fact that each session has its own > tableRelationCache and it does not get refreshed. > IMO tableRelationCache should be shared in sharedState, maybe in the > cacheManager so that refresh of caches for data that is not session-specific > such as temporary tables gets centralized. > --- Spark shell script > val spark2 = spark.newSession > spark.sql("CREATE TABLE test (a int) using parquet") > spark2.sql("select * from test").show // OK returns empty > spark.sql("select * from test").show // OK returns empty > spark.sql("insert into TABLE test values 1,2,3") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 3,2,1 > spark.sql("create table test2 (a int) using parquet") > spark.sql("insert into TABLE test2 values 4,5,6") > spark2.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("alter table test rename to test3") > spark.sql("alter table test2 rename to test") > spark.sql("alter table test3 rename to test2") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 6,4,5 > spark2.sql("select * from test2").show // ERROR throws > java.io.FileNotFoundException > spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19799) Support WITH clause in subqueries
Giambattista created SPARK-19799: Summary: Support WITH clause in subqueries Key: SPARK-19799 URL: https://issues.apache.org/jira/browse/SPARK-19799 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Giambattista Because of Spark-17590 it should be relatively easy to support WITH clause in subqueries besides nested CTE definitions. Here an example of a query that does not run on spark: create table test (seqno int, k string, v int) using parquet; insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33); SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER (PARTITION BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b FROM test ORDER BY seqno) SELECT k, MAX(b) as b FROM mavg GROUP BY k); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19798) Query returns stale results when tables are modified on other sessions
Giambattista created SPARK-19798: Summary: Query returns stale results when tables are modified on other sessions Key: SPARK-19798 URL: https://issues.apache.org/jira/browse/SPARK-19798 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Giambattista I observed the problem on master branch with thrift server in multisession mode (default), but I was able to replicate also with spark-shell as well (see below the sequence for replicating). I observed cases where changes made in a session (table insert, table renaming) are not visible to other derived sessions (created with session.newSession). The problem seems due to the fact that each session has its own tableRelationCache and it does not get refreshed. IMO tableRelationCache should be shared in sharedState, maybe in the cacheManager so that refresh of caches for data that is not session-specific such as temporary tables gets centralized. --- Spark shell script val spark2 = spark.newSession spark.sql("CREATE TABLE test (a int) using parquet") spark2.sql("select * from test").show // OK returns empty spark.sql("select * from test").show // OK returns empty spark.sql("insert into TABLE test values 1,2,3") spark2.sql("select * from test").show // ERROR returns empty spark.sql("select * from test").show // OK returns 3,2,1 spark.sql("create table test2 (a int) using parquet") spark.sql("insert into TABLE test2 values 4,5,6") spark2.sql("select * from test2").show // OK returns 6,4,5 spark.sql("select * from test2").show // OK returns 6,4,5 spark.sql("alter table test rename to test3") spark.sql("alter table test2 rename to test") spark.sql("alter table test3 rename to test2") spark2.sql("select * from test").show // ERROR returns empty spark.sql("select * from test").show // OK returns 6,4,5 spark2.sql("select * from test2").show // ERROR throws java.io.FileNotFoundException spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892016#comment-15892016 ] Giambattista commented on SPARK-17931: -- Thanks, I just opened SPARK-19796 and added required details. > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li >Assignee: Kay Ousterhout > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
Giambattista created SPARK-19796: Summary: taskScheduler fails serializing long statements received by thrift server Key: SPARK-19796 URL: https://issues.apache.org/jira/browse/SPARK-19796 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Giambattista This problem was observed after the changes made for SPARK-17931. In my use-case I'm sending very long insert statements to Spark thrift server and they are failing at TaskDescription.scala:89 because writeUTF fails if requested to write strings longer than 64Kb (see https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for a description of the issue). As suggested by Imran Rashid I tracked down the offending key: it is "spark.job.description" and it contains the complete SQL statement. The problem can be reproduced by creating a table like: create table test (a int) using parquet and by sending an insert statement like: scala> val r = 1 to 128000 scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890226#comment-15890226 ] Giambattista commented on SPARK-17931: -- I just wanted to report that after this change Spark is failing in executing long SQL statements (my case they were long insert into table statements). The problem I was facing is very well described in this article https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ Eventually, I was able to get them working again with the change below. --- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala @@ -86,7 +86,7 @@ private[spark] object TaskDescription { dataOut.writeInt(taskDescription.properties.size()) taskDescription.properties.asScala.foreach { case (key, value) => dataOut.writeUTF(key) - dataOut.writeUTF(value) + dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4))) } > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li >Assignee: Kay Ousterhout > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890226#comment-15890226 ] Giambattista edited comment on SPARK-17931 at 3/1/17 2:06 PM: -- I just wanted to report that after this change Spark is failing in executing long SQL statements (my case they were long insert into table statements). The problem I was facing is very well described in this article https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ Eventually, I was able to get them working again with the change below. {noformat} --- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala @@ -86,7 +86,7 @@ private[spark] object TaskDescription { dataOut.writeInt(taskDescription.properties.size()) taskDescription.properties.asScala.foreach { case (key, value) => dataOut.writeUTF(key) - dataOut.writeUTF(value) + dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4))) } {noformat} was (Author: gbloisi): I just wanted to report that after this change Spark is failing in executing long SQL statements (my case they were long insert into table statements). The problem I was facing is very well described in this article https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ Eventually, I was able to get them working again with the change below. --- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala @@ -86,7 +86,7 @@ private[spark] object TaskDescription { dataOut.writeInt(taskDescription.properties.size()) taskDescription.properties.asScala.foreach { case (key, value) => dataOut.writeUTF(key) - dataOut.writeUTF(value) + dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4))) } > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li >Assignee: Kay Ousterhout > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19059) Unable to retrieve data from a parquet table whose name starts with underscore
[ https://issues.apache.org/jira/browse/SPARK-19059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796253#comment-15796253 ] Giambattista commented on SPARK-19059: -- Please note your environment is printing version 2.0.0 even though the command line prompt seems located in 2.1.0 folder > Unable to retrieve data from a parquet table whose name starts with underscore > -- > > Key: SPARK-19059 > URL: https://issues.apache.org/jira/browse/SPARK-19059 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Giambattista > Fix For: 2.1.1 > > > It looks like there is some bug introduced in Spark 2.1.0 preventing to read > data from a parquet table (hive support is enabled) whose name starts with > underscore. CREATE and INSERT statements on the same table instead seems to > work as expected. > The problem can be reproduced from spark-shell through the following steps: > 1) Create a table with some values > scala> spark.sql("CREATE TABLE `_a`(i INT) USING parquet").show > scala> spark.sql("INSERT INTO `_a` VALUES (1), (2), (3)").show > 2) Select data from the just created and filled table --> no results > scala> spark.sql("SELECT * FROM `_a`").show > +---+ > | i| > +---+ > +---+ > 3) rename the table so that the prefixing underscore disappears > scala> spark.sql("ALTER TABLE `_a` RENAME TO `a`").show > 4) select data from the just renamed table --> results are shown > scala> spark.sql("SELECT * FROM `a`").show > +---+ > | i| > +---+ > | 1| > | 2| > | 3| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19059) Unable to retrieve data from a parquet table whose name starts with underscore
[ https://issues.apache.org/jira/browse/SPARK-19059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796248#comment-15796248 ] Giambattista commented on SPARK-19059: -- I'm using version 2.1.0 and I suspect it is a regression introduced with the changes in metadata caching. > Unable to retrieve data from a parquet table whose name starts with underscore > -- > > Key: SPARK-19059 > URL: https://issues.apache.org/jira/browse/SPARK-19059 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Giambattista > Fix For: 2.1.1 > > > It looks like there is some bug introduced in Spark 2.1.0 preventing to read > data from a parquet table (hive support is enabled) whose name starts with > underscore. CREATE and INSERT statements on the same table instead seems to > work as expected. > The problem can be reproduced from spark-shell through the following steps: > 1) Create a table with some values > scala> spark.sql("CREATE TABLE `_a`(i INT) USING parquet").show > scala> spark.sql("INSERT INTO `_a` VALUES (1), (2), (3)").show > 2) Select data from the just created and filled table --> no results > scala> spark.sql("SELECT * FROM `_a`").show > +---+ > | i| > +---+ > +---+ > 3) rename the table so that the prefixing underscore disappears > scala> spark.sql("ALTER TABLE `_a` RENAME TO `a`").show > 4) select data from the just renamed table --> results are shown > scala> spark.sql("SELECT * FROM `a`").show > +---+ > | i| > +---+ > | 1| > | 2| > | 3| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19059) Unable to retrieve data from a parquet table whose name starts with underscore
Giambattista created SPARK-19059: Summary: Unable to retrieve data from a parquet table whose name starts with underscore Key: SPARK-19059 URL: https://issues.apache.org/jira/browse/SPARK-19059 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Giambattista Fix For: 2.1.1 It looks like there is some bug introduced in Spark 2.1.0 preventing to read data from a parquet table (hive support is enabled) whose name starts with underscore. CREATE and INSERT statements on the same table instead seems to work as expected. The problem can be reproduced from spark-shell through the following steps: 1) Create a table with some values scala> spark.sql("CREATE TABLE `_a`(i INT) USING parquet").show scala> spark.sql("INSERT INTO `_a` VALUES (1), (2), (3)").show 2) Select data from the just created and filled table --> no results scala> spark.sql("SELECT * FROM `_a`").show +---+ | i| +---+ +---+ 3) rename the table so that the prefixing underscore disappears scala> spark.sql("ALTER TABLE `_a` RENAME TO `a`").show 4) select data from the just renamed table --> results are shown scala> spark.sql("SELECT * FROM `a`").show +---+ | i| +---+ | 1| | 2| | 3| +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org