[jira] [Updated] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2020-01-21 Thread Giambattista (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giambattista updated SPARK-19798:
-
Affects Version/s: 3.0.0

> Query returns stale results when tables are modified on other sessions
> --
>
> Key: SPARK-19798
> URL: https://issues.apache.org/jira/browse/SPARK-19798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 3.0.0
>Reporter: Giambattista
>Priority: Major
>
> I observed the problem on master branch with thrift server in multisession 
> mode (default), but I was able to replicate also with spark-shell as well 
> (see below the sequence for replicating).
> I observed cases where changes made in a session (table insert, table 
> renaming) are not visible to other derived sessions (created with 
> session.newSession).
> The problem seems due to the fact that each session has its own 
> tableRelationCache and it does not get refreshed.
> IMO tableRelationCache should be shared in sharedState, maybe in the 
> cacheManager so that refresh of caches for data that is not session-specific 
> such as temporary tables gets centralized.  
> --- Spark shell script
> val spark2 = spark.newSession
> spark.sql("CREATE TABLE test (a int) using parquet")
> spark2.sql("select * from test").show // OK returns empty
> spark.sql("select * from test").show // OK returns empty
> spark.sql("insert into TABLE test values 1,2,3")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 3,2,1
> spark.sql("create table test2 (a int) using parquet")
> spark.sql("insert into TABLE test2 values 4,5,6")
> spark2.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("alter table test rename to test3")
> spark.sql("alter table test2 rename to test")
> spark.sql("alter table test3 rename to test2")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 6,4,5
> spark2.sql("select * from test2").show // ERROR throws 
> java.io.FileNotFoundException
> spark.sql("select * from test2").show // OK returns 3,1,2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19799) Support WITH clause in subqueries

2017-03-02 Thread Giambattista (JIRA)
Giambattista created SPARK-19799:


 Summary: Support WITH clause in subqueries
 Key: SPARK-19799
 URL: https://issues.apache.org/jira/browse/SPARK-19799
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Giambattista


Because of Spark-17590 it should be relatively easy to support WITH clause in 
subqueries besides nested CTE definitions.

Here an example of a query that does not run on spark:
create table test (seqno int, k string, v int) using parquet;
insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 
66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33);
SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER (PARTITION 
BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b FROM test 
ORDER BY seqno) SELECT k, MAX(b) as b  FROM mavg GROUP BY k);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2017-03-02 Thread Giambattista (JIRA)
Giambattista created SPARK-19798:


 Summary: Query returns stale results when tables are modified on 
other sessions
 Key: SPARK-19798
 URL: https://issues.apache.org/jira/browse/SPARK-19798
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Giambattista


I observed the problem on master branch with thrift server in multisession mode 
(default), but I was able to replicate also with spark-shell as well (see below 
the sequence for replicating).
I observed cases where changes made in a session (table insert, table renaming) 
are not visible to other derived sessions (created with session.newSession).

The problem seems due to the fact that each session has its own 
tableRelationCache and it does not get refreshed.
IMO tableRelationCache should be shared in sharedState, maybe in the 
cacheManager so that refresh of caches for data that is not session-specific 
such as temporary tables gets centralized.  

--- Spark shell script

val spark2 = spark.newSession
spark.sql("CREATE TABLE test (a int) using parquet")
spark2.sql("select * from test").show // OK returns empty
spark.sql("select * from test").show // OK returns empty
spark.sql("insert into TABLE test values 1,2,3")
spark2.sql("select * from test").show // ERROR returns empty
spark.sql("select * from test").show // OK returns 3,2,1
spark.sql("create table test2 (a int) using parquet")
spark.sql("insert into TABLE test2 values 4,5,6")
spark2.sql("select * from test2").show // OK returns 6,4,5
spark.sql("select * from test2").show // OK returns 6,4,5
spark.sql("alter table test rename to test3")
spark.sql("alter table test2 rename to test")
spark.sql("alter table test3 rename to test2")
spark2.sql("select * from test").show // ERROR returns empty
spark.sql("select * from test").show // OK returns 6,4,5
spark2.sql("select * from test2").show // ERROR throws 
java.io.FileNotFoundException
spark.sql("select * from test2").show // OK returns 3,1,2





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-02 Thread Giambattista (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892016#comment-15892016
 ] 

Giambattista commented on SPARK-17931:
--

Thanks, I just opened SPARK-19796 and added required details.

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Giambattista (JIRA)
Giambattista created SPARK-19796:


 Summary: taskScheduler fails serializing long statements received 
by thrift server
 Key: SPARK-19796
 URL: https://issues.apache.org/jira/browse/SPARK-19796
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Giambattista


This problem was observed after the changes made for SPARK-17931.

In my use-case I'm sending very long insert statements to Spark thrift server 
and they are failing at TaskDescription.scala:89 because writeUTF fails if 
requested to write strings longer than 64Kb (see 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for a 
description of the issue).

As suggested by Imran Rashid I tracked down the offending key: it is 
"spark.job.description" and it contains the complete SQL statement.

The problem can be reproduced by creating a table like:
create table test (a int) using parquet

and by sending an insert statement like:
scala> val r = 1 to 128000
scala> println("insert into table test values (" + r.mkString("),(") + ")")





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-01 Thread Giambattista (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890226#comment-15890226
 ] 

Giambattista commented on SPARK-17931:
--

I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
 dataOut.writeInt(taskDescription.properties.size())
 taskDescription.properties.asScala.foreach { case (key, value) =>
   dataOut.writeUTF(key)
-  dataOut.writeUTF(value)
+  dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
 }



> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-01 Thread Giambattista (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890226#comment-15890226
 ] 

Giambattista edited comment on SPARK-17931 at 3/1/17 2:06 PM:
--

I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

{noformat}
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
 dataOut.writeInt(taskDescription.properties.size())
 taskDescription.properties.asScala.foreach { case (key, value) =>
   dataOut.writeUTF(key)
-  dataOut.writeUTF(value)
+  dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
 }
{noformat}



was (Author: gbloisi):
I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
 dataOut.writeInt(taskDescription.properties.size())
 taskDescription.properties.asScala.foreach { case (key, value) =>
   dataOut.writeUTF(key)
-  dataOut.writeUTF(value)
+  dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
 }



> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19059) Unable to retrieve data from a parquet table whose name starts with underscore

2017-01-03 Thread Giambattista (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796253#comment-15796253
 ] 

Giambattista commented on SPARK-19059:
--

Please note your environment is printing version 2.0.0 even though the command 
line prompt seems located in 2.1.0 folder 

> Unable to retrieve data from a parquet table whose name starts with underscore
> --
>
> Key: SPARK-19059
> URL: https://issues.apache.org/jira/browse/SPARK-19059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Giambattista
> Fix For: 2.1.1
>
>
> It looks like there is some bug introduced in Spark 2.1.0 preventing to read 
> data from a parquet table (hive support is enabled) whose name starts with 
> underscore. CREATE and INSERT statements on the same table instead seems to 
> work as expected.
> The problem can be reproduced from spark-shell through the following steps:
> 1) Create a table with some values
> scala> spark.sql("CREATE TABLE `_a`(i INT) USING parquet").show
> scala> spark.sql("INSERT INTO `_a` VALUES (1), (2), (3)").show
> 2) Select data from the just created and filled table --> no results
> scala> spark.sql("SELECT * FROM `_a`").show
> +---+
> |  i|
> +---+
> +---+
> 3) rename the table so that the prefixing underscore disappears
> scala> spark.sql("ALTER TABLE `_a` RENAME TO `a`").show
> 4) select data from the just renamed table --> results are shown
> scala> spark.sql("SELECT * FROM `a`").show
> +---+
> |  i|
> +---+
> |  1|
> |  2|
> |  3|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19059) Unable to retrieve data from a parquet table whose name starts with underscore

2017-01-03 Thread Giambattista (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796248#comment-15796248
 ] 

Giambattista commented on SPARK-19059:
--

I'm using version 2.1.0 and I suspect it is a regression introduced with the 
changes in metadata caching.

> Unable to retrieve data from a parquet table whose name starts with underscore
> --
>
> Key: SPARK-19059
> URL: https://issues.apache.org/jira/browse/SPARK-19059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Giambattista
> Fix For: 2.1.1
>
>
> It looks like there is some bug introduced in Spark 2.1.0 preventing to read 
> data from a parquet table (hive support is enabled) whose name starts with 
> underscore. CREATE and INSERT statements on the same table instead seems to 
> work as expected.
> The problem can be reproduced from spark-shell through the following steps:
> 1) Create a table with some values
> scala> spark.sql("CREATE TABLE `_a`(i INT) USING parquet").show
> scala> spark.sql("INSERT INTO `_a` VALUES (1), (2), (3)").show
> 2) Select data from the just created and filled table --> no results
> scala> spark.sql("SELECT * FROM `_a`").show
> +---+
> |  i|
> +---+
> +---+
> 3) rename the table so that the prefixing underscore disappears
> scala> spark.sql("ALTER TABLE `_a` RENAME TO `a`").show
> 4) select data from the just renamed table --> results are shown
> scala> spark.sql("SELECT * FROM `a`").show
> +---+
> |  i|
> +---+
> |  1|
> |  2|
> |  3|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19059) Unable to retrieve data from a parquet table whose name starts with underscore

2017-01-03 Thread Giambattista (JIRA)
Giambattista created SPARK-19059:


 Summary: Unable to retrieve data from a parquet table whose name 
starts with underscore
 Key: SPARK-19059
 URL: https://issues.apache.org/jira/browse/SPARK-19059
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Giambattista
 Fix For: 2.1.1


It looks like there is some bug introduced in Spark 2.1.0 preventing to read 
data from a parquet table (hive support is enabled) whose name starts with 
underscore. CREATE and INSERT statements on the same table instead seems to 
work as expected.

The problem can be reproduced from spark-shell through the following steps:
1) Create a table with some values
scala> spark.sql("CREATE TABLE `_a`(i INT) USING parquet").show
scala> spark.sql("INSERT INTO `_a` VALUES (1), (2), (3)").show

2) Select data from the just created and filled table --> no results
scala> spark.sql("SELECT * FROM `_a`").show
+---+
|  i|
+---+
+---+

3) rename the table so that the prefixing underscore disappears
scala> spark.sql("ALTER TABLE `_a` RENAME TO `a`").show

4) select data from the just renamed table --> results are shown
scala> spark.sql("SELECT * FROM `a`").show
+---+
|  i|
+---+
|  1|
|  2|
|  3|
+---+







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org