from:"Juliusz Sompolski \(JIRA\)"

[jira] [Created] (SPARK-20142) Move RewriteDistinctAggregates later into query execution

2017-03-29 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-20142:
-

 Summary: Move RewriteDistinctAggregates later into query execution
 Key: SPARK-20142
 URL: https://issues.apache.org/jira/browse/SPARK-20142
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Juliusz Sompolski
Priority: Minor


The rewrite of distinct aggregates complicates the later analysis of them by 
later optimizer rules.
Move it to later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't

2017-03-29 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-20145:
-

 Summary: "SELECT * FROM range(1)" works, but "SELECT * FROM 
RANGE(1)" doesn't
 Key: SPARK-20145
 URL: https://issues.apache.org/jira/browse/SPARK-20145
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Juliusz Sompolski


Executed at clean tip of the master branch, with all default settings:

scala> spark.sql("SELECT * FROM range(1)")
res1: org.apache.spark.sql.DataFrame = [id: bigint]

scala> spark.sql("SELECT * FROM RANGE(1)")
org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a 
table-valued function; line 1 pos 14
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
...

I believe it should be case insensitive?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't

2017-03-30 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949429#comment-15949429
 ] 

Juliusz Sompolski commented on SPARK-20145:
---

[~samelamin] sure, go ahead :-).

> "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
> 
>
> Key: SPARK-20145
> URL: https://issues.apache.org/jira/browse/SPARK-20145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> Executed at clean tip of the master branch, with all default settings:
> scala> spark.sql("SELECT * FROM range(1)")
> res1: org.apache.spark.sql.DataFrame = [id: bigint]
> scala> spark.sql("SELECT * FROM RANGE(1)")
> org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a 
> table-valued function; line 1 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
> ...
> I believe it should be case insensitive?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-04-12 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-20311:
-

 Summary: SQL "range(N) as alias" or "range(N) alias" doesn't work
 Key: SPARK-20311
 URL: https://issues.apache.org/jira/browse/SPARK-20311
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Juliusz Sompolski
Priority: Minor


`select * from range(10) as A;` or `select * from range(10) A;`
does not work.
As a workaround, a subquery has to be used:
`select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20412) NullPointerException in places expecting non-optional partitionSpec.

2017-04-20 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-20412:
-

 Summary: NullPointerException in places expecting non-optional 
partitionSpec.
 Key: SPARK-20412
 URL: https://issues.apache.org/jira/browse/SPARK-20412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.2.0
Reporter: Juliusz Sompolski


A number of commands expect a partition specification without empty values, 
e.g. {{SHOW PARTITIONS}}.
But then running {{SHOW PARTITIONS tbl (colStr='foo', colInt)}} throws it in an 
unfriendly way:
{{
java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1$$anonfun$apply$1.apply(SessionCatalog.scala:927)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1$$anonfun$apply$1.apply(SessionCatalog.scala:927)
  at scala.collection.Iterator$class.exists(Iterator.scala:919)
  at scala.collection.AbstractIterator.exists(Iterator.scala:1336)
  at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
  at scala.collection.AbstractIterable.exists(Iterable.scala:54)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1.apply(SessionCatalog.scala:927)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1.apply(SessionCatalog.scala:926)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec(SessionCatalog.scala:926)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$listPartitionNames$1.apply(SessionCatalog.scala:882)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$listPartitionNames$1.apply(SessionCatalog.scala:880)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionNames(SessionCatalog.scala:880)
  at 
org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(tables.scala:817)
}}

where {{requireNonEmptyValueInPartitionSpec}} does not expect a NULL there.

It seems that {{visitNonOptionalPartitionSpec}} could throw {{ParseException}} 
instead of putting in {{null}}, but I'm not sure if there are any implications 
for other commands using non-optional partitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20412) NullPointerException in places expecting non-optional partitionSpec.

2017-04-20 Thread Juliusz Sompolski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-20412:
--
Description: 
A number of commands expect a partition specification without empty values, 
e.g. {{SHOW PARTITIONS}}.
But then running {{SHOW PARTITIONS tbl (colStr='foo', colInt)}} throws it in an 
unfriendly way:
{code}
java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1$$anonfun$apply$1.apply(SessionCatalog.scala:927)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1$$anonfun$apply$1.apply(SessionCatalog.scala:927)
  at scala.collection.Iterator$class.exists(Iterator.scala:919)
  at scala.collection.AbstractIterator.exists(Iterator.scala:1336)
  at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
  at scala.collection.AbstractIterable.exists(Iterable.scala:54)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1.apply(SessionCatalog.scala:927)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1.apply(SessionCatalog.scala:926)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec(SessionCatalog.scala:926)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$listPartitionNames$1.apply(SessionCatalog.scala:882)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$listPartitionNames$1.apply(SessionCatalog.scala:880)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionNames(SessionCatalog.scala:880)
  at 
org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(tables.scala:817)
{code}

where {{requireNonEmptyValueInPartitionSpec}} does not expect a NULL there.

It seems that {{visitNonOptionalPartitionSpec}} could throw {{ParseException}} 
instead of putting in {{null}}, but I'm not sure if there are any implications 
for other commands using non-optional partitionSpec.

  was:
A number of commands expect a partition specification without empty values, 
e.g. {{SHOW PARTITIONS}}.
But then running {{SHOW PARTITIONS tbl (colStr='foo', colInt)}} throws it in an 
unfriendly way:
{{
java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1$$anonfun$apply$1.apply(SessionCatalog.scala:927)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1$$anonfun$apply$1.apply(SessionCatalog.scala:927)
  at scala.collection.Iterator$class.exists(Iterator.scala:919)
  at scala.collection.AbstractIterator.exists(Iterator.scala:1336)
  at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
  at scala.collection.AbstractIterable.exists(Iterable.scala:54)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1.apply(SessionCatalog.scala:927)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec$1.apply(SessionCatalog.scala:926)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireNonEmptyValueInPartitionSpec(SessionCatalog.scala:926)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$listPartitionNames$1.apply(SessionCatalog.scala:882)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$listPartitionNames$1.apply(SessionCatalog.scala:880)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionNames(SessionCatalog.scala:880)
  at 
org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(tables.scala:817)
}}

where {{requireNonEmptyValueInPartitionSpec}} does not expect a NULL there.

It seems that {{visitNonOptionalPartitionSpec}} could throw {{ParseException}} 
instead of putting in {{null}}, but I'm not sure if there are any implications 
for other commands using non-optional partitionSpec.


> NullPointerException in places expecting non-optional partitionSpec.
>

[jira] [Commented] (SPARK-20367) Spark silently escapes partition column names

2017-04-18 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972867#comment-15972867
 ] 

Juliusz Sompolski commented on SPARK-20367:
---

Hi [~hyukjin.kwon]. I tested also with parquet, and it also happens there.
{quote}
The same happens for other formats, but I encountered it working with CSV, 
since these more often contain ugly schemas...
{quote}

> Spark silently escapes partition column names
> -
>
> Key: SPARK-20367
> URL: https://issues.apache.org/jira/browse/SPARK-20367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> CSV files can have arbitrary column names:
> {code}
> scala> spark.range(1).select(col("id").as("Column?"), 
> col("id")).write.option("header", true).csv("/tmp/foo")
> scala> spark.read.option("header", true).csv("/tmp/foo").schema
> res1: org.apache.spark.sql.types.StructType = 
> StructType(StructField(Column?,StringType,true), 
> StructField(id,StringType,true))
> {code}
> However, once a column with characters like "?" in the name gets used in a 
> partitioning column, the column name gets silently escaped, and reading the 
> schema information back renders the column name with "?" turned into "%3F":
> {code}
> scala> spark.range(1).select(col("id").as("Column?"), 
> col("id")).write.partitionBy("Column?").option("header", true).csv("/tmp/bar")
> scala> spark.read.option("header", true).csv("/tmp/bar").schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(id,StringType,true), 
> StructField(Column%3F,IntegerType,true))
> {code}
> The same happens for other formats, but I encountered it working with CSV, 
> since these more often contain ugly schemas... 
> Not sure if it's a bug or a feature, but it might be more intuitive to fail 
> queries with invalid characters in the partitioning column name, rather than 
> silently escaping the name?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19948) Document that saveAsTable uses catalog as source of truth for table existence.

2017-03-14 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-19948:
-

 Summary: Document that saveAsTable uses catalog as source of truth 
for table existence.
 Key: SPARK-19948
 URL: https://issues.apache.org/jira/browse/SPARK-19948
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Juliusz Sompolski


It is quirky behaviour that saveAsTable to e.g. a JDBC source with SaveMode 
other than Overwrite will nevertheless overwrite the table in the external 
source,if that table was not a catalog table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20367) Spark silently escapes partition column names

2017-04-18 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-20367:
-

 Summary: Spark silently escapes partition column names
 Key: SPARK-20367
 URL: https://issues.apache.org/jira/browse/SPARK-20367
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.2.0
Reporter: Juliusz Sompolski
Priority: Minor


CSV files can have arbitrary column names:
{code}
scala> spark.range(1).select(col("id").as("Column?"), 
col("id")).write.option("header", true).csv("/tmp/foo")
scala> spark.read.option("header", true).csv("/tmp/foo").schema
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(Column?,StringType,true), 
StructField(id,StringType,true))
{code}
However, once a column with characters like "?" in the name gets used in a 
partitioning column, the column name gets silently escaped, and reading the 
schema information back renders the column name with "?" turned into "%3F":
{code}
scala> spark.range(1).select(col("id").as("Column?"), 
col("id")).write.partitionBy("Column?").option("header", true).csv("/tmp/bar")
scala> spark.read.option("header", true).csv("/tmp/bar").schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(id,StringType,true), 
StructField(Column%3F,IntegerType,true))
{code}
The same happens for other formats, but I encountered it working with CSV, 
since these more often contain ugly schemas... 

Not sure if it's a bug or a feature, but it might be more intuitive to fail 
queries with invalid characters in the partitioning column name, rather than 
silently escaping the name?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21272) SortMergeJoin LeftAnti does not update numOutputRows

2017-06-30 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-21272:
-

 Summary: SortMergeJoin LeftAnti does not update numOutputRows
 Key: SPARK-21272
 URL: https://issues.apache.org/jira/browse/SPARK-21272
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Juliusz Sompolski
Priority: Trivial


Output rows metric not updated in one of the branches.
PR pending.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20616) RuleExecutor logDebug of batch results should show diff to start of batch

2017-05-05 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-20616:
-

 Summary: RuleExecutor logDebug of batch results should show diff 
to start of batch
 Key: SPARK-20616
 URL: https://issues.apache.org/jira/browse/SPARK-20616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Juliusz Sompolski


Due to a likely typo, the logDebug msg printing the diff of query plans shows a 
diff to the initial plan, not diff to the start of batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22103) Move HashAggregateExec parent consume to a separate function in codegen

2017-09-22 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-22103:
-

 Summary: Move HashAggregateExec parent consume to a separate 
function in codegen
 Key: SPARK-22103
 URL: https://issues.apache.org/jira/browse/SPARK-22103
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Juliusz Sompolski


HashAggregateExec codegen uses two paths for fast hash table and a generic one.
It generates code paths for iterating over both, and both code paths generate 
the consume code of the parent operator, resulting in that code being expanded 
twice.
This leads to a long generated function that might be an issue for the compiler 
(see e.g. SPARK-21603).
I propose to remove the double expansion by generating the consume code in a 
helper function that can just be called from both iterating loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-06 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156036#comment-16156036
 ] 

Juliusz Sompolski commented on SPARK-21907:
---

[~kiszk] unfortunately I don't have a small scale repro. I hit it several times 
when running sql queries operating on several TB of data on a cluster with ~20 
nodes / ~300 cores.
Looking at the code, I'm quite sure it's caused by UnsafeInMemorySorter.reset:
{code}
  consumer.freeArray(array);
  array = consumer.allocateArray(initialSize);
{code}
where allocating this array just after it was freed fails with another OOM, 
causing a nested spill.
I think nested spilling is invalid, and thus acquiring memory in a way that can 
cause spill is invalid on code paths that are already during spilling.

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>

[jira] [Created] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-04 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-21907:
-

 Summary: NullPointerException in UnsafeExternalSorter.spill()
 Key: SPARK-21907
 URL: https://issues.apache.org/jira/browse/SPARK-21907
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Juliusz Sompolski


I see NPE during sorting with the following stacktrace:
{code}
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
at 
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-04 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152408#comment-16152408
 ] 

Juliusz Sompolski commented on SPARK-21907:
---

Note that UnsafeExternalSorter.spill appears twice on the stack trace, so it's 
nested spilling: the first triggered spilling triggers another spilling through 
UnsafeInMemorySorter.reset.

Possibly it's messing up something by nested-spilling itself twice?
Or messing something with
{code:java}
if (trigger != this) {
  if (readingIterator != null) {
return readingIterator.spill();
  }
  return 0L; // this should throw exception
}
{code}
in spill()

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
>

[jira] [Updated] (SPARK-22462) SQL metrics missing after foreach operation on dataframe

2017-11-06 Thread Juliusz Sompolski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-22462:
--
Attachment: collect.png
foreach.png

> SQL metrics missing after foreach operation on dataframe
> 
>
> Key: SPARK-22462
> URL: https://issues.apache.org/jira/browse/SPARK-22462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
> Attachments: collect.png, foreach.png
>
>
> No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed 
> on the DataFrame.
> e.g.
> {code}
> sql("select * from range(10)").collect()
> sql("select * from range(10)").foreach(a => Unit)
> sql("select * from range(10)").foreach(a => println(a))
> {code}
> See collect.png vs. foreach.png



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22462) SQL metrics missing after foreach operation on dataframe

2017-11-06 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-22462:
-

 Summary: SQL metrics missing after foreach operation on dataframe
 Key: SPARK-22462
 URL: https://issues.apache.org/jira/browse/SPARK-22462
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Juliusz Sompolski


No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed 
on the DataFrame.
e.g.
{code}
sql("select * from range(10)").collect()
sql("select * from range(10)").foreach(a => Unit)
sql("select * from range(10)").foreach(a => println(a))
{code}
See collect.png vs. foreach.png



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22721) BytesToBytesMap peak memory usage not accurate after reset()

2017-12-06 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-22721:
-

 Summary: BytesToBytesMap peak memory usage not accurate after 
reset()
 Key: SPARK-22721
 URL: https://issues.apache.org/jira/browse/SPARK-22721
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Juliusz Sompolski


BytesToBytesMap doesn't update peak memory usage before shrinking back to 
initial capacity in reset(), so after a disk spill one never knows what was the 
size of hash table was before spilling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-22 Thread Juliusz Sompolski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-24341:
--
Description: 
Ran on master:
{code}
drop table if exists juleka;
drop table if exists julekb;
create table juleka (a integer, b integer);
create table julekb (na integer, nb integer);
insert into juleka values (1,1);
insert into julekb values (1,1);
select * from juleka where (a, b) not in (select (na, nb) from julekb);
{code}

Results in:
{code}
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 27, 
Column 29: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 27, Column 29: Cannot compare types "int" and 
"org.apache.spark.sql.catalyst.InternalRow"
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at 
org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at 
org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at 
org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 27, Column 29: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 27, 
Column 29: Cannot compare types "int" and 
"org.apache.spark.sql.catalyst.InternalRow"
at

[jira] [Created] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-22 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-24341:
-

 Summary: Codegen compile error from predicate subquery
 Key: SPARK-24341
 URL: https://issues.apache.org/jira/browse/SPARK-24341
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Juliusz Sompolski


Ran on Shared Autoscaling on dogfood:
{code}
drop table if exists juleka;
drop table if exists julekb;
create table juleka (a integer, b integer);
create table julekb (na integer, nb integer);
insert into juleka values (1,1);
insert into julekb values (1,1);
select * from juleka where (a, b) not in (select (na, nb) from julekb);
{code}

Results in:
{code}
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 27, 
Column 29: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 27, Column 29: Cannot compare types "int" and 
"org.apache.spark.sql.catalyst.InternalRow"
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at 
org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at 
org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at 
org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 27, Column 29: failed to compile: 
org.codehaus.commons.compiler.CompileException: File

[jira] [Updated] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-22 Thread Juliusz Sompolski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-24341:
--
Description: 
Ran on master:
{code}
drop table if exists juleka;
drop table if exists julekb;
create table juleka (a integer, b integer);
create table julekb (na integer, nb integer);
insert into juleka values (1,1);
insert into julekb values (1,1);
select * from juleka where (a, b) not in (select (na, nb) from julekb);
{code}

Results in:
{code}
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 27, 
Column 29: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 27, Column 29: Cannot compare types "int" and 
"org.apache.spark.sql.catalyst.InternalRow"
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at 
org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at 
org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at 
org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 27, Column 29: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 27, 
Column 29: Cannot compare types "int" and 
"org.apache.spark.sql.catalyst.InternalRow"
at

[jira] [Commented] (SPARK-24395) Fix Behavior of NOT IN with Literals Containing NULL

2018-05-29 Thread Juliusz Sompolski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494181#comment-16494181
 ] 

Juliusz Sompolski commented on SPARK-24395:
---

The question is whether the literals should be treated as structs, or unpacked?

If like structs, then the current behavior is correct, I think.

But when a similar query is IN / NOT IN subquery, it is currently treated as if 
the left hand side was unpacked into independent columns.

cc [~mgaido] [~hvanhovell]

> Fix Behavior of NOT IN with Literals Containing NULL
> 
>
> Key: SPARK-24395
> URL: https://issues.apache.org/jira/browse/SPARK-24395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
>
> Spark does not return the correct answer when evaluating NOT IN in some 
> cases. For example:
> {code:java}
> CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES
>   (null, null)
>   AS m(a, b);
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;{code}
> According to the semantics of null-aware anti-join, this should return no 
> rows. However, it actually returns the row {{NULL NULL}}. This was found by 
> inspecting the unit tests added for SPARK-24381 
> ([https://github.com/apache/spark/pull/21425#pullrequestreview-123421822).]
> *Acceptance Criteria*:
>  * We should be able to add the following test cases back to 
> {{subquery/in-subquery/not-in-unit-test-multi-column-literal.sql}}:
> {code:java}
>   -- Case 2
>   -- (subquery contains a row with null in all columns -> row not returned)
> SELECT *
> FROM   m
> WHERE  (a, b) NOT IN ((CAST (null AS INT), CAST (null AS DECIMAL(2, 1;
>   -- Case 3
>   -- (probe-side columns are all null -> row not returned)
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL -- Matches only (null, null)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;
>   -- Case 4
>   -- (one column null, other column matches a row in the subquery result -> 
> row not returned)
> SELECT *
> FROM   m
> WHERE  b = 1.0 -- Matches (null, 1.0)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1; 
> {code}
>  
> cc [~smilegator] [~juliuszsompolski]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them

2018-04-26 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-24104:
-

 Summary: SQLAppStatusListener overwrites metrics 
onDriverAccumUpdates instead of updating them
 Key: SPARK-24104
 URL: https://issues.apache.org/jira/browse/SPARK-24104
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Juliusz Sompolski


SqlAppStatusListener does 
{code}
exec.driverAccumUpdates = accumUpdates.toMap
update(exec)
{code}
in onDriverAccumUpdates.
But postDriverMetricUpdates is called multiple time per query, e.g. from each 
FileSourceScanExec and BroadcastExchangeExec.

If the update does not really update it in the KV store (depending on 
liveUpdatePeriodNs), the previously posted metrics are lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null

2018-01-16 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-23087:
-

 Summary: CheckCartesianProduct too restrictive when condition is 
constant folded to false/null
 Key: SPARK-23087
 URL: https://issues.apache.org/jira/browse/SPARK-23087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.3.0
Reporter: Juliusz Sompolski


Running
{code}
sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A")
sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB")
sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = 
NULLTAB.a").collect()
{code}
results in:
{code}
org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT 
OUTER join between logical plans
Project
+- Range (0, 10, step=1, splits=None)
and
Project
+- Range (0, 10, step=1, splits=None)
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
  at 
 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
{code}

This is because NULLTAB.a is constant folded to null, and then the condition is 
constant folded altogether:
{code}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation ===
GlobalLimit 21  
 +- LocalLimit 21
+- Project [1 AS goo#28] 
!  +- Join LeftOuter, (a#0L = null)  
  :- Project [id#1L AS a#0L] 
  :  +- Range (0, 10, step=1, splits=None)   
  +- Project  
 +- Range (0, 10, step=1, splits=None) 

GlobalLimit 21
+- LocalLimit 21
   +- Project [1 AS goo#28]
  +- Join LeftOuter, null
 :- Project [id#1L AS a#0L]
 :  +- Range (0, 10, step=1, splits=None)
 +- Project
+- Range (0, 10, step=1, splits=None)
{code}

And then CheckCartesianProduct doesn't like it, even though the condition does 
not produce a cartesian product, but evaluates to null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23366) Improve hot reading path in ReadAheadInputStream

2018-02-08 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-23366:
-

 Summary: Improve hot reading path in ReadAheadInputStream
 Key: SPARK-23366
 URL: https://issues.apache.org/jira/browse/SPARK-23366
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Juliusz Sompolski


ReadAheadInputStream was introduced in 
[apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize 
reading spill files from disk.
However, investigating flamegraphs of profiles from investigating some 
regressed workloads after switch to Spark 2.3, it seems that the hot path of 
reading small amounts of data (like readInt) is inefficient - it involves 
taking locks, and multiple checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23310) Perf regression introduced by SPARK-21113

2018-02-08 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357890#comment-16357890
 ] 

Juliusz Sompolski commented on SPARK-23310:
---

[~kiszk] I raised SPARK-23366 and submitted 
[https://github.com/apache/spark/pull/20555] against it.

> Perf regression introduced by SPARK-21113
> -
>
> Key: SPARK-23310
> URL: https://issues.apache.org/jira/browse/SPARK-23310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Assignee: Sital Kedia
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While running all TPC-DS queries with SF set to 1000, we noticed that Q95 
> (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql)
>  has noticeable regression (11%). After looking into it, we found that the 
> regression was introduced by SPARK-21113. Specially, ReadAheadInputStream 
> gets lock congestion. After setting 
> spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression 
> disappear and the overall performance of all TPC-DS queries has improved.
>  
> I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to 
> false by default for Spark 2.3 and re-enable it after addressing the lock 
> congestion issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23445) ColumnStat refactoring

2018-02-15 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-23445:
-

 Summary: ColumnStat refactoring
 Key: SPARK-23445
 URL: https://issues.apache.org/jira/browse/SPARK-23445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Juliusz Sompolski


Refactor ColumnStat to be more flexible.
 * Split {{ColumnStat}} and {{CatalogColumnStat}} just like 
{{CatalogStatistics}} is split from {{Statistics}}. This detaches how the 
statistics are stored from how they are processed in the query plan. 
{{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not 
depend on dataType information.
 * For {{CatalogColumnStat}}, parse column names from property names in the 
metastore ({{KEY_VERSION }}property), not from metastore schema. This allows 
the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself 
is not in the metastore.
 * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns 
were optional already. Having them all optional is more consistent, and gives 
flexibility to e.g. drop some of the fields through transformations if they are 
difficult / impossible to calculate.

The added flexibility will make it possible to have alternative implementations 
for stats, and separates stats collection from stats and estimation processing 
in plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.

2018-01-02 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-22938:
-

 Summary: Assert that SQLConf.get is accessed only on the driver.
 Key: SPARK-22938
 URL: https://issues.apache.org/jira/browse/SPARK-22938
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.1
Reporter: Juliusz Sompolski


Assert if code tries to access SQLConf.get on executor.
This can lead to hard to detect bugs, where the executor will read 
fallbackConf, falling back to default config values, ignoring potentially 
changed non-default configs.
If a config is to be passed to executor code, it needs to be read on the 
driver, and passed explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22957) ApproxQuantile breaks if the number of rows exceeds MaxInt

2018-01-04 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-22957:
-

 Summary: ApproxQuantile breaks if the number of rows exceeds MaxInt
 Key: SPARK-22957
 URL: https://issues.apache.org/jira/browse/SPARK-22957
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Juliusz Sompolski


ApproxQuantile overflows when number of rows exceeds 2.147B (max int32).

If you run ApproxQuantile on a dataframe with 3B rows of 1 to 3B and ask it for 
1/6 quantiles, it should return [0.5B, 1B, 1.5B, 2B, 2.5B, 3B]. However, in the 
[implementation of 
ApproxQuantile|https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L195],
 it calls .toInt on the target rank, which overflows at 2.147B.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25284) Spark UI: make sure skipped stages are updated onJobEnd

2018-08-30 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-25284:
-

 Summary: Spark UI: make sure skipped stages are updated onJobEnd
 Key: SPARK-25284
 URL: https://issues.apache.org/jira/browse/SPARK-25284
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Juliusz Sompolski


Tiny bug in onJobEnd not forcing update of skipped stages in KVstore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25284) Spark UI: make sure skipped stages are updated onJobEnd

2018-08-31 Thread Juliusz Sompolski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598592#comment-16598592
 ] 

Juliusz Sompolski commented on SPARK-25284:
---

Contained by SPARK-24415

> Spark UI: make sure skipped stages are updated onJobEnd
> ---
>
> Key: SPARK-25284
> URL: https://issues.apache.org/jira/browse/SPARK-25284
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Tiny bug in onJobEnd not forcing update of skipped stages in KVstore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25284) Spark UI: make sure skipped stages are updated onJobEnd

2018-08-31 Thread Juliusz Sompolski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-25284.
---
Resolution: Duplicate

> Spark UI: make sure skipped stages are updated onJobEnd
> ---
>
> Key: SPARK-25284
> URL: https://issues.apache.org/jira/browse/SPARK-25284
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Tiny bug in onJobEnd not forcing update of skipped stages in KVstore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-18 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-24013:
-

 Summary: ApproximatePercentile grinds to a halt on sorted input.
 Key: SPARK-24013
 URL: https://issues.apache.org/jira/browse/SPARK-24013
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Juliusz Sompolski


Running
{code}
sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid from 
range(1000))").collect()
{code}
takes 7 seconds, while
{code}
sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
{code}
grinds to a halt - processes the first million rows quickly, and then slows 
down to a few thousands rows / second (4m rows processed after 20 minutes).

Thread dumps show that it spends time in QuantileSummary.compress.
Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-18 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442477#comment-16442477
 ] 

Juliusz Sompolski commented on SPARK-24013:
---

This hits when trying to create histogram statistics (SPARK-21975) on columns 
like monotonically increasing id - histograms cannot be created in reasonable 
time.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448093#comment-16448093
 ] 

Juliusz Sompolski commented on SPARK-24013:
---

Hi [~mgaido]
I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and 
I reproduce it:
 !screenshot-1.png! 

Maybe you need to bump up 100 to something higher when running on a bigger 
cluster that splits the range into more tasks?
For me it grinds to a halt after about 250 per task.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448093#comment-16448093
 ] 

Juliusz Sompolski edited comment on SPARK-24013 at 4/23/18 1:11 PM:


Hi [~mgaido]
I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and 
I reproduce it:
 !screenshot-1.png! 

Maybe you need to bump up 1000 to something higher when running on a bigger 
cluster that splits the range into more tasks?
For me it grinds to a halt after about 250 per task.


was (Author: juliuszsompolski):
Hi [~mgaido]
I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and 
I reproduce it:
 !screenshot-1.png! 

Maybe you need to bump up 100 to something higher when running on a bigger 
cluster that splits the range into more tasks?
For me it grinds to a halt after about 250 per task.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Juliusz Sompolski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-24013:
--
Attachment: screenshot-1.png

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25968) Non-codegen Floor and Ceil fail for FloatType

2018-11-08 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-25968:
-

 Summary: Non-codegen Floor and Ceil fail for FloatType
 Key: SPARK-25968
 URL: https://issues.apache.org/jira/browse/SPARK-25968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.2.2, 2.4.0
Reporter: Juliusz Sompolski


nullSafeEval of Floor and Ceil does not handle FloatType argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25968) Non-codegen Floor and Ceil fail for FloatType

2018-11-08 Thread Juliusz Sompolski (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-25968.
---
Resolution: Won't Fix

Ok, I see it's not supposed to handle it, but type gets promoted in the 
analyzer.

> Non-codegen Floor and Ceil fail for FloatType
> -
>
> Key: SPARK-25968
> URL: https://issues.apache.org/jira/browse/SPARK-25968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> nullSafeEval of Floor and Ceil does not handle FloatType argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26038) Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in long

2018-11-13 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-26038:
-

 Summary: Decimal toScalaBigInt/toJavaBigInteger not work for 
decimals not fitting in long
 Key: SPARK-26038
 URL: https://issues.apache.org/jira/browse/SPARK-26038
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.0, 2.2.0
Reporter: Juliusz Sompolski


Decimal toScalaBigInt/toJavaBigInteger just called toLong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26622) Improve wording in SQLMetrics labels

2019-01-15 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-26622:
-

 Summary: Improve wording in SQLMetrics labels
 Key: SPARK-26622
 URL: https://issues.apache.org/jira/browse/SPARK-26622
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


Revise sql metrics labels to be more understandable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-23 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-26159:
-

 Summary: Codegen for LocalTableScanExec
 Key: SPARK-26159
 URL: https://issues.apache.org/jira/browse/SPARK-26159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-05-31 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-27899:
-

 Summary: Make HiveMetastoreClient.getTableObjectsByName available 
in ExternalCatalog/SessionCatalog API
 Key: SPARK-27899
 URL: https://issues.apache.org/jira/browse/SPARK-27899
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Juliusz Sompolski


The new Spark ThriftServer SparkGetTablesOperation implemented in 
https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
request for every table. This can get very slow for large schemas (~50ms per 
table with an external Hive metastore).
Hive ThriftServer GetTablesOperation uses 
HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but 
we don't expose that through our APIs that go through Hive -> HiveClientImpl 
(HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog.

If we added and exposed getTableObjectsByName through our catalog APIs, we 
could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-05-31 Thread Juliusz Sompolski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853083#comment-16853083
 ] 

Juliusz Sompolski commented on SPARK-27899:
---

cc [~LI,Xiao], [~yumwang]

> Make HiveMetastoreClient.getTableObjectsByName available in 
> ExternalCatalog/SessionCatalog API
> --
>
> Key: SPARK-27899
> URL: https://issues.apache.org/jira/browse/SPARK-27899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> The new Spark ThriftServer SparkGetTablesOperation implemented in 
> https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
> request for every table. This can get very slow for large schemas (~50ms per 
> table with an external Hive metastore).
> Hive ThriftServer GetTablesOperation uses 
> HiveMetastoreClient.getTableObjectsByName to get table information in bulk, 
> but we don't expose that through our APIs that go through Hive -> 
> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> 
> SessionCatalog.
> If we added and exposed getTableObjectsByName through our catalog APIs, we 
> could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28910) Prevent schema verification when connecting to in memory derby

2019-08-29 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-28910:
-

 Summary: Prevent schema verification when connecting to in memory 
derby
 Key: SPARK-28910
 URL: https://issues.apache.org/jira/browse/SPARK-28910
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Juliusz Sompolski


When hive.metastore.schema.verification=true, HiveUtils.newClientForExecution 
fails with 
{{19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should 
not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58)
...
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient}}


This prevents Thriftserver from starting



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29056) ThriftServerSessionPage displays 1970/01/01 for queries that are not finished and not closed

2019-09-11 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-29056:
-

 Summary: ThriftServerSessionPage displays 1970/01/01 for queries 
that are not finished and not closed
 Key: SPARK-29056
 URL: https://issues.apache.org/jira/browse/SPARK-29056
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


Spark UI ODBC/JDBC tab session page displays 1970/01/01 (timestamp 0) as 
finish/close time for queries that haven't finished yet.

!image-2019-09-11-17-21-52-771.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29056) ThriftServerSessionPage displays 1970/01/01 for queries that are not finished and not closed

2019-09-11 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-29056:
--
Issue Type: Bug  (was: Improvement)

> ThriftServerSessionPage displays 1970/01/01 for queries that are not finished 
> and not closed
> 
>
> Key: SPARK-29056
> URL: https://issues.apache.org/jira/browse/SPARK-29056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Spark UI ODBC/JDBC tab session page displays 1970/01/01 (timestamp 0) as 
> finish/close time for queries that haven't finished yet.
> !image-2019-09-11-17-21-52-771.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29263) availableSlots in scheduler can change before being checked by barrier taskset

2019-09-26 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-29263:
-

 Summary: availableSlots in scheduler can change before being 
checked by barrier taskset
 Key: SPARK-29263
 URL: https://issues.apache.org/jira/browse/SPARK-29263
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


availableSlots are computed before the loop in resourceOffer, but they change 
in every iteration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching

2019-11-12 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972369#comment-16972369
 ] 

Juliusz Sompolski commented on SPARK-29349:
---

[~runzhiwang] it is used by Simba Spark ODBC driver to recover after a 
connection failure during fetching when AutoReconnect=1. It is available in 
Simba Spark ODBC driver v2.6.10 - I believe it is used only to recover the 
cursor position after reconnecting, not as a user facing feature to allow 
fetching backwards.

> Support FETCH_PRIOR in Thriftserver query results fetching
> --
>
> Key: SPARK-29349
> URL: https://issues.apache.org/jira/browse/SPARK-29349
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> Support FETCH_PRIOR fetching in Thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching

2019-10-03 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-29349:
-

 Summary: Support FETCH_PRIOR in Thriftserver query results fetching
 Key: SPARK-29349
 URL: https://issues.apache.org/jira/browse/SPARK-29349
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


Support FETCH_PRIOR fetching in Thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31051) Thriftserver operations other than SparkExecuteStatementOperation do not call onOperationClosed

2020-03-05 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-31051.
---
Resolution: Not A Problem

I'm just blind. They are there.

> Thriftserver operations other than SparkExecuteStatementOperation do not call 
> onOperationClosed
> ---
>
> Key: SPARK-31051
> URL: https://issues.apache.org/jira/browse/SPARK-31051
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> In Spark 3.0 onOperationClosed was implemented in HIveThriftServer2Listener 
> to track closing the operation in the thriftserver (after client finishes 
> fetching).
> However, it seems that only SparkExecuteStatementOperation calls it in it's 
> close() function. Other operations need to do this as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31051) Thriftserver operations other than SparkExecuteStatementOperation do not call onOperationClosed

2020-03-05 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-31051:
-

 Summary: Thriftserver operations other than 
SparkExecuteStatementOperation do not call onOperationClosed
 Key: SPARK-31051
 URL: https://issues.apache.org/jira/browse/SPARK-31051
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


In Spark 3.0 onOperationClosed was implemented in HIveThriftServer2Listener to 
track closing the operation in the thriftserver (after client finishes 
fetching).
However, it seems that only SparkExecuteStatementOperation calls it in it's 
close() function. Other operations need to do this as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31388) org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky

2020-04-08 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-31388:
-

 Summary: org.apache.spark.sql.hive.thriftserver.CliSuite result 
matching is flaky
 Key: SPARK-31388
 URL: https://issues.apache.org/jira/browse/SPARK-31388
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


CliSuite.runCliWithin result matching has issues. Will describe in PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31204) HiveResult compatibility for DatasourceV2 command

2020-03-20 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-31204:
-

 Summary: HiveResult compatibility for DatasourceV2 command
 Key: SPARK-31204
 URL: https://issues.apache.org/jira/browse/SPARK-31204
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


HiveResult performs some compatibility matches and conversions for commands to 
be compatible with Hive output, e.g.:

{code}
case ExecutedCommandExec(_: DescribeCommandBase) =>
  // If it is a describe command for a Hive table, we want to have the 
output format
  // be similar with Hive.
...
// SHOW TABLES in Hive only output table names, while ours output database, 
table name, isTemp.
case command @ ExecutedCommandExec(s: ShowTablesCommand) if !s.isExtended =>
{code}

It is needed for DatasourceV2 commands as well (eg. ShowTablesExec...).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31859) Thriftserver with spark.sql.datetime.java8API.enabled=true

2020-05-28 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-31859:
--
Issue Type: Bug  (was: Improvement)

> Thriftserver with spark.sql.datetime.java8API.enabled=true
> --
>
> Key: SPARK-31859
> URL: https://issues.apache.org/jira/browse/SPARK-31859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> {code}
>   test("spark.sql.datetime.java8API.enabled=true") {
> withJdbcStatement() { st =>
>   st.execute("set spark.sql.datetime.java8API.enabled=true")
>   val rs = st.executeQuery("select timestamp '2020-05-28 00:00:00'")
>   rs.next()
>   // scalastyle:off
>   println(rs.getObject(1))
> }
>   }
> {code}
> fails with 
> {code}
> HiveThriftBinaryServerSuite:
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
> at java.sql.Timestamp.valueOf(Timestamp.java:204)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:444)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:424)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getObject(HiveBaseResultSet.java:464
> {code}
> It seems it might be needed in HiveResult.toHiveString?
> cc [~maxgekk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31861) Thriftserver collecting timestamp not using spark.sql.session.timeZone

2020-05-28 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-31861:
-

 Summary: Thriftserver collecting timestamp not using 
spark.sql.session.timeZone
 Key: SPARK-31861
 URL: https://issues.apache.org/jira/browse/SPARK-31861
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


If JDBC client is in TimeZone PST, and sets spark.sql.session.timeZone to PST, 
and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM 
timezone of the Spark cluster is e.g. CET, then
- the timestamp literal in the query is interpreted as 12:00:00 PST, i.e. 
21:00:00 CET
- but currently when it's returned, the timestamps are collected from the query 
with a collect() in 
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L299,
 and then in the end Timestamps are turned into strings using a t.toString() in 
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/ColumnValue.java#L138
 This will use the Spark cluster TimeZone. That results in "21:00:00" returned 
to the JDBC application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31859) Thriftserver with spark.sql.datetime.java8API.enabled=true

2020-05-28 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119002#comment-17119002
 ] 

Juliusz Sompolski commented on SPARK-31859:
---

Actually, it's already in HiveResult.toHiveString...
What happens is:
- its taken from SparkRow in SparkExecuteStatement.addNonNullColumnValue  using 
"to += from.getAs[Timestamp](ordinal)". This appears to not complain that it's 
in fact an Instant, not Timestamp.
- in ColumnValue.timestampValue it gets turned into a String as 
value.toString(). That somehow also doesn't complain that the object is an 
Instant, not a Timestamp?
- this gets returned to the client as String, which complains that it cannot 
read it back into Timestamp..

I will fix it together with https://issues.apache.org/jira/browse/SPARK-31861

> Thriftserver with spark.sql.datetime.java8API.enabled=true
> --
>
> Key: SPARK-31859
> URL: https://issues.apache.org/jira/browse/SPARK-31859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> {code}
>   test("spark.sql.datetime.java8API.enabled=true") {
> withJdbcStatement() { st =>
>   st.execute("set spark.sql.datetime.java8API.enabled=true")
>   val rs = st.executeQuery("select timestamp '2020-05-28 00:00:00'")
>   rs.next()
>   // scalastyle:off
>   println(rs.getObject(1))
> }
>   }
> {code}
> fails with 
> {code}
> HiveThriftBinaryServerSuite:
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
> at java.sql.Timestamp.valueOf(Timestamp.java:204)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:444)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:424)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getObject(HiveBaseResultSet.java:464
> {code}
> It seems it might be needed in HiveResult.toHiveString?
> cc [~maxgekk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31863) Thriftserver not setting active SparkSession, SQLConf.get not getting session configs correctly

2020-05-28 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-31863:
-

 Summary: Thriftserver not setting active SparkSession, SQLConf.get 
not getting session configs correctly
 Key: SPARK-31863
 URL: https://issues.apache.org/jira/browse/SPARK-31863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


Thriftserver is not setting the active SparkSession.
Because of that, configuration obtained with SQLConf.get is not the session 
configuration.
This makes many configs set by "set" in the session not work correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31859) Thriftserver with spark.sql.datetime.java8API.enabled=true

2020-05-28 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-31859:
-

 Summary: Thriftserver with spark.sql.datetime.java8API.enabled=true
 Key: SPARK-31859
 URL: https://issues.apache.org/jira/browse/SPARK-31859
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


{code}
  test("spark.sql.datetime.java8API.enabled=true") {
withJdbcStatement() { st =>
  st.execute("set spark.sql.datetime.java8API.enabled=true")
  val rs = st.executeQuery("select timestamp '2020-05-28 00:00:00'")
  rs.next()
  // scalastyle:off
  println(rs.getObject(1))
}
  }
{code}
fails with 
{code}
HiveThriftBinaryServerSuite:
java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
hh:mm:ss[.f]
at java.sql.Timestamp.valueOf(Timestamp.java:204)
at 
org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:444)
at 
org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:424)
at 
org.apache.hive.jdbc.HiveBaseResultSet.getObject(HiveBaseResultSet.java:464
{code}

It seems it might be needed in HiveResult.toHiveString?
cc [~maxgekk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32132) Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0

2020-06-30 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148848#comment-17148848
 ] 

Juliusz Sompolski commented on SPARK-32132:
---

Also 2.4 adds "interval" at the start, while 3.0 does not. E.g. "interval 3 
days" in 2.4 and "3 days" in 3.0.
I actually think that the new 3.0 results are better / more standard, and I 
haven't heard about anyone complaining that it broke the way they parse it.

Edit: [~cloud_fan] posting now the above comment that I thought I posted 
yesterday, but it stayed open and not send in an open tab. It causes some 
issues with unit tests, but I think it shouldn't cause real world problems, and 
in any case the new format is likely better for the future. Thanks for 
explaining.

> Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0
> --
>
> Key: SPARK-32132
> URL: https://issues.apache.org/jira/browse/SPARK-32132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> In https://github.com/apache/spark/pull/26418, a setting 
> spark.sql.dialect.intervalOutputStyle was implemented, to control interval 
> output style. This PR also removed "toString" from CalendarInterval. This 
> change got reverted in https://github.com/apache/spark/pull/27304, and the 
> CalendarInterval.toString got implemented back in 
> https://github.com/apache/spark/pull/26572.
> But it behaves differently now: In 2.4 "4 weeks 2 days" are returned, and 3.0 
> returns "30 days".
> Thriftserver uses HiveResults.toHiveString, which uses 
> CalendarInterval.toString to return interval results as string.  The results 
> are now different in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32021) make_interval does not accept seconds >100

2020-06-18 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-32021:
--
Description: 
In make_interval(years, months, weeks, days, hours, mins, secs), secs are 
defined as Decimal(8, 6), which turns into null if the value of the expression 
overflows 100 seconds.
Larger seconds values should be allowed.

This has been reported by Simba, who wants to use make_interval to implement 
translation for TIMESTAMP_ADD ODBC function in Spark 3.0.
ODBC {fn TIMESTAMPADD(SECOND, integer_exp, timestamp} fails when integer_exp 
returns seconds values >= 100.

  was:
In make_interval(years, months, weeks, days, hours, mins, secs), secs are 
defined as Decimal(8, 6), which turns into null if the value of the expression 
overflows 100 seconds.
Larger seconds values should be allowed.

This has been reported by Simba, who wants to use make_interval to implement 
translation for TIMESTAMP_ADD ODBC function in Spark 3.0.


> make_interval does not accept seconds >100
> --
>
> Key: SPARK-32021
> URL: https://issues.apache.org/jira/browse/SPARK-32021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> In make_interval(years, months, weeks, days, hours, mins, secs), secs are 
> defined as Decimal(8, 6), which turns into null if the value of the 
> expression overflows 100 seconds.
> Larger seconds values should be allowed.
> This has been reported by Simba, who wants to use make_interval to implement 
> translation for TIMESTAMP_ADD ODBC function in Spark 3.0.
> ODBC {fn TIMESTAMPADD(SECOND, integer_exp, timestamp} fails when integer_exp 
> returns seconds values >= 100.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32021) make_interval does not accept seconds >100

2020-06-18 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-32021:
-

 Summary: make_interval does not accept seconds >100
 Key: SPARK-32021
 URL: https://issues.apache.org/jira/browse/SPARK-32021
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


In make_interval(years, months, weeks, days, hours, mins, secs), secs are 
defined as Decimal(8, 6), which turns into null if the value of the expression 
overflows 100 seconds.
Larger seconds values should be allowed.

This has been reported by Simba, who wants to use make_interval to implement 
translation for TIMESTAMP_ADD ODBC function in Spark 3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32132) Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0

2020-06-29 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148086#comment-17148086
 ] 

Juliusz Sompolski commented on SPARK-32132:
---

cc [~hyukjin.kwon] [~cloud_fan] [~Qin Yao]

> Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0
> --
>
> Key: SPARK-32132
> URL: https://issues.apache.org/jira/browse/SPARK-32132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> In https://github.com/apache/spark/pull/26418, a setting 
> spark.sql.dialect.intervalOutputStyle was implemented, to control interval 
> output style. This PR also removed "toString" from CalendarInterval. This 
> change got reverted in https://github.com/apache/spark/pull/27304, and the 
> CalendarInterval.toString got implemented back in 
> https://github.com/apache/spark/pull/26572.
> But it behaves differently now: In 2.4 "4 weeks 2 days" are returned, and 3.0 
> returns "30 days".
> Thriftserver uses HiveResults.toHiveString, which uses 
> CalendarInterval.toString to return interval results as string.  The results 
> are now different in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32132) Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0

2020-06-29 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-32132:
-

 Summary: Thriftserver interval returns "4 weeks 2 days" in 2.4 and 
"30 days" in 3.0
 Key: SPARK-32132
 URL: https://issues.apache.org/jira/browse/SPARK-32132
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


In https://github.com/apache/spark/pull/26418, a setting 
spark.sql.dialect.intervalOutputStyle was implemented, to control interval 
output style. This PR also removed "toString" from CalendarInterval. This 
change got reverted in https://github.com/apache/spark/pull/27304, and the 
CalendarInterval.toString got implemented back in 
https://github.com/apache/spark/pull/26572.

But it behaves differently now: In 2.4 "4 weeks 2 days" are returned, and 3.0 
returns "30 days".
Thriftserver uses HiveResults.toHiveString, which uses 
CalendarInterval.toString to return interval results as string.  The results 
are now different in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34152) CreateViewStatement.child should be a real child

2021-01-18 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267166#comment-17267166
 ] 

Juliusz Sompolski commented on SPARK-34152:
---

Same applies to AlterViewStatement.

> CreateViewStatement.child should be a real child
> 
>
> Key: SPARK-34152
> URL: https://issues.apache.org/jira/browse/SPARK-34152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Similar to `CreateTableAsSelectStatement`, the input query of 
> `CreateViewStatement` should be a child and get analyzed during the analysis 
> phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37090) Upgrade libthrift to resolve security vulnerabilities

2021-10-21 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-37090:
-

 Summary: Upgrade libthrift to resolve security vulnerabilities
 Key: SPARK-37090
 URL: https://issues.apache.org/jira/browse/SPARK-37090
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0, 3.1.0, 3.0.0, 3.3.0
Reporter: Juliusz Sompolski


Currently, Spark uses libthrift 0.12, which has reported high severity security 
vulnerabilities https://snyk.io/vuln/maven:org.apache.thrift%3Alibthrift
Upgrade to 0.14 to get rid of vulnerabilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37090) Upgrade libthrift to resolve security vulnerabilities

2021-10-21 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-37090.
---
Resolution: Duplicate

> Upgrade libthrift to resolve security vulnerabilities
> -
>
> Key: SPARK-37090
> URL: https://issues.apache.org/jira/browse/SPARK-37090
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Currently, Spark uses libthrift 0.12, which has reported high severity 
> security vulnerabilities 
> https://snyk.io/vuln/maven:org.apache.thrift%3Alibthrift
> Upgrade to 0.14 to get rid of vulnerabilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37090) Upgrade libthrift to resolve security vulnerabilities

2021-10-21 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432731#comment-17432731
 ] 

Juliusz Sompolski commented on SPARK-37090:
---

Duplicate of https://issues.apache.org/jira/browse/SPARK-36994

> Upgrade libthrift to resolve security vulnerabilities
> -
>
> Key: SPARK-37090
> URL: https://issues.apache.org/jira/browse/SPARK-37090
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Currently, Spark uses libthrift 0.12, which has reported high severity 
> security vulnerabilities 
> https://snyk.io/vuln/maven:org.apache.thrift%3Alibthrift
> Upgrade to 0.14 to get rid of vulnerabilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45133) Mark Spark Connect queries as finished when all result tasks are finished, not sent

2023-09-12 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-45133:
-

 Summary: Mark Spark Connect queries as finished when all result 
tasks are finished, not sent
 Key: SPARK-45133
 URL: https://issues.apache.org/jira/browse/SPARK-45133
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45167) Python Spark Connect client does not call `releaseAll`

2023-09-14 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-45167:
--
Epic Link: SPARK-43754  (was: SPARK-39375)

> Python Spark Connect client does not call `releaseAll`
> --
>
> Key: SPARK-45167
> URL: https://issues.apache.org/jira/browse/SPARK-45167
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>
> The Python client does not call release all previous responses on the server 
> and thus does not properly close the queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45680) ReleaseSession to close Spark Connect session

2023-10-26 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-45680:
-

 Summary: ReleaseSession to close Spark Connect session
 Key: SPARK-45680
 URL: https://issues.apache.org/jira/browse/SPARK-45680
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43754) Spark Connect Session & Query lifecycle

2023-10-26 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-43754:
--
Affects Version/s: 4.0.0

> Spark Connect Session & Query lifecycle
> ---
>
> Key: SPARK-43754
> URL: https://issues.apache.org/jira/browse/SPARK-43754
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Currently, queries in Spark Connect are executed within the RPC handler.
> We want to detach the RPC interface from actual sessions and execution, so 
> that we can make the interface more flexible
>  * maintain long running sessions, independent of unbroken GRPC channel
>  * be able to cancel queries
>  * have different interfaces to query results than push from server



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45780) Propagate all Spark Connect client threadlocal in InheritableThread

2023-11-03 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-45780:
-

 Summary: Propagate all Spark Connect client threadlocal in 
InheritableThread
 Key: SPARK-45780
 URL: https://issues.apache.org/jira/browse/SPARK-45780
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski


Propagate all thread locals that can be set in SparkConnectClient, not only 
'tags'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45206) Shutting down SparkConnectClient can have outstanding ReleaseExecute

2023-09-18 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-45206:
-

 Summary: Shutting down SparkConnectClient can have outstanding 
ReleaseExecute
 Key: SPARK-45206
 URL: https://issues.apache.org/jira/browse/SPARK-45206
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski


In Spark Connect scala client, there can be outstanding asynchronous 
ReleaseExecute calls when the client is shutdown. 

In ExecutePlanResponseReattachableIterator we (ab)use a grpc thread in 
createRetryingReleaseExecuteResponseObserver to run it.

 
When we do {{{}SparkConnectClient.shutdown{}}}, which does 
{{channel.shutdownNow()}}
it kills it before it finishes the ReleaseExecute. Maybe a more graceful 
shutdown would work, or we should move to a more explicit threadpool that we 
manage, like in python?

See discussion in https://github.com/apache/spark/pull/42929/files#r1329071826
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45416) Sanity check that Spark Connect returns arrow batches in order

2023-10-04 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-45416:
-

 Summary: Sanity check that Spark Connect returns arrow batches in 
order
 Key: SPARK-45416
 URL: https://issues.apache.org/jira/browse/SPARK-45416
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45435) Document that lazy checkpoint may cause undeterministm

2023-10-06 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-45435:
-

 Summary: Document that lazy checkpoint may cause undeterministm
 Key: SPARK-45435
 URL: https://issues.apache.org/jira/browse/SPARK-45435
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core, SQL
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski


Some people may want to use checkpoint to get a consistent snapshot of the 
Dataset / RDD. Warn that this is not the case with lazy checkpoint, because 
checkpoint is computed only at the end of the first action, and the data used 
during the first action may be different because of non-determinism and retries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45435) Document that lazy checkpoint may not be a consistent

2023-10-06 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-45435:
--
Summary: Document that lazy checkpoint may not be a consistent  (was: 
Document that lazy checkpoint may cause undeterministm)

> Document that lazy checkpoint may not be a consistent
> -
>
> Key: SPARK-45435
> URL: https://issues.apache.org/jira/browse/SPARK-45435
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>  Labels: pull-request-available
>
> Some people may want to use checkpoint to get a consistent snapshot of the 
> Dataset / RDD. Warn that this is not the case with lazy checkpoint, because 
> checkpoint is computed only at the end of the first action, and the data used 
> during the first action may be different because of non-determinism and 
> retries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45647) Spark Connect API to propagate per request context

2023-10-24 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-45647:
-

 Summary: Spark Connect API to propagate per request context
 Key: SPARK-45647
 URL: https://issues.apache.org/jira/browse/SPARK-45647
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski


There is an extension point to pass arbitrary proto extension in Spark Connect 
UserContext, but there is no API to do this in the client. Add a SparkSession 
API to attach extra protos that will be sent with all requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44730) Spark Connect: Cleaner thread not stopped when SparkSession stops

2023-08-17 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-44730.
---
Resolution: Not A Problem

> Spark Connect: Cleaner thread not stopped when SparkSession stops
> -
>
> Key: SPARK-44730
> URL: https://issues.apache.org/jira/browse/SPARK-44730
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Spark Connect scala client SparkSession has a cleaner, which starts a daemon 
> thread to clean up Closeable objects after GC. This daemon thread is never 
> stopped, and every SparkSession creates a new one.
> Cleaner implements a stop() function, but no-one ever calls it. Possibly 
> because even after SparkSession.stop(), the cleaner may still be needed when 
> remaining references are GCed... For this reason it seems that the Cleaner 
> should rather be a global singleton than within a session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44730) Spark Connect: Cleaner thread not stopped when SparkSession stops

2023-08-17 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755613#comment-17755613
 ] 

Juliusz Sompolski commented on SPARK-44730:
---

Silly me, Cleaner is already global and I looked wrong.

> Spark Connect: Cleaner thread not stopped when SparkSession stops
> -
>
> Key: SPARK-44730
> URL: https://issues.apache.org/jira/browse/SPARK-44730
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Spark Connect scala client SparkSession has a cleaner, which starts a daemon 
> thread to clean up Closeable objects after GC. This daemon thread is never 
> stopped, and every SparkSession creates a new one.
> Cleaner implements a stop() function, but no-one ever calls it. Possibly 
> because even after SparkSession.stop(), the cleaner may still be needed when 
> remaining references are GCed... For this reason it seems that the Cleaner 
> should rather be a global singleton than within a session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44855) Small tweaks to attaching ExecuteGrpcResponseSender to ExecuteResponseObserver

2023-08-17 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44855:
-

 Summary: Small tweaks to attaching ExecuteGrpcResponseSender to 
ExecuteResponseObserver
 Key: SPARK-44855
 URL: https://issues.apache.org/jira/browse/SPARK-44855
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski


Small improvements can be made to the way new ExecuteGrpcResponseSender is 
attached to observer.
 * Since now we have addGrpcResponseSender in ExecuteHolder, it should be 
ExecuteHolder responsibility to interrupt the old sender and that there is only 
one at a time, and to ExecuteResponseObserver's responsibility
 * executeObserver is used as a lock for synchronization. An explicit lock 
object could be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44765) Make ReleaseExecute retry in ExecutePlanResponseReattachableIterator reuse common mechanism

2023-08-16 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-44765.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Make ReleaseExecute retry in ExecutePlanResponseReattachableIterator reuse 
> common mechanism
> ---
>
> Key: SPARK-44765
> URL: https://issues.apache.org/jira/browse/SPARK-44765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44833) Spark Connect reattach when initial ExecutePlan didn't reach server doing too eager Reattach

2023-08-16 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44833:
-

 Summary: Spark Connect reattach when initial ExecutePlan didn't 
reach server doing too eager Reattach
 Key: SPARK-44833
 URL: https://issues.apache.org/jira/browse/SPARK-44833
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


In
{code:java}
case ex: StatusRuntimeException
if Option(StatusProto.fromThrowable(ex))
  .exists(_.getMessage.contains("INVALID_HANDLE.OPERATION_NOT_FOUND")) =>
  if (lastReturnedResponseId.isDefined) {
throw new IllegalStateException(
  "OPERATION_NOT_FOUND on the server but responses were already received 
from it.",
  ex)
  }
  // Try a new ExecutePlan, and throw upstream for retry.
->  iter = rawBlockingStub.executePlan(initialRequest)
->  throw new GrpcRetryHandler.RetryException {code}
we call executePlan, and throw RetryException to have an exception handled 
upstream.

Then it goes to
{code:java}
retry {
  if (firstTry) {
// on first try, we use the existing iter.
firstTry = false
  } else {
// on retry, the iter is borked, so we need a new one
->iter = rawBlockingStub.reattachExecute(createReattachExecuteRequest())
  } {code}
and because it's not firstTry, immediately does reattach.

This causes no failure - the reattach will work and attach to the query, the 
original executePlan will get detached. But it could be improved.

Same issue is also present in python reattach.py.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44835) SparkConnect ReattachExecute could raise before ExecutePlan even attaches.

2023-08-16 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44835:
-

 Summary: SparkConnect ReattachExecute could raise before 
ExecutePlan even attaches.
 Key: SPARK-44835
 URL: https://issues.apache.org/jira/browse/SPARK-44835
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


If a ReattachExecute is sent very quickly after ExecutePlan, the following 
could happen:
 * ExecutePlan didn't reach 
*executeHolder.runGrpcResponseSender(responseSender)* in 
SparkConnectExecutePlanHandler yet.
 * ReattachExecute races around and reaches 
*executeHolder.runGrpcResponseSender(responseSender)* in 
SparkConnectReattachExecuteHandler first.
 * When ExecutePlan reaches 
{*}executeHolder.runGrpcResponseSender(responseSender){*}, and 
executionObserver.attachConsumer(this) is called in ExecuteGrpcResponseSender 
of ExecutePlan, it will kick out the ExecuteGrpcResponseSender or 
ReattachExecute.

So even though ReattachExecute came later, it will get interrupted by the 
earlier ExecutePlan and finish with a *SparkSQLException(errorClass = 
"INVALID_CURSOR.DISCONNECTED", Map.empty)* (which was assumed to be a situation 
where a stale hanging RPC is replaced by a reconnection.

 

That would be very unlikely to happen in practice, because ExecutePlan 
shouldn't be abandoned so fast, but because of  
https://issues.apache.org/jira/browse/SPARK-44833 it is slightly more likely 
(though there there is also a 50ms sleep before retry, which again make it 
unlikely)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44872) Testing reattachable execute

2023-08-18 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44872:
-

 Summary: Testing reattachable execute
 Key: SPARK-44872
 URL: https://issues.apache.org/jira/browse/SPARK-44872
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.5.0, 4.0.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44849) Expose SparkConnectExecutionManager.listActiveExecutions

2023-08-17 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44849:
-

 Summary: Expose SparkConnectExecutionManager.listActiveExecutions
 Key: SPARK-44849
 URL: https://issues.apache.org/jira/browse/SPARK-44849
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44784) Failure in testing `SparkSessionE2ESuite` using Maven

2023-08-14 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753974#comment-17753974
 ] 

Juliusz Sompolski commented on SPARK-44784:
---

This is a classloader issue with the serialization of an UDF pulling classes 
from the client's class context that it doesn't necessarily need, and then 
resulting in ClassNotFoundException on the server.
org.apache.spark.SparkException: org/apache/spark/sql/connect/client/SparkResult
It's not specifically related to the tests in that suite, but can happen 
anyplace with UDFs.

> Failure in testing `SparkSessionE2ESuite` using Maven
> -
>
> Key: SPARK-44784
> URL: https://issues.apache.org/jira/browse/SPARK-44784
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Blocker
>
> [https://github.com/apache/spark/actions/runs/5832898984/job/15819181762]
>  
> The following failures exist in the daily Maven tests, we should fix them 
> before Apache Spark 3.5.0 release:
> {code:java}
> SparkSessionE2ESuite:
> 4638- interrupt all - background queries, foreground interrupt *** FAILED ***
> 4639  The code passed to eventually never returned normally. Attempted 30 
> times over 20.092924822 seconds. Last failure message: Some("unexpected 
> failure in q1: org.apache.spark.SparkException: 
> org/apache/spark/sql/connect/client/SparkResult") was not empty Error not 
> empty: Some(unexpected failure in q1: org.apache.spark.SparkException: 
> org/apache/spark/sql/connect/client/SparkResult). 
> (SparkSessionE2ESuite.scala:71)
> 4640- interrupt all - foreground queries, background interrupt *** FAILED ***
> 4641  "org/apache/spark/sql/connect/client/SparkResult" did not contain 
> "OPERATION_CANCELED" Unexpected exception: org.apache.spark.SparkException: 
> org/apache/spark/sql/connect/client/SparkResult 
> (SparkSessionE2ESuite.scala:105)
> 4642- interrupt tag *** FAILED ***
> 4643  The code passed to eventually never returned normally. Attempted 30 
> times over 20.069445587 seconds. Last failure message: ListBuffer() had 
> length 0 instead of expected length 2 Interrupted operations: ListBuffer().. 
> (SparkSessionE2ESuite.scala:199)
> 4644- interrupt operation *** FAILED ***
> 4645  org.apache.spark.SparkException: 
> org/apache/spark/sql/connect/client/SparkResult
> 4646  at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:89)
> 4647  at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:38)
> 4648  at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:46)
> 4649  at 
> org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:83)
> 4650  at 
> org.apache.spark.sql.connect.client.SparkResult.operationId(SparkResult.scala:174)
> 4651  at 
> org.apache.spark.sql.SparkSessionE2ESuite.$anonfun$new$31(SparkSessionE2ESuite.scala:243)
> 4652  at 
> org.apache.spark.sql.connect.client.util.RemoteSparkSession.$anonfun$test$1(RemoteSparkSession.scala:243)
> 4653  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> 4654  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> 4655  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> 4656  ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40918) Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for WSCG.isTooManyFields when using _metadata

2022-10-26 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-40918:
-

 Summary: Mismatch between ParquetFileFormat and FileSourceScanExec 
in # columns for WSCG.isTooManyFields when using _metadata
 Key: SPARK-40918
 URL: https://issues.apache.org/jira/browse/SPARK-40918
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Juliusz Sompolski


_metadata.columns are taken into account in FileSourceScanExec.supportColumnar, 
but not when the parquet reader is created. This can result in Parquet reader 
outputting columnar (because it has less columns than WSCG.isTooManyFields), 
whereas FileSourceScanExec wants row output (because with the extra metadata 
columns it hits the isTooManyFields limit).

I have a fix forthcoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20624) SPIP: Add better handling for node shutdown

2022-09-13 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603577#comment-17603577
 ] 

Juliusz Sompolski commented on SPARK-20624:
---

[~holden] Are these new APIs documented? I can't seem to find them in the 
official Spark documentation.
Should they be mentioned e.g. in 
https://spark.apache.org/docs/latest/job-scheduling.html#graceful-decommission-of-executors
 ?

> SPIP: Add better handling for node shutdown
> ---
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache

2023-01-02 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653648#comment-17653648
 ] 

Juliusz Sompolski commented on SPARK-41497:
---

Note that this issue leads to a correctness issue in Delta Merge, because it 
depends on a SetAccumulator as a side output channel for gathering files that 
need to be rewritten by the Merge: 
https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L445-L449

Missing some files there can result in duplicate records being inserted instead 
of existing records being updated.

> Accumulator undercounting in the case of retry task with rdd cache
> --
>
> Key: SPARK-41497
> URL: https://issues.apache.org/jira/browse/SPARK-41497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: wuyi
>Priority: Major
>
> Accumulator could be undercounted when the retried task has rdd cache.  See 
> the example below and you could also find the completed and reproducible 
> example at 
> [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc]
>   
> {code:scala}
> test("SPARK-XXX") {
>   // Set up a cluster with 2 executors
>   val conf = new SparkConf()
> .setMaster("local-cluster[2, 1, 
> 1024]").setAppName("TaskSchedulerImplSuite")
>   sc = new SparkContext(conf)
>   // Set up a custom task scheduler. The scheduler will fail the first task 
> attempt of the job
>   // submitted below. In particular, the failed first attempt task would 
> success on computation
>   // (accumulator accounting, result caching) but only fail to report its 
> success status due
>   // to the concurrent executor lost. The second task attempt would success.
>   taskScheduler = setupSchedulerWithCustomStatusUpdate(sc)
>   val myAcc = sc.longAccumulator("myAcc")
>   // Initiate a rdd with only one partition so there's only one task and 
> specify the storage level
>   // with MEMORY_ONLY_2 so that the rdd result will be cached on both two 
> executors.
>   val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter =>
> myAcc.add(100)
> iter.map(x => x + 1)
>   }.persist(StorageLevel.MEMORY_ONLY_2)
>   // This will pass since the second task attempt will succeed
>   assert(rdd.count() === 10)
>   // This will fail due to `myAcc.add(100)` won't be executed during the 
> second task attempt's
>   // execution. Because the second task attempt will load the rdd cache 
> directly instead of
>   // executing the task function so `myAcc.add(100)` is skipped.
>   assert(myAcc.value === 100)
> } {code}
>  
> We could also hit this issue with decommission even if the rdd only has one 
> copy. For example, decommission could migrate the rdd cache block to another 
> executor (the result is actually the same with 2 copies) and the 
> decommissioned executor lost before the task reports its success status to 
> the driver. 
>  
> And the issue is a bit more complicated than expected to fix. I have tried to 
> give some fixes but all of them are not ideal:
> Option 1: Clean up any rdd cache related to the failed task: in practice, 
> this option can already fix the issue in most cases. However, theoretically, 
> rdd cache could be reported to the driver right after the driver cleans up 
> the failed task's caches due to asynchronous communication. So this option 
> can’t resolve the issue thoroughly;
> Option 2: Disallow rdd cache reuse across the task attempts for the same 
> task: this option can 100% fix the issue. The problem is this way can also 
> affect the case where rdd cache can be reused across the attempts (e.g., when 
> there is no accumulator operation in the task), which can have perf 
> regression;
> Option 3: Introduce accumulator cache: first, this requires a new framework 
> for supporting accumulator cache; second, the driver should improve its logic 
> to distinguish whether the accumulator cache value should be reported to the 
> user to avoid overcounting. For example, in the case of task retry, the value 
> should be reported. However, in the case of rdd cache reuse, the value 
> shouldn’t be reported (should it?);
> Option 4: Do task success validation when a task trying to load the rdd 
> cache: this way defines a rdd cache is only valid/accessible if the task has 
> succeeded. This way could be either overkill or a bit complex (because 
> currently Spark would clean up the task state once it’s finished. So we need 
> to maintain a structure to know if task once succeeded or not. )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Comment Edited] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache

2023-01-02 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653648#comment-17653648
 ] 

Juliusz Sompolski edited comment on SPARK-41497 at 1/2/23 4:22 PM:
---

Note that this issue leads to a correctness issue in Delta Merge, because it 
depends on a SetAccumulator as a side output channel for gathering files that 
need to be rewritten by the Merge: 
https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L445-L449
Delta assumes that Spark accumulators can overcount (in some cases where task 
retries update them in duplicate), but it was assumed that it should never 
undercount and lose output like that...

Missing some files there can result in duplicate records being inserted instead 
of existing records being updated.


was (Author: juliuszsompolski):
Note that this issue leads to a correctness issue in Delta Merge, because it 
depends on a SetAccumulator as a side output channel for gathering files that 
need to be rewritten by the Merge: 
https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L445-L449

Missing some files there can result in duplicate records being inserted instead 
of existing records being updated.

> Accumulator undercounting in the case of retry task with rdd cache
> --
>
> Key: SPARK-41497
> URL: https://issues.apache.org/jira/browse/SPARK-41497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: wuyi
>Priority: Major
>
> Accumulator could be undercounted when the retried task has rdd cache.  See 
> the example below and you could also find the completed and reproducible 
> example at 
> [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc]
>   
> {code:scala}
> test("SPARK-XXX") {
>   // Set up a cluster with 2 executors
>   val conf = new SparkConf()
> .setMaster("local-cluster[2, 1, 
> 1024]").setAppName("TaskSchedulerImplSuite")
>   sc = new SparkContext(conf)
>   // Set up a custom task scheduler. The scheduler will fail the first task 
> attempt of the job
>   // submitted below. In particular, the failed first attempt task would 
> success on computation
>   // (accumulator accounting, result caching) but only fail to report its 
> success status due
>   // to the concurrent executor lost. The second task attempt would success.
>   taskScheduler = setupSchedulerWithCustomStatusUpdate(sc)
>   val myAcc = sc.longAccumulator("myAcc")
>   // Initiate a rdd with only one partition so there's only one task and 
> specify the storage level
>   // with MEMORY_ONLY_2 so that the rdd result will be cached on both two 
> executors.
>   val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter =>
> myAcc.add(100)
> iter.map(x => x + 1)
>   }.persist(StorageLevel.MEMORY_ONLY_2)
>   // This will pass since the second task attempt will succeed
>   assert(rdd.count() === 10)
>   // This will fail due to `myAcc.add(100)` won't be executed during the 
> second task attempt's
>   // execution. Because the second task attempt will load the rdd cache 
> directly instead of
>   // executing the task function so `myAcc.add(100)` is skipped.
>   assert(myAcc.value === 100)
> } {code}
>  
> We could also hit this issue with decommission even if the rdd only has one 
> copy. For example, decommission could migrate the rdd cache block to another 
> executor (the result is actually the same with 2 copies) and the 
> decommissioned executor lost before the task reports its success status to 
> the driver. 
>  
> And the issue is a bit more complicated than expected to fix. I have tried to 
> give some fixes but all of them are not ideal:
> Option 1: Clean up any rdd cache related to the failed task: in practice, 
> this option can already fix the issue in most cases. However, theoretically, 
> rdd cache could be reported to the driver right after the driver cleans up 
> the failed task's caches due to asynchronous communication. So this option 
> can’t resolve the issue thoroughly;
> Option 2: Disallow rdd cache reuse across the task attempts for the same 
> task: this option can 100% fix the issue. The problem is this way can also 
> affect the case where rdd cache can be reused across the attempts (e.g., when 
> there is no accumulator operation in the task), which can have perf 
> regression;
> Option 3: Introduce accumulator cache: first, this requires a new framework 
> for supporting accumulator cache; second, the driver should improve its logic 
> to distinguish whether the accumulator cache value should be reported to the 
> user to avoid overcounting. For example, in

[jira] [Created] (SPARK-43331) Spark Connect - SparkSession interruptAll

2023-05-01 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-43331:
-

 Summary: Spark Connect - SparkSession interruptAll
 Key: SPARK-43331
 URL: https://issues.apache.org/jira/browse/SPARK-43331
 Project: Spark
  Issue Type: Story
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


Add an "interruptAll" Api to client SparkSession, to allow interrupting all 
running executions in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44424) Reattach to existing execute in Spark Connect (python client)

2023-07-14 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44424:
-

 Summary: Reattach to existing execute in Spark Connect (python 
client)
 Key: SPARK-44424
 URL: https://issues.apache.org/jira/browse/SPARK-44424
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


Python client for https://issues.apache.org/jira/browse/SPARK-44421



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44423) Reattach to existing execute in Spark Connect (scala client)

2023-07-14 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44423:
-

 Summary: Reattach to existing execute in Spark Connect (scala 
client)
 Key: SPARK-44423
 URL: https://issues.apache.org/jira/browse/SPARK-44423
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


Scala client for https://issues.apache.org/jira/browse/SPARK-44421



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44422) Fine grained interrupt in Spark Connect

2023-07-14 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44422:
-

 Summary: Fine grained interrupt in Spark Connect
 Key: SPARK-44422
 URL: https://issues.apache.org/jira/browse/SPARK-44422
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


Next to SparkSession.interruptAll, provide mechanism for interrupting 
 * individual queries
 * user defined groups of queries in a session (by a tag)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44421) Reattach to existing execute in Spark Connect (server mechanism)

2023-07-14 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-44421:
--
Summary: Reattach to existing execute in Spark Connect (server mechanism)  
(was: Reattach to existing execute in Spark Connect)

> Reattach to existing execute in Spark Connect (server mechanism)
> 
>
> Key: SPARK-44421
> URL: https://issues.apache.org/jira/browse/SPARK-44421
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> If the ExecutePlan response stream gets broken, provide a mechanism to 
> reattach to the execution with a new RPC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44425) Validate that session_id is an UUID

2023-07-14 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44425:
-

 Summary: Validate that session_id is an UUID
 Key: SPARK-44425
 URL: https://issues.apache.org/jira/browse/SPARK-44425
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


Add validation that session_id is an UUID. This is currently the case in the 
clients, so we could make it an requirement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44421) Reattach to existing execute in Spark Connect

2023-07-14 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44421:
-

 Summary: Reattach to existing execute in Spark Connect
 Key: SPARK-44421
 URL: https://issues.apache.org/jira/browse/SPARK-44421
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


If the ExecutePlan response stream gets broken, provide a mechanism to reattach 
to the execution with a new RPC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43923) [CONNECT] Post listenerBus events during ExecutePlanRequest

2023-07-14 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-43923:
--
Issue Type: New Feature  (was: Bug)

> [CONNECT] Post listenerBus events during ExecutePlanRequest
> ---
>
> Key: SPARK-43923
> URL: https://issues.apache.org/jira/browse/SPARK-43923
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Jean-Francois Desjeans Gauthier
>Priority: Major
>
> Post events SparkListenerConnectOperationStarted, 
> SparkListenerConnectOperationParsed, SparkListenerConnectOperationCanceled,  
> SparkListenerConnectOperationFailed, SparkListenerConnectOperationFinished, 
> SparkListenerConnectOperationClosed & SparkListenerConnectSessionClosed.
> Mirror what is currently available for in HiveThriftServer2EventManager



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43952) Cancel Spark jobs not only by a single "jobgroup", but allow multiple "job tags"

2023-06-02 Thread Juliusz Sompolski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728791#comment-17728791
 ] 

Juliusz Sompolski commented on SPARK-43952:
---

Indirectly related with https://issues.apache.org/jira/browse/SPARK-43754 to 
allow Spark Connect cancellation of queries not conflict with other places 
setting job groups.

> Cancel Spark jobs not only by a single "jobgroup", but allow multiple "job 
> tags"
> 
>
> Key: SPARK-43952
> URL: https://issues.apache.org/jira/browse/SPARK-43952
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Currently, the only way to cancel running Spark Jobs is by using 
> SparkContext.cancelJobGroup, using a job group name that was previously set 
> using SparkContext.setJobGroup. This is problematic if multiple different 
> parts of the system want to do cancellation, and set their own ids.
> For example, 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L133]
>  sets it's own job group, which may override job group set by user. This way, 
> if user cancels the job group they set, it will not cancel these broadcast 
> jobs launches from within their jobs...
> As a solution, consider adding SparkContext.addJobTag / 
> SparkContext.removeJobTag, which would allow to have multiple "tags" on the 
> jobs, and introduce SparkContext.cancelJobsByTag to allow more flexible 
> cancelling of jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 149 matches

Mail list logo