[jira] [Updated] (SPARK-27221) Improve the assert error message in TreeNode

2019-03-20 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27221:
-
Description: 
When TreeNode.parseToJson may throw an assert error without any error message 
when a TreeNode is not implemented properly, and it's hard to find the bad 
TreeNode implementation.

It's better to improve the error message to indicate the type of TreeNode.

> Improve the assert error message in TreeNode
> 
>
> Key: SPARK-27221
> URL: https://issues.apache.org/jira/browse/SPARK-27221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> When TreeNode.parseToJson may throw an assert error without any error message 
> when a TreeNode is not implemented properly, and it's hard to find the bad 
> TreeNode implementation.
> It's better to improve the error message to indicate the type of TreeNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27221) Improve the assert error message in TreeNode

2019-03-20 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27221:


 Summary: Improve the assert error message in TreeNode
 Key: SPARK-27221
 URL: https://issues.apache.org/jira/browse/SPARK-27221
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25449) Don't send zero accumulators in heartbeats

2019-03-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791960#comment-16791960
 ] 

Shixiong Zhu commented on SPARK-25449:
--

I think this patch actually fixed a bug introduced by 
https://github.com/apache/spark/commit/0514e8d4b69615ba8918649e7e3c46b5713b6540 
It didn't use the correct default timeout. Before this batch, using 
`spark.executor.heartbeatInterval 30` would send a heartbeat every 30 ms, but 
each heartbeat RPC message timeout was 30 seconds.

This patch just unifies the default time unit in all usages of 
"spark.executor.heartbeatInterval".

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27111:
-
Fix Version/s: 2.3.4

> A continuous query may fail with InterruptedException when kafka consumer 
> temporally 0 partitions temporally
> 
>
> Key: SPARK-27111
> URL: https://issues.apache.org/jira/browse/SPARK-27111
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
>
> Before a Kafka consumer gets assigned with partitions, its offset will 
> contain 0 partitions. However, runContinuous will still run and launch a 
> Spark job having 0 partitions. In this case, there is a race that epoch may 
> interrupt the query execution thread after `lastExecution.toRdd`, and either 
> `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
> `runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-27111.
--
Resolution: Fixed

> A continuous query may fail with InterruptedException when kafka consumer 
> temporally 0 partitions temporally
> 
>
> Key: SPARK-27111
> URL: https://issues.apache.org/jira/browse/SPARK-27111
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
>
> Before a Kafka consumer gets assigned with partitions, its offset will 
> contain 0 partitions. However, runContinuous will still run and launch a 
> Spark job having 0 partitions. In this case, there is a race that epoch may 
> interrupt the query execution thread after `lastExecution.toRdd`, and either 
> `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
> `runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27111:
-
Fix Version/s: 2.4.2

> A continuous query may fail with InterruptedException when kafka consumer 
> temporally 0 partitions temporally
> 
>
> Key: SPARK-27111
> URL: https://issues.apache.org/jira/browse/SPARK-27111
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> Before a Kafka consumer gets assigned with partitions, its offset will 
> contain 0 partitions. However, runContinuous will still run and launch a 
> Spark job having 0 partitions. In this case, there is a race that epoch may 
> interrupt the query execution thread after `lastExecution.toRdd`, and either 
> `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
> `runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27111:
-
Fix Version/s: 3.0.0

> A continuous query may fail with InterruptedException when kafka consumer 
> temporally 0 partitions temporally
> 
>
> Key: SPARK-27111
> URL: https://issues.apache.org/jira/browse/SPARK-27111
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
>
> Before a Kafka consumer gets assigned with partitions, its offset will 
> contain 0 partitions. However, runContinuous will still run and launch a 
> Spark job having 0 partitions. In this case, there is a race that epoch may 
> interrupt the query execution thread after `lastExecution.toRdd`, and either 
> `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
> `runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27111:
-
Affects Version/s: 2.4.1
   2.4.0

> A continuous query may fail with InterruptedException when kafka consumer 
> temporally 0 partitions temporally
> 
>
> Key: SPARK-27111
> URL: https://issues.apache.org/jira/browse/SPARK-27111
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Before a Kafka consumer gets assigned with partitions, its offset will 
> contain 0 partitions. However, runContinuous will still run and launch a 
> Spark job having 0 partitions. In this case, there is a race that epoch may 
> interrupt the query execution thread after `lastExecution.toRdd`, and either 
> `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
> `runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-08 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27111:


 Summary: A continuous query may fail with InterruptedException 
when kafka consumer temporally 0 partitions temporally
 Key: SPARK-27111
 URL: https://issues.apache.org/jira/browse/SPARK-27111
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.3, 2.3.2, 2.3.1, 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Before a Kafka consumer gets assigned with partitions, its offset will contain 
0 partitions. However, runContinuous will still run and launch a Spark job 
having 0 partitions. In this case, there is a race that epoch may interrupt the 
query execution thread after `lastExecution.toRdd`, and either 
`epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
`runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26824:
-
Docs Text: Earlier version of Spark incorrectly escaped paths when writing 
out checkpoints and "_spark_metadata" for structured streaming. Queries 
affected by this issue will fail when running in Spark 3.0. It will report an 
instruction about how to migrate your queries.

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. For example, if you use "/chk chk", the metadata will be stored 
> in "/chk%20chk". File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-20 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26824.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. For example, if you use "/chk chk", the metadata will be stored 
> in "/chk%20chk". File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20977) NPE in CollectionAccumulator

2019-02-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-20977:
--

Reopening this issue as I believe I understand the cause. An accumulator is 
escaped before it's fully constructed in "readObject": 
https://github.com/apache/spark/blob/b19a28dea098c7d6188f8540429c50f42952d678/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L195

"The object should not be made visible to other threads, nor should the final 
fields be read, until all updates to the final fields of the object are 
complete." 
(https://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.5.3)

> NPE in CollectionAccumulator
> 
>
> Key: SPARK-20977
> URL: https://issues.apache.org/jira/browse/SPARK-20977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Priority: Major
>
> {code:java}
> 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
> heartbeat-receiver-event-loop-thread
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Is that the bug of spark? Has anybody ever hit the problem?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26824:
-
Affects Version/s: 2.0.0
   2.1.0
   2.2.0
   2.3.0

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. For example, if you use "/chk chk", the metadata will be stored 
> in "/chk%20chk". File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-04 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760071#comment-16760071
 ] 

Shixiong Zhu commented on SPARK-26824:
--

This will need a release note. After the fix, the paths to store metadata will 
be changed. The user needs to move the metadata files which are in a wrong 
place to the right location manually. Otherwise, their query will not pick up 
the old metadata.

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. For example, if you use "/chk chk", the metadata will be stored 
> in "/chk%20chk". File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26824:
-
Description: When a user specifies a checkpoint location containing special 
chars that need to be escaped in a path, the streaming query will store 
checkpoint in a wrong place. For example, if you use "/chk chk", the metadata 
will be stored in "/chk%20chk". File sink's "_spark_metadata" directory has the 
same issue.   (was: When a user specifies a checkpoint location containing 
special chars that need to be escaped in a path, the streaming query will store 
checkpoint in a wrong place. File sink's "_spark_metadata" directory has the 
same issue. )

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. For example, if you use "/chk chk", the metadata will be stored 
> in "/chk%20chk". File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26824:


 Summary: Streaming queries may store checkpoint data in a wrong 
directory
 Key: SPARK-26824
 URL: https://issues.apache.org/jira/browse/SPARK-26824
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


When a user specifies a checkpoint location containing special chars that need 
to be escaped in a path, the streaming query will store checkpoint in a wrong 
place. File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26824:
-
Labels: release-notes  (was: )

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Affects Version/s: 2.3.3

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0, 2.2.4
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Fix Version/s: (was: 2.3.3)
   2.3.4

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0, 2.2.4
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Affects Version/s: 2.2.2
   2.2.3

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0, 2.2.4
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26806.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1
   2.3.3
   2.2.4

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-01-31 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757824#comment-16757824
 ] 

Shixiong Zhu commented on SPARK-26783:
--

[~gsomogyi] This seems just an API document issue. Right? If the user passes 
"failOnDataLoss", the Kafka source will pick it up correctly.

> Kafka parameter documentation doesn't match with the reality (upper/lowercase)
> --
>
> Key: SPARK-26783
> URL: https://issues.apache.org/jira/browse/SPARK-26783
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
> I've just checked and there are several other parameters which suffer from 
> the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Description: 
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

This issue was reported by [~liancheng]

  was:
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}


> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Reporter: liancheng  (was: Shixiong Zhu)

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Affects Version/s: 2.2.1
   2.3.0
   2.3.1
   2.3.2

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26806:


 Summary: EventTimeStats.merge doesn't handle "zero.merge(zero)" 
correctly
 Key: SPARK-26806
 URL: https://issues.apache.org/jira/browse/SPARK-26806
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26682) Task attempt ID collision causes lost data

2019-01-24 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751651#comment-16751651
 ] 

Shixiong Zhu commented on SPARK-26682:
--

For future reference, data loss could happen when one task modified the other 
task's temp output file.

> Task attempt ID collision causes lost data
> --
>
> Key: SPARK-26682
> URL: https://issues.apache.org/jira/browse/SPARK-26682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.3.2, 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Blocker
>  Labels: data-loss
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> We recently tracked missing data to a collision in the fake Hadoop task 
> attempt ID created when using Hadoop OutputCommitters. This is similar to 
> SPARK-24589.
> A stage had one task fail to get one shard from a shuffle, causing a 
> FetchFailedException and Spark resubmitted the stage. Because only one task 
> was affected, the original stage attempt continued running tasks that had 
> been resubmitted. Another task ran two attempts concurrently on the same 
> executor, but had the same attempt number because they were from different 
> stage attempts. Because the attempt number was the same, the task used the 
> same temp locations. That caused one attempt to fail because a file path 
> already existed, and that attempt then removed the shared temp location and 
> deleted the other task's data. When the second attempt succeeded, it 
> committed partial data.
> The problem was that both attempts had the same partition and attempt 
> numbers, despite being run in different stages, and that was used to create a 
> Hadoop task attempt ID on which the temp location was based. The fix is to 
> use Spark's global task attempt ID, which is a counter, instead of attempt 
> number because attempt number is reused in stage attempts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26682) Task attempt ID collision causes lost data

2019-01-23 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750486#comment-16750486
 ] 

Shixiong Zhu commented on SPARK-26682:
--

IIUC, this issue will cause a file deletion (delete the temp file) and a file 
rename (move the temp file to the target file) happen at the same time. Could 
you clarify why this will cause a task committed partial data? I think the file 
rename should either move the whole file to the target file, or just fail, 
right?

> Task attempt ID collision causes lost data
> --
>
> Key: SPARK-26682
> URL: https://issues.apache.org/jira/browse/SPARK-26682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.3.2, 2.4.0
>Reporter: Ryan Blue
>Priority: Blocker
>  Labels: data-loss
>
> We recently tracked missing data to a collision in the fake Hadoop task 
> attempt ID created when using Hadoop OutputCommitters. This is similar to 
> SPARK-24589.
> A stage had one task fail to get one shard from a shuffle, causing a 
> FetchFailedException and Spark resubmitted the stage. Because only one task 
> was affected, the original stage attempt continued running tasks that had 
> been resubmitted. Another task ran two attempts concurrently on the same 
> executor, but had the same attempt number because they were from different 
> stage attempts. Because the attempt number was the same, the task used the 
> same temp locations. That caused one attempt to fail because a file path 
> already existed, and that attempt then removed the shared temp location and 
> deleted the other task's data. When the second attempt succeeded, it 
> committed partial data.
> The problem was that both attempts had the same partition and attempt 
> numbers, despite being run in different stages, and that was used to create a 
> Hadoop task attempt ID on which the temp location was based. The fix is to 
> use Spark's global task attempt ID, which is a counter, instead of attempt 
> number because attempt number is reused in stage attempts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26665:
-
Fix Version/s: 2.3.4

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26665:
-
Affects Version/s: 2.3.0
   2.3.1

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26665:
-
Affects Version/s: 2.3.2

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26665.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-18 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26665:


 Summary: BlockTransferService.fetchBlockSync may hang forever
 Key: SPARK-26665
 URL: https://issues.apache.org/jira/browse/SPARK-26665
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


`ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but no 
enough memory is available. However, when this happens, right now 
BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26629) Error with multiple file stream in a query + restart on a batch that has no data for one file stream

2019-01-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26629:
-
Fix Version/s: (was: 2.3.4)

> Error with multiple file stream in a query + restart on a batch that has no 
> data for one file stream
> 
>
> Key: SPARK-26629
> URL: https://issues.apache.org/jira/browse/SPARK-26629
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> When a streaming query has multiple file streams, and there is a batch where 
> one of the file streams dont have data in that batch, then if the query has 
> to restart from that, it will throw the following error.
> {code}
> java.lang.IllegalStateException: batch 1 doesn't exist
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205)
> {code}
> **Reason**
> Existing {{HDFSMetadata.verifyBatchIds}} throws error whenever the batchIds 
> list was empty. In the context of {{FileStreamSource.getBatch}} (where verify 
> is called) and FileStreamSourceLog (subclass of HDFSMetadata), this is 
> usually okay because, in a streaming query with one file stream, the batchIds 
> can never be empty:
> A batch is planned only when the FileStreamSourceLog has seen new offset 
> (that is, there are new data files).
> So FileStreamSource.getBatch will be called on X to Y where X will always be 
> > Y. This calls internally {{HDFSMetadata.verifyBatchIds (X+1, Y)}} with 
> X+1-Y ids.
> For example, {{FileStreamSource.getBatch(4, 5)}} will call {{verify(batchIds 
> = Seq(5), start = 5, end = 5)}}. However, the invariant of X > Y is not true 
> when there are two file stream sources, as a batch may be planned even when 
> only one of the file streams has data. So one of the file stream may not have 
> data, which can call {{FileStreamSource.getBatch(X, X) -> verify(batchIds = 
> Seq.empty, start = X+1, end = X) -> failure}}.
> Note that FileStreamSource.getBatch(X, X) gets called only when restarting a 
> query in a batch 

[jira] [Updated] (SPARK-26629) Error with multiple file stream in a query + restart on a batch that has no data for one file stream

2019-01-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26629:
-
Fix Version/s: 3.0.0
   2.4.1
   2.3.4

> Error with multiple file stream in a query + restart on a batch that has no 
> data for one file stream
> 
>
> Key: SPARK-26629
> URL: https://issues.apache.org/jira/browse/SPARK-26629
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 2.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.1, 3.0.0, 2.3.4
>
>
> When a streaming query has multiple file streams, and there is a batch where 
> one of the file streams dont have data in that batch, then if the query has 
> to restart from that, it will throw the following error.
> {code}
> java.lang.IllegalStateException: batch 1 doesn't exist
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205)
> {code}
> **Reason**
> Existing {{HDFSMetadata.verifyBatchIds}} throws error whenever the batchIds 
> list was empty. In the context of {{FileStreamSource.getBatch}} (where verify 
> is called) and FileStreamSourceLog (subclass of HDFSMetadata), this is 
> usually okay because, in a streaming query with one file stream, the batchIds 
> can never be empty:
> A batch is planned only when the FileStreamSourceLog has seen new offset 
> (that is, there are new data files).
> So FileStreamSource.getBatch will be called on X to Y where X will always be 
> > Y. This calls internally {{HDFSMetadata.verifyBatchIds (X+1, Y)}} with 
> X+1-Y ids.
> For example, {{FileStreamSource.getBatch(4, 5)}} will call {{verify(batchIds 
> = Seq(5), start = 5, end = 5)}}. However, the invariant of X > Y is not true 
> when there are two file stream sources, as a batch may be planned even when 
> only one of the file streams has data. So one of the file stream may not have 
> data, which can call {{FileStreamSource.getBatch(X, X) -> verify(batchIds = 
> Seq.empty, start = X+1, end = X) -> failure}}.
> Note that FileStreamSource.getBatch(X, X) gets called only 

[jira] [Resolved] (SPARK-26629) Error with multiple file stream in a query + restart on a batch that has no data for one file stream

2019-01-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26629.
--
Resolution: Fixed

> Error with multiple file stream in a query + restart on a batch that has no 
> data for one file stream
> 
>
> Key: SPARK-26629
> URL: https://issues.apache.org/jira/browse/SPARK-26629
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.1, 3.0.0, 2.3.4
>
>
> When a streaming query has multiple file streams, and there is a batch where 
> one of the file streams dont have data in that batch, then if the query has 
> to restart from that, it will throw the following error.
> {code}
> java.lang.IllegalStateException: batch 1 doesn't exist
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205)
> {code}
> **Reason**
> Existing {{HDFSMetadata.verifyBatchIds}} throws error whenever the batchIds 
> list was empty. In the context of {{FileStreamSource.getBatch}} (where verify 
> is called) and FileStreamSourceLog (subclass of HDFSMetadata), this is 
> usually okay because, in a streaming query with one file stream, the batchIds 
> can never be empty:
> A batch is planned only when the FileStreamSourceLog has seen new offset 
> (that is, there are new data files).
> So FileStreamSource.getBatch will be called on X to Y where X will always be 
> > Y. This calls internally {{HDFSMetadata.verifyBatchIds (X+1, Y)}} with 
> X+1-Y ids.
> For example, {{FileStreamSource.getBatch(4, 5)}} will call {{verify(batchIds 
> = Seq(5), start = 5, end = 5)}}. However, the invariant of X > Y is not true 
> when there are two file stream sources, as a batch may be planned even when 
> only one of the file streams has data. So one of the file stream may not have 
> data, which can call {{FileStreamSource.getBatch(X, X) -> verify(batchIds = 
> Seq.empty, start = X+1, end = X) -> failure}}.
> Note that FileStreamSource.getBatch(X, X) gets called only when restarting a 
> query in a batch where 

[jira] [Updated] (SPARK-26629) Error with multiple file stream in a query + restart on a batch that has no data for one file stream

2019-01-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26629:
-
Affects Version/s: 2.3.3

> Error with multiple file stream in a query + restart on a batch that has no 
> data for one file stream
> 
>
> Key: SPARK-26629
> URL: https://issues.apache.org/jira/browse/SPARK-26629
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.1, 3.0.0, 2.3.4
>
>
> When a streaming query has multiple file streams, and there is a batch where 
> one of the file streams dont have data in that batch, then if the query has 
> to restart from that, it will throw the following error.
> {code}
> java.lang.IllegalStateException: batch 1 doesn't exist
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205)
> {code}
> **Reason**
> Existing {{HDFSMetadata.verifyBatchIds}} throws error whenever the batchIds 
> list was empty. In the context of {{FileStreamSource.getBatch}} (where verify 
> is called) and FileStreamSourceLog (subclass of HDFSMetadata), this is 
> usually okay because, in a streaming query with one file stream, the batchIds 
> can never be empty:
> A batch is planned only when the FileStreamSourceLog has seen new offset 
> (that is, there are new data files).
> So FileStreamSource.getBatch will be called on X to Y where X will always be 
> > Y. This calls internally {{HDFSMetadata.verifyBatchIds (X+1, Y)}} with 
> X+1-Y ids.
> For example, {{FileStreamSource.getBatch(4, 5)}} will call {{verify(batchIds 
> = Seq(5), start = 5, end = 5)}}. However, the invariant of X > Y is not true 
> when there are two file stream sources, as a batch may be planned even when 
> only one of the file streams has data. So one of the file stream may not have 
> data, which can call {{FileStreamSource.getBatch(X, X) -> verify(batchIds = 
> Seq.empty, start = X+1, end = X) -> failure}}.
> Note that FileStreamSource.getBatch(X, X) gets called only when restarting a 
> query in a batch 

[jira] [Resolved] (SPARK-26350) Allow the user to override the group id of the Kafka's consumer

2019-01-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26350.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 3.0.0

> Allow the user to override the group id of the Kafka's consumer
> ---
>
> Key: SPARK-26350
> URL: https://issues.apache.org/jira/browse/SPARK-26350
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Permalink.url
>
>
> Sometimes the group id is used to identify the stream for "security". We 
> should give a flag that lets you override it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs

2019-01-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26586:
-
Affects Version/s: 2.2.1
   2.2.2
   2.3.1
   2.3.2

> Streaming queries should have isolated SparkSessions and confs
> --
>
> Key: SPARK-26586
> URL: https://issues.apache.org/jira/browse/SPARK-26586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> When a stream is started, the stream's config is supposed to be frozen and 
> all batches run with the config at start time. However, due to a race 
> condition in creating streams, updating a conf value in the active spark 
> session immediately after starting a stream can lead to the stream getting 
> that updated value.
>  
> The problem is that when StreamingQueryManager creates a MicrobatchExecution 
> (or ContinuousExecution), it passes in the shared spark session, and the 
> spark session isn't cloned until StreamExecution.start() is called. 
> DataStreamWriter.start() should not return until the SparkSession is cloned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs

2019-01-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26586:
-
Affects Version/s: 2.2.0

> Streaming queries should have isolated SparkSessions and confs
> --
>
> Key: SPARK-26586
> URL: https://issues.apache.org/jira/browse/SPARK-26586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> When a stream is started, the stream's config is supposed to be frozen and 
> all batches run with the config at start time. However, due to a race 
> condition in creating streams, updating a conf value in the active spark 
> session immediately after starting a stream can lead to the stream getting 
> that updated value.
>  
> The problem is that when StreamingQueryManager creates a MicrobatchExecution 
> (or ContinuousExecution), it passes in the shared spark session, and the 
> spark session isn't cloned until StreamExecution.start() is called. 
> DataStreamWriter.start() should not return until the SparkSession is cloned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs

2019-01-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26586:
-
Target Version/s: 3.0.0  (was: 2.5.0, 3.0.0)

> Streaming queries should have isolated SparkSessions and confs
> --
>
> Key: SPARK-26586
> URL: https://issues.apache.org/jira/browse/SPARK-26586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> When a stream is started, the stream's config is supposed to be frozen and 
> all batches run with the config at start time. However, due to a race 
> condition in creating streams, updating a conf value in the active spark 
> session immediately after starting a stream can lead to the stream getting 
> that updated value.
>  
> The problem is that when StreamingQueryManager creates a MicrobatchExecution 
> (or ContinuousExecution), it passes in the shared spark session, and the 
> spark session isn't cloned until StreamExecution.start() is called. 
> DataStreamWriter.start() should not return until the SparkSession is cloned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs

2019-01-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26586.
--
   Resolution: Fixed
 Assignee: Mukul Murthy
Fix Version/s: 3.0.0
   2.4.1

> Streaming queries should have isolated SparkSessions and confs
> --
>
> Key: SPARK-26586
> URL: https://issues.apache.org/jira/browse/SPARK-26586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> When a stream is started, the stream's config is supposed to be frozen and 
> all batches run with the config at start time. However, due to a race 
> condition in creating streams, updating a conf value in the active spark 
> session immediately after starting a stream can lead to the stream getting 
> that updated value.
>  
> The problem is that when StreamingQueryManager creates a MicrobatchExecution 
> (or ContinuousExecution), it passes in the shared spark session, and the 
> spark session isn't cloned until StreamExecution.start() is called. 
> DataStreamWriter.start() should not return until the SparkSession is cloned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26594) DataSourceOptions.asMap should return CaseInsensitiveMap

2019-01-10 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26594:


 Summary: DataSourceOptions.asMap should return CaseInsensitiveMap
 Key: SPARK-26594
 URL: https://issues.apache.org/jira/browse/SPARK-26594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


I'm pretty surprised that the following codes will fail.
{code}
import scala.collection.JavaConverters._
import org.apache.spark.sql.sources.v2.DataSourceOptions

val map = new DataSourceOptions(Map("fooBar" -> "x").asJava).asMap
assert(map.get("fooBar") == "x")
{code}

It's better to make DataSourceOptions.asMap return CaseInsensitiveMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26267) Kafka source may reprocess data

2019-01-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26267.
--
   Resolution: Fixed
Fix Version/s: 2.4.1

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26267) Kafka source may reprocess data

2018-12-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26267:
-
Fix Version/s: 3.0.0

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26267) Kafka source may reprocess data

2018-12-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-26267:


Assignee: Shixiong Zhu

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26350) Allow the user to override the group id of the Kafka's consumer

2018-12-12 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26350:


 Summary: Allow the user to override the group id of the Kafka's 
consumer
 Key: SPARK-26350
 URL: https://issues.apache.org/jira/browse/SPARK-26350
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


Sometimes the group id is used to identify the stream for "security". We should 
give a flag that lets you override it.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26267) Kafka source may reprocess data

2018-12-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26267:
-
Priority: Blocker  (was: Major)

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26267) Kafka source may reprocess data

2018-12-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26267:
-
Labels: correctness  (was: )

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26267) Kafka source may reprocess data

2018-12-04 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709169#comment-16709169
 ] 

Shixiong Zhu commented on SPARK-26267:
--

KAFKA-7703 only exists in Kafka 1.1.0 and above, so a possible workaround is 
using an old version that doesn't have this issue. This doesn't impact Spark 
2.3.x and below as we use Kafka 0.10.0.1 by default.

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26267) Kafka source may reprocess data

2018-12-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26267:


 Summary: Kafka source may reprocess data
 Key: SPARK-26267
 URL: https://issues.apache.org/jira/browse/SPARK-26267
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it may 
get an earliest offset, and then it will reprocess messages that have been 
processed when it gets the correct latest offset in the next batch.

This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26120:


 Summary: Fix a streaming query leak in Structured Streaming R tests
 Key: SPARK-26120
 URL: https://issues.apache.org/jira/browse/SPARK-26120
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


"Specify a schema by using a DDL-formatted string when reading" doesn't stop 
the streaming query before stopping Spark. It causes the following annoying 
logs.

{code}
Exception in thread "stream execution thread for [id = 
186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
... 7 more
org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
... 7 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26120:
-
Priority: Minor  (was: Major)

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26120:
-
Component/s: Structured Streaming
 SparkR

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26092.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 3.0.0
   2.4.1

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26092:
-
Issue Type: Bug  (was: Test)

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26092:


 Summary: Use CheckpointFileManager to write the streaming metadata 
file
 Key: SPARK-26092
 URL: https://issues.apache.org/jira/browse/SPARK-26092
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


We should use CheckpointFileManager to write the streaming metadata file to 
avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26069.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
> -
>
> Key: SPARK-26069
> URL: https://issues.apache.org/jira/browse/SPARK-26069
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>   at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-14 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26069:


 Summary: Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
 Key: SPARK-26069
 URL: https://issues.apache.org/jira/browse/SPARK-26069
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


{code}
sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at 
org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
at 
org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26042) KafkaContinuousSourceTopicDeletionSuite may hang forever

2018-11-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26042.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> KafkaContinuousSourceTopicDeletionSuite may hang forever
> 
>
> Key: SPARK-26042
> URL: https://issues.apache.org/jira/browse/SPARK-26042
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> Saw the following thread dump in some build:
> {code}
> "stream execution thread for [id = 1c13482e-1edf-4b5c-b63a-d652738c8a48, 
> runId = 10667ce9-7eac-4cef-a525-f1bd08eb50f1]" #4406 daemon prio=5 os_prio=0 
> tid=0x7fab1d3c5000 nid=0x7f4b waiting on condition [0x7fa96efcb000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070a904cf8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> ...
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:180)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:109)
>   - locked <0x00070a913ee8> (a 
> org.apache.spark.sql.execution.streaming.IncrementalExecution)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:270)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:270)
> ,,,
> "pool-1-thread-1-ScalaTest-running-KafkaContinuousSourceTopicDeletionSuite" 
> #20 prio=5 os_prio=0 tid=0x7fabc4e78800 nid=0x23be waiting for monitor 
> entry [0x7fab3dbff000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>   - waiting to lock <0x00070a913ee8> (a 
> org.apache.spark.sql.execution.streaming.IncrementalExecution)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.kafka010.KafkaContinuousSourceTopicDeletionSuite$$anonfun$3$$anonfun$apply$mcV$sp$12$$anonfun$apply$15.apply(KafkaContinuousSourceSuite.scala:210)
>   at 
> org.apache.spark.sql.kafka010.KafkaContinuousSourceTopicDeletionSuite$$anonfun$3$$anonfun$apply$mcV$sp$12$$anonfun$apply$15.apply(KafkaContinuousSourceSuite.scala:209)
> ...
> {code}
> It hung forever because the test main thread was trying to access 
> `executedPlan` but the lock was held by the streaming thread.
> This is a pretty common issue when using lazy vals as all lazy vals share the 
> same lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26042) KafkaContinuousSourceTopicDeletionSuite may hang forever

2018-11-13 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26042:


 Summary: KafkaContinuousSourceTopicDeletionSuite may hang forever
 Key: SPARK-26042
 URL: https://issues.apache.org/jira/browse/SPARK-26042
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming, Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Saw the following thread dump in some build:
{code}
"stream execution thread for [id = 1c13482e-1edf-4b5c-b63a-d652738c8a48, runId 
= 10667ce9-7eac-4cef-a525-f1bd08eb50f1]" #4406 daemon prio=5 os_prio=0 
tid=0x7fab1d3c5000 nid=0x7f4b waiting on condition [0x7fa96efcb000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00070a904cf8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
...
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:109)
- locked <0x00070a913ee8> (a 
org.apache.spark.sql.execution.streaming.IncrementalExecution)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:109)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:270)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:270)
,,,


"pool-1-thread-1-ScalaTest-running-KafkaContinuousSourceTopicDeletionSuite" #20 
prio=5 os_prio=0 tid=0x7fabc4e78800 nid=0x23be waiting for monitor entry 
[0x7fab3dbff000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
- waiting to lock <0x00070a913ee8> (a 
org.apache.spark.sql.execution.streaming.IncrementalExecution)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
at 
org.apache.spark.sql.kafka010.KafkaContinuousSourceTopicDeletionSuite$$anonfun$3$$anonfun$apply$mcV$sp$12$$anonfun$apply$15.apply(KafkaContinuousSourceSuite.scala:210)
at 
org.apache.spark.sql.kafka010.KafkaContinuousSourceTopicDeletionSuite$$anonfun$3$$anonfun$apply$mcV$sp$12$$anonfun$apply$15.apply(KafkaContinuousSourceSuite.scala:209)
...
{code}
It hung forever because the test main thread was trying to access 
`executedPlan` but the lock was held by the streaming thread.

This is a pretty common issue when using lazy vals as all lazy vals share the 
same lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-11-01 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671900#comment-16671900
 ] 

Shixiong Zhu commented on SPARK-25692:
--

[~sanket991] You can download the unit test logs from 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/5554/artifact/common/network-common/target/
 (It will be kept on Jenkins for several days)

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-11-01 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671895#comment-16671895
 ] 

Shixiong Zhu commented on SPARK-25692:
--

It's still flaky on Jenkins: 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/5554/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/fetchBothChunks/history/]

 !Screen Shot 2018-11-01 at 10.17.16 AM.png! 

You may need to run the whole tests together. 
[https://github.com/apache/spark/pull/22173] added a global thread pool, so 
other tests may also impact this test suite.

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-11-01 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25692:
-
Attachment: Screen Shot 2018-11-01 at 10.17.16 AM.png

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming

2018-10-31 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670897#comment-16670897
 ] 

Shixiong Zhu commented on SPARK-20568:
--

[~kabhwan] I think this is pretty useful. Do you have time working on this?

> Delete files after processing in structured streaming
> -
>
> Key: SPARK-20568
> URL: https://issues.apache.org/jira/browse/SPARK-20568
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Saul Shanabrook
>Priority: Major
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25899) Flaky test: CoarseGrainedSchedulerBackendSuite.compute max number of concurrent tasks can be launched

2018-10-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25899:
-
Issue Type: Test  (was: Documentation)

> Flaky test: CoarseGrainedSchedulerBackendSuite.compute max number of 
> concurrent tasks can be launched
> -
>
> Key: SPARK-25899
> URL: https://issues.apache.org/jira/browse/SPARK-25899
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> {code}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 400 times over 
> 10.00982864399 seconds. Last failure message: ArrayBuffer("2", "0", "3") 
> had length 3 instead of expected length 4.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:30)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:30)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite$$anonfun$3.apply(CoarseGrainedSchedulerBackendSuite.scala:54)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite$$anonfun$3.apply(CoarseGrainedSchedulerBackendSuite.scala:49)
>   at 
> org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:266)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:168)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)

[jira] [Created] (SPARK-25899) Flaky test: CoarseGrainedSchedulerBackendSuite.compute max number of concurrent tasks can be launched

2018-10-31 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25899:


 Summary: Flaky test: CoarseGrainedSchedulerBackendSuite.compute 
max number of concurrent tasks can be launched
 Key: SPARK-25899
 URL: https://issues.apache.org/jira/browse/SPARK-25899
 Project: Spark
  Issue Type: Documentation
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


{code}
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 400 times over 10.00982864399 
seconds. Last failure message: ArrayBuffer("2", "0", "3") had length 3 instead 
of expected length 4.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
at 
org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:30)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
at 
org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:30)
at 
org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite$$anonfun$3.apply(CoarseGrainedSchedulerBackendSuite.scala:54)
at 
org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite$$anonfun$3.apply(CoarseGrainedSchedulerBackendSuite.scala:49)
at 
org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:266)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:168)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 

[jira] [Resolved] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-30 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25773.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> Cancel zombie tasks in a result stage when the job finishes
> ---
>
> Key: SPARK-25773
> URL: https://issues.apache.org/jira/browse/SPARK-25773
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
>
> When a job finishes, there may be some zombie tasks still running due to 
> stage retry. Since a result stage will never be used by other jobs, running 
> these tasks are just wasting the cluster resource. This PR just asks 
> TaskScheduler to cancel the running tasks of a result stage when it's already 
> finished. Credits go to @srinathshankar who suggested this idea to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25849) Improve document for task cancellation.

2018-10-25 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25849:


 Summary: Improve document for task cancellation.
 Key: SPARK-25849
 URL: https://issues.apache.org/jira/browse/SPARK-25849
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


As suggested by [~markhamstra] in 
https://github.com/apache/spark/pull/22771#discussion_r228371144 , we should 
update the document to clarify how task cancellation works in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25822) Fix a race condition when releasing a Python worker

2018-10-24 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25822:


 Summary: Fix a race condition when releasing a Python worker
 Key: SPARK-25822
 URL: https://issues.apache.org/jira/browse/SPARK-25822
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.2
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


There is a race condition when releasing a Python worker. If 
"ReaderIterator.handleEndOfDataSection" is not running in the task thread, when 
a task is early terminated (such as "take(N)"), the task completion listener 
may close the worker but "handleEndOfDataSection" can still put the worker into 
the worker pool to reuse.

https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
 is a patch to reproduce this issue.

I also found a user reported this in the mail list: 
http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25771) Fix improper synchronization in PythonWorkerFactory

2018-10-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25771.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> Fix improper synchronization in PythonWorkerFactory
> ---
>
> Key: SPARK-25771
> URL: https://issues.apache.org/jira/browse/SPARK-25771
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25773:
-
Description: When a job finishes, there may be some zombie tasks still 
running due to stage retry. Since a result stage will never be used by other 
jobs, running these tasks are just wasting the cluster resource. This PR just 
asks TaskScheduler to cancel the running tasks of a result stage when it's 
already finished. Credits go to @srinathshankar who suggested this idea to me.

> Cancel zombie tasks in a result stage when the job finishes
> ---
>
> Key: SPARK-25773
> URL: https://issues.apache.org/jira/browse/SPARK-25773
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> When a job finishes, there may be some zombie tasks still running due to 
> stage retry. Since a result stage will never be used by other jobs, running 
> these tasks are just wasting the cluster resource. This PR just asks 
> TaskScheduler to cancel the running tasks of a result stage when it's already 
> finished. Credits go to @srinathshankar who suggested this idea to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25773:


 Summary: Cancel zombie tasks in a result stage when the job 
finishes
 Key: SPARK-25773
 URL: https://issues.apache.org/jira/browse/SPARK-25773
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25771) Fix improper synchronization in PythonWorkerFactory

2018-10-18 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25771:


 Summary: Fix improper synchronization in PythonWorkerFactory
 Key: SPARK-25771
 URL: https://issues.apache.org/jira/browse/SPARK-25771
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.2
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650796#comment-16650796
 ] 

Shixiong Zhu commented on SPARK-25738:
--

Marked as a blocker since this is a regression

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25738) LOAD DATA INPATH doesn't work if hdfs conf includes port

2018-10-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25738:
-
Priority: Blocker  (was: Critical)

> LOAD DATA INPATH doesn't work if hdfs conf includes port
> 
>
> Key: SPARK-25738
> URL: https://issues.apache.org/jira/browse/SPARK-25738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> LOAD DATA INPATH throws {{java.net.URISyntaxException: Malformed IPv6 address 
> at index 8}} if your hdfs conf includes a port for the namenode.
> This is because the URI is passing in the value of the hdfs conf 
> "fs.defaultFS" in for the host.  Note that variable is called {{authority}}, 
> but the 4-arg URI constructor actually expects a host: 
> https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
> {code}
> val defaultFSConf = 
> sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
> ...
> val newUri = new URI(scheme, authority, pathUri.getPath, pathUri.getFragment)
> {code}
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L386
> This was introduced by SPARK-23425.
> *Workaround*: specify a fully qualified path, eg. instead of 
> {noformat}
> LOAD DATA INPATH '/some/path/on/hdfs'
> {noformat}
> use
> {noformat}
> LOAD DATA INPATH 'hdfs://fizz.buzz.com:8020/some/path/on/hdfs'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-10 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645399#comment-16645399
 ] 

Shixiong Zhu commented on SPARK-23390:
--

[~dongjoon] when spark cancels a task, the task thread will get interrupted. I 
think this is what we need to test in Spark. I don't think Spark needs to have 
a great test coverage for codes inside third party libraries. It's unlikely we 
can reproduce this issue consistently without changing the third party 
libraries, since this will require to cancel a task when it's running some 
special codes in a third party library.

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> 

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-10-10 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645370#comment-16645370
 ] 

Shixiong Zhu commented on SPARK-10816:
--

Thanks a lot for the design docs and prototypes. I had a long discussion with 
[~tdas] and we think we should discuss other alternative approaches in the 
design doc.

We came out 3 possible implementations:

[1] Put  into a state store. In each batch, for each 
key, scan the sorted event lists and use the watermark to find out the 
finalized session and output them.

[2] Put <(key, timestamp), event> into a state store. Here is the key used in 
the state store is a tuple of user key and the event timestamp. In each batch, 
sort each partition using (key, timestamp) and scan the whole sorted partition 
to find out the finalized sessions and output them.

[3] Use two state stores like what stream-stream join does. The first state 
store will store , the second one will store . When we insert an event into the second state store, we should 
use insertion sort to make sure we store events order by timestamp, such as 
find the proper index for this event, and update the following indices after 
this event. Then we can just scan all keys and their values in the state store 
to find out the finalized session and output them.

[1] is easy to implement and can be done directly using 
`flatMapGroupsWithState` but it may fail when a key has too many events. [2] 
and [3] will scale well but the performance may be worse.

If I read the codes correctly, [https://github.com/apache/spark/pull/22583] is 
[1]. [https://github.com/apache/spark/pull/22482] is a combination of [2] and 
[3] but still need to load all values of a key into the memory at the same time.

[~kabhwan] [~XuanYuan] could you work together to update your design docs to 
add these alternative approaches and discuss pros and cons? It would be great 
you can put the design docs to a google doc so that it's easy to leave comments.

In addition, it's better to also discuss the compatibility, such as if we 
decide to use a new approach to implement session window but need to change the 
state format in the state store, do we have enough version information to 
identity the old and new formats?

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-09 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644162#comment-16644162
 ] 

Shixiong Zhu commented on SPARK-25692:
--

It may be caused by https://github.com/apache/spark/pull/22173

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25692:
-
Description: 
Looks like the whole test suite is pretty flaky. See: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/

This may be a regression in 3.0 as this didn't happen in 2.4 branch.

  was:
Looks like the whole test suite is pretty flaky. See: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/

This may be a regression in 2.4 as this didn't happen before.


> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25692:


 Summary: Flaky test: ChunkFetchIntegrationSuite
 Key: SPARK-25692
 URL: https://issues.apache.org/jira/browse/SPARK-25692
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


Looks like the whole test suite is pretty flaky. See: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/

This may be a regression in 2.4 as this didn't happen before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-09 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644036#comment-16644036
 ] 

Shixiong Zhu commented on SPARK-23390:
--

I didn't look at parquet. It may have a similar issue.

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/]
>  (May 3rd)
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/]
>  (May 9th)
>  - 

[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-09 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644034#comment-16644034
 ] 

Shixiong Zhu commented on SPARK-23390:
--

I think the issue is probably in orc. Any exception throwing between  
https://github.com/apache/orc/blob/b21b5ffcc1efcbd4aef337fa6faae4d25262f8f1/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L252
 and 
https://github.com/apache/orc/blob/b21b5ffcc1efcbd4aef337fa6faae4d25262f8f1/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L273
 will leak `dataReader`. For example, cancelling a Spark task may cause 
https://github.com/apache/orc/blob/b21b5ffcc1efcbd4aef337fa6faae4d25262f8f1/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L273
 throw an exception.

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> 

[jira] [Resolved] (SPARK-25644) Fix java foreachBatch API

2018-10-05 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25644.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> Fix java foreachBatch API
> -
>
> Key: SPARK-25644
> URL: https://issues.apache.org/jira/browse/SPARK-25644
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.4.0
>
>
> The java foreachBatch API in DataStreamWriter should accept java.lang.Long 
> rather scala.Long. It's better to fix the new API before the release gets 
> out, so I marked this ticket as a blocker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25644) Fix java foreachBatch API

2018-10-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25644:
-
Target Version/s: 2.4.0

> Fix java foreachBatch API
> -
>
> Key: SPARK-25644
> URL: https://issues.apache.org/jira/browse/SPARK-25644
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> The java foreachBatch API in DataStreamWriter should accept java.lang.Long 
> rather scala.Long. It's better to fix the new API before the release gets 
> out, so I marked this ticket as a blocker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25644) Fix java foreachBatch API

2018-10-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25644:


 Summary: Fix java foreachBatch API
 Key: SPARK-25644
 URL: https://issues.apache.org/jira/browse/SPARK-25644
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


The java foreachBatch API in DataStreamWriter should accept java.lang.Long 
rather scala.Long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25644) Fix java foreachBatch API

2018-10-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25644:
-
Description: The java foreachBatch API in DataStreamWriter should accept 
java.lang.Long rather scala.Long. It's better to fix the new API before the 
release gets out, so I marked this ticket as a blocker.  (was: The java 
foreachBatch API in DataStreamWriter should accept java.lang.Long rather 
scala.Long.)

> Fix java foreachBatch API
> -
>
> Key: SPARK-25644
> URL: https://issues.apache.org/jira/browse/SPARK-25644
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> The java foreachBatch API in DataStreamWriter should accept java.lang.Long 
> rather scala.Long. It's better to fix the new API before the release gets 
> out, so I marked this ticket as a blocker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-10-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637564#comment-16637564
 ] 

Shixiong Zhu commented on SPARK-25005:
--

[~qambard] Not sure about your question. If Kafka consumers fetch nothing, it 
will not update the position.

And yes, if a partition is full with invisible messages, we have to wait for 
timeout. I don't see any API to avoid this.

> Structured streaming doesn't support kafka transaction (creating empty offset 
> with abort & markers)
> ---
>
> Key: SPARK-25005
> URL: https://issues.apache.org/jira/browse/SPARK-25005
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Structured streaming can't consume kafka transaction. 
> We could try to apply SPARK-24720 (DStream) logic to Structured Streaming 
> source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-10-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637541#comment-16637541
 ] 

Shixiong Zhu commented on SPARK-25005:
--

[~qambard] If `poll` returns and offset gets changed, it means Kafka consumer 
fetches something but all of messages are invisible so consumer return empty.

If `poll` returns but offset doesn't change, it means Kafka fetches nothing 
before timeout. In this case, we just throw "TimeoutException". Spark will 
retry the task or just fail the job. Large GC pause can cause timeout and the 
user should tune the configs to avoid this happening. We cannot do much in 
Spark.

> Structured streaming doesn't support kafka transaction (creating empty offset 
> with abort & markers)
> ---
>
> Key: SPARK-25005
> URL: https://issues.apache.org/jira/browse/SPARK-25005
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Structured streaming can't consume kafka transaction. 
> We could try to apply SPARK-24720 (DStream) logic to Structured Streaming 
> source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25315) setting "auto.offset.reset" to "earliest" has no effect in Structured Streaming with Spark 2.3.1 and Kafka 1.0

2018-10-01 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25315.
--
Resolution: Not A Bug

> setting "auto.offset.reset" to "earliest" has no effect in Structured 
> Streaming with Spark 2.3.1 and Kafka 1.0
> --
>
> Key: SPARK-25315
> URL: https://issues.apache.org/jira/browse/SPARK-25315
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: Standalone; running in IDEA
>Reporter: Zhenhao Li
>Priority: Major
>
> The following code won't read from the beginning of the topic
> ```
> {code:java}
> val kafkaOptions = Map[String, String](
>  "kafka.bootstrap.servers" -> KAFKA_BOOTSTRAP_SERVERS,
>  "subscribe" -> TOPIC,
>  "group.id" -> GROUP_ID,
>  "auto.offset.reset" -> "earliest"
> )
> val myStream = sparkSession
> .readStream
> .format("kafka")
> .options(kafkaOptions)
> .load()
> .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   myStream
> .writeStream
> .format("console")
> .start()
> .awaitTermination()
> {code}
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25315) setting "auto.offset.reset" to "earliest" has no effect in Structured Streaming with Spark 2.3.1 and Kafka 1.0

2018-10-01 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634423#comment-16634423
 ] 

Shixiong Zhu commented on SPARK-25315:
--

Kafka’s own configurations should be set with "kafka." prefix. "group.id" and 
"auto.offset.reset" will be ignored.

In addition, after you add "kafka." prefix, you will see some error messages as 
"group.id" or "auto.offset.reset" is not supported. They are documented here: 
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations

> setting "auto.offset.reset" to "earliest" has no effect in Structured 
> Streaming with Spark 2.3.1 and Kafka 1.0
> --
>
> Key: SPARK-25315
> URL: https://issues.apache.org/jira/browse/SPARK-25315
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: Standalone; running in IDEA
>Reporter: Zhenhao Li
>Priority: Major
>
> The following code won't read from the beginning of the topic
> ```
> {code:java}
> val kafkaOptions = Map[String, String](
>  "kafka.bootstrap.servers" -> KAFKA_BOOTSTRAP_SERVERS,
>  "subscribe" -> TOPIC,
>  "group.id" -> GROUP_ID,
>  "auto.offset.reset" -> "earliest"
> )
> val myStream = sparkSession
> .readStream
> .format("kafka")
> .options(kafkaOptions)
> .load()
> .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   myStream
> .writeStream
> .format("console")
> .start()
> .awaitTermination()
> {code}
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25449.
--
   Resolution: Fixed
 Assignee: Mukul Murthy
Fix Version/s: 2.5.0

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.5.0
>
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25569) Failing a Spark job when an accumulator cannot be updated

2018-09-28 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25569:


 Summary: Failing a Spark job when an accumulator cannot be updated
 Key: SPARK-25569
 URL: https://issues.apache.org/jira/browse/SPARK-25569
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


Currently, when Spark fails to merge an accumulator updates from a task, it 
will not fail the task. (See 
https://github.com/apache/spark/blob/b7d80349b0e367d78cab238e62c2ec353f0f12b3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1266)
 So an accumulator update failure may be ignored silently. Some user may want 
to use accumulators in business critical things, and would like to fail a job 
when an accumulator is broken.

We can add a flag to always fail a Spark job when hitting an accumulator 
failure. Or we can add a new property to an accumulator and only fail a spark 
job when such accumulator fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-28 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25568:


 Summary: Continue to update the remaining accumulators when 
failing to update one accumulator
 Key: SPARK-25568
 URL: https://issues.apache.org/jira/browse/SPARK-25568
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2, 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Currently when failing to update an accumulator, 
DAGScheduler.updateAccumulators will skip the remaining accumulators. We should 
try to update the remaining accumulators if possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25568:
-
Description: 
Currently when failing to update an accumulator, 
DAGScheduler.updateAccumulators will skip the remaining accumulators. We should 
try to update the remaining accumulators if possible so that they can still 
report correct values.


  was:Currently when failing to update an accumulator, 
DAGScheduler.updateAccumulators will skip the remaining accumulators. We should 
try to update the remaining accumulators if possible.


> Continue to update the remaining accumulators when failing to update one 
> accumulator
> 
>
> Key: SPARK-25568
> URL: https://issues.apache.org/jira/browse/SPARK-25568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Currently when failing to update an accumulator, 
> DAGScheduler.updateAccumulators will skip the remaining accumulators. We 
> should try to update the remaining accumulators if possible so that they can 
> still report correct values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25495) FetchedData.reset doesn't reset _nextOffsetInFetchedData and _offsetAfterPoll

2018-09-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25495.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> FetchedData.reset doesn't reset _nextOffsetInFetchedData and _offsetAfterPoll
> -
>
> Key: SPARK-25495
> URL: https://issues.apache.org/jira/browse/SPARK-25495
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
>
> FetchedData.reset doesn't reset _nextOffsetInFetchedData and _offsetAfterPoll 
> and causes inconsistent cached data and may make Kafka connector return wrong 
> results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25495) FetchedData.reset doesn't reset _nextOffsetInFetchedData and _offsetAfterPoll

2018-09-20 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25495:


 Summary: FetchedData.reset doesn't reset _nextOffsetInFetchedData 
and _offsetAfterPoll
 Key: SPARK-25495
 URL: https://issues.apache.org/jira/browse/SPARK-25495
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


FetchedData.reset doesn't reset _nextOffsetInFetchedData and _offsetAfterPoll 
and causes inconsistent cached data and may make Kafka connector return wrong 
results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25449:
-
Issue Type: Improvement  (was: Task)

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25449:
-
Target Version/s:   (was: 2.5.0)

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2018-09-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19903:
-
Target Version/s:   (was: 2.4.0)

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>Priority: Major
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of input lines from kafka
> val lines = spark
> .readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", "localhost:9092")
> .option("subscribe", "words")
> .load()
> // Split the lines into words, retaining timestamps
> val words = lines
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS 
> TIMESTAMP)")
> .as[(String, Timestamp)]
> .flatMap(line => line._1.split(" ").map(word => (word, line._2)))
> .toDF("word", "timestamp")
> // Group the data by 

<    1   2   3   4   5   6   7   8   9   10   >