[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system call, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spilling the shuffle data to 
> disk, but the executor is still alive.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system call, we found that this thread is 
> always calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spill the shuffle data to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spilling the shuffle data to 
> disk, but the executor is still alive.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spill the shuffle data to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spill the shuffle data to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Attachment: D18F4.png
95330.png
91ADA.png

> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk for 
> several hours.
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk for 
> several hours.
>  !95330.png! 
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org