[jira] [Comment Edited] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-28 Thread Nathan Kleyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417519#comment-16417519
 ] 

Nathan Kleyn edited comment on SPARK-23801 at 3/28/18 3:14 PM:
---

[~kiszk] Sure, we'll have our best go at trying to narrow down the job to 
something that fails - many thanks for the speedy reply!

Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?


was (Author: nathankleyn):
[~kiszk] We'll have our best go at trying to narrow down the job to something 
that fails. Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025

[jira] [Commented] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-28 Thread Nathan Kleyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417519#comment-16417519
 ] 

Nathan Kleyn commented on SPARK-23801:
--

[~kiszk] We'll have our best go at trying to narrow down the job to something 
that fails. Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Description: 
After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
executor memory). I've attached the full coredump but here is an except:
{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
1.8.0_161-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 
)
# Problematic frame:
# V  [libjvm.so+0x995fdc]  oopDesc* 
PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
#
# Core dump written. Default location: 
/var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
 or core.1315
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#{code}
{code:java}
---  T H R E A D  ---

Current thread (0x7f146005b000):  GCTaskThread [stack: 
0x7f1464e2d000,0x7f1464f2e000] [id=1363]

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
0x

Registers:
RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
RDX=0x
RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
RDI=0x7ef7bc30bda8
R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
R11=0x7f14671240e0
R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
R15=0x000d
RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
ERR=0x
  TRAPNO=0x000d

Top of Stack: (sp=0x7f1464f2c1a0)
0x7f1464f2c1a0:   7f146005b000 0001
0x7f1464f2c1b0:   0004 7f14600bb640
0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
0x7f1464f2c1f0:   7ef8a80a7060 1741
0x7f1464f2c200:   0002 
0x7f1464f2c210:   7f1464f2c230 7f146742b005
0x7f1464f2c220:   7ef8a80a7050 1741
0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
0x7f1464f2c270:   7ef8b843d7c8 00020006
0x7f1464f2c280:   7f1464f2c340 7f14600bb640
0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
0x7f1464f2c2b0:   0001 
0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
0x7f1464f2c360:   7f1464f2c9d0 
0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
0x7f1464f2c380:   7f145427ba90 7ef9
0x7f1464f2c390:   0078 7ef9c035f8c0 

Instructions: (pc=0x7f1467427fdc)
0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 

Register to memory mapping:

RAX=0x17e907feccbc6d20 is an unknown value
RBX=0x7ef9c035f8c8 is pointing into the stack for thread: 0x7ef850009800
RCX=0x7f1464f2c9f0 is an unknown value
RDX=0x is an unknown value
RSP=0x7f1464f2c1a0 is an unknown value
RBP=0x7f1464f2c210 is an unknown value
RSI=0x0068 is an unknown value
RDI=0x7ef7bc30bda8 is pointing into metadata
R8 =0x7f1464f2c3d0 is an unknown value
R9 =0x1741 is an unknown value
R10=0x7f1467a52819:  in 
/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x7f1466a92000
R11=0x7f14671240e0:  in 
/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x7f1466a92000
R12=0x7f130912c998 is an oop
org.apache.spark.sql.catalyst.expressions

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Environment: 
Mesos coarse grained executor
18 * r3.4xlarge (16 core boxes) with 105G of executor memory

  was:18 * r3.4xlarge (16 core boxes) with 105G of executor memory


> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1 e8

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Environment: 18 * r3.4xlarge (16 core boxes) with 105G of executor memory

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: 18 * r3.4xlarge (16 core boxes) with 105G of executor 
> memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
> 0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 
> Register to memory mapp

[jira] [Created] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)
Nathan Kleyn created SPARK-23801:


 Summary: Consistent SIGSEGV after upgrading to Spark v2.3.0
 Key: SPARK-23801
 URL: https://issues.apache.org/jira/browse/SPARK-23801
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.3.0
Reporter: Nathan Kleyn
 Attachments: spark-executor-failure.coredump.log

After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
executor memory). I've attached the full coredump but here is an except:


{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
1.8.0_161-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 
)
# Problematic frame:
# V  [libjvm.so+0x995fdc]  oopDesc* 
PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
#
# Core dump written. Default location: 
/var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
 or core.1315
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#{code}
{code:java}
---  T H R E A D  ---

Current thread (0x7f146005b000):  GCTaskThread [stack: 
0x7f1464e2d000,0x7f1464f2e000] [id=1363]

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
0x

Registers:
RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
RDX=0x
RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
RDI=0x7ef7bc30bda8
R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
R11=0x7f14671240e0
R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
R15=0x000d
RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
ERR=0x
  TRAPNO=0x000d

Top of Stack: (sp=0x7f1464f2c1a0)
0x7f1464f2c1a0:   7f146005b000 0001
0x7f1464f2c1b0:   0004 7f14600bb640
0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
0x7f1464f2c1f0:   7ef8a80a7060 1741
0x7f1464f2c200:   0002 
0x7f1464f2c210:   7f1464f2c230 7f146742b005
0x7f1464f2c220:   7ef8a80a7050 1741
0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
0x7f1464f2c270:   7ef8b843d7c8 00020006
0x7f1464f2c280:   7f1464f2c340 7f14600bb640
0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
0x7f1464f2c2b0:   0001 
0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
0x7f1464f2c360:   7f1464f2c9d0 
0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
0x7f1464f2c380:   7f145427ba90 7ef9
0x7f1464f2c390:   0078 7ef9c035f8c0 

Instructions: (pc=0x7f1467427fdc)
0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 

Register to memory mapping:

RAX=0x17e907feccbc6d20 is an unknown value
RBX=0x7ef9c035f8c8 is pointing into the stack for thread: 0x7ef850009800
RCX=0x7f1464f2c9f0 is an unknown value
RDX=0x is an unknown value
RSP=0x7f1464f2c1a0 is an unknown value
RBP=0x7f1464f2c210 is an unknown value
RSI=0x0068 is an unknown value
RDI=0x7ef7bc30bda8 is pointing into metadata
R8 =0x7f1464f2c3d0 is an unknown value
R9 =0x1741 is an unknown value
R10=0x7f1467a52819:  in 
/usr/lib/jv

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Attachment: spark-executor-failure.coredump.log

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
> 0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 
> Register to memory mapping:
> RAX=0x17e907feccbc6d20 is an unknown value
> RBX=0x7ef9c035f8c8 is pointing into the stack for thread: