[
https://issues.apache.org/jira/browse/YARN-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346985#comment-16346985
]
Miklos Szegedi commented on YARN-7857:
--------------------------------------
Thank you, [~Jim_Brennan] for raising this. Indeed, you are right that it is
not a simple stack overflow that causes YARN-7796. However, I looked into it
and it might be the right fix.
The article you mentioned above is a bit coarse, and it does not tell much
about the details. I reproduced your issue with the exact RHEL versions you
mentioned. At the time of the crash we have the following values:
{code:java}
RBX=RSP=0x7fffffffe320=BOTTOM-0x1CE0
SIZE=128K=0x20000
RDX=(SIZE + 2*15)/16*16=0x20010
RAX=RSP-(4K-8)-n*4K=0x7ffffffdd328=RSP-0x20FF8 << crashing writing 0 here
RCX=RSP-((SIZE + 2*15)/16*16+3K-8)=0x7ffffffdb318=RSP-0x23008
BUFFER=(RSP - SIZE + 15)/16*16=0x7FFFFFFDE310
{code}
The stack check code writes a 0 to every page from RSP-(4K-8) down until RCX
using RAX as the iterator, which is RSP-0x23008 at the time of the crash. The
eventual location of the buffer is a bit above of the crash but not too much.
However, RSP is just 2 pages above the bottom of the stack and we try to check
just a few pages below the eventual buffer location, so the write should
succeed. In fact, when I try to reproduce the same issue (rh68 built binary on
rh74) with a 110K buffer instead of 128K, it works.
As a conclusion, the stack check code seems to be legitimate. However, the
code might address the same memory later ending up with the same crash without
stack checking. The RHEL 7.4 code does an or of each location with itself and
0. Since the stack check code is similar to what Meltdown does, I am wondering,
if we ran into some kernel protection. Moving the buffer to the heap removes
all risk running into this protection.
> -fstack-check compilation flag causes binary incompatibility for
> container-executor between RHEL 6 and RHEL 7
> -------------------------------------------------------------------------------------------------------------
>
> Key: YARN-7857
> URL: https://issues.apache.org/jira/browse/YARN-7857
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.0.0
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
>
> The segmentation fault in container-executor reported in [YARN-7796] appears
> to be due to a binary compatibility issue with the {{-fstack-check}} flag
> that was added in [YARN-6721]
> Based on my testing, a container-executor (without the patch from
> [YARN-7796]) compiled on RHEL 6 with the -fstack-check flag always hits this
> segmentation fault when run on RHEL 7. But if you compile without this flag,
> the container-executor runs on RHEL 7 with no problems. I also verified this
> with a simple program that just does the copy_file.
> I think we need to either remove this flag, or find a suitable alternative.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]