[ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479863#comment-16479863
 ] 

Robert Kanter commented on YARN-8310:
-------------------------------------

Thanks for the review [~miklos.szeg...@cloudera.com].

I've uploaded 003 patches:
- {{readFields}} now first fully reads the data into a byte array, and then 
tries to parse it.  We no longer do a reset or a cast, and the protobuf parsing 
method that takes a byte array only declares {{InvalidProtocolBufferException}} 
so we don't have to worry about the {{IOException}} either.  So this solves 
many of your comments, though it does make the code a bit messier because of 
all of the different streams I had to use.
- I didn't change how {{logAggregationSize}} because there's really 3 cases 
(see below).  The results of the first two cases are the same: there's no log 
aggregation context data to read.  So I think it makes sense to keep the 
{{logAggregationSize = -1}} logic as is, instead of adding in a boolean to 
complicate things.  And there's no need to provide a specific error code here 
because none of these scenarios should result in an error.
-# We read an EOF, which indicates that there was no log aggregation context 
data or size info
-# We read an int that's -1, which indicates that there is no long aggregation 
context data (it's like a flag)
-# We read an int that's >= 0, which indicates the size of the log aggregation 
context data to read
- I didn't change the split behavior you asked about.  This is how the original 
code worked (I mostly copy-pasted from the diff where this was removed), and I 
don't want to deviate too much from it in case we accidently make an 
incompatible change here.  This code should only be reading data that was 
written from the old {{write}} method, so it should only have correct data - 
it's not like the user is providing this.  As-is, this code isn't very robust 
because it's relying on the ordering of the fields being correct and has no 
versioning, which I assume is why we switched to protobuf in the first place - 
I don't think can or should fix that.

> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-8310
>                 URL: https://issues.apache.org/jira/browse/YARN-8310
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>            Priority: Major
>         Attachments: YARN-8310.001.patch, YARN-8310.002.patch, 
> YARN-8310.003.patch, YARN-8310.branch-2.001.patch, 
> YARN-8310.branch-2.002.patch, YARN-8310.branch-2.003.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>       at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>       at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>       at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>       at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>       at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.<init>(YarnSecurityTokenProtos.java:1860)
>       at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.<init>(YarnSecurityTokenProtos.java:1824)
>       at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>       at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>       at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>       at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>       at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>       at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>       at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>       ... 5 more
> {noformat}
> The NodeManager fails because it's trying to read a 
> {{ContainerTokenIdentifier}} in the "old" format before we changed them to 
> protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
> similar problem with the ResourceManager and RM Delegation Tokens.
> To provide a better experience, we should make the code able to read the old 
> format if it's unable to read it using the new format.  We didn't run into 
> any errors with the other two types of tokens that YARN-668 incompatibly 
> changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix 
> those while we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to