[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303471#comment-14303471
 ] 

Jason Lowe commented on YARN-1778:
----------------------------------

Thanks for the analysis and patch, [~zxu]!  I'm wondering if the test is trying 
to tell us there really is a problem with FSRMStateStore retries, and therefore 
fixing the test is actually masking a real problem that needs to be fixed in 
the main code.  If I understand the intent of the test correctly, it's trying 
to verify that FSRMStateStore will not throw an exception while namenodes are 
down or coming back up.  However if we make the test wait until the namenodes 
are back up before trying to connect then that defeats most of the point of the 
test.

I think the critical question is: should the "Namenode still not started" 
exception be retried by either the DFSClient layer or by FSRMStateStore?  I 
think it should, otherwise a client of FSRMStateStore is going to see this 
exception in a similar, real-world scenario where the Namenode was restarted 
and wonder why the framework didn't auto-retry.

> TestFSRMStateStore fails on trunk
> ---------------------------------
>
>                 Key: YARN-1778
>                 URL: https://issues.apache.org/jira/browse/YARN-1778
>             Project: Hadoop YARN
>          Issue Type: Test
>            Reporter: Xuan Gong
>            Assignee: zhihai xu
>         Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to