[
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated YARN-1185:
-----------------------------
Summary: FileSystemRMStateStore can leave partial files that prevent
subsequent recovery (was: FileSystemRMStateStore doesn't use temporary files
when writing data)
bq. The RM will not start if there is anything wrong with the stored state. So
it some write is partial/empty is will not start.
The concern I have about that approach is it requires manual intervention from
ops when there is a problem, and the current scheme can lead to that situation
occurring because the RM can crash at arbitrary points. I think the RM should
try to prevent that situation from occurring and/or have the ability to
automatically recover from that situation if it does occur. The RM could skip
the corrupted info and continue if the info is deemed not critical to the
overall recovery process. Then we're only involving ops if the corruption is
very serious.
{quote}
So we could do the following.
Storing app data may continue to be optimistic and since thats the main
workload we continue to do what we do today.
Storing global data (mainly the security stuff) can change to be more atomic.
{quote}
That sounds reasonable, especially if the RM is more robust during recovery. I
understand it's a tradeoff between reliability and performance, especially with
the RPC overhead when talking to HDFS and the potentially high rate of state
churn.
Thanks for the informative discussion, [~bikassaha]! Updating the summary to
better reflect the problem and not a particular solution.
> FileSystemRMStateStore can leave partial files that prevent subsequent
> recovery
> -------------------------------------------------------------------------------
>
> Key: YARN-1185
> URL: https://issues.apache.org/jira/browse/YARN-1185
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.1.0-beta
> Reporter: Jason Lowe
>
> FileSystemRMStateStore writes directly to the destination file when storing
> state. However if the RM were to crash in the middle of the write, the
> recovery method could encounter a partially-written file and either outright
> crash during recovery or silently load incomplete state.
> To avoid this, the data should be written to a temporary file and renamed to
> the destination file afterwards.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira