[
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163205#comment-15163205
]
Steve Loughran commented on YARN-4705:
--------------------------------------
YARN-4696 contains my current logic to handle failures to parse things. :
If the JSON parser fails then an info message is printed if we know the file is
non-empty (i.e. either length>0 or offset > 0)
I think there are some possible race conditions in the code as is, certainly
FNFEs ought to downgrade to info,
For other IOEs, I think they should be caught & logged per file, rather than
stop the entire scan loop. Otherwise bad permissions on one file would be
enough to break the scanning.
Regarding trying to work with Raw vs HDFS...I've not been able to get at raw,
am trying to disable caching in file://, but am close to accepting defeat and
spinning up a single mini yarn cluster across all my test cases. That or add a
config option to turn off checksumming in localFS. The logic is there, but you
can only set it in an FS instance which must be used directly or propagated to
the code-under-test via the FS cache.
The local FS does work for picking up completed work; the problem is that as
flush() doesn't, it doesn't reliably read the updates of incomplete jobs. And
when it does, unless the JSON is aligned on a buffer boundary, the parser is
going to fail, which is going to lead to lots and lots of info messages, unless
the logging is tuned further to only log if the last operation was not a
failure.
We only need to really worry about other cross-cluster filesystems for
production use here. Single node with local FS? Use the 1.0 APIs. Production:
Distributed FS which is required to implement flush() (even a delayed/async
flush) if you want to see incomplete applications. I believe GlusterFS supports
that, as does any POSIX FS if the checksum FS doesn't get in the way. What does
[~jayunit100] have to say about his filesystem's consistency model?
It will mean that the object stores, S3 and swift can't work as destinations
for logs. They are dangerous anyway as if the app crashes before
{{out.close()}} is called *all* data is lost. If we care about that, then you'd
really want to write to an FS (local or HDFS) then copy to the blobstore for
long-term histories.
> ATS 1.5 parse pipeline to consider handling open() events recoverably
> ---------------------------------------------------------------------
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
> Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace
> warning that the CRC check failed in Filesystem.open() file; something the FS
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure),
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to
> open a file as unrecoverable.
> on some filesystems, this may not be the case, i.e. if its open for writing
> it may not be available for reading; checksums maybe a similar issue.
> Perhaps a failure at open() should be viewed as recoverable while the app is
> still running?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)