[
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159012#comment-15159012
]
Steve Loughran commented on YARN-4705:
--------------------------------------
OK. so HDFS has guaranteed flush but no guarantees on modtime or size
propagation; in contrast, the local file:// FS is consistent with
FileStatus.length and actual length, but doesn't flush when told to, so can
delay its writes until a CRC-worth of data has been written —and there is no
obvious way to turn this off for testing via config files.
On HDFS then: empty files length can't be interpreted as a reason to skip; so
failures to read are an error. An attempt must be made to read it, but any
EOFexception or similar is not a failure. That is: you can't skip on empty,
just swallow the failure. Maybe at DEBUG list the exception and current file
status value. or just attempt to read() byte 0 after opening file; an
EOFException means "still empty"
That essentially means that until such a switch is provided, you cannot use the
localfs as a back end for ATS1.5 —even for testing. Or at least, you can write
with it, but the data won't be guaranteed to be visible until close() is
called. You may not get any view of incomplete apps —which is precisely what
I'm seeing.
If this is the case, then that's something ATS1.5 can't fix: it will have to be
in the documentation.
> ATS 1.5 parse pipeline to consider handling open() events recoverably
> ---------------------------------------------------------------------
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
> Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace
> warning that the CRC check failed in Filesystem.open() file; something the FS
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure),
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to
> open a file as unrecoverable.
> on some filesystems, this may not be the case, i.e. if its open for writing
> it may not be available for reading; checksums maybe a similar issue.
> Perhaps a failure at open() should be viewed as recoverable while the app is
> still running?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)