[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159012#comment-15159012
 ] 

Steve Loughran commented on YARN-4705:
--------------------------------------

OK. so HDFS has guaranteed flush but no guarantees on modtime or size 
propagation; in contrast, the local file:// FS is consistent with 
FileStatus.length and actual length, but doesn't flush when told to, so can 
delay its writes until a CRC-worth of data has been written —and there is no 
obvious way to turn this off for testing via config files.

On HDFS then: empty files length can't be interpreted as a reason to skip; so 
failures to read are an error. An attempt must be made to read it, but any 
EOFexception or similar is not a failure. That is: you can't skip on empty, 
just swallow the failure. Maybe at DEBUG list the exception and current file 
status value. or just attempt to read() byte 0 after opening file; an 
EOFException means "still empty"

That essentially means that until such a switch is provided, you cannot use the 
localfs as a back end for ATS1.5 —even for testing. Or at least, you can write 
with it, but the data won't be guaranteed to be visible until close() is 
called. You may not get any view of incomplete apps —which is precisely what 
I'm seeing.

If this is the case, then that's something ATS1.5 can't fix: it will have to be 
in the documentation.



> ATS 1.5 parse pipeline to consider handling open() events recoverably
> ---------------------------------------------------------------------
>
>                 Key: YARN-4705
>                 URL: https://issues.apache.org/jira/browse/YARN-4705
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to