[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163205#comment-15163205
 ] 

Steve Loughran commented on YARN-4705:
--------------------------------------

YARN-4696 contains my current logic to handle failures to parse things. :

If the JSON parser fails then an info message is printed if we know the file is 
non-empty (i.e. either length>0 or offset > 0)

I think there are some possible race conditions in the code as is, certainly 
FNFEs ought to downgrade to info, 

For other IOEs, I think they should be caught & logged per file, rather than 
stop the entire scan loop. Otherwise bad permissions on one file would be 
enough to break the scanning.


Regarding trying to work with Raw vs HDFS...I've not been able to get at raw, 
am trying to disable caching in file://, but am close to accepting defeat and 
spinning up a single mini yarn cluster across all my test cases. That or add a 
config option to turn off checksumming in localFS. The logic is there, but you 
can only set it in an FS instance which must be used directly or propagated to 
the code-under-test via the FS cache.

The local FS does work for picking up completed work; the problem is that as 
flush() doesn't, it doesn't reliably read the updates of incomplete jobs. And 
when it does, unless the JSON is aligned on a buffer boundary, the parser is 
going to fail, which is going to lead to lots and lots of info messages, unless 
the logging is tuned further to only log if the last operation was not a 
failure.

We only need to really worry about other cross-cluster filesystems for 
production use here. Single node with local FS? Use the 1.0 APIs. Production: 
Distributed FS which is required to implement flush() (even a delayed/async 
flush) if you want to see incomplete applications. I believe GlusterFS supports 
that, as does any POSIX FS if the checksum FS doesn't get in the way. What does 
[~jayunit100] have to say about his filesystem's consistency model? 

It will mean that the object stores, S3 and swift can't work as destinations 
for logs. They are dangerous anyway as if the app crashes before 
{{out.close()}} is called *all* data is lost. If we care about that, then you'd 
really want to write to an FS (local or HDFS) then copy to the blobstore for 
long-term histories.
 

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> ---------------------------------------------------------------------
>
>                 Key: YARN-4705
>                 URL: https://issues.apache.org/jira/browse/YARN-4705
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to