[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158913#comment-15158913
 ] 

Jason Lowe commented on YARN-4705:
----------------------------------

bq. One RPC call to check the file size shouldn't be a big problem in general.

As I mentioned above, we _cannot_ rely on the file size to be accurate.  The 
file is being actively written, and there's no guarantee the file size will be 
updated in a timely manner after data is written.  There can be data in the 
file for hours and the file size could still be zero.  In HDFS it will only be 
updated when the next block is allocated, so it could sit at filesize of 0 for 
a very long time (depending upon how fast the writer is going) until the 
filesize suddenly jumps to the blocksize when the writer passes the first block 
boundary.  The only real way to know how much data is in the file is to read it 
-- we cannot rely on what the namenode reports.

bq. After a scan of an empty file/failed parse, it gets loaded again, next scan 
round? Or is it removed from the scan list?

The file is always revisted, errors or not, on the next scan round as long as 
the application is active.  It opens the file then seeks to the last 
successfully read byte offset and tries to read more.  If data is successfully 
read then it updates the byte offset for the next round, rinse, repeat.

bq. Really a failure to parse the JSON or an empty file should be treated the 
same: try later if the file size increases

Again, we cannot rely on the file size to be updated.  To reduce load on the 
namenode, the writer is simply pushing the data out to the datanode -- it's not 
also making an RPC call to the namenode to update the filesize.  The only 
actors involved are the writer, the datanode, and the reader.  The namenode is 
oblivious to what's happening until the next block is allocated, which could 
take a really long time if the writer is writing slowly.  Note that for these 
files a slow writer is not a rare case, as it only writes when tasks change 
state.

I agree we need to handle this better, probably by making the error a bit less 
scary in the log.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> ---------------------------------------------------------------------
>
>                 Key: YARN-4705
>                 URL: https://issues.apache.org/jira/browse/YARN-4705
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to