Thomas Graves created YARN-1670:
-----------------------------------

             Summary: aggregated log writer can write more log data then it 
says is the log length
                 Key: YARN-1670
                 URL: https://issues.apache.org/jira/browse/YARN-1670
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.2.0, 0.23.10
            Reporter: Thomas Graves


We have seen exceptions when using 'yarn logs' to read log files. 
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
       at java.lang.Long.parseLong(Long.java:441)
       at java.lang.Long.parseLong(Long.java:483)
       at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:518)
       at 
org.apache.hadoop.yarn.logaggregation.LogDumper.dumpAContainerLogs(LogDumper.java:178)
       at 
org.apache.hadoop.yarn.logaggregation.LogDumper.run(LogDumper.java:130)
       at 
org.apache.hadoop.yarn.logaggregation.LogDumper.main(LogDumper.java:246)


We traced it down to the reader trying to read the file type of the next file 
but where it reads is still log data from the previous file.  What happened was 
the Log Length was written as a certain size but the log data was actually 
longer then that.  

Inside of the write() routine in LogValue it first writes what the logfile 
length is, but then when it goes to write the log itself it just goes to the 
end of the file.  There is a race condition here where if someone is still 
writing to the file when it goes to be aggregated the length written could be 
to small.

We should have the write() routine stop when it writes whatever it said was the 
length.  It would be nice if we could somehow tell the user it might be 
truncated but I'm not sure of a good way to do this.

We also noticed that a bug in readAContainerLogsForALogType where it is using 
an int for curRead whereas it should be using a long. 

      while (len != -1 && curRead < fileLength) {

This isn't actually a problem right now as it looks like the underlying decoder 
is doing the right thing and the len condition exits.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to