[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163232#comment-15163232
 ] 

Steve Loughran commented on YARN-4705:
--

All we need to know is "does flush() write data back so that other code can 
eventually see it?"

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-24 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163222#comment-15163222
 ] 

jay vyas commented on YARN-4705:


Ah the GlusterFS consistency model ? From my experience its not strongly 
consistent all the time in all cases.  I'd cc [~chenh] and @childsb as well on 
this ... they are currently working on these filesystems.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163205#comment-15163205
 ] 

Steve Loughran commented on YARN-4705:
--

YARN-4696 contains my current logic to handle failures to parse things. :

If the JSON parser fails then an info message is printed if we know the file is 
non-empty (i.e. either length>0 or offset > 0)

I think there are some possible race conditions in the code as is, certainly 
FNFEs ought to downgrade to info, 

For other IOEs, I think they should be caught & logged per file, rather than 
stop the entire scan loop. Otherwise bad permissions on one file would be 
enough to break the scanning.


Regarding trying to work with Raw vs HDFS...I've not been able to get at raw, 
am trying to disable caching in file://, but am close to accepting defeat and 
spinning up a single mini yarn cluster across all my test cases. That or add a 
config option to turn off checksumming in localFS. The logic is there, but you 
can only set it in an FS instance which must be used directly or propagated to 
the code-under-test via the FS cache.

The local FS does work for picking up completed work; the problem is that as 
flush() doesn't, it doesn't reliably read the updates of incomplete jobs. And 
when it does, unless the JSON is aligned on a buffer boundary, the parser is 
going to fail, which is going to lead to lots and lots of info messages, unless 
the logging is tuned further to only log if the last operation was not a 
failure.

We only need to really worry about other cross-cluster filesystems for 
production use here. Single node with local FS? Use the 1.0 APIs. Production: 
Distributed FS which is required to implement flush() (even a delayed/async 
flush) if you want to see incomplete applications. I believe GlusterFS supports 
that, as does any POSIX FS if the checksum FS doesn't get in the way. What does 
[~jayunit100] have to say about his filesystem's consistency model? 

It will mean that the object stores, S3 and swift can't work as destinations 
for logs. They are dangerous anyway as if the app crashes before 
{{out.close()}} is called *all* data is lost. If we care about that, then you'd 
really want to write to an FS (local or HDFS) then copy to the blobstore for 
long-term histories.
 

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-23 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159913#comment-15159913
 ] 

Li Lu commented on YARN-4705:
-

Once the WARN log is gone I'm fine with it. One quick question is, are we 
planning to use rawfile to test ATS v1.5? Will there be any differences on the 
semantic side between raw file system and HDFS? I think it's fine if the test 
only relates to the upper level logic of ATS v1.5, but we may need to be 
careful to test the storage layer with a different FS? 

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159130#comment-15159130
 ] 

Jason Lowe commented on YARN-4705:
--

Agreed that rawfile:// could prove useful outside of this scenario and probably 
makes sense to have in common.  Seems like it should be very straightforward to 
add the new scheme and direct it to RawLocalFileSystem, but I'm not familiar 
with how windows could mess with that.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159115#comment-15159115
 ] 

Steve Loughran commented on YARN-4705:
--

yeah, thought of that... tried patch file:// but it didn't take. I'll delve 
some more. IMO we should consider something like rawfile:// everywhere, but the 
windows URI handling stuff may turn out to be extra fun there.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159102#comment-15159102
 ] 

Jason Lowe commented on YARN-4705:
--

Another workaround could be a scheme to get a raw local filesystem instead of 
one with a CRC and use that for the unit tests, e.g.: something like rawfile:// 
instead of file://.  Then flushes would work as ATS1.5 expects.

Agree that we could move the errors encountered during "live" file reads to 
debug to avoid scaring users and document that ATS1.5 relies on flush actually 
flushing which means it can't be used with the crc'd local filesystem.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159012#comment-15159012
 ] 

Steve Loughran commented on YARN-4705:
--

OK. so HDFS has guaranteed flush but no guarantees on modtime or size 
propagation; in contrast, the local file:// FS is consistent with 
FileStatus.length and actual length, but doesn't flush when told to, so can 
delay its writes until a CRC-worth of data has been written —and there is no 
obvious way to turn this off for testing via config files.

On HDFS then: empty files length can't be interpreted as a reason to skip; so 
failures to read are an error. An attempt must be made to read it, but any 
EOFexception or similar is not a failure. That is: you can't skip on empty, 
just swallow the failure. Maybe at DEBUG list the exception and current file 
status value. or just attempt to read() byte 0 after opening file; an 
EOFException means "still empty"

That essentially means that until such a switch is provided, you cannot use the 
localfs as a back end for ATS1.5 —even for testing. Or at least, you can write 
with it, but the data won't be guaranteed to be visible until close() is 
called. You may not get any view of incomplete apps —which is precisely what 
I'm seeing.

If this is the case, then that's something ATS1.5 can't fix: it will have to be 
in the documentation.



> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158913#comment-15158913
 ] 

Jason Lowe commented on YARN-4705:
--

bq. One RPC call to check the file size shouldn't be a big problem in general.

As I mentioned above, we _cannot_ rely on the file size to be accurate.  The 
file is being actively written, and there's no guarantee the file size will be 
updated in a timely manner after data is written.  There can be data in the 
file for hours and the file size could still be zero.  In HDFS it will only be 
updated when the next block is allocated, so it could sit at filesize of 0 for 
a very long time (depending upon how fast the writer is going) until the 
filesize suddenly jumps to the blocksize when the writer passes the first block 
boundary.  The only real way to know how much data is in the file is to read it 
-- we cannot rely on what the namenode reports.

bq. After a scan of an empty file/failed parse, it gets loaded again, next scan 
round? Or is it removed from the scan list?

The file is always revisted, errors or not, on the next scan round as long as 
the application is active.  It opens the file then seeks to the last 
successfully read byte offset and tries to read more.  If data is successfully 
read then it updates the byte offset for the next round, rinse, repeat.

bq. Really a failure to parse the JSON or an empty file should be treated the 
same: try later if the file size increases

Again, we cannot rely on the file size to be updated.  To reduce load on the 
namenode, the writer is simply pushing the data out to the datanode -- it's not 
also making an RPC call to the namenode to update the filesize.  The only 
actors involved are the writer, the datanode, and the reader.  The namenode is 
oblivious to what's happening until the next block is allocated, which could 
take a really long time if the writer is writing slowly.  Note that for these 
files a slow writer is not a rare case, as it only writes when tasks change 
state.

I agree we need to handle this better, probably by making the error a bit less 
scary in the log.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158680#comment-15158680
 ] 

Steve Loughran commented on YARN-4705:
--

That's what confuses me. After a scan of an empty file/failed parse, it gets 
loaded again, next scan round? Or is it removed from the scan list?. Really a 
failure to parse the JSON or an empty file should be treated the same: try 
later if the file size increases (I found that to be a better metric when 
dealing with cached filesystem data in the spark history server). If the FS is 
buffering after a flush, then it may only save the last block, so the JSON 
won't parse fully at first —which is why that needs to retry too

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-22 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158146#comment-15158146
 ] 

Li Lu commented on YARN-4705:
-

Seems we need to somewhat handle this? This will happen when a scan happens 
after an empty file is created but nothing inside. Once we hit this problem we 
will throw this exception. To me preventing this from happening looks more 
appealing than having a WARN for this problem? One RPC call to check the file 
size shouldn't be a big problem in general. 

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157273#comment-15157273
 ] 

Jason Lowe commented on YARN-4705:
--

bq. adding a check to skip 0-byte files stops the stack I've been seeing from 
coming back...

The issue with checking the file size first is that the ATS could refuse to 
open a file that has data in it, adding extra delay between the application 
writing the data and the data appearing in the ATS.  Since we're trying to read 
a file that is still being written the file sizes may not reflect reality, 
especially in file systems like HDFS where the file size is only updated when 
new blocks are allocated.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157262#comment-15157262
 ] 

Jason Lowe commented on YARN-4705:
--

ATS 1.5 relies on reading files that are actively being written, and there are 
inherent errors that can occur during those scenarios.  I believe the error 
above is aborting the current attempt to read the summary file, but it should 
be trying again shortly later.  For active applications it essentially is 
polling for more events and will keep trying to read both the summary file and 
detailed file(s), seeking past the amount of data successfully read so far.

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154688#comment-15154688
 ] 

Steve Loughran commented on YARN-4705:
--

.. adding a check to skip 0-byte files stops the stack I've been seeing from 
coming back...

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably

2016-02-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154667#comment-15154667
 ] 

Steve Loughran commented on YARN-4705:
--

stack attached.

Looking at the code, I'm not sure what the retry policy here should be. FNFE, 
auth failures? recoverable or not? it's hard to predict.

{code}
2016-02-19 18:46:32,641 [EntityLogPluginWorker #1] WARN  
org.apache.hadoop.fs.FSInputChecker (ChecksumFileSystem.java:(157)) - 
Problem opening checksum file: 
file:/Users/stevel/spark-timeline-integration/target/tmp/ats/active/application__/appattempt___00/summarylog-appattempt___00.
  Ignoring exception: 
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:149)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:775)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:130)
at 
org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:110)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:584)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:560)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:747)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-02-19 18:46:32,688 [EntityLogPluginWorker #1] INFO  
org.apache.hadoop.yarn.server.timeline.LogInfo 
(LogInfo.java:parseForStore(112)) - Parsed 0 entities from 
file:/Users/stevel/Projects/spark-timeline-integration/target/tmp/ats/active/application__/appattempt___00/summarylog-appattempt___00
 in 50 msec
2016-02-19 18:46:32,689 [EntityLogPluginWorker #1] DEBUG 
org.apache.hadoop.yarn.server.timel
{code}

> ATS 1.5 parse pipeline to consider handling open() events recoverably
> -
>
> Key: YARN-4705
> URL: https://issues.apache.org/jira/browse/YARN-4705
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Priority: Minor
>
> During one of my own timeline test runs, I've been seeing a stack trace 
> warning that the CRC check failed in Filesystem.open() file; something the FS 
> was ignoring.
> Even though its swallowed (and probably not the cause of my test failure), 
> looking at the code in {{LogInfo.parsePath()}} that it considers a failure to 
> open a file as unrecoverable. 
> on some filesystems, this may not be the case, i.e. if its open for writing 
> it may not be available for reading; checksums maybe a similar issue. 
> Perhaps a failure at open() should be viewed as recoverable while the app is 
> still running?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)