[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863169#comment-16863169
 ] 

Adam Antal commented on YARN-9525:
----------------------------------

Sorry for the delayed answer, let me recap my current progress.

I run integration tests multiple times in every scenario just to have a decent 
knowledge about what we're dealing with. The tests were passing against remote 
folder in s3, so I thought the patch was ok, but checked the existing behaviour 
(HDFS remote app dir's case) as well - according to [~wangda]'s last comment.

Though IFile is reported to succeed in aggregating logs in those scenarios, 
during rolling log aggregation I have problems trying to access the logs 
through the logs CLI (reading through the associated file controller). It does 
not display any error, it just returns bad parts of the log - in my case, I ran 
a sleep job in the child container and its logs are mixed up with the AM's logs 
when I try to read it.

I compiled some debug messages into the hadoop-yarn-common jar, and run the 
tests again. It seems that the offset was miscalculated (due to the patch 
obviously), and in case of the regular HDFS remote dir when we read back the 
logs, we try to read it with wrong offset in the aggregated file, thus the logs 
get messed up. Although the length were ok. (it tried to read the correct 
number of bits, but starting from a bad position)
 The funny thing is that the patch works excellently against s3a, so I had to 
dig a bit further, and found the following:

Pre-patch when:
 - HDFS path is set as remote app folder
 - we're in rolling log aggregation situation
 - there was already a rolling session
 during the next rolling session there is no rollover (if the file is not big 
enough), so there won't be any new file generated. Meanwhile new OutputStream 
will be created targeting the existing file in append mode, but this time the 
"cursor" will point to the end of the file. Detecting this (after writing the 
dummyBytes, flushing, and checking the just written bits) the currentOffset 
will be set to 0.

After applying the patch: 
 Again, there is no rollover, hence the local bool variable createdNew will be 
set to false. Thus the currentOffset will be set according to the following 
piece of code:
{noformat}
currentOffSet = fc.getFileStatus(aggregatedLogFile).getLen();
{noformat}
which is wrong - it has to be zero, as before. The "cursor" still points to the 
end of the file, while the code thinks that it also has to be pushed/offset by 
the current length of the file.
 That information will be written to the index part, so when we read the file 
back, we will display bad bits, pushed away by that many bits.

The solution is simple: for cloud remote app folders rollover will be set to 0 
(see related jira: YARN-9607), so there will always be created a new file. 
(This is unavoidable as no append is not available.)
 So we should first check whether createdNew is true and we should only touch 
getFileStatus if it's false:
 - if there's no append we're fine, because a new file will always be created, 
thus the boolean will always be true, and the offset will always be zero 
(starting write from the beginning of the new empty file every rollover session)
 - if there is append, we fallback to the currently existing behaviour: if 
createdNew is true, then we're good. if it's not, then we're defaulting to the 
existing behaviour.

Uploaded new patch which addresses the comment above (actually it's just an 
extra if), and I also hope that this investigation is clear and it makes sense.
 Setting rollover to zero for non-appendable filesystems will be addressed in 
YARN-9607, but this patch makes sense without that, so the issues are not 
depending on each other.

Reacting to the [~ste...@apache.org]'s and [~tmarquardt]'s comments:
{quote}Good point. Would it actually be possible to pull this out into 
something you could actually make a standalone test against a filesystem?{quote}
Well it seems that it can hardly be modularised that way - so a simple 
"extracting a few lines of code" for test is not really applicable.
I can see a possible solution though, re-reading the code and collecting all 
the prerequisites or implicit things that the IFile is using, and putting it 
into a FSContract-based test. Is that what you were originally thinking?

{quote}getPos does seem a better strategy here. Adam: what do you think?{quote}
It makes sense to change this (use getPos), but I don't know how the existing 
behaviour (HDFS) would alter. I will test that as well, but was pretty occupied 
figuring out the above.

It seems HDFS is a bit hardwired into this, but at this point my integration 
tests are passing, which is a good sign.

Please review, if you can spare some time, and ask any questions that you may 
have - I will make an attempt to clarify it.

> IFile format is not working against s3a remote folder
> -----------------------------------------------------
>
>                 Key: YARN-9525
>                 URL: https://issues.apache.org/jira/browse/YARN-9525
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: log-aggregation
>    Affects Versions: 3.1.2
>            Reporter: Adam Antal
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, 
> YARN-9525.002.patch, YARN-9525.003.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>       at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>       at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>       at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>       at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>       at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>       at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>       at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>       ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to