[ 
https://issues.apache.org/jira/browse/YARN-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540288#comment-14540288
 ] 

Robert Kanter commented on YARN-2942:
-------------------------------------

Thanks [~jlowe] for your feedback.  It's good to get more views on this.

{quote} If I understand them correctly they both propose that the NMs upload 
the original per-node aggregated log to HDFS and then something (either the NMs 
or the RM) later comes along and creates the aggregate-of-aggregates log{quote}
Yes.  That's correct.  

{quote}However I didn't see details on solving the race condition where a log 
reader comes along, sees from the index file that the desired log isn't in the 
aggregate-of-aggregates, then opens the log and reads from it just as the log 
is deleted by the entity appending to the aggregate-of-aggregates.{quote}
That's a good point.  I hadn't thought of that issue.  Thinking about it now, I 
think there's a few options here:
- We could simply have the reader try again if it runs into a problem
- We could have the last NM delete the aggregated log files, so that it's less 
likely that this situation can occur
- Each NM could wait some amount of time (e.g. a few mins) after appending it's 
log file before deleting the original file, so that it's less likely that this 
situation can occur

{quote}We have an internal solution where we create per-application har files 
of the logs{quote}
Can you give some more details on this?  Is it something you can share?  If 
you've already solved this issue, then perhaps we can just use that.  Though 
doesn't creating har files require running an MR job?  

{quote}Another issue from log aggregation we've seen in practice is that the 
proposals don't address the significant write load the per-node aggregate files 
place on the namenode.{quote}
That's a good point.  Shortly after a job finishes, all of the involved NMs 
would upload their log files around the same time, which puts stress on the NN. 
 The NM giving the RM reports of the current aggregation progress was recently 
added by YARN-1376 and related.  Having the RM coordinate the aggregation is 
similar to my design with ZK, but instead of a ZK lock, the RM orchestrates 
things.  I like the idea of getting rid of the original aggregation and having 
the NMs all write to HDFS once, in the combined file directly.  We'd have to 
implement your last bullet point to have the NMs serve the logs in the 
meantime, as I don't think that's there today.  

I'll try to flesh this design out a bit more and see where it goes.  Unless we 
should use har files; though that adds an MR dependency.

> Aggregated Log Files should be combined
> ---------------------------------------
>
>                 Key: YARN-2942
>                 URL: https://issues.apache.org/jira/browse/YARN-2942
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.6.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: CombinedAggregatedLogsProposal_v3.pdf, 
> CombinedAggregatedLogsProposal_v6.pdf, CombinedAggregatedLogsProposal_v7.pdf, 
> CompactedAggregatedLogsProposal_v1.pdf, 
> CompactedAggregatedLogsProposal_v2.pdf, 
> ConcatableAggregatedLogsProposal_v4.pdf, 
> ConcatableAggregatedLogsProposal_v5.pdf, YARN-2942-preliminary.001.patch, 
> YARN-2942-preliminary.002.patch, YARN-2942.001.patch, YARN-2942.002.patch, 
> YARN-2942.003.patch
>
>
> Turning on log aggregation allows users to easily store container logs in 
> HDFS and subsequently view them in the YARN web UIs from a central place.  
> Currently, there is a separate log file for each Node Manager.  This can be a 
> problem for HDFS if you have a cluster with many nodes as you’ll slowly start 
> accumulating many (possibly small) files per YARN application.  The current 
> “solution” for this problem is to configure YARN (actually the JHS) to 
> automatically delete these files after some amount of time.  
> We should improve this by compacting the per-node aggregated log files into 
> one log file per application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to