[ 
https://issues.apache.org/jira/browse/YARN-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107821#comment-16107821
 ] 

Wangda Tan commented on YARN-6875:
----------------------------------

Thanks [~jlowe], 

bq. Quite a few important points to note here:
#1/#2 are true, however our original goal of the JIRA is not to just be a 
slightly better than old format.

For #3, it is not true when append fails.

For example, we have a file which appended 3 times (did partial log aggregation 
for 3 times). File looks like:
{code}
|Data-1|Index-1|Data-2|Index-2|Data-3|Index-3|
{code} 

At 4-th time, append fails in middle (such as NM failure, etc.)
{code}
|Data-1|Index-1|Data-2|Index-2|Data-3|Index-3|Data-4...(corrupted)|
{code} 

When we need to read logs, we need to go back all the way back to index-3, 
depends on how much we write for Data-4, this could be costly.
And the worse thing is, if Data-4 is not fixed by some reason. In the future 
time we need to read the app log again, we need to reverse-find where's the 
index-3.

There's another solution in my mind, in addition to Jason's suggestion before:

When we append logs for every partial log aggregation, we will append UUID + 
block_id for every N bits (N could = 64MB for example). Data looks like:
{code}
|Data-block-1-0|UUID_1_0|Data-block-1-1|UUID_1_1|Index-1|Data-block-2-0|UUID_2_0|Index-2|
{code} 

If append fails because of some reason, we will go back to search the last 
UUID+block_ID. For example:
{code}
|.good-data|.bad-data.|UUID_x_y|.bad-data.|
{code}

The last UUID+block_id is UUID_x_y. So we will know that, the last corrupted 
data has y more blocks in front of the position, so it will skip y * 
(BLOCK_SIZE + UUID_SIZE) bits. Which will be better than scan blocks one-by-one.

Thoughts? [~xgong].

> New aggregated log file format for YARN log aggregation.
> --------------------------------------------------------
>
>                 Key: YARN-6875
>                 URL: https://issues.apache.org/jira/browse/YARN-6875
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Xuan Gong
>            Assignee: Xuan Gong
>         Attachments: YARN-6875-NewLogAggregationFormat-design-doc.pdf
>
>
> T-file is the underlying log format for the aggregated logs in YARN. We have 
> seen several performance issues, especially for very large log files.
> We will introduce a new log format which have better performance for large 
> log files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to