[ 
https://issues.apache.org/jira/browse/YARN-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289895#comment-15289895
 ] 

Sangjin Lee commented on YARN-5109:
-----------------------------------

I spoke with [~jrottinghuis] offline about this. Initially we were thinking 
that we should encode those characters even in the case of bytes (essentially 
creating {{Separator.joinEncoded(byte[]...)}} and removing the raw 
{{Separator.join()}} method), but we are realizing that won't work.

The key here is that we not only need to handle those separator characters 
("=", "!", etc.) but also *preserve the ordering*. For example, suppose we have 
two timestamps ({{ts1}} and {{ts2}}) where {{ts1 < ts2}}. And assume {{ts2}} 
has a separator character in it. If we blindly encoded the separator character, 
we could easily violate {{ts1 < ts2}} once they are written. This would break 
all sorts of things, including range scans.

My proposal is this. In almost all of these cases, the structure of the data 
we're storing and parsing is known strongly, whether it is the row key or the 
column qualifier. The problem with the current parsing is it uses solely 
splitting by separator. We should use the full data structure it knows already 
to parse correctly.

For example, if we know that the structure is (string)=(timestamp)=(string), we 
can parse the first string, and then take the next 8 bytes *without splitting 
again* as we know it's a timestamp anyway and convert it into the long number, 
and take the last token after that. We should be able to follow the same idea 
in all cases.

Thoughts? [~varun_saxena], let me know if you'd like to take a stab at that 
idea, or I should.

> timestamps are stored unencoded causing parse errors
> ----------------------------------------------------
>
>                 Key: YARN-5109
>                 URL: https://issues.apache.org/jira/browse/YARN-5109
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Sangjin Lee
>            Assignee: Varun Saxena
>            Priority: Blocker
>              Labels: yarn-2928-1st-milestone
>
> When we store timestamps (for example as part of the row key or part of the 
> column name for an event), the bytes are used as is without any encoding. If 
> the byte value happens to contain a separator character we use (e.g. "!" or 
> "="), it causes a parse failure when we read it.
> I came across this while looking into this error in the timeline reader:
> {noformat}
> 2016-05-17 21:28:38,643 WARN 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TimelineStorageUtils:
>  incorrectly formatted column name: it will be discarded
> {noformat}
> I traced the data that was causing this, and the column name (for the event) 
> was the following:
> {noformat}
> i:e!YARN_RM_CONTAINER_CREATED=\x7F\xFF\xFE\xABDY=\x99=YARN_CONTAINER_ALLOCATED_HOST
> {noformat}
> Note that the column name is supposed to be of the format (event 
> id)=(timestamp)=(event info key). However, observe the timestamp portion:
> {noformat}
> \x7F\xFF\xFE\xABDY=\x99
> {noformat}
> The presence of the separator ("=") causes the parse error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to