[
https://issues.apache.org/jira/browse/YARN-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289895#comment-15289895
]
Sangjin Lee commented on YARN-5109:
-----------------------------------
I spoke with [~jrottinghuis] offline about this. Initially we were thinking
that we should encode those characters even in the case of bytes (essentially
creating {{Separator.joinEncoded(byte[]...)}} and removing the raw
{{Separator.join()}} method), but we are realizing that won't work.
The key here is that we not only need to handle those separator characters
("=", "!", etc.) but also *preserve the ordering*. For example, suppose we have
two timestamps ({{ts1}} and {{ts2}}) where {{ts1 < ts2}}. And assume {{ts2}}
has a separator character in it. If we blindly encoded the separator character,
we could easily violate {{ts1 < ts2}} once they are written. This would break
all sorts of things, including range scans.
My proposal is this. In almost all of these cases, the structure of the data
we're storing and parsing is known strongly, whether it is the row key or the
column qualifier. The problem with the current parsing is it uses solely
splitting by separator. We should use the full data structure it knows already
to parse correctly.
For example, if we know that the structure is (string)=(timestamp)=(string), we
can parse the first string, and then take the next 8 bytes *without splitting
again* as we know it's a timestamp anyway and convert it into the long number,
and take the last token after that. We should be able to follow the same idea
in all cases.
Thoughts? [~varun_saxena], let me know if you'd like to take a stab at that
idea, or I should.
> timestamps are stored unencoded causing parse errors
> ----------------------------------------------------
>
> Key: YARN-5109
> URL: https://issues.apache.org/jira/browse/YARN-5109
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Affects Versions: YARN-2928
> Reporter: Sangjin Lee
> Assignee: Varun Saxena
> Priority: Blocker
> Labels: yarn-2928-1st-milestone
>
> When we store timestamps (for example as part of the row key or part of the
> column name for an event), the bytes are used as is without any encoding. If
> the byte value happens to contain a separator character we use (e.g. "!" or
> "="), it causes a parse failure when we read it.
> I came across this while looking into this error in the timeline reader:
> {noformat}
> 2016-05-17 21:28:38,643 WARN
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TimelineStorageUtils:
> incorrectly formatted column name: it will be discarded
> {noformat}
> I traced the data that was causing this, and the column name (for the event)
> was the following:
> {noformat}
> i:e!YARN_RM_CONTAINER_CREATED=\x7F\xFF\xFE\xABDY=\x99=YARN_CONTAINER_ALLOCATED_HOST
> {noformat}
> Note that the column name is supposed to be of the format (event
> id)=(timestamp)=(event info key). However, observe the timestamp portion:
> {noformat}
> \x7F\xFF\xFE\xABDY=\x99
> {noformat}
> The presence of the separator ("=") causes the parse error.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]