[jira] [Commented] (YARN-5109) timestamps are stored unencoded causing parse errors

Joep Rottinghuis (JIRA) Tue, 24 May 2016 23:51:40 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299567#comment-15299567
 ]


Joep Rottinghuis commented on YARN-5109:
----------------------------------------

Overall, the approach and implementation looks really clean and solid.
I'll have to look at little further tomorrow, but I have three comments:
1) While reviewing patch for all places where we're converting strings into 
qualifiers for rows and columns it struck me that in 

HBaseTimelineWriterImpl.storeInFlowActivityTable
We still use GenericObjectMapper 
to go from a String to a column qualifier.
See line 190 (after applying patch):
{code}
    byte[] qualifier = GenericObjectMapper.write(flowRunId);
{code}

In TimelineEntityReader.parseEntity
line 140
we use the regular StringKeyConverter:
{code}
        FlowActivityColumnPrefix.RUN_ID.readResults(result,
            StringKeyConverter.getInstance());
    for (Map.Entry<String, Object> e : runIdsMap.entrySet()) {
{code}

so presumably the above code should read:
{code}
 byte[] qualifier = StringKeyConverter.getInstance().encode(flowRunId);
{code}

2) Somehow encode() and decode() apis sound like they should be symmetric. For 
StringKeyConverter they are.
StringKeyConverter.getInstance.decode(encode(s)).equals.s
for any String s (haven't checked null, or "").
Similarly
{code}
Bytes.equals(b, StringKeyConverter.getInstance.encode(decode(b)) == true
{code}
for any byte[] b.
This isn't true for ApplicationRowKey etc.
I understand this is because we use the same encode to create a prefix and I 
understand that we normally expect all fields to be present when we create an 
ApplicationRowKey.

Somehow it would be nicer to create an applicationRowKey object and then 
perhaps have a validate method on it, or be able to create one as a prefix 
(perhaps with a separate constructor).
Right now I don't really have a consistent suggestion for this, other than the 
comment that this somehow seems "unpleasant".

If we do decide to keep this current approach (perhaps because alternatives are 
just as ugly or unpleasant) then we should at least document in the javadoc for 
the encode and decode methods that the invariant that one might expect from 
naming doesn't have to hold true for all implementations. We should also update 
the javadoc in the implementations from a simple @override to a generated 
javadoc referring to the interface method.
I'm not sure if this is Hadoop coding standards, but I've always found it a 
little silly to methods that implement an interface with @override. For example
{code}
  /*
   * (non-Javadoc)
   * 
   * @see
   * org.apache.hadoop.yarn.server.timelineservice.storage.common.KeyConverter
   * #encode(java.lang.Object)
   */
  public byte[] encode(EntityRowKey rowKey) {
{code}
makes more sense to me then
{code}
  @Override  
  public byte[] encode(EntityRowKey rowKey) {
{code}
but that is probably more a matter of taste...

3) Why don't we do to ColumnHelper.readResultsWithTimestamps what we did to 
readResults?
that would make the method signature:
{code}
 public <K, V> NavigableMap<K, NavigableMap<Long, V>> readResultsWithTimestamps(
      Result result, byte[] columnPrefixBytes, KeyConverter<K> keyConverter)
      throws IOException {
{code}
With equivalent changes up the call stack in 
ApplicationColumnPrefix.readResultsWithTimestamps, 
EntiyColumnPrefix,readResultsWithTimestamps, 
FlowActivityColumnPrefix.readResultsWithTimestamps, and 
FlowRunColumnPrefix.readResultsWithTimestamps and all their respective uses.

> timestamps are stored unencoded causing parse errors
> ----------------------------------------------------
>
>                 Key: YARN-5109
>                 URL: https://issues.apache.org/jira/browse/YARN-5109
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Sangjin Lee
>            Assignee: Varun Saxena
>            Priority: Blocker
>              Labels: yarn-2928-1st-milestone
>         Attachments: YARN-5109-YARN-2928.003.patch, 
> YARN-5109-YARN-2928.01.patch, YARN-5109-YARN-2928.02.patch, 
> YARN-5109-YARN-2928.03.patch, YARN-5109-YARN-2928.04.patch, 
> YARN-5109-YARN-2928.05.patch, YARN-5109-YARN-2928.06.patch
>
>
> When we store timestamps (for example as part of the row key or part of the 
> column name for an event), the bytes are used as is without any encoding. If 
> the byte value happens to contain a separator character we use (e.g. "!" or 
> "="), it causes a parse failure when we read it.
> I came across this while looking into this error in the timeline reader:
> {noformat}
> 2016-05-17 21:28:38,643 WARN 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TimelineStorageUtils:
>  incorrectly formatted column name: it will be discarded
> {noformat}
> I traced the data that was causing this, and the column name (for the event) 
> was the following:
> {noformat}
> i:e!YARN_RM_CONTAINER_CREATED=\x7F\xFF\xFE\xABDY=\x99=YARN_CONTAINER_ALLOCATED_HOST
> {noformat}
> Note that the column name is supposed to be of the format (event 
> id)=(timestamp)=(event info key). However, observe the timestamp portion:
> {noformat}
> \x7F\xFF\xFE\xABDY=\x99
> {noformat}
> The presence of the separator ("=") causes the parse error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5109) timestamps are stored unencoded causing parse errors

Reply via email to