[ 
https://issues.apache.org/jira/browse/YARN-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315214#comment-15315214
 ] 

Joep Rottinghuis commented on YARN-5167:
----------------------------------------

Ok, discussed this a bit more with [~sjlee0] to see if we can get away with 
escaping too many times (sequence getting large) when we encode one at a time. 
Assuming we add the following to Separator: {code}PERCENT("%", "%4$");{code}

Escaping % for each separator would blow up our strings:
{code}Separator.encode(SPACE.encode("a!some name"), TAB, VALUES, 
QUALIFIERS){code} would become
(space){code}a!some%2$name{code}
(tab){code}a!some%4$2$name{code}
(values:=){code}a!some%4$4$2$name{code}
(qualifiers:!){code}a%0$some%44$4$2$name{code}
11 character string becomes a 20 character string.

If we encode % only once per call to encode(...) then we'd get
(space){code}a!some%2$name{code}
(tab, values, qualifiers){code}a%0$some%2$name{code}
11 character string becomes a 15 character string. 

The bloat gets worse with more occurrences of encoded characters.
The downside as [~sjlee0] pointed out is that we have to make it explicit that

{code}Separator.encode(SPACE.encode("a!some name"), TAB, VALUES, 
QUALIFIERS){code} must be followed by
{code}SPACE.decode(Separator.decode(result, QUALIFIERS, VALUES, TAB)){code}
So the question is if we want the shorter strings to store at the cost of a 
more complex API?
I think I'd be ok with that because end-users won't be using this API directly 
much. They make calls to the timeline readers and writers and this would be our 
code doing this.

If we choose to keep the API simpler and encode the % on each character 
encoding, therefore having fewer restrictions on order and call methods of 
decoding, then that also means that in places where we encode multiple 
characters in a single call to encode, we should have those character that are 
likely to appear the most end up the last in the argument list to reduce bloat.



> Escaping occurences of encodedValues
> ------------------------------------
>
>                 Key: YARN-5167
>                 URL: https://issues.apache.org/jira/browse/YARN-5167
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Joep Rottinghuis
>            Assignee: Sangjin Lee
>            Priority: Critical
>              Labels: yarn-2928-1st-milestone
>
> We had earlier decided to punt on this, but in discussing YARN-5109 we 
> thought it would be best to just be safe rather than sorry later on.
> Encoded sequences can occur in the original string, especially in case of 
> "foreign key" if we decide to have lookups.
> For example, space is encoded as %2$.
> Encoding "String with %2$ in it" would decode to "String with   in it".
> We though we should first escape existing occurrences of encoded strings by 
> prefixing a backslash (even if there is already a backslash that should be 
> ok). Then we should replace all unencoded strings.
> On the way out, we should replace all occurrences of our encoded string to 
> the original except when it is prefixed by an escape character. Lastly we 
> should strip off the one additional backslash in front of each remaining 
> (escaped) sequence.
> If we add the following entry to TestSeparator#testEncodeDecode() that 
> demonstrates what this jira should accomplish:
> {code}
>     testEncodeDecode("Double-escape %2$ and %3$ or \\%2$ or \\%3$, nor  
> \\\\%2$ = no problem!", Separator.QUALIFIERS,
>         Separator.VALUES, Separator.SPACE, Separator.TAB);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to