Sangjin Lee commented on YARN-3914:

[~zjshen], we have been discussing this. While adding entity creation time to 
the row key may solve this problem, the concern is that it may introduce others.

If the row key is 
(user/cluster/flow/run/app_id/entity_type/created_time/entity_id), then even 
the most basic query for (entity_type + entity_id) will get much more 
complicated, right? We cannot expect readers to provide the creation time every 
time they query for an entity by id.

Also, as you said, we cannot always accommodate different query vectors by 
adding more to the row key, or we would be risking blowing up the row key size 
or breaking other queries. We should be real judicious what goes into the row 

I think it's reasonable to expect that the entity id order would be either 
completely or nearly identical to the chronological order (e.g. app id, or 
container id). So perhaps we could rely on the entity id order to help mitigate 
this problem.


> Entity created time should be part of the row key of entity table
> -----------------------------------------------------------------
>                 Key: YARN-3914
>                 URL: https://issues.apache.org/jira/browse/YARN-3914
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
> Entity created time should be part of the row key of entity table, between 
> entity type and entity Id. The reason to have it is to index the entities. 
> Though we cannot index the entities for all kinds of information, indexing 
> them according to the created time is very necessary. Without it, every query 
> for the latest entities that belong to an application and a type will scan 
> through all the entities that belong to them. For example, if we want to list 
> the 100 latest started containers in an YARN app.

This message was sent by Atlassian JIRA

Reply via email to