[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-10-06 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552593#comment-15552593
 ] 

Varun Saxena edited comment on YARN-5585 at 10/6/16 5:30 PM:
-

bq. Given entities are sorted in ascending order, at some extent latest fist 
order can be achieve by doing reverse scan. I had tried this for 
yarn-containers and works fine.
Reverse scan would work fine but how do we decide which entity type would need 
it and which won't. By the way we need container IDs' in the reverse order too 
? IIRC, in one of the calls Li mentioned lexicographic order should be fine for 
new Web UI. If required we can have special handling for YARN specific entities 
like App attempts and Containers, just like we have for apps.
No matter what we do, it should be consistent across all entities. We can also 
have another query param to indicate reverse lexicographic order is required.

bq. IIUC, AM can delegate collector address to any of its running containers to 
publish its own data. TimelineClient can not be restricted to only AM.
True. In a secure setup, AM can even pass on the token. The point is we support 
talking to AM only. AM can then delegate its work to anyone. But the concern 
here was that prefix will have to be passed around by AM via a new protocol. So 
if application wants to support delegating work to other processes, it anyways 
needs to open new protocol. So I guess this concern is not specific to prefix. 
Correct ? However, would be useful if you can tell the use case of multiple 
JVMs'. Same DAGs' can be executed by different processes. This would help us 
understanding the use case and decide how best to support it.


was (Author: varun_saxena):
bq. Given entities are sorted in ascending order, at some extent latest fist 
order can be achieve by doing reverse scan. I had tried this for 
yarn-containers and works fine.
Reverse scan would work fine but how do we decide which entity type would need 
it and which won't. By the way we need container IDs' in the reverse order too 
? IIRC, in one of the calls Li mentioned lexicographic order should be fine for 
new Web UI. If required we can have special handling for YARN specific entities 
like App attempts and Containers, just like we have for apps.
No matter what we do, it should be consistent across all entities. We can also 
have another query param to indicate reverse lexicographic order is required.

bq. IIUC, AM can delegate collector address to any of its running containers to 
publish its own data. TimelineClient can not be restricted to only AM.
True. In a secure setup, AM can even pass on the token. The point is we support 
talking to AM only. AM can then delegate its work to anyone. But the concern 
here was that prefix will have to be passed around by AM via a new protocol. So 
if application wants to support delegating work to other processes, it anyways 
needs to open new protocol. So I guess this concern is not specific to prefix. 
Correct ? However, would be useful if you can tell the use case of multiple 
JVMs'. Same DAGs' can be executed by different processes. This would help us 
thing 

> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: 0001-YARN-5585.patch, YARN-5585-workaround.patch, 
> YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-10-06 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552144#comment-15552144
 ] 

Varun Saxena edited comment on YARN-5585 at 10/6/16 3:01 PM:
-

bq. I was thinking to use same REST API for both by using SingleColumnFilter. 
One cons I see is table scan for all the entityType i.e reflect in read 
performance.
We should not use SingleColumnValueFilter if we know the prefix because as you 
said former will lead to a relatively slower read performance. Basically we 
need to differentiate between not having a prefix for the entity type and user 
unable to supply it.

bq. I would have thought that we store the entities in the reverse entity id 
order, but it appears that the entity id is encoded into the row key as is 
(EntityRowKey). Am I reading that right? If so, this is a bug to fix.
Entity IDs' can be anything. Even a completely alphabetical sequence can be an 
entity ID. So it will not be possible to define a reverse order for every 
generic entity ID. Is this your question ?

bq. Firstly about multi JVM which makes application programmer to define new 
protocol for transferring prefixId. 
Trying to understand this more. Can same DAG be executed by multiple Tez AMs' ?

bq. Secondly, what if users misses providing an prefixId in subsequent 
updates.? 
This should be caught during integration phase. Right ?



was (Author: varun_saxena):
bq. I was thinking to use same REST API for both by using SingleColumnFilter. 
One cons I see is table scan for all the entityType i.e reflect in read 
performance.
We should not use SingleColumnValueFilter if we know the prefix because as you 
said former will lead to a relatively slower read performance. Basically we 
need to differentiate between having a prefix for the entity type and user 
unable to supply it.

bq. I would have thought that we store the entities in the reverse entity id 
order, but it appears that the entity id is encoded into the row key as is 
(EntityRowKey). Am I reading that right? If so, this is a bug to fix.
Entity IDs' can be anything. Even a completely alphabetical sequence can be an 
entity ID. So it will not be possible to define a reverse order for every 
generic entity ID. Is this your question ?

bq. Firstly about multi JVM which makes application programmer to define new 
protocol for transferring prefixId. 
Trying to understand this more. Can same DAG be executed by multiple Tez AMs' ?

bq. Secondly, what if users misses providing an prefixId in subsequent 
updates.? 
This should be caught during integration phase. Right ?


> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: 0001-YARN-5585.patch, YARN-5585-workaround.patch, 
> YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-10-04 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545818#comment-15545818
 ] 

Varun Saxena edited comment on YARN-5585 at 10/4/16 4:06 PM:
-

Thanks [~rohithsharma] for the patch. Few comments.

# Intention behind having ID_PREFIX in EntityColumn ? According to me, we need 
not store prefix in the column. Is it because we want to read it back and send 
it to client ?
# No need of GenericEntityReader#calculateTheClosestNextRowKeyForPrefix. 
Scan#setRowPrefixFilter will do it for you. We should call it the same way as 
was done previously.
# As entity ID prefix is a long, EntityRowKeyConverter#SEGMENT_SIZES should 
have new segment as Bytes.SIZEOF_LONG. It is currently given as VARIABLE_SIZE. 
Same change in TestRowKeys.
# In EntityRowKeyConverter#encode, no need to invert entity id prefix. We will 
take prefix as-is. Sender can publish the entity with inverted prefix if he 
wants contents in descending order (say). We can probably add something to 
TimelineUtils to invert it, if required, which then clients can use.
# In GenericEntityReader#parseEntity we should fetch id prefix from result set 
and setIdPrefix in TimelineEntity to be returned back. This will be useful for 
clients when they want to set fromPrefix (will be useful in Tez UI use case).
# Javadoc in TimelineReader should be changed. It currently says entities would 
be sorted by created time which is no longer true.
{code}
   * @return A set of TimelineEntity instances of the given entity
   *type in the given context scope which matches the given predicates
   *ordered by created time, descending. Each entity will only contain the
   *metadata(id, type and created time) plus the given fields to retrieve.
{code}
# We should also update documentation to reflect id prefix.





was (Author: varun_saxena):
Thanks [~rohithsharma] for the patch. Few comments.

# Intention behind having ID_PREFIX in EntityColumn ? According to me, we need 
not store prefix in the column. Is it because we want to read it back and send 
it to client ?
# No need of GenericEntityReader#calculateTheClosestNextRowKeyForPrefix. 
Scan#setRowPrefixFilter will do it for you. We should call it the same way as 
was done previously.
# As entity ID prefix is a long, EntityRowKeyConverter#SEGMENT_SIZES should 
have new segment as Bytes.SIZEOF_LONG. Same change in TestRowKeys.
# In EntityRowKeyConverter#encode, no need to invert entity id prefix. We will 
take prefix as-is. Sender can publish the entity with inverted prefix if he 
wants contents in descending order (say). We can probably add something to 
TimelineUtils to invert it, if required, which then clients can use.
# In GenericEntityReader#parseEntity we should fetch id prefix from result set 
and setIdPrefix in TimelineEntity to be returned back. This will be useful for 
clients when they want to set fromPrefix (will be useful in Tez UI use case).
# Javadoc in TimelineReader should be changed. It currently says entities would 
be sorted by created time which is no longer true.
{code}
   * @return A set of TimelineEntity instances of the given entity
   *type in the given context scope which matches the given predicates
   *ordered by created time, descending. Each entity will only contain the
   *metadata(id, type and created time) plus the given fields to retrieve.
{code}
# We should also update documentation to reflect id prefix.




> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: 0001-YARN-5585.patch, YARN-5585-workaround.patch, 
> YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in 

[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-29 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533830#comment-15533830
 ] 

Vrushali C edited comment on YARN-5585 at 9/29/16 7:40 PM:
---

Thanks [~rohithsharma] for the summary.

bq. 2. By default, use createdTime as entityPrefixId.

Also, that means, frameworks which don't want to use the entity id prefix have 
to explicitly specify a null prefix (or a special value that means null).
All the same, it will be really good to mention in the docs for clients that 
they should do the following. 

{code:title=TimelineWriterClient.java}
entity.setEntityPrefix(createdTime);
client.writeEntity(entity); // pseudo-code
{code}


bq. For the REST end point, we can support fromEntityPrefixId will become 
combination of entityPrefixId+entityId which can be used for pagination

I think pagination handling should be more generic than depending on something 
like "fromEntityPrefixId".  REST queries should simply ask for top N records 
with the understanding that the records are returned in sorted order of entity 
prefixes. For the next page of results, the client sends back the last row 
returned's key/entity prefix. For a rest query, if the "startFrom" query param 
is present, the scan starts from "startFrom" prefix value and returns the next 
N such records.





was (Author: vrushalic):
Thanks [~rohithsharma] for the summary.

bq. 2. By default, use createdTime as entityPrefixId.
Also, that means, frameworks which don't want to use the entity id prefix have 
to explicitly specify a null prefix (or a special value that means null).
All the same, it will be really good to mention in the docs for clients that 
they should do the following. 

{code:title=TimelineWriterClient.java}
entity.setEntityPrefix(createdTime);
client.writeEntity(entity); // pseudo-code
{code}

bq. For the REST end point, we can support fromEntityPrefixId will become 
combination of entityPrefixId+entityId which can be used for pagination
I think pagination handling should be more generic than depending on something 
like "fromEntityPrefixId".  REST queries should simply ask for top N records 
with the understanding that the records are returned in sorted order of entity 
prefixes. For the next page of results, the client sends back the last row 
returned's key/entity prefix. For a rest query, if the "startFrom" query param 
is present, the scan starts from "startFrom" prefix value and returns the next 
N such records.




> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-5585-workaround.patch, YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-26 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523177#comment-15523177
 ] 

Varun Saxena edited comment on YARN-5585 at 9/26/16 2:13 PM:
-

bq. When scan is performed on rows, the ResultScanner gives in the order of 
lexicographical order only. I could not get where does this entityIdPrefix will 
be used? Is it from storage or readerservere?
Entity ID prefix will be supplied by Tez in your case, while publishing 
entities, and can be inverse of created time if you want rows to be sorted in a 
descending order by created time. TimelineEntity class will now carry a prefix 
too.

bq. Does new tables separate or same-existing?
Same tables. Just the row key changes if you are not happy with lexicographic 
order.






was (Author: varun_saxena):
bq. When scan is performed on rows, the ResultScanner gives in the order of 
lexicographical order only. I could not get where does this entityIdPrefix will 
be used? Is it from storage or readerservere?
Entity ID prefix will be supplied by Tez in your case and can be inverse of 
created time if you want rows to be sorted in a descending order by created 
time. TimelineEntity class will now carry a prefix too.

bq. Does new tables separate or same-existing?
Same tables. Just the row key changes if you are not happy with lexicographic 
order.





> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-5585-workaround.patch, YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-22 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514243#comment-15514243
 ] 

Varun Saxena edited comment on YARN-5585 at 9/22/16 7:25 PM:
-

Summarizing the solution we decided upon in the call.

* We will now return entities from entity table in a lexicographic order of 
entity IDs'
* To achieve a different sort order, we will provide a mechanism for 
applications to provide an entity ID prefix which can be set in the 
TimelineEntity object while publishing the entity.
* This entityId prefix will be part of the row key in entity table. As the name 
suggests, it will be present just before the entity ID. Applications can choose 
to provide no entity ID prefix if they are happy with the lexicographic sort 
order. So the row key now will be 
{{cluster!user!flow!flowrun!app!entitytype!\{entityidprefix\}!\{entityid\}}}
* Entity ID will also be stored under a column qualifier too (being done 
already).
* Entity ID prefix can be a number (say long) as numbers generally provide a 
natural sort ordering. However, this needs to be finalized. Keep it as a string 
?
* When querying multiple entities, we will return the top N entities decided by 
limit in a lexicographic order of entity ID prefix + entity ID (i.e. if entity 
ID prefix is supplied). fromID filter can now be something like fromIDPrefix 
(say) or a similar filter which provides prefix + ID to support pagination.
* While querying a single entity, prefix can be supplied as a query param. If 
supplied, it will be a Get, otherwise we need to have a Scan with 
SingleColumnValueFilter on entity ID (this will be comparatively slower). We 
can have a separate REST endpoint to distinguish between prefix based queries 
and non prefix based queries. We need to distinguish between the case where for 
an entity prefix has not been specified on the write path and prefix not just 
supplied at the read path (even if it was supplied at the write path). This 
needs to be finalized.
* Prefix will also be returned as part of TimelineEntity object in response.

cc [~jrottinghuis], [~sjlee0], [~vrushalic], [~gtCarrera9]. Hope this covers 
everything.

The reason this solution was chosen was that we thought in UI use cases a 
single entity read would typically be followed listing of multiple entities and 
hence prefix would be known. This does not mean however, that we will not 
provide a mechanism to fetch entity if prefix wasn't given. We can use a single 
column value filter then.
Moreover, this solution overall had lesser write or read penalty compared to 
solutions listed above.



was (Author: varun_saxena):
Summarizing the solution we decided upon in the call.

* We will now return entities from entity table in a lexicographic order of 
entity IDs'
* To achieve a different sort order, we will provide a mechanism for 
applications to provide an entity ID prefix which can be set in the 
TimelineEntity object while writing the entity to backend.
* This entityId prefix will be part of the row key in entity table. As the name 
suggests, it will be present just before the entity ID. Applications can choose 
to provide no entity ID prefix if they are happy with the lexicographic sort 
order. So the row key now will be 
{{cluster!user!flow!flowrun!app!entitytype!\{entityidprefix\}!\{entityid\}}}
* Entity ID will also be stored under a column qualifier too (being done 
already).
* Entity ID prefix can be a number (say long) as numbers generally provide a 
natural sort ordering. However, this needs to be finalized. Keep it as a string 
?
* When querying multiple entities, we will return the top N entities decided by 
limit in a lexicographic order of entity ID prefix + entity ID (i.e. if entity 
ID prefix is supplied). fromID filter can now be something like fromIDPrefix 
(say) or a similar filter which provides prefix + ID to support pagination.
* While querying a single entity, prefix can be supplied as a query param. If 
supplied, it will be a Get, otherwise we need to have a Scan with 
SingleColumnValueFilter on entity ID (this will be comparatively slower). We 
can have a separate REST endpoint to distinguish between prefix based queries 
and non prefix based queries. We need to distinguish between the case where for 
an entity prefix has not been specified on the write path and prefix not just 
supplied at the read path (even if it was supplied at the write path). This 
needs to be finalized.
* Prefix will also be returned as part of TimelineEntity object in response.

cc [~jrottinghuis], [~sjlee0], [~vrushalic], [~gtCarrera9]. Hope this covers 
everything.

The reason this solution was chosen was that we thought in UI use cases a 
single entity read would typically be followed listing of multiple entities and 
hence prefix would be known. This does not mean however, that we will not 
provide a mechanism to fetch entity if prefix 

[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-21 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512200#comment-15512200
 ] 

Sangjin Lee edited comment on YARN-5585 at 9/22/16 5:26 AM:


I am also catching up on this discussion (sorry it got delayed).

Generally I am in agreement with Varun and Vrushali on possible approaches. I'd 
like to add a few more thoughts to refine the idea.

(1) supporting chronological order sorting
I think that even for framework-specific entities (e.g. tez vertices, MR task 
entities, etc.), the "sorting" order cannot be completely arbitrary. Because we 
have a strong design decision on reflecting recency in the row keys, the 
natural sorting order should be the *chronological order*, or strange things 
would result.

For YARN entities, the id order would satisfy this for the most part (and ditto 
for MR entities). If tez can craft the id's such that the lexicographical order 
is also the chronological order, that would be by far the simplest solution to 
the problem. I'm not sure how feasible it is for tez to add padding etc. to 
preserve the chronological order in the entity id's. [~rohithsharma], can we 
change the id's to order them properly?

If the framework cannot make the id lexicographical order the same as the 
chronological order, then we might have to introduce the notion of bytes 
provided by the framework (and an auxiliary table) to support this as suggested 
by Vrushali and Varun. But that would be at the some cost. All things being 
equal, I would love not to populate another table on the write path.

Also note that we still need to support single-entity queries in this case 
(i.e. queries by entity id). How would we be able to support queries by id in 
this case?

(2) setting the created time field
In timeline service v.2, the strong assumption/requirement is that the created 
time is set by the client. It sounds like the current tez code does not set the 
created time. It should be set. That's the contract we're using. We're not 
really expecting an empty created time when we write them.

(3) TimelineEntity.compareTo()
It is a good catch by Rohith. It escaped the review, but it does appear that 
the id sorting if created time is empty is the opposite of what it should be. 
The string should be sorted by the descending order, but the current code is 
doing the opposite. This should be fixed. We can either fix it here or can open 
a separate subtask to fix it. Either way, we should fix it.


was (Author: sjlee0):
I am also catching up on this discussion (sorry it got delayed).

Generally I am in agreement with Varun and Vrushali on possible approaches. I'd 
like to add a few more thoughts to refine the idea.

(1) supporting chronological order sorting
I think that even for framework-specific entities (e.g. tez vertices, MR task 
entities, etc.), the "sorting" order cannot be completely arbitrary. Because we 
have a strong design decision on reflecting recency in the row keys, the 
natural sorting order should be the *chronological order*, or strange things 
would result.

For YARN entities, the id order would satisfy this for the most part (and ditto 
for MR entities). If tez can craft the id's such that the lexicographical order 
is also the chronological order, that would be by far the simplest solution to 
the problem. I'm not sure how feasible it is for tez to add padding etc. to 
preserve the chronological order in the entity id's. [~rohithsharma], can we 
change the id's to order them properly?

If the framework cannot make the id lexicographical order the same as the 
chronological order, then we might have to introduce the notion of bytes 
provided by the framework (and an auxiliary table) to support this as suggested 
by Vrushali and Varun. But that would be at the some cost. All things being 
equal, I would love not to populate another table on the write path.

Also note that we still need to be able to support single-entity queries in 
this case (i.e. queries by entity id). How would we able to support queries by 
id in this case?

(2) setting the created time field
In timeline service v.2, the strong assumption/requirement is that the created 
time is set by the client. It sounds like the current tez code does not set the 
created time. I think it should be set. That's the contract we're using. We're 
not really expecting an empty created time when we write them.

(3) TimelineEntity.compareTo()
It is a good catch by Rohith. It escaped the review, but it does appear that 
the id sorting if created time is empty is the opposite of what it should be. 
The string should be sorted by the descending order, but the current code is 
doing the opposite. This should be fixed. We can either fix it here or can open 
a separate subtask to fix it. Either way, we should fix it.

> [Atsv2] Add a new filter fromId in REST endpoints
> 

[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-21 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512079#comment-15512079
 ] 

Vrushali C edited comment on YARN-5585 at 9/22/16 4:23 AM:
---

I have been thinking more on this. I think if there is a concern about having 
the same entity data in two tables, what we could do is, set a TTL (time to 
live) on the cells in the auxiliary table. That way, for some period of time we 
store data in two places but then it gets cleaned up. 

For example, if the Tez UI queries for data in the auxiliary table for a job 
that ran 1 year back, then say, it does not exist anymore in the auxiliary 
table since it got cleaned up by hbase. Now the Tez UI can try querying the 
regular table. Or the auxiliary REST api call can take a parameter that says if 
data is not found in auxiliary table, please query the regular entity table and 
the rest call would perhaps then take a little longer to return. Since we are 
querying for something that ran 1 year back, I believe we can wait for an extra 
moment for the call to return.

This way, we store data in two tables for a brief time period, rely on hbase to 
clean up cells as per their TTL and provide a way for frameworks to store/query 
their data in harmony with timeline service storage.


was (Author: vrushalic):
I have been thinking more on this. I think if there is a concern about having 
the same entity data in two tables, what we could do is, set a TTL (time to 
live) on the cells in the auxiliary table. That way, for some period of time we 
store data in two places but then it gets cleaned up. 

For example, if Tez UI queries for data in the auxiliary table for a job that 
ran 1 year back, then say, it does not exist anymore in the auxiliary table 
since it got cleaned up by hbase. Now the Tez UI can try querying the regular 
table. Or the auxiliary REST api call can take a parameter that says if data is 
not found in auxiliary table, please query the regular entity table and the 
rest call would perhaps then take a little longer to return. Since we are 
querying for something that ran 1 year back, I believe we can wait for an extra 
moment for the call to return.

This way, we store data in two tables for a brief time period, rely on hbase to 
clean up cells as per their TTL and provide a way for frameworks to store/query 
their data in harmony with timeline service storage.

> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-21 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512079#comment-15512079
 ] 

Vrushali C edited comment on YARN-5585 at 9/22/16 4:22 AM:
---

I have been thinking more on this. I think if there is a concern about having 
the same entity data in two tables, what we could do is, set a TTL (time to 
live) on the cells in the auxiliary table. That way, for some period of time we 
store data in two places but then it gets cleaned up. 

For example, if Tez UI queries for data in the auxiliary table for a job that 
ran 1 year back, then say, it does not exist anymore in the auxiliary table 
since it got cleaned up by hbase. Now the Tez UI can try querying the regular 
table. Or the auxiliary REST api call can take a parameter that says if data is 
not found in auxiliary table, please query the regular entity table and the 
rest call would perhaps then take a little longer to return. Since we are 
querying for something that ran 1 year back, I believe we can wait for an extra 
moment for the call to return.

This way, we store data in two tables for a brief time period, rely on hbase to 
clean up cells as per their TTL and provide a way for frameworks to store/query 
their data in harmony with timeline service storage.


was (Author: vrushalic):
I have been thinking more on this. I think if there is a concern about having 
the same entity data in two tables, what we could do is, set a TTL (time to 
live) on the cells in the auxiliary table. That way, for some period of time 
time we store data in two places but then it gets cleaned up. 

For example, if Tez UI queries for data in the auxiliary table for a job that 
ran 1 year back, then say, it does not exist anymore in the auxiliary table 
since it got cleaned up by hbase. Now the Tez UI can try querying the regular 
table. Or the auxiliary REST api call can take a parameter that says if data is 
not found in auxiliary table, please query the regular entity table and the 
rest call would perhaps then take a little longer to return. Since we are 
querying for something that ran 1 year back, I believe we can wait for an extra 
moment for the call to return.

This way, we store data in two tables for a brief time period, rely on hbase to 
clean up cells as per their TTL and provide a way for frameworks to store/query 
their data in harmony with timeline service storage.

> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Current Behavior : Default limit is set to 100. If there are 1000 entities 
> then REST call gives first/last 100 entities. How to retrieve next set of 100 
> entities i.e 101 to 200 OR 900 to 801?
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-5. But to retrieve next 5 apps, there is 
> no way to achieve this. 
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> Since ATS is targeting large number of entities storage, it is very common 
> use case to get next set of entities using fromId rather than querying all 
> the entites. This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-21 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511129#comment-15511129
 ] 

Vrushali C edited comment on YARN-5585 at 9/21/16 8:51 PM:
---

Catching up on this thread. I have tried to read through all the comments and 
discussions on this jira but please correct me if I am mistaken.

Two objectives here:
1) We are looking for a way to paginate results/response. 
2) Ability to return sorted results in the rest response (sorted on something 
other than row key)

Thoughts on these:
- We are looking for a way to paginate results/response. 
This pagination requirement is independent of any particular framework like 
Tez. With hRaven, our experience has been that more often than not, we end up 
enabling pagination support for most APIs.  So, in general, our rest api calls 
should support pagination. 

Pagination via the REST query:
This involves, in a generic fashion, being able to send in a “startFromRowKey” 
in the rest query. Say we extend our rest apis to accept such a parameter, it 
becomes generic enough to fetch N rows after this particular “startFromRowKey” 
value. The first rest api call will not send in anything, but each rest 
response will return “lastRowKey” to the client so that the client can use this 
in the next rest call. I have found this to be also useful for debugging the 
rest output on the browser.

- For Tez in particular, we need the ability to return sorted results in the 
rest response. In this case, results sorted based on “creation_time”.  The 
currently existing row key in the entity table does not all for sorted order of 
creation time retrieval very easily. 

So here is proposal which incorporates some aspects of both of your proposals 
Varun. 

I think we should expose a way for frameworks like Tez to store data sorted as 
per their criteria. And also allow them to specify when they want to query this 
specially sorted data. 

Today, Tez wants it sorted in entity creation time. Tomorrow, that could 
change. Also, today some other framework like Spark might want entities sorted 
based on something else. So putting it in the entity table's row key becomes a 
tough decision.

I propose we allow for auxiliary tables to be created for entities via cluster 
configuration settings. The auxiliary table name etc will be set in config in 
just like the timeline entity table name is set. This auxiliary table is 
specifically for entities, so has the same structure. 

Now, when tez’s timeline client creates a timeline entity, it will create it as 
it does right now but in addition, it will populate two new members of 
TimelineEntity object:
- auxiliaryTableName which contains the desired table name
- auxillaryEncodedKey   which contains a byte array value of  {noformat} 
“Inv(creation_time)!entity_id” {noformat} This is to be used as part of the row 
key suffix in the auxiliary table. Timeline service does not know what this 
byte value is, it does not care. It only adds this after the regular row key 
prefix of 
{code} “user!cluster!flow!Inv(flow run id) ! 
application!entitytype!”
{code}

Now it sends this write to timeline service. At the hbase writer side, we 
notice that the auxiliary table and auxiliary key are populated in the timeline 
entity object, so we do two writes. One write goes to our regular entity table 
with existing row key structure and other write goes to the auxiliary table 
with the row key of {code} “user!cluster!flow!Inv(flow run id)! 
application!entitytype!”{code}.

On the reader side, we allow the rest api to now specify explicitly if the 
client want reads from the auxillary table. Else reads go to the regular entity 
table. For frameworks like Tez, whenever they need sorted data based on 
creation time, perhaps in their UI, they know that, so they can now specify as 
part of the query param in their rest query that this is for the auxiliary 
table.  

This way, we provide frameworks a way to store data in whichever sorted order 
they want and for them to determine queries need that sorted data. 





was (Author: vrushalic):
Catching up on this thread. I have tried to read through all the comments and 
discussions on this jira but please correct me if I am mistaken.

Two objectives here:
1) We are looking for a way to paginate results/response. 
2) Ability to return sorted results in the rest response (sorted on something 
other than row key)

Thoughts on these:
- We are looking for a way to paginate results/response. 
This pagination requirement is independent of any particular framework like 
Tez. With hRaven, our experience has been that more often than not, we end up 
enabling pagination support for most APIs.  So, in general, our rest api calls 
should support pagination. 

Pagination via the REST query:
This involves, in a generic fashion, being able to send in a “startFromRowKey” 
in the rest query. Say we extend our 

[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-21 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511129#comment-15511129
 ] 

Vrushali C edited comment on YARN-5585 at 9/21/16 8:51 PM:
---

Catching up on this thread. I have tried to read through all the comments and 
discussions on this jira but please correct me if I am mistaken.

Two objectives here:
1) We are looking for a way to paginate results/response. 
2) Ability to return sorted results in the rest response (sorted on something 
other than row key)

Thoughts on these:
- We are looking for a way to paginate results/response. 
This pagination requirement is independent of any particular framework like 
Tez. With hRaven, our experience has been that more often than not, we end up 
enabling pagination support for most APIs.  So, in general, our rest api calls 
should support pagination. 

Pagination via the REST query:
This involves, in a generic fashion, being able to send in a “startFromRowKey” 
in the rest query. Say we extend our rest apis to accept such a parameter, it 
becomes generic enough to fetch N rows after this particular “startFromRowKey” 
value. The first rest api call will not send in anything, but each rest 
response will return “lastRowKey” to the client so that the client can use this 
in the next rest call. I have found this to be also useful for debugging the 
rest output on the browser.

- For Tez in particular, we need the ability to return sorted results in the 
rest response. In this case, results sorted based on “creation_time”.  The 
currently existing row key in the entity table does not all for sorted order of 
creation time retrieval very easily. 

So here is proposal which incorporates some aspects of both of your proposals 
Varun. 

I think we should expose a way for frameworks like Tez to store data sorted as 
per their criteria. And also allow them to specify when they want to query this 
specially sorted data. 

Today, Tez wants it sorted in entity creation time. Tomorrow, that could 
change. Also, today some other framework like Spark might want entities sorted 
based on something else. So putting it in the entity table's row key becomes a 
tough decision.

I propose we allow for auxiliary tables to be created for entities via cluster 
configuration settings. The auxiliary table name etc will be set in config in 
just like the timeline entity table name is set. This auxiliary table is 
specifically for entities, so has the same structure. 

Now, when tez’s timeline client creates a timeline entity, it will create it as 
it does right now but in addition, it will populate two new members of 
TimelineEntity object:
- auxiliaryTableName which contains the desired table name
- auxillaryEncodedKey   which contains a byte array value of  {code} 
“Inv(creation_time)!entity_id” {code}. This is to be used as part of the row 
key suffix in the auxiliary table. Timeline service does not know what this 
byte value is, it does not care. It only adds this after the regular row key 
prefix of 
{code} “user!cluster!flow!Inv(flow run id) ! 
application!entitytype!”
{code}

Now it sends this write to timeline service. At the hbase writer side, we 
notice that the auxiliary table and auxiliary key are populated in the timeline 
entity object, so we do two writes. One write goes to our regular entity table 
with existing row key structure and other write goes to the auxiliary table 
with the row key of {code} “user!cluster!flow!Inv(flow run id)! 
application!entitytype!”{code}.

On the reader side, we allow the rest api to now specify explicitly if the 
client want reads from the auxillary table. Else reads go to the regular entity 
table. For frameworks like Tez, whenever they need sorted data based on 
creation time, perhaps in their UI, they know that, so they can now specify as 
part of the query param in their rest query that this is for the auxiliary 
table.  

This way, we provide frameworks a way to store data in whichever sorted order 
they want and for them to determine queries need that sorted data. 





was (Author: vrushalic):
Catching up on this thread. I have tried to read through all the comments and 
discussions on this jira but please correct me if I am mistaken.

Two objectives here:
1) We are looking for a way to paginate results/response. 
2) Ability to return sorted results in the rest response (sorted on something 
other than row key)

Thoughts on these:
- We are looking for a way to paginate results/response. 
This pagination requirement is independent of any particular framework like 
Tez. With hRaven, our experience has been that more often than not, we end up 
enabling pagination support for most APIs.  So, in general, our rest api calls 
should support pagination. 

Pagination via the REST query:
This involves, in a generic fashion, being able to send in a “startFromRowKey” 
in the rest query. Say we extend our rest 

[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-15 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494251#comment-15494251
 ] 

Varun Saxena edited comment on YARN-5585 at 9/15/16 7:08 PM:
-

Just to summarise the suggestions given for folks to refer to.

* Applications (like Tez) would know best how to interpret their entity IDs' 
and how they can be descendingly sorted. Most entity IDs' seem to have some 
sort of monotonically increasing sequence like app ID. We can hence open up a 
PUBLIC interface which ATSv2 users like Tez can implement to decide how to 
encode and decode a particular entity type so that it is stored in descending 
sorted fashion (based on creation time) in ATSv2. Encoding and decoding similar 
to AppIDConverter written in our code.Because if row keys themselves can be 
sorted, this will be performance wise the best possible solution. Refer to 
[comment | 
https://issues.apache.org/jira/browse/YARN-5585?focusedCommentId=15470803=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15470803]
** _Pros of the approach:_ 
**# Lookup will be fast.
** _Cons of the approach:_ 
**# We are depending on application to provide some code for this to work. 
Corresponding JAR will have to be placed in classpath. Folks in other projects 
may not be pleased to not have inbuilt support for this in ATS.
**# Entity IDs' may not always have a monotonically increasing sequence like 
App IDs'.

* We can keep another table, say EntityCreationTable or EntityIndexTable with 
row key as {{cluster!user!flow!flowrun!app!entitytype!reverse entity creation 
time!entityid}}. We will make an entry into this table whenever created time is 
reported for the entity. The real data would still reside in the main entity 
table. Entities in this table will be sorted descendingly. On read side, we can 
first peek into this table to get relevant records in descending fashion (based 
on limit and/or fromId) and then use this info to query entity table. We can do 
this in two ways. We can get created times from querying this index table and 
apply a filter of created time range. Or alternatively we can try out 
MultiRowRangeFilter. That from javadoc of HBase seems to be efficient. We will 
have to do some processing to determine these multiple row key ranges.  Refer 
to [comment | 
https://issues.apache.org/jira/browse/YARN-5585?focusedCommentId=15472669=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15472669]
** _Note:_  Client should not send different created times for the same entity 
otherwise that will lead to an additional row.  If different created time would 
be reported more than once we will have to consider the latest one.
** _Pros of the approach:_ 
**# Solution provided within ATS.
**# Extra write only when created time is reported.
** _Cons of the approach:_ 
**# Extra peek into the index table on the read side. Single entity read can 
still be served directly from entity table though.

* Another option would be to change the row key of entity table to 
{{cluster!user!flow!flowrun!app!entitytype!reverse entity creation 
time!entityid}} and have another table to map 
{{cluster!user!flow!flowrun!app!entitytype!entityid}} to entity created time.
So for a single entity call (HBase Get) we will have to first peek into the new 
table and then get records from entity table.
** _Cons of the approach:_ 
**# On write side, we will have to first lookup into the index table which has 
the entity created time or on every write client should supply entity created 
time. First would impact write performance and latter may not be feasible for 
client to send.
**# What should be the row key if client does not supply created time on first 
write but supplies the created time on a subsequent write.

cc [~sjlee0], [~vrushalic], [~rohithsharma], [~gtCarrera9]


was (Author: varun_saxena):
Just to summarise the suggestions given for folks to refer to.

* Applications (like Tez) would know best how to interpret their entity IDs' 
and how they can be descendingly sorted. Most entity IDs' seem to have some 
sort of monotonically increasing sequence like app ID. We can hence open up a 
PUBLIC interface which ATSv2 users like Tez can implement to decide how to 
encode and decode a particular entity type so that it is stored in descending 
sorted fashion (based on creation time) in ATSv2. Encoding and decoding similar 
to AppIDConverter written in our code.Because if row keys themselves can be 
sorted, this will be performance wise the best possible solution. Refer to 
[comment | 
https://issues.apache.org/jira/browse/YARN-5585?focusedCommentId=15470803=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15470803]
** _Pros of the approach:_ 
**# Lookup will be fast.
** _Cons of the approach:_ 
**# We are depending on application to provide some code for this to work. 
Corresponding JAR 

[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-07 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472837#comment-15472837
 ] 

Varun Saxena edited comment on YARN-5585 at 9/8/16 5:29 AM:


IIUC, Tez DAG ID is a combination of YARN App ID and DAG sequence ID.
Isnt this DAG sequence ID monotonically increasing and assigned to DAGs' as 
they are run and assigned to them in sequence ?
I was assuming they were. That is why I suggested storing DAG ID as 16 bytes (8 
bytes of inverted cluster timsetamp from app id +  4 bytes of inverted seq id 
from app id + 4 bytes of inverted DAG seq number). Padding in this case wont be 
required.

Anyways other solutions have been proposed and we can come back to this only if 
necessary.
Or maybe we can have both above solution and below one as well.


was (Author: varun_saxena):
IIUC, Tez DAG ID is a combination of YARN App ID and DAG sequence ID.
Isnt this DAG sequence ID monotonically increasing and assigned to DAGs' as 
they are run in sequence ?
I was assuming they were. That is why I suggested storing DAG ID as 16 bytes (8 
bytes of inverted cluster timsetamp from app id +  4 bytes of inverted seq id 
from app id + 4 bytes of inverted DAG seq number). Padding in this case wont be 
required.

Anyways other solutions have been proposed and we can come back to this only if 
necessary.
Or maybe we can have both above solution and below one as well.

> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-10. But to retrieve next 5 apps, it is 
> difficult.
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-07 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470538#comment-15470538
 ] 

Rohith Sharma K S edited comment on YARN-5585 at 9/7/16 12:56 PM:
--

bq. Can you tell me the use case ? Listing all DAGs' or listing DAGs' within an 
app ? Or something else ? Typically how many DAGs' can there be per app ?
Basic use case is to achieve pagination nevertheless apps/flows/entities such 
as DAG or containers or any other user entities or system entities. Currently 
limit is 100 for any entities to retrieve. Say if number of entities is 200. 
Then REST call retrieves 100 entities. And how to retrieve 100 to 200 entities?


was (Author: rohithsharma):
bq. Can you tell me the use case ? Listing all DAGs' or listing DAGs' within an 
app ? Or something else ? Typically how many DAGs' can there be per app ?
Basic use case is to achieve pagination nevertheless DAG or containers or any 
other user entities. Currently limit is 100 for any entities to retrieve. Say 
if number of entities is 200. Then REST call retrieves 100 entities. And how to 
retrieve 100 to 200 entities?

> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: YARN-5585.v0.patch
>
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-10. But to retrieve next 5 apps, it is 
> difficult.
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-09-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15455257#comment-15455257
 ] 

Varun Saxena edited comment on YARN-5585 at 9/1/16 12:40 PM:
-

So I though a little bit over it and I think there is a solution possible for 
fetching apps within a cluster without much of performance impact. Because this 
seems to be your use case.

What we can do is that  we can get the required App IDs' from App to flow table 
first as app ids' in this table are sorted and extract applicable flows from 
there. And then get data from the application table using these unique flows to 
get more specific information about the apps.  We have something called 
MultiRowRangeFilter in HBase which can help us specify multiple row key ranges.
We can only return those apps which we found from app to flow table. 
And from a performance viewpoint we can assume there will always be a 
reasonable limit specified.
 
_Example:_
Assume, in a cluster we have applications from application_111_0001 to 
application_111_0034 (running or completed).
These apps will be stored in a descending order in app to flow table. 
Let us say you want to get latest 10 apps (i.e. limit in your query is 10).
What we can do is get first 10 apps from app to flow table i.e. 
application_111_0034 to application_111_0025. We can use PageFilter to 
return only first 10 records. This is the result set we can return back.
Assume application IDs' ending with _0034, _0031 and _0027 belong to flow1 and 
rest to flow2. We can then use this info to query app table.

So to get detailed info for these 10 apps in a single shot from application 
table, what we can do is as under :
* Create a MultiRowRangeFilter
* For flow1. add start row as {{cluster!user!flow1!application_111_0034}} 
and stop row as {{cluster!user!flow1!application_111_0027}}. We can make 
stop row inclusive. We can then add this start/stop row pair into the multi row 
range filter created.
* And for flow2, start row can be  
{{cluster!user!flow2!application_111_0033}} and stop row as  
{{cluster!user!flow2!application_111_0024}}. We can then add this 
start/stop row pair into the multi row range filter created.

This would be slower than getting all apps when flow or flow run is specified 
but would be faster than doing full table scan of application table, especially 
when it grows large.

Maybe I can raise a separate JIRA for this and handle it there if this is a 
real use case.


was (Author: varun_saxena):
So I though a little bit over it and I think there is a solution possible for 
fetching apps within a cluster without much of performance impact. Because this 
seems to be your use case.

What we can do is that  we can get the required App IDs' from App to flow table 
first as app ids' in this table are sorted and extract applicable flows from 
there. And then get data from the application table using these unique flows to 
get more specific information about the apps. Say pass a flow to appids' map. 
We have something called MultiRowRangeFilter in HBase which can help us specify 
multiple row key ranges.
We can only return those apps which we found from app to flow table. 
And from a performance viewpoint we can assume there will always be a 
reasonable limit specified.
 
_Example:_
Assume, in a cluster we have applications from application_111_0001 to 
application_111_0034 (running or completed).
These apps will be stored in a descending order in app to flow table. 
Let us say you want to get latest 10 apps (i.e. limit in your query is 10).
What we can do is get first 10 apps from app to flow table i.e. 
application_111_0034 to application_111_0025. We can use PageFilter to 
return only first 10 records. This is the result set we can return back.
Assume application IDs' ending with _0034, _0031 and _0027 belong to flow1 and 
rest to flow2. We can then use this info to query app table.

So to get detailed info for these 10 apps in a single shot from application 
table, what we can do is as under :
* Create a MultiRowRangeFilter
* For flow1. add start row as {{cluster!user!flow1!application_111_0034}} 
and stop row as {{cluster!user!flow1!application_111_0027}}. We can make 
stop row inclusive. We can then add this start/stop row pair into the multi row 
range filter created.
* And for flow2, start row can be  
{{cluster!user!flow2!application_111_0033}} and stop row as  
{{cluster!user!flow2!application_111_0024}}. We can then add this 
start/stop row pair into the multi row range filter created.

This would be slower than getting all apps when flow or flow run is specified 
but would be faster than doing full table scan of application table, especially 
when it grows large.

Maybe I can raise a separate JIRA for this and handle it there if this is a 
real use case.

> [Atsv2] Add a new filter fromId in 

[jira] [Comment Edited] (YARN-5585) [Atsv2] Add a new filter fromId in REST endpoints

2016-08-31 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452583#comment-15452583
 ] 

Varun Saxena edited comment on YARN-5585 at 8/31/16 3:51 PM:
-

If the use case is only for apps then the row keys in application table are 
stored in sorted manner (in descending order) within the scope of a flow / flow 
run.
And we can easily support fromId alongwith limit to achieve some sort of 
pagination here without any performance penalty.

However, the problem with this kind of an approach is that new apps keep on 
getting added so result may not be latest. For instance, if there are 100 apps 
app100-app1 in ATS and we show 10 apps on each page. Then, if we move to page 3 
we will show apps from app80-app71 but it is possible that say 5 more apps get 
added in the meantime i.e. we now have app105 to app1 in ATS.
Ideally page 3 should then show app85-app76. But I guess this would have 
already been considered.

Entities in entity table though are not sorted because entity could be anything.
If we have a similar use case for containers, we can consider separating it out 
to a different table and have special handling for it. But there should be a 
use case for it.


was (Author: varun_saxena):
If the use case is only for apps then the row keys in application table are 
stored in sorted manner (in descending order) within the scope of a flow / flow 
run.
And we can easily support fromId alongwith limit to achieve some sort of 
pagination here without any performance penalty.

However, the problem with this kind of an approach is that new apps keep on 
getting added so result may not be latest. For instance, if there are 100 apps 
app100-app1 in ATS and we show 10 apps on each page. Then, if we move to page 3 
we will show apps from app80-app71 but it is possible that say 5 more apps get 
added in the meantime i.e. we not have app105 to app1 in ATS.
Ideally page 3 should then show app85-app76.

Entities in entity table though are not sorted because entity could be anything.
If we have a similar use case for containers, we can consider separating it out 
to a different table and have special handling for it. But there should be a 
use case for it.

> [Atsv2] Add a new filter fromId in REST endpoints
> -
>
> Key: YARN-5585
> URL: https://issues.apache.org/jira/browse/YARN-5585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelinereader
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>
> TimelineReader REST API's provides lot of filters to retrieve the 
> applications. Along with those, it would be good to add new filter i.e fromId 
> so that entities can be retrieved after the fromId. 
> Example : If applications are stored database, app-1 app-2 ... app-10.
> *getApps?limit=5* gives app-1 to app-10. But to retrieve next 5 apps, it is 
> difficult.
> So proposal is to have fromId in the filter like 
> *getApps?limit=5&=app-5* which gives list of apps from app-6 to 
> app-10. 
> This is very useful for pagination in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org