[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593826#comment-14593826 ]
Joep Rottinghuis commented on YARN-3051: ---------------------------------------- Not all arguments are equally selective. For example, relatesTo (entities) are not stored in individual cells that can be used as a push down predicate for the HBase tables. We'd have to select all entities that match the other criteria, select the relatesTo string, parse it into individual fields and do set operations on them. {code} Set<TimelineEntity> getEntities(String userId, String clusterId, String flowId, String flowRunId, String appId, String entityType, Long limit, Long createdTimeBegin, Long createdTimeEnd, Long modifiedTimeBegin, Long modifiedTimeEnd, Set<TimelineEntity.Identifier> relatesTo, Set<TimelineEntity.Identifier> isRelatedTo, Set<KeyValuePair> info, Set<KeyValuePair> configs, Set<String> events, Set<String> metrics, EnumSet<Field> fieldsToRetrieve) throws IOException; } {code} If we defer being able to effectively select a subset of columns, what does it actually mean to specify a Set<KeyValuePair> ? Can the value be null to indicate that we don't care what the value is and that means that we want the column back in the result? I think we should separate out predicates (give me all X where Y=Z) versus selectors (give me all X...). It is not clear in the latest patch if fully populated entities will be returned. Wrt. {quote} Makes sense. We could use a regex or club different configs into different groups and let user query that group. But then the problem will be how do we specify those groups. So as you say lets defer it and discuss it at length when we take it up. {quote} and {quote} One thing though, along the lines of patch submitted earlier, I can include something like Map<String, NameValueRelations> for metrics in the interface for specifying relational operations . It will support things like metricA>val1 and metricA<val2 as well(means 2 conditions on the same metric to specify a range). Thoughts ? {quote} Before we invent our own way how to specify which columns (metrics, configs etc.) we'll retrieve let's make sure that what we come up with can efficiently be mapped to our backing store. As we've selected HBase as the major implementation to handle queries at scale, that means that we need to think how to make effective use of filters (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterBase.html) to aggressively reduce what we pull back from HBase. ColumnPrefixFilter for example will be a good way to express which config columns to retrieve. A regex will be a poor way, as that will result in having to pull back every columns, and then dropping values from a retrieved result. Similarly, if our rowkeys are prefixed by users then creating an API that doesn't include the user (only the cluster) means that we're doing a full table scan, albeit with skipfilters that let us skip over users that we're not interested in. In an earlier patch I saw NameValueRelation that was able to perform the operations. That again assumes that all values will be retrieved from the backing store, and then filtered in the reader before returned to the user. It will be more effective to make sure we can easily map this to operations we can push into HBase itself (through a ColumnValueFilter) through the available operations (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html). I'm certainly not arguing to have these HBase specific classes exposed in our API, but our methods should closely match what can be done, which I don't think will be overly restrictive or unreasonable. If we're going to have two types of tables in the backing store: a) HBase native tables, specifically structured for efficient storage and retrieval and b) Phoenix tables (mainly time based aggregates and aggregates over non-primary key prefixes), specifically structured for flexible querying would it make sense to break these two queries into separate families? Or are we thinking that based on what arguments are passed in, we decide which tables to query with which mechanism? > [Storage abstraction] Create backing storage read interface for ATS readers > --------------------------------------------------------------------------- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Affects Versions: YARN-2928 > Reporter: Sangjin Lee > Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, > YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)