[
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593826#comment-14593826
]
Joep Rottinghuis commented on YARN-3051:
----------------------------------------
Not all arguments are equally selective. For example, relatesTo (entities) are
not stored in individual cells that can be used as a push down predicate for
the HBase tables. We'd have to select all entities that match the other
criteria, select the relatesTo string, parse it into individual fields and do
set operations on them.
{code}
Set<TimelineEntity> getEntities(String userId, String clusterId, String
flowId,
String flowRunId, String appId, String entityType, Long limit,
Long createdTimeBegin, Long createdTimeEnd, Long modifiedTimeBegin,
Long modifiedTimeEnd, Set<TimelineEntity.Identifier> relatesTo,
Set<TimelineEntity.Identifier> isRelatedTo, Set<KeyValuePair> info,
Set<KeyValuePair> configs, Set<String> events, Set<String> metrics,
EnumSet<Field> fieldsToRetrieve) throws IOException;
}
{code}
If we defer being able to effectively select a subset of columns, what does it
actually mean to specify a Set<KeyValuePair> ?
Can the value be null to indicate that we don't care what the value is and that
means that we want the column back in the result?
I think we should separate out predicates (give me all X where Y=Z) versus
selectors (give me all X...).
It is not clear in the latest patch if fully populated entities will be
returned.
Wrt.
{quote}
Makes sense. We could use a regex or club different configs into different
groups and let user query that group. But then the problem will be how do we
specify those groups. So as you say lets defer it and discuss it at length when
we take it up.
{quote}
and
{quote}
One thing though, along the lines of patch submitted earlier, I can include
something like Map<String, NameValueRelations> for metrics in the interface for
specifying relational operations . It will support things like metricA>val1 and
metricA<val2 as well(means 2 conditions on the same metric to specify a range).
Thoughts ?
{quote}
Before we invent our own way how to specify which columns (metrics, configs
etc.) we'll retrieve let's make sure that what we come up with can efficiently
be mapped to our backing store.
As we've selected HBase as the major implementation to handle queries at scale,
that means that we need to think how to make effective use of filters
(https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterBase.html)
to aggressively reduce what we pull back from HBase. ColumnPrefixFilter for
example will be a good way to express which config columns to retrieve. A regex
will be a poor way, as that will result in having to pull back every columns,
and then dropping values from a retrieved result.
Similarly, if our rowkeys are prefixed by users then creating an API that
doesn't include the user (only the cluster) means that we're doing a full table
scan, albeit with skipfilters that let us skip over users that we're not
interested in.
In an earlier patch I saw NameValueRelation that was able to perform the
operations. That again assumes that all values will be retrieved from the
backing store, and then filtered in the reader before returned to the user. It
will be more effective to make sure we can easily map this to operations we can
push into HBase itself (through a ColumnValueFilter) through the available
operations
(https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html).
I'm certainly not arguing to have these HBase specific classes exposed in our
API, but our methods should closely match what can be done, which I don't think
will be overly restrictive or unreasonable.
If we're going to have two types of tables in the backing store:
a) HBase native tables, specifically structured for efficient storage and
retrieval
and
b) Phoenix tables (mainly time based aggregates and aggregates over non-primary
key prefixes), specifically structured for flexible querying
would it make sense to break these two queries into separate families?
Or are we thinking that based on what arguments are passed in, we decide which
tables to query with which mechanism?
> [Storage abstraction] Create backing storage read interface for ATS readers
> ---------------------------------------------------------------------------
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Affects Versions: YARN-2928
> Reporter: Sangjin Lee
> Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch,
> YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch,
> YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch,
> YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be
> implemented by multiple backing storage implementations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)