[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage

Joep Rottinghuis (JIRA) Thu, 14 May 2015 17:55:44 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544692#comment-14544692
 ]


Joep Rottinghuis commented on YARN-3411:
----------------------------------------

Ok, admitted that I have not yet read Sangjin's comments from just minutes ago.
I also need to spend some more time to fully read and comprehend the key 
structure.

Here is an initial set of comments that I have, please take them for that they 
are worth.
I'll have to spend more time to dig into more details.

Overall, we should fill in the javadoc, describe inputs, explain what we expect 
etc.

EntityColumnDetails probably needs a public static method

public static EntityColumnDetails createEntityColumnDetails(byte[] bytes)
that will take bytes, iterates over the enum and returns an enum from those 
bytes.
You'll need to decide if you want a default, or throw an exception if bytes 
does not map to any enum.



EntityTableDetails.java

getRowKeyPrefix should cleanse input arguments to strip 
EntityTableDetails.ROW_KEY_SEPARATOR_BYTES
(at least the string representative of that, may be good to split that 
definition into two).
The reverse method that takes a rowkey and splits into components should 
probably not reverse this.
For example, if we use a ! for separator and translate that to underscore, do 
we translate all underscores back to bangs?
In hRaven we did not, but stripping the separators is essential to avoid 
parsing errors during read.

Instead of just using DEFAULT_ENTITY_TABLE_NAME or overriding the tablename 
itself,
it will be super-usefull to be able to have one single config key to prefix all 
tables.
This way you can easily run a staging, a test, a production and whatever setup 
in the same
HBase instance / w/o having to override every single table in the config.
One could simply overwrite the default prefix and you're off and running.
For prefix I'm thinking "tst" "prod" "exp" etc. Once can then still override 
one tablename if needed,
but managing one whole setup will be easier.

For column prefixes, are you simply using a one character prefix, and the first 
character is always a prefix?
Or do we need to do cleansing on the columns written to ensure that nobody else 
will write with this
column prefix?
Also, if there is a prefix, why not make this an enum?


More general question: for TimelineWriter and its implementations, is there an 
expectation set around
concurrency? Is any synchronization expected / needed to ensure visibility when 
calls happen from different threads?
How about entities, are they expected to be immutable once passed to the write 
method?
Similarly for the constructor, we're assuming that the configuration object 
will not be modified while we're constructing a TimelineWriter?

For the write method in HBaseTimelineWriterImpl, we're collecting all the puts 
first, then we're creating a big list of entity puts, and then we're writing 
them out.
For large entities (with large configurations) we end up creating two (if not 
three I'd have to read more carefully) copies of the entire structure.
Wouln't it be better to do this in somewhat more of a streaming manner?
For example, first create info puts, then stream those out to the entityTable, 
then drop the infoPuts Put,
then move on to eventPut, configPut etc.

Even method like getConfigPut could conceivable by constructed in such a way 
that we don't have to create one massive copy of the config in a single Put.
Perhaps we can pass in the table and create smaller puts, perhaps even one per 
config key and stream those out.
Ditto when iterating over the timeseries.keySet() in getMetricsPut.


For serviceStop we should probably handle things a littlebit better than just 
calling close.
Are we expecting to just close the HBase connection, still keep writing and 
just let the write exceptions shoot up to the top of the writing process?
Should we occasionally ( as we're iterating) check the stop method and just 
bail out in a more elegant way?

On class range, the start and end are final right?

> [Storage implementation] explore the native HBase write schema for storage
> --------------------------------------------------------------------------
>
>                 Key: YARN-3411
>                 URL: https://issues.apache.org/jira/browse/YARN-3411
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Vrushali C
>            Priority: Critical
>         Attachments: ATSv2BackendHBaseSchemaproposal.pdf, 
> YARN-3411-YARN-2928.001.patch, YARN-3411-YARN-2928.002.patch, 
> YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, YARN-3411.poc.4.txt, 
> YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, YARN-3411.poc.7.txt, 
> YARN-3411.poc.txt
>
>
> There is work that's in progress to implement the storage based on a Phoenix 
> schema (YARN-3134).
> In parallel, we would like to explore an implementation based on a native 
> HBase schema for the write path. Such a schema does not exclude using 
> Phoenix, especially for reads and offline queries.
> Once we have basic implementations of both options, we could evaluate them in 
> terms of performance, scalability, usability, etc. and make a call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage

Reply via email to