[ 
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347858#comment-14347858
 ] 

Vrushali C commented on YARN-3134:
----------------------------------


There is a draft on some flow (and user and queue) based queries to be 
supported put up on jira YARN-3050 that could help us with the schema design. 
  
https://issues.apache.org/jira/secure/attachment/12695071/Flow%20based%20queries.docx

Sharing the schema of some of the hbase tables in hRaven:  (detailed schema at 
https://github.com/twitter/hraven/blob/master/bin/create_schema.rb)

{code}
create 'job_history', {NAME => 'i', COMPRESSION => 'LZO'}
create 'job_history_task', {NAME => 'i', COMPRESSION => 'LZO'}
# job_history (indexed) by jobId table contains 1 column family:
# i: job-level information specifically the rowkey into the
create 'job_history-by_jobId', {NAME => 'i', COMPRESSION => 'LZO'}
# job_history_app_version - stores all version numbers seen for a single app ID
# i: "info" -- version information
create 'job_history_app_version', {NAME => 'i', COMPRESSION => 'LZO'}
# job_history_agg_daily - stores daily aggregated job info
# the s column family has a TTL of 30 days, it's used as a scratch col family
# it stores the run ids that are seen for that day
# we assume that a flow will not run for more than 30 days, hence it's fine to 
"expire" that data
create 'job_history_agg_daily', {NAME => 'i', COMPRESSION => 'LZO', BLOOMFILTER 
=> 'ROWCOL'},
{NAME => 's', VERSIONS => 1, COMPRESSION => 'LZO', BLOCKCACHE => false, TTL => 
'2592000'}
# job_history_agg_weekly - stores weekly aggregated job info
# the s column family has a TTL of 30 days
# it stores the run ids that are seen for that week
# we assume that a flow will not run for more than 30 days, hence it's fine to 
"expire" that data
create 'job_history_agg_weekly', {NAME => 'i', COMPRESSION => 'LZO', 
BLOOMFILTER => 'ROWCOL'},
{NAME => 's', VERSIONS => 1, COMPRESSION => 'LZO', BLOCKCACHE => false, TTL => 
'2592000'}

{code}

job_history is the main table. 
It's row key:  cluster!user!application!timestamp!jobID 
cluster, user, application are stored as Strings. timestamp and jobID are 
stored as longs. 
cluster - unique cluster name (ie. “cluster1@dc1”) 
user - user running the application (“edgar”) 
application - application ID (aka flow name) derived from job configuration: 
uses “batch.desc” property if set otherwise parses a consistent ID from 
“mapreduce.job.name” 
timestamp - inverted (Long.MAX_VALUE - value) value of submission time. Storing 
the value as an inverted timestamp ensures the latest jobs are stored first for 
that cluster!user!app. This enables faster retrieval of more recent jobs for 
this flow.
jobID - stored as Job Tracker/Resource Manager start time (long), concatenated 
with job sequence number job_201306271100_0001 -> [1372352073732L][1L] 

How the columns are named in hRaven:
- each key in the job history file becomes the column name. For example, for 
finishedMaps, it would be stored as

{code}
column=i:finished_maps,
timestamp= 1425515902000, 
value=\x00\x00\x00\x00\x00\x00\x00\x05
{code}

In the output above, timestamp is the hbase cell timestamp. 

- we store the configuration information with a column name prefix of "c!"
{code}
column=i:c!yarn.sharedcache.manager.client.thread-count, 
timestamp= 1425515902000,
value=50
{code}

- each counter is stored with a prefix of "g!" or "gr!" or "gm!" 
{code}
For reducer counters, there is a prefix of gr! 
 column=i:gr!org.apache.hadoop.mapreduce.TaskCounter!SPILLED_RECORDS, 
timestamp= 1425515902000
value=\x00\x00\x00\x00\x00\x00\x00\x02

For mapper counters, there is a prefix of gm! 
column=i:gm!org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter!BYTES_READ,
timestamp= 1425515902000, 
value=\x00\x00\x00\x00\x00\x00\x00\x02
{code} 


> [Storage implementation] Exploiting the option of using Phoenix to access 
> HBase backend
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-3134
>                 URL: https://issues.apache.org/jira/browse/YARN-3134
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>
> Quote the introduction on Phoenix web page:
> {code}
> Apache Phoenix is a relational database layer over HBase delivered as a 
> client-embedded JDBC driver targeting low latency queries over HBase data. 
> Apache Phoenix takes your SQL query, compiles it into a series of HBase 
> scans, and orchestrates the running of those scans to produce regular JDBC 
> result sets. The table metadata is stored in an HBase table and versioned, 
> such that snapshot queries over prior versions will automatically use the 
> correct schema. Direct use of the HBase API, along with coprocessors and 
> custom filters, results in performance on the order of milliseconds for small 
> queries, or seconds for tens of millions of rows.
> {code}
> It may simply our implementation read/write data from/to HBase, and can 
> easily build index and compose complex query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to