Prabhu Joseph created YARN-9395:
-----------------------------------

             Summary: Short Names for repeated Hbase Column names
                 Key: YARN-9395
                 URL: https://issues.apache.org/jira/browse/YARN-9395
             Project: Hadoop YARN
          Issue Type: New Feature
          Components: ATSv2
    Affects Versions: 3.2.0
            Reporter: Prabhu Joseph
            Assignee: Prabhu Joseph


Currently ATS HBase tables stores the config name / metric name as column names 
which are long. This repeats for all the rows and consumes lot of storage 
space. And we have seen Customers Hbase Tables already consumes more than 1.5 
TB in few days

{code}
Example Configs:
c:yarn.timeline-service.webapp.rest-csrf.methods-to-ignore
c:yarn.timeline-service.entity-group-fs-store.active-dir
c:yarn.scheduler.configuration.zk-store.parent-path

Example Metrics:
m:REDUCE:org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_READ_OPS
m:REDUCE:org.apache.hadoop.mapreduce.TaskCounter:COMBINE_INPUT_RECORDS
m:REDUCE:org.apache.hadoop.mapreduce.TaskCounter:PHYSICAL_MEMORY_BYTES
{code}

We need to use short column names as per Hbase Best Practice - 
http://moi.vonos.net/bigdata/avro-hbase-colnames/ But the challenge is ATS does 
not know the column names until the rows get inserted. We can provide a mapping 
file to map the repeated configs / metrics / info from different applications 
to unique numbers which customers can configure upfront to save the storage 
space. Similar to what Phoenix does

https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data
https://phoenix.apache.org/columnencoding.html




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to