Prabhu Joseph created YARN-9395:
-----------------------------------
Summary: Short Names for repeated Hbase Column names
Key: YARN-9395
URL: https://issues.apache.org/jira/browse/YARN-9395
Project: Hadoop YARN
Issue Type: New Feature
Components: ATSv2
Affects Versions: 3.2.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph
Currently ATS HBase tables stores the config name / metric name as column names
which are long. This repeats for all the rows and consumes lot of storage
space. And we have seen Customers Hbase Tables already consumes more than 1.5
TB in few days
{code}
Example Configs:
c:yarn.timeline-service.webapp.rest-csrf.methods-to-ignore
c:yarn.timeline-service.entity-group-fs-store.active-dir
c:yarn.scheduler.configuration.zk-store.parent-path
Example Metrics:
m:REDUCE:org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_READ_OPS
m:REDUCE:org.apache.hadoop.mapreduce.TaskCounter:COMBINE_INPUT_RECORDS
m:REDUCE:org.apache.hadoop.mapreduce.TaskCounter:PHYSICAL_MEMORY_BYTES
{code}
We need to use short column names as per Hbase Best Practice -
http://moi.vonos.net/bigdata/avro-hbase-colnames/ But the challenge is ATS does
not know the column names until the rows get inserted. We can provide a mapping
file to map the repeated configs / metrics / info from different applications
to unique numbers which customers can configure upfront to save the storage
space. Similar to what Phoenix does
https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data
https://phoenix.apache.org/columnencoding.html
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]