[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997320#comment-13997320 ] Andrew Purtell commented on HBASE-7958: --- To make the statistics useful we could introduce CF level APIs for coprocessors, export to metrics2 handled by the regionserver, a cluster summary API handled by the master, and infrastructure for said API could be used to generate output or graphs on the master UI. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.95.2 Reporter: Jesse Yates Assignee: Jesse Yates Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997314#comment-13997314 ] Andrew Purtell commented on HBASE-7958: --- Thanks for the summary [~jesse_yates]. The description for this issue is 'statistics per-column family per-region'. In that scope maintaining a system table for statistics gathering is unnecessary, we can use region local storage. Perhaps during compactions we could calculate the basic things people seem to want: row count, row key cardinality, min/max/avg size per value, and total value size. Per CF, per region. Column qualifier cardinality also seems like it might be useful. Perhaps we could maintain a tree of statistic files, at the HFile level, at the CF level, at the table level, at the namespace level. Compactions would record into the resulting HFiles the statistics metadata calculated during processing. A background process running in the master could aggregate while following the tree in the background, swapping updated results for older results at every level when ready. We should be able to handle point-in-time counts and simple statistical properties in this way? It could be possible to use a system statistics table instead of files, but why have regionservers exchange RPCs if not necessary (and updating a table inline with compaction or split handing brings back unfond memories of something we had once called the 'region historian'). Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.95.2 Reporter: Jesse Yates Assignee: Jesse Yates Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995304#comment-13995304 ] Jesse Yates commented on HBASE-7958: [~apurtell] originally this was going to be done to support phoenix. so they could get a better idea of distribution of keys to do better region chunking. In phoenix we decided even sized chunks would work well enough, so the impetus to get this done fizzled a bit. Other reasons for fizzle: * do we just expose the metric for things like otsdb? or just store it in HBase in our own format? both? * do we need to a UI components? * everyone wants all the metrics * should this even be part of core and instead just done via a scan-time coprocessor? Those considerations aside, I think it still would still be great to do for core :) We can always add more stats once we have a way to handle all the rest. How to expose them is an open question (especially considering phoenix will want to read these stats too) - maybe a pluggable sink/use a custom tag for metrics2 so people can use their own sinks? Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.95.2 Reporter: Jesse Yates Assignee: Jesse Yates Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994463#comment-13994463 ] Mikhail Antonov commented on HBASE-7958: One note about stats w.r.t. to zk-less assignment work HBASE-11059. Perhaps having stats on number of rows/size in bytes per region can be used for master-initiated splits/merges. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.95.2 Reporter: Jesse Yates Assignee: Jesse Yates Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994412#comment-13994412 ] Andrew Purtell commented on HBASE-7958: --- Thinking about reviving this issue. [~jesse_yates], could you comment on why this fizzled? Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.95.2 Reporter: Jesse Yates Assignee: Jesse Yates Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648842#comment-13648842 ] Jonathan Hsieh commented on HBASE-7958: --- bumping from 0.95.1, read it if makes it in. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.95.2 Reporter: Jesse Yates Assignee: Jesse Yates Attachments: hbase-7958_rough-cut-v0.patch, hbase-7958-v0-parent.patch, hbase-7958-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624358#comment-13624358 ] Elliott Clark commented on HBASE-7958: -- bq.Yeah, they are certainly pretty, but IMO pretty useless for anyone but the most novice user. I think we've been pretty remiss to ignore the novice user. If those graphs could be the thing that gets users to use hbase then they did their job. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.95.2 Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.95.1 Attachments: hbase-7958_rough-cut-v0.patch, hbase-7958-v0-parent.patch, hbase-7958-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603908#comment-13603908 ] Jeff Whiting commented on HBASE-7958: - [~jesse_yates] I understand the concern and agree that we don't want to reinvent the wheel. Although it seems like some basic stats would be extremely useful. For example the region balancer could find the hottest regions (ones with the more requests per second) and automatically balance them across different region servers. A region could be split because it is too hot to reduce the number of requests rather than only splitting on size. Systems like ganglia / opentsb typically do really well at giving high level stats at a server level. However they would do poorly if they tried to have stats on every region (we have over 1000 regions and it would be a mess). Finally we could have some pretty graphs on the HMaster similar to Accumulo (see: http://i1-scripts.softpedia-static.com/screenshots/Apache-Accumulo_1.png?1341920105) Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: hbase-7958_rough-cut-v0.patch, hbase-7958-v0-parent.patch, hbase-7958-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603933#comment-13603933 ] Jesse Yates commented on HBASE-7958: Here's the beauty of the recently posted approach - you could definitely write your own stats and then tag each table with it and your off to the races (rather than relying on things like JMX and tcollectors, in the case of otsdb). {quote} For example the region balancer could find the hottest regions (ones with the more requests per second) and automatically balance them across different region servers. A region could be split because it is too hot to reduce the number of requests rather than only splitting on size. {quote} that would be extremely cool (only a little pun intended). I'd question how OTSDB performs on your scale though - it can collect a whole heck of a lot of stats and since it stores them in a very cleanly distributed way in HBase, I would be surprised if it wasn't scaling. My concern is that we don't fill up HDFS with logging stats that are 2-3x what the actual datasizes are, something that wouldn't be too far fetched. We just need to be careful to make sure we don't keep too much history bq. Finally we could have some pretty graphs on the HMaster similar to Accumulo Yeah, they are certainly pretty, but IMO pretty useless for anyone but the most novice user. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: hbase-7958_rough-cut-v0.patch, hbase-7958-v0-parent.patch, hbase-7958-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595132#comment-13595132 ] Jesse Yates commented on HBASE-7958: So it looks like there is a desire for a pretty large range of possible statistics. I'd rather we don't get bogged down in what specific statistics we want, but push more towards a design discussion around enabling people to capture these statistics. We know we want them, the question is how :) Once we have the mechanisms in place to read/write a stats table for an individual stat, we can much more easily expand that support stats at different tie-in places. The 'at compaction time histogram' seemed like an easy enough starting place for _one type of stat_, but that should not necessarily limit possible stats that can be collected; its an immediate use-case for a general statistics table. Stepping back, it seems to me that we can have a basic set of statistics that you can enable for a table at creation time (or even turn it on later too). We then also need a mechanism to let people add their own statistics easily (thinking a CP hook here). From there, we just need to have an mechanism to make it easy to access each statistic. I don't think any of the above proposals really changes my proposed outline-patch besides making it easy(easier?) to hook in custom stat implementations, a clean dynamic loading mechanism (from the various //TODOs for CP hooks), and a little more utility in the StatisticsTable class to make it easy to read a stat. Sound reasonable? Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593791#comment-13593791 ] Jeff Whiting commented on HBASE-7958: - Some stats I would like to see are historic requests per region / CF and requests per second. In newer versions of hbase the number of ops done on a region are exposed in the web interface and jmx how it gives you very little context as to what it has been historically nor what the current requests per second are. IMHO would find those stats very useful. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594307#comment-13594307 ] Andrew Purtell commented on HBASE-7958: --- I'd like to see a histogram of operations taken on the region, for subsequent autotuning for read-mostly, mixed, or write-mostly workloads. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: hbase-7958_rough-cut-v0.patch Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589869#comment-13589869 ] Jesse Yates commented on HBASE-7958: Good call Todd. The origina; intention was to enable histograms over the keyvalues in a region. They are pretty simple to implement and get people really far, for many cases. The histograms support things like determining parallelization of scan within a region (should I use 1, 5, or 100 threads to scan this region) as well as key/value cardinality (helpful for using non-covered indexes). Hopefully not getting too far into the implementation details, we could easily use a compound key structure in the stats table to support a large variety of stats going forward that adds almost no complication to the intial, histogram case. Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Fix For: 0.96.0 Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589061#comment-13589061 ] Jesse Yates commented on HBASE-7958: The notable piece here is that we can put this 'stats scanner' in as we write the compaction. It sees every key that gets written, so it can build perfect stats as of the compaction. The only limitation then is that its not 'per region', but 'per-CF per region'; it would be a much larger overhaul to combine all the information across all the compactions in the region - doable, but probably not worth it as you actually lose information and only gain a little savings in terms of math effort when calculating per-region, cross CF key distributions. (For calculating key distributions, you basically sum the area under the histogram across the columns to determine the 'volume' of keys in the given range) Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Fix For: 0.96.0 Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7958) Statistics per-column family per-region
[ https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589117#comment-13589117 ] Todd Lipcon commented on HBASE-7958: Before we get too much into the detail, can we clarify what kind of statistics we're interested in collecting in the first place? There are a bunch of different things we could collect, maybe it's good to enumerate some of them and list some of the potential applications of them before we get into the details of how they're implemented. Here are a few of the places where I've considered adding statistics in the past -- though they fall into different buckets which not everyone might consider statistics :) : - *Block heat* -- keep a reservoir sample of which rows in memstore have been read recently. When we flush the file, create a bitmap based on the sample mapping each flushed HFile block to its heat. These heat maps could be re-generated periodically based on block cache contents after the file is flushed. (something like 2 bits per HFile block would mean that the heat map for even a very large region could be re-written to disk in only a few MB). *Use case*: when we move a region to another server, it can effectively more effectively pre-warm its cache. - *Row key distribution* -- this seems to be the thing that people are talking about here mostly. Useful for calculating better split points for MR or region splits. - *Row key cardinality* - useful for join ordering in SQL engines with optimizers - *Column qualifier and cell value cardinality* - useful for join ordering as well as potentially automatic dictionary-coding? There are bunches of others that could be brainstormed up... so my main point is: what do we mean by stats? How should we build this so that it's extensible and usable for future stats as well as whatever first one we want to implement? Statistics per-column family per-region --- Key: HBASE-7958 URL: https://issues.apache.org/jira/browse/HBASE-7958 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Jesse Yates Fix For: 0.96.0 Originating from this discussion on the dev list: http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain Essentially, we should have built-in statistics gathering for HBase tables. This allows clients to have a better understanding of the distribution of keys within a table and a given region. We could also surface this information via the UI. There are a couple different proposals from the email, the overview is this: We add in something on compactions that gathers stats about the keys that are written and then we surface them to a table. The possible proposals include: *How to implement it?* # Coprocessors - ** advantage - it easily plugs in and people could pretty easily add their own statistics. ** disadvantage - UI elements would also require this, we get into dependent loading, which leads down the OSGi path. Also, these CPs need to be installed _after_ all the other CPs on compaction to ensure they see exactly what gets written (doable, but a pain) # Built into HBase as a custom scanner ** advantage - always goes in the right place and no need to muck about with loading CPs etc. ** disadvantage - less pluggable, at least for the initial cut *Where do we store data?* # .META. ** advantage - its an existing table, so we can jam it into another CF there ** disadvantage - this would make META much larger, possibly leading to splits AND will make it much harder for other processes to read the info # A new stats table ** advantage - cleanly separates out the information from META ** disadvantage - should use a 'system table' idea to prevent accidental deletion, manipulation by arbitrary clients, but still allow clients to read it. Once we have this framework, we can then move to an actual implementation of various statistics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira