[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2014-05-14 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997320#comment-13997320
 ] 

Andrew Purtell commented on HBASE-7958:
---

To make the statistics useful we could introduce CF level APIs for 
coprocessors, export to metrics2 handled by the regionserver, a cluster summary 
API handled by the master, and infrastructure for said API could be used to 
generate output or graphs on the master UI. 

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.95.2
Reporter: Jesse Yates
Assignee: Jesse Yates
 Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, 
 hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2014-05-14 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997314#comment-13997314
 ] 

Andrew Purtell commented on HBASE-7958:
---

Thanks for the summary [~jesse_yates]. 

The description for this issue is 'statistics per-column family per-region'. In 
that scope maintaining a system table for statistics gathering is unnecessary, 
we can use region local storage. Perhaps during compactions we could calculate 
the basic things people seem to want: row count, row key cardinality, 
min/max/avg size per value, and total value size. Per CF, per region. Column 
qualifier cardinality also seems like it might be useful. Perhaps we could 
maintain a tree of statistic files, at the HFile level, at the CF level, at the 
table level, at the namespace level. Compactions would record into the 
resulting HFiles the statistics metadata calculated during processing. A 
background process running in the master could aggregate while following the 
tree in the background, swapping updated results for older results at every 
level when ready. We should be able to handle point-in-time counts and simple 
statistical properties in this way? It could be possible to use a system 
statistics table instead of files, but why have regionservers exchange RPCs if 
not necessary (and updating a table inline with compaction or split handing 
brings back unfond memories of something we had once called the 'region 
historian').

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.95.2
Reporter: Jesse Yates
Assignee: Jesse Yates
 Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, 
 hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2014-05-12 Thread Jesse Yates (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995304#comment-13995304
 ] 

Jesse Yates commented on HBASE-7958:


[~apurtell] originally this was going to be done to support phoenix. so they 
could get a better idea of distribution of keys to do better region chunking. 
In phoenix we decided even sized chunks would work well enough, so the impetus 
to get this done fizzled a bit.

Other reasons for fizzle:
 * do we just expose the metric for things like otsdb? or just store it in 
HBase in our own format? both?
 * do we need to a UI components?
 * everyone wants all the metrics
 * should this even be part of core and instead just done via a scan-time 
coprocessor?

Those considerations aside, I think it still would still be great to do for 
core :)  We can always add more stats once we have a way to handle all the rest.

How to expose them is an open question (especially considering phoenix will 
want to read these stats too) - maybe a pluggable sink/use a custom tag for 
metrics2 so people can use their own sinks?

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.95.2
Reporter: Jesse Yates
Assignee: Jesse Yates
 Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, 
 hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2014-05-11 Thread Mikhail Antonov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994463#comment-13994463
 ] 

Mikhail Antonov commented on HBASE-7958:


One note about stats w.r.t. to zk-less assignment work HBASE-11059. Perhaps 
having stats on number of rows/size in bytes per region can be used for 
master-initiated splits/merges.

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.95.2
Reporter: Jesse Yates
Assignee: Jesse Yates
 Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, 
 hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2014-05-10 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994412#comment-13994412
 ] 

Andrew Purtell commented on HBASE-7958:
---

Thinking about reviving this issue. [~jesse_yates], could you comment on why 
this fizzled?

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.95.2
Reporter: Jesse Yates
Assignee: Jesse Yates
 Attachments: hbase-7958-v0-parent.patch, hbase-7958-v0.patch, 
 hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-05-03 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648842#comment-13648842
 ] 

Jonathan Hsieh commented on HBASE-7958:
---

bumping from 0.95.1, read it if makes it in.

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.95.2
Reporter: Jesse Yates
Assignee: Jesse Yates
 Attachments: hbase-7958_rough-cut-v0.patch, 
 hbase-7958-v0-parent.patch, hbase-7958-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-04-06 Thread Elliott Clark (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624358#comment-13624358
 ] 

Elliott Clark commented on HBASE-7958:
--

bq.Yeah, they are certainly pretty, but IMO pretty useless for anyone but the 
most novice user.
I think we've been pretty remiss to ignore the novice user.  If those graphs 
could be the thing that gets users to use hbase then they did their job.

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.95.2
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.95.1

 Attachments: hbase-7958_rough-cut-v0.patch, 
 hbase-7958-v0-parent.patch, hbase-7958-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-03-15 Thread Jeff Whiting (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603908#comment-13603908
 ] 

Jeff Whiting commented on HBASE-7958:
-

[~jesse_yates] I understand the concern and agree that we don't want to 
reinvent the wheel.

Although it seems like some basic stats would be extremely useful.  For example 
the region balancer could find the hottest regions (ones with the more requests 
per second) and automatically balance them across different region servers.  A 
region could be split because it is too hot to reduce the number of requests 
rather than only splitting on size.

Systems like ganglia / opentsb typically do really well at giving high level 
stats at a server level.  However they would do poorly if they tried to have 
stats on every region (we have over 1000 regions and it would be a mess).

Finally we could have some pretty graphs on the HMaster similar to Accumulo 
(see: 
http://i1-scripts.softpedia-static.com/screenshots/Apache-Accumulo_1.png?1341920105)


 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: hbase-7958_rough-cut-v0.patch, 
 hbase-7958-v0-parent.patch, hbase-7958-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-03-15 Thread Jesse Yates (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603933#comment-13603933
 ] 

Jesse Yates commented on HBASE-7958:


Here's the beauty of the recently posted approach - you could definitely write 
your own stats and then tag each table with it and your off to the races 
(rather than relying on things like JMX and tcollectors, in the case of otsdb).

{quote}
For example the region balancer could find the hottest regions (ones with the 
more requests per second) and automatically balance them across different 
region servers. A region could be split because it is too hot to reduce the 
number of requests rather than only splitting on size.
{quote}

that would be extremely cool (only a little pun intended).

I'd question how OTSDB performs on your scale though - it can collect a whole 
heck of a lot of stats and since it stores them in a very cleanly distributed 
way in HBase, I would be surprised if it wasn't scaling.

My concern is that we don't fill up HDFS with logging stats that are 2-3x what 
the actual datasizes are, something that wouldn't be too far fetched. We just 
need to be careful to make sure we don't keep too much history

bq. Finally we could have some pretty graphs on the HMaster similar to Accumulo

Yeah, they are certainly pretty, but IMO pretty useless for anyone but the most 
novice user.

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: hbase-7958_rough-cut-v0.patch, 
 hbase-7958-v0-parent.patch, hbase-7958-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-03-06 Thread Jesse Yates (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595132#comment-13595132
 ] 

Jesse Yates commented on HBASE-7958:


So it looks like there is a desire for a pretty large range of possible 
statistics. I'd rather we don't get bogged down in what specific statistics we 
want, but push more towards a design discussion around enabling people to 
capture these statistics. We know we want them, the question is how :)

Once we have the mechanisms in place to read/write a stats table for an 
individual stat, we can much more easily expand that support stats at different 
tie-in places. The 'at compaction time histogram' seemed like an easy enough 
starting place for _one type of stat_, but that should not necessarily limit 
possible stats that can be collected; its an immediate use-case for a general 
statistics table.

Stepping back, it seems to me that we can have a basic set of statistics that 
you can enable for a table at creation time (or even turn it on later too). We 
then also need a mechanism to let people add their own statistics easily 
(thinking a CP hook here). From there, we just need to have an mechanism to 
make it easy to access each statistic.

I don't think any of the above proposals really changes my proposed 
outline-patch besides making it easy(easier?) to hook in custom stat 
implementations, a clean dynamic loading mechanism (from the various //TODOs 
for CP hooks), and a little more utility in the StatisticsTable class to make 
it easy to read a stat.

Sound reasonable?

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-03-05 Thread Jeff Whiting (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593791#comment-13593791
 ] 

Jeff Whiting commented on HBASE-7958:
-

Some stats I would like to see are historic requests per region / CF and 
requests per second. 

In newer versions of hbase the number of ops done on a region are exposed in 
the web interface and jmx how it gives you very little context as to what it 
has been historically nor what the current requests per second are.

IMHO would find those stats very useful.

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-03-05 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594307#comment-13594307
 ] 

Andrew Purtell commented on HBASE-7958:
---

I'd like to see a histogram of operations taken on the region, for subsequent 
autotuning for read-mostly, mixed, or write-mostly workloads. 

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
Assignee: Jesse Yates
 Fix For: 0.96.0

 Attachments: hbase-7958_rough-cut-v0.patch


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-02-28 Thread Jesse Yates (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589869#comment-13589869
 ] 

Jesse Yates commented on HBASE-7958:


Good call Todd.

The origina; intention was to enable histograms over the keyvalues in a region. 
They are pretty simple to implement and get people really far, for many cases.

The histograms support things like determining parallelization of scan within a 
region (should I use 1, 5, or 100 threads to scan this region) as well as 
key/value cardinality (helpful for using non-covered indexes). 

Hopefully not getting too far into the implementation details, we could easily 
use a compound key structure in the stats table to support a large variety of 
stats going forward that adds almost no complication to the intial, histogram 
case. 

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
 Fix For: 0.96.0


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-02-27 Thread Jesse Yates (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589061#comment-13589061
 ] 

Jesse Yates commented on HBASE-7958:


The notable piece here is that we can put this 'stats scanner' in as we write 
the compaction. It sees every key that gets written, so it can build perfect 
stats as of the compaction. The only limitation then is that its not 'per 
region', but 'per-CF per region'; it would be a much larger overhaul to combine 
all the information across all the compactions in the region - doable, but 
probably not worth it as you actually lose information and only gain a little 
savings in terms of math effort when calculating per-region, cross CF key 
distributions.

(For calculating key distributions, you basically sum the area under the 
histogram across the columns to determine the 'volume' of keys in the given 
range)

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
 Fix For: 0.96.0


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7958) Statistics per-column family per-region

2013-02-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589117#comment-13589117
 ] 

Todd Lipcon commented on HBASE-7958:


Before we get too much into the detail, can we clarify what kind of statistics 
we're interested in collecting in the first place? There are a bunch of 
different things we could collect, maybe it's good to enumerate some of them 
and list some of the potential applications of them before we get into the 
details of how they're implemented.

Here are a few of the places where I've considered adding statistics in the 
past -- though they fall into different buckets which not everyone might 
consider statistics :) :

- *Block heat* -- keep a reservoir sample of which rows in memstore have been 
read recently. When we flush the file, create a bitmap based on the sample 
mapping each flushed HFile block to its heat. These heat maps could be 
re-generated periodically based on block cache contents after the file is 
flushed. (something like 2 bits per HFile block would mean that the heat map 
for even a very large region could be re-written to disk in only a few MB). 
*Use case*: when we move a region to another server, it can effectively more 
effectively pre-warm its cache. 
- *Row key distribution* -- this seems to be the thing that people are talking 
about here mostly. Useful for calculating better split points for MR or region 
splits.
- *Row key cardinality* - useful for join ordering in SQL engines with 
optimizers
- *Column qualifier and cell value cardinality* - useful for join ordering as 
well as potentially automatic dictionary-coding?

There are bunches of others that could be brainstormed up... so my main point 
is: what do we mean by stats? How should we build this so that it's extensible 
and usable for future stats as well as whatever first one we want to implement?

 Statistics per-column family per-region
 ---

 Key: HBASE-7958
 URL: https://issues.apache.org/jira/browse/HBASE-7958
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Jesse Yates
 Fix For: 0.96.0


 Originating from this discussion on the dev list: 
 http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
 Essentially, we should have built-in statistics gathering for HBase tables. 
 This allows clients to have a better understanding of the distribution of 
 keys within a table and a given region. We could also surface this 
 information via the UI.
 There are a couple different proposals from the email, the overview is this:
 We add in something on compactions that gathers stats about the keys that are 
 written and then we surface them to a table.
 The possible proposals include:
 *How to implement it?*
 # Coprocessors - 
 ** advantage - it easily plugs in and people could pretty easily add their 
 own statistics. 
 ** disadvantage - UI elements would also require this, we get into dependent 
 loading, which leads down the OSGi path. Also, these CPs need to be installed 
 _after_ all the other CPs on compaction to ensure they see exactly what gets 
 written (doable, but a pain)
 # Built into HBase as a custom scanner
 ** advantage - always goes in the right place and no need to muck about with 
 loading CPs etc.
 ** disadvantage - less pluggable, at least for the initial cut
 *Where do we store data?*
 # .META.
 ** advantage - its an existing table, so we can jam it into another CF there
 ** disadvantage - this would make META much larger, possibly leading to 
 splits AND will make it much harder for other processes to read the info
 # A new stats table
 ** advantage - cleanly separates out the information from META
 ** disadvantage - should use a 'system table' idea to prevent accidental 
 deletion, manipulation by arbitrary clients, but still allow clients to read 
 it.
 Once we have this framework, we can then move to an actual implementation of 
 various statistics.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira