[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231835#comment-14231835 ] Jonathan Hsieh commented on HBASE-3149: --- The jira reflects the latest status -- i believe this issue is waiting for someone to pick up and complete for about a year. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: 3149-trunk-v1.txt, Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231843#comment-14231843 ] Ted Yu commented on HBASE-3149: --- Please see HBASE-10201 Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: 3149-trunk-v1.txt, Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230448#comment-14230448 ] Brian Johnson commented on HBASE-3149: -- What is the status of this in trunk? Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: 3149-trunk-v1.txt, Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13854725#comment-13854725 ] Gaurav Menghani commented on HBASE-3149: [~saint@gmail.com] Yes, we have deployed this, with selective flushing disabled for now, since we didn't see any aggregate benefits yet. The heuristics that I was thinking about were around, which column families to flush when there are no column families above the threshold for flushing families. Eg. if the memstore limit is 128 MB, and the flushing threshold for a CF is 32 MB, there might be a case, where there are like 7-8 CFs, and none of them are above 32 MB. In that case, there are a couple of heuristics you can choose. Like: flush the top N column families, flush only as few column families to free up 1/4 th of the memstore, etc. The main benefit I see is the time spent while compacting the smaller CFs will be much lesser, since the number of files created would be much lesser. This is compensated against bigger column families being flushed earlier than before, and having smaller files than without this change, but with the right heuristics we can find a good balance. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: 3149-trunk-v1.txt, Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852025#comment-13852025 ] Gaurav Menghani commented on HBASE-3149: Ted has volunteered to port this to trunk in a separate JIRA. I will be working on different heuristics to see the benefits that we get. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: 3149-trunk-v1.txt, Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852042#comment-13852042 ] stack commented on HBASE-3149: -- [~gaurav.menghani] Gaurav, have you deployed this change? If so, what do you see in operation? When you talk about different heuristics, what you thinking? Thanks boss. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: 3149-trunk-v1.txt, Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804528#comment-13804528 ] Hadoop QA commented on HBASE-3149: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12610129/Per-CF-Memstore-Flush.diff against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7618//console This message is automatically generated. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804537#comment-13804537 ] Gaurav Menghani commented on HBASE-3149: The basic idea is to be able to maintain the smallest LSN amongst the edits present in a particular memstore for a column family. When we decide to flush a set of memstores, we find the smallest LSN id amongst the memstores that we are not flushing, say X, and say that we can remove the logs for any edits with LSN less than X. We choose a particular memstore to be flushed, if it occupies more than 't' bytes, when the global memstore size threshold is 'T' (and t/T = 1/4 for our configuration). If there is no memstore with = t bytes but the total size of all the memstores is above T, we flush all the memstores. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Fix For: 0.89-fb Attachments: Per-CF-Memstore-Flush.diff Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794254#comment-13794254 ] stack commented on HBASE-3149: -- Thanks [~gaurav.menghani] Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Gaurav Menghani Priority: Critical Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680110#comment-13680110 ] Sergey Shelukhin commented on HBASE-3149: - Any update since last comment? Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.92.3 Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577618#comment-13577618 ] Himanshu Vashishtha commented on HBASE-3149: This is a useful feature; I'm working on it. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.92.3 Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13213858#comment-13213858 ] Nicolas Spiegelberg commented on HBASE-3149: @Lars/Stack: note that the number of StoreFiles necessary to store N amount of data is order O(log N) with the existing compaction algorithm. This means that setting the compaction min size to a low value will not result in significantly more files. Furthermore, what's hurting performance is not the amount of files but the size of each file. The extra files will be very small and take up only a minority of the space in the LRU cache. Every time you unnecessarily compact files, you have to repopulate that StoreFile in the LRU cache and get a lot of disk reads in addition to the obvious write increase. This is all to say that I would recommend defaulting it to that low because the downsides are very minimal and the benefit can be substantial IO gains. bq. At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold. Why is this silly? With cache-on-write, the data is still cached in memory. It's just migrated from the MemCache to the BlockCache, which has comparable performance. Furthermore, BlockCache data is compressed, so it then takes up less space. Flushing also minimizes the amount of HLogs and decreases recovery time. Flushing would be bad if it meant we weren't optimally using the global MemStore size, but we currently are. bq. This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size? I think this is a better default, not that it's a one-size setting. I agree that this should toggleable on a per-CF basis, hence HBASE-5335. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.92.1 Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214372#comment-13214372 ] stack commented on HBASE-3149: -- @Nicolas I think I follow. I opened HBASE-5461. Let me try it. bq. Why is this silly? Because I was seeing a plethora of small files a problem but given your explaination above, I think I grok that its not many small files thats the prob; its that w/ the way high min size, our selection was to inclusionary and so we end up doing loads of rewriting. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.92.1 Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214382#comment-13214382 ] Lars Hofhansl commented on HBASE-3149: -- Thanks for explaining Nicolas. I wonder if a good default would be some fraction of the flushsize. Maybe 1/4*flushsize, or something. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.92.1 Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212938#comment-13212938 ] Lars George commented on HBASE-3149: bq. At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold. I thought so too. Setting the hbase.hstore.compaction.size to 4MB, and having the flush size at 256MB, it means you will never compact flush files larger than 4MB. So, in other words, only if you are flushing small files (say from a small, dependent column family) you are running a minor compaction on them. For the larger family you typically do not run those at all, right? This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size? It still seems to me that decoupling is what we should have available as well. But I thought about it for a while as well as discussed this various people: it seems that decoupling brings its own set of issues, for example, you might end up with too many HLog files because the small family is flushed only rarely. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.92.1 Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212166#comment-13212166 ] stack commented on HBASE-3149: -- @Nicolas I wonder about this... hbase.hstore.compaction.min.size. When we compact, don't we have to take adjacent files as part of our ACID guarantees? This would frustrate that? (I'll take a look... tomorrow). I'm wondering because i want to figure how to make it so we favor reference files... so they are always included in a compaction. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Fix For: 0.92.1 Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211614#comment-13211614 ] Lars Hofhansl commented on HBASE-3149: -- @Nicolas: Interesting bit about hstore.compaction.min.size. I'm curious, is 4MB something that works specifically for your setup, or would you generally recommend it setting it that low? It probably has to do with whether compression is enabled, how many CFs and relative sizes, etc. Maybe instead of defaulting it to flushsize, we could default it to flushsize/2 or flushsize/4...? Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211204#comment-13211204 ] stack commented on HBASE-3149: -- Thanks @Nicolas (and thanks @Mubarak -- sounds like something to indeed get into 0.92). At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210671#comment-13210671 ] Mubarak Seyed commented on HBASE-3149: -- @Nicolas, Is there any update on this issue? We have a production use-case wherein 80% of data goes to one CF and remaining 20% goes to two other CFs. I can collaborate with you if you are interested to pursue with patch. Thanks. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210687#comment-13210687 ] Nicolas Spiegelberg commented on HBASE-3149: @Mubarak: I think you probably are more interested in tuning the compaction settings. The initial reason for this JIRA was higher network IO. The actual problem was that the min unconditional compact size was too high caused bad compaction decision. We fixed this by lowering the min size from the default of the flush size (256MB, for us) to 4MB. {code} property namehbase.hstore.compaction.min.size/name value4194304/value description The minimum compaction size. All files below this size are always included into a compaction, even if outside compaction ratio times the total size of all files added to compaction so far. /description /property {code} We identified this a while ago and I thought we were going to change the default for 0.92, but it looks like it's still in the Store.java code :( A better use of your time would be to verify that this reduces your IO and write up a JIRA to change the default. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210698#comment-13210698 ] Mubarak Seyed commented on HBASE-3149: -- Thanks Nicolas. Will try with 4 MB and create a JIRA. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12988784#comment-12988784 ] Schubert Zhang commented on HBASE-3149: --- This jira is very useful in practice. In HBase, the horizontal partitions by rowkey-ranges make regions, and the vertical partitions by column-family make stores. These horizontal and vertical partitoning schema make a data tetragonum --- the store in hbase. The memstore is base on the store, so the flush and compaction need also be based on store. The memstoreSize in HRegion should be in HStore. For flexible configuration, I think we shlould be able to configure memstoresize (i.e. hbase.hregion.memstore.flush.size) in Column-Family level (when create table). And if possaible, I want the maxStoreSize also be configurable for different Column-Family. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983972#action_12983972 ] ryan rawson commented on HBASE-3149: ok got it. as long as we dont generate a seqid per family we are all good. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983312#action_12983312 ] Nicolas Spiegelberg commented on HBASE-3149: Some interesting stats. We did some rough calculations internally to see what effect an uneven distribution of data into column families was having on our network IO. Our data distribution for 3 column families was 1:1:20. When we looked at the flush:minor-compaction ratio for each of the store files, the large column family had a 1:2 ratio but the small CFs both had a 1:20 ratio! We are looking at roughly a 10% network IO decrease if we can bring those other 2 CFs down to a 1:2 ratio as well. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983510#action_12983510 ] ryan rawson commented on HBASE-3149: if you are going to generate a sequence id for every CF, then we will need to create and use a new synthetic ID for atomic views. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983523#action_12983523 ] Nicolas Spiegelberg commented on HBASE-3149: @ryan: the main work in step #3 isn't HBASE-2856 work. It's roughly modifying HLog.lastSeqWritten from Mapregion, long = Mapstore, long and all the refactoring to the HLog code that it entails. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Assignee: Nicolas Spiegelberg Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925090#action_12925090 ] Karthik Ranganathan commented on HBASE-3149: Yes, agreed that the memory implication is different. Eventually, is it not better to enforce the memory limit by using a combination of flush sizes and restricting the number of regions we create? Because ideally we should allow different flush sizes for the different CF's as the KV sizes could be way different... Shall I just make this an option in the conf for now with the default the way it is? Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3149) Make flush decisions per column family
[ https://issues.apache.org/jira/browse/HBASE-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924705#action_12924705 ] Jean-Daniel Cryans commented on HBASE-3149: --- I have been thinking about this one for some time... I think it makes sense in loads of ways since a common problem of multi-CF is that during the initial import the user ends up with thousands of small store files because some family grows faster and triggered the flushes, which in turn generates incredible compaction churn. On the other hand, it means that we almost consider a family as a region e.g. one region with 3 CF can have up to 3x64MB in the memstores. Make flush decisions per column family -- Key: HBASE-3149 URL: https://issues.apache.org/jira/browse/HBASE-3149 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Karthik Ranganathan Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.