(Branching this discussion since it's not directly relevant to the other thread)
I think if we ever come up with a formula, it needs to come with a big "your mileage may vary" sign. The reasons being: - If only a subset of the regions are getting written to, then only those regions need to be accounted for (I think this is what you referred to by Active Regions) - If the load is read heavy then you'd want to flush as little as possible, meaning a very few regions (possibly forcing them to be less than the theoretical maximum) - Not all tables may have the same flush size. - Some regions might be more active than others and may flush a lot more, and since we keep both active and inactive data in the HLogs then you might be churning more than you need to. - Same for families. Now on the formula: > If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) ) That's ok. > Active Regions = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / > (2~3)) Could you explain the division by 2 or 3? I'm not sure I'm following that. Also I don't remember if the flush size by region was fixed (it should be by family), but this would have an effect too. > Else > Active Regions = (Hlognumber*hdfsblock)/ (flush.size / (2~3)) Same comments. J-D 2011/9/6 Gaojinchao <[email protected]>: > Hi J-D > Should we can give a formula about active regions per node and up to book ? > I think many people encounter the same problem. > > I think the formula is: > If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) ) > Active Regions = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / > (2~3)) > Else > Active Regions = (Hlognumber*hdfsblock)/ (flush.size / (2~3)) > > > If I am wrong, please correct. Thanks.
