Sorry for missing the background. We assume user is more interested in his latest bills than his old bills. Thus, the query generator is worked as below: 1. randomly generate a number and reverse it as user id. 2. randomly generate a prioritied month based on the above assumpation. 3. ask HBase to query this user + month.
Thanks Weihua 2011/5/20 Matt Corgan <[email protected]>: > I think i traced this to a bug in my compaction scheduler that would have > missed scheduling about half the regions, hence the 240gb vs 480gb. To > confirm: major compaction will always run when asked, even if the region is > already major compacted, the table settings haven't changed, and it was last > major compacted on that same server. [potential hbase optimization here for > clusters with many cold regions]. So my theory about not localizing blocks > is false. > > Weihua - why do you think your throughput doubled when you went from > user+month to month+user keys? Are your queries using an even distribution > of months? I'm not exactly clear on your schema or query pattern. > > > On Thu, May 19, 2011 at 8:39 AM, Joey Echeverria <[email protected]> wrote: > >> I'm surprised the major compactions didn't balance the cluster better. >> I wonder if you've stumbled upon a bug in HBase that's causing it to >> leak old HFiles. >> >> Is the total amount of data in HDFS what you expect? >> >> -Joey >> >> On Thu, May 19, 2011 at 8:35 AM, Matt Corgan <[email protected]> wrote: >> > that's right >> > >> > >> > On Thu, May 19, 2011 at 8:23 AM, Joey Echeverria <[email protected]> >> wrote: >> > >> >> Am I right to assume that all of your data is in HBase, ie you don't >> >> keep anything in just HDFS files? >> >> >> >> -Joey >> >> >> >> On Thu, May 19, 2011 at 8:15 AM, Matt Corgan <[email protected]> >> wrote: >> >> > I wanted to do some more investigation before posting to the list, but >> it >> >> > seems relevant to this conversation... >> >> > >> >> > Is it possible that major compactions don't always localize the data >> >> blocks? >> >> > Our cluster had a bunch of regions full of historical analytics data >> >> that >> >> > were already major compacted, then we added a new >> datanode/regionserver. >> >> We >> >> > have a job that triggers major compactions at a minimum of once per >> week >> >> by >> >> > hashing the region name and giving it a time slot. It's been several >> >> weeks >> >> > and the original nodes each have ~480gb used in hdfs, while the new >> node >> >> has >> >> > only 240gb. Regions are scattered pretty randomly and evenly among >> the >> >> > regionservers. >> >> > >> >> > The job calls hBaseAdmin.majorCompact(hRegionInfo.getRegionName()); >> >> > >> >> > My guess is that if a region is already major compacted and no new >> data >> >> has >> >> > been added to it, then compaction is skipped. That's definitely an >> >> > essential feature during typical operation, but it's a problem if >> you're >> >> > relying on major compaction to balance the cluster. >> >> > >> >> > Matt >> >> > >> >> > >> >> > On Thu, May 19, 2011 at 4:42 AM, Michel Segel < >> [email protected] >> >> >wrote: >> >> > >> >> >> I had asked the question about how he created random keys... Hadn't >> seen >> >> a >> >> >> response. >> >> >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> >> >> >> >> Mike Segel >> >> >> >> >> >> On May 18, 2011, at 11:27 PM, Stack <[email protected]> wrote: >> >> >> >> >> >> > On Wed, May 18, 2011 at 5:11 PM, Weihua JIANG < >> [email protected] >> >> > >> >> >> wrote: >> >> >> >> All the DNs almost have the same number of blocks. Major >> compaction >> >> >> >> makes no difference. >> >> >> >> >> >> >> > >> >> >> > I would expect major compaction to even the number of blocks across >> >> >> > the cluster and it'd move the data for each region local to the >> >> >> > regionserver. >> >> >> > >> >> >> > The only explanation that I can see is that the hot DNs must be >> >> >> > carrying the hot blocks (The client querys are not random). I do >> not >> >> >> > know what else it could be. >> >> >> > >> >> >> > St.Ack >> >> >> > >> >> >> >> >> > >> >> >> >> >> >> >> >> -- >> >> Joseph Echeverria >> >> Cloudera, Inc. >> >> 443.305.9434 >> >> >> > >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 >> >
