See http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[][])
For illustration of why ts alone is a bad key for sorted hbase, see http://hbase.apache.org/schema.html#d0e2139 St.Ack On Wed, Feb 16, 2011 at 10:01 AM, Peter Haidinyak <[email protected]> wrote: > Thanks, I'm storing log files and need to scan the tables by date and vendor. > Since the vendor is limited to at most 16 characters I can put a padded > version in the front followed by the date (vendor**********|DD-MM-YYYY|other > date) and I can still scan by setting the start row to > (vendor**********|DD-MM-YYYY|other date) and the end row to > (w***************|DD-MM-YYYY|other date). > > Can anyone point me to information about 'pre-creating regions'? That sounds > like an interesting solution. > > Thanks again > > -Pete > > > > -----Original Message----- > From: Doug Meil [mailto:[email protected]] > Sent: Wednesday, February 16, 2011 9:41 AM > To: [email protected] > Subject: RE: Row Key Question > > Hi there- > > As was described in the HBase chapter in the Hadoop book by Tom White, you > don't want to insert a lot of data at one time with incrementing keys. > > YYYY-MM-DD would seem to me to be a reasonable lead-portion of a key - as > long as you aren't trying to insert everything in time-order (and all at one > time). There are other posts about randomizing the input records. That > would provide scan-ability, assuming that is important to you. There are > also tricks where you can reverse the date (e.g., dd-mm-yyyy, or hash the > date, etc.) for better spread if randoming the input records isn't possible. > > Another big performance benefit we've seen is pre-creating regions for > tables. One of our guys posted something about that this week. You'll have > more servers participating in the load right off the bat. > > Doug > > > -----Original Message----- > From: Peter Haidinyak [mailto:[email protected]] > Sent: Tuesday, February 15, 2011 7:38 PM > To: [email protected] > Subject: Row Key Question > > Hi All, > A couple of weeks ago I asked about how to distribute my rows across the > servers if the key always starts with the date in the format... > > YYYY-MM-DD > > I believe Stack, although I could be wrong, suggested pre-pending a 'X-' when > 'X' is a number from 1 to the number of servers I have. This way a scan can > be threaded out where there is one thread per server and each thread 'owns' > one 'X-' range of the keys. > My question is on the import side, should I have one thread per server and > round-robin each line of our log files to the threads for the 'put' to the > server? Does this buy me anymore throughput? > > Thanks again. > > -Pete > >
