Re: Row Key Question

Stack Wed, 16 Feb 2011 10:15:35 -0800

See 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
byte[][])


For illustration of why ts alone is a bad key for sorted hbase, see
http://hbase.apache.org/schema.html#d0e2139

St.Ack

On Wed, Feb 16, 2011 at 10:01 AM, Peter Haidinyak <[email protected]> wrote:
> Thanks, I'm storing log files and need to scan the tables by date and vendor. 
> Since the vendor is limited to at most 16 characters I can put a padded 
> version in the front followed by the date (vendor**********|DD-MM-YYYY|other 
> date) and I can still scan by setting the start row to 
> (vendor**********|DD-MM-YYYY|other date) and the end row to 
> (w***************|DD-MM-YYYY|other date).
>
> Can anyone point me to information about 'pre-creating regions'? That sounds 
> like an interesting solution.
>
> Thanks again
>
> -Pete
>
>
>
> -----Original Message-----
> From: Doug Meil [mailto:[email protected]]
> Sent: Wednesday, February 16, 2011 9:41 AM
> To: [email protected]
> Subject: RE: Row Key Question
>
> Hi there-
>
> As was described in the HBase chapter in the Hadoop book by Tom White, you 
> don't want to insert a lot of data at one time with incrementing keys.
>
> YYYY-MM-DD would seem to me to be a reasonable lead-portion of a key - as 
> long as you aren't trying to insert everything in time-order (and all at one 
> time).  There are other posts about randomizing the input records.  That 
> would provide scan-ability, assuming that is important to you.   There are 
> also tricks where you can reverse the date (e.g., dd-mm-yyyy, or hash the 
> date, etc.) for better spread if randoming the input records isn't possible.
>
> Another big performance benefit we've seen is pre-creating regions for 
> tables.  One of our guys posted something about that this week.  You'll have 
> more servers participating in the load right off the bat.
>
> Doug
>
>
> -----Original Message-----
> From: Peter Haidinyak [mailto:[email protected]]
> Sent: Tuesday, February 15, 2011 7:38 PM
> To: [email protected]
> Subject: Row Key Question
>
> Hi All,
>  A couple of weeks ago I asked about how to distribute my rows across the 
> servers if the key always starts with the date in the format...
>
> YYYY-MM-DD
>
> I believe Stack, although I could be wrong, suggested pre-pending a 'X-' when 
> 'X' is a number from 1 to the number of servers I have. This way a scan can 
> be threaded out where there is one thread per server and each thread 'owns' 
> one 'X-' range of the keys.
> My question is on the import side, should I have one thread per server and 
> round-robin each line of our log files to the threads for the 'put' to the 
> server? Does this buy me anymore throughput?
>
> Thanks again.
>
> -Pete
>
>

Re: Row Key Question

Reply via email to