Thanks, I'm storing log files and need to scan the tables by date and vendor. 
Since the vendor is limited to at most 16 characters I can put a padded version 
in the front followed by the date (vendor**********|DD-MM-YYYY|other date) and 
I can still scan by setting the start row to (vendor**********|DD-MM-YYYY|other 
date) and the end row to (w***************|DD-MM-YYYY|other date).

Can anyone point me to information about 'pre-creating regions'? That sounds 
like an interesting solution.

Thanks again

-Pete



-----Original Message-----
From: Doug Meil [mailto:[email protected]] 
Sent: Wednesday, February 16, 2011 9:41 AM
To: [email protected]
Subject: RE: Row Key Question

Hi there-

As was described in the HBase chapter in the Hadoop book by Tom White, you 
don't want to insert a lot of data at one time with incrementing keys.

YYYY-MM-DD would seem to me to be a reasonable lead-portion of a key - as long 
as you aren't trying to insert everything in time-order (and all at one time).  
There are other posts about randomizing the input records.  That would provide 
scan-ability, assuming that is important to you.   There are also tricks where 
you can reverse the date (e.g., dd-mm-yyyy, or hash the date, etc.) for better 
spread if randoming the input records isn't possible.  

Another big performance benefit we've seen is pre-creating regions for tables.  
One of our guys posted something about that this week.  You'll have more 
servers participating in the load right off the bat.

Doug


-----Original Message-----
From: Peter Haidinyak [mailto:[email protected]] 
Sent: Tuesday, February 15, 2011 7:38 PM
To: [email protected]
Subject: Row Key Question

Hi All,
  A couple of weeks ago I asked about how to distribute my rows across the 
servers if the key always starts with the date in the format...

YYYY-MM-DD

I believe Stack, although I could be wrong, suggested pre-pending a 'X-' when 
'X' is a number from 1 to the number of servers I have. This way a scan can be 
threaded out where there is one thread per server and each thread 'owns' one 
'X-' range of the keys. 
My question is on the import side, should I have one thread per server and 
round-robin each line of our log files to the threads for the 'put' to the 
server? Does this buy me anymore throughput?

Thanks again.

-Pete

Reply via email to