Thanks, I'm storing log files and need to scan the tables by date and vendor. Since the vendor is limited to at most 16 characters I can put a padded version in the front followed by the date (vendor**********|DD-MM-YYYY|other date) and I can still scan by setting the start row to (vendor**********|DD-MM-YYYY|other date) and the end row to (w***************|DD-MM-YYYY|other date).
Can anyone point me to information about 'pre-creating regions'? That sounds like an interesting solution. Thanks again -Pete -----Original Message----- From: Doug Meil [mailto:[email protected]] Sent: Wednesday, February 16, 2011 9:41 AM To: [email protected] Subject: RE: Row Key Question Hi there- As was described in the HBase chapter in the Hadoop book by Tom White, you don't want to insert a lot of data at one time with incrementing keys. YYYY-MM-DD would seem to me to be a reasonable lead-portion of a key - as long as you aren't trying to insert everything in time-order (and all at one time). There are other posts about randomizing the input records. That would provide scan-ability, assuming that is important to you. There are also tricks where you can reverse the date (e.g., dd-mm-yyyy, or hash the date, etc.) for better spread if randoming the input records isn't possible. Another big performance benefit we've seen is pre-creating regions for tables. One of our guys posted something about that this week. You'll have more servers participating in the load right off the bat. Doug -----Original Message----- From: Peter Haidinyak [mailto:[email protected]] Sent: Tuesday, February 15, 2011 7:38 PM To: [email protected] Subject: Row Key Question Hi All, A couple of weeks ago I asked about how to distribute my rows across the servers if the key always starts with the date in the format... YYYY-MM-DD I believe Stack, although I could be wrong, suggested pre-pending a 'X-' when 'X' is a number from 1 to the number of servers I have. This way a scan can be threaded out where there is one thread per server and each thread 'owns' one 'X-' range of the keys. My question is on the import side, should I have one thread per server and round-robin each line of our log files to the threads for the 'put' to the server? Does this buy me anymore throughput? Thanks again. -Pete
