Dave,
Its a bit more complicated than that. What I can say is that I have a billion rows of data. I want to pull a specific 100K rows from the table. The row keys are not contiguous and you could say they are 'random' such that if I were to do a table scan, I'd have to scan the entire table (All regions). Now if I had a list of the 100k rows. From a single client I could just create 100 threads and grab rows from HBase one at a time in each thread. But in a m/r, I can't really do that. (I want to do processing on the data I get returned.) So given a List Object with the row keys, how do I do a map reduce with this list as the starting point. Sure I could write it to HDFS and then do a m/r reading from the file and setting my own splits to control parallelism. But I'm hoping for a more elegant solution. I know that its possible, but I haven't thought it out... Was hoping someone else had this solved. thx > From: [email protected] > To: [email protected] > Date: Tue, 12 Oct 2010 08:35:25 -0700 > Subject: RE: Using external indexes in an HBase Map/Reduce job... > > Sorry, I am not clear on exactly what you are trying to accomplish here. I > have a table roughly of that size, and it doesn't seem to cause me any > trouble. I also have a few separate solr indexes for data in the table for > query -- the solr query syntax is sufficient for my current needs. This > setup allows me to do two things efficiently: > 1) batch processing of all records (e.g. tagging records that match a > particular criteria) > 2) search/lookup from a UI in an online manner > 3) it is also fairly easy to insert a bunch of records (keeping track of > their keys), and then run various batch processes only over those new records > -- essentially doing what you suggest: create a file of keys and split the > map task over that file. > > Dave > > > -----Original Message----- > From: Michael Segel [mailto:[email protected]] > Sent: Tuesday, October 12, 2010 5:36 AM > To: [email protected] > Subject: Using external indexes in an HBase Map/Reduce job... > > > Hi, > > Now I realize that most everyone is sitting in NY, while some of us can't > leave our respective cities.... > > Came across this problem and I was wondering how others solved it. > > Suppose you have a really large table with 1 billion rows of data. > Since HBase really doesn't have any indexes built in (Don't get me started > about the contrib/transactional stuff...), you're forced to use some sort of > external index, or roll your own index table. > > The net result is that you end up with a list object that contains your > result set. > > So the question is... what's the best way to feed the list object in? > > One option I thought about is writing the object to a file and then using it > as the file in and then control the splitters. Not the most efficient but it > would work. > > Was trying to find a more 'elegant' solution and I'm sure that anyone using > SOLR or LUCENE or whatever... had come across this problem too. > > Any suggestions? > > Thx > >
