RE: Using external indexes in an HBase Map/Reduce job...

Michael Segel Tue, 12 Oct 2010 09:21:05 -0700


Dave,


Its a bit more complicated than that.

What I can say is that I have a billion rows of data. 
I want to pull a specific 100K rows from the table. 

The row keys are not contiguous and you could say they are 'random' such that 
if I were to do a table scan, I'd have to scan the entire table (All regions).

Now if I had a list of the 100k rows. From a single client I could just create 
100 threads and grab rows from HBase one at a time in each thread.

But in a m/r, I can't really do that.  (I want to do processing on the data I 
get returned.)

So given a List Object with the row keys, how do I do a map reduce with this 
list as the starting point. 

Sure I could write it to HDFS and then do a m/r reading from the file and 
setting my own splits to control parallelism. 
But I'm hoping for a more elegant solution.

I know that its possible, but I haven't thought it out... Was hoping someone 
else had this solved.

thx

> From: [email protected]
> To: [email protected]
> Date: Tue, 12 Oct 2010 08:35:25 -0700
> Subject: RE: Using external indexes in an HBase Map/Reduce job...
> 
> Sorry, I am not clear on exactly what you are trying to accomplish here.  I 
> have a table roughly of that size, and it doesn't seem to cause me any 
> trouble.  I also have a few separate solr indexes for data in the table for 
> query -- the solr query syntax is sufficient for my current needs.  This 
> setup allows me to do two things efficiently:
> 1) batch processing of all records (e.g. tagging records that match a 
> particular criteria)
> 2) search/lookup from a UI in an online manner
> 3) it is also fairly easy to insert a bunch of records (keeping track of 
> their keys), and then run various batch processes only over those new records 
> -- essentially doing what you suggest: create a file of keys and split the 
> map task over that file.
> 
> Dave
> 
> 
> -----Original Message-----
> From: Michael Segel [mailto:[email protected]] 
> Sent: Tuesday, October 12, 2010 5:36 AM
> To: [email protected]
> Subject: Using external indexes in an HBase Map/Reduce job...
> 
> 
> Hi,
> 
> Now I realize that most everyone is sitting in NY, while some of us can't 
> leave our respective cities....
> 
> Came across this problem and I was wondering how others solved it.
> 
> Suppose you have a really large table with 1 billion rows of data. 
> Since HBase really doesn't have any indexes built in (Don't get me started 
> about the contrib/transactional stuff...), you're forced to use some sort of 
> external index, or roll your own index table.
> 
> The net result is that you end up with a list object that contains your 
> result set.
> 
> So the question is... what's the best way to feed the list object in?
> 
> One option I thought about is writing the object to a file and then using it 
> as the file in and then control the splitters. Not the most efficient but it 
> would work.
> 
> Was trying to find a more 'elegant' solution and I'm sure that anyone using 
> SOLR or LUCENE or whatever... had come across this problem too.
> 
> Any suggestions? 
> 
> Thx
> 
>

RE: Using external indexes in an HBase Map/Reduce job...

Reply via email to