All,

Let me clarify ...

The ultimate data we want to process is in HBase.

The data qualifiers are not part of the row key so you would have to do a full 
table scan to get the data.
(A full table scan of 1 billion rows just to find a subset of 100K rows?)

So the idea is what if I got the set of row_keys that I want to process from an 
external source.
I don't mention the source, because its not important.

What I am looking at is that at the start of my program, I have this java List 
object that contains my 100K record keys for the records I want to fetch.

So how can I write a m/r that allows me to split and fetch based on a object 
and not a file or an hfile for input?


Let me give you a concrete but imaginary example...

I have combined all of the DMV vehicle registrations for all of the US.

I want to find all of the cars that are registered to somebody with the last 
name Smith.

Since the owner's last name isn't part of the row key. I have to do a full 
table scan.  (Not really efficient.)

Suppose I have an external index. I get the list of row keys in a List Object.

Now I want to process the list in a m/r job.

So what's the best way to do it?

Can you use an object to feed in to a m/r job?  (And that's the key point I'm 
trying to solve.)

Does that make sense?

-Mike

> Subject: Re: Using external indexes in an HBase Map/Reduce job...
> From: [email protected]
> Date: Tue, 12 Oct 2010 11:53:11 -0700
> To: [email protected]
> 
> I've been reading this thread, and I'm still not clear on what the problem 
> is.  I saw your original post, but was unclear then as well. 
> 
> Please correct me if I'm wrong.  It sounds like you want to run a M/R job on 
> some data that resides in a table in HBase.  But, since the table is so large 
> the M/R job would take a long time to process the entire table, so you want 
> to only process the relevant subset.  It also sounds like since you need M/R, 
> the relevant subset is too large to fit in memory and needs a distributed 
> solution.   Is this correct so far?
> 
> A solution exists: scan filters.  The individual region servers filter the 
> data. 
> 
> When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob.  
> That method takes a Scan object as an input.  The Scan object can have a 
> filter which is run on the individual region server to limit the data that 
> gets sent to the job.  I've written my own filters as well, which are quite 
> simple.  But, it is a bit of a pain because you have to make sure the custom 
> filter is in the classpath of the servers.  I've used it to randomly select a 
> subset of data from HBase for quick test runs of new M/R jobs.  
> 
> You might be able to use existing filters.  I recommend taking a look at the 
> RowFilter as a starting point.  I haven't used it, but it takes a 
> WritableByteArrayComparable which could possibly be extended to be based on a 
> bloom filter or a list.  
> 
> -Matthew
> 
> On Oct 12, 2010, at 10:55 AM, jason wrote:
> 
> >> What I can say is that I have a billion rows of data.
> >> I want to pull a specific 100K rows from the table.
> > 
> > Michael, I think I have exactly the same use case. Even numbers are the 
> > same.
> > 
> > I posted a similar question a couple of weeks ago, but unfortunately
> > did not get a definite answer:
> > 
> > http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[email protected]%3e
> > 
> > So far, I decided to put HBase aside and experiment with Hadoop
> > directly using its BloomMapFile and its ability to quickly discard
> > files that do not contain requested keys.
> > This implies that I have to have a custom InputFormat for that, many
> > input map files, and sorted list of input keys.
> > 
> > I do not have any performance numbers yet to compare this approach to
> > the full scan but I am writing tests as we speak.
> > 
> > Please keep me posted if you find a good solution for this problem in
> > general (M/R scanning through a random key subset either based on
> > HBase or Hadoop)
> > 
> > 
> > 
> > On 10/12/10, Michael Segel <[email protected]> wrote:
> >> 
> >> 
> >> Dave,
> >> 
> >> Its a bit more complicated than that.
> >> 
> >> What I can say is that I have a billion rows of data.
> >> I want to pull a specific 100K rows from the table.
> >> 
> >> The row keys are not contiguous and you could say they are 'random' such
> >> that if I were to do a table scan, I'd have to scan the entire table (All
> >> regions).
> >> 
> >> Now if I had a list of the 100k rows. From a single client I could just
> >> create 100 threads and grab rows from HBase one at a time in each thread.
> >> 
> >> But in a m/r, I can't really do that.  (I want to do processing on the data
> >> I get returned.)
> >> 
> >> So given a List Object with the row keys, how do I do a map reduce with 
> >> this
> >> list as the starting point.
> >> 
> >> Sure I could write it to HDFS and then do a m/r reading from the file and
> >> setting my own splits to control parallelism.
> >> But I'm hoping for a more elegant solution.
> >> 
> >> I know that its possible, but I haven't thought it out... Was hoping 
> >> someone
> >> else had this solved.
> >> 
> >> thx
> >> 
> >>> From: [email protected]
> >>> To: [email protected]
> >>> Date: Tue, 12 Oct 2010 08:35:25 -0700
> >>> Subject: RE: Using external indexes in an HBase Map/Reduce job...
> >>> 
> >>> Sorry, I am not clear on exactly what you are trying to accomplish here.
> >>> I have a table roughly of that size, and it doesn't seem to cause me any
> >>> trouble.  I also have a few separate solr indexes for data in the table
> >>> for query -- the solr query syntax is sufficient for my current needs.
> >>> This setup allows me to do two things efficiently:
> >>> 1) batch processing of all records (e.g. tagging records that match a
> >>> particular criteria)
> >>> 2) search/lookup from a UI in an online manner
> >>> 3) it is also fairly easy to insert a bunch of records (keeping track of
> >>> their keys), and then run various batch processes only over those new
> >>> records -- essentially doing what you suggest: create a file of keys and
> >>> split the map task over that file.
> >>> 
> >>> Dave
> >>> 
> >>> 
> >>> -----Original Message-----
> >>> From: Michael Segel [mailto:[email protected]]
> >>> Sent: Tuesday, October 12, 2010 5:36 AM
> >>> To: [email protected]
> >>> Subject: Using external indexes in an HBase Map/Reduce job...
> >>> 
> >>> 
> >>> Hi,
> >>> 
> >>> Now I realize that most everyone is sitting in NY, while some of us can't
> >>> leave our respective cities....
> >>> 
> >>> Came across this problem and I was wondering how others solved it.
> >>> 
> >>> Suppose you have a really large table with 1 billion rows of data.
> >>> Since HBase really doesn't have any indexes built in (Don't get me started
> >>> about the contrib/transactional stuff...), you're forced to use some sort
> >>> of external index, or roll your own index table.
> >>> 
> >>> The net result is that you end up with a list object that contains your
> >>> result set.
> >>> 
> >>> So the question is... what's the best way to feed the list object in?
> >>> 
> >>> One option I thought about is writing the object to a file and then using
> >>> it as the file in and then control the splitters. Not the most efficient
> >>> but it would work.
> >>> 
> >>> Was trying to find a more 'elegant' solution and I'm sure that anyone
> >>> using SOLR or LUCENE or whatever... had come across this problem too.
> >>> 
> >>> Any suggestions?
> >>> 
> >>> Thx
> >>> 
> >>>                                   
> >> 
> 
                                          

Reply via email to