Mathew,

You've finally figured out the problem.

And since the data resides in HBase which I ultimately want to get... its an 
HBase problem.

Were the list of keys in a file sitting on HDFS, its a simple m/r problem. You 
have a file reader and you set the number of splits.

If the index was an HBase table, you just scan the index and use the HTable 
input to drive the map/reduce.

My point was that there isn't a way to take in an object and use that to drive 
the m/r.

And yes, you've come to the same conclusion I came to before writing the 
question.

As to is it worth it? Yes, because right now there is not a good indexing 
solution to HBase when it comes to a map/reduce.

I don't think I'm the first one to think about it....

Thx
-Mike

> Subject: Re: Using external indexes in an HBase Map/Reduce job...
> From: [email protected]
> Date: Tue, 12 Oct 2010 13:57:54 -0700
> To: [email protected]
> 
> Michael, 
> 
>        This is really more of an M/R question than an HBase question...
> 
>        The problem is that the other nodes in the cluster don't have access 
> to the memory of the node that has the Java Object.  You'll need to copy it 
> to some other thing that other nodes can read (or create your own 
> infrastructure that lets other nodes get the data from the object node - not 
> recommended).  If you are running HBase, then you have at least 3 available 
> to you: DFS, HBase, and Zookeeper.  In order for M/R to use it, there needs 
> to be an InputFormat that knows how to read the data.  I know of existing 
> input formats that can support 2 out of 3 of the above: DFS and HBase.  You 
> could write your own, but it will be more trouble than it is worth.   It is 
> probably best to write the data to one of the two, and have the M/R job read 
> that.  
> 
>        You've probably seen examples that let you pass objects to mappers and 
> reducers using the job configuration (org.apache.hadoop.conf.Configuration).  
> This is meant for configuration items (hence the name) and not large data 
> objects.  You could pass the object this way, but there still needs to be 
> some input data for mappers to be started up.  So, it is possible to have a 
> dummy file that sends data to the mappers.  Once the mapper is started, it 
> can disregard the input data, read the object from the configuration, and 
> then self select which items in the list to process based on its own 
> identity, or perhaps even the input data.  While it is possible, I don't 
> recommend it. 
> 
> Good luck,
> 
> Matthew
> 
> 
> On Oct 12, 2010, at 12:53 PM, Michael Segel wrote:
> 
> > 
> > 
> > All,
> > 
> > Let me clarify ...
> > 
> > The ultimate data we want to process is in HBase.
> > 
> > The data qualifiers are not part of the row key so you would have to do a 
> > full table scan to get the data.
> > (A full table scan of 1 billion rows just to find a subset of 100K rows?)
> > 
> > So the idea is what if I got the set of row_keys that I want to process 
> > from an external source.
> > I don't mention the source, because its not important.
> > 
> > What I am looking at is that at the start of my program, I have this java 
> > List object that contains my 100K record keys for the records I want to 
> > fetch.
> > 
> > So how can I write a m/r that allows me to split and fetch based on a 
> > object and not a file or an hfile for input?
> > 
> > 
> > Let me give you a concrete but imaginary example...
> > 
> > I have combined all of the DMV vehicle registrations for all of the US.
> > 
> > I want to find all of the cars that are registered to somebody with the 
> > last name Smith.
> > 
> > Since the owner's last name isn't part of the row key. I have to do a full 
> > table scan.  (Not really efficient.)
> > 
> > Suppose I have an external index. I get the list of row keys in a List 
> > Object.
> > 
> > Now I want to process the list in a m/r job.
> > 
> > So what's the best way to do it?
> > 
> > Can you use an object to feed in to a m/r job?  (And that's the key point 
> > I'm trying to solve.)
> > 
> > Does that make sense?
> > 
> > -Mike
> > 
> >> Subject: Re: Using external indexes in an HBase Map/Reduce job...
> >> From: [email protected]
> >> Date: Tue, 12 Oct 2010 11:53:11 -0700
> >> To: [email protected]
> >> 
> >> I've been reading this thread, and I'm still not clear on what the problem 
> >> is.  I saw your original post, but was unclear then as well. 
> >> 
> >> Please correct me if I'm wrong.  It sounds like you want to run a M/R job 
> >> on some data that resides in a table in HBase.  But, since the table is so 
> >> large the M/R job would take a long time to process the entire table, so 
> >> you want to only process the relevant subset.  It also sounds like since 
> >> you need M/R, the relevant subset is too large to fit in memory and needs 
> >> a distributed solution.   Is this correct so far?
> >> 
> >> A solution exists: scan filters.  The individual region servers filter the 
> >> data. 
> >> 
> >> When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob.  
> >> That method takes a Scan object as an input.  The Scan object can have a 
> >> filter which is run on the individual region server to limit the data that 
> >> gets sent to the job.  I've written my own filters as well, which are 
> >> quite simple.  But, it is a bit of a pain because you have to make sure 
> >> the custom filter is in the classpath of the servers.  I've used it to 
> >> randomly select a subset of data from HBase for quick test runs of new M/R 
> >> jobs.  
> >> 
> >> You might be able to use existing filters.  I recommend taking a look at 
> >> the RowFilter as a starting point.  I haven't used it, but it takes a 
> >> WritableByteArrayComparable which could possibly be extended to be based 
> >> on a bloom filter or a list.  
> >> 
> >> -Matthew
> >> 
> >> On Oct 12, 2010, at 10:55 AM, jason wrote:
> >> 
> >>>> What I can say is that I have a billion rows of data.
> >>>> I want to pull a specific 100K rows from the table.
> >>> 
> >>> Michael, I think I have exactly the same use case. Even numbers are the 
> >>> same.
> >>> 
> >>> I posted a similar question a couple of weeks ago, but unfortunately
> >>> did not get a definite answer:
> >>> 
> >>> http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[email protected]%3e
> >>> 
> >>> So far, I decided to put HBase aside and experiment with Hadoop
> >>> directly using its BloomMapFile and its ability to quickly discard
> >>> files that do not contain requested keys.
> >>> This implies that I have to have a custom InputFormat for that, many
> >>> input map files, and sorted list of input keys.
> >>> 
> >>> I do not have any performance numbers yet to compare this approach to
> >>> the full scan but I am writing tests as we speak.
> >>> 
> >>> Please keep me posted if you find a good solution for this problem in
> >>> general (M/R scanning through a random key subset either based on
> >>> HBase or Hadoop)
> >>> 
> >>> 
> >>> 
> >>> On 10/12/10, Michael Segel <[email protected]> wrote:
> >>>> 
> >>>> 
> >>>> Dave,
> >>>> 
> >>>> Its a bit more complicated than that.
> >>>> 
> >>>> What I can say is that I have a billion rows of data.
> >>>> I want to pull a specific 100K rows from the table.
> >>>> 
> >>>> The row keys are not contiguous and you could say they are 'random' such
> >>>> that if I were to do a table scan, I'd have to scan the entire table (All
> >>>> regions).
> >>>> 
> >>>> Now if I had a list of the 100k rows. From a single client I could just
> >>>> create 100 threads and grab rows from HBase one at a time in each thread.
> >>>> 
> >>>> But in a m/r, I can't really do that.  (I want to do processing on the 
> >>>> data
> >>>> I get returned.)
> >>>> 
> >>>> So given a List Object with the row keys, how do I do a map reduce with 
> >>>> this
> >>>> list as the starting point.
> >>>> 
> >>>> Sure I could write it to HDFS and then do a m/r reading from the file and
> >>>> setting my own splits to control parallelism.
> >>>> But I'm hoping for a more elegant solution.
> >>>> 
> >>>> I know that its possible, but I haven't thought it out... Was hoping 
> >>>> someone
> >>>> else had this solved.
> >>>> 
> >>>> thx
> >>>> 
> >>>>> From: [email protected]
> >>>>> To: [email protected]
> >>>>> Date: Tue, 12 Oct 2010 08:35:25 -0700
> >>>>> Subject: RE: Using external indexes in an HBase Map/Reduce job...
> >>>>> 
> >>>>> Sorry, I am not clear on exactly what you are trying to accomplish here.
> >>>>> I have a table roughly of that size, and it doesn't seem to cause me any
> >>>>> trouble.  I also have a few separate solr indexes for data in the table
> >>>>> for query -- the solr query syntax is sufficient for my current needs.
> >>>>> This setup allows me to do two things efficiently:
> >>>>> 1) batch processing of all records (e.g. tagging records that match a
> >>>>> particular criteria)
> >>>>> 2) search/lookup from a UI in an online manner
> >>>>> 3) it is also fairly easy to insert a bunch of records (keeping track of
> >>>>> their keys), and then run various batch processes only over those new
> >>>>> records -- essentially doing what you suggest: create a file of keys and
> >>>>> split the map task over that file.
> >>>>> 
> >>>>> Dave
> >>>>> 
> >>>>> 
> >>>>> -----Original Message-----
> >>>>> From: Michael Segel [mailto:[email protected]]
> >>>>> Sent: Tuesday, October 12, 2010 5:36 AM
> >>>>> To: [email protected]
> >>>>> Subject: Using external indexes in an HBase Map/Reduce job...
> >>>>> 
> >>>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> Now I realize that most everyone is sitting in NY, while some of us 
> >>>>> can't
> >>>>> leave our respective cities....
> >>>>> 
> >>>>> Came across this problem and I was wondering how others solved it.
> >>>>> 
> >>>>> Suppose you have a really large table with 1 billion rows of data.
> >>>>> Since HBase really doesn't have any indexes built in (Don't get me 
> >>>>> started
> >>>>> about the contrib/transactional stuff...), you're forced to use some 
> >>>>> sort
> >>>>> of external index, or roll your own index table.
> >>>>> 
> >>>>> The net result is that you end up with a list object that contains your
> >>>>> result set.
> >>>>> 
> >>>>> So the question is... what's the best way to feed the list object in?
> >>>>> 
> >>>>> One option I thought about is writing the object to a file and then 
> >>>>> using
> >>>>> it as the file in and then control the splitters. Not the most efficient
> >>>>> but it would work.
> >>>>> 
> >>>>> Was trying to find a more 'elegant' solution and I'm sure that anyone
> >>>>> using SOLR or LUCENE or whatever... had come across this problem too.
> >>>>> 
> >>>>> Any suggestions?
> >>>>> 
> >>>>> Thx
> >>>>> 
> >>>>>                                         
> >>>> 
> >> 
> >                                       
> 
                                          

Reply via email to