All,
Let me clarify ... The ultimate data we want to process is in HBase. The data qualifiers are not part of the row key so you would have to do a full table scan to get the data. (A full table scan of 1 billion rows just to find a subset of 100K rows?) So the idea is what if I got the set of row_keys that I want to process from an external source. I don't mention the source, because its not important. What I am looking at is that at the start of my program, I have this java List object that contains my 100K record keys for the records I want to fetch. So how can I write a m/r that allows me to split and fetch based on a object and not a file or an hfile for input? Let me give you a concrete but imaginary example... I have combined all of the DMV vehicle registrations for all of the US. I want to find all of the cars that are registered to somebody with the last name Smith. Since the owner's last name isn't part of the row key. I have to do a full table scan. (Not really efficient.) Suppose I have an external index. I get the list of row keys in a List Object. Now I want to process the list in a m/r job. So what's the best way to do it? Can you use an object to feed in to a m/r job? (And that's the key point I'm trying to solve.) Does that make sense? -Mike > Subject: Re: Using external indexes in an HBase Map/Reduce job... > From: [email protected] > Date: Tue, 12 Oct 2010 11:53:11 -0700 > To: [email protected] > > I've been reading this thread, and I'm still not clear on what the problem > is. I saw your original post, but was unclear then as well. > > Please correct me if I'm wrong. It sounds like you want to run a M/R job on > some data that resides in a table in HBase. But, since the table is so large > the M/R job would take a long time to process the entire table, so you want > to only process the relevant subset. It also sounds like since you need M/R, > the relevant subset is too large to fit in memory and needs a distributed > solution. Is this correct so far? > > A solution exists: scan filters. The individual region servers filter the > data. > > When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob. > That method takes a Scan object as an input. The Scan object can have a > filter which is run on the individual region server to limit the data that > gets sent to the job. I've written my own filters as well, which are quite > simple. But, it is a bit of a pain because you have to make sure the custom > filter is in the classpath of the servers. I've used it to randomly select a > subset of data from HBase for quick test runs of new M/R jobs. > > You might be able to use existing filters. I recommend taking a look at the > RowFilter as a starting point. I haven't used it, but it takes a > WritableByteArrayComparable which could possibly be extended to be based on a > bloom filter or a list. > > -Matthew > > On Oct 12, 2010, at 10:55 AM, jason wrote: > > >> What I can say is that I have a billion rows of data. > >> I want to pull a specific 100K rows from the table. > > > > Michael, I think I have exactly the same use case. Even numbers are the > > same. > > > > I posted a similar question a couple of weeks ago, but unfortunately > > did not get a definite answer: > > > > http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[email protected]%3e > > > > So far, I decided to put HBase aside and experiment with Hadoop > > directly using its BloomMapFile and its ability to quickly discard > > files that do not contain requested keys. > > This implies that I have to have a custom InputFormat for that, many > > input map files, and sorted list of input keys. > > > > I do not have any performance numbers yet to compare this approach to > > the full scan but I am writing tests as we speak. > > > > Please keep me posted if you find a good solution for this problem in > > general (M/R scanning through a random key subset either based on > > HBase or Hadoop) > > > > > > > > On 10/12/10, Michael Segel <[email protected]> wrote: > >> > >> > >> Dave, > >> > >> Its a bit more complicated than that. > >> > >> What I can say is that I have a billion rows of data. > >> I want to pull a specific 100K rows from the table. > >> > >> The row keys are not contiguous and you could say they are 'random' such > >> that if I were to do a table scan, I'd have to scan the entire table (All > >> regions). > >> > >> Now if I had a list of the 100k rows. From a single client I could just > >> create 100 threads and grab rows from HBase one at a time in each thread. > >> > >> But in a m/r, I can't really do that. (I want to do processing on the data > >> I get returned.) > >> > >> So given a List Object with the row keys, how do I do a map reduce with > >> this > >> list as the starting point. > >> > >> Sure I could write it to HDFS and then do a m/r reading from the file and > >> setting my own splits to control parallelism. > >> But I'm hoping for a more elegant solution. > >> > >> I know that its possible, but I haven't thought it out... Was hoping > >> someone > >> else had this solved. > >> > >> thx > >> > >>> From: [email protected] > >>> To: [email protected] > >>> Date: Tue, 12 Oct 2010 08:35:25 -0700 > >>> Subject: RE: Using external indexes in an HBase Map/Reduce job... > >>> > >>> Sorry, I am not clear on exactly what you are trying to accomplish here. > >>> I have a table roughly of that size, and it doesn't seem to cause me any > >>> trouble. I also have a few separate solr indexes for data in the table > >>> for query -- the solr query syntax is sufficient for my current needs. > >>> This setup allows me to do two things efficiently: > >>> 1) batch processing of all records (e.g. tagging records that match a > >>> particular criteria) > >>> 2) search/lookup from a UI in an online manner > >>> 3) it is also fairly easy to insert a bunch of records (keeping track of > >>> their keys), and then run various batch processes only over those new > >>> records -- essentially doing what you suggest: create a file of keys and > >>> split the map task over that file. > >>> > >>> Dave > >>> > >>> > >>> -----Original Message----- > >>> From: Michael Segel [mailto:[email protected]] > >>> Sent: Tuesday, October 12, 2010 5:36 AM > >>> To: [email protected] > >>> Subject: Using external indexes in an HBase Map/Reduce job... > >>> > >>> > >>> Hi, > >>> > >>> Now I realize that most everyone is sitting in NY, while some of us can't > >>> leave our respective cities.... > >>> > >>> Came across this problem and I was wondering how others solved it. > >>> > >>> Suppose you have a really large table with 1 billion rows of data. > >>> Since HBase really doesn't have any indexes built in (Don't get me started > >>> about the contrib/transactional stuff...), you're forced to use some sort > >>> of external index, or roll your own index table. > >>> > >>> The net result is that you end up with a list object that contains your > >>> result set. > >>> > >>> So the question is... what's the best way to feed the list object in? > >>> > >>> One option I thought about is writing the object to a file and then using > >>> it as the file in and then control the splitters. Not the most efficient > >>> but it would work. > >>> > >>> Was trying to find a more 'elegant' solution and I'm sure that anyone > >>> using SOLR or LUCENE or whatever... had come across this problem too. > >>> > >>> Any suggestions? > >>> > >>> Thx > >>> > >>> > >> >
