I've been reading this thread, and I'm still not clear on what the problem is. I saw your original post, but was unclear then as well.
Please correct me if I'm wrong. It sounds like you want to run a M/R job on some data that resides in a table in HBase. But, since the table is so large the M/R job would take a long time to process the entire table, so you want to only process the relevant subset. It also sounds like since you need M/R, the relevant subset is too large to fit in memory and needs a distributed solution. Is this correct so far? A solution exists: scan filters. The individual region servers filter the data. When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob. That method takes a Scan object as an input. The Scan object can have a filter which is run on the individual region server to limit the data that gets sent to the job. I've written my own filters as well, which are quite simple. But, it is a bit of a pain because you have to make sure the custom filter is in the classpath of the servers. I've used it to randomly select a subset of data from HBase for quick test runs of new M/R jobs. You might be able to use existing filters. I recommend taking a look at the RowFilter as a starting point. I haven't used it, but it takes a WritableByteArrayComparable which could possibly be extended to be based on a bloom filter or a list. -Matthew On Oct 12, 2010, at 10:55 AM, jason wrote: >> What I can say is that I have a billion rows of data. >> I want to pull a specific 100K rows from the table. > > Michael, I think I have exactly the same use case. Even numbers are the same. > > I posted a similar question a couple of weeks ago, but unfortunately > did not get a definite answer: > > http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[email protected]%3e > > So far, I decided to put HBase aside and experiment with Hadoop > directly using its BloomMapFile and its ability to quickly discard > files that do not contain requested keys. > This implies that I have to have a custom InputFormat for that, many > input map files, and sorted list of input keys. > > I do not have any performance numbers yet to compare this approach to > the full scan but I am writing tests as we speak. > > Please keep me posted if you find a good solution for this problem in > general (M/R scanning through a random key subset either based on > HBase or Hadoop) > > > > On 10/12/10, Michael Segel <[email protected]> wrote: >> >> >> Dave, >> >> Its a bit more complicated than that. >> >> What I can say is that I have a billion rows of data. >> I want to pull a specific 100K rows from the table. >> >> The row keys are not contiguous and you could say they are 'random' such >> that if I were to do a table scan, I'd have to scan the entire table (All >> regions). >> >> Now if I had a list of the 100k rows. From a single client I could just >> create 100 threads and grab rows from HBase one at a time in each thread. >> >> But in a m/r, I can't really do that. (I want to do processing on the data >> I get returned.) >> >> So given a List Object with the row keys, how do I do a map reduce with this >> list as the starting point. >> >> Sure I could write it to HDFS and then do a m/r reading from the file and >> setting my own splits to control parallelism. >> But I'm hoping for a more elegant solution. >> >> I know that its possible, but I haven't thought it out... Was hoping someone >> else had this solved. >> >> thx >> >>> From: [email protected] >>> To: [email protected] >>> Date: Tue, 12 Oct 2010 08:35:25 -0700 >>> Subject: RE: Using external indexes in an HBase Map/Reduce job... >>> >>> Sorry, I am not clear on exactly what you are trying to accomplish here. >>> I have a table roughly of that size, and it doesn't seem to cause me any >>> trouble. I also have a few separate solr indexes for data in the table >>> for query -- the solr query syntax is sufficient for my current needs. >>> This setup allows me to do two things efficiently: >>> 1) batch processing of all records (e.g. tagging records that match a >>> particular criteria) >>> 2) search/lookup from a UI in an online manner >>> 3) it is also fairly easy to insert a bunch of records (keeping track of >>> their keys), and then run various batch processes only over those new >>> records -- essentially doing what you suggest: create a file of keys and >>> split the map task over that file. >>> >>> Dave >>> >>> >>> -----Original Message----- >>> From: Michael Segel [mailto:[email protected]] >>> Sent: Tuesday, October 12, 2010 5:36 AM >>> To: [email protected] >>> Subject: Using external indexes in an HBase Map/Reduce job... >>> >>> >>> Hi, >>> >>> Now I realize that most everyone is sitting in NY, while some of us can't >>> leave our respective cities.... >>> >>> Came across this problem and I was wondering how others solved it. >>> >>> Suppose you have a really large table with 1 billion rows of data. >>> Since HBase really doesn't have any indexes built in (Don't get me started >>> about the contrib/transactional stuff...), you're forced to use some sort >>> of external index, or roll your own index table. >>> >>> The net result is that you end up with a list object that contains your >>> result set. >>> >>> So the question is... what's the best way to feed the list object in? >>> >>> One option I thought about is writing the object to a file and then using >>> it as the file in and then control the splitters. Not the most efficient >>> but it would work. >>> >>> Was trying to find a more 'elegant' solution and I'm sure that anyone >>> using SOLR or LUCENE or whatever... had come across this problem too. >>> >>> Any suggestions? >>> >>> Thx >>> >>> >>
