Mathew, You've finally figured out the problem.
And since the data resides in HBase which I ultimately want to get... its an HBase problem. Were the list of keys in a file sitting on HDFS, its a simple m/r problem. You have a file reader and you set the number of splits. If the index was an HBase table, you just scan the index and use the HTable input to drive the map/reduce. My point was that there isn't a way to take in an object and use that to drive the m/r. And yes, you've come to the same conclusion I came to before writing the question. As to is it worth it? Yes, because right now there is not a good indexing solution to HBase when it comes to a map/reduce. I don't think I'm the first one to think about it.... Thx -Mike > Subject: Re: Using external indexes in an HBase Map/Reduce job... > From: [email protected] > Date: Tue, 12 Oct 2010 13:57:54 -0700 > To: [email protected] > > Michael, > > This is really more of an M/R question than an HBase question... > > The problem is that the other nodes in the cluster don't have access > to the memory of the node that has the Java Object. You'll need to copy it > to some other thing that other nodes can read (or create your own > infrastructure that lets other nodes get the data from the object node - not > recommended). If you are running HBase, then you have at least 3 available > to you: DFS, HBase, and Zookeeper. In order for M/R to use it, there needs > to be an InputFormat that knows how to read the data. I know of existing > input formats that can support 2 out of 3 of the above: DFS and HBase. You > could write your own, but it will be more trouble than it is worth. It is > probably best to write the data to one of the two, and have the M/R job read > that. > > You've probably seen examples that let you pass objects to mappers and > reducers using the job configuration (org.apache.hadoop.conf.Configuration). > This is meant for configuration items (hence the name) and not large data > objects. You could pass the object this way, but there still needs to be > some input data for mappers to be started up. So, it is possible to have a > dummy file that sends data to the mappers. Once the mapper is started, it > can disregard the input data, read the object from the configuration, and > then self select which items in the list to process based on its own > identity, or perhaps even the input data. While it is possible, I don't > recommend it. > > Good luck, > > Matthew > > > On Oct 12, 2010, at 12:53 PM, Michael Segel wrote: > > > > > > > All, > > > > Let me clarify ... > > > > The ultimate data we want to process is in HBase. > > > > The data qualifiers are not part of the row key so you would have to do a > > full table scan to get the data. > > (A full table scan of 1 billion rows just to find a subset of 100K rows?) > > > > So the idea is what if I got the set of row_keys that I want to process > > from an external source. > > I don't mention the source, because its not important. > > > > What I am looking at is that at the start of my program, I have this java > > List object that contains my 100K record keys for the records I want to > > fetch. > > > > So how can I write a m/r that allows me to split and fetch based on a > > object and not a file or an hfile for input? > > > > > > Let me give you a concrete but imaginary example... > > > > I have combined all of the DMV vehicle registrations for all of the US. > > > > I want to find all of the cars that are registered to somebody with the > > last name Smith. > > > > Since the owner's last name isn't part of the row key. I have to do a full > > table scan. (Not really efficient.) > > > > Suppose I have an external index. I get the list of row keys in a List > > Object. > > > > Now I want to process the list in a m/r job. > > > > So what's the best way to do it? > > > > Can you use an object to feed in to a m/r job? (And that's the key point > > I'm trying to solve.) > > > > Does that make sense? > > > > -Mike > > > >> Subject: Re: Using external indexes in an HBase Map/Reduce job... > >> From: [email protected] > >> Date: Tue, 12 Oct 2010 11:53:11 -0700 > >> To: [email protected] > >> > >> I've been reading this thread, and I'm still not clear on what the problem > >> is. I saw your original post, but was unclear then as well. > >> > >> Please correct me if I'm wrong. It sounds like you want to run a M/R job > >> on some data that resides in a table in HBase. But, since the table is so > >> large the M/R job would take a long time to process the entire table, so > >> you want to only process the relevant subset. It also sounds like since > >> you need M/R, the relevant subset is too large to fit in memory and needs > >> a distributed solution. Is this correct so far? > >> > >> A solution exists: scan filters. The individual region servers filter the > >> data. > >> > >> When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob. > >> That method takes a Scan object as an input. The Scan object can have a > >> filter which is run on the individual region server to limit the data that > >> gets sent to the job. I've written my own filters as well, which are > >> quite simple. But, it is a bit of a pain because you have to make sure > >> the custom filter is in the classpath of the servers. I've used it to > >> randomly select a subset of data from HBase for quick test runs of new M/R > >> jobs. > >> > >> You might be able to use existing filters. I recommend taking a look at > >> the RowFilter as a starting point. I haven't used it, but it takes a > >> WritableByteArrayComparable which could possibly be extended to be based > >> on a bloom filter or a list. > >> > >> -Matthew > >> > >> On Oct 12, 2010, at 10:55 AM, jason wrote: > >> > >>>> What I can say is that I have a billion rows of data. > >>>> I want to pull a specific 100K rows from the table. > >>> > >>> Michael, I think I have exactly the same use case. Even numbers are the > >>> same. > >>> > >>> I posted a similar question a couple of weeks ago, but unfortunately > >>> did not get a definite answer: > >>> > >>> http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[email protected]%3e > >>> > >>> So far, I decided to put HBase aside and experiment with Hadoop > >>> directly using its BloomMapFile and its ability to quickly discard > >>> files that do not contain requested keys. > >>> This implies that I have to have a custom InputFormat for that, many > >>> input map files, and sorted list of input keys. > >>> > >>> I do not have any performance numbers yet to compare this approach to > >>> the full scan but I am writing tests as we speak. > >>> > >>> Please keep me posted if you find a good solution for this problem in > >>> general (M/R scanning through a random key subset either based on > >>> HBase or Hadoop) > >>> > >>> > >>> > >>> On 10/12/10, Michael Segel <[email protected]> wrote: > >>>> > >>>> > >>>> Dave, > >>>> > >>>> Its a bit more complicated than that. > >>>> > >>>> What I can say is that I have a billion rows of data. > >>>> I want to pull a specific 100K rows from the table. > >>>> > >>>> The row keys are not contiguous and you could say they are 'random' such > >>>> that if I were to do a table scan, I'd have to scan the entire table (All > >>>> regions). > >>>> > >>>> Now if I had a list of the 100k rows. From a single client I could just > >>>> create 100 threads and grab rows from HBase one at a time in each thread. > >>>> > >>>> But in a m/r, I can't really do that. (I want to do processing on the > >>>> data > >>>> I get returned.) > >>>> > >>>> So given a List Object with the row keys, how do I do a map reduce with > >>>> this > >>>> list as the starting point. > >>>> > >>>> Sure I could write it to HDFS and then do a m/r reading from the file and > >>>> setting my own splits to control parallelism. > >>>> But I'm hoping for a more elegant solution. > >>>> > >>>> I know that its possible, but I haven't thought it out... Was hoping > >>>> someone > >>>> else had this solved. > >>>> > >>>> thx > >>>> > >>>>> From: [email protected] > >>>>> To: [email protected] > >>>>> Date: Tue, 12 Oct 2010 08:35:25 -0700 > >>>>> Subject: RE: Using external indexes in an HBase Map/Reduce job... > >>>>> > >>>>> Sorry, I am not clear on exactly what you are trying to accomplish here. > >>>>> I have a table roughly of that size, and it doesn't seem to cause me any > >>>>> trouble. I also have a few separate solr indexes for data in the table > >>>>> for query -- the solr query syntax is sufficient for my current needs. > >>>>> This setup allows me to do two things efficiently: > >>>>> 1) batch processing of all records (e.g. tagging records that match a > >>>>> particular criteria) > >>>>> 2) search/lookup from a UI in an online manner > >>>>> 3) it is also fairly easy to insert a bunch of records (keeping track of > >>>>> their keys), and then run various batch processes only over those new > >>>>> records -- essentially doing what you suggest: create a file of keys and > >>>>> split the map task over that file. > >>>>> > >>>>> Dave > >>>>> > >>>>> > >>>>> -----Original Message----- > >>>>> From: Michael Segel [mailto:[email protected]] > >>>>> Sent: Tuesday, October 12, 2010 5:36 AM > >>>>> To: [email protected] > >>>>> Subject: Using external indexes in an HBase Map/Reduce job... > >>>>> > >>>>> > >>>>> Hi, > >>>>> > >>>>> Now I realize that most everyone is sitting in NY, while some of us > >>>>> can't > >>>>> leave our respective cities.... > >>>>> > >>>>> Came across this problem and I was wondering how others solved it. > >>>>> > >>>>> Suppose you have a really large table with 1 billion rows of data. > >>>>> Since HBase really doesn't have any indexes built in (Don't get me > >>>>> started > >>>>> about the contrib/transactional stuff...), you're forced to use some > >>>>> sort > >>>>> of external index, or roll your own index table. > >>>>> > >>>>> The net result is that you end up with a list object that contains your > >>>>> result set. > >>>>> > >>>>> So the question is... what's the best way to feed the list object in? > >>>>> > >>>>> One option I thought about is writing the object to a file and then > >>>>> using > >>>>> it as the file in and then control the splitters. Not the most efficient > >>>>> but it would work. > >>>>> > >>>>> Was trying to find a more 'elegant' solution and I'm sure that anyone > >>>>> using SOLR or LUCENE or whatever... had come across this problem too. > >>>>> > >>>>> Any suggestions? > >>>>> > >>>>> Thx > >>>>> > >>>>> > >>>> > >> > > >
