Hi Doug, Thanks for the suggestion. I like the idea of simply deleting the table, however, I'm not sure if I can implement it.
Basically, I have one process which is constantly feeding the table, and, once a day, I want to run a MR job to proccess this table (Which will emtpy it). While I'm processing it, I still want to other process to have the ability to store data. Since I can't rename the table because this functionnaly doesn't exist, I need to have the 2 working on the same table. Maybe what I can do is working on the colum name.... Like I store on a different column every day based on the day number and I just run MR on all the columns except today. After that, I can delete all the columns except the one for the current day. Issue is if the MR is taking more than 24h... Also, is that fast to delete a column? JM 2012/10/12 Doug Meil <[email protected]>: > > I'm not entirely sure of the use-case, but here are some thoughts on thisÅ > > re: "should I take the table from the pool, and simply call the delete > method?" > > Yep, you can construct an HTable instance within a MR job. But use the > delete that takes a list because the single-delete will invoke an RPC for > each one (not great over an MR job). > > Construct the HTable instance at the Mapper level (not map-method level) > and keep a buffer of deletes in a List. At the end of the job, send any > un-processed deletes in the cleanup method. > > > I'm not entirely sure why you'd want to delete every row in a table (as > opposed to processing all the rows in Table1 and generating an entirely > new Table2). And then drop Table1 when you're done with it. That seems > like it would be less hassle than deleting every row (since the table is > empty anyway). > > > > > > > On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <[email protected]> wrote: > >>Hi, >> >>I have a table which I want to parse over a MR job. >> >>Today, I'm using a scan to parse all the rows. Each row is retrieve, >>removed, and the parsed (feeding 2 other tables) >> >>The goal is to parse all the content while some process might still be >>adding some more. >> >>On the map method from the MR job, can I delete the row I'm working >>with? If so, how should I do? should I take the table from the pool, >>and simply call the delete method? The issue is, doing a delete for >>each line will take a while. I would prefer to batch them, but I don't >>know when will be the last line, so it's difficult to know when to >>send the batch. Is there a way to say to the MR job to delete this >>line? Also, what's the impact on the MR job if I delete the row it's >>working one? >> >>Or is the MR job not the best way to do that? >> >>Thanks, >> >>JM >> > >
