I think I get the idea. I can't do it like that because Job1 might try to access the table as the same time as Job2 is trying to rename it, or other same kind off issues, but I will work on something similar.
Thanks, I will start another thread for another delete question I have ;) JM 2012/10/12 Doug Meil <[email protected]>: > > Just throwing an idea out there, but if you rotate tables you could > probably do what you want.. > > 1) Table1 is being written throughout the day > 2) It's time to kick off the MR job, but before the job is submitted > Table2 is now configured to be the 'write' table > 3) MR job processes all the data in Table1. Table1 is dropped/truncated > when finished. > 4) Table2 continues to get writes > 5) Now it's time to run the MR job again, Table1 is now configured to be > the 'write' table and Table2 is processed by the MR job. > 6) Continue rotating between the tables > > Something like this is probably going to be a lot easier to manage than > doing deletes of what you've read. > > > > > On 10/12/12 3:47 PM, "Jean-Marc Spaggiari" <[email protected]> wrote: > >>Hi Doug, >> >>Thanks for the suggestion. I like the idea of simply deleting the >>table, however, I'm not sure if I can implement it. >> >>Basically, I have one process which is constantly feeding the table, >>and, once a day, I want to run a MR job to proccess this table (Which >>will emtpy it). >> >>While I'm processing it, I still want to other process to have the >>ability to store data. >> >>Since I can't rename the table because this functionnaly doesn't >>exist, I need to have the 2 working on the same table. >> >>Maybe what I can do is working on the colum name.... Like I store on a >>different column every day based on the day number and I just run MR >>on all the columns except today. After that, I can delete all the >>columns except the one for the current day. Issue is if the MR is >>taking more than 24h... >> >>Also, is that fast to delete a column? >> >>JM >> >>2012/10/12 Doug Meil <[email protected]>: >>> >>> I'm not entirely sure of the use-case, but here are some thoughts on >>>thisÅ >>> >>> re: "should I take the table from the pool, and simply call the delete >>> method?" >>> >>> Yep, you can construct an HTable instance within a MR job. But use the >>> delete that takes a list because the single-delete will invoke an RPC >>>for >>> each one (not great over an MR job). >>> >>> Construct the HTable instance at the Mapper level (not map-method level) >>> and keep a buffer of deletes in a List. At the end of the job, send any >>> un-processed deletes in the cleanup method. >>> >>> >>> I'm not entirely sure why you'd want to delete every row in a table (as >>> opposed to processing all the rows in Table1 and generating an entirely >>> new Table2). And then drop Table1 when you're done with it. That seems >>> like it would be less hassle than deleting every row (since the table is >>> empty anyway). >>> >>> >>> >>> >>> >>> >>> On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <[email protected]> >>>wrote: >>> >>>>Hi, >>>> >>>>I have a table which I want to parse over a MR job. >>>> >>>>Today, I'm using a scan to parse all the rows. Each row is retrieve, >>>>removed, and the parsed (feeding 2 other tables) >>>> >>>>The goal is to parse all the content while some process might still be >>>>adding some more. >>>> >>>>On the map method from the MR job, can I delete the row I'm working >>>>with? If so, how should I do? should I take the table from the pool, >>>>and simply call the delete method? The issue is, doing a delete for >>>>each line will take a while. I would prefer to batch them, but I don't >>>>know when will be the last line, so it's difficult to know when to >>>>send the batch. Is there a way to say to the MR job to delete this >>>>line? Also, what's the impact on the MR job if I delete the row it's >>>>working one? >>>> >>>>Or is the MR job not the best way to do that? >>>> >>>>Thanks, >>>> >>>>JM >>>> >>> >>> >> > >
