Hi Rohit, It will alway be consistent. I don't see why there will be any un-consistency with the scenario your described below.
JM 2013/6/22 Rohit Kelkar <[email protected]>: > Thanks JM, I am not so concerned about holding those rows in memory because > they are mostly ordered integers and I would be using a bitset. So I have > some leeway in that sense. My dilemma was > 1. updating instantly within the map > 2. bulk updating at the end of the map > Yes I do understand the drawback with 2 if map crashes. I am ready to incur > that penalty if that avoids any inconsistent behaviour on hbase. > > - R > > > On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari < > [email protected]> wrote: > >> Hi Rahit, >> >> The list is a bad idea. When you will have millions of lines per >> regions, are going to pu millions of them in memory in your list? >> >> Your MR will scan the entire table, row by row. If you modify the >> current row, when the scanner will search for the next one, it will >> not look at current one. So there is no real issue with that. >> >> Also, instead of doing puts one by one I will recommand you to buffer >> them (let's say, 100 by 100) and put them as a batch. Don't forget to >> push the remaining at the end of the job. The drawback is that if the >> MR crash you will have some rows already processed and not marked as >> processed... >> >> JM >> >> 2013/6/22 Rohit Kelkar <[email protected]>: >> > I have a usecase where I push data in my HTable in waves followed by >> > Mapper-only processing. Currently once a row is processed in map I >> > immediately mark it as processed=true. For this inside the map I execute >> a >> > table.put(isprocessed=true). I am not sure if modifying the table like >> this >> > is a good idea. I am also concerned that I am modifying the same table >> that >> > I am running the MR job on. >> > So I am thinking of another approach where I accumulate the processed >> rows >> > in a list (or a better compact data structure) and use the cleanup method >> > of the MR job to execute all the table.put(isprocessed=true) at once. >> > What is the suggested best practice? >> > >> > - R >>
