Hi Rahit, The list is a bad idea. When you will have millions of lines per regions, are going to pu millions of them in memory in your list?
Your MR will scan the entire table, row by row. If you modify the current row, when the scanner will search for the next one, it will not look at current one. So there is no real issue with that. Also, instead of doing puts one by one I will recommand you to buffer them (let's say, 100 by 100) and put them as a batch. Don't forget to push the remaining at the end of the job. The drawback is that if the MR crash you will have some rows already processed and not marked as processed... JM 2013/6/22 Rohit Kelkar <[email protected]>: > I have a usecase where I push data in my HTable in waves followed by > Mapper-only processing. Currently once a row is processed in map I > immediately mark it as processed=true. For this inside the map I execute a > table.put(isprocessed=true). I am not sure if modifying the table like this > is a good idea. I am also concerned that I am modifying the same table that > I am running the MR job on. > So I am thinking of another approach where I accumulate the processed rows > in a list (or a better compact data structure) and use the cleanup method > of the MR job to execute all the table.put(isprocessed=true) at once. > What is the suggested best practice? > > - R
