I had a problem of lost rows when the flush was right before the close statement. ---- Sean Sechrist <[email protected]> wrote:
============= Accidentally dropped the user list of this email exchange. Anyone have any other ideas here? But using scanner caching of 1 fixes the problem, as suspected. So now I'll investigate why the scanner cache is being lost. Thanks, Sean On Mon, Mar 21, 2011 at 11:06 AM, Sean Sechrist <[email protected]> wrote: > Hey Mike, thanks for the response. > > > This would mean that you have 184 mappers, right? > > We actually had 43 mappers (43 regions in the source table). > > > If this is correct, then it appears that you are losing only the records > cached once per mapper task. > > It would be interesting to see if this happened in the first set of > cached rows, or if it happens in the last > > set of cached rows. > > So it actually happens (possibly) more than once per task. For example, for > the first 10 tasks, here are the numbers of missed records: > > 0, 0, 3996, 4995, 0, 0, 999, 1998, 3996, 999 > > > My next suggestion is to turn off the scan caching. > > Good idea, I'll see if that works. > > Thanks, > Sean > > On Mon, Mar 21, 2011 at 10:39 AM, Michael Segel <[email protected] > > wrote: > >> For some reason my e-mail to the hbase list failed.... >> >> >> ------------------------------ >> From: [email protected] >> >> To: [email protected] >> Subject: RE: Scan isn't processing all rows >> Date: Mon, 21 Mar 2011 09:37:06 -0500 >> >> Sean, >> Ok... >> >> Lets think about this... >> >> You're saying that without the actual put, your application is reading all >> of the rows and they are being processed correctly. >> You said that when you add the put() to the second table, it appears that >> rows that were scanned are in the cache are lost. So that you are missing >> multiples of 999 rows. >> Based on your example... >> >> > To get a sense of how many we are missing, the latest run missed 183,816 >> out >> > of 29,572,075 rows in the source table. >> >> This would mean that you have 184 mappers, right? >> >> If this is correct, then it appears that you are losing only the records >> cached once per mapper task. >> It would be interesting to see if this happened in the first set of cached >> rows, or if it happens in the last set of cached rows. >> (You can see this by seeing which rows are missing and where they are in >> the HTable region based on their row key.) >> >> My next suggestion is to turn off the scan caching. >> You will obviously take a little performance hit, but that should clean up >> the problem. >> >> If that works, then you should be able to start to look at your code to >> see what's causing the failure. >> >> HTH >> >> -Mike >> >> > From: [email protected] >> > Date: Mon, 21 Mar 2011 09:01:32 -0400 >> > Subject: Re: Scan isn't processing all rows >> >> > To: [email protected] >> > >> > Okay, I've tried that test, as well as making sure speculative execution >> is >> > turned off. Neither made a difference. It's not only a problem with >> writing >> > to the target table - The number of map input records for the job is >> wrong, >> > as well. But it's correct when we run jobs that do not write to HBase, >> such >> > as a row count. >> > >> > I ran another job to calculate the number of missed rows per region of >> the >> > source table (which is not consistent between runs), by comparing the >> source >> > table with the target table. >> > >> > An interesting thing I found is that the number of skipped rows is >> always a >> > multiple of 999. This is especially interesting because our scanner >> caching >> > is 1000. So I think we're skipping over the scanner cache sometimes. >> > >> > To get a sense of how many we are missing, the latest run missed 183,816 >> out >> > of 29,572,075 rows in the source table. >> > >> > Any ideas? >> > >> > Thanks, >> > Sean >> > >> > On Fri, Mar 18, 2011 at 9:58 AM, Michael Segel < >> [email protected]>wrote: >> > >> > > >> > > Sean, >> > > >> > > Here's a simple test. >> > > >> > > Modify your code so that you aren't using the TableOutputFormat class, >> but >> > > a null writable and inside the map() method you actually do the write >> > > yourself. >> > > >> > > Also make sure to explicitly flush and close your HTable connection >> when >> > > your mapper ends. >> > > >> > > >> > > >> > > > From: [email protected] >> > > > Date: Fri, 18 Mar 2011 09:50:47 -0400 >> > > > Subject: Scan isn't processing all rows >> > > > To: [email protected] >> > > > >> > > > Hi all, >> > > > >> > > > We're experiencing a problem where a map-only job using >> TableInputFormat >> > > and >> > > > TableOutputFormat to export data from one table into another is not >> > > reading >> > > > all of the rows in the source table. That is, # map input records != >> # >> > > > records in the table. Anyone have any clue how that could happen? >> > > > >> > > > Some more detail: >> > > > >> > > > It appears to only happen when we are writing results to the >> destination >> > > > table. If I comment out the lines where where data is written from >> the >> > > > mapper (context.write), then the number of input records is correct. >> > > > >> > > > I verified that the rows that did not get written to the output >> table, so >> > > > it's not just a counter problem. We aren't using any filter or >> anything, >> > > > just a straight-up scan to try to read everything in the table. >> > > > >> > > > We're on hbase-0.89.20100924. >> > > > >> > > > Thanks, >> > > > Sean >> > > >> > > -- 1. If a man is standing in the middle of the forest talking, and there is no woman around to hear him, is he still wrong? 2. Behind every great woman... Is a man checking out her ass 3. I am not a member of any organized political party. I am a Democrat.* 4. Diplomacy is the art of saying "Nice doggie" until you can find a rock.* 5. A process is what you need when all your good people have left. *Will Rogers
