Okay, see the ticket here: https://issues.apache.org/jira/browse/HBASE-3686.
Thanks, Sean On Tue, Mar 22, 2011 at 2:40 PM, Stack <[email protected]> wrote: > Good man Sean for figuring it. Please make an issue. Lets try and > fix it so no one else has the pain you just did. > St.Ack > > On Tue, Mar 22, 2011 at 11:27 AM, Sean Sechrist <[email protected]> > wrote: > > Okay, so I figured out what was going wrong: > > > > The property hbase.regionserver.lease.period was 120s on the machine I > > submitted the job from, but was only 60s on the RegionServer. > > > > This caused the scanner to timeout on the region server. But when the > next > > HTable.ClientScanner.next() call got the UnknownScannerException sent > back > > from the region server, it didn't think it was actually timed out, since > the > > time elapsed was <120s. So instead of a ScannerTimeoutException being > > thrown, it was treated as if it was a NSRE. > > > > I haven't been able to trace through the code to see *exactly* why that > > causes rows to be skipped, but I have a very simple job that will > > consistently reproduce the problem, see here: > http://pastebin.com/H5Ymq9UJ > > > > -Sean > > > > On Mon, Mar 21, 2011 at 2:28 PM, Sean Sechrist <[email protected]> > wrote: > > > >> The missed rows are not found near the beginning or end of a task (or > >> region). It's also a *read* problem - all of the writes that it tries to > do > >> are fine. There is just not the corect number of input records from the > >> scan. > >> > >> Thanks, > >> Sean > >> > >> > >> On Mon, Mar 21, 2011 at 2:13 PM, Michael Segel < > [email protected]>wrote: > >> > >>> Which release? > >>> > >>> I thought that when you issued the flush() command that it was a > >>> 'blocking' command in that it didn't return control back until after > the > >>> flush() completes? > >>> > >>> In Sean's response, the losses were in some of the threads and not all > of > >>> them. He didn't indicate where the records were in each region. > >>> If as you say, the flush() command is too close to the close() > statement, > >>> then these records would be at the end of the region. > >>> > >>> Another area to look in to is what is happening when the table's > regions > >>> split during the write. > >>> > >>> > >>> HTH > >>> > >>> -Mike > >>> > >>> PS. When I tried writing to [email protected], the e-mails > bounced. > >>> Not sure why... > >>> > >>> > Date: Mon, 21 Mar 2011 13:55:02 -0400 > >>> > From: [email protected] > >>> > To: [email protected]; [email protected] > >>> > Subject: Re: FW: Scan isn't processing all rows > >>> > >>> > > >>> > I had a problem of lost rows when the flush was right before the > close > >>> statement. > >>> > ---- Sean Sechrist <[email protected]> wrote: > >>> > > >>> > ============= > >>> > Accidentally dropped the user list of this email exchange. Anyone > have > >>> any > >>> > other ideas here? > >>> > > >>> > But using scanner caching of 1 fixes the problem, as suspected. So > now > >>> I'll > >>> > investigate why the scanner cache is being lost. > >>> > > >>> > Thanks, > >>> > Sean > >>> > > >>> > On Mon, Mar 21, 2011 at 11:06 AM, Sean Sechrist <[email protected] > > > >>> wrote: > >>> > > >>> > > Hey Mike, thanks for the response. > >>> > > > >>> > > > This would mean that you have 184 mappers, right? > >>> > > > >>> > > We actually had 43 mappers (43 regions in the source table). > >>> > > > >>> > > > If this is correct, then it appears that you are losing only the > >>> records > >>> > > cached once per mapper task. > >>> > > > It would be interesting to see if this happened in the first set > of > >>> > > cached rows, or if it happens in the last > >>> > > > set of cached rows. > >>> > > > >>> > > So it actually happens (possibly) more than once per task. For > >>> example, for > >>> > > the first 10 tasks, here are the numbers of missed records: > >>> > > > >>> > > 0, 0, 3996, 4995, 0, 0, 999, 1998, 3996, 999 > >>> > > > >>> > > > My next suggestion is to turn off the scan caching. > >>> > > > >>> > > Good idea, I'll see if that works. > >>> > > > >>> > > Thanks, > >>> > > Sean > >>> > > > >>> > > On Mon, Mar 21, 2011 at 10:39 AM, Michael Segel < > >>> [email protected] > >>> > > > wrote: > >>> > > > >>> > >> For some reason my e-mail to the hbase list failed.... > >>> > >> > >>> > >> > >>> > >> ------------------------------ > >>> > >> From: [email protected] > >>> > >> > >>> > >> To: [email protected] > >>> > >> Subject: RE: Scan isn't processing all rows > >>> > >> Date: Mon, 21 Mar 2011 09:37:06 -0500 > >>> > >> > >>> > >> Sean, > >>> > >> Ok... > >>> > >> > >>> > >> Lets think about this... > >>> > >> > >>> > >> You're saying that without the actual put, your application is > >>> reading all > >>> > >> of the rows and they are being processed correctly. > >>> > >> You said that when you add the put() to the second table, it > appears > >>> that > >>> > >> rows that were scanned are in the cache are lost. So that you are > >>> missing > >>> > >> multiples of 999 rows. > >>> > >> Based on your example... > >>> > >> > >>> > >> > To get a sense of how many we are missing, the latest run missed > >>> 183,816 > >>> > >> out > >>> > >> > of 29,572,075 rows in the source table. > >>> > >> > >>> > >> This would mean that you have 184 mappers, right? > >>> > >> > >>> > >> If this is correct, then it appears that you are losing only the > >>> records > >>> > >> cached once per mapper task. > >>> > >> It would be interesting to see if this happened in the first set > of > >>> cached > >>> > >> rows, or if it happens in the last set of cached rows. > >>> > >> (You can see this by seeing which rows are missing and where they > are > >>> in > >>> > >> the HTable region based on their row key.) > >>> > >> > >>> > >> My next suggestion is to turn off the scan caching. > >>> > >> You will obviously take a little performance hit, but that should > >>> clean up > >>> > >> the problem. > >>> > >> > >>> > >> If that works, then you should be able to start to look at your > code > >>> to > >>> > >> see what's causing the failure. > >>> > >> > >>> > >> HTH > >>> > >> > >>> > >> -Mike > >>> > >> > >>> > >> > From: [email protected] > >>> > >> > Date: Mon, 21 Mar 2011 09:01:32 -0400 > >>> > >> > Subject: Re: Scan isn't processing all rows > >>> > >> > >>> > >> > To: [email protected] > >>> > >> > > >>> > >> > Okay, I've tried that test, as well as making sure speculative > >>> execution > >>> > >> is > >>> > >> > turned off. Neither made a difference. It's not only a problem > with > >>> > >> writing > >>> > >> > to the target table - The number of map input records for the > job > >>> is > >>> > >> wrong, > >>> > >> > as well. But it's correct when we run jobs that do not write to > >>> HBase, > >>> > >> such > >>> > >> > as a row count. > >>> > >> > > >>> > >> > I ran another job to calculate the number of missed rows per > region > >>> of > >>> > >> the > >>> > >> > source table (which is not consistent between runs), by > comparing > >>> the > >>> > >> source > >>> > >> > table with the target table. > >>> > >> > > >>> > >> > An interesting thing I found is that the number of skipped rows > is > >>> > >> always a > >>> > >> > multiple of 999. This is especially interesting because our > scanner > >>> > >> caching > >>> > >> > is 1000. So I think we're skipping over the scanner cache > >>> sometimes. > >>> > >> > > >>> > >> > To get a sense of how many we are missing, the latest run missed > >>> 183,816 > >>> > >> out > >>> > >> > of 29,572,075 rows in the source table. > >>> > >> > > >>> > >> > Any ideas? > >>> > >> > > >>> > >> > Thanks, > >>> > >> > Sean > >>> > >> > > >>> > >> > On Fri, Mar 18, 2011 at 9:58 AM, Michael Segel < > >>> > >> [email protected]>wrote: > >>> > >> > > >>> > >> > > > >>> > >> > > Sean, > >>> > >> > > > >>> > >> > > Here's a simple test. > >>> > >> > > > >>> > >> > > Modify your code so that you aren't using the > TableOutputFormat > >>> class, > >>> > >> but > >>> > >> > > a null writable and inside the map() method you actually do > the > >>> write > >>> > >> > > yourself. > >>> > >> > > > >>> > >> > > Also make sure to explicitly flush and close your HTable > >>> connection > >>> > >> when > >>> > >> > > your mapper ends. > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > From: [email protected] > >>> > >> > > > Date: Fri, 18 Mar 2011 09:50:47 -0400 > >>> > >> > > > Subject: Scan isn't processing all rows > >>> > >> > > > To: [email protected] > >>> > >> > > > > >>> > >> > > > Hi all, > >>> > >> > > > > >>> > >> > > > We're experiencing a problem where a map-only job using > >>> > >> TableInputFormat > >>> > >> > > and > >>> > >> > > > TableOutputFormat to export data from one table into another > is > >>> not > >>> > >> > > reading > >>> > >> > > > all of the rows in the source table. That is, # map input > >>> records != > >>> > >> # > >>> > >> > > > records in the table. Anyone have any clue how that could > >>> happen? > >>> > >> > > > > >>> > >> > > > Some more detail: > >>> > >> > > > > >>> > >> > > > It appears to only happen when we are writing results to the > >>> > >> destination > >>> > >> > > > table. If I comment out the lines where where data is > written > >>> from > >>> > >> the > >>> > >> > > > mapper (context.write), then the number of input records is > >>> correct. > >>> > >> > > > > >>> > >> > > > I verified that the rows that did not get written to the > output > >>> > >> table, so > >>> > >> > > > it's not just a counter problem. We aren't using any filter > or > >>> > >> anything, > >>> > >> > > > just a straight-up scan to try to read everything in the > table. > >>> > >> > > > > >>> > >> > > > We're on hbase-0.89.20100924. > >>> > >> > > > > >>> > >> > > > Thanks, > >>> > >> > > > Sean > >>> > >> > > > >>> > >> > >>> > > > >>> > > > >>> > > >>> > -- > >>> > > >>> > 1. If a man is standing in the middle of the forest talking, and > there > >>> is no woman around to hear him, is he still wrong? > >>> > > >>> > 2. Behind every great woman... Is a man checking out her ass > >>> > > >>> > 3. I am not a member of any organized political party. I am a > Democrat.* > >>> > > >>> > 4. Diplomacy is the art of saying "Nice doggie" until you can find a > >>> rock.* > >>> > > >>> > 5. A process is what you need when all your good people have left. > >>> > > >>> > > >>> > *Will Rogers > >>> > > >>> > > >>> > >> > >> > > >
