Okay, see the ticket here: https://issues.apache.org/jira/browse/HBASE-3686.

Thanks,
Sean

On Tue, Mar 22, 2011 at 2:40 PM, Stack <[email protected]> wrote:

> Good man Sean for figuring it.  Please make an issue.   Lets try and
> fix it so no one else has the pain you just did.
> St.Ack
>
> On Tue, Mar 22, 2011 at 11:27 AM, Sean Sechrist <[email protected]>
> wrote:
> > Okay, so I figured out what was going wrong:
> >
> > The property hbase.regionserver.lease.period was 120s on the machine I
> > submitted the job from, but was only 60s on the RegionServer.
> >
> > This caused the scanner to timeout on the region server. But when the
> next
> > HTable.ClientScanner.next() call got the UnknownScannerException sent
> back
> > from the region server, it didn't think it was actually timed out, since
> the
> > time elapsed was <120s. So instead of a ScannerTimeoutException being
> > thrown, it was treated as if it was a NSRE.
> >
> > I haven't been able to trace through the code to see *exactly* why that
> > causes rows to be skipped, but I have a very simple job that will
> > consistently reproduce the problem, see here:
> http://pastebin.com/H5Ymq9UJ
> >
> > -Sean
> >
> > On Mon, Mar 21, 2011 at 2:28 PM, Sean Sechrist <[email protected]>
> wrote:
> >
> >> The missed rows are not found near the beginning or end of a task (or
> >> region). It's also a *read* problem - all of the writes that it tries to
> do
> >> are fine. There is just not the corect number of input records from the
> >> scan.
> >>
> >> Thanks,
> >> Sean
> >>
> >>
> >> On Mon, Mar 21, 2011 at 2:13 PM, Michael Segel <
> [email protected]>wrote:
> >>
> >>>  Which release?
> >>>
> >>> I thought that when you issued the flush() command that it was a
> >>> 'blocking' command in that it didn't return control back until after
> the
> >>> flush() completes?
> >>>
> >>> In Sean's response, the losses were in some of the threads and not all
> of
> >>> them. He didn't indicate where the records were in each region.
> >>> If as you say, the flush() command is too close to the close()
> statement,
> >>> then these records would be at the end of the region.
> >>>
> >>> Another area to look in to is what is happening when the table's
> regions
> >>> split during the write.
> >>>
> >>>
> >>> HTH
> >>>
> >>> -Mike
> >>>
> >>> PS. When I tried writing to [email protected], the e-mails
> bounced.
> >>> Not sure why...
> >>>
> >>> > Date: Mon, 21 Mar 2011 13:55:02 -0400
> >>> > From: [email protected]
> >>> > To: [email protected]; [email protected]
> >>> > Subject: Re: FW: Scan isn't processing all rows
> >>>
> >>> >
> >>> > I had a problem of lost rows when the flush was right before the
> close
> >>> statement.
> >>> > ---- Sean Sechrist <[email protected]> wrote:
> >>> >
> >>> > =============
> >>> > Accidentally dropped the user list of this email exchange. Anyone
> have
> >>> any
> >>> > other ideas here?
> >>> >
> >>> > But using scanner caching of 1 fixes the problem, as suspected. So
> now
> >>> I'll
> >>> > investigate why the scanner cache is being lost.
> >>> >
> >>> > Thanks,
> >>> > Sean
> >>> >
> >>> > On Mon, Mar 21, 2011 at 11:06 AM, Sean Sechrist <[email protected]
> >
> >>> wrote:
> >>> >
> >>> > > Hey Mike, thanks for the response.
> >>> > >
> >>> > > > This would mean that you have 184 mappers, right?
> >>> > >
> >>> > > We actually had 43 mappers (43 regions in the source table).
> >>> > >
> >>> > > > If this is correct, then it appears that you are losing only the
> >>> records
> >>> > > cached once per mapper task.
> >>> > > > It would be interesting to see if this happened in the first set
> of
> >>> > > cached rows, or if it happens in the last
> >>> > > > set of cached rows.
> >>> > >
> >>> > > So it actually happens (possibly) more than once per task. For
> >>> example, for
> >>> > > the first 10 tasks, here are the numbers of missed records:
> >>> > >
> >>> > > 0, 0, 3996, 4995, 0, 0, 999, 1998, 3996, 999
> >>> > >
> >>> > > > My next suggestion is to turn off the scan caching.
> >>> > >
> >>> > > Good idea, I'll see if that works.
> >>> > >
> >>> > > Thanks,
> >>> > > Sean
> >>> > >
> >>> > > On Mon, Mar 21, 2011 at 10:39 AM, Michael Segel <
> >>> [email protected]
> >>> > > > wrote:
> >>> > >
> >>> > >> For some reason my e-mail to the hbase list failed....
> >>> > >>
> >>> > >>
> >>> > >> ------------------------------
> >>> > >> From: [email protected]
> >>> > >>
> >>> > >> To: [email protected]
> >>> > >> Subject: RE: Scan isn't processing all rows
> >>> > >> Date: Mon, 21 Mar 2011 09:37:06 -0500
> >>> > >>
> >>> > >> Sean,
> >>> > >> Ok...
> >>> > >>
> >>> > >> Lets think about this...
> >>> > >>
> >>> > >> You're saying that without the actual put, your application is
> >>> reading all
> >>> > >> of the rows and they are being processed correctly.
> >>> > >> You said that when you add the put() to the second table, it
> appears
> >>> that
> >>> > >> rows that were scanned are in the cache are lost. So that you are
> >>> missing
> >>> > >> multiples of 999 rows.
> >>> > >> Based on your example...
> >>> > >>
> >>> > >> > To get a sense of how many we are missing, the latest run missed
> >>> 183,816
> >>> > >> out
> >>> > >> > of 29,572,075 rows in the source table.
> >>> > >>
> >>> > >> This would mean that you have 184 mappers, right?
> >>> > >>
> >>> > >> If this is correct, then it appears that you are losing only the
> >>> records
> >>> > >> cached once per mapper task.
> >>> > >> It would be interesting to see if this happened in the first set
> of
> >>> cached
> >>> > >> rows, or if it happens in the last set of cached rows.
> >>> > >> (You can see this by seeing which rows are missing and where they
> are
> >>> in
> >>> > >> the HTable region based on their row key.)
> >>> > >>
> >>> > >> My next suggestion is to turn off the scan caching.
> >>> > >> You will obviously take a little performance hit, but that should
> >>> clean up
> >>> > >> the problem.
> >>> > >>
> >>> > >> If that works, then you should be able to start to look at your
> code
> >>> to
> >>> > >> see what's causing the failure.
> >>> > >>
> >>> > >> HTH
> >>> > >>
> >>> > >> -Mike
> >>> > >>
> >>> > >> > From: [email protected]
> >>> > >> > Date: Mon, 21 Mar 2011 09:01:32 -0400
> >>> > >> > Subject: Re: Scan isn't processing all rows
> >>> > >>
> >>> > >> > To: [email protected]
> >>> > >> >
> >>> > >> > Okay, I've tried that test, as well as making sure speculative
> >>> execution
> >>> > >> is
> >>> > >> > turned off. Neither made a difference. It's not only a problem
> with
> >>> > >> writing
> >>> > >> > to the target table - The number of map input records for the
> job
> >>> is
> >>> > >> wrong,
> >>> > >> > as well. But it's correct when we run jobs that do not write to
> >>> HBase,
> >>> > >> such
> >>> > >> > as a row count.
> >>> > >> >
> >>> > >> > I ran another job to calculate the number of missed rows per
> region
> >>> of
> >>> > >> the
> >>> > >> > source table (which is not consistent between runs), by
> comparing
> >>> the
> >>> > >> source
> >>> > >> > table with the target table.
> >>> > >> >
> >>> > >> > An interesting thing I found is that the number of skipped rows
> is
> >>> > >> always a
> >>> > >> > multiple of 999. This is especially interesting because our
> scanner
> >>> > >> caching
> >>> > >> > is 1000. So I think we're skipping over the scanner cache
> >>> sometimes.
> >>> > >> >
> >>> > >> > To get a sense of how many we are missing, the latest run missed
> >>> 183,816
> >>> > >> out
> >>> > >> > of 29,572,075 rows in the source table.
> >>> > >> >
> >>> > >> > Any ideas?
> >>> > >> >
> >>> > >> > Thanks,
> >>> > >> > Sean
> >>> > >> >
> >>> > >> > On Fri, Mar 18, 2011 at 9:58 AM, Michael Segel <
> >>> > >> [email protected]>wrote:
> >>> > >> >
> >>> > >> > >
> >>> > >> > > Sean,
> >>> > >> > >
> >>> > >> > > Here's a simple test.
> >>> > >> > >
> >>> > >> > > Modify your code so that you aren't using the
> TableOutputFormat
> >>> class,
> >>> > >> but
> >>> > >> > > a null writable and inside the map() method you actually do
> the
> >>> write
> >>> > >> > > yourself.
> >>> > >> > >
> >>> > >> > > Also make sure to explicitly flush and close your HTable
> >>> connection
> >>> > >> when
> >>> > >> > > your mapper ends.
> >>> > >> > >
> >>> > >> > >
> >>> > >> > >
> >>> > >> > > > From: [email protected]
> >>> > >> > > > Date: Fri, 18 Mar 2011 09:50:47 -0400
> >>> > >> > > > Subject: Scan isn't processing all rows
> >>> > >> > > > To: [email protected]
> >>> > >> > > >
> >>> > >> > > > Hi all,
> >>> > >> > > >
> >>> > >> > > > We're experiencing a problem where a map-only job using
> >>> > >> TableInputFormat
> >>> > >> > > and
> >>> > >> > > > TableOutputFormat to export data from one table into another
> is
> >>> not
> >>> > >> > > reading
> >>> > >> > > > all of the rows in the source table. That is, # map input
> >>> records !=
> >>> > >> #
> >>> > >> > > > records in the table. Anyone have any clue how that could
> >>> happen?
> >>> > >> > > >
> >>> > >> > > > Some more detail:
> >>> > >> > > >
> >>> > >> > > > It appears to only happen when we are writing results to the
> >>> > >> destination
> >>> > >> > > > table. If I comment out the lines where where data is
> written
> >>> from
> >>> > >> the
> >>> > >> > > > mapper (context.write), then the number of input records is
> >>> correct.
> >>> > >> > > >
> >>> > >> > > > I verified that the rows that did not get written to the
> output
> >>> > >> table, so
> >>> > >> > > > it's not just a counter problem. We aren't using any filter
> or
> >>> > >> anything,
> >>> > >> > > > just a straight-up scan to try to read everything in the
> table.
> >>> > >> > > >
> >>> > >> > > > We're on hbase-0.89.20100924.
> >>> > >> > > >
> >>> > >> > > > Thanks,
> >>> > >> > > > Sean
> >>> > >> > >
> >>> > >>
> >>> > >
> >>> > >
> >>> >
> >>> > --
> >>> >
> >>> > 1. If a man is standing in the middle of the forest talking, and
> there
> >>> is no woman around to hear him, is he still wrong?
> >>> >
> >>> > 2. Behind every great woman... Is a man checking out her ass
> >>> >
> >>> > 3. I am not a member of any organized political party. I am a
> Democrat.*
> >>> >
> >>> > 4. Diplomacy is the art of saying "Nice doggie" until you can find a
> >>> rock.*
> >>> >
> >>> > 5. A process is what you need when all your good people have left.
> >>> >
> >>> >
> >>> > *Will Rogers
> >>> >
> >>> >
> >>>
> >>
> >>
> >
>

Reply via email to