We ignore false positives for for now. A common solution is to maintain a set of known false positives and check that set for membership first before looking at the bloom filter. -----Original message----- > From:Vijith <[email protected]> > Sent: Mon 03-Sep-2012 13:01 > To: Markus Jelsma <[email protected]> > Subject: Re: Need some directions > > I tried with bloom filters. Its working fine for my sample site. So how did > you handle false positives then ? > I am working on it as part of a training assignment. I thought this would be > a good starting point to learn Nutch code base. > > On Fri, Aug 31, 2012 at 7:20 PM, Markus Jelsma <[email protected] > <mailto:[email protected]> > wrote: > > -----Original message----- > > From:Vijith <[email protected] <mailto:[email protected]> > > > Sent: Fri 31-Aug-2012 15:44 > > To: [email protected] <mailto:[email protected]> > > Subject: Re: Need some directions > > > > I have tried running nutch with a sample site with two different urls > > redirecting to a common resource. > > I could not find any clues, from hadoop.log, where the common resource is > > parsed multiple times. > > Could some one please explain the exact scenario that creates this bug. > > In the Jira comment you said it fetched page4 twice now. > > > > > And how does this bug relates to NUTCH-1184 ? > > It relates to 1184 because if URL's in the same fetch list link to a common > page, it can be followed.as <http://followed.as> well. > > We solved this issue by keeping a list of crawled URL's in a external bloom > filter. > > > > > On Thu, Aug 30, 2012 at 11:44 AM, Vijith <[email protected] > > <mailto:[email protected]> <mailto:[email protected] > > <mailto:[email protected]> > > wrote: > > Hi all, > > > > I am new to dev... I am working on NUTCH-1150... > > I would like to get some directions before I can start... Right now I am > > going through the Fetcher.java code... > > > > -- > > . . . . . thanks & regards > > > > Vijith V. > > > > > > > > > > > > -- > > . . . . . thanks & regards > > > > Vijith V. > > > > > > > > > > -- > . . . . . thanks & regards > > Vijith V. > > >

