We ignore false positives for for now. A common solution is to maintain a set 
of known false positives and check that set for membership first before looking 
at the bloom filter.
 
-----Original message-----
> From:Vijith <[email protected]>
> Sent: Mon 03-Sep-2012 13:01
> To: Markus Jelsma <[email protected]>
> Subject: Re: Need some directions
> 
> I tried with bloom filters. Its working fine for my sample site. So how did 
> you handle false positives then ?
> I am working on it as part of a training assignment. I thought this would be 
> a good starting point to learn Nutch code base.
> 
> On Fri, Aug 31, 2012 at 7:20 PM, Markus Jelsma <[email protected] 
> <mailto:[email protected]> > wrote:
> 
> -----Original message-----
> > From:Vijith <[email protected] <mailto:[email protected]> >
> > Sent: Fri 31-Aug-2012 15:44
> > To: [email protected] <mailto:[email protected]> 
> > Subject: Re: Need some directions
> >
> > I have tried running nutch with a sample site with two different urls 
> > redirecting to a common resource.
> > I could not find any clues, from hadoop.log, where the common resource is 
> > parsed multiple times.
> > Could some one please explain the exact scenario that creates this bug.
> 
> In the Jira comment you said it fetched page4 twice now.
> 
> >
> > And how does this bug relates to NUTCH-1184 ? 
> 
> It relates to 1184 because if URL's in the same fetch list link to a common 
> page, it can be followed.as <http://followed.as> well.
> 
> We solved this issue by keeping a list of crawled URL's in a external bloom 
> filter.
> 
> >
> > On Thu, Aug 30, 2012 at 11:44 AM, Vijith <[email protected] 
> > <mailto:[email protected]> <mailto:[email protected] 
> > <mailto:[email protected]> > > wrote:
> > Hi all, 
> >
> > I am new to dev... I am working on NUTCH-1150...
> > I would like to get some directions before I can start... Right now I am 
> > going through the Fetcher.java code...
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
> >
> >
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
> >
> 
> 
> 
> -- 
> . . . . . thanks & regards
> 
> Vijith V.
> 
> 
> 

Reply via email to