RE: 90% of URL rejected by filtering (Nutch 2.3.1)

Markus Jelsma Wed, 05 Oct 2016 11:55:18 -0700

Hello - you can try debug logging. Or get the whole list of URL's in a flat 
file, and pipe it to bin/nutch org.apache.nutch.net.URLFilterChecker 
-allCombined. URL's with a plus + are passed, URL's with a - minus are filtered.



 
 
-----Original message-----
> From:shubham.gupta <[email protected]>
> Sent: Wednesday 5th October 2016 14:10
> To: [email protected]
> Subject: Re: 90% of URL rejected by filtering (Nutch 2.3.1)
> 
> is there any way to find out the url filtered?
> 
> Also the line -^.{513,}$ was inserted as the update job was failing 
> consistently due to MongoDb exception : key too large to index.
> 
> Thanks and Regards,
> Shubham Gupta
> 
> On Wednesday 05 October 2016 01:50 PM, Sachin Shaju wrote:
> > For the time being you can comment out this line -^.{513,}$ and check.
> >
> > Regards,
> > Sachin Shaju
> >
> > [email protected]
> > +919539887554
> >
> > On Wed, Oct 5, 2016 at 11:41 AM, shubham.gupta <[email protected]>
> > wrote:
> >
> >> my current regex-urlfilter properties are as follows:
> >>
> >> # skip file: ftp: and mailto: urls
> >> #-^(file|ftp|mailto):
> >>
> >> # skip image and other suffixes we can't yet parse
> >> # for a more extensive coverage use the urlfilter-suffix plugin
> >> #-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
> >> wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> >> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >>
> >> # skip URLs containing certain characters as probable queries, etc.
> >> #-[?*!@=]
> >>
> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> >> loops
> >> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >>
> >> # accept anything else
> >> -^(http://up.anv.bz)
> >> +.
> >>
> >> # skip URLs longer than 512 characters
> >> -^.{513,}$
> >>
> >> Thanks and Regards,
> >> Shubham Gupta
> >>
> >> On Wednesday 05 October 2016 11:29 AM, Sachin Shaju wrote:
> >>
> >>> my regex-urlfilter properties are as follows:
> >>>>>>> # skip file: ftp: and mailto: urls
> >>>>>>> -^(file|ftp|mailto):
> >>>>>>>
> >>>>>>> # skip image and other suffixes we can't yet parse
> >>>>>>> # for a more extensive coverage use the urlfilter-suffix plugin
> >>>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
> >>>>>>> wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> >>>>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >>>>>>>
> >>>>>>> # skip URLs containing certain characters as probable queries, etc.
> >>>>>>> #-[?*!@=]
> >>>>>>>
> >>>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
> >>> break
> >>>>>>> loops
> >>>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >>>>>>>
> >>>>>>> # accept anything else
> >>>>>>> -^(http://up.anv.bz)
> >>>>>>> +.
> >>>>>>>
> >>>>>>> # skip URLs longer than 512 characters
> >>>>>>> -^.{513,}$
> >>
> 
>

RE: 90% of URL rejected by filtering (Nutch 2.3.1)

Reply via email to