Hello - you can try debug logging. Or get the whole list of URL's in a flat file, and pipe it to bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined. URL's with a plus + are passed, URL's with a - minus are filtered.
-----Original message----- > From:shubham.gupta <[email protected]> > Sent: Wednesday 5th October 2016 14:10 > To: [email protected] > Subject: Re: 90% of URL rejected by filtering (Nutch 2.3.1) > > is there any way to find out the url filtered? > > Also the line -^.{513,}$ was inserted as the update job was failing > consistently due to MongoDb exception : key too large to index. > > Thanks and Regards, > Shubham Gupta > > On Wednesday 05 October 2016 01:50 PM, Sachin Shaju wrote: > > For the time being you can comment out this line -^.{513,}$ and check. > > > > Regards, > > Sachin Shaju > > > > [email protected] > > +919539887554 > > > > On Wed, Oct 5, 2016 at 11:41 AM, shubham.gupta <[email protected]> > > wrote: > > > >> my current regex-urlfilter properties are as follows: > >> > >> # skip file: ftp: and mailto: urls > >> #-^(file|ftp|mailto): > >> > >> # skip image and other suffixes we can't yet parse > >> # for a more extensive coverage use the urlfilter-suffix plugin > >> #-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS| > >> wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM| > >> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > >> > >> # skip URLs containing certain characters as probable queries, etc. > >> #-[?*!@=] > >> > >> # skip URLs with slash-delimited segment that repeats 3+ times, to break > >> loops > >> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >> > >> # accept anything else > >> -^(http://up.anv.bz) > >> +. > >> > >> # skip URLs longer than 512 characters > >> -^.{513,}$ > >> > >> Thanks and Regards, > >> Shubham Gupta > >> > >> On Wednesday 05 October 2016 11:29 AM, Sachin Shaju wrote: > >> > >>> my regex-urlfilter properties are as follows: > >>>>>>> # skip file: ftp: and mailto: urls > >>>>>>> -^(file|ftp|mailto): > >>>>>>> > >>>>>>> # skip image and other suffixes we can't yet parse > >>>>>>> # for a more extensive coverage use the urlfilter-suffix plugin > >>>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS| > >>>>>>> wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM| > >>>>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > >>>>>>> > >>>>>>> # skip URLs containing certain characters as probable queries, etc. > >>>>>>> #-[?*!@=] > >>>>>>> > >>>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to > >>> break > >>>>>>> loops > >>>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >>>>>>> > >>>>>>> # accept anything else > >>>>>>> -^(http://up.anv.bz) > >>>>>>> +. > >>>>>>> > >>>>>>> # skip URLs longer than 512 characters > >>>>>>> -^.{513,}$ > >> > >

