Hi, all 8 available urlfilter-* plugins are linked from the API doc page https://builds.apache.org/job/nutch-trunk/javadoc/
Activate those you need in the property plugin.includes. Most of the urlfilter plugins have a specific configuration file which must be adapted to your needs. For the specific problem it's to just activate urlfilter-validator. Best, Sebastian On 11/29/2017 09:21 AM, Semyon Semyonov wrote: > Hi Sebastian, > > We didn't set up the URL filters. > Could you let me know the way to specify them(is it a file with urlfilters + > plugin, right?) and maybe advice me a default filter that filters such > problematic urls? > > Thanks. > > Semyon. > > > Sent: Tuesday, November 28, 2017 at 4:17 PM > From: "Sebastian Nagel" <[email protected]> > To: [email protected] > Subject: Re: Not valid URLs in Crawldb through crawlcomplete > Hi Semyon, > >> It seems like Nutch takes the anchor name as an URL for the crawling a store >> it in database with > the key equals to name. > > if you look into the page HTML you can see that it's the href attribute: > > <p><a href="http://#Are there any places to eat onsite during the show?" > target="_self" > title="http://#Are there any places to eat onsite during the show?">Are there > any places to eat > onsite during the show?</a></p> > > > How are URL filters configured? Normally, a URL > "http://#Are there any places to eat onsite during the show?" > should not make it into the CrawlDb. > > Best, > Sebastian > > On 11/28/2017 02:17 PM, Semyon Semyonov wrote: >> Hello all, >> >> I have launched a crawling process for 100 websites with external links >> equals to true. >> After several hours, I run the crawlcomplete command with mode equals host. >> >> The crawlcomplete output file contains(apart from the proper host names) the >> following lines. >> >> 1 #Are there any places to eat onsite during the show#Are there any places >> to eat onsite during the show UNFETCHED >> 1 #Are there any points where I can access the internet at the show#Are >> there any points where I can access the internet at the show UNFETCHED >> 1 #Can I register onsite#Can I register onsite UNFETCHED >> 1 #Can children attend the show#Can children attend the show UNFETCHED >> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you >> recommend any site-seeing attractions in Amsterdam UNFETCHED >> 1 #Do I need a visa#Do I need a visa UNFETCHED >> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at >> the Amsterdam RAI UNFETCHED >> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me >> to practice my religion UNFETCHED >> 1 #Is there parking#Is there parking UNFETCHED >> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED >> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED >> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED >> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen >> Experience UNFETCHED >> 1 #When and where is IBC#When and where is IBC UNFETCHED >> 1 #Who attends IBC#Who attends IBC UNFETCHED >> >> After googling I found the webpage where it came from: >> https://show.ibc.org/about-ibc/faqs >> >> It seems like Nutch takes the anchor name as an URL for the crawling a store >> it in database with the key equals to name. >> >> For example. >> <a class="anchor" name="Are there any places to eat onsite during the >> show?"></a> >> >> Any suggestion what is it and how to fix it? >> Thanks. >> >> Semyon. >> > >

