> Why isn't essential filtering ON by default? Good question. Per default only urlfilter-regex is active. That has been the case since long. I think it's better not to load first users with the need to configure multiple filters.
But adding urlfilter-validator might be a good idea. Feel free to open a Jira issue for this. > And nowhere, in the tutorial, is it mentioned that you need to specify > "-filter" > to updatedb to make it work. No. You don't have to. By default filters are only applied to: - injected URLs - outlinks during parsing - redirects (if fetcher follows redirects) It's most efficient not filter the CrawlDb. It's costly to apply the filters again and again: the CrawlDb might be huge (up to billions of URLs), and/or filter rules can be complex. The default does the necessary but avoid unnecessary work. Best, Sebastian On 11/29/2017 05:07 PM, Michael Coffey wrote: > I bet that problem affects a lot of people. It certainly has affected me. > > Why isn't essential filtering ON by default? > > The bin/crawl script doesn't even have a way for the operator to specify any > filltering. And nowhere, in the tutorial, is it mentioned that you need to > specify "-filter" to updatedb to make it work. > > > From: Sebastian Nagel <[email protected]> > To: [email protected] > Sent: Wednesday, November 29, 2017 2:40 AM > Subject: Re: Not valid URLs in Crawldb through crawlcomplete > > Hi, > > all 8 available urlfilter-* plugins are linked from the API doc page > https://builds.apache.org/job/nutch-trunk/javadoc/ > > Activate those you need in the property plugin.includes. > > Most of the urlfilter plugins have a specific configuration file which > must be adapted to your needs. > > For the specific problem it's to just activate urlfilter-validator. > > Best, > Sebastian > > On 11/29/2017 09:21 AM, Semyon Semyonov wrote: >> Hi Sebastian, >> >> We didn't set up the URL filters. >> Could you let me know the way to specify them(is it a file with urlfilters + >> plugin, right?) and maybe advice me a default filter that filters such >> problematic urls? >> >> Thanks. >> >> Semyon. >> >> >> Sent: Tuesday, November 28, 2017 at 4:17 PM >> From: "Sebastian Nagel" <[email protected]> >> To: [email protected] >> Subject: Re: Not valid URLs in Crawldb through crawlcomplete >> Hi Semyon, >> >>> It seems like Nutch takes the anchor name as an URL for the crawling a >>> store it in database with >> the key equals to name. >> >> if you look into the page HTML you can see that it's the href attribute: >> >> <p><a href="http://#Are there any places to eat onsite during the show?" >> target="_self" >> title="http://#Are there any places to eat onsite during the show?">Are >> there any places to eat >> onsite during the show?</a></p> >> >> >> How are URL filters configured? Normally, a URL >> "http://#Are there any places to eat onsite during the show?" >> should not make it into the CrawlDb. >> >> Best, >> Sebastian >> >> On 11/28/2017 02:17 PM, Semyon Semyonov wrote: >>> Hello all, >>> >>> I have launched a crawling process for 100 websites with external links >>> equals to true. >>> After several hours, I run the crawlcomplete command with mode equals host. >>> >>> The crawlcomplete output file contains(apart from the proper host names) >>> the following lines. >>> >>> 1 #Are there any places to eat onsite during the show#Are there any places >>> to eat onsite during the show UNFETCHED >>> 1 #Are there any points where I can access the internet at the show#Are >>> there any points where I can access the internet at the show UNFETCHED >>> 1 #Can I register onsite#Can I register onsite UNFETCHED >>> 1 #Can children attend the show#Can children attend the show UNFETCHED >>> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you >>> recommend any site-seeing attractions in Amsterdam UNFETCHED >>> 1 #Do I need a visa#Do I need a visa UNFETCHED >>> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at >>> the Amsterdam RAI UNFETCHED >>> 1 #Is there anywhere for me to practice my religion#Is there anywhere for >>> me to practice my religion UNFETCHED >>> 1 #Is there parking#Is there parking UNFETCHED >>> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED >>> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED >>> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED >>> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen >>> Experience UNFETCHED >>> 1 #When and where is IBC#When and where is IBC UNFETCHED >>> 1 #Who attends IBC#Who attends IBC UNFETCHED >>> >>> After googling I found the webpage where it came from: >>> https://show.ibc.org/about-ibc/faqs >>> >>> It seems like Nutch takes the anchor name as an URL for the crawling a >>> store it in database with the key equals to name. >>> >>> For example. >>> <a class="anchor" name="Are there any places to eat onsite during the >>> show?"></a> >>> >>> Any suggestion what is it and how to fix it? >>> Thanks. >>> >>> Semyon. >>> >> >> > > > > >

