Hi Sebastian, We didn't set up the URL filters. Could you let me know the way to specify them(is it a file with urlfilters + plugin, right?) and maybe advice me a default filter that filters such problematic urls?
Thanks. Semyon. Sent: Tuesday, November 28, 2017 at 4:17 PM From: "Sebastian Nagel" <[email protected]> To: [email protected] Subject: Re: Not valid URLs in Crawldb through crawlcomplete Hi Semyon, > It seems like Nutch takes the anchor name as an URL for the crawling a store > it in database with the key equals to name. if you look into the page HTML you can see that it's the href attribute: <p><a href="http://#Are there any places to eat onsite during the show?" target="_self" title="http://#Are there any places to eat onsite during the show?">Are there any places to eat onsite during the show?</a></p> How are URL filters configured? Normally, a URL "http://#Are there any places to eat onsite during the show?" should not make it into the CrawlDb. Best, Sebastian On 11/28/2017 02:17 PM, Semyon Semyonov wrote: > Hello all, > > I have launched a crawling process for 100 websites with external links > equals to true. > After several hours, I run the crawlcomplete command with mode equals host. > > The crawlcomplete output file contains(apart from the proper host names) the > following lines. > > 1 #Are there any places to eat onsite during the show#Are there any places to > eat onsite during the show UNFETCHED > 1 #Are there any points where I can access the internet at the show#Are there > any points where I can access the internet at the show UNFETCHED > 1 #Can I register onsite#Can I register onsite UNFETCHED > 1 #Can children attend the show#Can children attend the show UNFETCHED > 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you > recommend any site-seeing attractions in Amsterdam UNFETCHED > 1 #Do I need a visa#Do I need a visa UNFETCHED > 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at > the Amsterdam RAI UNFETCHED > 1 #Is there anywhere for me to practice my religion#Is there anywhere for me > to practice my religion UNFETCHED > 1 #Is there parking#Is there parking UNFETCHED > 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED > 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED > 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED > 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen > Experience UNFETCHED > 1 #When and where is IBC#When and where is IBC UNFETCHED > 1 #Who attends IBC#Who attends IBC UNFETCHED > > After googling I found the webpage where it came from: > https://show.ibc.org/about-ibc/faqs > > It seems like Nutch takes the anchor name as an URL for the crawling a store > it in database with the key equals to name. > > For example. > <a class="anchor" name="Are there any places to eat onsite during the > show?"></a> > > Any suggestion what is it and how to fix it? > Thanks. > > Semyon. >

