Hi Sebastian,

We didn't set up the URL filters. 
Could you let me know the way to specify them(is it a file with urlfilters + 
plugin, right?) and maybe advice me a default filter that filters such 
problematic urls?

Thanks.

Semyon.


Sent: Tuesday, November 28, 2017 at 4:17 PM
From: "Sebastian Nagel" <[email protected]>
To: [email protected]
Subject: Re: Not valid URLs in Crawldb through crawlcomplete
Hi Semyon,

> It seems like Nutch takes the anchor name as an URL for the crawling a store 
> it in database with
the key equals to name.

if you look into the page HTML you can see that it's the href attribute:

<p><a href="http://#Are there any places to eat onsite during the show?" 
target="_self"
title="http://#Are there any places to eat onsite during the show?">Are there 
any places to eat
onsite during the show?</a></p>


How are URL filters configured? Normally, a URL
"http://#Are there any places to eat onsite during the show?"
should not make it into the CrawlDb.

Best,
Sebastian

On 11/28/2017 02:17 PM, Semyon Semyonov wrote:
> Hello all,
>
> I have launched a crawling process for 100 websites with external links 
> equals to true.
> After several hours, I run the crawlcomplete command with mode equals host.
>
> The crawlcomplete output file contains(apart from the proper host names) the 
> following lines.
>
> 1 #Are there any places to eat onsite during the show#Are there any places to 
> eat onsite during the show UNFETCHED
> 1 #Are there any points where I can access the internet at the show#Are there 
> any points where I can access the internet at the show UNFETCHED
> 1 #Can I register onsite#Can I register onsite UNFETCHED
> 1 #Can children attend the show#Can children attend the show UNFETCHED
> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you 
> recommend any site-seeing attractions in Amsterdam UNFETCHED
> 1 #Do I need a visa#Do I need a visa UNFETCHED
> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at 
> the Amsterdam RAI UNFETCHED
> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me 
> to practice my religion UNFETCHED
> 1 #Is there parking#Is there parking UNFETCHED
> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen 
> Experience UNFETCHED
> 1 #When and where is IBC#When and where is IBC UNFETCHED
> 1 #Who attends IBC#Who attends IBC UNFETCHED
>
> After googling I found the webpage where it came from:
> https://show.ibc.org/about-ibc/faqs
>
> It seems like Nutch takes the anchor name as an URL for the crawling a store 
> it in database with the key equals to name.
>
> For example.
> <a class="anchor" name="Are there any places to eat onsite during the 
> show?"></a>
>
> Any suggestion what is it and how to fix it?
> Thanks.
>
> Semyon.
>
 

Reply via email to