Re: Not valid URLs in Crawldb through crawlcomplete

Michael Coffey Wed, 29 Nov 2017 08:08:48 -0800

I bet that problem affects a lot of people. It certainly has affected me. 

Why isn't essential filtering ON by default?


The bin/crawl script doesn't even have a way for the operator to specify any 
filltering. And nowhere, in the tutorial, is it mentioned that you need to 
specify "-filter" to updatedb to make it work.


      From: Sebastian Nagel <[email protected]>
 To: [email protected] 
 Sent: Wednesday, November 29, 2017 2:40 AM
 Subject: Re: Not valid URLs in Crawldb through crawlcomplete
   
Hi,

all 8 available urlfilter-* plugins are linked from the API doc page
  https://builds.apache.org/job/nutch-trunk/javadoc/

Activate those you need in the property plugin.includes.

Most of the urlfilter plugins have a specific configuration file which
must be adapted to your needs.

For the specific problem it's to just activate urlfilter-validator.

Best,
Sebastian

On 11/29/2017 09:21 AM, Semyon Semyonov wrote:
> Hi Sebastian,
> 
> We didn't set up the URL filters. 
> Could you let me know the way to specify them(is it a file with urlfilters + 
> plugin, right?) and maybe advice me a default filter that filters such 
> problematic urls?
> 
> Thanks.
> 
> Semyon.
> 
> 
> Sent: Tuesday, November 28, 2017 at 4:17 PM
> From: "Sebastian Nagel" <[email protected]>
> To: [email protected]
> Subject: Re: Not valid URLs in Crawldb through crawlcomplete
> Hi Semyon,
> 
>> It seems like Nutch takes the anchor name as an URL for the crawling a store 
>> it in database with
> the key equals to name.
> 
> if you look into the page HTML you can see that it's the href attribute:
> 
> <p><a href="http://#Are there any places to eat onsite during the show?" 
> target="_self"
> title="http://#Are there any places to eat onsite during the show?">Are there 
> any places to eat
> onsite during the show?</a></p>
> 
> 
> How are URL filters configured? Normally, a URL
> "http://#Are there any places to eat onsite during the show?"
> should not make it into the CrawlDb.
> 
> Best,
> Sebastian
> 
> On 11/28/2017 02:17 PM, Semyon Semyonov wrote:
>> Hello all,
>>
>> I have launched a crawling process for 100 websites with external links 
>> equals to true.
>> After several hours, I run the crawlcomplete command with mode equals host.
>>
>> The crawlcomplete output file contains(apart from the proper host names) the 
>> following lines.
>>
>> 1 #Are there any places to eat onsite during the show#Are there any places 
>> to eat onsite during the show UNFETCHED
>> 1 #Are there any points where I can access the internet at the show#Are 
>> there any points where I can access the internet at the show UNFETCHED
>> 1 #Can I register onsite#Can I register onsite UNFETCHED
>> 1 #Can children attend the show#Can children attend the show UNFETCHED
>> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you 
>> recommend any site-seeing attractions in Amsterdam UNFETCHED
>> 1 #Do I need a visa#Do I need a visa UNFETCHED
>> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at 
>> the Amsterdam RAI UNFETCHED
>> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me 
>> to practice my religion UNFETCHED
>> 1 #Is there parking#Is there parking UNFETCHED
>> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
>> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
>> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
>> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen 
>> Experience UNFETCHED
>> 1 #When and where is IBC#When and where is IBC UNFETCHED
>> 1 #Who attends IBC#Who attends IBC UNFETCHED
>>
>> After googling I found the webpage where it came from:
>> https://show.ibc.org/about-ibc/faqs
>>
>> It seems like Nutch takes the anchor name as an URL for the crawling a store 
>> it in database with the key equals to name.
>>
>> For example.
>> <a class="anchor" name="Are there any places to eat onsite during the 
>> show?"></a>
>>
>> Any suggestion what is it and how to fix it?
>> Thanks.
>>
>> Semyon.
>>
>  
>

Re: Not valid URLs in Crawldb through crawlcomplete

Reply via email to