RE: Period-terminated hostnames

Markus Jelsma Thu, 18 Apr 2013 14:27:03 -0700

Rodney,

Those are valid URL's but you clearly don't need them. You can either use 
filters to get rid of them or normalize them away. Use the 
org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test 
your config.


Markus

 
 
-----Original message-----
> From:Rodney Barnett <[email protected]>
> Sent: Thu 18-Apr-2013 22:31
> To: [email protected]
> Subject: Period-terminated hostnames
> 
> I'm using nutch 1.6 to crawl a variety of web pages/sites and I'm finding
> that my solr database contains pairs of near-duplicate entries where the
> main difference is that one contains a period after the hostname in the id.
> For example:
> 
> entry 1: id: http://example.com/
> 
> entry 2: id: http://example.com./
> 
>  
> 
> I can't find any references to this issue.  Has anyone else noticed this?
> Is there a good way to correct this?
> 
>  
> 
> I've added an entry to regex-normalize.xml to remove the period, but I'm not
> sure yet whether it works.  Is there a good way to test the url normalizer
> configuration?
> 
>  
> 
> I tracked the source of some of these urls back to hyperlinks extracted from
> PDF files where the hyperlink doesn't seem to have the period but the linked
> text is followed by a period.  For example:
> "{link}http://example.com{/link}."; where the curly braces indicate the
> hyperlink boundaries.  The command "nutch parsechecker" reports that the
> outlink is http://example.com. for this case.
> 
>  
> 
> Thanks for any assistance.
> 
>  
> 
> Rodney
> 
>

RE: Period-terminated hostnames

Reply via email to