I'm using nutch 1.6 to crawl a variety of web pages/sites and I'm finding
that my solr database contains pairs of near-duplicate entries where the
main difference is that one contains a period after the hostname in the id.
For example:

entry 1: id: http://example.com/

entry 2: id: http://example.com./

 

I can't find any references to this issue.  Has anyone else noticed this?
Is there a good way to correct this?

 

I've added an entry to regex-normalize.xml to remove the period, but I'm not
sure yet whether it works.  Is there a good way to test the url normalizer
configuration?

 

I tracked the source of some of these urls back to hyperlinks extracted from
PDF files where the hyperlink doesn't seem to have the period but the linked
text is followed by a period.  For example:
"{link}http://example.com{/link}."; where the curly braces indicate the
hyperlink boundaries.  The command "nutch parsechecker" reports that the
outlink is http://example.com. for this case.

 

Thanks for any assistance.

 

Rodney

Reply via email to