I'm using nutch 1.6 to crawl a variety of web pages/sites and I'm finding that my solr database contains pairs of near-duplicate entries where the main difference is that one contains a period after the hostname in the id. For example:
entry 1: id: http://example.com/ entry 2: id: http://example.com./ I can't find any references to this issue. Has anyone else noticed this? Is there a good way to correct this? I've added an entry to regex-normalize.xml to remove the period, but I'm not sure yet whether it works. Is there a good way to test the url normalizer configuration? I tracked the source of some of these urls back to hyperlinks extracted from PDF files where the hyperlink doesn't seem to have the period but the linked text is followed by a period. For example: "{link}http://example.com{/link}." where the curly braces indicate the hyperlink boundaries. The command "nutch parsechecker" reports that the outlink is http://example.com. for this case. Thanks for any assistance. Rodney

