Odd results from nutch-crawl (1.4), and request for inlink command

Joshua J Pavel Fri, 22 Jun 2012 06:21:40 -0700


So, during my crawl I get entries like this in the crawl log as a result of
the parsing:


http://www.domain.comhttp/www.domain.com/news/articles/2012-03-11/201205101336665761902.html
http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html

The fetches fail, obviously, with:
fetch of
http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html
 failed with: java.net.UnknownHostException: www.domain.comhttp

I'm not sure if the prepension of the domain is related to incorrectly
parsing http://, but the site's code seems to be sound.

Has anyone else seen this behavior?

To help troubleshoot it, I'm trying to dump the inlinks to the these pages,
but I'm struggling for the command to do that.  Any help would be
appreciated.  Thanks everyone!

Odd results from nutch-crawl (1.4), and request for inlink command

Reply via email to