Hi, If Nutch finds a relative URL it will be converted to absolute. This means that any URL that does not explicitly start with http:// is going to have the host prefixed. You domain.com pages produce bad URL's such as http/www. And since this is not http://, it'll end up as http://domain.com/http/www.domain.com/.../....
-----Original message----- > From:Joshua J Pavel <[email protected]> > Sent: Fri 22-Jun-2012 15:21 > To: [email protected] > Subject: Odd results from nutch-crawl (1.4), and request for inlink command > > > > So, during my crawl I get entries like this in the crawl log as a result of > the parsing: > > http://www.domain.comhttp/www.domain.com/news/articles/2012-03-11/201205101336665761902.html > http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html > > The fetches fail, obviously, with: > fetch of > http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html > failed with: java.net.UnknownHostException: www.domain.comhttp > > I'm not sure if the prepension of the domain is related to incorrectly > parsing http://, but the site's code seems to be sound. > > Has anyone else seen this behavior? > > To help troubleshoot it, I'm trying to dump the inlinks to the these pages, > but I'm struggling for the command to do that. Any help would be > appreciated. Thanks everyone!

