Hi,

If Nutch finds a relative URL it will be converted to absolute. This means that 
any URL that does not explicitly start with http:// is going to have the host 
prefixed. You domain.com pages produce bad URL's such as http/www. And since 
this is not http://, it'll end up as 
http://domain.com/http/www.domain.com/.../....


 
 
-----Original message-----
> From:Joshua J Pavel <[email protected]>
> Sent: Fri 22-Jun-2012 15:21
> To: [email protected]
> Subject: Odd results from nutch-crawl (1.4), and request for inlink command
> 
> 
> 
> So, during my crawl I get entries like this in the crawl log as a result of
> the parsing:
> 
> http://www.domain.comhttp/www.domain.com/news/articles/2012-03-11/201205101336665761902.html
> http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html
> 
> The fetches fail, obviously, with:
> fetch of
> http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html
>  failed with: java.net.UnknownHostException: www.domain.comhttp
> 
> I'm not sure if the prepension of the domain is related to incorrectly
> parsing http://, but the site's code seems to be sound.
> 
> Has anyone else seen this behavior?
> 
> To help troubleshoot it, I'm trying to dump the inlinks to the these pages,
> but I'm struggling for the command to do that.  Any help would be
> appreciated.  Thanks everyone!

Reply via email to