It's running on one machine. There could be server side redirects from these root URLs that have parameters in them which I'm blocking. I'll dig around. Thanks Markus.
-Chethan On Thu, Jun 7, 2012 at 8:27 PM, Markus Jelsma <[email protected]>wrote: > If Nutch runs on a different machine the DNS may not be resolving the host > after all. To solve the issue you will have to find a way to resolve the > host. Take a look in the Nutch logs. > > > -----Original message----- > > From:Chethan Prasad <[email protected]> > > Sent: Thu 07-Jun-2012 16:49 > > To: Markus Jelsma <[email protected]>; [email protected] > > Subject: RE: robots.txt UnknownHostException > > > > Well I can reach it from the browser. So the DNS should be good there. > > > > Thanks, > > Chethan > > From: Markus Jelsma > > Sent: 6/7/2012 8:07 PM > > To: [email protected] > > Subject: RE: robots.txt UnknownHostException > > Hi > > > > It cannot resolve the host and therefore crawl none of the pages on > > that host. Make sure your DNS settings are correct, the host actually > > exists or add it manually to your hosts file. > > > > Cheers > > > > > > -----Original message----- > > > From:chethan <[email protected]> > > > Sent: Thu 07-Jun-2012 16:29 > > > To: [email protected] > > > Subject: Re: robots.txt UnknownHostException > > > > > > But that should not stop it from crawling the rest of the site right? > What > > > I'm seeing here is when the UnknownHostException is thrown from the > robots > > > url, the rest of the site is never crawled. Shouldn't it find more > links on > > > the root page and follow them? > > > > > > Thanks, > > > Chethan > > > > > > On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma < > [email protected]>wrote: > > > > > > > Hi, > > > > > > > > Nutch will fetch URL's without robots.txt, but if robots.txt throws > an > > > > UnknownHostException, the URL will throw it as well and fail. > > > > > > > > Cheers > > > > > > > > > > > > -----Original message----- > > > > > From:chethan <[email protected]> > > > > > Sent: Thu 07-Jun-2012 16:16 > > > > > To: [email protected] > > > > > Subject: robots.txt UnknownHostException > > > > > > > > > > Hi, > > > > > > > > > > When Nutch doesn't find the robots.txt for a given URL, why does > it not > > > > > fetch that URL at all? I mean, if the robots is not found, doesn't > it > > > > mean > > > > > that the owner of that website doesn't really care about crawlers? > So, > > > > it's > > > > > ok for Nutch to fetch from it right? > > > > > > > > > > Thanks, > > > > > Chethan > > > > > > > > > > > > > > >

