RE: robots.txt UnknownHostException

Markus Jelsma Thu, 07 Jun 2012 07:37:07 -0700

Hi

It cannot resolve the host and therefore crawl none of the pages on that host. 
Make sure your DNS settings are correct, the host actually exists or add it 
manually to your hosts file.


Cheers
 
 
-----Original message-----
> From:chethan <[email protected]>
> Sent: Thu 07-Jun-2012 16:29
> To: [email protected]
> Subject: Re: robots.txt UnknownHostException
> 
> But that should not stop it from crawling the rest of the site right? What
> I'm seeing here is when the  UnknownHostException is thrown from the robots
> url, the rest of the site is never crawled. Shouldn't it find more links on
> the root page and follow them?
> 
> Thanks,
> Chethan
> 
> On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma 
> <[email protected]>wrote:
> 
> > Hi,
> >
> > Nutch will fetch URL's without robots.txt, but if robots.txt throws an
> > UnknownHostException, the URL will throw it as well and fail.
> >
> > Cheers
> >
> >
> > -----Original message-----
> > > From:chethan <[email protected]>
> > > Sent: Thu 07-Jun-2012 16:16
> > > To: [email protected]
> > > Subject: robots.txt UnknownHostException
> > >
> > > Hi,
> > >
> > > When Nutch doesn't find the robots.txt for a given URL, why does it not
> > > fetch that URL at all? I mean, if the robots is not found, doesn't it
> > mean
> > > that the owner of that website doesn't really care about crawlers? So,
> > it's
> > > ok for Nutch to fetch from it right?
> > >
> > > Thanks,
> > > Chethan
> > >
> >
>

RE: robots.txt UnknownHostException

Reply via email to