RE: robots.txt UnknownHostException

Chethan Prasad Thu, 07 Jun 2012 07:49:11 -0700

Well I can reach it from the browser. So the DNS should be good there.

Thanks,
Chethan
From: Markus Jelsma
Sent: 6/7/2012 8:07 PM
To: [email protected]
Subject: RE: robots.txt UnknownHostException
Hi


It cannot resolve the host and therefore crawl none of the pages on
that host. Make sure your DNS settings are correct, the host actually
exists or add it manually to your hosts file.

Cheers


-----Original message-----
> From:chethan <[email protected]>
> Sent: Thu 07-Jun-2012 16:29
> To: [email protected]
> Subject: Re: robots.txt UnknownHostException
>
> But that should not stop it from crawling the rest of the site right? What
> I'm seeing here is when the  UnknownHostException is thrown from the robots
> url, the rest of the site is never crawled. Shouldn't it find more links on
> the root page and follow them?
>
> Thanks,
> Chethan
>
> On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma 
> <[email protected]>wrote:
>
> > Hi,
> >
> > Nutch will fetch URL's without robots.txt, but if robots.txt throws an
> > UnknownHostException, the URL will throw it as well and fail.
> >
> > Cheers
> >
> >
> > -----Original message-----
> > > From:chethan <[email protected]>
> > > Sent: Thu 07-Jun-2012 16:16
> > > To: [email protected]
> > > Subject: robots.txt UnknownHostException
> > >
> > > Hi,
> > >
> > > When Nutch doesn't find the robots.txt for a given URL, why does it not
> > > fetch that URL at all? I mean, if the robots is not found, doesn't it
> > mean
> > > that the owner of that website doesn't really care about crawlers? So,
> > it's
> > > ok for Nutch to fetch from it right?
> > >
> > > Thanks,
> > > Chethan
> > >
> >
>

RE: robots.txt UnknownHostException

Reply via email to