But that should not stop it from crawling the rest of the site right? What
I'm seeing here is when the  UnknownHostException is thrown from the robots
url, the rest of the site is never crawled. Shouldn't it find more links on
the root page and follow them?

Thanks,
Chethan

On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma <[email protected]>wrote:

> Hi,
>
> Nutch will fetch URL's without robots.txt, but if robots.txt throws an
> UnknownHostException, the URL will throw it as well and fail.
>
> Cheers
>
>
> -----Original message-----
> > From:chethan <[email protected]>
> > Sent: Thu 07-Jun-2012 16:16
> > To: [email protected]
> > Subject: robots.txt UnknownHostException
> >
> > Hi,
> >
> > When Nutch doesn't find the robots.txt for a given URL, why does it not
> > fetch that URL at all? I mean, if the robots is not found, doesn't it
> mean
> > that the owner of that website doesn't really care about crawlers? So,
> it's
> > ok for Nutch to fetch from it right?
> >
> > Thanks,
> > Chethan
> >
>

Reply via email to