Re: robots.txt UnknownHostException

chethan Thu, 07 Jun 2012 09:54:35 -0700

It's running on one machine. There could be server side redirects from
these root URLs that have parameters in them which I'm blocking. I'll dig
around. Thanks Markus.


-Chethan


On Thu, Jun 7, 2012 at 8:27 PM, Markus Jelsma <[email protected]>wrote:

> If Nutch runs on a different machine the DNS may not be resolving the host
> after all. To solve the issue you will have to find a way to resolve the
> host. Take a look in the Nutch logs.
>
>
> -----Original message-----
> > From:Chethan Prasad <[email protected]>
> > Sent: Thu 07-Jun-2012 16:49
> > To: Markus Jelsma <[email protected]>; [email protected]
> > Subject: RE: robots.txt UnknownHostException
> >
> > Well I can reach it from the browser. So the DNS should be good there.
> >
> > Thanks,
> > Chethan
> > From: Markus Jelsma
> > Sent: 6/7/2012 8:07 PM
> > To: [email protected]
> > Subject: RE: robots.txt UnknownHostException
> > Hi
> >
> > It cannot resolve the host and therefore crawl none of the pages on
> > that host. Make sure your DNS settings are correct, the host actually
> > exists or add it manually to your hosts file.
> >
> > Cheers
> >
> >
> > -----Original message-----
> > > From:chethan <[email protected]>
> > > Sent: Thu 07-Jun-2012 16:29
> > > To: [email protected]
> > > Subject: Re: robots.txt UnknownHostException
> > >
> > > But that should not stop it from crawling the rest of the site right?
> What
> > > I'm seeing here is when the  UnknownHostException is thrown from the
> robots
> > > url, the rest of the site is never crawled. Shouldn't it find more
> links on
> > > the root page and follow them?
> > >
> > > Thanks,
> > > Chethan
> > >
> > > On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma <
> [email protected]>wrote:
> > >
> > > > Hi,
> > > >
> > > > Nutch will fetch URL's without robots.txt, but if robots.txt throws
> an
> > > > UnknownHostException, the URL will throw it as well and fail.
> > > >
> > > > Cheers
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:chethan <[email protected]>
> > > > > Sent: Thu 07-Jun-2012 16:16
> > > > > To: [email protected]
> > > > > Subject: robots.txt UnknownHostException
> > > > >
> > > > > Hi,
> > > > >
> > > > > When Nutch doesn't find the robots.txt for a given URL, why does
> it not
> > > > > fetch that URL at all? I mean, if the robots is not found, doesn't
> it
> > > > mean
> > > > > that the owner of that website doesn't really care about crawlers?
> So,
> > > > it's
> > > > > ok for Nutch to fetch from it right?
> > > > >
> > > > > Thanks,
> > > > > Chethan
> > > > >
> > > >
> > >
> >
>

Re: robots.txt UnknownHostException

Reply via email to