I've copied the same page on non-https location and changed
the protocol-httpclient to protocol-http. And the parser found 18 outlinks.
So it seems that the problem is with the httpclient...

Thanks Markus,

A

On Fri, Oct 7, 2011 at 11:36 AM, Markus Jelsma
<[email protected]>wrote:

> You're using parse-html, it should extract those relative outlinks just
> fine.
> Using protocol-httpclient should not make things different. But to rule it
> out, can you parse the page from some other location using protocol-http
> instead?
>
> Do you have any relevant non-default settings on your config?
>
> > Dear all,
> >
> > I've been trying to crawl and index a https intranet, but the generator
> > keeps saying that there are 0 links to be fetched after authenticating
> and
> > parsing the first page. It seems that there's something wrong with the
> > parser when used with https (httpclient).
> >
> > here's the command that I'm using to reproduce the error:
> >
> > bin/nutch org.apache.nutch.parse.ParserChecker
> http://server/user/library
> >
> > cmd output:  http://pastebin.com/h5e7wAZ5
> >
> > hadoop.log: http://pastebin.com/S7ieS2TT (you can see the page is
> fetched
> > and the contents around line 300)
> >
> > Any ideas/help will be appreciated,
> >
> > Alfredas
>

Reply via email to