I've copied the same page on non-https location and changed the protocol-httpclient to protocol-http. And the parser found 18 outlinks. So it seems that the problem is with the httpclient...
Thanks Markus, A On Fri, Oct 7, 2011 at 11:36 AM, Markus Jelsma <[email protected]>wrote: > You're using parse-html, it should extract those relative outlinks just > fine. > Using protocol-httpclient should not make things different. But to rule it > out, can you parse the page from some other location using protocol-http > instead? > > Do you have any relevant non-default settings on your config? > > > Dear all, > > > > I've been trying to crawl and index a https intranet, but the generator > > keeps saying that there are 0 links to be fetched after authenticating > and > > parsing the first page. It seems that there's something wrong with the > > parser when used with https (httpclient). > > > > here's the command that I'm using to reproduce the error: > > > > bin/nutch org.apache.nutch.parse.ParserChecker > http://server/user/library > > > > cmd output: http://pastebin.com/h5e7wAZ5 > > > > hadoop.log: http://pastebin.com/S7ieS2TT (you can see the page is > fetched > > and the contents around line 300) > > > > Any ideas/help will be appreciated, > > > > Alfredas >

