It seems that the problem is solved in the 1.4 version (current trunk). Great. Small side note: if you use maven to build the project it will build the 1.3 version ex plugins. Ant/ivy builds 1.4 and the plugins.
Alfredas On Fri, Oct 7, 2011 at 4:52 PM, Alfredas Chmieliauskas < [email protected]> wrote: > Bug: the protocol.getProtocolOutput() for httpclient "protocol" returns > empty content.... > > Alfredas > > > > > On Fri, Oct 7, 2011 at 11:58 AM, Alfredas Chmieliauskas < > [email protected]> wrote: > >> I've copied the same page on non-https location and changed >> the protocol-httpclient to protocol-http. And the parser found 18 outlinks. >> So it seems that the problem is with the httpclient... >> >> Thanks Markus, >> >> A >> >> >> On Fri, Oct 7, 2011 at 11:36 AM, Markus Jelsma < >> [email protected]> wrote: >> >>> You're using parse-html, it should extract those relative outlinks just >>> fine. >>> Using protocol-httpclient should not make things different. But to rule >>> it >>> out, can you parse the page from some other location using protocol-http >>> instead? >>> >>> Do you have any relevant non-default settings on your config? >>> >>> > Dear all, >>> > >>> > I've been trying to crawl and index a https intranet, but the generator >>> > keeps saying that there are 0 links to be fetched after authenticating >>> and >>> > parsing the first page. It seems that there's something wrong with the >>> > parser when used with https (httpclient). >>> > >>> > here's the command that I'm using to reproduce the error: >>> > >>> > bin/nutch org.apache.nutch.parse.ParserChecker >>> http://server/user/library >>> > >>> > cmd output: http://pastebin.com/h5e7wAZ5 >>> > >>> > hadoop.log: http://pastebin.com/S7ieS2TT (you can see the page is >>> fetched >>> > and the contents around line 300) >>> > >>> > Any ideas/help will be appreciated, >>> > >>> > Alfredas >>> >> >> >

