Re: Not finding links when using HTTPS (httpclient)

Alfredas Chmieliauskas Fri, 07 Oct 2011 11:42:32 -0700

It seems that the problem is solved in the 1.4 version (current trunk).
Great.
Small side note: if you use maven to build the project it will build the 1.3
version ex plugins. Ant/ivy builds 1.4 and the plugins.


Alfredas

On Fri, Oct 7, 2011 at 4:52 PM, Alfredas Chmieliauskas <
[email protected]> wrote:

> Bug: the protocol.getProtocolOutput() for httpclient "protocol" returns
> empty content....
>
> Alfredas
>
>
>
>
> On Fri, Oct 7, 2011 at 11:58 AM, Alfredas Chmieliauskas <
> [email protected]> wrote:
>
>> I've copied the same page on non-https location and changed
>> the protocol-httpclient to protocol-http. And the parser found 18 outlinks.
>> So it seems that the problem is with the httpclient...
>>
>> Thanks Markus,
>>
>> A
>>
>>
>> On Fri, Oct 7, 2011 at 11:36 AM, Markus Jelsma <
>> [email protected]> wrote:
>>
>>> You're using parse-html, it should extract those relative outlinks just
>>> fine.
>>> Using protocol-httpclient should not make things different. But to rule
>>> it
>>> out, can you parse the page from some other location using protocol-http
>>> instead?
>>>
>>> Do you have any relevant non-default settings on your config?
>>>
>>> > Dear all,
>>> >
>>> > I've been trying to crawl and index a https intranet, but the generator
>>> > keeps saying that there are 0 links to be fetched after authenticating
>>> and
>>> > parsing the first page. It seems that there's something wrong with the
>>> > parser when used with https (httpclient).
>>> >
>>> > here's the command that I'm using to reproduce the error:
>>> >
>>> > bin/nutch org.apache.nutch.parse.ParserChecker
>>> http://server/user/library
>>> >
>>> > cmd output:  http://pastebin.com/h5e7wAZ5
>>> >
>>> > hadoop.log: http://pastebin.com/S7ieS2TT (you can see the page is
>>> fetched
>>> > and the contents around line 300)
>>> >
>>> > Any ideas/help will be appreciated,
>>> >
>>> > Alfredas
>>>
>>
>>
>

Re: Not finding links when using HTTPS (httpclient)

Reply via email to