Here's the parse checker output -- the page does have text (and 3
links) but it's not showing it with the dumpText option.  I'd expect
there to be some kind of error, since a fetch fails on it when i run
that....

ParseChecker output:

./bin/nutch parsechecker -dumpText https://localhost/crawldocs/index.html

fetching: https://localhost/crawldocs/index.html
parsing: https://localhost/crawldocs/index.html
contentType: text/html
---------
Url
---------------
https://localhost/crawldocs/index.html---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: Connection=close Content-Type=text/html
Parse Metadata: CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252
---------
ParseText
---------

-- Chris



On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross <[email protected]> wrote:
> Well, NTLM is a windows thing with a username and password.
>
> I have a certificate.  No username/password.  The debug stuff would be
> helpful once I can get a bit farther...I don't know how to tell Nutch
> to crawl with the cert.  I'm getting a 403 error -- it is not (using?
> finding?) the certs that I have passed in via -D arguments.
>
> I appreciate you trying to help -- but I need knowledge on getting
> Nutch to use a cert.
>
> -- Chris
>
>
>
> On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <[email protected]> wrote:
>> There are many debugging tips on the bottom of that page, did you try them?
>>
>> E.g. ParserChecker, debug-level log info, ...
>>
>> BTW, which authentication scheme is required by your site? For NTLMv2 is
>> poorly supported
>>
>> Remi
>>
>> On Wednesday, March 7, 2012, Christopher Gross <[email protected]> wrote:
>>> I have protocol-httpclient set.
>>>
>>> I can't see how I'm supposed to do the certs.  I can't seem to get
>>> them to work by passing them in via -D args when I call the nutch
>>> script (-Djavax.net.ssl.trustStore=xxxx
>>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
>>> for them in the AuthenticationSchemes
>>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
>>> shown on the page?
>>>
>>> If you have a specific page that could help please send that.
>>>
>>> -- Chris
>>>
>>>
>>>
>>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <[email protected]>
>> wrote:
>>>> Try googling for Nutch+httpclient
>>>>
>>>> Remi
>>>>
>>>> On Wednesday, March 7, 2012, Christopher Gross <[email protected]> wrote:
>>>>> Is there any good documentation for setting up Nutch to crawl HTTPS
>>>>> sites using a certificate?  I've poked around on the wiki and tried
>>>>> some google searches without much luck.
>>>>>
>>>>> I'm using Nutch 1.4.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -- Chris
>>>>>
>>>

Reply via email to