Here's the parse checker output -- the page does have text (and 3 links) but it's not showing it with the dumpText option. I'd expect there to be some kind of error, since a fetch fails on it when i run that....
ParseChecker output: ./bin/nutch parsechecker -dumpText https://localhost/crawldocs/index.html fetching: https://localhost/crawldocs/index.html parsing: https://localhost/crawldocs/index.html contentType: text/html --------- Url --------------- https://localhost/crawldocs/index.html--------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: Connection=close Content-Type=text/html Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252 --------- ParseText --------- -- Chris On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross <[email protected]> wrote: > Well, NTLM is a windows thing with a username and password. > > I have a certificate. No username/password. The debug stuff would be > helpful once I can get a bit farther...I don't know how to tell Nutch > to crawl with the cert. I'm getting a 403 error -- it is not (using? > finding?) the certs that I have passed in via -D arguments. > > I appreciate you trying to help -- but I need knowledge on getting > Nutch to use a cert. > > -- Chris > > > > On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <[email protected]> wrote: >> There are many debugging tips on the bottom of that page, did you try them? >> >> E.g. ParserChecker, debug-level log info, ... >> >> BTW, which authentication scheme is required by your site? For NTLMv2 is >> poorly supported >> >> Remi >> >> On Wednesday, March 7, 2012, Christopher Gross <[email protected]> wrote: >>> I have protocol-httpclient set. >>> >>> I can't see how I'm supposed to do the certs. I can't seem to get >>> them to work by passing them in via -D args when I call the nutch >>> script (-Djavax.net.ssl.trustStore=xxxx >>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc). Is there something >>> for them in the AuthenticationSchemes >>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not >>> shown on the page? >>> >>> If you have a specific page that could help please send that. >>> >>> -- Chris >>> >>> >>> >>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <[email protected]> >> wrote: >>>> Try googling for Nutch+httpclient >>>> >>>> Remi >>>> >>>> On Wednesday, March 7, 2012, Christopher Gross <[email protected]> wrote: >>>>> Is there any good documentation for setting up Nutch to crawl HTTPS >>>>> sites using a certificate? I've poked around on the wiki and tried >>>>> some google searches without much luck. >>>>> >>>>> I'm using Nutch 1.4. >>>>> >>>>> Thanks! >>>>> >>>>> -- Chris >>>>> >>>

