Re: Crawling with Certs

Lewis John Mcgibbney Thu, 08 Mar 2012 02:59:50 -0800

Hi Christopher,

It appears that the page is being fetched successfully. What is not
successful is the parser obtaining the page content... these fields appears
the be returning empty values when as you have stated this is not the case.


How large is the page content? does you http.content.limit accommodate
this?

Also you ARE getting back that the content metadata connection appears to
be closed! Maybe there are some other credentials to be supplied for
crawling certificate authenticated sites... I really don't know.

On Wed, Mar 7, 2012 at 9:28 PM, Christopher Gross <[email protected]> wrote:

> Here's the parse checker output -- the page does have text (and 3
> links) but it's not showing it with the dumpText option.  I'd expect
> there to be some kind of error, since a fetch fails on it when i run
> that....
>
> ParseChecker output:
>
> ./bin/nutch parsechecker -dumpText https://localhost/crawldocs/index.html
>
> fetching: https://localhost/crawldocs/index.html
> parsing: https://localhost/crawldocs/index.html
> contentType: text/html
> ---------
> Url
> ---------------
> https://localhost/crawldocs/index.html---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: Connection=close Content-Type=text/html
> Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> ---------
> ParseText
> ---------
>
> -- Chris
>
>
>
> On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross <[email protected]>
> wrote:
> > Well, NTLM is a windows thing with a username and password.
> >
> > I have a certificate.  No username/password.  The debug stuff would be
> > helpful once I can get a bit farther...I don't know how to tell Nutch
> > to crawl with the cert.  I'm getting a 403 error -- it is not (using?
> > finding?) the certs that I have passed in via -D arguments.
> >
> > I appreciate you trying to help -- but I need knowledge on getting
> > Nutch to use a cert.
> >
> > -- Chris
> >
> >
> >
> > On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <[email protected]>
> wrote:
> >> There are many debugging tips on the bottom of that page, did you try
> them?
> >>
> >> E.g. ParserChecker, debug-level log info, ...
> >>
> >> BTW, which authentication scheme is required by your site? For NTLMv2 is
> >> poorly supported
> >>
> >> Remi
> >>
> >> On Wednesday, March 7, 2012, Christopher Gross <[email protected]>
> wrote:
> >>> I have protocol-httpclient set.
> >>>
> >>> I can't see how I'm supposed to do the certs.  I can't seem to get
> >>> them to work by passing them in via -D args when I call the nutch
> >>> script (-Djavax.net.ssl.trustStore=xxxx
> >>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
> >>> for them in the AuthenticationSchemes
> >>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
> >>> shown on the page?
> >>>
> >>> If you have a specific page that could help please send that.
> >>>
> >>> -- Chris
> >>>
> >>>
> >>>
> >>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <[email protected]>
> >> wrote:
> >>>> Try googling for Nutch+httpclient
> >>>>
> >>>> Remi
> >>>>
> >>>> On Wednesday, March 7, 2012, Christopher Gross <[email protected]>
> wrote:
> >>>>> Is there any good documentation for setting up Nutch to crawl HTTPS
> >>>>> sites using a certificate?  I've poked around on the wiki and tried
> >>>>> some google searches without much luck.
> >>>>>
> >>>>> I'm using Nutch 1.4.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> -- Chris
> >>>>>
> >>>
>



-- 
*Lewis*

Re: Crawling with Certs

Reply via email to