Hi Abhay,

This is a problem space we looked at a while ago and made quite a bit of 
progress on.

Firstly, the protocol-httpclient plugin has been considered in a deprecated 
state for a while.
https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
I'm pretty sure that it will NOT cater for your use case. More information on 
the functionality and limits of this plugin can be found at 
https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes 
some more recent initiatives can be found at 
https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication

Now, some of the plugins which may be used/adapted for your use case include 

1. https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit - 
customizable through 
https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
 

2. both
https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
some documentation exists at 
https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction

Admittedly, I've not tried to run these plugins against a modern SSO site 
recently. I suspect that some dependency updates would not go a miss so please 
take that info consideration.

Your note regarding the time it takes for the 'chaining' of systems together to 
achieve the login is well made. This was easily observed and needs a more 
consolidated/calculated approach IMHO.

I would be interested to discuss this further with you...

hth
lewismc

On 2021/06/07 02:45:54, Abhay Ratnaparkhi <abhay.ratnapar...@gmail.com> wrote: 
> Hello,
> 
> We are using Nutch to crawl intranet pages behind SSO authentication.
> 
> I would like to know if anyone has used/updated httpclient protocol plugin
> for crawling pages behind SSO authentication.
> 
> The SSO auth redirects pages to the SSO server for login and optionally
> asks for second factor authentication like TOTP.
> 
> We have been using a custom plugin (which calls a nodejs service) which
> uses a google puppeteer to drive chromium browser to do this login and OTP
> handling. This is much slower and might not require as many of these pages
> are rendered on server sides (so dynamic rendering isn't required)
> 
> Thank you
> Abhay Ratnaparkhi
> 

Reply via email to