Hi Abhay, This is a problem space we looked at a while ago and made quite a bit of progress on.
Firstly, the protocol-httpclient plugin has been considered in a deprecated state for a while. https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient I'm pretty sure that it will NOT cater for your use case. More information on the functionality and limits of this plugin can be found at https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes some more recent initiatives can be found at https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication Now, some of the plugins which may be used/adapted for your use case include 1. https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit - customizable through https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java 2. both https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium some documentation exists at https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction Admittedly, I've not tried to run these plugins against a modern SSO site recently. I suspect that some dependency updates would not go a miss so please take that info consideration. Your note regarding the time it takes for the 'chaining' of systems together to achieve the login is well made. This was easily observed and needs a more consolidated/calculated approach IMHO. I would be interested to discuss this further with you... hth lewismc On 2021/06/07 02:45:54, Abhay Ratnaparkhi <abhay.ratnapar...@gmail.com> wrote: > Hello, > > We are using Nutch to crawl intranet pages behind SSO authentication. > > I would like to know if anyone has used/updated httpclient protocol plugin > for crawling pages behind SSO authentication. > > The SSO auth redirects pages to the SSO server for login and optionally > asks for second factor authentication like TOTP. > > We have been using a custom plugin (which calls a nodejs service) which > uses a google puppeteer to drive chromium browser to do this login and OTP > handling. This is much slower and might not require as many of these pages > are rendered on server sides (so dynamic rendering isn't required) > > Thank you > Abhay Ratnaparkhi >