Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-08 Thread Lewis John McGibbney
OK, I'm going to try out Selenium Grid 4 and record my experience in a wiki page. I'll write back here in due course. Thanks On 2021/07/08 17:11:56, Abhay Ratnaparkhi wrote: > Hello Lewis, > > Sorry for the late reply, I missed your email. > The version we used is 3.141.59. As I mentioned

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-08 Thread Abhay Ratnaparkhi
Hello Lewis, Sorry for the late reply, I missed your email. The version we used is 3.141.59. As I mentioned earlier, we moved to using puppeteer instead of selenium. Thank you ~Abhay Below was the hub configuration. ``` hub: image: "selenium/hub" tag: "3.141.59" port: servicePort:

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-01 Thread Lewis John McGibbney
Hi Abhay, On 2021/06/10 22:27:42, Abhay Ratnaparkhi wrote: > > Based on selenium I created a microservice (which handles all required SSO > redirections/ OTP handlings etc) and hosted that with a selenium grid in > the kubernetes cluster for scaling. > I found that we couldn't scale this

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-12 Thread lewis john mcgibbney
, 2021 at 17:36 wrote: > > user Digest 13 Jun 2021 00:36:36 - Issue 3108 > > Topics (messages 34633 through 34634) > > Re: Apache Nutch help request for a school project :) > 34633 by: lewis john mcgibbney > > Re: Crawling pages behind SSO authentication

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-10 Thread Abhay Ratnaparkhi
Thank you Lewis for your reply. I initially looked into the above protocol-htmlunit and protocol-interactiveselenium plugins you

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-09 Thread Lewis John McGibbney
Hi Abhay, This is a problem space we looked at a while ago and made quite a bit of progress on. Firstly, the protocol-httpclient plugin has been considered in a deprecated state for a while. https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient I'm pretty sure that it will

Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-06 Thread Abhay Ratnaparkhi
Hello, We are using Nutch to crawl intranet pages behind SSO authentication. I would like to know if anyone has used/updated httpclient protocol plugin for crawling pages behind SSO authentication. The SSO auth redirects pages to the SSO server for login and optionally asks for second factor