Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-08 Thread Lewis John McGibbney
OK, I'm going to try out Selenium Grid 4 and record my experience in a wiki 
page.
I'll write back here in due course.
Thanks

On 2021/07/08 17:11:56, Abhay Ratnaparkhi  wrote: 
> Hello Lewis,
> 
> Sorry for the late reply, I missed your email.
> The version we used is 3.141.59. As I mentioned earlier, we moved to using
> puppeteer instead of selenium.
> 
> 
> Thank you
> ~Abhay
> 
> 
> Below was the hub configuration.
> 
> 
> ```
> hub:
> image: "selenium/hub"
> tag: "3.141.59"
> port: 
> servicePort: 
> readinessTimeout: 40
> readinessDelay: 40
> livenessTimeout: 160
> javaOpts: "-Xmx8192m"
> resources:
> limits:
> cpu: "7"
> memory: "9Gi"
> gridNewSessionWaitTimeout: -1
> gridJettyMaxThreads: 750
> gridNodePolling: 1
> gridCleanUpCycle: 5000
> gridTimeout: 360
> gridBrowserTimeout: 120
> gridMaxSession: 5
> gridUnregisterIfStillDownAfter: 60
> chrome:
> enabled: true
> image: "selenium/node-chrome"
> tag: "3.141.59"
> replicas: 60
> nodeMaxSession: 5
> nodeRegistryCycle: 5000
> javaOpts: "-Xmx2048m"
> resources:
> limits:
> cpu: "1200m"
> memory: "3000Mi"
> 
> On Thu, Jul 1, 2021 at 3:06 PM Lewis John McGibbney 
> wrote:
> 
> > Hi Abhay,
> >
> > On 2021/06/10 22:27:42, Abhay Ratnaparkhi 
> > wrote:
> >
> > >
> > > Based on selenium I created a microservice (which handles all required
> > SSO
> > > redirections/ OTP handlings etc) and hosted that with a selenium grid in
> > > the kubernetes cluster for scaling.
> > > I found that we couldn't scale this approach beyond a certain point and
> > the
> > > selenium hub in the selenium grid can not be scaled horizontally.
> >
> > Which version of Selenium Grid and Hub did you use?
> > I haven't used either for a while... I did see that Grid 4 is available
> > https://www.selenium.dev/documentation/en/grid/grid_4/
> >
> > lewismc
> >
> 


Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-08 Thread Abhay Ratnaparkhi
Hello Lewis,

Sorry for the late reply, I missed your email.
The version we used is 3.141.59. As I mentioned earlier, we moved to using
puppeteer instead of selenium.


Thank you
~Abhay


Below was the hub configuration.


```
hub:
image: "selenium/hub"
tag: "3.141.59"
port: 
servicePort: 
readinessTimeout: 40
readinessDelay: 40
livenessTimeout: 160
javaOpts: "-Xmx8192m"
resources:
limits:
cpu: "7"
memory: "9Gi"
gridNewSessionWaitTimeout: -1
gridJettyMaxThreads: 750
gridNodePolling: 1
gridCleanUpCycle: 5000
gridTimeout: 360
gridBrowserTimeout: 120
gridMaxSession: 5
gridUnregisterIfStillDownAfter: 60
chrome:
enabled: true
image: "selenium/node-chrome"
tag: "3.141.59"
replicas: 60
nodeMaxSession: 5
nodeRegistryCycle: 5000
javaOpts: "-Xmx2048m"
resources:
limits:
cpu: "1200m"
memory: "3000Mi"

On Thu, Jul 1, 2021 at 3:06 PM Lewis John McGibbney 
wrote:

> Hi Abhay,
>
> On 2021/06/10 22:27:42, Abhay Ratnaparkhi 
> wrote:
>
> >
> > Based on selenium I created a microservice (which handles all required
> SSO
> > redirections/ OTP handlings etc) and hosted that with a selenium grid in
> > the kubernetes cluster for scaling.
> > I found that we couldn't scale this approach beyond a certain point and
> the
> > selenium hub in the selenium grid can not be scaled horizontally.
>
> Which version of Selenium Grid and Hub did you use?
> I haven't used either for a while... I did see that Grid 4 is available
> https://www.selenium.dev/documentation/en/grid/grid_4/
>
> lewismc
>


Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-01 Thread Lewis John McGibbney
Hi Abhay,

On 2021/06/10 22:27:42, Abhay Ratnaparkhi  wrote: 

> 
> Based on selenium I created a microservice (which handles all required SSO
> redirections/ OTP handlings etc) and hosted that with a selenium grid in
> the kubernetes cluster for scaling.
> I found that we couldn't scale this approach beyond a certain point and the
> selenium hub in the selenium grid can not be scaled horizontally.

Which version of Selenium Grid and Hub did you use?
I haven't used either for a while... I did see that Grid 4 is available
https://www.selenium.dev/documentation/en/grid/grid_4/

lewismc


Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-12 Thread lewis john mcgibbney
Yes you are hitting the exact same problems that we did. This presents a
major persistent challenge for using Nutch across the enterprise as it
quite frankly doesn’t scale.
I’m going to take next week to have a look into this specific issue and see
what I can come up with.
By any chance are you able to share your K8s configuration management here?
Are you using Helm?
Are you running Nutch in K8s or via some other deployment?
Next week I’m also looking into building our  CloudFormation template for
Nutch on EMR with Ranger included and will donate this to the Nutch
project.

On Sat, Jun 12, 2021 at 17:36  wrote:

>
> user Digest 13 Jun 2021 00:36:36 - Issue 3108
>
> Topics (messages 34633 through 34634)
>
> Re: Apache Nutch help request for a school project :)
> 34633 by: lewis john mcgibbney
>
> Re: Crawling pages behind SSO authentication (SAML/OIDC)
> 34634 by: Abhay Ratnaparkhi
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: lewis john mcgibbney 
> To: "gokmen.yontem" 
> Cc: Sebastian Nagel , user@nutch.apache.org
> Bcc:
> Date: Thu, 10 Jun 2021 09:53:31 -0700
> Subject: Re: Apache Nutch help request for a school project :)
> :)
>
> On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem 
> wrote:
>
> > Lewis, Sebastian
> > I can’t thank you enough! Your help is much appreciated.
> >
> > Next time I'll follow your advice and use the mailing list, which I
> > wasn't aware of that.
> >
> > Best wishes,
> > Gorkem
> >
> >
> > On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > > Yep Sebastian is absolutely correct. I sent you a pull request.
> > >
> > > https://github.com/gorkemyontem/nutch/pull/1
> > > HTH
> > > lewismc
> > >
> > > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
> > >  wrote:
> > >
> > >> I’ll have a look today. You can always use the mailing list as
> > >> well. Feel free to post your questions there and we will help you
> > >> out :)
> > >>
> > >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
> > >>  wrote:
> > >>
> > >>> Hi Lewis,
> > >>> Sorry to bother you. I've been trying to configure Apache Nutch
> > >>> for
> > >>> almost 10 days now and I'm about to give up. I saw that you are
> > >>> contributing to this project and I thought maybe you can help me.
> > >>> This is how desperate I am :)
> > >>>
> > >>> Here's my repo if you have time:
> > >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> > >>> I'm trying to use docker images so there isn't much on the repo/
> > >>>
> > >>> This is my current error:
> > >>>
> > >>> nutch| Indexer: java.lang.RuntimeException: Indexing job did
> > >>> not
> > >>> succeed, job status:FAILED, reason: NA
> > >>> nutch|  at
> > >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> > >>> nutch|  at
> > >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> > >>> nutch|  at
> > >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> > >>> nutch|  at
> > >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
> > >>>
> > >>> People say that schema.xml could be wrong, but I'm using the most
> > >>> up to
> > >>> date one from here
> > >>>
> > >>
> > >
> >
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
> > >>>
> > >>> Many many thanks!
> > >>> Best wishes,
> > >>> Gorkem
> > >> --
> > >>
> > >> http://home.apache.org/~lewismc/
> > >> http://people.apache.org/keys/committer/lewismc
> > >
> > > --
> > >
> > > http://home.apache.org/~lewismc/
> > > http://people.apache.org/keys/committer/lewismc
> >
>
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/k

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-10 Thread Abhay Ratnaparkhi
Thank you Lewis for your reply.

I initially looked into the above protocol-htmlunit

 and protocol-interactiveselenium

plugins
you mentioned.

Based on selenium I created a microservice (which handles all required SSO
redirections/ OTP handlings etc) and hosted that with a selenium grid in
the kubernetes cluster for scaling.
I found that we couldn't scale this approach beyond a certain point and the
selenium hub in the selenium grid can not be scaled horizontally.

Later we switched using Puppetter 
to drive headless chrome and scaled this in kubernetes using browserless

The nutch plugin developed to call these hosted APIs. This helps but still
this is very slow compared to traditional httpclient approach.

As this is a common problem in the intranet environment, I was wondering
how people are handling this. I would be happy to discuss this further.

Thank you
Abhay





On Wed, Jun 9, 2021 at 6:41 PM Lewis John McGibbney 
wrote:

> Hi Abhay,
>
> This is a problem space we looked at a while ago and made quite a bit of
> progress on.
>
> Firstly, the protocol-httpclient plugin has been considered in a
> deprecated state for a while.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
> I'm pretty sure that it will NOT cater for your use case. More information
> on the functionality and limits of this plugin can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes
> some more recent initiatives can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication
>
> Now, some of the plugins which may be used/adapted for your use case
> include
>
> 1.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit
> - customizable through
> https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
>
> 2. both
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
>
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> some documentation exists at
> https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction
>
> Admittedly, I've not tried to run these plugins against a modern SSO site
> recently. I suspect that some dependency updates would not go a miss so
> please take that info consideration.
>
> Your note regarding the time it takes for the 'chaining' of systems
> together to achieve the login is well made. This was easily observed and
> needs a more consolidated/calculated approach IMHO.
>
> I would be interested to discuss this further with you...
>
> hth
> lewismc
>
> On 2021/06/07 02:45:54, Abhay Ratnaparkhi 
> wrote:
> > Hello,
> >
> > We are using Nutch to crawl intranet pages behind SSO authentication.
> >
> > I would like to know if anyone has used/updated httpclient protocol
> plugin
> > for crawling pages behind SSO authentication.
> >
> > The SSO auth redirects pages to the SSO server for login and optionally
> > asks for second factor authentication like TOTP.
> >
> > We have been using a custom plugin (which calls a nodejs service) which
> > uses a google puppeteer to drive chromium browser to do this login and
> OTP
> > handling. This is much slower and might not require as many of these
> pages
> > are rendered on server sides (so dynamic rendering isn't required)
> >
> > Thank you
> > Abhay Ratnaparkhi
> >
>


Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-09 Thread Lewis John McGibbney
Hi Abhay,

This is a problem space we looked at a while ago and made quite a bit of 
progress on.

Firstly, the protocol-httpclient plugin has been considered in a deprecated 
state for a while.
https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
I'm pretty sure that it will NOT cater for your use case. More information on 
the functionality and limits of this plugin can be found at 
https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes 
some more recent initiatives can be found at 
https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication

Now, some of the plugins which may be used/adapted for your use case include 

1. https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit - 
customizable through 
https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
 

2. both
https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
some documentation exists at 
https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction

Admittedly, I've not tried to run these plugins against a modern SSO site 
recently. I suspect that some dependency updates would not go a miss so please 
take that info consideration.

Your note regarding the time it takes for the 'chaining' of systems together to 
achieve the login is well made. This was easily observed and needs a more 
consolidated/calculated approach IMHO.

I would be interested to discuss this further with you...

hth
lewismc

On 2021/06/07 02:45:54, Abhay Ratnaparkhi  wrote: 
> Hello,
> 
> We are using Nutch to crawl intranet pages behind SSO authentication.
> 
> I would like to know if anyone has used/updated httpclient protocol plugin
> for crawling pages behind SSO authentication.
> 
> The SSO auth redirects pages to the SSO server for login and optionally
> asks for second factor authentication like TOTP.
> 
> We have been using a custom plugin (which calls a nodejs service) which
> uses a google puppeteer to drive chromium browser to do this login and OTP
> handling. This is much slower and might not require as many of these pages
> are rendered on server sides (so dynamic rendering isn't required)
> 
> Thank you
> Abhay Ratnaparkhi
> 


Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-06 Thread Abhay Ratnaparkhi
Hello,

We are using Nutch to crawl intranet pages behind SSO authentication.

I would like to know if anyone has used/updated httpclient protocol plugin
for crawling pages behind SSO authentication.

The SSO auth redirects pages to the SSO server for login and optionally
asks for second factor authentication like TOTP.

We have been using a custom plugin (which calls a nodejs service) which
uses a google puppeteer to drive chromium browser to do this login and OTP
handling. This is much slower and might not require as many of these pages
are rendered on server sides (so dynamic rendering isn't required)

Thank you
Abhay Ratnaparkhi