Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-10 Thread Abhay Ratnaparkhi
Thank you Lewis for your reply.

I initially looked into the above protocol-htmlunit

 and protocol-interactiveselenium

plugins
you mentioned.

Based on selenium I created a microservice (which handles all required SSO
redirections/ OTP handlings etc) and hosted that with a selenium grid in
the kubernetes cluster for scaling.
I found that we couldn't scale this approach beyond a certain point and the
selenium hub in the selenium grid can not be scaled horizontally.

Later we switched using Puppetter 
to drive headless chrome and scaled this in kubernetes using browserless

The nutch plugin developed to call these hosted APIs. This helps but still
this is very slow compared to traditional httpclient approach.

As this is a common problem in the intranet environment, I was wondering
how people are handling this. I would be happy to discuss this further.

Thank you
Abhay





On Wed, Jun 9, 2021 at 6:41 PM Lewis John McGibbney 
wrote:

> Hi Abhay,
>
> This is a problem space we looked at a while ago and made quite a bit of
> progress on.
>
> Firstly, the protocol-httpclient plugin has been considered in a
> deprecated state for a while.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
> I'm pretty sure that it will NOT cater for your use case. More information
> on the functionality and limits of this plugin can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes
> some more recent initiatives can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication
>
> Now, some of the plugins which may be used/adapted for your use case
> include
>
> 1.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit
> - customizable through
> https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
>
> 2. both
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
>
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> some documentation exists at
> https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction
>
> Admittedly, I've not tried to run these plugins against a modern SSO site
> recently. I suspect that some dependency updates would not go a miss so
> please take that info consideration.
>
> Your note regarding the time it takes for the 'chaining' of systems
> together to achieve the login is well made. This was easily observed and
> needs a more consolidated/calculated approach IMHO.
>
> I would be interested to discuss this further with you...
>
> hth
> lewismc
>
> On 2021/06/07 02:45:54, Abhay Ratnaparkhi 
> wrote:
> > Hello,
> >
> > We are using Nutch to crawl intranet pages behind SSO authentication.
> >
> > I would like to know if anyone has used/updated httpclient protocol
> plugin
> > for crawling pages behind SSO authentication.
> >
> > The SSO auth redirects pages to the SSO server for login and optionally
> > asks for second factor authentication like TOTP.
> >
> > We have been using a custom plugin (which calls a nodejs service) which
> > uses a google puppeteer to drive chromium browser to do this login and
> OTP
> > handling. This is much slower and might not require as many of these
> pages
> > are rendered on server sides (so dynamic rendering isn't required)
> >
> > Thank you
> > Abhay Ratnaparkhi
> >
>


Re: Apache Nutch help request for a school project :)

2021-06-10 Thread lewis john mcgibbney
:)

On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem 
wrote:

> Lewis, Sebastian
> I can’t thank you enough! Your help is much appreciated.
>
> Next time I'll follow your advice and use the mailing list, which I
> wasn't aware of that.
>
> Best wishes,
> Gorkem
>
>
> On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > Yep Sebastian is absolutely correct. I sent you a pull request.
> >
> > https://github.com/gorkemyontem/nutch/pull/1
> > HTH
> > lewismc
> >
> > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
> >  wrote:
> >
> >> I’ll have a look today. You can always use the mailing list as
> >> well. Feel free to post your questions there and we will help you
> >> out :)
> >>
> >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
> >>  wrote:
> >>
> >>> Hi Lewis,
> >>> Sorry to bother you. I've been trying to configure Apache Nutch
> >>> for
> >>> almost 10 days now and I'm about to give up. I saw that you are
> >>> contributing to this project and I thought maybe you can help me.
> >>> This is how desperate I am :)
> >>>
> >>> Here's my repo if you have time:
> >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> >>> I'm trying to use docker images so there isn't much on the repo/
> >>>
> >>> This is my current error:
> >>>
> >>> nutch| Indexer: java.lang.RuntimeException: Indexing job did
> >>> not
> >>> succeed, job status:FAILED, reason: NA
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> >>> nutch|  at
> >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
> >>>
> >>> People say that schema.xml could be wrong, but I'm using the most
> >>> up to
> >>> date one from here
> >>>
> >>
> >
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
> >>>
> >>> Many many thanks!
> >>> Best wishes,
> >>> Gorkem
> >> --
> >>
> >> http://home.apache.org/~lewismc/
> >> http://people.apache.org/keys/committer/lewismc
> >
> > --
> >
> > http://home.apache.org/~lewismc/
> > http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc