Yes you are hitting the exact same problems that we did. This presents a
major persistent challenge for using Nutch across the enterprise as it
quite frankly doesn’t scale.
I’m going to take next week to have a look into this specific issue and see
what I can come up with.
By any chance are you able to share your K8s configuration management here?
Are you using Helm?
Are you running Nutch in K8s or via some other deployment?
Next week I’m also looking into building our  CloudFormation template for
Nutch on EMR with Ranger included and will donate this to the Nutch
project.

On Sat, Jun 12, 2021 at 17:36 <user-digest-h...@nutch.apache.org> wrote:

>
> user Digest 13 Jun 2021 00:36:36 -0000 Issue 3108
>
> Topics (messages 34633 through 34634)
>
> Re: Apache Nutch help request for a school project :)
>         34633 by: lewis john mcgibbney
>
> Re: Crawling pages behind SSO authentication (SAML/OIDC)
>         34634 by: Abhay Ratnaparkhi
>
> Administrivia:
>
> ---------------------------------------------------------------------
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> ----------------------------------------------------------------------
>
>
>
>
> ---------- Forwarded message ----------
> From: lewis john mcgibbney <lewi...@apache.org>
> To: "gokmen.yontem" <gokmen.yon...@boun.edu.tr>
> Cc: Sebastian Nagel <wastl.na...@googlemail.com>, user@nutch.apache.org
> Bcc:
> Date: Thu, 10 Jun 2021 09:53:31 -0700
> Subject: Re: Apache Nutch help request for a school project :)
> :)
>
> On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem <gokmen.yon...@boun.edu.tr>
> wrote:
>
> > Lewis, Sebastian
> > I can’t thank you enough! Your help is much appreciated.
> >
> > Next time I'll follow your advice and use the mailing list, which I
> > wasn't aware of that.
> >
> > Best wishes,
> > Gorkem
> >
> >
> > On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > > Yep Sebastian is absolutely correct. I sent you a pull request.
> > >
> > > https://github.com/gorkemyontem/nutch/pull/1
> > > HTH
> > > lewismc
> > >
> > > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
> > > <lewi...@apache.org> wrote:
> > >
> > >> I’ll have a look today. You can always use the mailing list as
> > >> well. Feel free to post your questions there and we will help you
> > >> out :)
> > >>
> > >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
> > >> <gokmen.yon...@boun.edu.tr> wrote:
> > >>
> > >>> Hi Lewis,
> > >>> Sorry to bother you. I've been trying to configure Apache Nutch
> > >>> for
> > >>> almost 10 days now and I'm about to give up. I saw that you are
> > >>> contributing to this project and I thought maybe you can help me.
> > >>> This is how desperate I am :)
> > >>>
> > >>> Here's my repo if you have time:
> > >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> > >>> I'm trying to use docker images so there isn't much on the repo/
> > >>>
> > >>> This is my current error:
> > >>>
> > >>> nutch    | Indexer: java.lang.RuntimeException: Indexing job did
> > >>> not
> > >>> succeed, job status:FAILED, reason: NA
> > >>> nutch    |      at
> > >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> > >>> nutch    |      at
> > >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> > >>> nutch    |      at
> > >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> > >>> nutch    |      at
> > >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
> > >>>
> > >>> People say that schema.xml could be wrong, but I'm using the most
> > >>> up to
> > >>> date one from here
> > >>>
> > >>
> > >
> >
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
> > >>>
> > >>> Many many thanks!
> > >>> Best wishes,
> > >>> Gorkem
> > >> --
> > >>
> > >> http://home.apache.org/~lewismc/
> > >> http://people.apache.org/keys/committer/lewismc
> > >
> > > --
> > >
> > > http://home.apache.org/~lewismc/
> > > http://people.apache.org/keys/committer/lewismc
> >
>
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>
>
>
> ---------- Forwarded message ----------
> From: Abhay Ratnaparkhi <abhay.ratnapar...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Thu, 10 Jun 2021 17:27:42 -0500
> Subject: Re: Crawling pages behind SSO authentication (SAML/OIDC)
> Thank you Lewis for your reply.
>
> I initially looked into the above protocol-htmlunit
> <https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit>
>  and protocol-interactiveselenium
> <
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> >
> plugins
> you mentioned.
>
> Based on selenium I created a microservice (which handles all required SSO
> redirections/ OTP handlings etc) and hosted that with a selenium grid in
> the kubernetes cluster for scaling.
> I found that we couldn't scale this approach beyond a certain point and the
> selenium hub in the selenium grid can not be scaled horizontally.
>
> Later we switched using Puppetter <https://github.com/puppeteer/puppeteer>
> to drive headless chrome and scaled this in kubernetes using browserless
> <https://github.com/browserless/chrome>
> The nutch plugin developed to call these hosted APIs. This helps but still
> this is very slow compared to traditional httpclient approach.
>
> As this is a common problem in the intranet environment, I was wondering
> how people are handling this. I would be happy to discuss this further.
>
> Thank you
> Abhay
>
>
>
>
>
> On Wed, Jun 9, 2021 at 6:41 PM Lewis John McGibbney <lewi...@apache.org>
> wrote:
>
> > Hi Abhay,
> >
> > This is a problem space we looked at a while ago and made quite a bit of
> > progress on.
> >
> > Firstly, the protocol-httpclient plugin has been considered in a
> > deprecated state for a while.
> >
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
> > I'm pretty sure that it will NOT cater for your use case. More
> information
> > on the functionality and limits of this plugin can be found at
> >
> https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes
> > some more recent initiatives can be found at
> > https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication
> >
> > Now, some of the plugins which may be used/adapted for your use case
> > include
> >
> > 1.
> > https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit
> > - customizable through
> >
> https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
> >
> > 2. both
> > https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
> >
> >
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> > some documentation exists at
> >
> https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction
> >
> > Admittedly, I've not tried to run these plugins against a modern SSO site
> > recently. I suspect that some dependency updates would not go a miss so
> > please take that info consideration.
> >
> > Your note regarding the time it takes for the 'chaining' of systems
> > together to achieve the login is well made. This was easily observed and
> > needs a more consolidated/calculated approach IMHO.
> >
> > I would be interested to discuss this further with you...
> >
> > hth
> > lewismc
> >
> > On 2021/06/07 02:45:54, Abhay Ratnaparkhi <abhay.ratnapar...@gmail.com>
> > wrote:
> > > Hello,
> > >
> > > We are using Nutch to crawl intranet pages behind SSO authentication.
> > >
> > > I would like to know if anyone has used/updated httpclient protocol
> > plugin
> > > for crawling pages behind SSO authentication.
> > >
> > > The SSO auth redirects pages to the SSO server for login and optionally
> > > asks for second factor authentication like TOTP.
> > >
> > > We have been using a custom plugin (which calls a nodejs service) which
> > > uses a google puppeteer to drive chromium browser to do this login and
> > OTP
> > > handling. This is much slower and might not require as many of these
> > pages
> > > are rendered on server sides (so dynamic rendering isn't required)
> > >
> > > Thank you
> > > Abhay Ratnaparkhi
> > >
> >
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Reply via email to