Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Julien Nioche Thu, 31 Jul 2014 00:57:08 -0700

Hi,

Just to add to what Seb said below :









*> (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium
<https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium>)> C)
Not have to wait another 2 years for Nutch to patch in either the Ajax
crawler> hashbang workaround and then, not having to patch it to get the
use case of ammending the> original url with the hashbang-workaround's
content.Your are right: it's a shame for many issues and patches lying
aroundfor years until they get integrated. On the other hand: everyoneis
welcome to participate, provide and review patches, improve codeand
documentation, etc.  There is lot of work to do...*

Open source projects like Nutch rely on the participation of the community.
Everyone is welcome to contribute is any way possible.
If you wanted NUTCH-1323 to be committed quicker you could have helped
review the patch, voted for it, expressed yourself on the mailing list,
etc... Nutch is not a top-down organisation where things are decided
entirely by PMC members but an evolutionary process where things get done
because they are needed, get improved because they are used and so on...
Your contribution with this plugin is a good example of this : you needed
it, shared it and it might get improved as more people start using it.

Glad to see interest, and more importantly, people still interested in
> nutch on the mailing list!


Crawling is a bit of a niche activity and the traffic on the lists is never
huge but Nutch is a very healthy project, and keeps getting better and
better (even if some JIRA issues to not get committed very quickly). Having
to maintain 2 versions definitely doesn't help focusing the effort.

BTW what about porting your plugin to Nutch 1.x?

Thanks again for sharing your work

Julien






On 31 July 2014 06:25, Mo Omer <[email protected]> wrote:

> Sorry for the multiple emails, I didn't see the rest of your email
> Sebastian.
>
> Re httpclient - I had a total of just a few hours to hack together my
> previous selenium stand alone plugin, and even less time to put together
> this solution so there is looooots of stuff that can be pulled out that's
> leftover from httpclient!
>
> Unfortunately lately my work queue is heavy; and, I've already moved on
> from the project using this plugin. I'll happily look at and merge PRs, but
> can't promise any additional refactoring or curation on my end.
>
> I will put together a tutorial, as I mentioned in the previous email,
> showing
>
> A) What selenium is
> B) Why it's a good compromise
> C) Setting up Selenium Hub on Ubuntu 14.04
> D) Setting up Selenium Node on Ubuntu 14.04
> E) Some issues I've encountered with selenium node
>
> Glad to see interest, and more importantly, people still interested in
> nutch on the mailing list!
>
> Thank you,
>
> Mo
>
> This message was drafted on a tiny touch screen; please forgive brevity &
> tpyos
>
> > On Jul 30, 2014, at 5:22 PM, Sebastian Nagel <[email protected]>
> wrote:
> >
> > Hi Mohammed,
> >
> > sounds interesting. I'll give it a try soon.
> >
> >> I've been using it in production for a month now; and, there are some
> >> obvious things that need patching like
> >> - Enabling for https pages
> >> - It would probably be best for the overall use case to retrieve all of
> the
> >> document's html, rather than just a <body> tag (if exists).
> > At a first glance, looks like long passages of code are from
> protocol-http.
> > Would be good to pull-out the parts specific to selenium and integrate
> > them with the existing code base. This might require some refactoring.
> >
> >> (from
> https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium)
> >> C) Not have to wait another 2 years for Nutch to patch in either the
> Ajax crawler
> >> hashbang workaround and then, not having to patch it to get the use
> case of ammending the
> >> original url with the hashbang-workaround's content.
> > Your are right: it's a shame for many issues and patches lying around
> > for years until they get integrated. On the other hand: everyone
> > is welcome to participate, provide and review patches, improve code
> > and documentation, etc.  There is lot of work to do...
> >
> > Thanks for sharing the plugin,
> > would be great to here more from you!
> >
> > Sebastian
> >
> >
> >
> >> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
> >> This looks fantastic. Are you interested in bringing in into the
> codebase?I
> >> think that this would be very useful to many users of Nutch and would be
> >> extremely interested in hashing out a patch with you in order to do so.
> >> Thanks
> >> Lewis
> >
> >
> >> On 07/29/2014 04:26 PM, Mohammed Omer wrote:
> >> Morning everyone,
> >>
> >> Figured I'd share out a little plugin that delegates fetching and
> crawling
> >> to a Selenium Hub/Node system, so that you can rely on Firefox to
> correctly
> >> render and parse javascript as it would, and Selenium to pull out the
> >> content you care about.
> >>
> >> At the moment, the plugin is set to pull just the innerHTML of the
> page's
> >> <body>; as I just needed a quick and dirty fix. It's forked from my
> >> patching of another user's previous attempt at getting Selenium
> standalone
> >> working with Nutch; that was in turn a fork of httpclient. That worked
> >> fine, but it was vulnerable to leaving lots of zombie processes hanging
> >> around when errors occurred. I pretty much just patched it enough to
> get it
> >> working - so if you end up using it and patching things / removing
> >> unnecessaries, send them up on a PR!
> >>
> >> Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass
> >> requests for pages to that system, and receive html content as the
> response.
> >>
> >> I've been using it in production for a month now; and, there are some
> >> obvious things that need patching like
> >>
> >> - Enabling for https pages
> >> - It would probably be best for the overall use case to retrieve all of
> the
> >> document's html, rather than just a <body> tag (if exists).
> >>
> >> Available at: https://github.com/momer/nutch-selenium-grid-plugin
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Reply via email to