Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Mo Omer Wed, 30 Jul 2014 22:26:28 -0700

Sorry for the multiple emails, I didn't see the rest of your email Sebastian.


Re httpclient - I had a total of just a few hours to hack together my previous 
selenium stand alone plugin, and even less time to put together this solution 
so there is looooots of stuff that can be pulled out that's leftover from 
httpclient! 

Unfortunately lately my work queue is heavy; and, I've already moved on from 
the project using this plugin. I'll happily look at and merge PRs, but can't 
promise any additional refactoring or curation on my end.

I will put together a tutorial, as I mentioned in the previous email, showing

A) What selenium is
B) Why it's a good compromise
C) Setting up Selenium Hub on Ubuntu 14.04
D) Setting up Selenium Node on Ubuntu 14.04
E) Some issues I've encountered with selenium node

Glad to see interest, and more importantly, people still interested in nutch on 
the mailing list!

Thank you,

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Jul 30, 2014, at 5:22 PM, Sebastian Nagel <[email protected]> 
> wrote:
> 
> Hi Mohammed,
> 
> sounds interesting. I'll give it a try soon.
> 
>> I've been using it in production for a month now; and, there are some
>> obvious things that need patching like
>> - Enabling for https pages
>> - It would probably be best for the overall use case to retrieve all of the
>> document's html, rather than just a <body> tag (if exists).
> At a first glance, looks like long passages of code are from protocol-http.
> Would be good to pull-out the parts specific to selenium and integrate
> them with the existing code base. This might require some refactoring.
> 
>> (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium)
>> C) Not have to wait another 2 years for Nutch to patch in either the Ajax 
>> crawler
>> hashbang workaround and then, not having to patch it to get the use case of 
>> ammending the
>> original url with the hashbang-workaround's content.
> Your are right: it's a shame for many issues and patches lying around
> for years until they get integrated. On the other hand: everyone
> is welcome to participate, provide and review patches, improve code
> and documentation, etc.  There is lot of work to do...
> 
> Thanks for sharing the plugin,
> would be great to here more from you!
> 
> Sebastian
> 
> 
> 
>> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
>> This looks fantastic. Are you interested in bringing in into the codebase?I
>> think that this would be very useful to many users of Nutch and would be
>> extremely interested in hashing out a patch with you in order to do so.
>> Thanks
>> Lewis
> 
> 
>> On 07/29/2014 04:26 PM, Mohammed Omer wrote:
>> Morning everyone,
>> 
>> Figured I'd share out a little plugin that delegates fetching and crawling
>> to a Selenium Hub/Node system, so that you can rely on Firefox to correctly
>> render and parse javascript as it would, and Selenium to pull out the
>> content you care about.
>> 
>> At the moment, the plugin is set to pull just the innerHTML of the page's
>> <body>; as I just needed a quick and dirty fix. It's forked from my
>> patching of another user's previous attempt at getting Selenium standalone
>> working with Nutch; that was in turn a fork of httpclient. That worked
>> fine, but it was vulnerable to leaving lots of zombie processes hanging
>> around when errors occurred. I pretty much just patched it enough to get it
>> working - so if you end up using it and patching things / removing
>> unnecessaries, send them up on a PR!
>> 
>> Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass
>> requests for pages to that system, and receive html content as the response.
>> 
>> I've been using it in production for a month now; and, there are some
>> obvious things that need patching like
>> 
>> - Enabling for https pages
>> - It would probably be best for the overall use case to retrieve all of the
>> document's html, rather than just a <body> tag (if exists).
>> 
>> Available at: https://github.com/momer/nutch-selenium-grid-plugin
>

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Reply via email to