Re: Nutch use a Browser or phantomjs as fetcher

Patrick Kirsch Tue, 10 Jun 2014 02:23:32 -0700

Hey,
On 06/10/2014 10:52 AM, Julien Nioche wrote:

Hi


You can do that as a custom protocol implementation. The fetcher code would
stay the same but the byte content returned for a given URL would be
produced by phantomjs or whichever selenuim backend you'd to use.

Do you have a documentation/wiki link or example to start from?

Currently I implemented it in
src/java/org/apache/nutch/fetcher/Fetcher.java
as hook, if it contains "html" and "head" in the first 500 characters.

Regards,
 Patrick


HTH

Julien


On 7 June 2014 11:35, remi tassing <[email protected]> wrote:

I'm currently looking at those separately but an integrated option would be
more efficient.

Looking forward for any experience sharing


On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <[email protected]> wrote:

Hey list,
  I'm sure this issue was asked several times, but a quick look in the
nutch user archive did not help, so:

Has anyone documentation or tried to use a browser (like chromium) or
phantomjs etc. for fetching web pages?

Due to a heavily loaded javascript site, nutch needs to see the fully
rendered page.

Second question, would it be better to implement it as plugin or rather
native in the fetcher class?

Regards,
  Patrick

Re: Nutch use a Browser or phantomjs as fetcher

Reply via email to