Re: Nutch use a Browser or phantomjs as fetcher

Julien Nioche Tue, 10 Jun 2014 14:25:43 -0700

Hi Patrick

You could look at the protocol-http plugin as an example.


Julien


On 10 June 2014 10:22, Patrick Kirsch <[email protected]> wrote:

> Hey,
>
> On 06/10/2014 10:52 AM, Julien Nioche wrote:
>
>> Hi
>>
>> You can do that as a custom protocol implementation. The fetcher code
>> would
>> stay the same but the byte content returned for a given URL would be
>> produced by phantomjs or whichever selenuim backend you'd to use.
>>
> Do you have a documentation/wiki link or example to start from?
>
> Currently I implemented it in
> src/java/org/apache/nutch/fetcher/Fetcher.java
> as hook, if it contains "html" and "head" in the first 500 characters.
>
> Regards,
>  Patrick
>
>
>> HTH
>>
>> Julien
>>
>>
>> On 7 June 2014 11:35, remi tassing <[email protected]> wrote:
>>
>>  I'm currently looking at those separately but an integrated option would
>>> be
>>> more efficient.
>>>
>>> Looking forward for any experience sharing
>>>
>>>
>>> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <[email protected]> wrote:
>>>
>>>  Hey list,
>>>>   I'm sure this issue was asked several times, but a quick look in the
>>>> nutch user archive did not help, so:
>>>>
>>>> Has anyone documentation or tried to use a browser (like chromium) or
>>>> phantomjs etc. for fetching web pages?
>>>>
>>>> Due to a heavily loaded javascript site, nutch needs to see the fully
>>>> rendered page.
>>>>
>>>> Second question, would it be better to implement it as plugin or rather
>>>> native in the fetcher class?
>>>>
>>>> Regards,
>>>>   Patrick
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch use a Browser or phantomjs as fetcher

Reply via email to