Have you tried using the protocol-selenium plugin? I've had luck using to
fetch pages with dynamically loaded content.

https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium


-- Jimmy

On Fri, Jun 5, 2015 at 4:16 AM, Imtiaz Shakil Siddique <
[email protected]> wrote:

> Hi,
>
> I am using apache-nutch-1.9. My configuration ignores external links.
>
> I've some urls in my seed file. But the problem is , nutch crawler doesn't
> find the links in those pages because the site popuates content using ajax
> call. I've removed all possible regex filters inside conf folder of nutch.
>
> How can I collect those links. Any advice ?
> Thanks in advance.
>

Reply via email to