I am crawling a number of sites that load part of their page content
dynamically using Ajax and JSON.  Obviously, Nutch does not find the links
that are embedded in the JSON response, since the request never executes.  

 

I have been thinking about retrieving the JSON by specifying the URI in the
Nutch seed list and then building a configurable custom parser to extract
the links from the JSON that is returned.

 

A few questions:

 

1.       I have looked at the Nutch mailing lists and Tika and do not see
any plugins to do this. This surprised me, as I would have thought it would
be a common requirement.  Am I missing it somewhere?

 

2.       Has anyone built anything similar that they would be willing to
share?

 

3.       If I need to build it, do you have any advise/tips on an effective
approach or potential issues before I go ahead?  Would you add the
capability to Tika or in Nutch, for example?

 

Thanks

Reply via email to