I am crawling a number of sites that load part of their page content dynamically using Ajax and JSON. Obviously, Nutch does not find the links that are embedded in the JSON response, since the request never executes.
I have been thinking about retrieving the JSON by specifying the URI in the Nutch seed list and then building a configurable custom parser to extract the links from the JSON that is returned. A few questions: 1. I have looked at the Nutch mailing lists and Tika and do not see any plugins to do this. This surprised me, as I would have thought it would be a common requirement. Am I missing it somewhere? 2. Has anyone built anything similar that they would be willing to share? 3. If I need to build it, do you have any advise/tips on an effective approach or potential issues before I go ahead? Would you add the capability to Tika or in Nutch, for example? Thanks

