You would have to create a custom parser plugin (not a parse filter plugin).
For the JSON MIME-type it Nutch will invoke your parser. See the current
(broken) Javascript parser for an example on how to build and configure such a
parser. You must also enable the parser for the MIME-type in parse-plugins.xml.
-----Original message-----
> From:Iain Lopata <[email protected]>
> Sent: Monday 25th November 2013 15:08
> To: [email protected]
> Subject: Parsing JSON response
>
> I am crawling a number of sites that load part of their page content
> dynamically using Ajax and JSON. Obviously, Nutch does not find the links
> that are embedded in the JSON response, since the request never executes.
>
>
>
> I have been thinking about retrieving the JSON by specifying the URI in the
> Nutch seed list and then building a configurable custom parser to extract
> the links from the JSON that is returned.
>
>
>
> A few questions:
>
>
>
> 1. I have looked at the Nutch mailing lists and Tika and do not see
> any plugins to do this. This surprised me, as I would have thought it would
> be a common requirement. Am I missing it somewhere?
>
>
>
> 2. Has anyone built anything similar that they would be willing to
> share?
>
>
>
> 3. If I need to build it, do you have any advise/tips on an effective
> approach or potential issues before I go ahead? Would you add the
> capability to Tika or in Nutch, for example?
>
>
>
> Thanks
>
>