You would have to create a custom parser plugin (not a parse filter plugin). 
For the JSON MIME-type it Nutch will invoke your parser. See the current 
(broken) Javascript parser for an example on how to build and configure such a 
parser. You must also enable the parser for the MIME-type in parse-plugins.xml. 
 
-----Original message-----
> From:Iain Lopata <[email protected]>
> Sent: Monday 25th November 2013 15:08
> To: [email protected]
> Subject: Parsing JSON response
> 
> I am crawling a number of sites that load part of their page content
> dynamically using Ajax and JSON.  Obviously, Nutch does not find the links
> that are embedded in the JSON response, since the request never executes.  
> 
>  
> 
> I have been thinking about retrieving the JSON by specifying the URI in the
> Nutch seed list and then building a configurable custom parser to extract
> the links from the JSON that is returned.
> 
>  
> 
> A few questions:
> 
>  
> 
> 1.       I have looked at the Nutch mailing lists and Tika and do not see
> any plugins to do this. This surprised me, as I would have thought it would
> be a common requirement.  Am I missing it somewhere?
> 
>  
> 
> 2.       Has anyone built anything similar that they would be willing to
> share?
> 
>  
> 
> 3.       If I need to build it, do you have any advise/tips on an effective
> approach or potential issues before I go ahead?  Would you add the
> capability to Tika or in Nutch, for example?
> 
>  
> 
> Thanks
> 
> 

Reply via email to