Hello - you can make an HTML parse filter plugin to extract the things you need 
from the source HTML. Add everything you extract to the parse metadata. You 
must then build an an indexing filter that selects the key/value pairs you 
added earlier to the parse metadata, the key/value pairs must be added as a 
field to the NutchDocument that is emitted by the indexing filter.

Finally, you must build an indexing backend plugin, this is where you write 
JSON. The indexing backend plugin receives the NutchDocument you built in your 
indexing filter. Use these to build your JSON document.

Regards,
Markus
 
 
-----Original message-----
> From:Srinivasan Ramaswamy <[email protected]>
> Sent: Monday 13th March 2017 20:26
> To: [email protected]
> Subject: extract elements from each url as json and write it to s3
> 
> Hi nutch-users,
> 
> I would like to write a nutch plugin to parse each url and extract
> different elements from the page (using something like jsoup parser) and
> construct a json and write it to s3 (I am running my nutch cluster in AWS).
> I am curious to know whether there is any existing plugin that can do some
> of the work for me.
> 
> I do see an example of how to write a parser plugin over at
> https://wiki.apache.org/nutch/WritingPluginExample-1.2
> I am curious to hear from people who have tried a similar use case, to
> learn from others experience.
> 
> Thanks
> Srini
> 

Reply via email to