Hello - you can make an HTML parse filter plugin to extract the things you need from the source HTML. Add everything you extract to the parse metadata. You must then build an an indexing filter that selects the key/value pairs you added earlier to the parse metadata, the key/value pairs must be added as a field to the NutchDocument that is emitted by the indexing filter.
Finally, you must build an indexing backend plugin, this is where you write JSON. The indexing backend plugin receives the NutchDocument you built in your indexing filter. Use these to build your JSON document. Regards, Markus -----Original message----- > From:Srinivasan Ramaswamy <[email protected]> > Sent: Monday 13th March 2017 20:26 > To: [email protected] > Subject: extract elements from each url as json and write it to s3 > > Hi nutch-users, > > I would like to write a nutch plugin to parse each url and extract > different elements from the page (using something like jsoup parser) and > construct a json and write it to s3 (I am running my nutch cluster in AWS). > I am curious to know whether there is any existing plugin that can do some > of the work for me. > > I do see an example of how to write a parser plugin over at > https://wiki.apache.org/nutch/WritingPluginExample-1.2 > I am curious to hear from people who have tried a similar use case, to > learn from others experience. > > Thanks > Srini >

