Hi Yossi and Jorge, Thanks for your detailed answer and guidance! I will look into the materials immediately. We started to use Nutch 1.X extensively, and we would definitely contribute improvements back to the main code base if possible.
Zoltán On 2017-07-27 13:15:24, Jorge Betancourt <[email protected]> wrote: Hi Zoltán, You can take a look at [1] in there you could find some documentation, although it says that was updated to version 1.8, we do not change the extension points that often. You can also take a look at the code [2] related to the plugin subsystem. It is true that the documentation is not ideal, but looking at the code and at the tests can provide a really good overview. You didn't mention which version of Nutch you were using, depending on what you're trying to do you'll need and HtmlParseFilter (that will allow you to extract information out of the parsed HTML) and/or and IndexingFilter which will let you customize the information before the document is sent to Solr/ES (this is probably what you want). I wrote a post about the IndexingFilter using a practical case some time ago [3], you can take a look at it, it doesn't go too deep but could help, also if you want to take a look at the code check [4] which is the version that was merged into Nutch master. We always welcome new contributions you could help improve the existing documentation or adding new documentation on those parts that are less documented. [1] https://wiki.apache.org/nutch/AboutPlugins [2] https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin [3] https://jorgelbg.wordpress.com/2014/08/30/indexing-inlinks-and-outlinks-with-nutch-1-x/ [4] https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter On Thu, Jul 27, 2017 at 11:30 AM Yossi Tamari wrote: > Hi Zoltan, > > I think what you want is a HtmlParseFilter - > https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html > . > I recommend you read https://florianhartl.com/nutch-plugin-tutorial.html, > and take a look at one of the included HtmlParseFilters, e.g. > parsefilter-regex. > > If you have more specific questions, I may be able to help. > > Yossi. > > -----Original Message----- > From: Zoltán Zvara [mailto:[email protected]] > Sent: 26 July 2017 20:18 > To: [email protected] > Subject: After Parse extension point > > Dear Community, > > Looking for the extension point which executes after parse and before > update. > Moreover, I would be happy to read further on how extension points are > built up (in which order). My first impressions of Nutch is that it is > highly under-documented, or existing documentation is outdated. I would be > pleased to look into details how the plugin system works, further how > extension points are controlled and ran by Nutch. > > Best, > Zoltán > >

