Hi Yossi and Jorge,

Thanks for your detailed answer and guidance! I will look into the materials 
immediately.
We started to use Nutch 1.X extensively, and we would definitely contribute 
improvements back to the main code base if possible.

Zoltán
On 2017-07-27 13:15:24, Jorge Betancourt <[email protected]> wrote:
Hi Zoltán,

You can take a look at [1] in there you could find some documentation,
although it says that was updated to version 1.8, we do not change the
extension points that often. You can also take a look at the code [2]
related to the plugin subsystem. It is true that the documentation is not
ideal, but looking at the code and at the tests can provide a really good
overview.

You didn't mention which version of Nutch you were using, depending on what
you're trying to do you'll need and HtmlParseFilter (that will allow you to
extract information out of the parsed HTML) and/or and IndexingFilter which
will let you customize the information before the document is sent to
Solr/ES (this is probably what you want).

I wrote a post about the IndexingFilter using a practical case some time
ago [3], you can take a look at it, it doesn't go too deep but could help,
also if you want to take a look at the code check [4] which is the version
that was merged into Nutch master.

We always welcome new contributions you could help improve the existing
documentation or adding new documentation on those parts that are less
documented.

[1] https://wiki.apache.org/nutch/AboutPlugins
[2]
https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin
[3]
https://jorgelbg.wordpress.com/2014/08/30/indexing-inlinks-and-outlinks-with-nutch-1-x/
[4] https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter


On Thu, Jul 27, 2017 at 11:30 AM Yossi Tamari wrote:

> Hi Zoltan,
>
> I think what you want is a HtmlParseFilter -
> https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html
> .
> I recommend you read https://florianhartl.com/nutch-plugin-tutorial.html,
> and take a look at one of the included HtmlParseFilters, e.g.
> parsefilter-regex.
>
> If you have more specific questions, I may be able to help.
>
> Yossi.
>
> -----Original Message-----
> From: Zoltán Zvara [mailto:[email protected]]
> Sent: 26 July 2017 20:18
> To: [email protected]
> Subject: After Parse extension point
>
> Dear Community,
>
> Looking for the extension point which executes after parse and before
> update.
> Moreover, I would be happy to read further on how extension points are
> built up (in which order). My first impressions of Nutch is that it is
> highly under-documented, or existing documentation is outdated. I would be
> pleased to look into details how the plugin system works, further how
> extension points are controlled and ran by Nutch.
>
> Best,
> Zoltán
>
>

Reply via email to