Dear list,
I'm a new Nutch Developer and I have a few questions to ask you.

1 - Are there any general guidelines for plugin development (in addition to
the ones specified in the wiki guide).
I looked around github and it seems that many plugins are developed as a
monolithic piece of code that is attached to / forked from the main Nutch
repo. I take it that, ideally, plugins should be developed as their own
separate repositories, so they can be versioned and tested against
different versions of Nutch. Is there a recommended way to do this ? I'm
considering using git submodules to add plugin repos as Nutch dependencies
or else crating symlinks from the plugins folder to the right plugin
repositories.

2 - As a specific use case for point (1), I have developed a plugin that
reads some Machine Learning models from a directory. Ideally, I'd like to
leave the files in the same repository as the plugin, and leave it in a way
so that it can be tested, versioned and developed as an independent repo.
At the moment, I can just make it work by specifying the path to these
models in nutch-site.xml, but I wonder whether that directory could be
accessible by the plugin in some other way (either by some classes in the
Plugin system or by ivy/ant). Any thoughts ?

3 - Is there any tooling developed by the community to deploy and monitor
Nutch applications ? At the moment, we have a scrip that deploys Nutch but
is not robust enough. I see that there's a dockefile. I'm just wondering if
it could be used (possibly together with some other tooling) to provision a
hadoop cluster which the app runs on top. Another tool to run the crawling
steps (fetch, parse, index) and provide some form of monitoring would be
great. I hear that this is somehow present in Nutch 2, but I was more
interested in Nutch 1 (since v2 is not production ready yet, is it?). I was
wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
or some work using Kubernates or Mesos. If anyone has experience with this
and could give me some pointers, I would greatly appreciate it.

4 - At the moment we collect some websites which we extract some metadata
from, but we don't need to make the results available in a search server
like Solr or ElasticSearch. Is there any queue or streaming based plugin
for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
good reasons for moving to Nutch 2).

All the best,
Thiago Galery

Reply via email to