Dear list, I'm a new Nutch Developer and I have a few questions to ask you.
1 - Are there any general guidelines for plugin development (in addition to the ones specified in the wiki guide). I looked around github and it seems that many plugins are developed as a monolithic piece of code that is attached to / forked from the main Nutch repo. I take it that, ideally, plugins should be developed as their own separate repositories, so they can be versioned and tested against different versions of Nutch. Is there a recommended way to do this ? I'm considering using git submodules to add plugin repos as Nutch dependencies or else crating symlinks from the plugins folder to the right plugin repositories. 2 - As a specific use case for point (1), I have developed a plugin that reads some Machine Learning models from a directory. Ideally, I'd like to leave the files in the same repository as the plugin, and leave it in a way so that it can be tested, versioned and developed as an independent repo. At the moment, I can just make it work by specifying the path to these models in nutch-site.xml, but I wonder whether that directory could be accessible by the plugin in some other way (either by some classes in the Plugin system or by ivy/ant). Any thoughts ? 3 - Is there any tooling developed by the community to deploy and monitor Nutch applications ? At the moment, we have a scrip that deploys Nutch but is not robust enough. I see that there's a dockefile. I'm just wondering if it could be used (possibly together with some other tooling) to provision a hadoop cluster which the app runs on top. Another tool to run the crawling steps (fetch, parse, index) and provide some form of monitoring would be great. I hear that this is somehow present in Nutch 2, but I was more interested in Nutch 1 (since v2 is not production ready yet, is it?). I was wondering if there are any community recipes for Chef/Puppet/Ansible/Salt or some work using Kubernates or Mesos. If anyone has experience with this and could give me some pointers, I would greatly appreciate it. 4 - At the moment we collect some websites which we extract some metadata from, but we don't need to make the results available in a search server like Solr or ElasticSearch. Is there any queue or streaming based plugin for Nutch, so 'indexing' can be regarding as sending to a queue ? I know that Nutch 2 has Gora as an abstraction layer, so maybe this could be a gora plug-in, but I'm mainly interested in something for Nutch 1 (or else good reasons for moving to Nutch 2). All the best, Thiago Galery

