Hi Chris, thanks for the response, here are some elaborations of my initial questions on the basis of your reply.
On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > Hi Thiago, > > Welcome! > > First thing to check out: > > http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer > > > I would follow that by checking out info on how to use our > Source Code repo: > > http://wiki.apache.org/nutch/UsingGit > > > OK now on to your specific questions: > > > > > On 4/6/16, 8:48 AM, "Thiago Galery" <[email protected]> wrote: > > >Dear list, > >I'm a new Nutch Developer and I have a few questions to ask you. > > > >1 - Are there any general guidelines for plugin development (in addition > to > >the ones specified in the wiki guide). > >I looked around github and it seems that many plugins are developed as a > >monolithic piece of code that is attached to / forked from the main Nutch > >repo. I take it that, ideally, plugins should be developed as their own > >separate repositories, so they can be versioned and tested against > >different versions of Nutch. Is there a recommended way to do this ? I'm > >considering using git submodules to add plugin repos as Nutch dependencies > >or else crating symlinks from the plugins folder to the right plugin > >repositories. > > I would recommend plugin develop to be done against the master branch of > nutch, which you can find a cloned copy of here: > > http://github.com/apache/nutch/tree/master > > You can follow this process to submit pull requests to add plugins: > > http://github.com/apache/nutch/#contributing > > > > >2 - As a specific use case for point (1), I have developed a plugin that > >reads some Machine Learning models from a directory. Ideally, I'd like to > >leave the files in the same repository as the plugin, and leave it in a > way > >so that it can be tested, versioned and developed as an independent repo. > > Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml > Then read the property in your plugin via > NutchConfiguration.create().get(“name”) > > If the property references a model file, add a property that lists > (relatively) > the file path, and then read the property assuming that your Nutch *.job > or jar code depending on whether you are running on Hadoop or locally has > access to $NUTCH/conf > Could you elaborate on this a bit more. At the moment I'm specifying the full path or the models, this works well on local mode, but might raise problems when running on a hadoop cluster. I understand that the path should be specified relatively, but I'm not sure relative to what, that is, if the job file has access to the conf folder, should I put the models inside conf and just add the property models.folder = conf/models ? I imagine that another option is to use a hdfs url for the models location, would that work ? > >At the moment, I can just make it work by specifying the path to these > >models in nutch-site.xml, but I wonder whether that directory could be > >accessible by the plugin in some other way (either by some classes in the > >Plugin system or by ivy/ant). Any thoughts ? > > See above. > > > > >3 - Is there any tooling developed by the community to deploy and monitor > >Nutch applications ? At the moment, we have a scrip that deploys Nutch but > >is not robust enough. I see that there's a dockefile. I'm just wondering > if > >it could be used (possibly together with some other tooling) to provision > a > >hadoop cluster which the app runs on top. Another tool to run the crawling > >steps (fetch, parse, index) and provide some form of monitoring would be > >great. > > We have been working on a project called Memex Explorer: > http://github.com/memex-explorer/memex-explorer > Memex explorer seems to be really interesting !!! However, I had some issues (tests not passing, redis not runnning, some screens unavailable). On the github page, it says that the project is not maintained. I'd be happy to fix bugs and contribute, but if the project is just gonna be ditched, then I'd be less inclined to do so. Does anyone know what the plans for memex are ? > that provides these types of capabilities. Have a look. > > >I hear that this is somehow present in Nutch 2, but I was more > >interested in Nutch 1 (since v2 is not production ready yet, is it?). I > was > >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt > >or some work using Kubernates or Mesos. If anyone has experience with this > >and could give me some pointers, I would greatly appreciate it. > > FYI above. > > > > >4 - At the moment we collect some websites which we extract some metadata > >from, but we don't need to make the results available in a search server > >like Solr or ElasticSearch. Is there any queue or streaming based plugin > >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know > >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a > >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else > >good reasons for moving to Nutch 2). > > Lots of people are interested in this and there is Storm Crawler > that sort of does this, which involves some of the Nutch PMC and > committers. > > Within Nutch there is also work done by my USC masters student and > Nutch PMC member and committer Sujen Shah where he added a publisher > using ActiveMQ Artemis that publishes Nutch events so we can display > what’s up in D3 and JSON. You can see the work here, I intend to commit > it soon: > > https://issues.apache.org/jira/browse/NUTCH-2132 > > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >

