On Tue, Aug 10, 2010 at 7:44 AM, Alex McLintock <[email protected]> wrote: > On 10 August 2010 12:32, Arthur Pemberton <[email protected]> wrote: >> I'm trying to use Nutch to build a niche search engine, and I would >> like to have full control over URLs. I would like to precisely control >> which URL get crawled, followed, stored and indexed. Is it possible to >> do this as a plug-in? What and where should I be reading to do this? > > Hello Arthur, > > Yes you can do this, but it would require you to learn about the > plugin system - remove the filter plugins you don't want, and add in > one that you write which implements the algorithm you want. > > Plugins are simple Java classes which implement one of several > abstract classes - ie comply to the Nutch Plugin API. The best way of > understanding them is to look at the existing plugin code. There is a > little in the wiki - but could be more.
I assume the plugin API is properly documented, I haven't yet looked. I was waiting for some direction before I went in any one way first. Are there any recommended instructions at least for setting up a dev environment with one of the popular free Java IDEs? > You need to specify which plugins are used in config files, and if > using Hadoop, you may need to do some fancy stuff to make sure they > are deployed properly. (Sometimes you need to rebuild Nutch in order > to get it to use plugins. Or so I am told). > > I've been slowly learning about plugins and can maybe help you > off-list if you like. I too have been interested in niche search > engines. I'm also investigating OpenBixo which is a web mining toolkit > inspired by Nutch. Your desire for total control may steer you that > way. I'll take a look at OpenBixo. I may take you up on that offer of assistance if my own efforts fail. Thank you. -- Fedora 13 (www.pembo13.com)

