On Tue, Aug 10, 2010 at 7:44 AM, Alex McLintock
<[email protected]> wrote:
> On 10 August 2010 12:32, Arthur Pemberton <[email protected]> wrote:
>> I'm trying to use Nutch to build a niche search engine, and I would
>> like to have full control over URLs. I would like to precisely control
>> which URL get crawled, followed, stored and indexed. Is it possible to
>> do this as a plug-in? What and where should I be reading to do this?
>
> Hello Arthur,
>
> Yes you can do this, but it would require you to learn about the
> plugin system - remove the filter plugins you don't want, and add in
> one that you write which implements the algorithm you want.
>
> Plugins are simple Java classes which implement one of several
> abstract classes - ie comply to the Nutch Plugin API. The best way of
> understanding them is to look at the existing plugin code. There is a
> little in the wiki - but could be more.

I assume the plugin API is properly documented, I haven't yet looked.
I was waiting for some direction before I went in any one way first.

Are there any recommended instructions at least for setting up a dev
environment with one of the popular free Java IDEs?

> You need to specify which plugins are used in config files, and if
> using Hadoop, you may need to do some fancy stuff to make sure they
> are deployed properly. (Sometimes you need to rebuild Nutch in order
> to get it to use plugins. Or so I am told).
>
> I've been slowly learning about plugins and can maybe help you
> off-list if you like. I too have been interested in niche search
> engines. I'm also investigating OpenBixo which is a web mining toolkit
> inspired by Nutch. Your desire for total control may steer you that
> way.

I'll take a look at OpenBixo. I may take you up on that offer of
assistance if my own efforts fail.

Thank you.

-- 
Fedora 13
(www.pembo13.com)

Reply via email to