Hi Johannes,

On Tue, Nov 6, 2012 at 8:42 AM, Johannes Dorn <[email protected]> wrote:

> I want to implement an alternative crawler that doesn't crawl over html files 
> but uses a proprietary index to identify targets to download.

I think you would need to expand a bit on your use of a "proprietary
index", and what exactly you are going to do with this index? Do you
have URLs in there that you wish to bootstrap a crawler with? Please
tell us.

> The apparent reliance of Nutch 2.1 on hbase is a bit daunting as i don't want 
> to use too many new technologies.

Nutch 2.x DOES NOT rely on HBase. You have a number of options for
storing your host/webpage crawl database. For a more in depth overview
please either see the ivy/ivy.xml file in the 2.x branch [0] (take a
look at the commented out gora dependencies), or alternatively head
over to the current gora site documentation [1], this will give you an
ideal account of the possible host/webpage storage options.

> Is Nutch 1.5.1 a better fit for me or should i look for something entirely 
> different?

1.5.1 (and 1.x trunk) has a reliance on the HDFS for storing and
accessing the crawl material as mentioned above however without
knowing anything more about your requirements it would be foolish to
prescribe you one or the other...

hth

Lewis

[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/ivy/ivy.xml
[1] http://gora.apache.org/docs/current/index.html

-- 
Lewis

Reply via email to