Hi Johannes, On Tue, Nov 6, 2012 at 8:42 AM, Johannes Dorn <[email protected]> wrote:
> I want to implement an alternative crawler that doesn't crawl over html files > but uses a proprietary index to identify targets to download. I think you would need to expand a bit on your use of a "proprietary index", and what exactly you are going to do with this index? Do you have URLs in there that you wish to bootstrap a crawler with? Please tell us. > The apparent reliance of Nutch 2.1 on hbase is a bit daunting as i don't want > to use too many new technologies. Nutch 2.x DOES NOT rely on HBase. You have a number of options for storing your host/webpage crawl database. For a more in depth overview please either see the ivy/ivy.xml file in the 2.x branch [0] (take a look at the commented out gora dependencies), or alternatively head over to the current gora site documentation [1], this will give you an ideal account of the possible host/webpage storage options. > Is Nutch 1.5.1 a better fit for me or should i look for something entirely > different? 1.5.1 (and 1.x trunk) has a reliance on the HDFS for storing and accessing the crawl material as mentioned above however without knowing anything more about your requirements it would be foolish to prescribe you one or the other... hth Lewis [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/ivy/ivy.xml [1] http://gora.apache.org/docs/current/index.html -- Lewis

