Hi Lewis, thank you for your reply.
I can't go into too much detail about the index, though i will try to be more specific. Basically, i have a text file with a bunch of names. These names represent download targets, but are not URLs. I can already handle the download part for single names. I want to use Nutch to handle the queuing of the downloads on multiple machines including backing off after being blocked for too much access. thx Johannes Am 06.11.2012 um 10:05 schrieb Lewis John Mcgibbney <[email protected]>: > Hi Johannes, > > On Tue, Nov 6, 2012 at 8:42 AM, Johannes Dorn <[email protected]> wrote: > >> I want to implement an alternative crawler that doesn't crawl over html >> files but uses a proprietary index to identify targets to download. > > I think you would need to expand a bit on your use of a "proprietary > index", and what exactly you are going to do with this index? Do you > have URLs in there that you wish to bootstrap a crawler with? Please > tell us. > >> The apparent reliance of Nutch 2.1 on hbase is a bit daunting as i don't >> want to use too many new technologies. > > Nutch 2.x DOES NOT rely on HBase. You have a number of options for > storing your host/webpage crawl database. For a more in depth overview > please either see the ivy/ivy.xml file in the 2.x branch [0] (take a > look at the commented out gora dependencies), or alternatively head > over to the current gora site documentation [1], this will give you an > ideal account of the possible host/webpage storage options. > >> Is Nutch 1.5.1 a better fit for me or should i look for something entirely >> different? > > 1.5.1 (and 1.x trunk) has a reliance on the HDFS for storing and > accessing the crawl material as mentioned above however without > knowing anything more about your requirements it would be foolish to > prescribe you one or the other... > > hth > > Lewis > > [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/ivy/ivy.xml > [1] http://gora.apache.org/docs/current/index.html > > -- > Lewis >

