Hi Lewis,

thank you for your reply.

I can't go into too much detail about the index, though i will try to be more 
specific.

Basically, i have a text file with a bunch of names. These names represent 
download targets,  but are not URLs. I can already handle the download part for 
single names. 
I want to use Nutch to handle the queuing of the downloads on multiple machines 
including backing off after being blocked for too much access.

thx
Johannes




Am 06.11.2012 um 10:05 schrieb Lewis John Mcgibbney <[email protected]>:

> Hi Johannes,
> 
> On Tue, Nov 6, 2012 at 8:42 AM, Johannes Dorn <[email protected]> wrote:
> 
>> I want to implement an alternative crawler that doesn't crawl over html 
>> files but uses a proprietary index to identify targets to download.
> 
> I think you would need to expand a bit on your use of a "proprietary
> index", and what exactly you are going to do with this index? Do you
> have URLs in there that you wish to bootstrap a crawler with? Please
> tell us.
> 
>> The apparent reliance of Nutch 2.1 on hbase is a bit daunting as i don't 
>> want to use too many new technologies.
> 
> Nutch 2.x DOES NOT rely on HBase. You have a number of options for
> storing your host/webpage crawl database. For a more in depth overview
> please either see the ivy/ivy.xml file in the 2.x branch [0] (take a
> look at the commented out gora dependencies), or alternatively head
> over to the current gora site documentation [1], this will give you an
> ideal account of the possible host/webpage storage options.
> 
>> Is Nutch 1.5.1 a better fit for me or should i look for something entirely 
>> different?
> 
> 1.5.1 (and 1.x trunk) has a reliance on the HDFS for storing and
> accessing the crawl material as mentioned above however without
> knowing anything more about your requirements it would be foolish to
> prescribe you one or the other...
> 
> hth
> 
> Lewis
> 
> [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/ivy/ivy.xml
> [1] http://gora.apache.org/docs/current/index.html
> 
> -- 
> Lewis
> 

Reply via email to