Hello Oli - please see inline. Markus
-----Original message----- > From:Oli Lalonde <[email protected]> > Sent: Saturday 21st January 2017 2:53 > To: [email protected] > Subject: Not a distributed crawler? > > Hi everyone, > > Nutch/Hadoop newbie here. > > I'm looking for a general web crawler to essentially compile a list of > image URLs (the href attribute in <img> tags) found on the web. The found > image URLs should be fed to a queue or stored in a database for further > processing. > > Initially, I thought I'd build a crawler from scratch which seemed > relatively easy but then you have to deal with spider traps, crawl > frequency, politeness, scaling out, etc. which Nutch seems to largely solve > already. > > 1) Am I right to think that Nutch could be extended to achieve this? Yes, Nutch is a very mature project and quite stable to use. We have used it for many large scale projects and are still using it to provide services. > > 2) I was reading the FAQ and was a bit confused by this answer: > https://wiki.apache.org/nutch/FAQ#Will_Nutch_use_a_distributed_crawler.2C_like_Grub.3F. > Isn't crawling distributed when Nutch runs on Hadoop? Yes! > > 3) As a follow up question, how hard is the dependency on Hadoop? For > example, would it be possible to setup Nutch on Kubernetes? I'm asking > because I've heard maintaining an Hadoop cluster can be difficult and I > already have access and experience with a Kubernetes cluster. This is most likely not going to work, unless Kubernetes implements the org.apache.hadoop.* API's. Operating a Hadoop cluster can be tedious indeed, but to our experience it rarely misbehaves. > > 4) Are some statistics available regarding URLs crawled per day, given X > hardware? How about bandwidth use? I'm trying to get a general idea of cost > and speed. Ideally, I'd like to be able to crawl around ~10 billion unique > images. Nutch has some database overhead to take into account, but it can average out on thousands of URL's per second on a high octane machine, and faster with machines. The cluster receiving the data usually has a tougher job to do. If images are what you need, you need to crawl a huge number of URL's, Nutch can do this. > > Thanks in advance! > > -- > - Oli > > Oli Lalonde > http://www.syskall.com <-- connect with me! >

