Hi everyone,

Nutch/Hadoop newbie here.

I'm looking for a general web crawler to essentially compile a list of
image URLs (the href attribute in <img> tags) found on the web. The found
image URLs should be fed to a queue or stored in a database for further
processing.

Initially, I thought I'd build a crawler from scratch which seemed
relatively easy but then you have to deal with spider traps, crawl
frequency, politeness, scaling out, etc. which Nutch seems to largely solve
already.

1) Am I right to think that Nutch could be extended to achieve this?

2) I was reading the FAQ and was a bit confused by this answer:
https://wiki.apache.org/nutch/FAQ#Will_Nutch_use_a_distributed_crawler.2C_like_Grub.3F.
Isn't crawling distributed when Nutch runs on Hadoop?

3) As a follow up question, how hard is the dependency on Hadoop? For
example, would it be possible to setup Nutch on Kubernetes? I'm asking
because I've heard maintaining an Hadoop cluster can be difficult and I
already have access and experience with a Kubernetes cluster.

4) Are some statistics available regarding URLs crawled per day, given X
hardware? How about bandwidth use? I'm trying to get a general idea of cost
and speed. Ideally, I'd like to be able to crawl around ~10 billion unique
images.

Thanks in advance!

-- 
- Oli

Oli Lalonde
http://www.syskall.com <-- connect with me!

Reply via email to