RE: Not a distributed crawler?

Markus Jelsma Sat, 21 Jan 2017 03:33:18 -0800

Hello Oli - please see inline.
Markus


-----Original message-----
> From:Oli Lalonde <[email protected]>
> Sent: Saturday 21st January 2017 2:53
> To: [email protected]
> Subject: Not a distributed crawler?
> 
> Hi everyone,
> 
> Nutch/Hadoop newbie here.
> 
> I'm looking for a general web crawler to essentially compile a list of
> image URLs (the href attribute in <img> tags) found on the web. The found
> image URLs should be fed to a queue or stored in a database for further
> processing.
> 
> Initially, I thought I'd build a crawler from scratch which seemed
> relatively easy but then you have to deal with spider traps, crawl
> frequency, politeness, scaling out, etc. which Nutch seems to largely solve
> already.
> 
> 1) Am I right to think that Nutch could be extended to achieve this?

Yes, Nutch is a very mature project and quite stable to use. We have used it 
for many large scale projects and are still using it to provide services.

> 
> 2) I was reading the FAQ and was a bit confused by this answer:
> https://wiki.apache.org/nutch/FAQ#Will_Nutch_use_a_distributed_crawler.2C_like_Grub.3F.
> Isn't crawling distributed when Nutch runs on Hadoop?

Yes!

> 
> 3) As a follow up question, how hard is the dependency on Hadoop? For
> example, would it be possible to setup Nutch on Kubernetes? I'm asking
> because I've heard maintaining an Hadoop cluster can be difficult and I
> already have access and experience with a Kubernetes cluster.

This is most likely not going to work, unless Kubernetes implements the 
org.apache.hadoop.* API's. Operating a Hadoop cluster can be tedious indeed, 
but to our experience it rarely misbehaves.

> 
> 4) Are some statistics available regarding URLs crawled per day, given X
> hardware? How about bandwidth use? I'm trying to get a general idea of cost
> and speed. Ideally, I'd like to be able to crawl around ~10 billion unique
> images.

Nutch has some database overhead to take into account, but it can average out 
on thousands of URL's per second on a high octane machine, and faster with 
machines. The cluster receiving the data usually has a tougher job to do.

If images are what you need, you need to crawl a huge number of URL's, Nutch 
can do this.

> 
> Thanks in advance!
> 
> -- 
> - Oli
> 
> Oli Lalonde
> http://www.syskall.com <-- connect with me!
>

RE: Not a distributed crawler?

Reply via email to