Configuration Nutch in cluster mode

2023-01-14 Thread Mike
I will now try to configure the bot url etc. before the building,
but how and where do I configure between the crawls e.g. number of pages
per host?

where do I configure nutch in cluster mode?

thx, mike


Re: Nutch/Hadoop Cluster

2023-01-14 Thread Markus Jelsma
Hello Mike,

> would it pay off for me to put a hadoop cluster on top of the 3 servers.

Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that you will also
need.

Ideally you would have some additional VMs to run the controlling Hadoop
programs and perhaps the Hadoop client nodes on. The workers can run on
bare metal.

> 1.) a server would not be integrated directly into the crawl process as a
master.

What do you mean? Can you elaborate?

> 2.) can I run multiple crawl jobs on one server?

Yes! Just have separate instances of Nutch home dirs on your Hadoop client
nodes, each having their own configuration.

Regards,
Markus

Op za 14 jan. 2023 om 18:42 schreef Mike :

> Hi!
>
> I am now crawling the internet in local mode in parallel with up to 10
> instances on 3 computers. would it pay off for me to put a hadoop cluster
> on top of the 3 servers.
>
> 1.) a server would not be integrated directly into the crawl process as a
> master.
> 2.) can I run multiple crawl jobs on one server?
>
> Thanks
>


Nutch/Hadoop Cluster

2023-01-14 Thread Mike
Hi!

I am now crawling the internet in local mode in parallel with up to 10
instances on 3 computers. would it pay off for me to put a hadoop cluster
on top of the 3 servers.

1.) a server would not be integrated directly into the crawl process as a
master.
2.) can I run multiple crawl jobs on one server?

Thanks