Re: Crawling same domain URL's

prateek Tue, 11 May 2021 03:48:56 -0700

Hi Lewis,

As mentioned earlier, it does not matter how many mappers I assign to fetch
tasks. Since all the URLs are of the same domain, everything will be
assigned to the same mapper and all other mappers will have no task to
execute. So I am looking for ways I can crawl the same domain URLs quickly.


Regards
Prateek

On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney <lewi...@apache.org>
wrote:

> Hi Prateek,
> mapred.map.tasks     -->    mapreduce.job.maps
> mapred.reduce.tasks  -->    mapreduce.job.reduces
> You should be able to override in these in nutch-site.xml then publish to
> your Hadoop cluster.
> lewismc
>
> On 2021/05/07 15:18:38, prateek <prats86....@gmail.com> wrote:
> > Hi,
> >
> > I am trying to crawl URLs belonging to the same domain (around 140k) and
> > because of the fact that all the same domain URLs go to the same mapper,
> > only one mapper is used for fetching. All others are just a waste of
> > resources. These are the configurations I have tried till now but it's
> > still very slow.
> >
> > Attempt 1 -
> >         fetcher.threads.fetch : 10
> >         fetcher.server.delay : 1
> >         fetcher.threads.per.queue : 1,
> >         fetcher.server.min.delay : 0.0
> >
> > Attempt 2 -
> >         fetcher.threads.fetch : 10
> >         fetcher.server.delay : 1
> >         fetcher.threads.per.queue : 3,
> >         fetcher.server.min.delay : 0.5
> >
> > Is there a way to distribute the same domain URLs across all the
> > fetcher.threads.fetch? I understand that in this case crawl delay cannot
> be
> > reinforced across different mappers but for my use case it's ok to crawl
> > aggressively. So any suggestions?
> >
> > Regards
> > Prateek
> >
>

Re: Crawling same domain URL's

Reply via email to