RE: Nutch scalability

Vladimir Loubenski Wed, 05 Oct 2016 12:38:14 -0700

Thank you Markus for the prompt response,
How can I run fetcher jobs simultaneously on Hadoop?
What database do you recommend for Nutch 2.3.1?
Regards,
Vladimir.



-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: October-05-16 2:53 PM
To: [email protected]
Subject: RE: Nutch scalability

Where i wrote YJK, i of course meant CJK instead.
 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wednesday 5th October 2016 20:34
> To: [email protected]
> Subject: RE: Nutch scalability
> 
> Hello Vladimir - answers inline.
> Markus
>  
> -----Original message-----
> > From:Vladimir Loubenski <[email protected]>
> > Sent: Wednesday 5th October 2016 20:09
> > To: [email protected]
> > Subject: Nutch scalability
> > 
> > Hi,
> > I have Nutch 2.3.1 installation with MongoDB.
> > 
> > I want to understand what scalability options I have.
> > 
> > 1. Number threads during one Job can be defined by nutch-site.xml
> >     a. fetcher.threads.per.queue - This number is the maximum number of 
> > threads that should be allowed to access a queue at one time.
> >     b. fetcher.threads.fetch - The number of FetcherThreads the fetcher 
> > should use Do we have other scalability configuration parameters?
> 
> Yes, run more fetcher jobs simultaneously on Hadoop.
> 
> > 
> > 2. Ability to run the same Job on different hosts.
> >     Does it supported by Nutch?
> 
> Different hosts? Run the fetch job on different machines you mean, as opposed 
> to having Nutch crawl different hosts in one job. In the case of the former, 
> see previous answer, Hadoop.
> 
> > 3.  Ability to run Jobs in parallel.
> >     Example: I run “fetch” job. It produces new not Crawled URLS. 
> >     Can I run another job to process these uncrawled URLS before the first 
> > Job is done?
> 
> You can run jobs in parallel but i advice against running fetcher jobs in 
> parallel. During generation, Nutch fetches a list of URL's to fetch, then it 
> runs the fetcher. If you start a new job right away, the generator will 
> generate the same list of URL's to fetch unless some parameter is set. If 
> that is supported in 2.x, 1.x supports it, but i'd not recommend it because 
> why? You can just generate a much larger list for the fetcher.
> 
> > 4. Database scalability.
> >     Can I use multiple instances Mongo DB for crawling?
> 
> Probably not a good idea. I don't know about Mongo, but can it run as a 
> cluster? If that is true, then use a larger cluster, not multiple databases. 
> By the way, Mongo seems to be a bad idea anyway, as i read today on this 
> mailing list, because it does not support id's larger than 512 bytes, while 
> URL's can easy consist of more than that. For example, YJK URL's in UTF-8 
> take up 4 bytes per character, meaning a YJK URL of 128 characters is the 
> limit.
> 
> > 
> > Thank you in advance,
> > Vladimir.
> > 
> > 
>

RE: Nutch scalability

Reply via email to