RE: Nutch scalability

Markus Jelsma Wed, 05 Oct 2016 12:49:23 -0700

Hello Vladimir,

We don't use Nutch 2.x, but if i would, i'd probably choose Apache HBase, or 
maybe even Apache Solr in our case.
Here are some resources for Nutch 2.x on Hadoop: 
http://wiki.apache.org/nutch/#Nutch_2.X_tutorial.28s.29


Markus
 
-----Original message-----
> From:Vladimir Loubenski <[email protected]>
> Sent: Wednesday 5th October 2016 21:37
> To: [email protected]
> Subject: RE: Nutch scalability
> 
> Thank you Markus for the prompt response,
> How can I run fetcher jobs simultaneously on Hadoop?
> What database do you recommend for Nutch 2.3.1?
> Regards,
> Vladimir.
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: October-05-16 2:53 PM
> To: [email protected]
> Subject: RE: Nutch scalability
> 
> Where i wrote YJK, i of course meant CJK instead.
>  
>  
> -----Original message-----
> > From:Markus Jelsma <[email protected]>
> > Sent: Wednesday 5th October 2016 20:34
> > To: [email protected]
> > Subject: RE: Nutch scalability
> > 
> > Hello Vladimir - answers inline.
> > Markus
> >  
> > -----Original message-----
> > > From:Vladimir Loubenski <[email protected]>
> > > Sent: Wednesday 5th October 2016 20:09
> > > To: [email protected]
> > > Subject: Nutch scalability
> > > 
> > > Hi,
> > > I have Nutch 2.3.1 installation with MongoDB.
> > > 
> > > I want to understand what scalability options I have.
> > > 
> > > 1. Number threads during one Job can be defined by nutch-site.xml
> > >   a. fetcher.threads.per.queue - This number is the maximum number of 
> > > threads that should be allowed to access a queue at one time.
> > >   b. fetcher.threads.fetch - The number of FetcherThreads the fetcher 
> > > should use Do we have other scalability configuration parameters?
> > 
> > Yes, run more fetcher jobs simultaneously on Hadoop.
> > 
> > > 
> > > 2. Ability to run the same Job on different hosts.
> > >   Does it supported by Nutch?
> > 
> > Different hosts? Run the fetch job on different machines you mean, as 
> > opposed to having Nutch crawl different hosts in one job. In the case of 
> > the former, see previous answer, Hadoop.
> > 
> > > 3.  Ability to run Jobs in parallel.
> > >   Example: I run “fetch” job. It produces new not Crawled URLS. 
> > >   Can I run another job to process these uncrawled URLS before the first 
> > > Job is done?
> > 
> > You can run jobs in parallel but i advice against running fetcher jobs in 
> > parallel. During generation, Nutch fetches a list of URL's to fetch, then 
> > it runs the fetcher. If you start a new job right away, the generator will 
> > generate the same list of URL's to fetch unless some parameter is set. If 
> > that is supported in 2.x, 1.x supports it, but i'd not recommend it because 
> > why? You can just generate a much larger list for the fetcher.
> > 
> > > 4. Database scalability.
> > >   Can I use multiple instances Mongo DB for crawling?
> > 
> > Probably not a good idea. I don't know about Mongo, but can it run as a 
> > cluster? If that is true, then use a larger cluster, not multiple 
> > databases. By the way, Mongo seems to be a bad idea anyway, as i read today 
> > on this mailing list, because it does not support id's larger than 512 
> > bytes, while URL's can easy consist of more than that. For example, YJK 
> > URL's in UTF-8 take up 4 bytes per character, meaning a YJK URL of 128 
> > characters is the limit.
> > 
> > > 
> > > Thank you in advance,
> > > Vladimir.
> > > 
> > > 
> > 
>

RE: Nutch scalability

Reply via email to