Thank you Markus for the prompt response, How can I run fetcher jobs simultaneously on Hadoop? What database do you recommend for Nutch 2.3.1? Regards, Vladimir.
-----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: October-05-16 2:53 PM To: [email protected] Subject: RE: Nutch scalability Where i wrote YJK, i of course meant CJK instead. -----Original message----- > From:Markus Jelsma <[email protected]> > Sent: Wednesday 5th October 2016 20:34 > To: [email protected] > Subject: RE: Nutch scalability > > Hello Vladimir - answers inline. > Markus > > -----Original message----- > > From:Vladimir Loubenski <[email protected]> > > Sent: Wednesday 5th October 2016 20:09 > > To: [email protected] > > Subject: Nutch scalability > > > > Hi, > > I have Nutch 2.3.1 installation with MongoDB. > > > > I want to understand what scalability options I have. > > > > 1. Number threads during one Job can be defined by nutch-site.xml > > a. fetcher.threads.per.queue - This number is the maximum number of > > threads that should be allowed to access a queue at one time. > > b. fetcher.threads.fetch - The number of FetcherThreads the fetcher > > should use Do we have other scalability configuration parameters? > > Yes, run more fetcher jobs simultaneously on Hadoop. > > > > > 2. Ability to run the same Job on different hosts. > > Does it supported by Nutch? > > Different hosts? Run the fetch job on different machines you mean, as opposed > to having Nutch crawl different hosts in one job. In the case of the former, > see previous answer, Hadoop. > > > 3. Ability to run Jobs in parallel. > > Example: I run “fetch” job. It produces new not Crawled URLS. > > Can I run another job to process these uncrawled URLS before the first > > Job is done? > > You can run jobs in parallel but i advice against running fetcher jobs in > parallel. During generation, Nutch fetches a list of URL's to fetch, then it > runs the fetcher. If you start a new job right away, the generator will > generate the same list of URL's to fetch unless some parameter is set. If > that is supported in 2.x, 1.x supports it, but i'd not recommend it because > why? You can just generate a much larger list for the fetcher. > > > 4. Database scalability. > > Can I use multiple instances Mongo DB for crawling? > > Probably not a good idea. I don't know about Mongo, but can it run as a > cluster? If that is true, then use a larger cluster, not multiple databases. > By the way, Mongo seems to be a bad idea anyway, as i read today on this > mailing list, because it does not support id's larger than 512 bytes, while > URL's can easy consist of more than that. For example, YJK URL's in UTF-8 > take up 4 bytes per character, meaning a YJK URL of 128 characters is the > limit. > > > > > Thank you in advance, > > Vladimir. > > > > >

