Hello Vladimir - answers inline. Markus -----Original message----- > From:Vladimir Loubenski <[email protected]> > Sent: Wednesday 5th October 2016 20:09 > To: [email protected] > Subject: Nutch scalability > > Hi, > I have Nutch 2.3.1 installation with MongoDB. > > I want to understand what scalability options I have. > > 1. Number threads during one Job can be defined by nutch-site.xml > a. fetcher.threads.per.queue - This number is the maximum number of > threads that should be allowed to access a queue at one time. > b. fetcher.threads.fetch - The number of FetcherThreads the fetcher > should use > Do we have other scalability configuration parameters?
Yes, run more fetcher jobs simultaneously on Hadoop. > > 2. Ability to run the same Job on different hosts. > Does it supported by Nutch? Different hosts? Run the fetch job on different machines you mean, as opposed to having Nutch crawl different hosts in one job. In the case of the former, see previous answer, Hadoop. > 3. Ability to run Jobs in parallel. > Example: I run “fetch” job. It produces new not Crawled URLS. > Can I run another job to process these uncrawled URLS before the first > Job is done? You can run jobs in parallel but i advice against running fetcher jobs in parallel. During generation, Nutch fetches a list of URL's to fetch, then it runs the fetcher. If you start a new job right away, the generator will generate the same list of URL's to fetch unless some parameter is set. If that is supported in 2.x, 1.x supports it, but i'd not recommend it because why? You can just generate a much larger list for the fetcher. > 4. Database scalability. > Can I use multiple instances Mongo DB for crawling? Probably not a good idea. I don't know about Mongo, but can it run as a cluster? If that is true, then use a larger cluster, not multiple databases. By the way, Mongo seems to be a bad idea anyway, as i read today on this mailing list, because it does not support id's larger than 512 bytes, while URL's can easy consist of more than that. For example, YJK URL's in UTF-8 take up 4 bytes per character, meaning a YJK URL of 128 characters is the limit. > > Thank you in advance, > Vladimir. > >

