Hello Vladimir, We don't use Nutch 2.x, but if i would, i'd probably choose Apache HBase, or maybe even Apache Solr in our case. Here are some resources for Nutch 2.x on Hadoop: http://wiki.apache.org/nutch/#Nutch_2.X_tutorial.28s.29
Markus -----Original message----- > From:Vladimir Loubenski <[email protected]> > Sent: Wednesday 5th October 2016 21:37 > To: [email protected] > Subject: RE: Nutch scalability > > Thank you Markus for the prompt response, > How can I run fetcher jobs simultaneously on Hadoop? > What database do you recommend for Nutch 2.3.1? > Regards, > Vladimir. > > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: October-05-16 2:53 PM > To: [email protected] > Subject: RE: Nutch scalability > > Where i wrote YJK, i of course meant CJK instead. > > > -----Original message----- > > From:Markus Jelsma <[email protected]> > > Sent: Wednesday 5th October 2016 20:34 > > To: [email protected] > > Subject: RE: Nutch scalability > > > > Hello Vladimir - answers inline. > > Markus > > > > -----Original message----- > > > From:Vladimir Loubenski <[email protected]> > > > Sent: Wednesday 5th October 2016 20:09 > > > To: [email protected] > > > Subject: Nutch scalability > > > > > > Hi, > > > I have Nutch 2.3.1 installation with MongoDB. > > > > > > I want to understand what scalability options I have. > > > > > > 1. Number threads during one Job can be defined by nutch-site.xml > > > a. fetcher.threads.per.queue - This number is the maximum number of > > > threads that should be allowed to access a queue at one time. > > > b. fetcher.threads.fetch - The number of FetcherThreads the fetcher > > > should use Do we have other scalability configuration parameters? > > > > Yes, run more fetcher jobs simultaneously on Hadoop. > > > > > > > > 2. Ability to run the same Job on different hosts. > > > Does it supported by Nutch? > > > > Different hosts? Run the fetch job on different machines you mean, as > > opposed to having Nutch crawl different hosts in one job. In the case of > > the former, see previous answer, Hadoop. > > > > > 3. Ability to run Jobs in parallel. > > > Example: I run “fetch” job. It produces new not Crawled URLS. > > > Can I run another job to process these uncrawled URLS before the first > > > Job is done? > > > > You can run jobs in parallel but i advice against running fetcher jobs in > > parallel. During generation, Nutch fetches a list of URL's to fetch, then > > it runs the fetcher. If you start a new job right away, the generator will > > generate the same list of URL's to fetch unless some parameter is set. If > > that is supported in 2.x, 1.x supports it, but i'd not recommend it because > > why? You can just generate a much larger list for the fetcher. > > > > > 4. Database scalability. > > > Can I use multiple instances Mongo DB for crawling? > > > > Probably not a good idea. I don't know about Mongo, but can it run as a > > cluster? If that is true, then use a larger cluster, not multiple > > databases. By the way, Mongo seems to be a bad idea anyway, as i read today > > on this mailing list, because it does not support id's larger than 512 > > bytes, while URL's can easy consist of more than that. For example, YJK > > URL's in UTF-8 take up 4 bytes per character, meaning a YJK URL of 128 > > characters is the limit. > > > > > > > > Thank you in advance, > > > Vladimir. > > > > > > > > >

