Where i wrote YJK, i of course meant CJK instead.
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wednesday 5th October 2016 20:34
> To: [email protected]
> Subject: RE: Nutch scalability
>
> Hello Vladimir - answers inline.
> Markus
>
> -----Original message-----
> > From:Vladimir Loubenski <[email protected]>
> > Sent: Wednesday 5th October 2016 20:09
> > To: [email protected]
> > Subject: Nutch scalability
> >
> > Hi,
> > I have Nutch 2.3.1 installation with MongoDB.
> >
> > I want to understand what scalability options I have.
> >
> > 1. Number threads during one Job can be defined by nutch-site.xml
> > a. fetcher.threads.per.queue - This number is the maximum number of
> > threads that should be allowed to access a queue at one time.
> > b. fetcher.threads.fetch - The number of FetcherThreads the fetcher
> > should use
> > Do we have other scalability configuration parameters?
>
> Yes, run more fetcher jobs simultaneously on Hadoop.
>
> >
> > 2. Ability to run the same Job on different hosts.
> > Does it supported by Nutch?
>
> Different hosts? Run the fetch job on different machines you mean, as opposed
> to having Nutch crawl different hosts in one job. In the case of the former,
> see previous answer, Hadoop.
>
> > 3. Ability to run Jobs in parallel.
> > Example: I run “fetch” job. It produces new not Crawled URLS.
> > Can I run another job to process these uncrawled URLS before the first
> > Job is done?
>
> You can run jobs in parallel but i advice against running fetcher jobs in
> parallel. During generation, Nutch fetches a list of URL's to fetch, then it
> runs the fetcher. If you start a new job right away, the generator will
> generate the same list of URL's to fetch unless some parameter is set. If
> that is supported in 2.x, 1.x supports it, but i'd not recommend it because
> why? You can just generate a much larger list for the fetcher.
>
> > 4. Database scalability.
> > Can I use multiple instances Mongo DB for crawling?
>
> Probably not a good idea. I don't know about Mongo, but can it run as a
> cluster? If that is true, then use a larger cluster, not multiple databases.
> By the way, Mongo seems to be a bad idea anyway, as i read today on this
> mailing list, because it does not support id's larger than 512 bytes, while
> URL's can easy consist of more than that. For example, YJK URL's in UTF-8
> take up 4 bytes per character, meaning a YJK URL of 128 characters is the
> limit.
>
> >
> > Thank you in advance,
> > Vladimir.
> >
> >
>