Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give.
I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, "Julien Nioche" <lists.digitalpeb...@gmail.com> wrote: > Hi Meraj, > > The generator will place all the URLs in a single segment if all they > belong to the same host for politeness reason. Otherwise it will use > whichever value is passed with the -numFetchers parameter in the generation > step. > > Why don't you use the crawl script in /bin instead of tinkering with the > (now deprecated) Crawl class? It comes with a good default configuration > and should make your life easier. > > Julien > > > On 28 August 2014 06:47, Meraj A. Khan <mera...@gmail.com> wrote: > > > Hi All, > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that > there > > is only a single reducer in the generate partition job. I am running in > a > > situation where the subsequent fetch is only running in a single map task > > (I believe as a consequence of a single reducer in the earlier phase). > How > > can I force Nutch to do fetch in multiple map tasks , is there a setting > to > > force more than one reducers in the generate-partition job to have more > map > > tasks ?. > > > > Please also note that I have commented out the code in Crawl.java to not > do > > the LInkInversion phase as , I dont need the scoring of the URLS that > Nutch > > crawls, every URL is equally important to me. > > > > Thanks. > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >