I think that is a typo , and it is actually CrawlDirectory. For the single map task issue although I have not tried it yet,but we can control the number of fetchers by numFetchers parameter when doing the generate via the bin/generate. On Sep 7, 2014 9:23 AM, "Simon Z" <simonz.nu...@gmail.com> wrote:
> Hi Julien, > > What do you mean by "<crawlID>" please? I am using nutch 1.8 and follow the > instruction in the tutorial as mentioned before, and seems have a similar > situation, that is, fetch runs on only one map task. I am running on a > cluster of four nodes on hadoop 2.4.1. > > Notice that the map task can be assigned to any node, but only one map each > round. > > I have set > > numSlaves=4 > mode=distributed > > > The seed url list includes five different websites from different host. > > > Is there any settings I missed out? > > Thanks in advance. > > Regards, > > Simon > > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > No, just do 'bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>' > from > > the master node. It internally calls the nutch script for the individual > > commands, which takes care of sending the job jar to your hadoop cluster, > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 > > > > > > > > > > On 29 August 2014 15:24, S.L <simpleliving...@gmail.com> wrote: > > > > > Sorry Julien , I overlooked the directory names. > > > > > > My understanding is that the Hadoop Job is submitted to a cluster by > > using > > > the following command on the RM node bin/hadoop .job file <params> > > > > > > Are you suggesting I submit the script instead of the Nutch .job jar > like > > > below? > > > > > > bin/hadoop bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds> > > > > > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > As the name runtime/deploy suggest - it is used exactly for that > > purpose > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the > > > script, > > > > that's all. > > > > Look at the bottom of the nutch script for details. > > > > > > > > Julien > > > > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( > > > > http://sched.co/1pbE15n) were we'll cover things like these > > > > > > > > > > > > > > > > On 29 August 2014 14:30, S.L <simpleliving...@gmail.com> wrote: > > > > > > > > > Thanks, can this be used on a hadoop cluster? > > > > > > > > > > Sent from my HTC > > > > > > > > > > ----- Reply message ----- > > > > > From: "Julien Nioche" <lists.digitalpeb...@gmail.com> > > > > > To: "user@nutch.apache.org" <user@nutch.apache.org> > > > > > Subject: Nutch 1.7 fetch happening in a single map task. > > > > > Date: Fri, Aug 29, 2014 9:00 AM > > > > > > > > > > See > > > > > > > > > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > > > > > > > just go to runtime/deploy/bin and run the script from there. > > > > > > > > > > Julien > > > > > > > > > > > > > > > On 29 August 2014 13:38, Meraj A. Khan <mera...@gmail.com> wrote: > > > > > > > > > > > Hi Julien, > > > > > > > > > > > > I have 15 domains and they are all being fetched in a single map > > task > > > > > which > > > > > > does not fetch all the urls no matter what depth or topN i give. > > > > > > > > > > > > I am submitting the Nutch job jar which seems to be using the > > > > Crawl.java > > > > > > class, how do I use the Crawl script on a Hadoop cluster, are > there > > > any > > > > > > pointers you can share? > > > > > > > > > > > > Thanks. > > > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" < > > > > lists.digitalpeb...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Meraj, > > > > > > > > > > > > > > The generator will place all the URLs in a single segment if > all > > > they > > > > > > > belong to the same host for politeness reason. Otherwise it > will > > > use > > > > > > > whichever value is passed with the -numFetchers parameter in > the > > > > > > generation > > > > > > > step. > > > > > > > > > > > > > > Why don't you use the crawl script in /bin instead of tinkering > > > with > > > > > the > > > > > > > (now deprecated) Crawl class? It comes with a good default > > > > > configuration > > > > > > > and should make your life easier. > > > > > > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > > > > On 28 August 2014 06:47, Meraj A. Khan <mera...@gmail.com> > > wrote: > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I > > noticed > > > > that > > > > > > > there > > > > > > > > is only a single reducer in the generate partition job. I am > > > > running > > > > > > in > > > > > > > a > > > > > > > > situation where the subsequent fetch is only running in a > > single > > > > map > > > > > > task > > > > > > > > (I believe as a consequence of a single reducer in the > earlier > > > > > phase). > > > > > > > How > > > > > > > > can I force Nutch to do fetch in multiple map tasks , is > there > > a > > > > > > setting > > > > > > > to > > > > > > > > force more than one reducers in the generate-partition job to > > > have > > > > > more > > > > > > > map > > > > > > > > tasks ?. > > > > > > > > > > > > > > > > Please also note that I have commented out the code in > > Crawl.java > > > > to > > > > > > not > > > > > > > do > > > > > > > > the LInkInversion phase as , I dont need the scoring of the > > URLS > > > > that > > > > > > > Nutch > > > > > > > > crawls, every URL is equally important to me. > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Open Source Solutions for Text Engineering > > > > > > > > > > > > > > http://digitalpebble.blogspot.com/ > > > > > > > http://www.digitalpebble.com > > > > > > > http://twitter.com/digitalpebble > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Open Source Solutions for Text Engineering > > > > > > > > > > http://digitalpebble.blogspot.com/ > > > > > http://www.digitalpebble.com > > > > > http://twitter.com/digitalpebble > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Open Source Solutions for Text Engineering > > > > > > > > http://digitalpebble.blogspot.com/ > > > > http://www.digitalpebble.com > > > > http://twitter.com/digitalpebble > > > > > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > >