Re: Nutch 1.7 fetch happening in a single map task.
Hi Meraj, The *nutch* & deploy are at the same level, need to change the location of the job file please? Thanks in advance, On Mon, Sep 8, 2014 at 10:03 PM, Meraj A. Khan wrote: > AFAIK, the script does not go by the mode you set , but the presence of the > *nutch*.job file in the a directory a level above script it self i. > ../*.job. > > Can you please check if you have the Hadoop job file at the appropriate > location? > > On Mon, Sep 8, 2014 at 9:22 AM, Simon Z wrote: > > > Thank you very Meraj for your reply, I also thought it's a typo. > > > > I had set the numFetchers via numSlaves, and the echo of generator showed > > that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2), > > but the output of generator showed that the run mode is "local" and > > generate exact one mapper, although I had changed mode=distributed, any > > idea about this please? > > > > Many regards, > > > > Simon > > > > > > > > > > On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan wrote: > > > > > I think that is a typo , and it is actually CrawlDirectory. For the > > single > > > map task issue although I have not tried it yet,but we can control the > > > number of fetchers by numFetchers parameter when doing the generate via > > the > > > bin/generate. > > > On Sep 7, 2014 9:23 AM, "Simon Z" wrote: > > > > > > > Hi Julien, > > > > > > > > What do you mean by "" please? I am using nutch 1.8 and > follow > > > the > > > > instruction in the tutorial as mentioned before, and seems have a > > similar > > > > situation, that is, fetch runs on only one map task. I am running on > a > > > > cluster of four nodes on hadoop 2.4.1. > > > > > > > > Notice that the map task can be assigned to any node, but only one > map > > > each > > > > round. > > > > > > > > I have set > > > > > > > > numSlaves=4 > > > > mode=distributed > > > > > > > > > > > > The seed url list includes five different websites from different > host. > > > > > > > > > > > > Is there any settings I missed out? > > > > > > > > Thanks in advance. > > > > > > > > Regards, > > > > > > > > Simon > > > > > > > > > > > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche < > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > > > No, just do 'bin/crawl > > ' > > > > from > > > > > the master node. It internally calls the nutch script for the > > > individual > > > > > commands, which takes care of sending the job jar to your hadoop > > > cluster, > > > > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 > > > > > > > > > > > > > > > > > > > > > > > > > On 29 August 2014 15:24, S.L wrote: > > > > > > > > > > > Sorry Julien , I overlooked the directory names. > > > > > > > > > > > > My understanding is that the Hadoop Job is submitted to a > cluster > > by > > > > > using > > > > > > the following command on the RM node bin/hadoop .job file > > > > > > > > > > > > > Are you suggesting I submit the script instead of the Nutch .job > > jar > > > > like > > > > > > below? > > > > > > > > > > > > bin/hadoop bin/crawl > > > > > > > > > > > > > > > > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > > > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > > > > > > > As the name runtime/deploy suggest - it is used exactly for > that > > > > > purpose > > > > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run > > the > > > > > > script, > > > > > > > that's all. > > > > > > > Look at the bottom of the nutch script for details. > > > > > > > > > > > > > > Julien > > > > > > > > > > > > > > PS: there will be a Nutch tutorial at the forthcoming
Re: Nutch 1.7 fetch happening in a single map task.
AFAIK, the script does not go by the mode you set , but the presence of the *nutch*.job file in the a directory a level above script it self i. ../*.job. Can you please check if you have the Hadoop job file at the appropriate location? On Mon, Sep 8, 2014 at 9:22 AM, Simon Z wrote: > Thank you very Meraj for your reply, I also thought it's a typo. > > I had set the numFetchers via numSlaves, and the echo of generator showed > that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2), > but the output of generator showed that the run mode is "local" and > generate exact one mapper, although I had changed mode=distributed, any > idea about this please? > > Many regards, > > Simon > > > > > On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan wrote: > > > I think that is a typo , and it is actually CrawlDirectory. For the > single > > map task issue although I have not tried it yet,but we can control the > > number of fetchers by numFetchers parameter when doing the generate via > the > > bin/generate. > > On Sep 7, 2014 9:23 AM, "Simon Z" wrote: > > > > > Hi Julien, > > > > > > What do you mean by "" please? I am using nutch 1.8 and follow > > the > > > instruction in the tutorial as mentioned before, and seems have a > similar > > > situation, that is, fetch runs on only one map task. I am running on a > > > cluster of four nodes on hadoop 2.4.1. > > > > > > Notice that the map task can be assigned to any node, but only one map > > each > > > round. > > > > > > I have set > > > > > > numSlaves=4 > > > mode=distributed > > > > > > > > > The seed url list includes five different websites from different host. > > > > > > > > > Is there any settings I missed out? > > > > > > Thanks in advance. > > > > > > Regards, > > > > > > Simon > > > > > > > > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche < > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > No, just do 'bin/crawl > ' > > > from > > > > the master node. It internally calls the nutch script for the > > individual > > > > commands, which takes care of sending the job jar to your hadoop > > cluster, > > > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 > > > > > > > > > > > > > > > > > > > > On 29 August 2014 15:24, S.L wrote: > > > > > > > > > Sorry Julien , I overlooked the directory names. > > > > > > > > > > My understanding is that the Hadoop Job is submitted to a cluster > by > > > > using > > > > > the following command on the RM node bin/hadoop .job file > > > > > > > > > > Are you suggesting I submit the script instead of the Nutch .job > jar > > > like > > > > > below? > > > > > > > > > > bin/hadoop bin/crawl > > > > > > > > > > > > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > > > > > As the name runtime/deploy suggest - it is used exactly for that > > > > purpose > > > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run > the > > > > > script, > > > > > > that's all. > > > > > > Look at the bottom of the nutch script for details. > > > > > > > > > > > > Julien > > > > > > > > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon > EU > > ( > > > > > > http://sched.co/1pbE15n) were we'll cover things like these > > > > > > > > > > > > > > > > > > > > > > > > On 29 August 2014 14:30, S.L wrote: > > > > > > > > > > > > > Thanks, can this be used on a hadoop cluster? > > > > > > > > > > > > > > Sent from my HTC > > > > > > > > > > > > > > - Reply message - > > > > > > > From: "Julien Nioche" > > > > > > > To: "user@nutch.apache.org" > > > > > > > Subject: Nutch 1.7 fetch happening in
Re: Nutch 1.7 fetch happening in a single map task.
Thank you very Meraj for your reply, I also thought it's a typo. I had set the numFetchers via numSlaves, and the echo of generator showed that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2), but the output of generator showed that the run mode is "local" and generate exact one mapper, although I had changed mode=distributed, any idea about this please? Many regards, Simon On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan wrote: > I think that is a typo , and it is actually CrawlDirectory. For the single > map task issue although I have not tried it yet,but we can control the > number of fetchers by numFetchers parameter when doing the generate via the > bin/generate. > On Sep 7, 2014 9:23 AM, "Simon Z" wrote: > > > Hi Julien, > > > > What do you mean by "" please? I am using nutch 1.8 and follow > the > > instruction in the tutorial as mentioned before, and seems have a similar > > situation, that is, fetch runs on only one map task. I am running on a > > cluster of four nodes on hadoop 2.4.1. > > > > Notice that the map task can be assigned to any node, but only one map > each > > round. > > > > I have set > > > > numSlaves=4 > > mode=distributed > > > > > > The seed url list includes five different websites from different host. > > > > > > Is there any settings I missed out? > > > > Thanks in advance. > > > > Regards, > > > > Simon > > > > > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > > > No, just do 'bin/crawl' > > from > > > the master node. It internally calls the nutch script for the > individual > > > commands, which takes care of sending the job jar to your hadoop > cluster, > > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 > > > > > > > > > > > > > > > On 29 August 2014 15:24, S.L wrote: > > > > > > > Sorry Julien , I overlooked the directory names. > > > > > > > > My understanding is that the Hadoop Job is submitted to a cluster by > > > using > > > > the following command on the RM node bin/hadoop .job file > > > > > > > > Are you suggesting I submit the script instead of the Nutch .job jar > > like > > > > below? > > > > > > > > bin/hadoop bin/crawl > > > > > > > > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > > > As the name runtime/deploy suggest - it is used exactly for that > > > purpose > > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the > > > > script, > > > > > that's all. > > > > > Look at the bottom of the nutch script for details. > > > > > > > > > > Julien > > > > > > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU > ( > > > > > http://sched.co/1pbE15n) were we'll cover things like these > > > > > > > > > > > > > > > > > > > > On 29 August 2014 14:30, S.L wrote: > > > > > > > > > > > Thanks, can this be used on a hadoop cluster? > > > > > > > > > > > > Sent from my HTC > > > > > > > > > > > > - Reply message - > > > > > > From: "Julien Nioche" > > > > > > To: "user@nutch.apache.org" > > > > > > Subject: Nutch 1.7 fetch happening in a single map task. > > > > > > Date: Fri, Aug 29, 2014 9:00 AM > > > > > > > > > > > > See > > > > > > > > > > > > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > > > > > > > > > just go to runtime/deploy/bin and run the script from there. > > > > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > On 29 August 2014 13:38, Meraj A. Khan > wrote: > > > > > > > > > > > > > Hi Julien, > > > > > > > > > > > > > > I have 15 domains and they are all being fetched in a single > map > > > task > > > > > >
Re: Nutch 1.7 fetch happening in a single map task.
I think that is a typo , and it is actually CrawlDirectory. For the single map task issue although I have not tried it yet,but we can control the number of fetchers by numFetchers parameter when doing the generate via the bin/generate. On Sep 7, 2014 9:23 AM, "Simon Z" wrote: > Hi Julien, > > What do you mean by "" please? I am using nutch 1.8 and follow the > instruction in the tutorial as mentioned before, and seems have a similar > situation, that is, fetch runs on only one map task. I am running on a > cluster of four nodes on hadoop 2.4.1. > > Notice that the map task can be assigned to any node, but only one map each > round. > > I have set > > numSlaves=4 > mode=distributed > > > The seed url list includes five different websites from different host. > > > Is there any settings I missed out? > > Thanks in advance. > > Regards, > > Simon > > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > No, just do 'bin/crawl' > from > > the master node. It internally calls the nutch script for the individual > > commands, which takes care of sending the job jar to your hadoop cluster, > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 > > > > > > > > > > On 29 August 2014 15:24, S.L wrote: > > > > > Sorry Julien , I overlooked the directory names. > > > > > > My understanding is that the Hadoop Job is submitted to a cluster by > > using > > > the following command on the RM node bin/hadoop .job file > > > > > > Are you suggesting I submit the script instead of the Nutch .job jar > like > > > below? > > > > > > bin/hadoop bin/crawl > > > > > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > As the name runtime/deploy suggest - it is used exactly for that > > purpose > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the > > > script, > > > > that's all. > > > > Look at the bottom of the nutch script for details. > > > > > > > > Julien > > > > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( > > > > http://sched.co/1pbE15n) were we'll cover things like these > > > > > > > > > > > > > > > > On 29 August 2014 14:30, S.L wrote: > > > > > > > > > Thanks, can this be used on a hadoop cluster? > > > > > > > > > > Sent from my HTC > > > > > > > > > > - Reply message - > > > > > From: "Julien Nioche" > > > > > To: "user@nutch.apache.org" > > > > > Subject: Nutch 1.7 fetch happening in a single map task. > > > > > Date: Fri, Aug 29, 2014 9:00 AM > > > > > > > > > > See > > > > > > > > > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > > > > > > > just go to runtime/deploy/bin and run the script from there. > > > > > > > > > > Julien > > > > > > > > > > > > > > > On 29 August 2014 13:38, Meraj A. Khan wrote: > > > > > > > > > > > Hi Julien, > > > > > > > > > > > > I have 15 domains and they are all being fetched in a single map > > task > > > > > which > > > > > > does not fetch all the urls no matter what depth or topN i give. > > > > > > > > > > > > I am submitting the Nutch job jar which seems to be using the > > > > Crawl.java > > > > > > class, how do I use the Crawl script on a Hadoop cluster, are > there > > > any > > > > > > pointers you can share? > > > > > > > > > > > > Thanks. > > > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" < > > > > lists.digitalpeb...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Meraj, > > > > > > > > > > > > > > The generator will place all the URLs in a single segment if > all > > > they > > > > > > > belong to the same host for politeness reason. Otherwise it > will > > >
Re: Nutch 1.7 fetch happening in a single map task.
Hi Julien, What do you mean by "" please? I am using nutch 1.8 and follow the instruction in the tutorial as mentioned before, and seems have a similar situation, that is, fetch runs on only one map task. I am running on a cluster of four nodes on hadoop 2.4.1. Notice that the map task can be assigned to any node, but only one map each round. I have set numSlaves=4 mode=distributed The seed url list includes five different websites from different host. Is there any settings I missed out? Thanks in advance. Regards, Simon On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > No, just do 'bin/crawl' from > the master node. It internally calls the nutch script for the individual > commands, which takes care of sending the job jar to your hadoop cluster, > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 > > > > > On 29 August 2014 15:24, S.L wrote: > > > Sorry Julien , I overlooked the directory names. > > > > My understanding is that the Hadoop Job is submitted to a cluster by > using > > the following command on the RM node bin/hadoop .job file > > > > Are you suggesting I submit the script instead of the Nutch .job jar like > > below? > > > > bin/hadoop bin/crawl > > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > > > As the name runtime/deploy suggest - it is used exactly for that > purpose > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the > > script, > > > that's all. > > > Look at the bottom of the nutch script for details. > > > > > > Julien > > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( > > > http://sched.co/1pbE15n) were we'll cover things like these > > > > > > > > > > > > On 29 August 2014 14:30, S.L wrote: > > > > > > > Thanks, can this be used on a hadoop cluster? > > > > > > > > Sent from my HTC > > > > > > > > - Reply message - > > > > From: "Julien Nioche" > > > > To: "user@nutch.apache.org" > > > > Subject: Nutch 1.7 fetch happening in a single map task. > > > > Date: Fri, Aug 29, 2014 9:00 AM > > > > > > > > See > > > > > > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > > > > > just go to runtime/deploy/bin and run the script from there. > > > > > > > > Julien > > > > > > > > > > > > On 29 August 2014 13:38, Meraj A. Khan wrote: > > > > > > > > > Hi Julien, > > > > > > > > > > I have 15 domains and they are all being fetched in a single map > task > > > > which > > > > > does not fetch all the urls no matter what depth or topN i give. > > > > > > > > > > I am submitting the Nutch job jar which seems to be using the > > > Crawl.java > > > > > class, how do I use the Crawl script on a Hadoop cluster, are there > > any > > > > > pointers you can share? > > > > > > > > > > Thanks. > > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" < > > > lists.digitalpeb...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Meraj, > > > > > > > > > > > > The generator will place all the URLs in a single segment if all > > they > > > > > > belong to the same host for politeness reason. Otherwise it will > > use > > > > > > whichever value is passed with the -numFetchers parameter in the > > > > > generation > > > > > > step. > > > > > > > > > > > > Why don't you use the crawl script in /bin instead of tinkering > > with > > > > the > > > > > > (now deprecated) Crawl class? It comes with a good default > > > > configuration > > > > > > and should make your life easier. > > > > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > On 28 August 2014 06:47, Meraj A. Khan > wrote: > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > I am running Nutch 1.7
Re: Nutch 1.7 fetch happening in a single map task.
Julien, Thank you for the decisive advice,using the crawl script seems to have solved the problem of abrupt termination of the crawl , the bin/crawl script respects the depth and topN parameters and iterates accordingly. However , I have an issue with the number of maps that are being used for fetch phase , its always 1 , I see that the script sets the numFetchers parameters at the time of generate phase euqal to the number of slaves which is 3 in my case , however only a single map task is being used, under-utilizing my Hadoop cluster and slowing down the crawl . I see that in the Crawldb update phase there millions on 'db_unfetched' urls still the generate phase only creates a single segment with about 20-30k urls and as a result only a single map tasks is being used for the fetch phase, I guess I need to make the generate phase generate more segments than one , how do I do that using the bin/crawl script. Please note that this is for Nutch 1.7 on Hadoop 2.3.0. Thanks. On Fri, Aug 29, 2014 at 10:39 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > No, just do 'bin/crawl' from > the master node. It internally calls the nutch script for the individual > commands, which takes care of sending the job jar to your hadoop cluster, > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 > > > > > On 29 August 2014 15:24, S.L wrote: > > > Sorry Julien , I overlooked the directory names. > > > > My understanding is that the Hadoop Job is submitted to a cluster by > using > > the following command on the RM node bin/hadoop .job file > > > > Are you suggesting I submit the script instead of the Nutch .job jar like > > below? > > > > bin/hadoop bin/crawl > > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > > > As the name runtime/deploy suggest - it is used exactly for that > purpose > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the > > script, > > > that's all. > > > Look at the bottom of the nutch script for details. > > > > > > Julien > > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( > > > http://sched.co/1pbE15n) were we'll cover things like these > > > > > > > > > > > > On 29 August 2014 14:30, S.L wrote: > > > > > > > Thanks, can this be used on a hadoop cluster? > > > > > > > > Sent from my HTC > > > > > > > > - Reply message - > > > > From: "Julien Nioche" > > > > To: "user@nutch.apache.org" > > > > Subject: Nutch 1.7 fetch happening in a single map task. > > > > Date: Fri, Aug 29, 2014 9:00 AM > > > > > > > > See > > > > > > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > > > > > just go to runtime/deploy/bin and run the script from there. > > > > > > > > Julien > > > > > > > > > > > > On 29 August 2014 13:38, Meraj A. Khan wrote: > > > > > > > > > Hi Julien, > > > > > > > > > > I have 15 domains and they are all being fetched in a single map > task > > > > which > > > > > does not fetch all the urls no matter what depth or topN i give. > > > > > > > > > > I am submitting the Nutch job jar which seems to be using the > > > Crawl.java > > > > > class, how do I use the Crawl script on a Hadoop cluster, are there > > any > > > > > pointers you can share? > > > > > > > > > > Thanks. > > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" < > > > lists.digitalpeb...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Meraj, > > > > > > > > > > > > The generator will place all the URLs in a single segment if all > > they > > > > > > belong to the same host for politeness reason. Otherwise it will > > use > > > > > > whichever value is passed with the -numFetchers parameter in the > > > > > generation > > > > > > step. > > > > > > > > > > > > Why don't you use the crawl script in /bin instead of tinkering > > with > > > > the > > > > > > (now deprecated) Crawl class? It comes with a good default > > > >
Re: Nutch 1.7 fetch happening in a single map task.
No, just do 'bin/crawl' from the master node. It internally calls the nutch script for the individual commands, which takes care of sending the job jar to your hadoop cluster, see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 On 29 August 2014 15:24, S.L wrote: > Sorry Julien , I overlooked the directory names. > > My understanding is that the Hadoop Job is submitted to a cluster by using > the following command on the RM node bin/hadoop .job file > > Are you suggesting I submit the script instead of the Nutch .job jar like > below? > > bin/hadoop bin/crawl > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > As the name runtime/deploy suggest - it is used exactly for that purpose > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the > script, > > that's all. > > Look at the bottom of the nutch script for details. > > > > Julien > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( > > http://sched.co/1pbE15n) were we'll cover things like these > > > > > > > > On 29 August 2014 14:30, S.L wrote: > > > > > Thanks, can this be used on a hadoop cluster? > > > > > > Sent from my HTC > > > > > > - Reply message - > > > From: "Julien Nioche" > > > To: "user@nutch.apache.org" > > > Subject: Nutch 1.7 fetch happening in a single map task. > > > Date: Fri, Aug 29, 2014 9:00 AM > > > > > > See > > > > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > > > just go to runtime/deploy/bin and run the script from there. > > > > > > Julien > > > > > > > > > On 29 August 2014 13:38, Meraj A. Khan wrote: > > > > > > > Hi Julien, > > > > > > > > I have 15 domains and they are all being fetched in a single map task > > > which > > > > does not fetch all the urls no matter what depth or topN i give. > > > > > > > > I am submitting the Nutch job jar which seems to be using the > > Crawl.java > > > > class, how do I use the Crawl script on a Hadoop cluster, are there > any > > > > pointers you can share? > > > > > > > > Thanks. > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" < > > lists.digitalpeb...@gmail.com> > > > > wrote: > > > > > > > > > Hi Meraj, > > > > > > > > > > The generator will place all the URLs in a single segment if all > they > > > > > belong to the same host for politeness reason. Otherwise it will > use > > > > > whichever value is passed with the -numFetchers parameter in the > > > > generation > > > > > step. > > > > > > > > > > Why don't you use the crawl script in /bin instead of tinkering > with > > > the > > > > > (now deprecated) Crawl class? It comes with a good default > > > configuration > > > > > and should make your life easier. > > > > > > > > > > Julien > > > > > > > > > > > > > > > On 28 August 2014 06:47, Meraj A. Khan wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed > > that > > > > > there > > > > > > is only a single reducer in the generate partition job. I am > > running > > > > in > > > > > a > > > > > > situation where the subsequent fetch is only running in a single > > map > > > > task > > > > > > (I believe as a consequence of a single reducer in the earlier > > > phase). > > > > > How > > > > > > can I force Nutch to do fetch in multiple map tasks , is there a > > > > setting > > > > > to > > > > > > force more than one reducers in the generate-partition job to > have > > > more > > > > > map > > > > > > tasks ?. > > > > > > > > > > > > Please also note that I have commented out the code in Crawl.java > > to > > > > not > > > > > do > > > > > > the LInkInversion phase as , I dont need the scoring of the URLS > > that > > > > > Nutch > > > > > > crawls, every URL is equally important to me. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Open Source Solutions for Text Engineering > > > > > > > > > > http://digitalpebble.blogspot.com/ > > > > > http://www.digitalpebble.com > > > > > http://twitter.com/digitalpebble > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > http://twitter.com/digitalpebble > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Nutch 1.7 fetch happening in a single map task.
Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > As the name runtime/deploy suggest - it is used exactly for that purpose > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, > that's all. > Look at the bottom of the nutch script for details. > > Julien > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( > http://sched.co/1pbE15n) were we'll cover things like these > > > > On 29 August 2014 14:30, S.L wrote: > > > Thanks, can this be used on a hadoop cluster? > > > > Sent from my HTC > > > > - Reply message - > > From: "Julien Nioche" > > To: "user@nutch.apache.org" > > Subject: Nutch 1.7 fetch happening in a single map task. > > Date: Fri, Aug 29, 2014 9:00 AM > > > > See > > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > just go to runtime/deploy/bin and run the script from there. > > > > Julien > > > > > > On 29 August 2014 13:38, Meraj A. Khan wrote: > > > > > Hi Julien, > > > > > > I have 15 domains and they are all being fetched in a single map task > > which > > > does not fetch all the urls no matter what depth or topN i give. > > > > > > I am submitting the Nutch job jar which seems to be using the > Crawl.java > > > class, how do I use the Crawl script on a Hadoop cluster, are there any > > > pointers you can share? > > > > > > Thanks. > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" < > lists.digitalpeb...@gmail.com> > > > wrote: > > > > > > > Hi Meraj, > > > > > > > > The generator will place all the URLs in a single segment if all they > > > > belong to the same host for politeness reason. Otherwise it will use > > > > whichever value is passed with the -numFetchers parameter in the > > > generation > > > > step. > > > > > > > > Why don't you use the crawl script in /bin instead of tinkering with > > the > > > > (now deprecated) Crawl class? It comes with a good default > > configuration > > > > and should make your life easier. > > > > > > > > Julien > > > > > > > > > > > > On 28 August 2014 06:47, Meraj A. Khan wrote: > > > > > > > > > Hi All, > > > > > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed > that > > > > there > > > > > is only a single reducer in the generate partition job. I am > running > > > in > > > > a > > > > > situation where the subsequent fetch is only running in a single > map > > > task > > > > > (I believe as a consequence of a single reducer in the earlier > > phase). > > > > How > > > > > can I force Nutch to do fetch in multiple map tasks , is there a > > > setting > > > > to > > > > > force more than one reducers in the generate-partition job to have > > more > > > > map > > > > > tasks ?. > > > > > > > > > > Please also note that I have commented out the code in Crawl.java > to > > > not > > > > do > > > > > the LInkInversion phase as , I dont need the scoring of the URLS > that > > > > Nutch > > > > > crawls, every URL is equally important to me. > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Open Source Solutions for Text Engineering > > > > > > > > http://digitalpebble.blogspot.com/ > > > > http://www.digitalpebble.com > > > > http://twitter.com/digitalpebble > > > > > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >
Re: Nutch 1.7 fetch happening in a single map task.
As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( http://sched.co/1pbE15n) were we'll cover things like these On 29 August 2014 14:30, S.L wrote: > Thanks, can this be used on a hadoop cluster? > > Sent from my HTC > > - Reply message - > From: "Julien Nioche" > To: "user@nutch.apache.org" > Subject: Nutch 1.7 fetch happening in a single map task. > Date: Fri, Aug 29, 2014 9:00 AM > > See > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > just go to runtime/deploy/bin and run the script from there. > > Julien > > > On 29 August 2014 13:38, Meraj A. Khan wrote: > > > Hi Julien, > > > > I have 15 domains and they are all being fetched in a single map task > which > > does not fetch all the urls no matter what depth or topN i give. > > > > I am submitting the Nutch job jar which seems to be using the Crawl.java > > class, how do I use the Crawl script on a Hadoop cluster, are there any > > pointers you can share? > > > > Thanks. > > On Aug 29, 2014 4:40 AM, "Julien Nioche" > > wrote: > > > > > Hi Meraj, > > > > > > The generator will place all the URLs in a single segment if all they > > > belong to the same host for politeness reason. Otherwise it will use > > > whichever value is passed with the -numFetchers parameter in the > > generation > > > step. > > > > > > Why don't you use the crawl script in /bin instead of tinkering with > the > > > (now deprecated) Crawl class? It comes with a good default > configuration > > > and should make your life easier. > > > > > > Julien > > > > > > > > > On 28 August 2014 06:47, Meraj A. Khan wrote: > > > > > > > Hi All, > > > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that > > > there > > > > is only a single reducer in the generate partition job. I am running > > in > > > a > > > > situation where the subsequent fetch is only running in a single map > > task > > > > (I believe as a consequence of a single reducer in the earlier > phase). > > > How > > > > can I force Nutch to do fetch in multiple map tasks , is there a > > setting > > > to > > > > force more than one reducers in the generate-partition job to have > more > > > map > > > > tasks ?. > > > > > > > > Please also note that I have commented out the code in Crawl.java to > > not > > > do > > > > the LInkInversion phase as , I dont need the scoring of the URLS that > > > Nutch > > > > crawls, every URL is equally important to me. > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > > > > Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > http://twitter.com/digitalpebble > > > > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Nutch 1.7 fetch happening in a single map task.
Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: "Julien Nioche" To: "user@nutch.apache.org" Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan wrote: > Hi Julien, > > I have 15 domains and they are all being fetched in a single map task which > does not fetch all the urls no matter what depth or topN i give. > > I am submitting the Nutch job jar which seems to be using the Crawl.java > class, how do I use the Crawl script on a Hadoop cluster, are there any > pointers you can share? > > Thanks. > On Aug 29, 2014 4:40 AM, "Julien Nioche" > wrote: > > > Hi Meraj, > > > > The generator will place all the URLs in a single segment if all they > > belong to the same host for politeness reason. Otherwise it will use > > whichever value is passed with the -numFetchers parameter in the > generation > > step. > > > > Why don't you use the crawl script in /bin instead of tinkering with the > > (now deprecated) Crawl class? It comes with a good default configuration > > and should make your life easier. > > > > Julien > > > > > > On 28 August 2014 06:47, Meraj A. Khan wrote: > > > > > Hi All, > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that > > there > > > is only a single reducer in the generate partition job. I am running > in > > a > > > situation where the subsequent fetch is only running in a single map > task > > > (I believe as a consequence of a single reducer in the earlier phase). > > How > > > can I force Nutch to do fetch in multiple map tasks , is there a > setting > > to > > > force more than one reducers in the generate-partition job to have more > > map > > > tasks ?. > > > > > > Please also note that I have commented out the code in Crawl.java to > not > > do > > > the LInkInversion phase as , I dont need the scoring of the URLS that > > Nutch > > > crawls, every URL is equally important to me. > > > > > > Thanks. > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Nutch 1.7 fetch happening in a single map task.
See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan wrote: > Hi Julien, > > I have 15 domains and they are all being fetched in a single map task which > does not fetch all the urls no matter what depth or topN i give. > > I am submitting the Nutch job jar which seems to be using the Crawl.java > class, how do I use the Crawl script on a Hadoop cluster, are there any > pointers you can share? > > Thanks. > On Aug 29, 2014 4:40 AM, "Julien Nioche" > wrote: > > > Hi Meraj, > > > > The generator will place all the URLs in a single segment if all they > > belong to the same host for politeness reason. Otherwise it will use > > whichever value is passed with the -numFetchers parameter in the > generation > > step. > > > > Why don't you use the crawl script in /bin instead of tinkering with the > > (now deprecated) Crawl class? It comes with a good default configuration > > and should make your life easier. > > > > Julien > > > > > > On 28 August 2014 06:47, Meraj A. Khan wrote: > > > > > Hi All, > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that > > there > > > is only a single reducer in the generate partition job. I am running > in > > a > > > situation where the subsequent fetch is only running in a single map > task > > > (I believe as a consequence of a single reducer in the earlier phase). > > How > > > can I force Nutch to do fetch in multiple map tasks , is there a > setting > > to > > > force more than one reducers in the generate-partition job to have more > > map > > > tasks ?. > > > > > > Please also note that I have commented out the code in Crawl.java to > not > > do > > > the LInkInversion phase as , I dont need the scoring of the URLS that > > Nutch > > > crawls, every URL is equally important to me. > > > > > > Thanks. > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Nutch 1.7 fetch happening in a single map task.
Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, "Julien Nioche" wrote: > Hi Meraj, > > The generator will place all the URLs in a single segment if all they > belong to the same host for politeness reason. Otherwise it will use > whichever value is passed with the -numFetchers parameter in the generation > step. > > Why don't you use the crawl script in /bin instead of tinkering with the > (now deprecated) Crawl class? It comes with a good default configuration > and should make your life easier. > > Julien > > > On 28 August 2014 06:47, Meraj A. Khan wrote: > > > Hi All, > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that > there > > is only a single reducer in the generate partition job. I am running in > a > > situation where the subsequent fetch is only running in a single map task > > (I believe as a consequence of a single reducer in the earlier phase). > How > > can I force Nutch to do fetch in multiple map tasks , is there a setting > to > > force more than one reducers in the generate-partition job to have more > map > > tasks ?. > > > > Please also note that I have commented out the code in Crawl.java to not > do > > the LInkInversion phase as , I dont need the scoring of the URLS that > Nutch > > crawls, every URL is equally important to me. > > > > Thanks. > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >
Re: Nutch 1.7 fetch happening in a single map task.
Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien On 28 August 2014 06:47, Meraj A. Khan wrote: > Hi All, > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there > is only a single reducer in the generate partition job. I am running in a > situation where the subsequent fetch is only running in a single map task > (I believe as a consequence of a single reducer in the earlier phase). How > can I force Nutch to do fetch in multiple map tasks , is there a setting to > force more than one reducers in the generate-partition job to have more map > tasks ?. > > Please also note that I have commented out the code in Crawl.java to not do > the LInkInversion phase as , I dont need the scoring of the URLS that Nutch > crawls, every URL is equally important to me. > > Thanks. > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Nutch 1.7 fetch happening in a single map task.
Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier phase). How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job to have more map tasks ?. Please also note that I have commented out the code in Crawl.java to not do the LInkInversion phase as , I dont need the scoring of the URLS that Nutch crawls, every URL is equally important to me. Thanks.