Re: Nutch 1.7 fetch happening in a single map task.

2014-09-09 Thread Simon Z
Hi Meraj,

The *nutch* & deploy are at the same level, need to change the location of
the job file please?

Thanks in advance,



On Mon, Sep 8, 2014 at 10:03 PM, Meraj A. Khan  wrote:

> AFAIK, the script does not go by the mode you set , but the presence of the
> *nutch*.job file in the a directory a level above script it self i.
> ../*.job.
>
> Can you please check if you have the Hadoop job file at the appropriate
> location?
>
> On Mon, Sep 8, 2014 at 9:22 AM, Simon Z  wrote:
>
> > Thank you very Meraj for your reply, I also thought it's a typo.
> >
> > I had set the numFetchers via numSlaves, and the echo of generator showed
> > that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2),
> > but the output of generator showed that the  run mode is "local" and
> > generate exact one mapper, although I had changed mode=distributed, any
> > idea about this please?
> >
> > Many regards,
> >
> > Simon
> >
> >
> >
> >
> > On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan  wrote:
> >
> > > I think that is a typo , and it is actually CrawlDirectory. For  the
> > single
> > > map task issue although I have not tried it yet,but  we can control the
> > > number of fetchers by numFetchers parameter when doing the generate via
> > the
> > > bin/generate.
> > > On Sep 7, 2014 9:23 AM, "Simon Z"  wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > What do you mean by "" please? I am using nutch 1.8 and
> follow
> > > the
> > > > instruction in the tutorial as mentioned before, and seems have a
> > similar
> > > > situation, that is, fetch runs on only one map task. I am running on
> a
> > > > cluster of four nodes on hadoop 2.4.1.
> > > >
> > > > Notice that the map task can be assigned to any node, but only one
> map
> > > each
> > > > round.
> > > >
> > > > I have set
> > > >
> > > > numSlaves=4
> > > > mode=distributed
> > > >
> > > >
> > > > The seed url list includes five different websites from different
> host.
> > > >
> > > >
> > > > Is there any settings I missed out?
> > > >
> > > > Thanks in advance.
> > > >
> > > > Regards,
> > > >
> > > > Simon
> > > >
> > > >
> > > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche <
> > > > lists.digitalpeb...@gmail.com> wrote:
> > > >
> > > > > No, just do 'bin/crawl   
> > '
> > > > from
> > > > > the master node. It internally calls the nutch script for the
> > > individual
> > > > > commands, which takes care of sending the job jar to your hadoop
> > > cluster,
> > > > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 29 August 2014 15:24, S.L  wrote:
> > > > >
> > > > > > Sorry Julien , I overlooked the directory names.
> > > > > >
> > > > > > My understanding is that the Hadoop Job is submitted  to a
> cluster
> > by
> > > > > using
> > > > > > the following command on the RM node bin/hadoop .job file
> 
> > > > > >
> > > > > > Are you suggesting I submit the script instead of the Nutch .job
> > jar
> > > > like
> > > > > > below?
> > > > > >
> > > > > > bin/hadoop  bin/crawl   
> > 
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > > > > > lists.digitalpeb...@gmail.com> wrote:
> > > > > >
> > > > > > > As the name runtime/deploy suggest - it is used exactly for
> that
> > > > > purpose
> > > > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run
> > the
> > > > > > script,
> > > > > > > that's all.
> > > > > > > Look at the bottom of the nutch script for details.
> > > > > > >
> > > > > > > Julien
> > > > > > >
> > > > > > > PS: there will be a Nutch tutorial at the forthcoming

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-08 Thread Meraj A. Khan
AFAIK, the script does not go by the mode you set , but the presence of the
*nutch*.job file in the a directory a level above script it self i.
../*.job.

Can you please check if you have the Hadoop job file at the appropriate
location?

On Mon, Sep 8, 2014 at 9:22 AM, Simon Z  wrote:

> Thank you very Meraj for your reply, I also thought it's a typo.
>
> I had set the numFetchers via numSlaves, and the echo of generator showed
> that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2),
> but the output of generator showed that the  run mode is "local" and
> generate exact one mapper, although I had changed mode=distributed, any
> idea about this please?
>
> Many regards,
>
> Simon
>
>
>
>
> On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan  wrote:
>
> > I think that is a typo , and it is actually CrawlDirectory. For  the
> single
> > map task issue although I have not tried it yet,but  we can control the
> > number of fetchers by numFetchers parameter when doing the generate via
> the
> > bin/generate.
> > On Sep 7, 2014 9:23 AM, "Simon Z"  wrote:
> >
> > > Hi Julien,
> > >
> > > What do you mean by "" please? I am using nutch 1.8 and follow
> > the
> > > instruction in the tutorial as mentioned before, and seems have a
> similar
> > > situation, that is, fetch runs on only one map task. I am running on a
> > > cluster of four nodes on hadoop 2.4.1.
> > >
> > > Notice that the map task can be assigned to any node, but only one map
> > each
> > > round.
> > >
> > > I have set
> > >
> > > numSlaves=4
> > > mode=distributed
> > >
> > >
> > > The seed url list includes five different websites from different host.
> > >
> > >
> > > Is there any settings I missed out?
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > >
> > > Simon
> > >
> > >
> > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche <
> > > lists.digitalpeb...@gmail.com> wrote:
> > >
> > > > No, just do 'bin/crawl   
> '
> > > from
> > > > the master node. It internally calls the nutch script for the
> > individual
> > > > commands, which takes care of sending the job jar to your hadoop
> > cluster,
> > > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
> > > >
> > > >
> > > >
> > > >
> > > > On 29 August 2014 15:24, S.L  wrote:
> > > >
> > > > > Sorry Julien , I overlooked the directory names.
> > > > >
> > > > > My understanding is that the Hadoop Job is submitted  to a cluster
> by
> > > > using
> > > > > the following command on the RM node bin/hadoop .job file 
> > > > >
> > > > > Are you suggesting I submit the script instead of the Nutch .job
> jar
> > > like
> > > > > below?
> > > > >
> > > > > bin/hadoop  bin/crawl   
> 
> > > > >
> > > > >
> > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > > > > lists.digitalpeb...@gmail.com> wrote:
> > > > >
> > > > > > As the name runtime/deploy suggest - it is used exactly for that
> > > > purpose
> > > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run
> the
> > > > > script,
> > > > > > that's all.
> > > > > > Look at the bottom of the nutch script for details.
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon
> EU
> > (
> > > > > > http://sched.co/1pbE15n) were we'll cover things like these
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 29 August 2014 14:30, S.L  wrote:
> > > > > >
> > > > > > > Thanks, can this be used on a hadoop cluster?
> > > > > > >
> > > > > > > Sent from my HTC
> > > > > > >
> > > > > > > - Reply message -
> > > > > > > From: "Julien Nioche" 
> > > > > > > To: "user@nutch.apache.org" 
> > > > > > > Subject: Nutch 1.7 fetch happening in 

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-08 Thread Simon Z
Thank you very Meraj for your reply, I also thought it's a typo.

I had set the numFetchers via numSlaves, and the echo of generator showed
that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2),
but the output of generator showed that the  run mode is "local" and
generate exact one mapper, although I had changed mode=distributed, any
idea about this please?

Many regards,

Simon




On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan  wrote:

> I think that is a typo , and it is actually CrawlDirectory. For  the single
> map task issue although I have not tried it yet,but  we can control the
> number of fetchers by numFetchers parameter when doing the generate via the
> bin/generate.
> On Sep 7, 2014 9:23 AM, "Simon Z"  wrote:
>
> > Hi Julien,
> >
> > What do you mean by "" please? I am using nutch 1.8 and follow
> the
> > instruction in the tutorial as mentioned before, and seems have a similar
> > situation, that is, fetch runs on only one map task. I am running on a
> > cluster of four nodes on hadoop 2.4.1.
> >
> > Notice that the map task can be assigned to any node, but only one map
> each
> > round.
> >
> > I have set
> >
> > numSlaves=4
> > mode=distributed
> >
> >
> > The seed url list includes five different websites from different host.
> >
> >
> > Is there any settings I missed out?
> >
> > Thanks in advance.
> >
> > Regards,
> >
> > Simon
> >
> >
> > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> > > No, just do 'bin/crawl'
> > from
> > > the master node. It internally calls the nutch script for the
> individual
> > > commands, which takes care of sending the job jar to your hadoop
> cluster,
> > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
> > >
> > >
> > >
> > >
> > > On 29 August 2014 15:24, S.L  wrote:
> > >
> > > > Sorry Julien , I overlooked the directory names.
> > > >
> > > > My understanding is that the Hadoop Job is submitted  to a cluster by
> > > using
> > > > the following command on the RM node bin/hadoop .job file 
> > > >
> > > > Are you suggesting I submit the script instead of the Nutch .job jar
> > like
> > > > below?
> > > >
> > > > bin/hadoop  bin/crawl
> > > >
> > > >
> > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > > > lists.digitalpeb...@gmail.com> wrote:
> > > >
> > > > > As the name runtime/deploy suggest - it is used exactly for that
> > > purpose
> > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
> > > > script,
> > > > > that's all.
> > > > > Look at the bottom of the nutch script for details.
> > > > >
> > > > > Julien
> > > > >
> > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU
> (
> > > > > http://sched.co/1pbE15n) were we'll cover things like these
> > > > >
> > > > >
> > > > >
> > > > > On 29 August 2014 14:30, S.L  wrote:
> > > > >
> > > > > > Thanks, can this be used on a hadoop cluster?
> > > > > >
> > > > > > Sent from my HTC
> > > > > >
> > > > > > - Reply message -
> > > > > > From: "Julien Nioche" 
> > > > > > To: "user@nutch.apache.org" 
> > > > > > Subject: Nutch 1.7 fetch happening in a single map task.
> > > > > > Date: Fri, Aug 29, 2014 9:00 AM
> > > > > >
> > > > > > See
> > > > > >
> > > >
> > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> > > > > >
> > > > > > just go to runtime/deploy/bin and run the script from there.
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > >
> > > > > > On 29 August 2014 13:38, Meraj A. Khan 
> wrote:
> > > > > >
> > > > > > > Hi Julien,
> > > > > > >
> > > > > > > I have 15 domains and they are all being fetched in a single
> map
> > > task
> > > > > >

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-07 Thread Meraj A. Khan
I think that is a typo , and it is actually CrawlDirectory. For  the single
map task issue although I have not tried it yet,but  we can control the
number of fetchers by numFetchers parameter when doing the generate via the
bin/generate.
On Sep 7, 2014 9:23 AM, "Simon Z"  wrote:

> Hi Julien,
>
> What do you mean by "" please? I am using nutch 1.8 and follow the
> instruction in the tutorial as mentioned before, and seems have a similar
> situation, that is, fetch runs on only one map task. I am running on a
> cluster of four nodes on hadoop 2.4.1.
>
> Notice that the map task can be assigned to any node, but only one map each
> round.
>
> I have set
>
> numSlaves=4
> mode=distributed
>
>
> The seed url list includes five different websites from different host.
>
>
> Is there any settings I missed out?
>
> Thanks in advance.
>
> Regards,
>
> Simon
>
>
> On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
> > No, just do 'bin/crawl'
> from
> > the master node. It internally calls the nutch script for the individual
> > commands, which takes care of sending the job jar to your hadoop cluster,
> > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
> >
> >
> >
> >
> > On 29 August 2014 15:24, S.L  wrote:
> >
> > > Sorry Julien , I overlooked the directory names.
> > >
> > > My understanding is that the Hadoop Job is submitted  to a cluster by
> > using
> > > the following command on the RM node bin/hadoop .job file 
> > >
> > > Are you suggesting I submit the script instead of the Nutch .job jar
> like
> > > below?
> > >
> > > bin/hadoop  bin/crawl
> > >
> > >
> > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > > lists.digitalpeb...@gmail.com> wrote:
> > >
> > > > As the name runtime/deploy suggest - it is used exactly for that
> > purpose
> > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
> > > script,
> > > > that's all.
> > > > Look at the bottom of the nutch script for details.
> > > >
> > > > Julien
> > > >
> > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
> > > > http://sched.co/1pbE15n) were we'll cover things like these
> > > >
> > > >
> > > >
> > > > On 29 August 2014 14:30, S.L  wrote:
> > > >
> > > > > Thanks, can this be used on a hadoop cluster?
> > > > >
> > > > > Sent from my HTC
> > > > >
> > > > > - Reply message -
> > > > > From: "Julien Nioche" 
> > > > > To: "user@nutch.apache.org" 
> > > > > Subject: Nutch 1.7 fetch happening in a single map task.
> > > > > Date: Fri, Aug 29, 2014 9:00 AM
> > > > >
> > > > > See
> > > > >
> > >
> http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> > > > >
> > > > > just go to runtime/deploy/bin and run the script from there.
> > > > >
> > > > > Julien
> > > > >
> > > > >
> > > > > On 29 August 2014 13:38, Meraj A. Khan  wrote:
> > > > >
> > > > > > Hi Julien,
> > > > > >
> > > > > > I have 15 domains and they are all being fetched in a single map
> > task
> > > > > which
> > > > > > does not fetch all the urls no matter what depth or topN i give.
> > > > > >
> > > > > > I am submitting the Nutch job jar which seems to be using the
> > > > Crawl.java
> > > > > > class, how do I use the Crawl script on a Hadoop cluster, are
> there
> > > any
> > > > > > pointers you can share?
> > > > > >
> > > > > > Thanks.
> > > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" <
> > > > lists.digitalpeb...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Meraj,
> > > > > > >
> > > > > > > The generator will place all the URLs in a single segment if
> all
> > > they
> > > > > > > belong to the same host for politeness reason. Otherwise it
> will
> > >

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-07 Thread Simon Z
Hi Julien,

What do you mean by "" please? I am using nutch 1.8 and follow the
instruction in the tutorial as mentioned before, and seems have a similar
situation, that is, fetch runs on only one map task. I am running on a
cluster of four nodes on hadoop 2.4.1.

Notice that the map task can be assigned to any node, but only one map each
round.

I have set

numSlaves=4
mode=distributed


The seed url list includes five different websites from different host.


Is there any settings I missed out?

Thanks in advance.

Regards,

Simon


On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> No, just do 'bin/crawl' from
> the master node. It internally calls the nutch script for the individual
> commands, which takes care of sending the job jar to your hadoop cluster,
> see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
>
>
>
>
> On 29 August 2014 15:24, S.L  wrote:
>
> > Sorry Julien , I overlooked the directory names.
> >
> > My understanding is that the Hadoop Job is submitted  to a cluster by
> using
> > the following command on the RM node bin/hadoop .job file 
> >
> > Are you suggesting I submit the script instead of the Nutch .job jar like
> > below?
> >
> > bin/hadoop  bin/crawl
> >
> >
> > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> > > As the name runtime/deploy suggest - it is used exactly for that
> purpose
> > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
> > script,
> > > that's all.
> > > Look at the bottom of the nutch script for details.
> > >
> > > Julien
> > >
> > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
> > > http://sched.co/1pbE15n) were we'll cover things like these
> > >
> > >
> > >
> > > On 29 August 2014 14:30, S.L  wrote:
> > >
> > > > Thanks, can this be used on a hadoop cluster?
> > > >
> > > > Sent from my HTC
> > > >
> > > > - Reply message -
> > > > From: "Julien Nioche" 
> > > > To: "user@nutch.apache.org" 
> > > > Subject: Nutch 1.7 fetch happening in a single map task.
> > > > Date: Fri, Aug 29, 2014 9:00 AM
> > > >
> > > > See
> > > >
> > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> > > >
> > > > just go to runtime/deploy/bin and run the script from there.
> > > >
> > > > Julien
> > > >
> > > >
> > > > On 29 August 2014 13:38, Meraj A. Khan  wrote:
> > > >
> > > > > Hi Julien,
> > > > >
> > > > > I have 15 domains and they are all being fetched in a single map
> task
> > > > which
> > > > > does not fetch all the urls no matter what depth or topN i give.
> > > > >
> > > > > I am submitting the Nutch job jar which seems to be using the
> > > Crawl.java
> > > > > class, how do I use the Crawl script on a Hadoop cluster, are there
> > any
> > > > > pointers you can share?
> > > > >
> > > > > Thanks.
> > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" <
> > > lists.digitalpeb...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Meraj,
> > > > > >
> > > > > > The generator will place all the URLs in a single segment if all
> > they
> > > > > > belong to the same host for politeness reason. Otherwise it will
> > use
> > > > > > whichever value is passed with the -numFetchers parameter in the
> > > > > generation
> > > > > > step.
> > > > > >
> > > > > > Why don't you use the crawl script in /bin instead of tinkering
> > with
> > > > the
> > > > > > (now deprecated) Crawl class? It comes with a good default
> > > > configuration
> > > > > > and should make your life easier.
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > >
> > > > > > On 28 August 2014 06:47, Meraj A. Khan 
> wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I am running Nutch 1.7 

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-31 Thread Meraj A. Khan
Julien,

Thank you for the decisive advice,using the crawl script seems to have
solved the problem of abrupt termination of the crawl , the bin/crawl
script respects the depth and topN parameters and iterates accordingly.

However , I have an issue with the number of maps that are being used for
fetch phase , its always 1 , I see that the script sets the numFetchers
parameters at the time of generate phase euqal to the number of slaves
which is 3 in my case , however  only a single map task is being used,
under-utilizing my Hadoop cluster and slowing down the crawl .

I see that in the Crawldb update phase there millions on 'db_unfetched'
urls still the generate phase only creates a single segment with about
20-30k urls and as a result only a single map tasks is being used for the
fetch phase, I guess I need to make the generate phase  generate more
segments than one , how do I do that using the bin/crawl script.

Please note that this is for Nutch 1.7 on Hadoop 2.3.0.

Thanks.


On Fri, Aug 29, 2014 at 10:39 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> No, just do 'bin/crawl' from
> the master node. It internally calls the nutch script for the individual
> commands, which takes care of sending the job jar to your hadoop cluster,
> see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
>
>
>
>
> On 29 August 2014 15:24, S.L  wrote:
>
> > Sorry Julien , I overlooked the directory names.
> >
> > My understanding is that the Hadoop Job is submitted  to a cluster by
> using
> > the following command on the RM node bin/hadoop .job file 
> >
> > Are you suggesting I submit the script instead of the Nutch .job jar like
> > below?
> >
> > bin/hadoop  bin/crawl
> >
> >
> > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> > > As the name runtime/deploy suggest - it is used exactly for that
> purpose
> > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
> > script,
> > > that's all.
> > > Look at the bottom of the nutch script for details.
> > >
> > > Julien
> > >
> > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
> > > http://sched.co/1pbE15n) were we'll cover things like these
> > >
> > >
> > >
> > > On 29 August 2014 14:30, S.L  wrote:
> > >
> > > > Thanks, can this be used on a hadoop cluster?
> > > >
> > > > Sent from my HTC
> > > >
> > > > - Reply message -
> > > > From: "Julien Nioche" 
> > > > To: "user@nutch.apache.org" 
> > > > Subject: Nutch 1.7 fetch happening in a single map task.
> > > > Date: Fri, Aug 29, 2014 9:00 AM
> > > >
> > > > See
> > > >
> > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> > > >
> > > > just go to runtime/deploy/bin and run the script from there.
> > > >
> > > > Julien
> > > >
> > > >
> > > > On 29 August 2014 13:38, Meraj A. Khan  wrote:
> > > >
> > > > > Hi Julien,
> > > > >
> > > > > I have 15 domains and they are all being fetched in a single map
> task
> > > > which
> > > > > does not fetch all the urls no matter what depth or topN i give.
> > > > >
> > > > > I am submitting the Nutch job jar which seems to be using the
> > > Crawl.java
> > > > > class, how do I use the Crawl script on a Hadoop cluster, are there
> > any
> > > > > pointers you can share?
> > > > >
> > > > > Thanks.
> > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" <
> > > lists.digitalpeb...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Meraj,
> > > > > >
> > > > > > The generator will place all the URLs in a single segment if all
> > they
> > > > > > belong to the same host for politeness reason. Otherwise it will
> > use
> > > > > > whichever value is passed with the -numFetchers parameter in the
> > > > > generation
> > > > > > step.
> > > > > >
> > > > > > Why don't you use the crawl script in /bin instead of tinkering
> > with
> > > > the
> > > > > > (now deprecated) Crawl class? It comes with a good default
> > > > 

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
No, just do 'bin/crawl' from
the master node. It internally calls the nutch script for the individual
commands, which takes care of sending the job jar to your hadoop cluster,
see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271




On 29 August 2014 15:24, S.L  wrote:

> Sorry Julien , I overlooked the directory names.
>
> My understanding is that the Hadoop Job is submitted  to a cluster by using
> the following command on the RM node bin/hadoop .job file 
>
> Are you suggesting I submit the script instead of the Nutch .job jar like
> below?
>
> bin/hadoop  bin/crawl
>
>
> On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
> > As the name runtime/deploy suggest - it is used exactly for that purpose
> > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
> script,
> > that's all.
> > Look at the bottom of the nutch script for details.
> >
> > Julien
> >
> > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
> > http://sched.co/1pbE15n) were we'll cover things like these
> >
> >
> >
> > On 29 August 2014 14:30, S.L  wrote:
> >
> > > Thanks, can this be used on a hadoop cluster?
> > >
> > > Sent from my HTC
> > >
> > > - Reply message -
> > > From: "Julien Nioche" 
> > > To: "user@nutch.apache.org" 
> > > Subject: Nutch 1.7 fetch happening in a single map task.
> > > Date: Fri, Aug 29, 2014 9:00 AM
> > >
> > > See
> > >
> http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> > >
> > > just go to runtime/deploy/bin and run the script from there.
> > >
> > > Julien
> > >
> > >
> > > On 29 August 2014 13:38, Meraj A. Khan  wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > I have 15 domains and they are all being fetched in a single map task
> > > which
> > > > does not fetch all the urls no matter what depth or topN i give.
> > > >
> > > > I am submitting the Nutch job jar which seems to be using the
> > Crawl.java
> > > > class, how do I use the Crawl script on a Hadoop cluster, are there
> any
> > > > pointers you can share?
> > > >
> > > > Thanks.
> > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" <
> > lists.digitalpeb...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Meraj,
> > > > >
> > > > > The generator will place all the URLs in a single segment if all
> they
> > > > > belong to the same host for politeness reason. Otherwise it will
> use
> > > > > whichever value is passed with the -numFetchers parameter in the
> > > > generation
> > > > > step.
> > > > >
> > > > > Why don't you use the crawl script in /bin instead of tinkering
> with
> > > the
> > > > > (now deprecated) Crawl class? It comes with a good default
> > > configuration
> > > > > and should make your life easier.
> > > > >
> > > > > Julien
> > > > >
> > > > >
> > > > > On 28 August 2014 06:47, Meraj A. Khan  wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed
> > that
> > > > > there
> > > > > > is only a single reducer in the generate partition job. I am
> > running
> > > > in
> > > > > a
> > > > > > situation where the subsequent fetch is only running in a single
> > map
> > > > task
> > > > > > (I believe as a consequence of a single reducer in the earlier
> > > phase).
> > > > > How
> > > > > > can I force Nutch to do fetch in multiple map tasks , is there a
> > > > setting
> > > > > to
> > > > > > force more than one reducers in the generate-partition job to
> have
> > > more
> > > > > map
> > > > > > tasks ?.
> > > > > >
> > > > > > Please also note that I have commented out the code in Crawl.java
> > to
> > > > not
> > > > > do
> > > > > > the LInkInversion phase as , I dont need the scoring of the URLS
> > that
> > > > > Nutch
> > > > > > crawls, every URL is equally important to me.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > > http://twitter.com/digitalpebble
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread S.L
Sorry Julien , I overlooked the directory names.

My understanding is that the Hadoop Job is submitted  to a cluster by using
the following command on the RM node bin/hadoop .job file 

Are you suggesting I submit the script instead of the Nutch .job jar like
below?

bin/hadoop  bin/crawl


On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> As the name runtime/deploy suggest - it is used exactly for that purpose
> ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script,
> that's all.
> Look at the bottom of the nutch script for details.
>
> Julien
>
> PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
> http://sched.co/1pbE15n) were we'll cover things like these
>
>
>
> On 29 August 2014 14:30, S.L  wrote:
>
> > Thanks, can this be used on a hadoop cluster?
> >
> > Sent from my HTC
> >
> > - Reply message -
> > From: "Julien Nioche" 
> > To: "user@nutch.apache.org" 
> > Subject: Nutch 1.7 fetch happening in a single map task.
> > Date: Fri, Aug 29, 2014 9:00 AM
> >
> > See
> > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> >
> > just go to runtime/deploy/bin and run the script from there.
> >
> > Julien
> >
> >
> > On 29 August 2014 13:38, Meraj A. Khan  wrote:
> >
> > > Hi Julien,
> > >
> > > I have 15 domains and they are all being fetched in a single map task
> > which
> > > does not fetch all the urls no matter what depth or topN i give.
> > >
> > > I am submitting the Nutch job jar which seems to be using the
> Crawl.java
> > > class, how do I use the Crawl script on a Hadoop cluster, are there any
> > > pointers you can share?
> > >
> > > Thanks.
> > > On Aug 29, 2014 4:40 AM, "Julien Nioche" <
> lists.digitalpeb...@gmail.com>
> > > wrote:
> > >
> > > > Hi Meraj,
> > > >
> > > > The generator will place all the URLs in a single segment if all they
> > > > belong to the same host for politeness reason. Otherwise it will use
> > > > whichever value is passed with the -numFetchers parameter in the
> > > generation
> > > > step.
> > > >
> > > > Why don't you use the crawl script in /bin instead of tinkering with
> > the
> > > > (now deprecated) Crawl class? It comes with a good default
> > configuration
> > > > and should make your life easier.
> > > >
> > > > Julien
> > > >
> > > >
> > > > On 28 August 2014 06:47, Meraj A. Khan  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed
> that
> > > > there
> > > > > is only a single reducer in the generate partition job. I am
> running
> > > in
> > > > a
> > > > > situation where the subsequent fetch is only running in a single
> map
> > > task
> > > > > (I believe as a consequence of a single reducer in the earlier
> > phase).
> > > > How
> > > > > can I force Nutch to do fetch in multiple map tasks , is there a
> > > setting
> > > > to
> > > > > force more than one reducers in the generate-partition job to have
> > more
> > > > map
> > > > > tasks ?.
> > > > >
> > > > > Please also note that I have commented out the code in Crawl.java
> to
> > > not
> > > > do
> > > > > the LInkInversion phase as , I dont need the scoring of the URLS
> that
> > > > Nutch
> > > > > crawls, every URL is equally important to me.
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > > http://twitter.com/digitalpebble
> > > >
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
As the name runtime/deploy suggest - it is used exactly for that purpose
;-) Just make sure HADOOP_HOME/bin is added to the path and run the script,
that's all.
Look at the bottom of the nutch script for details.

Julien

PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
http://sched.co/1pbE15n) were we'll cover things like these



On 29 August 2014 14:30, S.L  wrote:

> Thanks, can this be used on a hadoop cluster?
>
> Sent from my HTC
>
> - Reply message -
> From: "Julien Nioche" 
> To: "user@nutch.apache.org" 
> Subject: Nutch 1.7 fetch happening in a single map task.
> Date: Fri, Aug 29, 2014 9:00 AM
>
> See
> http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
>
> just go to runtime/deploy/bin and run the script from there.
>
> Julien
>
>
> On 29 August 2014 13:38, Meraj A. Khan  wrote:
>
> > Hi Julien,
> >
> > I have 15 domains and they are all being fetched in a single map task
> which
> > does not fetch all the urls no matter what depth or topN i give.
> >
> > I am submitting the Nutch job jar which seems to be using the Crawl.java
> > class, how do I use the Crawl script on a Hadoop cluster, are there any
> > pointers you can share?
> >
> > Thanks.
> > On Aug 29, 2014 4:40 AM, "Julien Nioche" 
> > wrote:
> >
> > > Hi Meraj,
> > >
> > > The generator will place all the URLs in a single segment if all they
> > > belong to the same host for politeness reason. Otherwise it will use
> > > whichever value is passed with the -numFetchers parameter in the
> > generation
> > > step.
> > >
> > > Why don't you use the crawl script in /bin instead of tinkering with
> the
> > > (now deprecated) Crawl class? It comes with a good default
> configuration
> > > and should make your life easier.
> > >
> > > Julien
> > >
> > >
> > > On 28 August 2014 06:47, Meraj A. Khan  wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> > > there
> > > > is only a single reducer in the generate partition job. I am  running
> > in
> > > a
> > > > situation where the subsequent fetch is only running in a single map
> > task
> > > > (I believe as a consequence of a single reducer in the earlier
> phase).
> > > How
> > > > can I force Nutch to do fetch in multiple map tasks , is there a
> > setting
> > > to
> > > > force more than one reducers in the generate-partition job to have
> more
> > > map
> > > > tasks ?.
> > > >
> > > > Please also note that I have commented out the code in Crawl.java to
> > not
> > > do
> > > > the LInkInversion phase as , I dont need the scoring of the URLS that
> > > Nutch
> > > > crawls, every URL is equally important to me.
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread S.L
Thanks, can this be used on a hadoop cluster?

Sent from my HTC

- Reply message -
From: "Julien Nioche" 
To: "user@nutch.apache.org" 
Subject: Nutch 1.7 fetch happening in a single map task.
Date: Fri, Aug 29, 2014 9:00 AM

See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script

just go to runtime/deploy/bin and run the script from there.

Julien


On 29 August 2014 13:38, Meraj A. Khan  wrote:

> Hi Julien,
>
> I have 15 domains and they are all being fetched in a single map task which
> does not fetch all the urls no matter what depth or topN i give.
>
> I am submitting the Nutch job jar which seems to be using the Crawl.java
> class, how do I use the Crawl script on a Hadoop cluster, are there any
> pointers you can share?
>
> Thanks.
> On Aug 29, 2014 4:40 AM, "Julien Nioche" 
> wrote:
>
> > Hi Meraj,
> >
> > The generator will place all the URLs in a single segment if all they
> > belong to the same host for politeness reason. Otherwise it will use
> > whichever value is passed with the -numFetchers parameter in the
> generation
> > step.
> >
> > Why don't you use the crawl script in /bin instead of tinkering with the
> > (now deprecated) Crawl class? It comes with a good default configuration
> > and should make your life easier.
> >
> > Julien
> >
> >
> > On 28 August 2014 06:47, Meraj A. Khan  wrote:
> >
> > > Hi All,
> > >
> > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> > there
> > > is only a single reducer in the generate partition job. I am  running
> in
> > a
> > > situation where the subsequent fetch is only running in a single map
> task
> > > (I believe as a consequence of a single reducer in the earlier phase).
> > How
> > > can I force Nutch to do fetch in multiple map tasks , is there a
> setting
> > to
> > > force more than one reducers in the generate-partition job to have more
> > map
> > > tasks ?.
> > >
> > > Please also note that I have commented out the code in Crawl.java to
> not
> > do
> > > the LInkInversion phase as , I dont need the scoring of the URLS that
> > Nutch
> > > crawls, every URL is equally important to me.
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script

just go to runtime/deploy/bin and run the script from there.

Julien


On 29 August 2014 13:38, Meraj A. Khan  wrote:

> Hi Julien,
>
> I have 15 domains and they are all being fetched in a single map task which
> does not fetch all the urls no matter what depth or topN i give.
>
> I am submitting the Nutch job jar which seems to be using the Crawl.java
> class, how do I use the Crawl script on a Hadoop cluster, are there any
> pointers you can share?
>
> Thanks.
> On Aug 29, 2014 4:40 AM, "Julien Nioche" 
> wrote:
>
> > Hi Meraj,
> >
> > The generator will place all the URLs in a single segment if all they
> > belong to the same host for politeness reason. Otherwise it will use
> > whichever value is passed with the -numFetchers parameter in the
> generation
> > step.
> >
> > Why don't you use the crawl script in /bin instead of tinkering with the
> > (now deprecated) Crawl class? It comes with a good default configuration
> > and should make your life easier.
> >
> > Julien
> >
> >
> > On 28 August 2014 06:47, Meraj A. Khan  wrote:
> >
> > > Hi All,
> > >
> > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> > there
> > > is only a single reducer in the generate partition job. I am  running
> in
> > a
> > > situation where the subsequent fetch is only running in a single map
> task
> > > (I believe as a consequence of a single reducer in the earlier phase).
> > How
> > > can I force Nutch to do fetch in multiple map tasks , is there a
> setting
> > to
> > > force more than one reducers in the generate-partition job to have more
> > map
> > > tasks ?.
> > >
> > > Please also note that I have commented out the code in Crawl.java to
> not
> > do
> > > the LInkInversion phase as , I dont need the scoring of the URLS that
> > Nutch
> > > crawls, every URL is equally important to me.
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Meraj A. Khan
Hi Julien,

I have 15 domains and they are all being fetched in a single map task which
does not fetch all the urls no matter what depth or topN i give.

I am submitting the Nutch job jar which seems to be using the Crawl.java
class, how do I use the Crawl script on a Hadoop cluster, are there any
pointers you can share?

Thanks.
On Aug 29, 2014 4:40 AM, "Julien Nioche" 
wrote:

> Hi Meraj,
>
> The generator will place all the URLs in a single segment if all they
> belong to the same host for politeness reason. Otherwise it will use
> whichever value is passed with the -numFetchers parameter in the generation
> step.
>
> Why don't you use the crawl script in /bin instead of tinkering with the
> (now deprecated) Crawl class? It comes with a good default configuration
> and should make your life easier.
>
> Julien
>
>
> On 28 August 2014 06:47, Meraj A. Khan  wrote:
>
> > Hi All,
> >
> > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> there
> > is only a single reducer in the generate partition job. I am  running in
> a
> > situation where the subsequent fetch is only running in a single map task
> > (I believe as a consequence of a single reducer in the earlier phase).
> How
> > can I force Nutch to do fetch in multiple map tasks , is there a setting
> to
> > force more than one reducers in the generate-partition job to have more
> map
> > tasks ?.
> >
> > Please also note that I have commented out the code in Crawl.java to not
> do
> > the LInkInversion phase as , I dont need the scoring of the URLS that
> Nutch
> > crawls, every URL is equally important to me.
> >
> > Thanks.
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
Hi Meraj,

The generator will place all the URLs in a single segment if all they
belong to the same host for politeness reason. Otherwise it will use
whichever value is passed with the -numFetchers parameter in the generation
step.

Why don't you use the crawl script in /bin instead of tinkering with the
(now deprecated) Crawl class? It comes with a good default configuration
and should make your life easier.

Julien


On 28 August 2014 06:47, Meraj A. Khan  wrote:

> Hi All,
>
> I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there
> is only a single reducer in the generate partition job. I am  running in a
> situation where the subsequent fetch is only running in a single map task
> (I believe as a consequence of a single reducer in the earlier phase).  How
> can I force Nutch to do fetch in multiple map tasks , is there a setting to
> force more than one reducers in the generate-partition job to have more map
> tasks ?.
>
> Please also note that I have commented out the code in Crawl.java to not do
> the LInkInversion phase as , I dont need the scoring of the URLS that Nutch
> crawls, every URL is equally important to me.
>
> Thanks.
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Nutch 1.7 fetch happening in a single map task.

2014-08-27 Thread Meraj A. Khan
Hi All,

I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there
is only a single reducer in the generate partition job. I am  running in a
situation where the subsequent fetch is only running in a single map task
(I believe as a consequence of a single reducer in the earlier phase).  How
can I force Nutch to do fetch in multiple map tasks , is there a setting to
force more than one reducers in the generate-partition job to have more map
tasks ?.

Please also note that I have commented out the code in Crawl.java to not do
the LInkInversion phase as , I dont need the scoring of the URLS that Nutch
crawls, every URL is equally important to me.

Thanks.