Re: 2.2.1 Compilation Failure

2014-10-30 Thread Sagar Handore
Hi Lixiang,

Please go through following link

http://nlp.solutions.asia/?p=362

Hope it will solve your problem.


On Fri, Oct 31, 2014 at 7:48 AM, Lixiang Ao  wrote:

> HI all,
>
> I'm a beginner of Nutch and I just followed the instructions here
> http://wiki.apache.org/nutch/Nutch2Tutorial, but I got the following
> errors:
>
>
> aolixiang@lbt:~/IdeaProjects/dwc/nutch$ ant runtime
> Buildfile: /home/aolixiang/IdeaProjects/dwc/nutch/build.xml
> Trying to override old definition of task javac
>   [taskdef] Could not load definitions from resource
> org/sonar/ant/antlib.xml. It could not be found.
>
> ivy-probe-antlib:
>
> ivy-download:
>   [taskdef] Could not load definitions from resource
> org/sonar/ant/antlib.xml. It could not be found.
>
> ivy-download-unchecked:
>
> ivy-init-antlib:
>
> ivy-init:
>
> init:
> [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build
> [mkdir] Created dir:
> /home/aolixiang/IdeaProjects/dwc/nutch/build/classes
> [mkdir] Created dir:
> /home/aolixiang/IdeaProjects/dwc/nutch/build/release
> [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build/test
> [mkdir] Created dir:
> /home/aolixiang/IdeaProjects/dwc/nutch/build/test/classes
>
> clean-lib:
>
> resolve-default:
> [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/
> ::
> [ivy:resolve] :: loading settings :: file =
> /home/aolixiang/IdeaProjects/dwc/nutch/ivy/ivysettings.xml
>   [taskdef] Could not load definitions from resource
> org/sonar/ant/antlib.xml. It could not be found.
>
> copy-libs:
>
> compile-core:
> [javac] Compiling 180 source files to
> /home/aolixiang/IdeaProjects/dwc/nutch/build/classes
> [javac] warning: [options] bootstrap class path not set in conjunction
> with -source 1.6
> [javac]
>
> /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:28:
> error: cannot find symbol
> [javac] import org.apache.avro.ipc.AvroRemoteException;
> [javac]   ^
> [javac]   symbol:   class AvroRemoteException
> [javac]   location: package org.apache.avro.ipc
> [javac]
>
> /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:35:
> error: cannot find symbol
> [javac] import org.apache.gora.persistency.StateManager;
> [javac]   ^
> [javac]   symbol:   class StateManager
> [javac]   location: package org.apache.gora.persistency
>
> ...
> (lots of errors like that)
>
> [javac]
>
> /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/Host.java:136:
> error: cannot find symbol
> [javac] getStateManager().setDirty(this, 2);
> [javac] ^
> [javac]   symbol:   method getStateManager()
> [javac]   location: class Host
> [javac] 78 errors
> [javac] 6 warnings
>
> BUILD FAILED
> /home/aolixiang/IdeaProjects/dwc/nutch/build.xml:101: Compile failed; see
> the compiler error output for details.
>
>
> Could anyone tell me what might went wrong?
>
>
> Thanks
> Lixiang
>



-- 
Thanks & Regards,
Handore Sagar
ZYRM INC
Software Engineer

-- 

IMPORTANT NOTICE: The information in this e-mail and any attached files is 
CONFIDENTIAL and may be legally privileged or prohibited from disclosure 
and unauthorised use. The views of the author may not necessarily reflect 
those of Zymr, Inc. It is intended solely for the addressee, or the 
employee or agent responsible for delivering such materials to the 
addressee. If you have received this message in error please return it to 
the sender then delete the email and destroy any copies of it. If you are 
not the intended recipient, any form of reproduction, dissemination, 
copying, disclosure, modification, distribution and/or publication, or any 
action taken or omitted to be taken in reliance upon this message or its 
attachments is prohibited and may be unlawful. At present the integrity of 
e-mail across the Internet cannot be guaranteed and messages sent via this 
medium are potentially at risk. All liability is excluded to the extent 
permitted by law for any claims arising as a result of the use of this 
medium to transmit information by or to Zymr, Inc.


2.2.1 Compilation Failure

2014-10-30 Thread Lixiang Ao
HI all,

I'm a beginner of Nutch and I just followed the instructions here
http://wiki.apache.org/nutch/Nutch2Tutorial, but I got the following errors:


aolixiang@lbt:~/IdeaProjects/dwc/nutch$ ant runtime
Buildfile: /home/aolixiang/IdeaProjects/dwc/nutch/build.xml
Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:
[mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build
[mkdir] Created dir:
/home/aolixiang/IdeaProjects/dwc/nutch/build/classes
[mkdir] Created dir:
/home/aolixiang/IdeaProjects/dwc/nutch/build/release
[mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build/test
[mkdir] Created dir:
/home/aolixiang/IdeaProjects/dwc/nutch/build/test/classes

clean-lib:

resolve-default:
[ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file =
/home/aolixiang/IdeaProjects/dwc/nutch/ivy/ivysettings.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

copy-libs:

compile-core:
[javac] Compiling 180 source files to
/home/aolixiang/IdeaProjects/dwc/nutch/build/classes
[javac] warning: [options] bootstrap class path not set in conjunction
with -source 1.6
[javac]
/home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:28:
error: cannot find symbol
[javac] import org.apache.avro.ipc.AvroRemoteException;
[javac]   ^
[javac]   symbol:   class AvroRemoteException
[javac]   location: package org.apache.avro.ipc
[javac]
/home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:35:
error: cannot find symbol
[javac] import org.apache.gora.persistency.StateManager;
[javac]   ^
[javac]   symbol:   class StateManager
[javac]   location: package org.apache.gora.persistency

...
(lots of errors like that)

[javac]
/home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/Host.java:136:
error: cannot find symbol
[javac] getStateManager().setDirty(this, 2);
[javac] ^
[javac]   symbol:   method getStateManager()
[javac]   location: class Host
[javac] 78 errors
[javac] 6 warnings

BUILD FAILED
/home/aolixiang/IdeaProjects/dwc/nutch/build.xml:101: Compile failed; see
the compiler error output for details.


Could anyone tell me what might went wrong?


Thanks
Lixiang


Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-30 Thread Meraj A. Khan
Thanks for the info Julien.For the hypothetical example below

 topN 200,000
generate.max.count = 10,000
generate.count.mode = host

If the number of hosts is 10 and let us assume that each one of those hosts
has more than 10,000 unfetched URLs in CrawlDB , since we have set
generate.max.count to 10,000 exactly 100,000 URLs would be fetched.

Would the remaining URLs be fetched in the next phase cycle? Do we need to
consider any data loss(URLs) in this scenario ?





On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi Meraj
>
> You can control the # of URLs per segment with
>
> 
>   generate.max.count
>   -1
>   The maximum number of urls in a single
>   fetchlist.  -1 if unlimited. The urls are counted according
>   to the value of the parameter generator.count.mode.
>   
> 
>
> 
>   generate.count.mode
>   host
>   Determines how the URLs are counted for generator.max.count.
>   Default value is 'host' but can be 'domain'. Note that we do not count
>   per IP in the new version of the Generator.
>   
> 
>
> the urls are grouped into inputs for the map tasks accordingly.
>
> Julien
>
>
>
>
>
> On 26 October 2014 19:08, Meraj A. Khan  wrote:
>
> > Julien,
> >
> > On further analysis , I found that it was not a delay at reduce time ,
> but
> > a long running fetch map task , when I have multiple fetch map tasks
> > running on a single segment , I see  that one of the map tasks runs for a
> > excessively longer period of time than the other fetch map tasks ,it
> seems
> > this is happening because of the disproportionate distribution of urls
> per
> > map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it
> > seems its not guaranteed that each fetch map tasks will have 100,000 urls
> > to fetch.
> >
> > Is is possible to set the an upper limit on the max number of URLs per
> > fetch map task, along with the collective topN for the whole Fetch phase
> ?
> >
> > Thanks,
> > Meraj.
> >
> > On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> > > Hi Meraj,
> > >
> > > What do the logs for the map tasks tell you about the URLs being
> fetched?
> > >
> > > J.
> > >
> > > On 17 October 2014 19:08, Meraj A. Khan  wrote:
> > >
> > > > Julien,
> > > >
> > > > Thanks for your suggestion , I looked at the jstack thread dumps ,
> and
> > I
> > > > could see that the fetcher threads are in a waiting state and
> actually
> > > the
> > > > map phase is not yet complete looking at the JobClient console.
> > > >
> > > > 14/10/15 12:09:48 INFO mapreduce.Job:  map 95% reduce 31%
> > > > 14/10/16 07:11:20 INFO mapreduce.Job:  map 96% reduce 31%
> > > > 14/10/17 01:20:56 INFO mapreduce.Job:  map 97% reduce 31%
> > > >
> > > > And the following is the kind of statements I see in the jstack
> thread
> > > > dump  for Hadoop child processes, is it possible that these map tasks
> > are
> > > > actually waiting on a particular host with some excessive crawl-delay
> > , I
> > > > already had the fetcher.threads.per.queue to 5 , fetcher.server.delay
> > to
> > > 0,
> > > > fetcher.max.crawl.delay to 10  and http.max.delays to 1000 .
> > > >
> > > > Please see the jstack  log info for the child processes below.
> > > >
> > > > Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed
> > > mode):
> > > >
> > > > "Attach Listener" daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5
> > > waiting
> > > > on condition [0x]
> > > >java.lang.Thread.State: RUNNABLE
> > > >
> > > > "IPC Client (638223659) connection to /170.75.153.162:40980 from
> > > > job_1413149941617_0059" daemon prio=10 tid=0x01a5c000
> nid=0xce8
> > > in
> > > > Object.wait() [0x7fecdf80e000]
> > > >java.lang.Thread.State: TIMED_WAITING (on object monitor)
> > > > at java.lang.Object.wait(Native Method)
> > > > - waiting on <0x99f8bf48> (a
> > > > org.apache.hadoop.ipc.Client$Connection)
> > > > at
> > > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899)
> > > > - locked <0x99f8bf48> (a
> > > > org.apache.hadoop.ipc.Client$Connection)
> > > > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944)
> > > >
> > > > "fetcher#5" daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in
> > > > Object.wait() [0x7fecdf90f000]
> > > >java.lang.Thread.State: WAITING (on object monitor)
> > > > at java.lang.Object.wait(Native Method)
> > > > - waiting on <0x99f62a68> (a
> > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
> > > > at java.lang.Object.wait(Object.java:503)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
> > > > - locked <0x99f62a68> (a
> > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
> > > > at
> > > > org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)
> > > >
> > > > "fetch

bin/Crawl script loosing status updates from the MR job.

2014-10-30 Thread Meraj A. Khan
Hi All,

I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN
by redirecting its output to a log file as shown below.

/opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000 >
/tmp/nutch.log 2>&1 &

The issue I am facing is that randomly this script when it is running a job
looses track of the updates like Map 80% Reduce 67% and gets stuck there ,
and in the mean time the job completes successfully and the script is
waiting there for further updates , as a result the looping of
generate-fetch -update jobs gets terminated prematurely.

This is so random that I am not able to figure out a particular pattern to
this issue, and end up  restarting the script every so often.Some times
this happens in a job as short in duration as the inject phase of Nutch.

Just wondering if anyone faced this issue ?Is the fact that I am
redirecting the output to a logfile playing a part in this ? What are the
best practices for running a long running script like bin/crawl ? I am
using CentOs7.x

Thanks.


Re: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.

2014-10-30 Thread Lewis John Mcgibbney
Hi Segar,

On Wed, Oct 29, 2014 at 11:40 PM,  wrote:

>
> Follow the following steps:
> 1. Execute 'ant job' i.e. open build.xml and execute 'runtime(default)'
> target.
> It will generate 'runtime' folder in project.
> 2. Open nutch-default.xml and update "plugin.folders" to
> "/home//nutchProject/runtime/local/plugins"
> 3. Refresh the project.
> 4. Run the project.
>
>
> I take it you are talking about running Nutch in Eclipse?
The problem you are having is that the nutch-default.xml you are editing is
not on your Eclipse classpath. Additionally, you need to navigate to the
'Order and Export' tab of the Properties dialogue box and ensure that the
same files are at THE TOP of the classpath there.

HTH
Lewis


Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-30 Thread Julien Nioche
Hi Meraj

You can control the # of URLs per segment with


  generate.max.count
  -1
  The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  



  generate.count.mode
  host
  Determines how the URLs are counted for generator.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count
  per IP in the new version of the Generator.
  


the urls are grouped into inputs for the map tasks accordingly.

Julien





On 26 October 2014 19:08, Meraj A. Khan  wrote:

> Julien,
>
> On further analysis , I found that it was not a delay at reduce time , but
> a long running fetch map task , when I have multiple fetch map tasks
> running on a single segment , I see  that one of the map tasks runs for a
> excessively longer period of time than the other fetch map tasks ,it seems
> this is happening because of the disproportionate distribution of urls per
> map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it
> seems its not guaranteed that each fetch map tasks will have 100,000 urls
> to fetch.
>
> Is is possible to set the an upper limit on the max number of URLs per
> fetch map task, along with the collective topN for the whole Fetch phase ?
>
> Thanks,
> Meraj.
>
> On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
> > Hi Meraj,
> >
> > What do the logs for the map tasks tell you about the URLs being fetched?
> >
> > J.
> >
> > On 17 October 2014 19:08, Meraj A. Khan  wrote:
> >
> > > Julien,
> > >
> > > Thanks for your suggestion , I looked at the jstack thread dumps , and
> I
> > > could see that the fetcher threads are in a waiting state and actually
> > the
> > > map phase is not yet complete looking at the JobClient console.
> > >
> > > 14/10/15 12:09:48 INFO mapreduce.Job:  map 95% reduce 31%
> > > 14/10/16 07:11:20 INFO mapreduce.Job:  map 96% reduce 31%
> > > 14/10/17 01:20:56 INFO mapreduce.Job:  map 97% reduce 31%
> > >
> > > And the following is the kind of statements I see in the jstack thread
> > > dump  for Hadoop child processes, is it possible that these map tasks
> are
> > > actually waiting on a particular host with some excessive crawl-delay
> , I
> > > already had the fetcher.threads.per.queue to 5 , fetcher.server.delay
> to
> > 0,
> > > fetcher.max.crawl.delay to 10  and http.max.delays to 1000 .
> > >
> > > Please see the jstack  log info for the child processes below.
> > >
> > > Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed
> > mode):
> > >
> > > "Attach Listener" daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5
> > waiting
> > > on condition [0x]
> > >java.lang.Thread.State: RUNNABLE
> > >
> > > "IPC Client (638223659) connection to /170.75.153.162:40980 from
> > > job_1413149941617_0059" daemon prio=10 tid=0x01a5c000 nid=0xce8
> > in
> > > Object.wait() [0x7fecdf80e000]
> > >java.lang.Thread.State: TIMED_WAITING (on object monitor)
> > > at java.lang.Object.wait(Native Method)
> > > - waiting on <0x99f8bf48> (a
> > > org.apache.hadoop.ipc.Client$Connection)
> > > at
> > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899)
> > > - locked <0x99f8bf48> (a
> > > org.apache.hadoop.ipc.Client$Connection)
> > > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944)
> > >
> > > "fetcher#5" daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in
> > > Object.wait() [0x7fecdf90f000]
> > >java.lang.Thread.State: WAITING (on object monitor)
> > > at java.lang.Object.wait(Native Method)
> > > - waiting on <0x99f62a68> (a
> > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
> > > at java.lang.Object.wait(Object.java:503)
> > > at
> > >
> > >
> >
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
> > > - locked <0x99f62a68> (a
> > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
> > > at
> > > org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)
> > >
> > > "fetcher#4" daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in
> > > Object.wait() [0x7fecdfa1]
> > >java.lang.Thread.State: WAITING (on object monitor)
> > > at java.lang.Object.wait(Native Method)
> > > - waiting on <0x99f62a68> (a
> > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
> > > at java.lang.Object.wait(Object.java:503)
> > > at
> > >
> > >
> >
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
> > > - locked <0x99f62a68> (a
> > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
> > > at
> > > org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)
> > >
> > > "fetcher#3" daemon prio=10 tid=0x7fecf8c45800 nid=0xce5 in
> > > Object.wait() [0x7fecdfb11

Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-10-30 Thread Julien Nioche
Thanks for sharing this Meraj. It's already proving useful to other users.

On 25 September 2014 17:04, Meraj A. Khan  wrote:

> Just wanted to update and let everyone know that this issue with single map
> task for fetch was occurring because Generator.java had logic around MRV1
> property *mapred.job.tracker*, I had to change that logic and as I am
> running this on YARN and now multiple fetch tasks operate on a single
> segment.
>
> Also I misunderstood that multiple segments would need to be generated to
> achieve parallelism , it does not seem to be the case , parallelism at
> fetch time is achieved by having multiple fetch tasks operate on a single
> segment.
>
> Thanks everyone for your help on resolving this issue.
>
>
>
> On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan  wrote:
>
> > Folks,
> >
> > As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
> > cluster .
> >
> > In order to scale I would need to Fetch concurrently with multiple map
> > tasks on multiple nodes ,I  think that the first step to do so would be
> to
> > generate multiple segments in the generate phase so that multiple fetch
> map
> > tasks can operate in parallel and in  order to generate multiple segments
> > at Generate time I have made the following changes , but unfortunately I
> > have been unsuccessful in doing so.
> >
> > I have tweaked the following parameters in bin/crawl to do so .
> >
> > added the *maxNumSegments* and *numFetchers* parameters in the call to
> > generate in *bin/crawl *script as can be seen below.
> >
> >
> > *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
> > $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers
> $numFetchers
> > -noFilter*
> >
> > (Here $numFetchers has a value of 15)
> >
> > The *generate.max.count* and *generate.count.mode* and *topN* are all
> > default values , meaning I am not providing any values for them.
> >
> > Also the crawldb status before the Generate phase is as shown below , it
> > shows that the number of unfetched URLs is more than *75 million* , so
> > its not that there are not enough urls for Generate to generate multiple
> > segments.
> >
> > * CrawlDB status*
> > * db_fetched=318708*
> > * db_gone=4774*
> > * db_notmodified=2274*
> > * db_redir_perm=2253*
> > * db_redir_temp=2527*
> > * db_unfetched=7524*
> >
> > However I do see this message in the logs consistently during the
> generate
> > phase.
> >
> >  *Generator: jobtracker is 'local', generating exactly one partition.*
> >
> > is this "one partition" referring to the the single segment that is going
> > to be generated ? If so how do I address this.
> >
> >
> > I feel like I have exhausted all the options but I am unable to have the
> > Generate phase generate more than one segment at a time.
> >
> > Can someone let me know if there is anything else that I should be trying
> > here ?
> >
> > *Thanks and any help is much appreciated!*
> >
> >
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble