Re: 2.2.1 Compilation Failure
Hi Lixiang, Please go through following link http://nlp.solutions.asia/?p=362 Hope it will solve your problem. On Fri, Oct 31, 2014 at 7:48 AM, Lixiang Ao wrote: > HI all, > > I'm a beginner of Nutch and I just followed the instructions here > http://wiki.apache.org/nutch/Nutch2Tutorial, but I got the following > errors: > > > aolixiang@lbt:~/IdeaProjects/dwc/nutch$ ant runtime > Buildfile: /home/aolixiang/IdeaProjects/dwc/nutch/build.xml > Trying to override old definition of task javac > [taskdef] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > ivy-probe-antlib: > > ivy-download: > [taskdef] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > ivy-download-unchecked: > > ivy-init-antlib: > > ivy-init: > > init: > [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build > [mkdir] Created dir: > /home/aolixiang/IdeaProjects/dwc/nutch/build/classes > [mkdir] Created dir: > /home/aolixiang/IdeaProjects/dwc/nutch/build/release > [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build/test > [mkdir] Created dir: > /home/aolixiang/IdeaProjects/dwc/nutch/build/test/classes > > clean-lib: > > resolve-default: > [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ > :: > [ivy:resolve] :: loading settings :: file = > /home/aolixiang/IdeaProjects/dwc/nutch/ivy/ivysettings.xml > [taskdef] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > copy-libs: > > compile-core: > [javac] Compiling 180 source files to > /home/aolixiang/IdeaProjects/dwc/nutch/build/classes > [javac] warning: [options] bootstrap class path not set in conjunction > with -source 1.6 > [javac] > > /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:28: > error: cannot find symbol > [javac] import org.apache.avro.ipc.AvroRemoteException; > [javac] ^ > [javac] symbol: class AvroRemoteException > [javac] location: package org.apache.avro.ipc > [javac] > > /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:35: > error: cannot find symbol > [javac] import org.apache.gora.persistency.StateManager; > [javac] ^ > [javac] symbol: class StateManager > [javac] location: package org.apache.gora.persistency > > ... > (lots of errors like that) > > [javac] > > /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/Host.java:136: > error: cannot find symbol > [javac] getStateManager().setDirty(this, 2); > [javac] ^ > [javac] symbol: method getStateManager() > [javac] location: class Host > [javac] 78 errors > [javac] 6 warnings > > BUILD FAILED > /home/aolixiang/IdeaProjects/dwc/nutch/build.xml:101: Compile failed; see > the compiler error output for details. > > > Could anyone tell me what might went wrong? > > > Thanks > Lixiang > -- Thanks & Regards, Handore Sagar ZYRM INC Software Engineer -- IMPORTANT NOTICE: The information in this e-mail and any attached files is CONFIDENTIAL and may be legally privileged or prohibited from disclosure and unauthorised use. The views of the author may not necessarily reflect those of Zymr, Inc. It is intended solely for the addressee, or the employee or agent responsible for delivering such materials to the addressee. If you have received this message in error please return it to the sender then delete the email and destroy any copies of it. If you are not the intended recipient, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication, or any action taken or omitted to be taken in reliance upon this message or its attachments is prohibited and may be unlawful. At present the integrity of e-mail across the Internet cannot be guaranteed and messages sent via this medium are potentially at risk. All liability is excluded to the extent permitted by law for any claims arising as a result of the use of this medium to transmit information by or to Zymr, Inc.
2.2.1 Compilation Failure
HI all, I'm a beginner of Nutch and I just followed the instructions here http://wiki.apache.org/nutch/Nutch2Tutorial, but I got the following errors: aolixiang@lbt:~/IdeaProjects/dwc/nutch$ ant runtime Buildfile: /home/aolixiang/IdeaProjects/dwc/nutch/build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-download-unchecked: ivy-init-antlib: ivy-init: init: [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build/classes [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build/release [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build/test [mkdir] Created dir: /home/aolixiang/IdeaProjects/dwc/nutch/build/test/classes clean-lib: resolve-default: [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = /home/aolixiang/IdeaProjects/dwc/nutch/ivy/ivysettings.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. copy-libs: compile-core: [javac] Compiling 180 source files to /home/aolixiang/IdeaProjects/dwc/nutch/build/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:28: error: cannot find symbol [javac] import org.apache.avro.ipc.AvroRemoteException; [javac] ^ [javac] symbol: class AvroRemoteException [javac] location: package org.apache.avro.ipc [javac] /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/WebPage.java:35: error: cannot find symbol [javac] import org.apache.gora.persistency.StateManager; [javac] ^ [javac] symbol: class StateManager [javac] location: package org.apache.gora.persistency ... (lots of errors like that) [javac] /home/aolixiang/IdeaProjects/dwc/nutch/src/java/org/apache/nutch/storage/Host.java:136: error: cannot find symbol [javac] getStateManager().setDirty(this, 2); [javac] ^ [javac] symbol: method getStateManager() [javac] location: class Host [javac] 78 errors [javac] 6 warnings BUILD FAILED /home/aolixiang/IdeaProjects/dwc/nutch/build.xml:101: Compile failed; see the compiler error output for details. Could anyone tell me what might went wrong? Thanks Lixiang
Re: Reduce phase in Fetcher taking excessive time to finish.
Thanks for the info Julien.For the hypothetical example below topN 200,000 generate.max.count = 10,000 generate.count.mode = host If the number of hosts is 10 and let us assume that each one of those hosts has more than 10,000 unfetched URLs in CrawlDB , since we have set generate.max.count to 10,000 exactly 100,000 URLs would be fetched. Would the remaining URLs be fetched in the next phase cycle? Do we need to consider any data loss(URLs) in this scenario ? On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Meraj > > You can control the # of URLs per segment with > > > generate.max.count > -1 > The maximum number of urls in a single > fetchlist. -1 if unlimited. The urls are counted according > to the value of the parameter generator.count.mode. > > > > > generate.count.mode > host > Determines how the URLs are counted for generator.max.count. > Default value is 'host' but can be 'domain'. Note that we do not count > per IP in the new version of the Generator. > > > > the urls are grouped into inputs for the map tasks accordingly. > > Julien > > > > > > On 26 October 2014 19:08, Meraj A. Khan wrote: > > > Julien, > > > > On further analysis , I found that it was not a delay at reduce time , > but > > a long running fetch map task , when I have multiple fetch map tasks > > running on a single segment , I see that one of the map tasks runs for a > > excessively longer period of time than the other fetch map tasks ,it > seems > > this is happening because of the disproportionate distribution of urls > per > > map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it > > seems its not guaranteed that each fetch map tasks will have 100,000 urls > > to fetch. > > > > Is is possible to set the an upper limit on the max number of URLs per > > fetch map task, along with the collective topN for the whole Fetch phase > ? > > > > Thanks, > > Meraj. > > > > On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > > > Hi Meraj, > > > > > > What do the logs for the map tasks tell you about the URLs being > fetched? > > > > > > J. > > > > > > On 17 October 2014 19:08, Meraj A. Khan wrote: > > > > > > > Julien, > > > > > > > > Thanks for your suggestion , I looked at the jstack thread dumps , > and > > I > > > > could see that the fetcher threads are in a waiting state and > actually > > > the > > > > map phase is not yet complete looking at the JobClient console. > > > > > > > > 14/10/15 12:09:48 INFO mapreduce.Job: map 95% reduce 31% > > > > 14/10/16 07:11:20 INFO mapreduce.Job: map 96% reduce 31% > > > > 14/10/17 01:20:56 INFO mapreduce.Job: map 97% reduce 31% > > > > > > > > And the following is the kind of statements I see in the jstack > thread > > > > dump for Hadoop child processes, is it possible that these map tasks > > are > > > > actually waiting on a particular host with some excessive crawl-delay > > , I > > > > already had the fetcher.threads.per.queue to 5 , fetcher.server.delay > > to > > > 0, > > > > fetcher.max.crawl.delay to 10 and http.max.delays to 1000 . > > > > > > > > Please see the jstack log info for the child processes below. > > > > > > > > Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed > > > mode): > > > > > > > > "Attach Listener" daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5 > > > waiting > > > > on condition [0x] > > > >java.lang.Thread.State: RUNNABLE > > > > > > > > "IPC Client (638223659) connection to /170.75.153.162:40980 from > > > > job_1413149941617_0059" daemon prio=10 tid=0x01a5c000 > nid=0xce8 > > > in > > > > Object.wait() [0x7fecdf80e000] > > > >java.lang.Thread.State: TIMED_WAITING (on object monitor) > > > > at java.lang.Object.wait(Native Method) > > > > - waiting on <0x99f8bf48> (a > > > > org.apache.hadoop.ipc.Client$Connection) > > > > at > > > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899) > > > > - locked <0x99f8bf48> (a > > > > org.apache.hadoop.ipc.Client$Connection) > > > > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944) > > > > > > > > "fetcher#5" daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in > > > > Object.wait() [0x7fecdf90f000] > > > >java.lang.Thread.State: WAITING (on object monitor) > > > > at java.lang.Object.wait(Native Method) > > > > - waiting on <0x99f62a68> (a > > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) > > > > at java.lang.Object.wait(Object.java:503) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) > > > > - locked <0x99f62a68> (a > > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) > > > > at > > > > org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) > > > > > > > > "fetch
bin/Crawl script loosing status updates from the MR job.
Hi All, I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN by redirecting its output to a log file as shown below. /opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000 > /tmp/nutch.log 2>&1 & The issue I am facing is that randomly this script when it is running a job looses track of the updates like Map 80% Reduce 67% and gets stuck there , and in the mean time the job completes successfully and the script is waiting there for further updates , as a result the looping of generate-fetch -update jobs gets terminated prematurely. This is so random that I am not able to figure out a particular pattern to this issue, and end up restarting the script every so often.Some times this happens in a job as short in duration as the inject phase of Nutch. Just wondering if anyone faced this issue ?Is the fact that I am redirecting the output to a logfile playing a part in this ? What are the best practices for running a long running script like bin/crawl ? I am using CentOs7.x Thanks.
Re: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
Hi Segar, On Wed, Oct 29, 2014 at 11:40 PM, wrote: > > Follow the following steps: > 1. Execute 'ant job' i.e. open build.xml and execute 'runtime(default)' > target. > It will generate 'runtime' folder in project. > 2. Open nutch-default.xml and update "plugin.folders" to > "/home//nutchProject/runtime/local/plugins" > 3. Refresh the project. > 4. Run the project. > > > I take it you are talking about running Nutch in Eclipse? The problem you are having is that the nutch-default.xml you are editing is not on your Eclipse classpath. Additionally, you need to navigate to the 'Order and Export' tab of the Properties dialogue box and ensure that the same files are at THE TOP of the classpath there. HTH Lewis
Re: Reduce phase in Fetcher taking excessive time to finish.
Hi Meraj You can control the # of URLs per segment with generate.max.count -1 The maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. generate.count.mode host Determines how the URLs are counted for generator.max.count. Default value is 'host' but can be 'domain'. Note that we do not count per IP in the new version of the Generator. the urls are grouped into inputs for the map tasks accordingly. Julien On 26 October 2014 19:08, Meraj A. Khan wrote: > Julien, > > On further analysis , I found that it was not a delay at reduce time , but > a long running fetch map task , when I have multiple fetch map tasks > running on a single segment , I see that one of the map tasks runs for a > excessively longer period of time than the other fetch map tasks ,it seems > this is happening because of the disproportionate distribution of urls per > map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it > seems its not guaranteed that each fetch map tasks will have 100,000 urls > to fetch. > > Is is possible to set the an upper limit on the max number of URLs per > fetch map task, along with the collective topN for the whole Fetch phase ? > > Thanks, > Meraj. > > On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > Hi Meraj, > > > > What do the logs for the map tasks tell you about the URLs being fetched? > > > > J. > > > > On 17 October 2014 19:08, Meraj A. Khan wrote: > > > > > Julien, > > > > > > Thanks for your suggestion , I looked at the jstack thread dumps , and > I > > > could see that the fetcher threads are in a waiting state and actually > > the > > > map phase is not yet complete looking at the JobClient console. > > > > > > 14/10/15 12:09:48 INFO mapreduce.Job: map 95% reduce 31% > > > 14/10/16 07:11:20 INFO mapreduce.Job: map 96% reduce 31% > > > 14/10/17 01:20:56 INFO mapreduce.Job: map 97% reduce 31% > > > > > > And the following is the kind of statements I see in the jstack thread > > > dump for Hadoop child processes, is it possible that these map tasks > are > > > actually waiting on a particular host with some excessive crawl-delay > , I > > > already had the fetcher.threads.per.queue to 5 , fetcher.server.delay > to > > 0, > > > fetcher.max.crawl.delay to 10 and http.max.delays to 1000 . > > > > > > Please see the jstack log info for the child processes below. > > > > > > Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed > > mode): > > > > > > "Attach Listener" daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5 > > waiting > > > on condition [0x] > > >java.lang.Thread.State: RUNNABLE > > > > > > "IPC Client (638223659) connection to /170.75.153.162:40980 from > > > job_1413149941617_0059" daemon prio=10 tid=0x01a5c000 nid=0xce8 > > in > > > Object.wait() [0x7fecdf80e000] > > >java.lang.Thread.State: TIMED_WAITING (on object monitor) > > > at java.lang.Object.wait(Native Method) > > > - waiting on <0x99f8bf48> (a > > > org.apache.hadoop.ipc.Client$Connection) > > > at > > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899) > > > - locked <0x99f8bf48> (a > > > org.apache.hadoop.ipc.Client$Connection) > > > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944) > > > > > > "fetcher#5" daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in > > > Object.wait() [0x7fecdf90f000] > > >java.lang.Thread.State: WAITING (on object monitor) > > > at java.lang.Object.wait(Native Method) > > > - waiting on <0x99f62a68> (a > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) > > > at java.lang.Object.wait(Object.java:503) > > > at > > > > > > > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) > > > - locked <0x99f62a68> (a > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) > > > at > > > org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) > > > > > > "fetcher#4" daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in > > > Object.wait() [0x7fecdfa1] > > >java.lang.Thread.State: WAITING (on object monitor) > > > at java.lang.Object.wait(Native Method) > > > - waiting on <0x99f62a68> (a > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) > > > at java.lang.Object.wait(Object.java:503) > > > at > > > > > > > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) > > > - locked <0x99f62a68> (a > > > org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) > > > at > > > org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) > > > > > > "fetcher#3" daemon prio=10 tid=0x7fecf8c45800 nid=0xce5 in > > > Object.wait() [0x7fecdfb11
Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.
Thanks for sharing this Meraj. It's already proving useful to other users. On 25 September 2014 17:04, Meraj A. Khan wrote: > Just wanted to update and let everyone know that this issue with single map > task for fetch was occurring because Generator.java had logic around MRV1 > property *mapred.job.tracker*, I had to change that logic and as I am > running this on YARN and now multiple fetch tasks operate on a single > segment. > > Also I misunderstood that multiple segments would need to be generated to > achieve parallelism , it does not seem to be the case , parallelism at > fetch time is achieved by having multiple fetch tasks operate on a single > segment. > > Thanks everyone for your help on resolving this issue. > > > > On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan wrote: > > > Folks, > > > > As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN > > cluster . > > > > In order to scale I would need to Fetch concurrently with multiple map > > tasks on multiple nodes ,I think that the first step to do so would be > to > > generate multiple segments in the generate phase so that multiple fetch > map > > tasks can operate in parallel and in order to generate multiple segments > > at Generate time I have made the following changes , but unfortunately I > > have been unsuccessful in doing so. > > > > I have tweaked the following parameters in bin/crawl to do so . > > > > added the *maxNumSegments* and *numFetchers* parameters in the call to > > generate in *bin/crawl *script as can be seen below. > > > > > > *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb > > $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers > $numFetchers > > -noFilter* > > > > (Here $numFetchers has a value of 15) > > > > The *generate.max.count* and *generate.count.mode* and *topN* are all > > default values , meaning I am not providing any values for them. > > > > Also the crawldb status before the Generate phase is as shown below , it > > shows that the number of unfetched URLs is more than *75 million* , so > > its not that there are not enough urls for Generate to generate multiple > > segments. > > > > * CrawlDB status* > > * db_fetched=318708* > > * db_gone=4774* > > * db_notmodified=2274* > > * db_redir_perm=2253* > > * db_redir_temp=2527* > > * db_unfetched=7524* > > > > However I do see this message in the logs consistently during the > generate > > phase. > > > > *Generator: jobtracker is 'local', generating exactly one partition.* > > > > is this "one partition" referring to the the single segment that is going > > to be generated ? If so how do I address this. > > > > > > I feel like I have exhausted all the options but I am unable to have the > > Generate phase generate more than one segment at a time. > > > > Can someone let me know if there is anything else that I should be trying > > here ? > > > > *Thanks and any help is much appreciated!* > > > > > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble