Is it roughly when the memory goes out of control? Could be a dodgy URL putting the URLnormalisation in a spin : one gets all sorts of horrors after a while.
Maybe try using '-noNorm' on the Generation and see if that has any impact. Would be good also to know on which job and map/red the issue is happening, can use the Hadoop jobtracker GUI on the pseudo distributed mode to see that Thanks Julien On 3 February 2011 00:28, axierr <[email protected]> wrote: > > > Here are the results, I'm going to do now without url partitioning : > > nutch generator output - > Generator: starting at 2011-02-02 20:11:00 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > > jstack output - > Full thread dump Java HotSpot(TM) Client VM (17.1-b03 mixed mode, sharing): > > "communication thread" daemon prio=10 tid=0x0a104800 nid=0x637 runnable > [0xb3cad000] > java.lang.Thread.State: RUNNABLE > at java.lang.Object.getClass(Native Method) > at java.util.ArrayList.<init>(ArrayList.java:134) > at > org.apache.hadoop.fs.FileSystem.getAllStatistics(FileSystem.java:1567) > - locked <0x8ef584c8> (a java.lang.Class for > org.apache.hadoop.fs.FileSystem) > at org.apache.hadoop.mapred.Task.updateCounters(Task.java:652) > - locked <0x66a3d020> (a org.apache.hadoop.mapred.ReduceTask) > at org.apache.hadoop.mapred.Task.access$600(Task.java:56) > at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:539) > at java.lang.Thread.run(Thread.java:662) > > "Attach Listener" daemon prio=10 tid=0x0a0a9800 nid=0x7e31 waiting on > condition [0x00000000] > java.lang.Thread.State: RUNNABLE > > "Thread-13" prio=10 tid=0xb3b27800 nid=0x6fd9 runnable [0xb3cfe000] > java.lang.Thread.State: RUNNABLE > at java.util.ArrayList.size(ArrayList.java:177) > at java.util.AbstractList$Itr.hasNext(AbstractList.java:339) > at > > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:168) > - locked <0x66a48778> (a > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer) > at > > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:179) > - locked <0x66a48778> (a > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer) > at > org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286) > at > org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:244) > at > org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) > > "Low Memory Detector" daemon prio=10 tid=0x09ec8800 nid=0x6fb3 runnable > [0x00000000] > java.lang.Thread.State: RUNNABLE > > "CompilerThread0" daemon prio=10 tid=0x09ec6800 nid=0x6fb2 waiting on > condition [0x00000000] > java.lang.Thread.State: RUNNABLE > > "Signal Dispatcher" daemon prio=10 tid=0x09ec4c00 nid=0x6fb1 runnable > [0x00000000] > java.lang.Thread.State: RUNNABLE > > "Finalizer" daemon prio=10 tid=0x09ec0800 nid=0x6fb0 in Object.wait() > [0xb46cc000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x65410258> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > - locked <0x65410258> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > "Reference Handler" daemon prio=10 tid=0x09ebbc00 nid=0x6faf in > Object.wait() [0xb471d000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x654102e8> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:485) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked <0x654102e8> (a java.lang.ref.Reference$Lock) > > "main" prio=10 tid=0x09e97000 nid=0x6fad runnable [0xb6c97000] > java.lang.Thread.State: RUNNABLE > at > java.text.DecimalFormatSymbols.initialize(DecimalFormatSymbols.java:509) > at > java.text.DecimalFormatSymbols.<init>(DecimalFormatSymbols.java:77) > at java.text.DecimalFormat.<init>(DecimalFormat.java:416) > at > org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:113) > at > org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1283) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251) > at org.apache.nutch.crawl.Generator.generate(Generator.java:526) > at org.apache.nutch.crawl.Generator.run(Generator.java:692) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Generator.main(Generator.java:648) > > "VM Thread" prio=10 tid=0x09eba400 nid=0x6fae runnable > > "VM Periodic Task Thread" prio=10 tid=0x09ecac00 nid=0x6fb4 waiting on > condition > > JNI global references: 1419 > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-1-2-performance-and-memory-issues-tp2407256p2410061.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

