I'll fix NUTCH-2466 this afternoon. 
 
-----Original message-----
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Wednesday 17th January 2018 14:09
> To: user@nutch.apache.org
> Subject: Re: SitemapProcessor destroyed our CrawlDB
> 
> It was finally Omkar who brought NUTCH-2442 forward.
> Time to review the patch of NUTCH-2466!
> 
> On 01/17/2018 01:53 PM, Markus Jelsma wrote:
> > Ah thanks!
> > 
> > I knew you'd fixed some of these, now i know my patch of NUTCH-2466 
> > silently removes your commit!
> > 
> > My bad, thanks!
> > Markus 
> >  
> > -----Original message-----
> >> From:Sebastian Nagel <wastl.na...@googlemail.com>
> >> Sent: Wednesday 17th January 2018 13:32
> >> To: user@nutch.apache.org
> >> Subject: Re: SitemapProcessor destroyed our CrawlDB
> >>
> >> Hi Markus,
> >>
> >> the problem should be fixed with NUTCH-2442. It wasn't the case with the 
> >> first version of the
> >> sitemap processor. It's mandatory to check also the return value of 
> >> job.waitForCompletion(true),
> >> only checking for exceptions isn't enough!
> >>
> >> Sebastian
> >>
> >> On 01/17/2018 11:51 AM, Markus Jelsma wrote:
> >>> Hello,
> >>>
> >>> We noticed some abnormalities in our crawl cycle caused by a sudden 
> >>> reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed 
> >>> out, see below) and left us with a decimated CrawlDB.
> >>>
> >>> This is odd because of:
> >>>
> >>>     } catch (Exception e) {
> >>>       if (fs.exists(tempCrawlDb))
> >>>         fs.delete(tempCrawlDb, true);
> >>>
> >>>       LockUtil.removeLockFile(fs, lock);
> >>>       throw e;
> >>>     }
> >>>
> >>> Any ideas?
> >>>
> >>> Thanks,
> >>> Markus
> >>>
> >>> Full thread dump OpenJDK 64-Bit Server VM (25.151-b12 mixed mode):
> >>>
> >>> "Thread-52" #74 prio=5 os_prio=0 tid=0x00007fe2adc85000 nid=0x6cf8 
> >>> runnable [0x00007fe28a86d000]
> >>>    java.lang.Thread.State: RUNNABLE 
> >>> at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3797) 
> >>> at java.util.regex.Pattern$Start.match(Pattern.java:3461) 
> >>> at java.util.regex.Matcher.search(Matcher.java:1248) 
> >>> at java.util.regex.Matcher.find(Matcher.java:637) 
> >>> at java.util.regex.Matcher.replaceAll(Matcher.java:951) 
> >>> at 
> >>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:193)
> >>>  
> >>> at 
> >>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:200)
> >>>  
> >>> at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319) 
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.filterNormalize(SitemapProcessor.java:176)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:225)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.generateSitemapUrlDatum(SitemapProcessor.java:264)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:154)
> >>>  
> >>> at 
> >>> org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(SitemapProcessor.java:95)
> >>>  
> >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
> >>> at 
> >>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:273)
> >>>
> >>> "SpillThread" #34 daemon prio=5 os_prio=0 tid=0x00007fe2ada12000 
> >>> nid=0x6c2f waiting on condition [0x00007fe28d2ad000]
> >>>    java.lang.Thread.State: WAITING (parking) 
> >>> at sun.misc.Unsafe.park(Native Method) 
> >>> - parking to wait for  <0x00000000ede6dc80> (a 
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
> >>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> >>> at 
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >>>  
> >>> at 
> >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1530)
> >>>
> >>> "org.apache.hadoop.hdfs.PeerCache@1fc0053e" #33 daemon prio=5 os_prio=0 
> >>> tid=0x00007fe2ad7fe000 nid=0x6be7 waiting on condition 
> >>> [0x00007fe28d3ae000]
> >>>    java.lang.Thread.State: TIMED_WAITING (sleeping) 
> >>> at java.lang.Thread.sleep(Native Method) 
> >>> at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:253) 
> >>> at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46) 
> >>> at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124) 
> >>> at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> "communication thread" #28 daemon prio=5 os_prio=0 tid=0x00007fe2ad975800 
> >>> nid=0x6b9e in Object.wait() [0x00007fe28d8b1000]
> >>>    java.lang.Thread.State: TIMED_WAITING (on object monitor) 
> >>> at java.lang.Object.wait(Native Method) 
> >>> at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:799) 
> >>> - locked <0x00000000ede69ae8> (a java.lang.Object) 
> >>> at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> "client DomainSocketWatcher" #27 daemon prio=5 os_prio=0 
> >>> tid=0x00007fe2ad952000 nid=0x6b95 runnable [0x00007fe28d9b2000]
> >>>    java.lang.Thread.State: RUNNABLE 
> >>> at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) 
> >>> at 
> >>> org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
> >>>  
> >>> at 
> >>> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
> >>>  
> >>> at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> "Thread for syncLogs" #26 daemon prio=5 os_prio=0 tid=0x00007fe2ad820000 
> >>> nid=0x6b81 waiting on condition [0x00007fe28deb3000]
> >>>    java.lang.Thread.State: TIMED_WAITING (parking) 
> >>> at sun.misc.Unsafe.park(Native Method) 
> >>> - parking to wait for  <0x00000000e7118190> (a 
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
> >>> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
> >>> at 
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> >>>  
> >>> at 
> >>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
> >>>  
> >>> at 
> >>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
> >>>  
> >>> at 
> >>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> >>>  
> >>> at 
> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> >>>  
> >>> at 
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>  
> >>> at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner"
> >>>  #24 daemon prio=5 os_prio=0 tid=0x00007fe2ad746800 nid=0x6b79 in 
> >>> Object.wait() [0x00007fe28e1cc000]
> >>>    java.lang.Thread.State: WAITING (on object monitor) 
> >>> at java.lang.Object.wait(Native Method) 
> >>> - waiting on <0x00000000e7171060> (a java.lang.ref.ReferenceQueue$Lock) 
> >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) 
> >>> - locked <0x00000000e7171060> (a java.lang.ref.ReferenceQueue$Lock) 
> >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) 
> >>> at 
> >>> org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3212)
> >>>  
> >>> at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> "IPC Parameter Sending Thread #0" #23 daemon prio=5 os_prio=0 
> >>> tid=0x00007fe2ad637000 nid=0x6b6d waiting on condition 
> >>> [0x00007fe28e4cd000]
> >>>    java.lang.Thread.State: TIMED_WAITING (parking) 
> >>> at sun.misc.Unsafe.park(Native Method) 
> >>> - parking to wait for  <0x00000000e7117338> (a 
> >>> java.util.concurrent.SynchronousQueue$TransferStack) 
> >>> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
> >>> at 
> >>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
> >>>  
> >>> at 
> >>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
> >>>  
> >>> at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) 
> >>> at 
> >>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
> >>>  
> >>> at 
> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> >>>  
> >>> at 
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>  
> >>> at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> "IPC Client (1373419525) connection to /89.188.14.3:36783 from 
> >>> job_1516025831039_0247" #22 daemon prio=5 os_prio=0 
> >>> tid=0x00007fe2ad632000 nid=0x6b6c in Object.wait() [0x00007fe28e5ce000]
> >>>    java.lang.Thread.State: TIMED_WAITING (on object monitor) 
> >>> at java.lang.Object.wait(Native Method) 
> >>> at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1008) 
> >>> - locked <0x00000000e7119130> (a org.apache.hadoop.ipc.Client$Connection) 
> >>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1052)
> >>>
> >>> "Timer for 'MapTask' metrics system" #21 daemon prio=5 os_prio=0 
> >>> tid=0x00007fe2ad525800 nid=0x6b5f in Object.wait() [0x00007fe28f551000]
> >>>    java.lang.Thread.State: TIMED_WAITING (on object monitor) 
> >>> at java.lang.Object.wait(Native Method) 
> >>> at java.util.TimerThread.mainLoop(Timer.java:552) 
> >>> - locked <0x00000000e713ca30> (a java.util.TaskQueue) 
> >>> at java.util.TimerThread.run(Timer.java:505)
> >>>
> >>> "Service Thread" #17 daemon prio=9 os_prio=0 tid=0x00007fe2ac0f8800 
> >>> nid=0x6b36 runnable [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C1 CompilerThread11" #16 daemon prio=9 os_prio=0 tid=0x00007fe2ac0eb800 
> >>> nid=0x6b34 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C1 CompilerThread10" #15 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e9800 
> >>> nid=0x6b33 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C1 CompilerThread9" #14 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e7000 
> >>> nid=0x6b32 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C1 CompilerThread8" #13 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e5000 
> >>> nid=0x6b31 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread7" #12 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e3000 
> >>> nid=0x6b30 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread6" #11 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e1000 
> >>> nid=0x6b2f waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread5" #10 daemon prio=9 os_prio=0 tid=0x00007fe2ac0de800 
> >>> nid=0x6b2d waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread4" #9 daemon prio=9 os_prio=0 tid=0x00007fe2ac0d4800 
> >>> nid=0x6b2b waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread3" #8 daemon prio=9 os_prio=0 tid=0x00007fe2ac0d2800 
> >>> nid=0x6b2a waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007fe2ac0ce000 
> >>> nid=0x6b29 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007fe2ac0cc000 
> >>> nid=0x6b28 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007fe2ac0c9000 
> >>> nid=0x6b26 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007fe2ac0c7000 
> >>> nid=0x6b24 waiting on condition [0x0000000000000000]
> >>>    java.lang.Thread.State: RUNNABLE
> >>>
> >>> "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007fe2ac0a0000 nid=0x6ab7 
> >>> in Object.wait() [0x00007fe29592c000]
> >>>    java.lang.Thread.State: WAITING (on object monitor) 
> >>> at java.lang.Object.wait(Native Method) 
> >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) 
> >>> - locked <0x00000000e72f0140> (a java.lang.ref.ReferenceQueue$Lock) 
> >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) 
> >>> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
> >>>
> >>> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007fe2ac09b800 
> >>> nid=0x6ab6 in Object.wait() [0x00007fe295a2d000]
> >>>    java.lang.Thread.State: WAITING (on object monitor) 
> >>> at java.lang.Object.wait(Native Method) 
> >>> at java.lang.Object.wait(Object.java:502) 
> >>> at java.lang.ref.Reference.tryHandlePending(Reference.java:191) 
> >>> - locked <0x00000000e72f0180> (a java.lang.ref.Reference$Lock) 
> >>> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> >>>
> >>> "main" #1 prio=5 os_prio=0 tid=0x00007fe2ac014000 nid=0x6a69 in 
> >>> Object.wait() [0x00007fe2b5747000]
> >>>    java.lang.Thread.State: WAITING (on object monitor) 
> >>> at java.lang.Object.wait(Native Method) 
> >>> - waiting on <0x00000000edb32e88> (a 
> >>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner) 
> >>> at java.lang.Thread.join(Thread.java:1252) 
> >>> - locked <0x00000000edb32e88> (a 
> >>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner) 
> >>> at java.lang.Thread.join(Thread.java:1326) 
> >>> at 
> >>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper.run(MultithreadedMapper.java:146)
> >>>  
> >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) 
> >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) 
> >>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) 
> >>> at java.security.AccessController.doPrivileged(Native Method) 
> >>> at javax.security.auth.Subject.doAs(Subject.java:422) 
> >>> at 
> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> >>>  
> >>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> >>>
> >>> "VM Thread" os_prio=0 tid=0x00007fe2ac093800 nid=0x6aad runnable 
> >>>
> >>> "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fe2ac029000 
> >>> nid=0x6a77 runnable 
> >>>
> >>> "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fe2ac02b000 
> >>> nid=0x6a79 runnable 
> >>>
> >>> "GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007fe2ac02c800 
> >>> nid=0x6a7b runnable 
> >>>
> >>> "GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fe2ac02e800 
> >>> nid=0x6a7d runnable 
> >>>
> >>> "GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007fe2ac030000 
> >>> nid=0x6a80 runnable 
> >>>
> >>> "GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007fe2ac032000 
> >>> nid=0x6a95 runnable 
> >>>
> >>> "GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007fe2ac033800 
> >>> nid=0x6a96 runnable 
> >>>
> >>> "GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007fe2ac035800 
> >>> nid=0x6a97 runnable 
> >>>
> >>> "GC task thread#8 (ParallelGC)" os_prio=0 tid=0x00007fe2ac037000 
> >>> nid=0x6a98 runnable 
> >>>
> >>> "GC task thread#9 (ParallelGC)" os_prio=0 tid=0x00007fe2ac039000 
> >>> nid=0x6a99 runnable 
> >>>
> >>> "GC task thread#10 (ParallelGC)" os_prio=0 tid=0x00007fe2ac03a800 
> >>> nid=0x6a9a runnable 
> >>>
> >>> "GC task thread#11 (ParallelGC)" os_prio=0 tid=0x00007fe2ac03c800 
> >>> nid=0x6a9b runnable 
> >>>
> >>> "GC task thread#12 (ParallelGC)" os_prio=0 tid=0x00007fe2ac03e000 
> >>> nid=0x6a9c runnable 
> >>>
> >>> "VM Periodic Task Thread" os_prio=0 tid=0x00007fe2ac0fb000 nid=0x6b38 
> >>> waiting on condition 
> >>>
> >>> JNI global references: 275
> >>>
> >>> Heap
> >>>  PSYoungGen      total 116224K, used 105934K [0x00000000f7b00000, 
> >>> 0x0000000100000000, 0x0000000100000000)
> >>>   eden space 100864K, 89% used 
> >>> [0x00000000f7b00000,0x00000000fd3a6228,0x00000000fdd80000)
> >>>   from space 15360K, 98% used 
> >>> [0x00000000fdd80000,0x00000000fec4d7a0,0x00000000fec80000)
> >>>   to   space 19456K, 0% used 
> >>> [0x00000000fed00000,0x00000000fed00000,0x0000000100000000)
> >>>  ParOldGen       total 273408K, used 189187K [0x00000000e7000000, 
> >>> 0x00000000f7b00000, 0x00000000f7b00000)
> >>>   object space 273408K, 69% used 
> >>> [0x00000000e7000000,0x00000000f28c0c88,0x00000000f7b00000)
> >>>  Metaspace       used 33001K, capacity 33602K, committed 34048K, reserved 
> >>> 1079296K
> >>>   class space    used 3581K, capacity 3675K, committed 3840K, reserved 
> >>> 1048576K
> >>>
> >>
> >>
> 
> 

Reply via email to