RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Hello Lewis, We do have some weird and complicated rules, but these should not time out for 450 seconds, e.g. keep the JVM busy for that amount of time. We still haven't fully investigated yet so it is a possibility some sitemap entries are very long and complicated. But 450 seconds, very odd,

Re: Getting Error

2018-01-17 Thread govind nitk
Hi Sebastian and lewis, Did build on other machine and diffed the runtime log. Got the issues pretty clear yes, the build was not proper. Got it resolved. Happy crawling. Regards, GoViNd On Mon, Jan 15, 2018 at 2:04 AM, Sebastian Nagel wrote: > Hi Govind, > >

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
I'll fix NUTCH-2466 this afternoon. -Original message- > From:Sebastian Nagel > Sent: Wednesday 17th January 2018 14:09 > To: user@nutch.apache.org > Subject: Re: SitemapProcessor destroyed our CrawlDB > > It was finally Omkar who brought NUTCH-2442

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Ah thanks! I knew you'd fixed some of these, now i know my patch of NUTCH-2466 silently removes your commit! My bad, thanks! Markus -Original message- > From:Sebastian Nagel > Sent: Wednesday 17th January 2018 13:32 > To: user@nutch.apache.org > Subject:

Re: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Sebastian Nagel
It was finally Omkar who brought NUTCH-2442 forward. Time to review the patch of NUTCH-2466! On 01/17/2018 01:53 PM, Markus Jelsma wrote: > Ah thanks! > > I knew you'd fixed some of these, now i know my patch of NUTCH-2466 silently > removes your commit! > > My bad, thanks! > Markus > >

SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Hello, We noticed some abnormalities in our crawl cycle caused by a sudden reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed out, see below) and left us with a decimated CrawlDB. This is odd because of:     } catch (Exception e) {   if (fs.exists(tempCrawlDb))