In the SVN area can you point me to the protocol plugin please? http://svn.apache.org/repos/asf/nutch/
Thank you Lewis On Wed, Aug 29, 2012 at 3:22 PM, Matteo Simoncini <[email protected]> wrote: > Sorry, I forgot it. > > 1.5 > > Matteo > > 2012/8/29 Lewis John Mcgibbney <[email protected]>: >> What version of Nutch is this? >> >> Lewis >> >> On Wed, Aug 29, 2012 at 9:58 AM, xpow <[email protected]> wrote: >>> Hello, >>> >>> I've tried to use the protocol-smb plugin with nutch. The nutch read and >>> parsed the documents correctly, but afterward, when it hit the crawldb, >>> crawl.CrawlDbReducer, i got a lot of 'crawl.CrawlDbReducer - Missing fetch >>> and old value, signature=[B@34d0cdd0', which causing no documents get >>> indexed with solr ... >>> >>> Can anyone help me to pinpoint what was going on?? >>> >>> Thanks >>> >>> Here's the log file: >>> 2012-08-29 13:54:52,641 INFO parse.ParseSegment - Parsing: >>> smb://192.168.3.6/share/putusan/putusan_sidang_PUTUSAN 48-2011 TELAH >>> baca.pdf >>> 2012-08-29 13:54:53,576 INFO parse.ParseSegment - Parsing: >>> smb://192.168.3.6/share/putusan/putusan_sidang_Putusan 55 PUU-2010-TELAH >>> BACA.pdf >>> 2012-08-29 13:54:53,612 INFO parse.ParseSegment - Parsing: >>> smb://192.168.3.6/share/putusan/putusan_sidang_Putusan Sela 108 PHPU >>> 2011.pdf >>> 2012-08-29 13:54:53,930 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'outlink', using default >>> 2012-08-29 13:54:55,087 INFO parse.ParseSegment - ParseSegment: finished at >>> 2012-08-29 13:54:55, elapsed: 00:00:28 >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: starting at >>> 2012-08-29 13:54:55 >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: db: >>> crawl/crawldb >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: segments: >>> [crawl/segments/20120829134849] >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: additions >>> allowed: true >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: URL >>> normalizing: true >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: URL filtering: >>> true >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: 404 purging: >>> false >>> 2012-08-29 13:54:55,104 INFO crawl.CrawlDb - CrawlDb update: Merging >>> segment data into db. >>> 2012-08-29 13:54:55,584 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'crawldb', using default >>> 2012-08-29 13:54:55,765 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'crawldb', using default >>> 2012-08-29 13:54:56,121 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'crawldb', using default >>> 2012-08-29 13:54:56,160 INFO crawl.FetchScheduleFactory - Using >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >>> 2012-08-29 13:54:56,160 INFO crawl.AbstractFetchSchedule - >>> defaultInterval=2592000 >>> 2012-08-29 13:54:56,160 INFO crawl.AbstractFetchSchedule - >>> maxInterval=7776000 >>> 2012-08-29 13:54:56,198 WARN crawl.CrawlDbReducer - Missing fetch and old >>> value, signature=[B@34d0cdd0 >>> 2012-08-29 13:54:56,199 WARN crawl.CrawlDbReducer - Missing fetch and old >>> value, signature=[B@78782dc6 >>> 2012-08-29 13:54:56,199 WARN crawl.CrawlDbReducer - Missing fetch and old >>> value, signature=[B@1a055ff4 >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/Nutch-SMB-protocol-tp4003945.html >>> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> >> >> -- >> Lewis -- Lewis

