I'm not so familiar with SVN. Is this what you mean? http://svn.apache.org/repos/asf/nutch/branches/branch-1.5/
Matteo 2012/8/29 Lewis John Mcgibbney <[email protected]> > In the SVN area can you point me to the protocol plugin please? > > http://svn.apache.org/repos/asf/nutch/ > > Thank you > > Lewis > > On Wed, Aug 29, 2012 at 3:22 PM, Matteo Simoncini <[email protected]> > wrote: > > Sorry, I forgot it. > > > > 1.5 > > > > Matteo > > > > 2012/8/29 Lewis John Mcgibbney <[email protected]>: > >> What version of Nutch is this? > >> > >> Lewis > >> > >> On Wed, Aug 29, 2012 at 9:58 AM, xpow <[email protected]> wrote: > >>> Hello, > >>> > >>> I've tried to use the protocol-smb plugin with nutch. The nutch read > and > >>> parsed the documents correctly, but afterward, when it hit the crawldb, > >>> crawl.CrawlDbReducer, i got a lot of 'crawl.CrawlDbReducer - Missing > fetch > >>> and old value, signature=[B@34d0cdd0', which causing no documents get > >>> indexed with solr ... > >>> > >>> Can anyone help me to pinpoint what was going on?? > >>> > >>> Thanks > >>> > >>> Here's the log file: > >>> 2012-08-29 13:54:52,641 INFO parse.ParseSegment - Parsing: > >>> smb://192.168.3.6/share/putusan/putusan_sidang_PUTUSAN 48-2011 TELAH > >>> baca.pdf > >>> 2012-08-29 13:54:53,576 INFO parse.ParseSegment - Parsing: > >>> smb://192.168.3.6/share/putusan/putusan_sidang_Putusan 55 > PUU-2010-TELAH > >>> BACA.pdf > >>> 2012-08-29 13:54:53,612 INFO parse.ParseSegment - Parsing: > >>> smb://192.168.3.6/share/putusan/putusan_sidang_Putusan Sela 108 PHPU > >>> 2011.pdf > >>> 2012-08-29 13:54:53,930 INFO regex.RegexURLNormalizer - can't find > rules > >>> for scope 'outlink', using default > >>> 2012-08-29 13:54:55,087 INFO parse.ParseSegment - ParseSegment: > finished at > >>> 2012-08-29 13:54:55, elapsed: 00:00:28 > >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: starting > at > >>> 2012-08-29 13:54:55 > >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: db: > >>> crawl/crawldb > >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: segments: > >>> [crawl/segments/20120829134849] > >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: additions > >>> allowed: true > >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: URL > >>> normalizing: true > >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: URL > filtering: > >>> true > >>> 2012-08-29 13:54:55,103 INFO crawl.CrawlDb - CrawlDb update: 404 > purging: > >>> false > >>> 2012-08-29 13:54:55,104 INFO crawl.CrawlDb - CrawlDb update: Merging > >>> segment data into db. > >>> 2012-08-29 13:54:55,584 INFO regex.RegexURLNormalizer - can't find > rules > >>> for scope 'crawldb', using default > >>> 2012-08-29 13:54:55,765 INFO regex.RegexURLNormalizer - can't find > rules > >>> for scope 'crawldb', using default > >>> 2012-08-29 13:54:56,121 INFO regex.RegexURLNormalizer - can't find > rules > >>> for scope 'crawldb', using default > >>> 2012-08-29 13:54:56,160 INFO crawl.FetchScheduleFactory - Using > >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >>> 2012-08-29 13:54:56,160 INFO crawl.AbstractFetchSchedule - > >>> defaultInterval=2592000 > >>> 2012-08-29 13:54:56,160 INFO crawl.AbstractFetchSchedule - > >>> maxInterval=7776000 > >>> 2012-08-29 13:54:56,198 WARN crawl.CrawlDbReducer - Missing fetch and > old > >>> value, signature=[B@34d0cdd0 > >>> 2012-08-29 13:54:56,199 WARN crawl.CrawlDbReducer - Missing fetch and > old > >>> value, signature=[B@78782dc6 > >>> 2012-08-29 13:54:56,199 WARN crawl.CrawlDbReducer - Missing fetch and > old > >>> value, signature=[B@1a055ff4 > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-SMB-protocol-tp4003945.html > >>> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > >> > >> -- > >> Lewis > > > > -- > Lewis >

