Hello,

I've tried to use the protocol-smb plugin with nutch. The nutch read and
parsed the documents correctly, but afterward, when it hit the crawldb,
crawl.CrawlDbReducer, i got a lot of 'crawl.CrawlDbReducer - Missing fetch
and old value, signature=[B@34d0cdd0', which causing no documents get
indexed with solr ...

Can anyone help me to pinpoint what was going on??

Thanks

Here's the log file:
2012-08-29 13:54:52,641 INFO  parse.ParseSegment - Parsing:
smb://192.168.3.6/share/putusan/putusan_sidang_PUTUSAN 48-2011 TELAH
baca.pdf
2012-08-29 13:54:53,576 INFO  parse.ParseSegment - Parsing:
smb://192.168.3.6/share/putusan/putusan_sidang_Putusan 55 PUU-2010-TELAH
BACA.pdf
2012-08-29 13:54:53,612 INFO  parse.ParseSegment - Parsing:
smb://192.168.3.6/share/putusan/putusan_sidang_Putusan Sela 108 PHPU
2011.pdf
2012-08-29 13:54:53,930 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'outlink', using default
2012-08-29 13:54:55,087 INFO  parse.ParseSegment - ParseSegment: finished at
2012-08-29 13:54:55, elapsed: 00:00:28
2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: starting at
2012-08-29 13:54:55
2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: db:
crawl/crawldb
2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: segments:
[crawl/segments/20120829134849]
2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: URL filtering:
true
2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
false
2012-08-29 13:54:55,104 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2012-08-29 13:54:55,584 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2012-08-29 13:54:55,765 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2012-08-29 13:54:56,121 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2012-08-29 13:54:56,160 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-08-29 13:54:56,160 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-08-29 13:54:56,160 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-08-29 13:54:56,198 WARN  crawl.CrawlDbReducer - Missing fetch and old
value, signature=[B@34d0cdd0
2012-08-29 13:54:56,199 WARN  crawl.CrawlDbReducer - Missing fetch and old
value, signature=[B@78782dc6
2012-08-29 13:54:56,199 WARN  crawl.CrawlDbReducer - Missing fetch and old
value, signature=[B@1a055ff4





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-SMB-protocol-tp4003945.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to