Re: Nutch - SMB protocol

Lewis John Mcgibbney Wed, 29 Aug 2012 07:28:23 -0700

In the SVN area can you point me to the protocol plugin please?

http://svn.apache.org/repos/asf/nutch/


Thank you

Lewis

On Wed, Aug 29, 2012 at 3:22 PM, Matteo Simoncini <[email protected]> wrote:
> Sorry, I forgot it.
>
> 1.5
>
> Matteo
>
> 2012/8/29 Lewis John Mcgibbney <[email protected]>:
>> What version of Nutch is this?
>>
>> Lewis
>>
>> On Wed, Aug 29, 2012 at 9:58 AM, xpow <[email protected]> wrote:
>>> Hello,
>>>
>>> I've tried to use the protocol-smb plugin with nutch. The nutch read and
>>> parsed the documents correctly, but afterward, when it hit the crawldb,
>>> crawl.CrawlDbReducer, i got a lot of 'crawl.CrawlDbReducer - Missing fetch
>>> and old value, signature=[B@34d0cdd0', which causing no documents get
>>> indexed with solr ...
>>>
>>> Can anyone help me to pinpoint what was going on??
>>>
>>> Thanks
>>>
>>> Here's the log file:
>>> 2012-08-29 13:54:52,641 INFO  parse.ParseSegment - Parsing:
>>> smb://192.168.3.6/share/putusan/putusan_sidang_PUTUSAN 48-2011 TELAH
>>> baca.pdf
>>> 2012-08-29 13:54:53,576 INFO  parse.ParseSegment - Parsing:
>>> smb://192.168.3.6/share/putusan/putusan_sidang_Putusan 55 PUU-2010-TELAH
>>> BACA.pdf
>>> 2012-08-29 13:54:53,612 INFO  parse.ParseSegment - Parsing:
>>> smb://192.168.3.6/share/putusan/putusan_sidang_Putusan Sela 108 PHPU
>>> 2011.pdf
>>> 2012-08-29 13:54:53,930 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'outlink', using default
>>> 2012-08-29 13:54:55,087 INFO  parse.ParseSegment - ParseSegment: finished at
>>> 2012-08-29 13:54:55, elapsed: 00:00:28
>>> 2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: starting at
>>> 2012-08-29 13:54:55
>>> 2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: db:
>>> crawl/crawldb
>>> 2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: segments:
>>> [crawl/segments/20120829134849]
>>> 2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: additions
>>> allowed: true
>>> 2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: URL
>>> normalizing: true
>>> 2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: URL filtering:
>>> true
>>> 2012-08-29 13:54:55,103 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
>>> false
>>> 2012-08-29 13:54:55,104 INFO  crawl.CrawlDb - CrawlDb update: Merging
>>> segment data into db.
>>> 2012-08-29 13:54:55,584 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'crawldb', using default
>>> 2012-08-29 13:54:55,765 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'crawldb', using default
>>> 2012-08-29 13:54:56,121 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'crawldb', using default
>>> 2012-08-29 13:54:56,160 INFO  crawl.FetchScheduleFactory - Using
>>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>> 2012-08-29 13:54:56,160 INFO  crawl.AbstractFetchSchedule -
>>> defaultInterval=2592000
>>> 2012-08-29 13:54:56,160 INFO  crawl.AbstractFetchSchedule -
>>> maxInterval=7776000
>>> 2012-08-29 13:54:56,198 WARN  crawl.CrawlDbReducer - Missing fetch and old
>>> value, signature=[B@34d0cdd0
>>> 2012-08-29 13:54:56,199 WARN  crawl.CrawlDbReducer - Missing fetch and old
>>> value, signature=[B@78782dc6
>>> 2012-08-29 13:54:56,199 WARN  crawl.CrawlDbReducer - Missing fetch and old
>>> value, signature=[B@1a055ff4
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/Nutch-SMB-protocol-tp4003945.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>>
>> --
>> Lewis



-- 
Lewis

Re: Nutch - SMB protocol

Reply via email to