Hi,
I'm sure this is an "old" topic, but I still no luck crawling with it.
It's a little bit harder than crawling web / http protocol :(
Following are some important files I configured:
(1) urls/seed.txt
file://opt/searchengine/test/
which contains one file:
-rw-r--r-- 1 bayu bayu 3272 Jun 5 10:02 Testdocumentsaja.pdf
(2) regex-urlfilter.txt: allowing file: protocol and accept path URL
-^(ftp|mailto):
+^file://opt/searchengine/test
(3) nutch-site.xml : enabling protocol-file
<property>
<name>plugin.includes</name>
<value>protocol-(http|file)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library.
</description>
</property>
For the crawl nutch script using common steps (inject - generate - fetch -
parse - updatedb - solrindex - solrdedup).
>From the hadoop.log below, nutch could fetch file protocol path, but it
never parse the file inside /opt/searchengine/test/.
hadoop.log:
2014-06-05 10:33:33,274 INFO crawl.Injector - Injector: starting at
2014-06-05 10:33:33
2014-06-05 10:33:33,276 INFO crawl.Injector - Injector: crawlDb:
/opt/searchengine/nutch/BWCrawl/crawldb
2014-06-05 10:33:33,276 INFO crawl.Injector - Injector: urlDir:
/opt/searchengine/nutch/urls/seed.txt
2014-06-05 10:33:33,277 INFO crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2014-06-05 10:33:33,714 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:33,807 WARN snappy.LoadSnappy - Snappy native library not
loaded
2014-06-05 10:33:34,717 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2014-06-05 10:33:35,127 INFO crawl.Injector - Injector: total number of
urls rejected by filters: 0
2014-06-05 10:33:35,131 INFO crawl.Injector - Injector: total number of
urls injected after normalization and filtering: 1
2014-06-05 10:33:35,132 INFO crawl.Injector - Injector: Merging injected
urls into crawl db.
2014-06-05 10:33:35,396 INFO crawl.Injector - Injector: overwrite: false
2014-06-05 10:33:35,397 INFO crawl.Injector - Injector: update: false
2014-06-05 10:33:36,357 INFO crawl.Injector - Injector: finished at
2014-06-05 10:33:36, elapsed: 00:00:03
2014-06-05 10:33:37,857 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:37,863 INFO crawl.Generator - Generator: starting at
2014-06-05 10:33:37
2014-06-05 10:33:37,863 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2014-06-05 10:33:37,864 INFO crawl.Generator - Generator: filtering: true
2014-06-05 10:33:37,865 INFO crawl.Generator - Generator: normalizing: true
2014-06-05 10:33:37,876 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2014-06-05 10:33:38,915 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-05 10:33:38,916 INFO crawl.AbstractFetchSchedule -
defaultInterval=129600
2014-06-05 10:33:38,917 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-06-05 10:33:38,929 INFO regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2014-06-05 10:33:39,006 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-05 10:33:39,007 INFO crawl.AbstractFetchSchedule -
defaultInterval=129600
2014-06-05 10:33:39,007 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-06-05 10:33:39,015 INFO regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2014-06-05 10:33:39,384 INFO crawl.Generator - Generator: Partitioning
selected urls for politeness.
2014-06-05 10:33:40,386 INFO crawl.Generator - Generator: segment:
/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:40,593 INFO regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2014-06-05 10:33:41,540 INFO crawl.Generator - Generator: finished at
2014-06-05 10:33:41, elapsed: 00:00:03
2014-06-05 10:33:42,634 INFO fetcher.Fetcher - Fetcher: starting at
2014-06-05 10:33:42
2014-06-05 10:33:42,635 INFO fetcher.Fetcher - Fetcher: segment:
/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:43,056 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:43,719 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:43,720 INFO fetcher.Fetcher - Fetcher: threads: 10
2014-06-05 10:33:43,720 INFO fetcher.Fetcher - Fetcher: time-out divisor: 4
2014-06-05 10:33:43,739 INFO fetcher.Fetcher - QueueFeeder finished: total
1 records + hit by time limit :0
2014-06-05 10:33:44,102 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,103 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,104 INFO fetcher.Fetcher - fetching
file://opt/searchengine/test/ (queue crawl delay=5000ms)
2014-06-05 10:33:44,106 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,107 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,111 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,111 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,118 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,120 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,121 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,122 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,122 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,127 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,129 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,130 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,131 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,132 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,133 INFO fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,146 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,149 INFO fetcher.Fetcher - Fetcher: throughput
threshold: -1
2014-06-05 10:33:44,149 INFO fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2014-06-05 10:33:44,150 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,423 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2014-06-05 10:33:45,151 INFO fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2014-06-05 10:33:45,153 INFO fetcher.Fetcher - -activeThreads=0
2014-06-05 10:33:45,497 INFO fetcher.Fetcher - Fetcher: finished at
2014-06-05 10:33:45, elapsed: 00:00:02
2014-06-05 10:33:46,660 INFO parse.ParseSegment - ParseSegment: starting
at 2014-06-05 10:33:46
2014-06-05 10:33:46,661 INFO parse.ParseSegment - ParseSegment: segment:
/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:47,094 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:48,527 INFO parse.ParseSegment - ParseSegment: finished
at 2014-06-05 10:33:48, elapsed: 00:00:01
2014-06-05 10:33:49,949 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:49,995 INFO crawl.CrawlDb - CrawlDb update: starting at
2014-06-05 10:33:49
2014-06-05 10:33:49,996 INFO crawl.CrawlDb - CrawlDb update: db:
/opt/searchengine/nutch/BWCrawl/crawldb
2014-06-05 10:33:49,997 INFO crawl.CrawlDb - CrawlDb update: segments:
[/opt/searchengine/nutch/BWCrawl/segments/20140605103340]
2014-06-05 10:33:50,002 INFO crawl.CrawlDb - CrawlDb update: additions
allowed: true
2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: URL
filtering: true
2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: 404 purging:
false
2014-06-05 10:33:50,006 INFO crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2014-06-05 10:33:51,150 INFO regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2014-06-05 10:33:51,242 INFO regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2014-06-05 10:33:51,399 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-05 10:33:51,399 INFO crawl.AbstractFetchSchedule -
defaultInterval=129600
2014-06-05 10:33:51,399 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-06-05 10:33:51,537 INFO crawl.CrawlDb - CrawlDb update: finished at
2014-06-05 10:33:51, elapsed: 00:00:01
2014-06-05 10:33:53,008 INFO indexer.IndexingJob - Indexer: starting at
2014-06-05 10:33:53
2014-06-05 10:33:53,024 INFO indexer.IndexingJob - Indexer: deleting gone
documents: false
2014-06-05 10:33:53,025 INFO indexer.IndexingJob - Indexer: URL filtering:
false
2014-06-05 10:33:53,027 INFO indexer.IndexingJob - Indexer: URL
normalizing: false
2014-06-05 10:33:53,373 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-06-05 10:33:53,385 INFO indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
2014-06-05 10:33:53,396 INFO indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: /opt/searchengine/nutch/BWCrawl/crawldb
2014-06-05 10:33:53,396 INFO indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: file:/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:53,464 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:54,214 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2014-06-05 10:33:54,532 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: content
dest: content
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: title dest:
title
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: author dest:
author
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: host dest:
host
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: segment
dest: segment
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: boost dest:
boost
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: digest dest:
digest
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: tstamp dest:
tstamp
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: url dest: id
2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: url dest: url
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: content
dest: content
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: title dest:
title
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: author dest:
author
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: host dest:
host
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: segment
dest: segment
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: boost dest:
boost
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: digest dest:
digest
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: tstamp dest:
tstamp
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: url dest: id
2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: url dest: url
2014-06-05 10:33:55,063 INFO indexer.IndexingJob - Indexer: finished at
2014-06-05 10:33:55, elapsed: 00:00:02
Result of nutch readdb:
CrawlDb statistics start: BWCrawl/crawldb/
Statistics for CrawlDb: BWCrawl/crawldb/
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 3 (db_gone): 1
CrawlDb statistics: done
Following are some of documents I've read:
- http://wiki.apache.org/nutch/IntranetDocumentSearch
- http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
-
http://lucene.472066.n3.nabble.com/Crawling-the-local-file-system-with-Nutch-Document-td607747.html
System: Ubuntu 14.04, nutch 1.8, Solr 4.8.0.
I really appreciate if someone could share some hints or any
"running-proof" references for this subject.
Thank you.-
--
wassalam,
[bayu]