I'm trying to crawl pages from a number of domains, and one of these domains has been giving me trouble. The really irritating thing is that it did work at least once, which led me to believe that I'd solved the problem. I can't think of anything at this point but to paste my log of a failed crawl and solrindex and hope that someone can think of anything I've overlooked. Does anything look strange here?
Thanks, Chip 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO crawl.Crawl - threads = 10 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 500000 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: looking in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - URL Meta Indexing Filter (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:31:02,964 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2011-12-19 16:31:05,722 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2011-12-19 16:31:07,014 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-12-19 16:31:07,897 INFO crawl.Injector - Injector: finished at 2011-12-19 16:31:07, elapsed: 00:00:06 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: starting at 2011-12-19 16:31:07 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: filtering: true 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: normalizing: true 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: topN: 500000 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2011-12-19 16:31:09,157 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:09,157 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19 16:31:09,157 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2011-12-19 16:31:09,157 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2011-12-19 16:31:09,189 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:09,189 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19 16:31:09,189 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2011-12-19 16:31:09,189 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default 2011-12-19 16:31:10,071 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2011-12-19 16:31:11,080 INFO crawl.Generator - Generator: segment: mit-c-crawl/segments/20111219163111 2011-12-19 16:31:12,309 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2011-12-19 16:31:13,223 INFO crawl.Generator - Generator: finished at 2011-12-19 16:31:13, elapsed: 00:00:05 2011-12-19 16:31:13,239 INFO fetcher.Fetcher - Fetcher: starting at 2011-12-19 16:31:13 2011-12-19 16:31:13,239 INFO fetcher.Fetcher - Fetcher: segment: mit-c-crawl/segments/20111219163111 2011-12-19 16:31:14,515 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,515 INFO fetcher.Fetcher - Fetcher: threads: 10 2011-12-19 16:31:14,515 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 2011-12-19 16:31:14,515 INFO fetcher.Fetcher - QueueFeeder finished: total 1 records + hit by time limit :0 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - fetching http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Fetcher: throughput threshold: -1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Fetcher: throughput threshold retries: 5 2011-12-19 16:31:14,562 INFO httpclient.Http - http.proxy.host = null 2011-12-19 16:31:14,562 INFO httpclient.Http - http.proxy.port = 8080 2011-12-19 16:31:14,562 INFO httpclient.Http - http.timeout = 10000 2011-12-19 16:31:14,562 INFO httpclient.Http - http.content.limit = -1 2011-12-19 16:31:14,562 INFO httpclient.Http - http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: Physics History Finding Aids Web Site; http://www.aip.org/history/nbl/findingaids.html; [email protected]) 2011-12-19 16:31:14,562 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2011-12-19 16:31:14,799 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2011-12-19 16:31:15,539 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2011-12-19 16:31:15,539 INFO fetcher.Fetcher - -activeThreads=0 2011-12-19 16:31:16,390 INFO fetcher.Fetcher - Fetcher: finished at 2011-12-19 16:31:16, elapsed: 00:00:03 2011-12-19 16:31:16,390 INFO parse.ParseSegment - ParseSegment: starting at 2011-12-19 16:31:16 2011-12-19 16:31:16,390 INFO parse.ParseSegment - ParseSegment: segment: mit-c-crawl/segments/20111219163111 2011-12-19 16:31:18,533 INFO parse.ParseSegment - ParseSegment: finished at 2011-12-19 16:31:18, elapsed: 00:00:02 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: starting at 2011-12-19 16:31:18 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: db: mit-c-crawl/crawldb 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: segments: [mit-c-crawl/segments/20111219163111] 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: additions allowed: true 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: URL normalizing: true 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: URL filtering: true 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: 404 purging: false 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: Merging segment data into db. 2011-12-19 16:31:19,873 INFO regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default 2011-12-19 16:31:20,046 INFO regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default 2011-12-19 16:31:20,204 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:20,204 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19 16:31:20,204 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2011-12-19 16:31:20,771 INFO crawl.CrawlDb - CrawlDb update: finished at 2011-12-19 16:31:20, elapsed: 00:00:02 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: starting at 2011-12-19 16:31:20 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: linkdb: mit-c-crawl/linkdb 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: URL normalize: true 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: URL filter: true 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: adding segment: file:/C:/apache/apache-nutch-1.4/runtime/local/mit-c-crawl/segments/20111219163111 2011-12-19 16:31:22,898 INFO crawl.LinkDb - LinkDb: finished at 2011-12-19 16:31:22, elapsed: 00:00:02 2011-12-19 16:31:22,898 INFO crawl.Crawl - crawl finished: mit-c-crawl 2011-12-19 16:32:08,061 INFO solr.SolrIndexer - SolrIndexer: starting at 2011-12-19 16:32:08 2011-12-19 16:32:08,093 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: mit-c-crawl/crawldb 2011-12-19 16:32:08,093 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: mit-c-crawl/linkdb 2011-12-19 16:32:08,093 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: mit-c-crawl/segments/20111219163111 2011-12-19 16:32:09,984 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-12-19 16:32:10,141 INFO plugin.PluginRepository - Plugins: looking in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Registered Plugins: 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - URL Meta Indexing Filter (urlmeta) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Registered Extension-Points: 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:32:10,252 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 16:32:10,283 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2011-12-19 16:32:10,283 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 16:32:10,283 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 16:32:11,276 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 16:32:11,276 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2011-12-19 16:32:11,276 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 16:32:11,276 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 16:32:11,402 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 16:32:11,402 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2011-12-19 16:32:11,402 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 16:32:11,402 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 16:32:11,544 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 16:32:11,544 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2011-12-19 16:32:11,544 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 16:32:11,544 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 16:32:11,686 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 16:32:11,686 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2011-12-19 16:32:11,686 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 16:32:11,686 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 16:32:11,906 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 16:32:11,906 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2011-12-19 16:32:11,906 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 16:32:11,906 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 16:32:11,985 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 16:32:11,985 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2011-12-19 16:32:11,985 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 16:32:11,985 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: content dest: content 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: site dest: site 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: title dest: title 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: host dest: host 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: segment dest: segment 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: boost dest: boost 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: digest dest: digest 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: url dest: id 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: url dest: url 2011-12-19 16:32:13,309 INFO solr.SolrIndexer - SolrIndexer: finished at 2011-12-19 16:32:13, elapsed: 00:00:05

