nutch issue: error parsing

Meiping Wang(Amelia) Sun, 03 Feb 2013 21:12:24 -0800

Hey:
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 5
Injector: starting at 2013-02-04 13:05:18
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-02-04 13:05:33, elapsed: 00:00:14
Generator: starting at 2013-02-04 13:05:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130204130541
Generator: finished at 2013-02-04 13:05:48, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
Fetcher: starting at 2013-02-04 13:05:48
Fetcher: segment: crawl/segments/20130204130541
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-02-04 13:05:58, elapsed: 00:00:10
ParseSegment: starting at 2013-02-04 13:05:58
ParseSegment: segment: crawl/segments/20130204130541
Error parsing: http://nutch.apache.org/: failed(2,200): 
org.apache.nutch.parse.ParseException: Unable to successfully parse content
Parsed (15ms):http://nutch.apache.org/
ParseSegment: finished at 2013-02-04 13:06:05, elapsed: 00:00:07
CrawlDb update: starting at 2013-02-04 13:06:05
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130204130541]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-02-04 13:06:18, elapsed: 00:00:13
Generator: starting at 2013-02-04 13:06:18
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2013-02-04 13:06:25
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: 
file:/E:/SearchEngine/workspace/Nutch2.1/crawl/segments/20130204130541
LinkDb: finished at 2013-02-04 13:06:32, elapsed: 00:00:07
crawl finished: crawl


After running the Nutch2.1 in the Eclipse (OS is Windows), there are some 
problems having been showed in red on the above. can anybody give me the right 
instructions?

Best Regards
Amelia (Meiping Wang)

nutch issue: error parsing

Reply via email to