I'm not if I did understand you correct. Here is the complete output of my crawl:
tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in: /Users/toom/Downloads/nutch-1.3/sites rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled threads = 10 depth = 3 solrUrl=null topN = 50 Injector: starting at 2011-07-12 12:28:49 Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-12 12:28:53, elapsed: 00:00:04 Generator: starting at 2011-07-12 12:28:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Generator: finished at 2011-07-12 12:28:57, elapsed: 00:00:04 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-12 12:28:57 Fetcher: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 fetching http://nutch.apache.org/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-12 12:29:01, elapsed: 00:00:03 ParseSegment: starting at 2011-07-12 12:29:01 ParseSegment: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 ParseSegment: finished at 2011-07-12 12:29:03, elapsed: 00:00:02 CrawlDb update: starting at 2011-07-12 12:29:03 CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb CrawlDb update: segments: [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-12 12:29:06, elapsed: 00:00:02 Generator: starting at 2011-07-12 12:29:06 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Generator: finished at 2011-07-12 12:29:10, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-12 12:29:10 Fetcher: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Fetcher: threads: 10 QueueFeeder finished: total 50 records + hit by time limit :0 fetching http://www.cafepress.com/nutch/ fetching http://creativecommons.org/press-releases/entry/5064 fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html fetching http://www.apache.org/dist/nutch/CHANGES-1.0.txt fetching http://eu.apachecon.com/c/aceu2009/sessions/138 fetching http://www.us.apachecon.com/c/acus2009/ fetching http://issues.apache.org/jira/browse/NUTCH fetching http://forrest.apache.org/ fetching http://hadoop.apache.org/ fetching http://wiki.apache.org/nutch/ fetching http://nutch.apache.org/credits.html fetching http://tika.apache.org/ fetching http://lucene.apache.org/solr/ fetching http://osuosl.org/news_folder/nutch fetching http://www.eu.apachecon.com/c/aceu2009/ -activeThreads=10, spinWaiting=1, fetchQueues.totalSize=35 -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=35 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 fetching http://www.apache.org/ fetching http://eu.apachecon.com/c/aceu2009/sessions/251 fetching http://nutch.apache.org/skin/fontsize.js -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=32 fetching http://www.us.apachecon.com/c/acus2009/schedule fetching http://wiki.apache.org/nutch/NutchTutorial -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30 fetching http://lucene.apache.org/java/ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29 fetching http://www.apache.org/dyn/closer.cgi/nutch/ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28 fetching http://eu.apachecon.com/c/aceu2009/sessions/197 fetching http://nutch.apache.org/nightly.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26 fetching http://wiki.apache.org/nutch/FAQ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25 fetching http://www.apache.org/licenses/ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=24 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=24 fetching http://eu.apachecon.com/c/aceu2009/sessions/136 fetching http://nutch.apache.org/apidocs-1.3/index.html -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=22 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22 fetching http://www.apache.org/dist/nutch/CHANGES-1.2.txt -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=21 fetching http://nutch.apache.org/skin/breadcrumbs.js fetching http://eu.apachecon.com/c/aceu2009/sessions/165 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19 fetching http://www.apache.org/dist/nutch/CHANGES-0.9.txt -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18 fetching http://eu.apachecon.com/c/aceu2009/sessions/201 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17 fetching http://nutch.apache.org/skin/getMenu.js -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 fetching http://www.apache.org/dist/nutch/CHANGES-1.1.txt -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 fetching http://eu.apachecon.com/c/aceu2009/sessions/137 fetching http://nutch.apache.org/index.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 fetching http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 fetching http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt fetching http://eu.apachecon.com/c/aceu2009/sessions/250 -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=10 fetching http://nutch.apache.org/mailing_lists.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 fetching http://www.apache.org/dist/nutch/CHANGES-1.3.txt -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=8 fetching http://nutch.apache.org/bot.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 fetching http://nutch.apache.org/issue_tracking.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 fetching http://nutch.apache.org/about.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 fetching http://nutch.apache.org/i18n.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466617719 now = 1310466613063 0. http://nutch.apache.org/version_control.html 1. http://nutch.apache.org/skin/getBlank.js 2. http://nutch.apache.org/index.pdf 3. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466617719 now = 1310466614064 0. http://nutch.apache.org/version_control.html 1. http://nutch.apache.org/skin/getBlank.js 2. http://nutch.apache.org/index.pdf 3. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466617719 now = 1310466615066 0. http://nutch.apache.org/version_control.html 1. http://nutch.apache.org/skin/getBlank.js 2. http://nutch.apache.org/index.pdf 3. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466617719 now = 1310466616068 0. http://nutch.apache.org/version_control.html 1. http://nutch.apache.org/skin/getBlank.js 2. http://nutch.apache.org/index.pdf 3. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466617719 now = 1310466617069 0. http://nutch.apache.org/version_control.html 1. http://nutch.apache.org/skin/getBlank.js 2. http://nutch.apache.org/index.pdf 3. http://nutch.apache.org/apidocs-1.2/index.html fetching http://nutch.apache.org/version_control.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466617719 now = 1310466618071 0. http://nutch.apache.org/skin/getBlank.js 1. http://nutch.apache.org/index.pdf 2. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466623151 now = 1310466619073 0. http://nutch.apache.org/skin/getBlank.js 1. http://nutch.apache.org/index.pdf 2. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466623151 now = 1310466620075 0. http://nutch.apache.org/skin/getBlank.js 1. http://nutch.apache.org/index.pdf 2. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466623151 now = 1310466621077 0. http://nutch.apache.org/skin/getBlank.js 1. http://nutch.apache.org/index.pdf 2. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466623151 now = 1310466622078 0. http://nutch.apache.org/skin/getBlank.js 1. http://nutch.apache.org/index.pdf 2. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466623151 now = 1310466623080 0. http://nutch.apache.org/skin/getBlank.js 1. http://nutch.apache.org/index.pdf 2. http://nutch.apache.org/apidocs-1.2/index.html fetching http://nutch.apache.org/skin/getBlank.js -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466628578 now = 1310466624082 0. http://nutch.apache.org/index.pdf 1. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466628578 now = 1310466625084 0. http://nutch.apache.org/index.pdf 1. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466628578 now = 1310466626086 0. http://nutch.apache.org/index.pdf 1. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466628578 now = 1310466627088 0. http://nutch.apache.org/index.pdf 1. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466628578 now = 1310466628089 0. http://nutch.apache.org/index.pdf 1. http://nutch.apache.org/apidocs-1.2/index.html fetching http://nutch.apache.org/index.pdf -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466628578 now = 1310466629090 0. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466634844 now = 1310466630092 0. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466634844 now = 1310466631094 0. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466634844 now = 1310466632095 0. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466634844 now = 1310466633097 0. http://nutch.apache.org/apidocs-1.2/index.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://nutch.apache.org maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466634844 now = 1310466634099 0. http://nutch.apache.org/apidocs-1.2/index.html fetching http://nutch.apache.org/apidocs-1.2/index.html -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-12 12:30:37, elapsed: 00:01:27 ParseSegment: starting at 2011-07-12 12:30:37 ParseSegment: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Error parsing: http://nutch.apache.org/skin/breadcrumbs.js: failed(2,0): Can't retrieve Tika parser for mime-type application/javascript Error parsing: http://nutch.apache.org/skin/fontsize.js: failed(2,0): Can't retrieve Tika parser for mime-type application/javascript Error parsing: http://nutch.apache.org/skin/getBlank.js: failed(2,0): Can't retrieve Tika parser for mime-type application/javascript Error parsing: http://nutch.apache.org/skin/getMenu.js: failed(2,0): Can't retrieve Tika parser for mime-type application/javascript ParseSegment: finished at 2011-07-12 12:30:46, elapsed: 00:00:08 CrawlDb update: starting at 2011-07-12 12:30:46 CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb CrawlDb update: segments: [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-12 12:30:48, elapsed: 00:00:02 Generator: starting at 2011-07-12 12:30:48 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Generator: finished at 2011-07-12 12:30:52, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-12 12:30:52 Fetcher: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Fetcher: threads: 10 QueueFeeder finished: total 50 records + hit by time limit :0 fetching http://www.onehippo.com/ fetching http://apacheconeu.blogspot.com/ fetching http://www.day.com/ fetching http://www.func.nl/apacheconeu2009 fetching http://www.thawte.com/ fetching http://eu.apachecon.com/c/aceu2009/about fetching http://www.us.apachecon.com/c/acus2009/sessions/333 fetching http://www.joost.com/ fetching http://developer.yahoo.com/blogs/hadoop/ fetching http://www.springsource.com/ fetching http://www.isi.edu/~koehn/europarl/ fetching http://www.topicus.nl/ fetching http://opensource.hp.com/ fetching http://nutch.apache.org/apidocs-1.3/overview-frame.html -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36 fetching http://www.haloworldwide.com/ fetching https://builds.apache.org/job/Nutch-trunk/javadoc/ fetch of https://builds.apache.org/job/Nutch-trunk/javadoc/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https fetching http://www.hotwaxmedia.com/ fetching http://lucene.apache.org/hadoop fetching http://www.cloudera.com/ fetching http://code.google.com/opensource/ fetching http://www.lucidimagination.com/ fetching http://apache.lehtivihrea.org/nutch/ fetching http://www.eu.apachecon.com/c/aceu2009/about/meetups -activeThreads=10, spinWaiting=4, fetchQueues.totalSize=27 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 fetching http://www.us.apachecon.com/c/acus2009/sessions/334 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26 fetching http://nutch.apache.org/apidocs-1.2/allclasses-frame.html fetching http://eu.apachecon.com/c/aceu2009/about/crowdvine -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=24 fetching http://www.eu.apachecon.com/c/aceu2009/about/videoStreaming -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23 fetching http://www.us.apachecon.com/c/acus2009/sessions/335 fetching http://nutch.apache.org/apidocs-1.2/overview-summary.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 fetching http://eu.apachecon.com/c/aceu2009/speakers -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20 fetching http://www.eu.apachecon.com/c/aceu2009/sponsors/sponsor -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 fetching http://www.us.apachecon.com/c/acus2009/sessions/461 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=18 fetching http://nutch.apache.org/apidocs-1.3/allclasses-frame.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=17 fetching http://eu.apachecon.com/c/aceu2009/articles -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=16 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 fetching http://www.us.apachecon.com/c/acus2009/sessions/427 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15 fetching http://nutch.apache.org/apidocs-1.2/overview-frame.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 fetching http://eu.apachecon.com/c/aceu2009/sessions/ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 fetching http://www.us.apachecon.com/c/acus2009/sessions/430 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=12 fetching http://nutch.apache.org/apidocs-1.3/overview-summary.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 fetching http://eu.apachecon.com/c/aceu2009/sponsors/sponsors -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=10 fetching http://www.us.apachecon.com/c/acus2009/sessions/375 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 fetching http://eu.apachecon.com/c/ fetching http://www.us.apachecon.com/c/acus2009/sessions/462 -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 fetching http://www.us.apachecon.com/c/acus2009/sessions/428 fetching http://eu.apachecon.com/c/aceu2009/schedule -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 fetching http://www.us.apachecon.com/c/acus2009/sessions/331 fetching http://eu.apachecon.com/c/aceu2009/ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 * queue: http://eu.apachecon.com maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466704235 now = 1310466704428 0. http://eu.apachecon.com/js/jquery.akslideshow.js * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466709214 now = 1310466704428 0. http://www.us.apachecon.com/c/acus2009/sessions/437 1. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 * queue: http://eu.apachecon.com maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466704235 now = 1310466705429 0. http://eu.apachecon.com/js/jquery.akslideshow.js * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466709214 now = 1310466705430 0. http://www.us.apachecon.com/c/acus2009/sessions/437 1. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://eu.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466710968 now = 1310466706431 0. http://eu.apachecon.com/js/jquery.akslideshow.js * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466709214 now = 1310466706431 0. http://www.us.apachecon.com/c/acus2009/sessions/437 1. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://eu.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466710968 now = 1310466707433 0. http://eu.apachecon.com/js/jquery.akslideshow.js * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466709214 now = 1310466707433 0. http://www.us.apachecon.com/c/acus2009/sessions/437 1. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 * queue: http://eu.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466710968 now = 1310466708435 0. http://eu.apachecon.com/js/jquery.akslideshow.js * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466709214 now = 1310466708435 0. http://www.us.apachecon.com/c/acus2009/sessions/437 1. http://www.us.apachecon.com/c/acus2009/sessions/332 fetching http://www.us.apachecon.com/c/acus2009/sessions/437 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2 * queue: http://eu.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466710968 now = 1310466709442 0. http://eu.apachecon.com/js/jquery.akslideshow.js * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 1 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466709214 now = 1310466709442 0. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 * queue: http://eu.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466710968 now = 1310466710444 0. http://eu.apachecon.com/js/jquery.akslideshow.js * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466714813 now = 1310466710444 0. http://www.us.apachecon.com/c/acus2009/sessions/332 fetching http://eu.apachecon.com/js/jquery.akslideshow.js -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466714813 now = 1310466711446 0. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466714813 now = 1310466712447 0. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466714813 now = 1310466713448 0. http://www.us.apachecon.com/c/acus2009/sessions/332 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 * queue: http://www.us.apachecon.com maxThreads = 1 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1310466714813 now = 1310466714450 0. http://www.us.apachecon.com/c/acus2009/sessions/332 fetching http://www.us.apachecon.com/c/acus2009/sessions/332 -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-12 12:31:55, elapsed: 00:01:03 ParseSegment: starting at 2011-07-12 12:31:55 ParseSegment: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: failed(2,0): Can't retrieve Tika parser for mime-type text/javascript ParseSegment: finished at 2011-07-12 12:31:59, elapsed: 00:00:03 CrawlDb update: starting at 2011-07-12 12:31:59 CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb CrawlDb update: segments: [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 LinkDb: starting at 2011-07-12 12:32:03 LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 LinkDb: adding segment: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732 LinkDb: adding segment: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256 LinkDb: adding segment: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 LinkDb: adding segment: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 LinkDb: adding segment: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_data Input path does not exist: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_data Input path does not exist: file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) 2011/7/12 Julien Nioche <[email protected]>: >> Actually I'm not shure if I look at the right log lines. Please >> explain in more detail for what exactly I should look for. Anyway I >> found the following line just before the error: >> >> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: >> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript >> >> But I can see that parsing erros like this already appeared earlier >> during the crawl. >> > > This simply means that the javascript parser is not enabled in your conf > (which is the default behaviour) and as a consequence the default parser > (Tika) was used to try and parse it but has no resources for doing so. > > Note : we should probably add .js to the default url filters. The javascript > parser has been deactivated by default because it generates atrocious URLs > so we might as well prevent such URLs form being fetched in the first place. > > Anyway this is not the source of the problem. You seem to have unparsed > segments among the ones specified. Could be that you interrupted a previous > crawl or got a problem with it and did not delete these segments or the > whole crawl directory. Removing the segments and calling the last couple of > steps manually should do the trick. > > > >> >> >> >> 2011/7/12 Markus Jelsma <[email protected]>: >> > Were there errors during parsing of that last segment? >> > >> >> I'm starting with nutch and I ran a simple job as described in the >> >> nutch tutorial. After a while I get the following error: >> >> >> >> >> >> CrawlDb update: URL filtering: true >> >> CrawlDb update: Merging segment data into db. >> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 >> >> LinkDb: starting at 2011-07-12 12:32:03 >> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb >> >> LinkDb: URL normalize: true >> >> LinkDb: URL filter: true >> >> LinkDb: adding segment: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 >> >> LinkDb: adding segment: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732 >> >> LinkDb: adding segment: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256 >> >> LinkDb: adding segment: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 >> >> LinkDb: adding segment: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 >> >> LinkDb: adding segment: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 >> >> Exception in thread "main" >> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not >> >> exist: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_d >> >> ata Input path does not exist: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_da >> >> ta Input path does not exist: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_da >> >> ta at >> >> >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 >> >> 90) at >> >> >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn >> >> putFormat.java:44) at >> >> >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20 >> >> 1) at >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> >> at >> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >> >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at >> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at >> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >> >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) >> >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) >> > >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

