I don't see this segment 20110712114256 being parsed. On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote: > I'm not if I did understand you correct. Here is the complete output > of my crawl: > > > tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled > -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 > solrUrl is not set, indexing will be skipped... > crawl started in: /Users/toom/Downloads/nutch-1.3/sites > rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled > threads = 10 > depth = 3 > solrUrl=null > topN = 50 > Injector: starting at 2011-07-12 12:28:49 > Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb > Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2011-07-12 12:28:53, elapsed: 00:00:04 > Generator: starting at 2011-07-12 12:28:53 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 > Generator: finished at 2011-07-12 12:28:57, elapsed: 00:00:04 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2011-07-12 12:28:57 > Fetcher: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Fetcher: > threads: 10 > QueueFeeder finished: total 1 records + hit by time limit :0 > fetching http://nutch.apache.org/ > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-07-12 12:29:01, elapsed: 00:00:03 > ParseSegment: starting at 2011-07-12 12:29:01 > ParseSegment: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 > ParseSegment: finished at 2011-07-12 12:29:03, elapsed: 00:00:02 > CrawlDb update: starting at 2011-07-12 12:29:03 > CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb > CrawlDb update: segments: > [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2011-07-12 12:29:06, elapsed: 00:00:02 > Generator: starting at 2011-07-12 12:29:06 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 > Generator: finished at 2011-07-12 12:29:10, elapsed: 00:00:03 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2011-07-12 12:29:10 > Fetcher: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Fetcher: > threads: 10 > QueueFeeder finished: total 50 records + hit by time limit :0 > fetching http://www.cafepress.com/nutch/ > fetching http://creativecommons.org/press-releases/entry/5064 > fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html > fetching http://www.apache.org/dist/nutch/CHANGES-1.0.txt > fetching http://eu.apachecon.com/c/aceu2009/sessions/138 > fetching http://www.us.apachecon.com/c/acus2009/ > fetching http://issues.apache.org/jira/browse/NUTCH > fetching http://forrest.apache.org/ > fetching http://hadoop.apache.org/ > fetching http://wiki.apache.org/nutch/ > fetching http://nutch.apache.org/credits.html > fetching http://tika.apache.org/ > fetching http://lucene.apache.org/solr/ > fetching http://osuosl.org/news_folder/nutch > fetching http://www.eu.apachecon.com/c/aceu2009/ > -activeThreads=10, spinWaiting=1, fetchQueues.totalSize=35 > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=35 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 > fetching http://www.apache.org/ > fetching http://eu.apachecon.com/c/aceu2009/sessions/251 > fetching http://nutch.apache.org/skin/fontsize.js > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=32 > fetching http://www.us.apachecon.com/c/acus2009/schedule > fetching http://wiki.apache.org/nutch/NutchTutorial > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30 > fetching http://lucene.apache.org/java/ > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29 > fetching http://www.apache.org/dyn/closer.cgi/nutch/ > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28 > fetching http://eu.apachecon.com/c/aceu2009/sessions/197 > fetching http://nutch.apache.org/nightly.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26 > fetching http://wiki.apache.org/nutch/FAQ > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25 > fetching http://www.apache.org/licenses/ > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=24 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=24 > fetching http://eu.apachecon.com/c/aceu2009/sessions/136 > fetching http://nutch.apache.org/apidocs-1.3/index.html > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=22 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22 > fetching http://www.apache.org/dist/nutch/CHANGES-1.2.txt > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=21 > fetching http://nutch.apache.org/skin/breadcrumbs.js > fetching http://eu.apachecon.com/c/aceu2009/sessions/165 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19 > fetching http://www.apache.org/dist/nutch/CHANGES-0.9.txt > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18 > fetching http://eu.apachecon.com/c/aceu2009/sessions/201 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17 > fetching http://nutch.apache.org/skin/getMenu.js > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 > fetching http://www.apache.org/dist/nutch/CHANGES-1.1.txt > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 > fetching http://eu.apachecon.com/c/aceu2009/sessions/137 > fetching http://nutch.apache.org/index.html > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 > fetching http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 > fetching > http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_0 > 4_21.txt fetching http://eu.apachecon.com/c/aceu2009/sessions/250 > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=10 > fetching http://nutch.apache.org/mailing_lists.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > fetching http://www.apache.org/dist/nutch/CHANGES-1.3.txt > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=8 > fetching http://nutch.apache.org/bot.html > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > fetching http://nutch.apache.org/issue_tracking.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 > fetching http://nutch.apache.org/about.html > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > fetching http://nutch.apache.org/i18n.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466617719 > now = 1310466613063 > 0. http://nutch.apache.org/version_control.html > 1. http://nutch.apache.org/skin/getBlank.js > 2. http://nutch.apache.org/index.pdf > 3. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466617719 > now = 1310466614064 > 0. http://nutch.apache.org/version_control.html > 1. http://nutch.apache.org/skin/getBlank.js > 2. http://nutch.apache.org/index.pdf > 3. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466617719 > now = 1310466615066 > 0. http://nutch.apache.org/version_control.html > 1. http://nutch.apache.org/skin/getBlank.js > 2. http://nutch.apache.org/index.pdf > 3. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466617719 > now = 1310466616068 > 0. http://nutch.apache.org/version_control.html > 1. http://nutch.apache.org/skin/getBlank.js > 2. http://nutch.apache.org/index.pdf > 3. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466617719 > now = 1310466617069 > 0. http://nutch.apache.org/version_control.html > 1. http://nutch.apache.org/skin/getBlank.js > 2. http://nutch.apache.org/index.pdf > 3. http://nutch.apache.org/apidocs-1.2/index.html > fetching http://nutch.apache.org/version_control.html > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466617719 > now = 1310466618071 > 0. http://nutch.apache.org/skin/getBlank.js > 1. http://nutch.apache.org/index.pdf > 2. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466623151 > now = 1310466619073 > 0. http://nutch.apache.org/skin/getBlank.js > 1. http://nutch.apache.org/index.pdf > 2. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466623151 > now = 1310466620075 > 0. http://nutch.apache.org/skin/getBlank.js > 1. http://nutch.apache.org/index.pdf > 2. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466623151 > now = 1310466621077 > 0. http://nutch.apache.org/skin/getBlank.js > 1. http://nutch.apache.org/index.pdf > 2. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466623151 > now = 1310466622078 > 0. http://nutch.apache.org/skin/getBlank.js > 1. http://nutch.apache.org/index.pdf > 2. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466623151 > now = 1310466623080 > 0. http://nutch.apache.org/skin/getBlank.js > 1. http://nutch.apache.org/index.pdf > 2. http://nutch.apache.org/apidocs-1.2/index.html > fetching http://nutch.apache.org/skin/getBlank.js > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466628578 > now = 1310466624082 > 0. http://nutch.apache.org/index.pdf > 1. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466628578 > now = 1310466625084 > 0. http://nutch.apache.org/index.pdf > 1. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466628578 > now = 1310466626086 > 0. http://nutch.apache.org/index.pdf > 1. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466628578 > now = 1310466627088 > 0. http://nutch.apache.org/index.pdf > 1. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466628578 > now = 1310466628089 > 0. http://nutch.apache.org/index.pdf > 1. http://nutch.apache.org/apidocs-1.2/index.html > fetching http://nutch.apache.org/index.pdf > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466628578 > now = 1310466629090 > 0. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466634844 > now = 1310466630092 > 0. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466634844 > now = 1310466631094 > 0. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466634844 > now = 1310466632095 > 0. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466634844 > now = 1310466633097 > 0. http://nutch.apache.org/apidocs-1.2/index.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://nutch.apache.org > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466634844 > now = 1310466634099 > 0. http://nutch.apache.org/apidocs-1.2/index.html > fetching http://nutch.apache.org/apidocs-1.2/index.html > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=0 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-07-12 12:30:37, elapsed: 00:01:27 > ParseSegment: starting at 2011-07-12 12:30:37 > ParseSegment: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 > Error parsing: http://nutch.apache.org/skin/breadcrumbs.js: > failed(2,0): Can't retrieve Tika parser for mime-type > application/javascript > Error parsing: http://nutch.apache.org/skin/fontsize.js: failed(2,0): > Can't retrieve Tika parser for mime-type application/javascript > Error parsing: http://nutch.apache.org/skin/getBlank.js: failed(2,0): > Can't retrieve Tika parser for mime-type application/javascript > Error parsing: http://nutch.apache.org/skin/getMenu.js: failed(2,0): > Can't retrieve Tika parser for mime-type application/javascript > ParseSegment: finished at 2011-07-12 12:30:46, elapsed: 00:00:08 > CrawlDb update: starting at 2011-07-12 12:30:46 > CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb > CrawlDb update: segments: > [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2011-07-12 12:30:48, elapsed: 00:00:02 > Generator: starting at 2011-07-12 12:30:48 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 > Generator: finished at 2011-07-12 12:30:52, elapsed: 00:00:03 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2011-07-12 12:30:52 > Fetcher: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Fetcher: > threads: 10 > QueueFeeder finished: total 50 records + hit by time limit :0 > fetching http://www.onehippo.com/ > fetching http://apacheconeu.blogspot.com/ > fetching http://www.day.com/ > fetching http://www.func.nl/apacheconeu2009 > fetching http://www.thawte.com/ > fetching http://eu.apachecon.com/c/aceu2009/about > fetching http://www.us.apachecon.com/c/acus2009/sessions/333 > fetching http://www.joost.com/ > fetching http://developer.yahoo.com/blogs/hadoop/ > fetching http://www.springsource.com/ > fetching http://www.isi.edu/~koehn/europarl/ > fetching http://www.topicus.nl/ > fetching http://opensource.hp.com/ > fetching http://nutch.apache.org/apidocs-1.3/overview-frame.html > -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36 > fetching http://www.haloworldwide.com/ > fetching https://builds.apache.org/job/Nutch-trunk/javadoc/ > fetch of https://builds.apache.org/job/Nutch-trunk/javadoc/ failed > with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found > for url=https > fetching http://www.hotwaxmedia.com/ > fetching http://lucene.apache.org/hadoop > fetching http://www.cloudera.com/ > fetching http://code.google.com/opensource/ > fetching http://www.lucidimagination.com/ > fetching http://apache.lehtivihrea.org/nutch/ > fetching http://www.eu.apachecon.com/c/aceu2009/about/meetups > -activeThreads=10, spinWaiting=4, fetchQueues.totalSize=27 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 > fetching http://www.us.apachecon.com/c/acus2009/sessions/334 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26 > fetching http://nutch.apache.org/apidocs-1.2/allclasses-frame.html > fetching http://eu.apachecon.com/c/aceu2009/about/crowdvine > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=24 > fetching http://www.eu.apachecon.com/c/aceu2009/about/videoStreaming > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23 > fetching http://www.us.apachecon.com/c/acus2009/sessions/335 > fetching http://nutch.apache.org/apidocs-1.2/overview-summary.html > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 > fetching http://eu.apachecon.com/c/aceu2009/speakers > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20 > fetching http://www.eu.apachecon.com/c/aceu2009/sponsors/sponsor > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 > fetching http://www.us.apachecon.com/c/acus2009/sessions/461 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=18 > fetching http://nutch.apache.org/apidocs-1.3/allclasses-frame.html > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=17 > fetching http://eu.apachecon.com/c/aceu2009/articles > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=16 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 > fetching http://www.us.apachecon.com/c/acus2009/sessions/427 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15 > fetching http://nutch.apache.org/apidocs-1.2/overview-frame.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 > fetching http://eu.apachecon.com/c/aceu2009/sessions/ > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 > fetching http://www.us.apachecon.com/c/acus2009/sessions/430 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=12 > fetching http://nutch.apache.org/apidocs-1.3/overview-summary.html > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 > fetching http://eu.apachecon.com/c/aceu2009/sponsors/sponsors > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=10 > fetching http://www.us.apachecon.com/c/acus2009/sessions/375 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 > fetching http://eu.apachecon.com/c/ > fetching http://www.us.apachecon.com/c/acus2009/sessions/462 > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 > fetching http://www.us.apachecon.com/c/acus2009/sessions/428 > fetching http://eu.apachecon.com/c/aceu2009/schedule > -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 > fetching http://www.us.apachecon.com/c/acus2009/sessions/331 > fetching http://eu.apachecon.com/c/aceu2009/ > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 > * queue: http://eu.apachecon.com > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466704235 > now = 1310466704428 > 0. http://eu.apachecon.com/js/jquery.akslideshow.js > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466709214 > now = 1310466704428 > 0. http://www.us.apachecon.com/c/acus2009/sessions/437 > 1. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 > * queue: http://eu.apachecon.com > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466704235 > now = 1310466705429 > 0. http://eu.apachecon.com/js/jquery.akslideshow.js > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466709214 > now = 1310466705430 > 0. http://www.us.apachecon.com/c/acus2009/sessions/437 > 1. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://eu.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466710968 > now = 1310466706431 > 0. http://eu.apachecon.com/js/jquery.akslideshow.js > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466709214 > now = 1310466706431 > 0. http://www.us.apachecon.com/c/acus2009/sessions/437 > 1. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://eu.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466710968 > now = 1310466707433 > 0. http://eu.apachecon.com/js/jquery.akslideshow.js > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466709214 > now = 1310466707433 > 0. http://www.us.apachecon.com/c/acus2009/sessions/437 > 1. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 > * queue: http://eu.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466710968 > now = 1310466708435 > 0. http://eu.apachecon.com/js/jquery.akslideshow.js > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466709214 > now = 1310466708435 > 0. http://www.us.apachecon.com/c/acus2009/sessions/437 > 1. http://www.us.apachecon.com/c/acus2009/sessions/332 > fetching http://www.us.apachecon.com/c/acus2009/sessions/437 > -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2 > * queue: http://eu.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466710968 > now = 1310466709442 > 0. http://eu.apachecon.com/js/jquery.akslideshow.js > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 1 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466709214 > now = 1310466709442 > 0. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 > * queue: http://eu.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466710968 > now = 1310466710444 > 0. http://eu.apachecon.com/js/jquery.akslideshow.js > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466714813 > now = 1310466710444 > 0. http://www.us.apachecon.com/c/acus2009/sessions/332 > fetching http://eu.apachecon.com/js/jquery.akslideshow.js > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466714813 > now = 1310466711446 > 0. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466714813 > now = 1310466712447 > 0. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466714813 > now = 1310466713448 > 0. http://www.us.apachecon.com/c/acus2009/sessions/332 > -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 > * queue: http://www.us.apachecon.com > maxThreads = 1 > inProgress = 0 > crawlDelay = 5000 > minCrawlDelay = 0 > nextFetchTime = 1310466714813 > now = 1310466714450 > 0. http://www.us.apachecon.com/c/acus2009/sessions/332 > fetching http://www.us.apachecon.com/c/acus2009/sessions/332 > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-07-12 12:31:55, elapsed: 00:01:03 > ParseSegment: starting at 2011-07-12 12:31:55 > ParseSegment: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 > Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: > failed(2,0): Can't retrieve Tika parser for mime-type text/javascript > ParseSegment: finished at 2011-07-12 12:31:59, elapsed: 00:00:03 > CrawlDb update: starting at 2011-07-12 12:31:59 > CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb > CrawlDb update: segments: > [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 > LinkDb: starting at 2011-07-12 12:32:03 > LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 > LinkDb: adding segment: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732 > LinkDb: adding segment: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256 > LinkDb: adding segment: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 > LinkDb: adding segment: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 > LinkDb: adding segment: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 > Exception in thread "main" > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_d > ata Input path does not exist: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_da > ta Input path does not exist: > file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_da > ta at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 > 90) at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn > putFormat.java:44) at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20 > 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at > org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) > > 2011/7/12 Julien Nioche <[email protected]>: > >> Actually I'm not shure if I look at the right log lines. Please > >> explain in more detail for what exactly I should look for. Anyway I > >> found the following line just before the error: > >> > >> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: > >> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript > >> > >> But I can see that parsing erros like this already appeared earlier > >> during the crawl. > > > > This simply means that the javascript parser is not enabled in your conf > > (which is the default behaviour) and as a consequence the default parser > > (Tika) was used to try and parse it but has no resources for doing so. > > > > Note : we should probably add .js to the default url filters. The > > javascript parser has been deactivated by default because it generates > > atrocious URLs so we might as well prevent such URLs form being fetched > > in the first place. > > > > Anyway this is not the source of the problem. You seem to have unparsed > > segments among the ones specified. Could be that you interrupted a > > previous crawl or got a problem with it and did not delete these > > segments or the whole crawl directory. Removing the segments and calling > > the last couple of steps manually should do the trick. > > > >> 2011/7/12 Markus Jelsma <[email protected]>: > >> > Were there errors during parsing of that last segment? > >> > > >> >> I'm starting with nutch and I ran a simple job as described in the > >> >> nutch tutorial. After a while I get the following error: > >> >> > >> >> > >> >> CrawlDb update: URL filtering: true > >> >> CrawlDb update: Merging segment data into db. > >> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 > >> >> LinkDb: starting at 2011-07-12 12:32:03 > >> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb > >> >> LinkDb: URL normalize: true > >> >> LinkDb: URL filter: true > >> >> LinkDb: adding segment: > >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 > >> >> LinkDb: adding segment: > >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732 > >> >> LinkDb: adding segment: > >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256 > >> >> LinkDb: adding segment: > >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 > >> >> LinkDb: adding segment: > >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 > >> >> LinkDb: adding segment: > >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 > >> >> Exception in thread "main" > >> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not > >> > >> >> exist: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse > >> _d > >> > >> >> ata Input path does not exist: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse > >> _da > >> > >> >> ta Input path does not exist: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse > >> _da > >> > >> >> ta at > >> > >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java > >> :1 > >> > >> >> 90) at > >> > >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFile > >> In > >> > >> >> putFormat.java:44) at > >> > >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java: > >> 20 > >> > >> >> 1) at > >> > >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > >> > >> >> at > >> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 > >> >> 81) at > >> >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at > >> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at > >> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > >> >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > >> >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) > >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com
-- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

