Okay, and what does that mean? How can I repair the error?
2011/7/12 Markus Jelsma <[email protected]>: > I don't see this segment 20110712114256 being parsed. > > On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote: >> I'm not if I did understand you correct. Here is the complete output >> of my crawl: >> >> >> tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled >> -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 >> solrUrl is not set, indexing will be skipped... >> crawl started in: /Users/toom/Downloads/nutch-1.3/sites >> rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled >> threads = 10 >> depth = 3 >> solrUrl=null >> topN = 50 >> Injector: starting at 2011-07-12 12:28:49 >> Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb >> Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled >> Injector: Converting injected urls to crawl db entries. >> Injector: Merging injected urls into crawl db. >> Injector: finished at 2011-07-12 12:28:53, elapsed: 00:00:04 >> Generator: starting at 2011-07-12 12:28:53 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 50 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls for politeness. >> Generator: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 >> Generator: finished at 2011-07-12 12:28:57, elapsed: 00:00:04 >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting at 2011-07-12 12:28:57 >> Fetcher: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Fetcher: >> threads: 10 >> QueueFeeder finished: total 1 records + hit by time limit :0 >> fetching http://nutch.apache.org/ >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=1 >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 >> -finishing thread FetcherThread, activeThreads=0 >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> -activeThreads=0 >> Fetcher: finished at 2011-07-12 12:29:01, elapsed: 00:00:03 >> ParseSegment: starting at 2011-07-12 12:29:01 >> ParseSegment: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 >> ParseSegment: finished at 2011-07-12 12:29:03, elapsed: 00:00:02 >> CrawlDb update: starting at 2011-07-12 12:29:03 >> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb >> CrawlDb update: segments: >> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: finished at 2011-07-12 12:29:06, elapsed: 00:00:02 >> Generator: starting at 2011-07-12 12:29:06 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 50 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls for politeness. >> Generator: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 >> Generator: finished at 2011-07-12 12:29:10, elapsed: 00:00:03 >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting at 2011-07-12 12:29:10 >> Fetcher: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Fetcher: >> threads: 10 >> QueueFeeder finished: total 50 records + hit by time limit :0 >> fetching http://www.cafepress.com/nutch/ >> fetching http://creativecommons.org/press-releases/entry/5064 >> fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html >> fetching http://www.apache.org/dist/nutch/CHANGES-1.0.txt >> fetching http://eu.apachecon.com/c/aceu2009/sessions/138 >> fetching http://www.us.apachecon.com/c/acus2009/ >> fetching http://issues.apache.org/jira/browse/NUTCH >> fetching http://forrest.apache.org/ >> fetching http://hadoop.apache.org/ >> fetching http://wiki.apache.org/nutch/ >> fetching http://nutch.apache.org/credits.html >> fetching http://tika.apache.org/ >> fetching http://lucene.apache.org/solr/ >> fetching http://osuosl.org/news_folder/nutch >> fetching http://www.eu.apachecon.com/c/aceu2009/ >> -activeThreads=10, spinWaiting=1, fetchQueues.totalSize=35 >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=35 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35 >> fetching http://www.apache.org/ >> fetching http://eu.apachecon.com/c/aceu2009/sessions/251 >> fetching http://nutch.apache.org/skin/fontsize.js >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=32 >> fetching http://www.us.apachecon.com/c/acus2009/schedule >> fetching http://wiki.apache.org/nutch/NutchTutorial >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30 >> fetching http://lucene.apache.org/java/ >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29 >> fetching http://www.apache.org/dyn/closer.cgi/nutch/ >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28 >> fetching http://eu.apachecon.com/c/aceu2009/sessions/197 >> fetching http://nutch.apache.org/nightly.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26 >> fetching http://wiki.apache.org/nutch/FAQ >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25 >> fetching http://www.apache.org/licenses/ >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=24 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=24 >> fetching http://eu.apachecon.com/c/aceu2009/sessions/136 >> fetching http://nutch.apache.org/apidocs-1.3/index.html >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=22 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22 >> fetching http://www.apache.org/dist/nutch/CHANGES-1.2.txt >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=21 >> fetching http://nutch.apache.org/skin/breadcrumbs.js >> fetching http://eu.apachecon.com/c/aceu2009/sessions/165 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19 >> fetching http://www.apache.org/dist/nutch/CHANGES-0.9.txt >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18 >> fetching http://eu.apachecon.com/c/aceu2009/sessions/201 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17 >> fetching http://nutch.apache.org/skin/getMenu.js >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 >> fetching http://www.apache.org/dist/nutch/CHANGES-1.1.txt >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15 >> fetching http://eu.apachecon.com/c/aceu2009/sessions/137 >> fetching http://nutch.apache.org/index.html >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 >> fetching http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 >> fetching >> http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_0 >> 4_21.txt fetching http://eu.apachecon.com/c/aceu2009/sessions/250 >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=10 >> fetching http://nutch.apache.org/mailing_lists.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> fetching http://www.apache.org/dist/nutch/CHANGES-1.3.txt >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=8 >> fetching http://nutch.apache.org/bot.html >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> fetching http://nutch.apache.org/issue_tracking.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6 >> fetching http://nutch.apache.org/about.html >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> fetching http://nutch.apache.org/i18n.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466617719 >> now = 1310466613063 >> 0. http://nutch.apache.org/version_control.html >> 1. http://nutch.apache.org/skin/getBlank.js >> 2. http://nutch.apache.org/index.pdf >> 3. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466617719 >> now = 1310466614064 >> 0. http://nutch.apache.org/version_control.html >> 1. http://nutch.apache.org/skin/getBlank.js >> 2. http://nutch.apache.org/index.pdf >> 3. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466617719 >> now = 1310466615066 >> 0. http://nutch.apache.org/version_control.html >> 1. http://nutch.apache.org/skin/getBlank.js >> 2. http://nutch.apache.org/index.pdf >> 3. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466617719 >> now = 1310466616068 >> 0. http://nutch.apache.org/version_control.html >> 1. http://nutch.apache.org/skin/getBlank.js >> 2. http://nutch.apache.org/index.pdf >> 3. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466617719 >> now = 1310466617069 >> 0. http://nutch.apache.org/version_control.html >> 1. http://nutch.apache.org/skin/getBlank.js >> 2. http://nutch.apache.org/index.pdf >> 3. http://nutch.apache.org/apidocs-1.2/index.html >> fetching http://nutch.apache.org/version_control.html >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 1 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466617719 >> now = 1310466618071 >> 0. http://nutch.apache.org/skin/getBlank.js >> 1. http://nutch.apache.org/index.pdf >> 2. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466623151 >> now = 1310466619073 >> 0. http://nutch.apache.org/skin/getBlank.js >> 1. http://nutch.apache.org/index.pdf >> 2. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466623151 >> now = 1310466620075 >> 0. http://nutch.apache.org/skin/getBlank.js >> 1. http://nutch.apache.org/index.pdf >> 2. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466623151 >> now = 1310466621077 >> 0. http://nutch.apache.org/skin/getBlank.js >> 1. http://nutch.apache.org/index.pdf >> 2. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466623151 >> now = 1310466622078 >> 0. http://nutch.apache.org/skin/getBlank.js >> 1. http://nutch.apache.org/index.pdf >> 2. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466623151 >> now = 1310466623080 >> 0. http://nutch.apache.org/skin/getBlank.js >> 1. http://nutch.apache.org/index.pdf >> 2. http://nutch.apache.org/apidocs-1.2/index.html >> fetching http://nutch.apache.org/skin/getBlank.js >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466628578 >> now = 1310466624082 >> 0. http://nutch.apache.org/index.pdf >> 1. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466628578 >> now = 1310466625084 >> 0. http://nutch.apache.org/index.pdf >> 1. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466628578 >> now = 1310466626086 >> 0. http://nutch.apache.org/index.pdf >> 1. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466628578 >> now = 1310466627088 >> 0. http://nutch.apache.org/index.pdf >> 1. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466628578 >> now = 1310466628089 >> 0. http://nutch.apache.org/index.pdf >> 1. http://nutch.apache.org/apidocs-1.2/index.html >> fetching http://nutch.apache.org/index.pdf >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 1 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466628578 >> now = 1310466629090 >> 0. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466634844 >> now = 1310466630092 >> 0. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466634844 >> now = 1310466631094 >> 0. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466634844 >> now = 1310466632095 >> 0. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466634844 >> now = 1310466633097 >> 0. http://nutch.apache.org/apidocs-1.2/index.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://nutch.apache.org >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466634844 >> now = 1310466634099 >> 0. http://nutch.apache.org/apidocs-1.2/index.html >> fetching http://nutch.apache.org/apidocs-1.2/index.html >> -finishing thread FetcherThread, activeThreads=9 >> -finishing thread FetcherThread, activeThreads=8 >> -finishing thread FetcherThread, activeThreads=7 >> -finishing thread FetcherThread, activeThreads=6 >> -finishing thread FetcherThread, activeThreads=5 >> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=0 >> -finishing thread FetcherThread, activeThreads=4 >> -finishing thread FetcherThread, activeThreads=3 >> -finishing thread FetcherThread, activeThreads=2 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=0 >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> -activeThreads=0 >> Fetcher: finished at 2011-07-12 12:30:37, elapsed: 00:01:27 >> ParseSegment: starting at 2011-07-12 12:30:37 >> ParseSegment: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 >> Error parsing: http://nutch.apache.org/skin/breadcrumbs.js: >> failed(2,0): Can't retrieve Tika parser for mime-type >> application/javascript >> Error parsing: http://nutch.apache.org/skin/fontsize.js: failed(2,0): >> Can't retrieve Tika parser for mime-type application/javascript >> Error parsing: http://nutch.apache.org/skin/getBlank.js: failed(2,0): >> Can't retrieve Tika parser for mime-type application/javascript >> Error parsing: http://nutch.apache.org/skin/getMenu.js: failed(2,0): >> Can't retrieve Tika parser for mime-type application/javascript >> ParseSegment: finished at 2011-07-12 12:30:46, elapsed: 00:00:08 >> CrawlDb update: starting at 2011-07-12 12:30:46 >> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb >> CrawlDb update: segments: >> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: finished at 2011-07-12 12:30:48, elapsed: 00:00:02 >> Generator: starting at 2011-07-12 12:30:48 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 50 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls for politeness. >> Generator: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 >> Generator: finished at 2011-07-12 12:30:52, elapsed: 00:00:03 >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting at 2011-07-12 12:30:52 >> Fetcher: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Fetcher: >> threads: 10 >> QueueFeeder finished: total 50 records + hit by time limit :0 >> fetching http://www.onehippo.com/ >> fetching http://apacheconeu.blogspot.com/ >> fetching http://www.day.com/ >> fetching http://www.func.nl/apacheconeu2009 >> fetching http://www.thawte.com/ >> fetching http://eu.apachecon.com/c/aceu2009/about >> fetching http://www.us.apachecon.com/c/acus2009/sessions/333 >> fetching http://www.joost.com/ >> fetching http://developer.yahoo.com/blogs/hadoop/ >> fetching http://www.springsource.com/ >> fetching http://www.isi.edu/~koehn/europarl/ >> fetching http://www.topicus.nl/ >> fetching http://opensource.hp.com/ >> fetching http://nutch.apache.org/apidocs-1.3/overview-frame.html >> -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36 >> fetching http://www.haloworldwide.com/ >> fetching https://builds.apache.org/job/Nutch-trunk/javadoc/ >> fetch of https://builds.apache.org/job/Nutch-trunk/javadoc/ failed >> with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found >> for url=https >> fetching http://www.hotwaxmedia.com/ >> fetching http://lucene.apache.org/hadoop >> fetching http://www.cloudera.com/ >> fetching http://code.google.com/opensource/ >> fetching http://www.lucidimagination.com/ >> fetching http://apache.lehtivihrea.org/nutch/ >> fetching http://www.eu.apachecon.com/c/aceu2009/about/meetups >> -activeThreads=10, spinWaiting=4, fetchQueues.totalSize=27 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/334 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26 >> fetching http://nutch.apache.org/apidocs-1.2/allclasses-frame.html >> fetching http://eu.apachecon.com/c/aceu2009/about/crowdvine >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=24 >> fetching http://www.eu.apachecon.com/c/aceu2009/about/videoStreaming >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/335 >> fetching http://nutch.apache.org/apidocs-1.2/overview-summary.html >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21 >> fetching http://eu.apachecon.com/c/aceu2009/speakers >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20 >> fetching http://www.eu.apachecon.com/c/aceu2009/sponsors/sponsor >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/461 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=18 >> fetching http://nutch.apache.org/apidocs-1.3/allclasses-frame.html >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=17 >> fetching http://eu.apachecon.com/c/aceu2009/articles >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=16 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/427 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15 >> fetching http://nutch.apache.org/apidocs-1.2/overview-frame.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 >> fetching http://eu.apachecon.com/c/aceu2009/sessions/ >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/430 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=12 >> fetching http://nutch.apache.org/apidocs-1.3/overview-summary.html >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11 >> fetching http://eu.apachecon.com/c/aceu2009/sponsors/sponsors >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=10 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/375 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 >> fetching http://eu.apachecon.com/c/ >> fetching http://www.us.apachecon.com/c/acus2009/sessions/462 >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/428 >> fetching http://eu.apachecon.com/c/aceu2009/schedule >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/331 >> fetching http://eu.apachecon.com/c/aceu2009/ >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 >> * queue: http://eu.apachecon.com >> maxThreads = 1 >> inProgress = 1 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466704235 >> now = 1310466704428 >> 0. http://eu.apachecon.com/js/jquery.akslideshow.js >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466709214 >> now = 1310466704428 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/437 >> 1. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3 >> * queue: http://eu.apachecon.com >> maxThreads = 1 >> inProgress = 1 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466704235 >> now = 1310466705429 >> 0. http://eu.apachecon.com/js/jquery.akslideshow.js >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466709214 >> now = 1310466705430 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/437 >> 1. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://eu.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466710968 >> now = 1310466706431 >> 0. http://eu.apachecon.com/js/jquery.akslideshow.js >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466709214 >> now = 1310466706431 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/437 >> 1. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://eu.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466710968 >> now = 1310466707433 >> 0. http://eu.apachecon.com/js/jquery.akslideshow.js >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466709214 >> now = 1310466707433 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/437 >> 1. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3 >> * queue: http://eu.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466710968 >> now = 1310466708435 >> 0. http://eu.apachecon.com/js/jquery.akslideshow.js >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466709214 >> now = 1310466708435 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/437 >> 1. http://www.us.apachecon.com/c/acus2009/sessions/332 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/437 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2 >> * queue: http://eu.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466710968 >> now = 1310466709442 >> 0. http://eu.apachecon.com/js/jquery.akslideshow.js >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 1 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466709214 >> now = 1310466709442 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 >> * queue: http://eu.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466710968 >> now = 1310466710444 >> 0. http://eu.apachecon.com/js/jquery.akslideshow.js >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466714813 >> now = 1310466710444 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/332 >> fetching http://eu.apachecon.com/js/jquery.akslideshow.js >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466714813 >> now = 1310466711446 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466714813 >> now = 1310466712447 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466714813 >> now = 1310466713448 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/332 >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 >> * queue: http://www.us.apachecon.com >> maxThreads = 1 >> inProgress = 0 >> crawlDelay = 5000 >> minCrawlDelay = 0 >> nextFetchTime = 1310466714813 >> now = 1310466714450 >> 0. http://www.us.apachecon.com/c/acus2009/sessions/332 >> fetching http://www.us.apachecon.com/c/acus2009/sessions/332 >> -finishing thread FetcherThread, activeThreads=9 >> -finishing thread FetcherThread, activeThreads=8 >> -finishing thread FetcherThread, activeThreads=7 >> -finishing thread FetcherThread, activeThreads=6 >> -finishing thread FetcherThread, activeThreads=5 >> -finishing thread FetcherThread, activeThreads=4 >> -finishing thread FetcherThread, activeThreads=3 >> -finishing thread FetcherThread, activeThreads=2 >> -finishing thread FetcherThread, activeThreads=1 >> -finishing thread FetcherThread, activeThreads=0 >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> -activeThreads=0 >> Fetcher: finished at 2011-07-12 12:31:55, elapsed: 00:01:03 >> ParseSegment: starting at 2011-07-12 12:31:55 >> ParseSegment: segment: >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 >> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: >> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript >> ParseSegment: finished at 2011-07-12 12:31:59, elapsed: 00:00:03 >> CrawlDb update: starting at 2011-07-12 12:31:59 >> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb >> CrawlDb update: segments: >> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 >> LinkDb: starting at 2011-07-12 12:32:03 >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb >> LinkDb: URL normalize: true >> LinkDb: URL filter: true >> LinkDb: adding segment: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 >> LinkDb: adding segment: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732 >> LinkDb: adding segment: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256 >> LinkDb: adding segment: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 >> LinkDb: adding segment: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 >> LinkDb: adding segment: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 >> Exception in thread "main" >> org.apache.hadoop.mapred.InvalidInputException: Input path does not >> exist: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_d >> ata Input path does not exist: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_da >> ta Input path does not exist: >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_da >> ta at >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 >> 90) at >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn >> putFormat.java:44) at >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20 >> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) >> >> 2011/7/12 Julien Nioche <[email protected]>: >> >> Actually I'm not shure if I look at the right log lines. Please >> >> explain in more detail for what exactly I should look for. Anyway I >> >> found the following line just before the error: >> >> >> >> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: >> >> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript >> >> >> >> But I can see that parsing erros like this already appeared earlier >> >> during the crawl. >> > >> > This simply means that the javascript parser is not enabled in your conf >> > (which is the default behaviour) and as a consequence the default parser >> > (Tika) was used to try and parse it but has no resources for doing so. >> > >> > Note : we should probably add .js to the default url filters. The >> > javascript parser has been deactivated by default because it generates >> > atrocious URLs so we might as well prevent such URLs form being fetched >> > in the first place. >> > >> > Anyway this is not the source of the problem. You seem to have unparsed >> > segments among the ones specified. Could be that you interrupted a >> > previous crawl or got a problem with it and did not delete these >> > segments or the whole crawl directory. Removing the segments and calling >> > the last couple of steps manually should do the trick. >> > >> >> 2011/7/12 Markus Jelsma <[email protected]>: >> >> > Were there errors during parsing of that last segment? >> >> > >> >> >> I'm starting with nutch and I ran a simple job as described in the >> >> >> nutch tutorial. After a while I get the following error: >> >> >> >> >> >> >> >> >> CrawlDb update: URL filtering: true >> >> >> CrawlDb update: Merging segment data into db. >> >> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 >> >> >> LinkDb: starting at 2011-07-12 12:32:03 >> >> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb >> >> >> LinkDb: URL normalize: true >> >> >> LinkDb: URL filter: true >> >> >> LinkDb: adding segment: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 >> >> >> LinkDb: adding segment: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732 >> >> >> LinkDb: adding segment: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256 >> >> >> LinkDb: adding segment: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 >> >> >> LinkDb: adding segment: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 >> >> >> LinkDb: adding segment: >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 >> >> >> Exception in thread "main" >> >> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not >> >> >> >> >> exist: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse >> >> _d >> >> >> >> >> ata Input path does not exist: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse >> >> _da >> >> >> >> >> ta Input path does not exist: >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse >> >> _da >> >> >> >> >> ta at >> >> >> >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java >> >> :1 >> >> >> >> >> 90) at >> >> >> >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFile >> >> In >> >> >> >> >> putFormat.java:44) at >> >> >> >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java: >> >> 20 >> >> >> >> >> 1) at >> >> >> >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> >> >> >> >> at >> >> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 >> >> >> 81) at >> >> >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at >> >> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at >> >> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >> >> >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) >> >> >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) >> >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> >> >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) >> > >> > -- >> > * >> > *Open Source Solutions for Text Engineering >> > >> > http://digitalpebble.blogspot.com/ >> > http://www.digitalpebble.com > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 >

