I don't see this segment 20110712114256 being parsed.

On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote:
> I'm not if I did understand you correct. Here is the complete output
> of my crawl:
> 
> 
> tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled
> -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
> solrUrl is not set, indexing will be skipped...
> crawl started in: /Users/toom/Downloads/nutch-1.3/sites
> rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled
> threads = 10
> depth = 3
> solrUrl=null
> topN = 50
> Injector: starting at 2011-07-12 12:28:49
> Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-12 12:28:53, elapsed: 00:00:04
> Generator: starting at 2011-07-12 12:28:53
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> Generator: finished at 2011-07-12 12:28:57, elapsed: 00:00:04
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-12 12:28:57
> Fetcher: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Fetcher:
> threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://nutch.apache.org/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-12 12:29:01, elapsed: 00:00:03
> ParseSegment: starting at 2011-07-12 12:29:01
> ParseSegment: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> ParseSegment: finished at 2011-07-12 12:29:03, elapsed: 00:00:02
> CrawlDb update: starting at 2011-07-12 12:29:03
> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> CrawlDb update: segments:
> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-12 12:29:06, elapsed: 00:00:02
> Generator: starting at 2011-07-12 12:29:06
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> Generator: finished at 2011-07-12 12:29:10, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-12 12:29:10
> Fetcher: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Fetcher:
> threads: 10
> QueueFeeder finished: total 50 records + hit by time limit :0
> fetching http://www.cafepress.com/nutch/
> fetching http://creativecommons.org/press-releases/entry/5064
> fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
> fetching http://www.apache.org/dist/nutch/CHANGES-1.0.txt
> fetching http://eu.apachecon.com/c/aceu2009/sessions/138
> fetching http://www.us.apachecon.com/c/acus2009/
> fetching http://issues.apache.org/jira/browse/NUTCH
> fetching http://forrest.apache.org/
> fetching http://hadoop.apache.org/
> fetching http://wiki.apache.org/nutch/
> fetching http://nutch.apache.org/credits.html
> fetching http://tika.apache.org/
> fetching http://lucene.apache.org/solr/
> fetching http://osuosl.org/news_folder/nutch
> fetching http://www.eu.apachecon.com/c/aceu2009/
> -activeThreads=10, spinWaiting=1, fetchQueues.totalSize=35
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=35
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35
> fetching http://www.apache.org/
> fetching http://eu.apachecon.com/c/aceu2009/sessions/251
> fetching http://nutch.apache.org/skin/fontsize.js
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=32
> fetching http://www.us.apachecon.com/c/acus2009/schedule
> fetching http://wiki.apache.org/nutch/NutchTutorial
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30
> fetching http://lucene.apache.org/java/
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29
> fetching http://www.apache.org/dyn/closer.cgi/nutch/
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28
> fetching http://eu.apachecon.com/c/aceu2009/sessions/197
> fetching http://nutch.apache.org/nightly.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
> fetching http://wiki.apache.org/nutch/FAQ
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25
> fetching http://www.apache.org/licenses/
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=24
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=24
> fetching http://eu.apachecon.com/c/aceu2009/sessions/136
> fetching http://nutch.apache.org/apidocs-1.3/index.html
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=22
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22
> fetching http://www.apache.org/dist/nutch/CHANGES-1.2.txt
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=21
> fetching http://nutch.apache.org/skin/breadcrumbs.js
> fetching http://eu.apachecon.com/c/aceu2009/sessions/165
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19
> fetching http://www.apache.org/dist/nutch/CHANGES-0.9.txt
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18
> fetching http://eu.apachecon.com/c/aceu2009/sessions/201
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17
> fetching http://nutch.apache.org/skin/getMenu.js
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16
> fetching http://www.apache.org/dist/nutch/CHANGES-1.1.txt
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15
> fetching http://eu.apachecon.com/c/aceu2009/sessions/137
> fetching http://nutch.apache.org/index.html
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13
> fetching http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> fetching
> http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_0
> 4_21.txt fetching http://eu.apachecon.com/c/aceu2009/sessions/250
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=10
> fetching http://nutch.apache.org/mailing_lists.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> fetching http://www.apache.org/dist/nutch/CHANGES-1.3.txt
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=8
> fetching http://nutch.apache.org/bot.html
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> fetching http://nutch.apache.org/issue_tracking.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> fetching http://nutch.apache.org/about.html
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> fetching http://nutch.apache.org/i18n.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466617719
>   now           = 1310466613063
>   0. http://nutch.apache.org/version_control.html
>   1. http://nutch.apache.org/skin/getBlank.js
>   2. http://nutch.apache.org/index.pdf
>   3. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466617719
>   now           = 1310466614064
>   0. http://nutch.apache.org/version_control.html
>   1. http://nutch.apache.org/skin/getBlank.js
>   2. http://nutch.apache.org/index.pdf
>   3. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466617719
>   now           = 1310466615066
>   0. http://nutch.apache.org/version_control.html
>   1. http://nutch.apache.org/skin/getBlank.js
>   2. http://nutch.apache.org/index.pdf
>   3. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466617719
>   now           = 1310466616068
>   0. http://nutch.apache.org/version_control.html
>   1. http://nutch.apache.org/skin/getBlank.js
>   2. http://nutch.apache.org/index.pdf
>   3. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466617719
>   now           = 1310466617069
>   0. http://nutch.apache.org/version_control.html
>   1. http://nutch.apache.org/skin/getBlank.js
>   2. http://nutch.apache.org/index.pdf
>   3. http://nutch.apache.org/apidocs-1.2/index.html
> fetching http://nutch.apache.org/version_control.html
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 1
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466617719
>   now           = 1310466618071
>   0. http://nutch.apache.org/skin/getBlank.js
>   1. http://nutch.apache.org/index.pdf
>   2. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466623151
>   now           = 1310466619073
>   0. http://nutch.apache.org/skin/getBlank.js
>   1. http://nutch.apache.org/index.pdf
>   2. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466623151
>   now           = 1310466620075
>   0. http://nutch.apache.org/skin/getBlank.js
>   1. http://nutch.apache.org/index.pdf
>   2. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466623151
>   now           = 1310466621077
>   0. http://nutch.apache.org/skin/getBlank.js
>   1. http://nutch.apache.org/index.pdf
>   2. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466623151
>   now           = 1310466622078
>   0. http://nutch.apache.org/skin/getBlank.js
>   1. http://nutch.apache.org/index.pdf
>   2. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466623151
>   now           = 1310466623080
>   0. http://nutch.apache.org/skin/getBlank.js
>   1. http://nutch.apache.org/index.pdf
>   2. http://nutch.apache.org/apidocs-1.2/index.html
> fetching http://nutch.apache.org/skin/getBlank.js
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466628578
>   now           = 1310466624082
>   0. http://nutch.apache.org/index.pdf
>   1. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466628578
>   now           = 1310466625084
>   0. http://nutch.apache.org/index.pdf
>   1. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466628578
>   now           = 1310466626086
>   0. http://nutch.apache.org/index.pdf
>   1. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466628578
>   now           = 1310466627088
>   0. http://nutch.apache.org/index.pdf
>   1. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466628578
>   now           = 1310466628089
>   0. http://nutch.apache.org/index.pdf
>   1. http://nutch.apache.org/apidocs-1.2/index.html
> fetching http://nutch.apache.org/index.pdf
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 1
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466628578
>   now           = 1310466629090
>   0. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466634844
>   now           = 1310466630092
>   0. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466634844
>   now           = 1310466631094
>   0. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466634844
>   now           = 1310466632095
>   0. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466634844
>   now           = 1310466633097
>   0. http://nutch.apache.org/apidocs-1.2/index.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://nutch.apache.org
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466634844
>   now           = 1310466634099
>   0. http://nutch.apache.org/apidocs-1.2/index.html
> fetching http://nutch.apache.org/apidocs-1.2/index.html
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-12 12:30:37, elapsed: 00:01:27
> ParseSegment: starting at 2011-07-12 12:30:37
> ParseSegment: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> Error parsing: http://nutch.apache.org/skin/breadcrumbs.js:
> failed(2,0): Can't retrieve Tika parser for mime-type
> application/javascript
> Error parsing: http://nutch.apache.org/skin/fontsize.js: failed(2,0):
> Can't retrieve Tika parser for mime-type application/javascript
> Error parsing: http://nutch.apache.org/skin/getBlank.js: failed(2,0):
> Can't retrieve Tika parser for mime-type application/javascript
> Error parsing: http://nutch.apache.org/skin/getMenu.js: failed(2,0):
> Can't retrieve Tika parser for mime-type application/javascript
> ParseSegment: finished at 2011-07-12 12:30:46, elapsed: 00:00:08
> CrawlDb update: starting at 2011-07-12 12:30:46
> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> CrawlDb update: segments:
> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-12 12:30:48, elapsed: 00:00:02
> Generator: starting at 2011-07-12 12:30:48
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> Generator: finished at 2011-07-12 12:30:52, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-12 12:30:52
> Fetcher: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Fetcher:
> threads: 10
> QueueFeeder finished: total 50 records + hit by time limit :0
> fetching http://www.onehippo.com/
> fetching http://apacheconeu.blogspot.com/
> fetching http://www.day.com/
> fetching http://www.func.nl/apacheconeu2009
> fetching http://www.thawte.com/
> fetching http://eu.apachecon.com/c/aceu2009/about
> fetching http://www.us.apachecon.com/c/acus2009/sessions/333
> fetching http://www.joost.com/
> fetching http://developer.yahoo.com/blogs/hadoop/
> fetching http://www.springsource.com/
> fetching http://www.isi.edu/~koehn/europarl/
> fetching http://www.topicus.nl/
> fetching http://opensource.hp.com/
> fetching http://nutch.apache.org/apidocs-1.3/overview-frame.html
> -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36
> fetching http://www.haloworldwide.com/
> fetching https://builds.apache.org/job/Nutch-trunk/javadoc/
> fetch of https://builds.apache.org/job/Nutch-trunk/javadoc/ failed
> with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found
> for url=https
> fetching http://www.hotwaxmedia.com/
> fetching http://lucene.apache.org/hadoop
> fetching http://www.cloudera.com/
> fetching http://code.google.com/opensource/
> fetching http://www.lucidimagination.com/
> fetching http://apache.lehtivihrea.org/nutch/
> fetching http://www.eu.apachecon.com/c/aceu2009/about/meetups
> -activeThreads=10, spinWaiting=4, fetchQueues.totalSize=27
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
> fetching http://www.us.apachecon.com/c/acus2009/sessions/334
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
> fetching http://nutch.apache.org/apidocs-1.2/allclasses-frame.html
> fetching http://eu.apachecon.com/c/aceu2009/about/crowdvine
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=24
> fetching http://www.eu.apachecon.com/c/aceu2009/about/videoStreaming
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23
> fetching http://www.us.apachecon.com/c/acus2009/sessions/335
> fetching http://nutch.apache.org/apidocs-1.2/overview-summary.html
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21
> fetching http://eu.apachecon.com/c/aceu2009/speakers
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20
> fetching http://www.eu.apachecon.com/c/aceu2009/sponsors/sponsor
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> fetching http://www.us.apachecon.com/c/acus2009/sessions/461
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=18
> fetching http://nutch.apache.org/apidocs-1.3/allclasses-frame.html
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=17
> fetching http://eu.apachecon.com/c/aceu2009/articles
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=16
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16
> fetching http://www.us.apachecon.com/c/acus2009/sessions/427
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15
> fetching http://nutch.apache.org/apidocs-1.2/overview-frame.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
> fetching http://eu.apachecon.com/c/aceu2009/sessions/
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13
> fetching http://www.us.apachecon.com/c/acus2009/sessions/430
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=12
> fetching http://nutch.apache.org/apidocs-1.3/overview-summary.html
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> fetching http://eu.apachecon.com/c/aceu2009/sponsors/sponsors
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=10
> fetching http://www.us.apachecon.com/c/acus2009/sessions/375
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> fetching http://eu.apachecon.com/c/
> fetching http://www.us.apachecon.com/c/acus2009/sessions/462
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> fetching http://www.us.apachecon.com/c/acus2009/sessions/428
> fetching http://eu.apachecon.com/c/aceu2009/schedule
> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> fetching http://www.us.apachecon.com/c/acus2009/sessions/331
> fetching http://eu.apachecon.com/c/aceu2009/
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3
> * queue: http://eu.apachecon.com
>   maxThreads    = 1
>   inProgress    = 1
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466704235
>   now           = 1310466704428
>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466709214
>   now           = 1310466704428
>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3
> * queue: http://eu.apachecon.com
>   maxThreads    = 1
>   inProgress    = 1
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466704235
>   now           = 1310466705429
>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466709214
>   now           = 1310466705430
>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://eu.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466710968
>   now           = 1310466706431
>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466709214
>   now           = 1310466706431
>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://eu.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466710968
>   now           = 1310466707433
>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466709214
>   now           = 1310466707433
>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> * queue: http://eu.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466710968
>   now           = 1310466708435
>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466709214
>   now           = 1310466708435
>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> fetching http://www.us.apachecon.com/c/acus2009/sessions/437
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2
> * queue: http://eu.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466710968
>   now           = 1310466709442
>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 1
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466709214
>   now           = 1310466709442
>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> * queue: http://eu.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466710968
>   now           = 1310466710444
>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466714813
>   now           = 1310466710444
>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> fetching http://eu.apachecon.com/js/jquery.akslideshow.js
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466714813
>   now           = 1310466711446
>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466714813
>   now           = 1310466712447
>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466714813
>   now           = 1310466713448
>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> * queue: http://www.us.apachecon.com
>   maxThreads    = 1
>   inProgress    = 0
>   crawlDelay    = 5000
>   minCrawlDelay = 0
>   nextFetchTime = 1310466714813
>   now           = 1310466714450
>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> fetching http://www.us.apachecon.com/c/acus2009/sessions/332
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-12 12:31:55, elapsed: 00:01:03
> ParseSegment: starting at 2011-07-12 12:31:55
> ParseSegment: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript
> ParseSegment: finished at 2011-07-12 12:31:59, elapsed: 00:00:03
> CrawlDb update: starting at 2011-07-12 12:31:59
> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> CrawlDb update: segments:
> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
> LinkDb: starting at 2011-07-12 12:32:03
> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> LinkDb: adding segment:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732
> LinkDb: adding segment:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256
> LinkDb: adding segment:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> LinkDb: adding segment:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> LinkDb: adding segment:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_d
> ata Input path does not exist:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_da
> ta Input path does not exist:
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_da
> ta at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> 90) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> putFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> 
> 2011/7/12 Julien Nioche <[email protected]>:
> >> Actually I'm not shure if I look at the right log lines. Please
> >> explain in more detail for what exactly I should look for. Anyway I
> >> found the following line just before the error:
> >> 
> >> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
> >> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript
> >> 
> >> But I can see that parsing erros like this already appeared earlier
> >> during the crawl.
> > 
> > This simply means that the javascript parser is not enabled in your conf
> > (which is the default behaviour) and as a consequence the default parser
> > (Tika) was used to try and parse it but has no resources for doing so.
> > 
> > Note : we should probably add .js to the default url filters. The
> > javascript parser has been deactivated by default because it generates
> > atrocious URLs so we might as well prevent such URLs form being fetched
> > in the first place.
> > 
> > Anyway this is not the source of the problem. You seem to have unparsed
> > segments among the ones specified. Could be that you interrupted a
> > previous crawl or got a problem with it and did not delete these
> > segments or the whole crawl directory. Removing the segments and calling
> > the last couple of steps manually should do the trick.
> > 
> >> 2011/7/12 Markus Jelsma <[email protected]>:
> >> > Were there errors during parsing of that last segment?
> >> > 
> >> >> I'm starting with nutch and I ran a simple job as described in the
> >> >> nutch tutorial. After a while I get the following error:
> >> >> 
> >> >> 
> >> >> CrawlDb update: URL filtering: true
> >> >> CrawlDb update: Merging segment data into db.
> >> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
> >> >> LinkDb: starting at 2011-07-12 12:32:03
> >> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb
> >> >> LinkDb: URL normalize: true
> >> >> LinkDb: URL filter: true
> >> >> LinkDb: adding segment:
> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> >> >> LinkDb: adding segment:
> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732
> >> >> LinkDb: adding segment:
> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256
> >> >> LinkDb: adding segment:
> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> >> >> LinkDb: adding segment:
> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> >> >> LinkDb: adding segment:
> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> >> >> Exception in thread "main"
> >> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> >> 
> >> >> exist:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse
> >> _d
> >> 
> >> >> ata Input path does not exist:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse
> >> _da
> >> 
> >> >> ta Input path does not exist:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse
> >> _da
> >> 
> >> >> ta at
> >> 
> >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java
> >> :1
> >> 
> >> >> 90) at
> >> 
> >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFile
> >> In
> >> 
> >> >> putFormat.java:44) at
> >> 
> >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:
> >> 20
> >> 
> >> >> 1) at
> >> 
> >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> >> 
> >> >> at
> >> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
> >> >> 81) at
> >> >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> >> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> >> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> >> >>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> >> >>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
> >> >>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> >>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> > 
> > --
> > *
> > *Open Source Solutions for Text Engineering
> > 
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to