Hi, Something that that turned up on another list [0] was a scenario where the following URL [1] was being fetched for processing.
Having tried fetching and parsing the URL unsuccessfully outside of Nutch I decided to try the parsechecker with the following output. More comments below the output... lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ ./bin/nutch parsechecker http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning [5] 3086 [6] 3087 [7] 3088 [4] Done ./bin/nutch parsechecker http://en.wikipedia.org/w/api.php?action=query lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ fetching: http://en.wikipedia.org/w/api.php?action=query parsing: http://en.wikipedia.org/w/api.php?action=query contentType: text/html signature: e29908847945e7dc482c2f6d6129a11c --------- Url --------------- http://en.wikipedia.org/w/api.php?action=query --------- ParseData --------- Version: 5 Status: success(1,0) Title: MediaWiki API Result Outlinks: 2 outlink: toUrl: https://www.mediawiki.org/wiki/API anchor: complete documentation outlink: toUrl: http://en.wikipedia.org/w/api.php anchor: API help Content Metadata: Vary=Accept-Encoding,X-Forwarded-Proto Date=Thu, 21 Jun 2012 22:14:21 GMT Content-Length=427 Content-Encoding=gzip Connection=close X-Cache-Lookup=MISS from amssq38.esams.wikimedia.org:80 Content-Type=text/html; charset=utf-8 X-Cache=MISS from amssq38.esams.wikimedia.org Server=Apache Cache-Control=private X-Content-Type-Options=nosniff _ip=91.198.174.225 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 1) What do the integers within the []'s represent? 2) After encountering the first ampersand the URL seems to be truncated. Is this normalization or something else? My urlfilter regex is default. 3) The parser chokes and doesn't finish it's job. Any ideas about how these urls should be dealt with, or of course what suggestions there may be to prevent the parser from freezing on us? Thanks in advance. Lewis [0] http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201206.mbox/%3CCAPeLbhNzuepW90V33TLvZ4n-eWRrHUspACbm3qK34wsTY6xTxQ%40mail.gmail.com%3E [1] http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning -- Lewis

