Hi,

Something that that turned up on another list [0] was a scenario where
the following URL [1] was being fetched for processing.

Having tried fetching and parsing the URL unsuccessfully outside of
Nutch I decided to try the parsechecker with the following output.
More comments below the output...

lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$
./bin/nutch parsechecker
http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning
[5] 3086
[6] 3087
[7] 3088
[4]   Done                    ./bin/nutch parsechecker
http://en.wikipedia.org/w/api.php?action=query
lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ fetching:
http://en.wikipedia.org/w/api.php?action=query
parsing: http://en.wikipedia.org/w/api.php?action=query
contentType: text/html
signature: e29908847945e7dc482c2f6d6129a11c
---------
Url
---------------
http://en.wikipedia.org/w/api.php?action=query
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: MediaWiki API Result
Outlinks: 2
  outlink: toUrl: https://www.mediawiki.org/wiki/API anchor: complete
documentation
  outlink: toUrl: http://en.wikipedia.org/w/api.php anchor: API help
Content Metadata: Vary=Accept-Encoding,X-Forwarded-Proto Date=Thu, 21
Jun 2012 22:14:21 GMT Content-Length=427 Content-Encoding=gzip
Connection=close X-Cache-Lookup=MISS from
amssq38.esams.wikimedia.org:80 Content-Type=text/html; charset=utf-8
X-Cache=MISS from amssq38.esams.wikimedia.org Server=Apache
Cache-Control=private X-Content-Type-Options=nosniff
_ip=91.198.174.225
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

1) What do the integers within the []'s represent?
2) After encountering the first ampersand the URL seems to be
truncated. Is this normalization or something else? My urlfilter regex
is default.
3) The parser chokes and doesn't finish it's job.

Any ideas about how these urls should be dealt with, or of course what
suggestions there may be to prevent the parser from freezing on us?

Thanks in advance.

Lewis

[0] 
http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201206.mbox/%3CCAPeLbhNzuepW90V33TLvZ4n-eWRrHUspACbm3qK34wsTY6xTxQ%40mail.gmail.com%3E
[1] 
http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning

-- 
Lewis

Reply via email to