Hi Lewis, You got fooled by the ampersand switch on Unix terminals that sends a command to the background. The [] integers are Unix process ID's of the commands you have given.
$ a&b&c is not one but three commands, sending a and b to the background. Your shell will output the [process ID] if a backgrounded command is finished. Encapsulate your URL with quotes and you are safe. Cheers, Markus -----Original message----- > From:Lewis John Mcgibbney <[email protected]> > Sent: Fri 22-Jun-2012 00:36 > To: [email protected] > Subject: Parser choking on irregular url > > Hi, > > Something that that turned up on another list [0] was a scenario where > the following URL [1] was being fetched for processing. > > Having tried fetching and parsing the URL unsuccessfully outside of > Nutch I decided to try the parsechecker with the following output. > More comments below the output... > > lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ > ./bin/nutch parsechecker > http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning > [5] 3086 > [6] 3087 > [7] 3088 > [4] Done ./bin/nutch parsechecker > http://en.wikipedia.org/w/api.php?action=query > lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ fetching: > http://en.wikipedia.org/w/api.php?action=query > parsing: http://en.wikipedia.org/w/api.php?action=query > contentType: text/html > signature: e29908847945e7dc482c2f6d6129a11c > --------- > Url > --------------- > http://en.wikipedia.org/w/api.php?action=query > --------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: MediaWiki API Result > Outlinks: 2 > outlink: toUrl: https://www.mediawiki.org/wiki/API anchor: complete > documentation > outlink: toUrl: http://en.wikipedia.org/w/api.php anchor: API help > Content Metadata: Vary=Accept-Encoding,X-Forwarded-Proto Date=Thu, 21 > Jun 2012 22:14:21 GMT Content-Length=427 Content-Encoding=gzip > Connection=close X-Cache-Lookup=MISS from > amssq38.esams.wikimedia.org:80 Content-Type=text/html; charset=utf-8 > X-Cache=MISS from amssq38.esams.wikimedia.org Server=Apache > Cache-Control=private X-Content-Type-Options=nosniff > _ip=91.198.174.225 > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > > 1) What do the integers within the []'s represent? > 2) After encountering the first ampersand the URL seems to be > truncated. Is this normalization or something else? My urlfilter regex > is default. > 3) The parser chokes and doesn't finish it's job. > > Any ideas about how these urls should be dealt with, or of course what > suggestions there may be to prevent the parser from freezing on us? > > Thanks in advance. > > Lewis > > [0] > http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201206.mbox/%3CCAPeLbhNzuepW90V33TLvZ4n-eWRrHUspACbm3qK34wsTY6xTxQ%40mail.gmail.com%3E > [1] > http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning > > -- > Lewis >

