Excellent On Fri, Jun 22, 2012 at 12:17 AM, Markus Jelsma <[email protected]> wrote: > Hi Lewis, > > You got fooled by the ampersand switch on Unix terminals that sends a command > to the background. The [] integers are Unix process ID's of the commands you > have given. > > $ a&b&c is not one but three commands, sending a and b to the background. > Your shell will output the [process ID] if a backgrounded command is finished. > > Encapsulate your URL with quotes and you are safe. > > Cheers, > Markus > > > > -----Original message----- >> From:Lewis John Mcgibbney <[email protected]> >> Sent: Fri 22-Jun-2012 00:36 >> To: [email protected] >> Subject: Parser choking on irregular url >> >> Hi, >> >> Something that that turned up on another list [0] was a scenario where >> the following URL [1] was being fetched for processing. >> >> Having tried fetching and parsing the URL unsuccessfully outside of >> Nutch I decided to try the parsechecker with the following output. >> More comments below the output... >> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >> ./bin/nutch parsechecker >> http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning >> [5] 3086 >> [6] 3087 >> [7] 3088 >> [4] Done ./bin/nutch parsechecker >> http://en.wikipedia.org/w/api.php?action=query >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ fetching: >> http://en.wikipedia.org/w/api.php?action=query >> parsing: http://en.wikipedia.org/w/api.php?action=query >> contentType: text/html >> signature: e29908847945e7dc482c2f6d6129a11c >> --------- >> Url >> --------------- >> http://en.wikipedia.org/w/api.php?action=query >> --------- >> ParseData >> --------- >> Version: 5 >> Status: success(1,0) >> Title: MediaWiki API Result >> Outlinks: 2 >> outlink: toUrl: https://www.mediawiki.org/wiki/API anchor: complete >> documentation >> outlink: toUrl: http://en.wikipedia.org/w/api.php anchor: API help >> Content Metadata: Vary=Accept-Encoding,X-Forwarded-Proto Date=Thu, 21 >> Jun 2012 22:14:21 GMT Content-Length=427 Content-Encoding=gzip >> Connection=close X-Cache-Lookup=MISS from >> amssq38.esams.wikimedia.org:80 Content-Type=text/html; charset=utf-8 >> X-Cache=MISS from amssq38.esams.wikimedia.org Server=Apache >> Cache-Control=private X-Content-Type-Options=nosniff >> _ip=91.198.174.225 >> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 >> >> 1) What do the integers within the []'s represent? >> 2) After encountering the first ampersand the URL seems to be >> truncated. Is this normalization or something else? My urlfilter regex >> is default. >> 3) The parser chokes and doesn't finish it's job. >> >> Any ideas about how these urls should be dealt with, or of course what >> suggestions there may be to prevent the parser from freezing on us? >> >> Thanks in advance. >> >> Lewis >> >> [0] >> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201206.mbox/%3CCAPeLbhNzuepW90V33TLvZ4n-eWRrHUspACbm3qK34wsTY6xTxQ%40mail.gmail.com%3E >> [1] >> http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning >> >> -- >> Lewis >>
-- Lewis

