Hi Lewis,

You got fooled by the ampersand switch on Unix terminals that sends a command 
to the background. The [] integers are Unix process ID's of the commands you 
have given. 

$ a&b&c is not one but three commands, sending a and b to the background. Your 
shell will output the [process ID] if a backgrounded command is finished.

Encapsulate your URL with quotes and you are safe.

Cheers,
Markus

 
 
-----Original message-----
> From:Lewis John Mcgibbney <[email protected]>
> Sent: Fri 22-Jun-2012 00:36
> To: [email protected]
> Subject: Parser choking on irregular url
> 
> Hi,
> 
> Something that that turned up on another list [0] was a scenario where
> the following URL [1] was being fetched for processing.
> 
> Having tried fetching and parsing the URL unsuccessfully outside of
> Nutch I decided to try the parsechecker with the following output.
> More comments below the output...
> 
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$
> ./bin/nutch parsechecker
> http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning
> [5] 3086
> [6] 3087
> [7] 3088
> [4]   Done                    ./bin/nutch parsechecker
> http://en.wikipedia.org/w/api.php?action=query
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ fetching:
> http://en.wikipedia.org/w/api.php?action=query
> parsing: http://en.wikipedia.org/w/api.php?action=query
> contentType: text/html
> signature: e29908847945e7dc482c2f6d6129a11c
> ---------
> Url
> ---------------
> http://en.wikipedia.org/w/api.php?action=query
> ---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: MediaWiki API Result
> Outlinks: 2
>   outlink: toUrl: https://www.mediawiki.org/wiki/API anchor: complete
> documentation
>   outlink: toUrl: http://en.wikipedia.org/w/api.php anchor: API help
> Content Metadata: Vary=Accept-Encoding,X-Forwarded-Proto Date=Thu, 21
> Jun 2012 22:14:21 GMT Content-Length=427 Content-Encoding=gzip
> Connection=close X-Cache-Lookup=MISS from
> amssq38.esams.wikimedia.org:80 Content-Type=text/html; charset=utf-8
> X-Cache=MISS from amssq38.esams.wikimedia.org Server=Apache
> Cache-Control=private X-Content-Type-Options=nosniff
> _ip=91.198.174.225
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> 
> 1) What do the integers within the []'s represent?
> 2) After encountering the first ampersand the URL seems to be
> truncated. Is this normalization or something else? My urlfilter regex
> is default.
> 3) The parser chokes and doesn't finish it's job.
> 
> Any ideas about how these urls should be dealt with, or of course what
> suggestions there may be to prevent the parser from freezing on us?
> 
> Thanks in advance.
> 
> Lewis
> 
> [0] 
> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201206.mbox/%3CCAPeLbhNzuepW90V33TLvZ4n-eWRrHUspACbm3qK34wsTY6xTxQ%40mail.gmail.com%3E
> [1] 
> http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning
> 
> -- 
> Lewis
> 

Reply via email to