Excellent

On Fri, Jun 22, 2012 at 12:17 AM, Markus Jelsma
<[email protected]> wrote:
> Hi Lewis,
>
> You got fooled by the ampersand switch on Unix terminals that sends a command 
> to the background. The [] integers are Unix process ID's of the commands you 
> have given.
>
> $ a&b&c is not one but three commands, sending a and b to the background. 
> Your shell will output the [process ID] if a backgrounded command is finished.
>
> Encapsulate your URL with quotes and you are safe.
>
> Cheers,
> Markus
>
>
>
> -----Original message-----
>> From:Lewis John Mcgibbney <[email protected]>
>> Sent: Fri 22-Jun-2012 00:36
>> To: [email protected]
>> Subject: Parser choking on irregular url
>>
>> Hi,
>>
>> Something that that turned up on another list [0] was a scenario where
>> the following URL [1] was being fetched for processing.
>>
>> Having tried fetching and parsing the URL unsuccessfully outside of
>> Nutch I decided to try the parsechecker with the following output.
>> More comments below the output...
>>
>> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$
>> ./bin/nutch parsechecker
>> http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning
>> [5] 3086
>> [6] 3087
>> [7] 3088
>> [4]   Done                    ./bin/nutch parsechecker
>> http://en.wikipedia.org/w/api.php?action=query
>> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ fetching:
>> http://en.wikipedia.org/w/api.php?action=query
>> parsing: http://en.wikipedia.org/w/api.php?action=query
>> contentType: text/html
>> signature: e29908847945e7dc482c2f6d6129a11c
>> ---------
>> Url
>> ---------------
>> http://en.wikipedia.org/w/api.php?action=query
>> ---------
>> ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title: MediaWiki API Result
>> Outlinks: 2
>>   outlink: toUrl: https://www.mediawiki.org/wiki/API anchor: complete
>> documentation
>>   outlink: toUrl: http://en.wikipedia.org/w/api.php anchor: API help
>> Content Metadata: Vary=Accept-Encoding,X-Forwarded-Proto Date=Thu, 21
>> Jun 2012 22:14:21 GMT Content-Length=427 Content-Encoding=gzip
>> Connection=close X-Cache-Lookup=MISS from
>> amssq38.esams.wikimedia.org:80 Content-Type=text/html; charset=utf-8
>> X-Cache=MISS from amssq38.esams.wikimedia.org Server=Apache
>> Cache-Control=private X-Content-Type-Options=nosniff
>> _ip=91.198.174.225
>> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>>
>> 1) What do the integers within the []'s represent?
>> 2) After encountering the first ampersand the URL seems to be
>> truncated. Is this normalization or something else? My urlfilter regex
>> is default.
>> 3) The parser chokes and doesn't finish it's job.
>>
>> Any ideas about how these urls should be dealt with, or of course what
>> suggestions there may be to prevent the parser from freezing on us?
>>
>> Thanks in advance.
>>
>> Lewis
>>
>> [0] 
>> http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201206.mbox/%3CCAPeLbhNzuepW90V33TLvZ4n-eWRrHUspACbm3qK34wsTY6xTxQ%40mail.gmail.com%3E
>> [1] 
>> http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning
>>
>> --
>> Lewis
>>



-- 
Lewis

Reply via email to