[Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread malcolm.wallace
Does anyone else think it odd that Prelude.words will break a string at a non-breaking space?Prelude words "abc def\xA0ghi"["abc","def","ghi"]I would have expected this to be the obvious behaviour:Prelude words "abc def\xA0ghi"["abc","def\160ghi"]Regards,Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread Colin Adams
It doesn't seem odd to me.

Consider an HTML page with that sentence displayed on it. If you ask the
viewer of the page how many words are in the sentence, then surely you will
get the answer 3?

On 28 March 2011 16:55, malcolm.wallace malcolm.wall...@me.com wrote:

 Does anyone else think it odd that Prelude.words will break a string at a
 non-breaking space?

 Prelude words abc def\xA0ghi
 [abc,def,ghi]

 I would have expected this to be the obvious behaviour:

 Prelude words abc def\xA0ghi
 [abc,def\160ghi]

 Regards,
 Malcolm

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe




-- 
Colin Adams
Preston, Lancashire, ENGLAND
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread Christopher Done
On 28 March 2011 17:55, malcolm.wallace malcolm.wall...@me.com wrote:

 Does anyone else think it odd that Prelude.words will break a string at a
 non-breaking space?

 Prelude words abc def\xA0ghi
 [abc,def,ghi]


I think it's predictable, isSpace (which words is based on) is based on
generalCategory, which returns the proper Unicode category:

λ generalCategory '\xa0'
Space

So:

-- | Selects white-space characters in the Latin-1 range.-- (In
Unicode terms, this includes spaces and some control
characters.)isSpace :: Char - Bool-- isSpace includes
non-breaking space-- Done with explicit equalities both for
efficiency, and to avoid a tiresome-- recursion with GHC.List
elemisSpace c   =  c == ' ' ||
  c == '\t'||   c == '\n'||
   c == '\r'||   c == '\f'
||   c == '\v'||
c == '\xa0'  ||   iswspace (fromIntegral (ord
c)) /= 0
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread malcolm.wallace
Consider an HTML page with that "sentence" displayed on it. If you ask the viewer of the page how many words are in the sentence, then surely you will get the answer 3?But what about the author? Surely there is no reason to use a non-breaking space unless they intend it to mean that the characters before and after it belong to the same logical unit-of-comprehension?Regards, Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread James Cook

On Mar 28, 2011, at 12:05 PM, Christopher Done wrote:

On 28 March 2011 17:55, malcolm.wallace malcolm.wall...@me.com  
wrote:
Does anyone else think it odd that Prelude.words will break a string  
at a non-breaking space?


Prelude words abc def\xA0ghi
[abc,def,ghi]

I think it's predictable, isSpace (which words is based on) is based  
on generalCategory, which returns the proper Unicode category:


λ generalCategory '\xa0'
Space


I agree, and I also agree that it would make sense the other way (not  
breaking on non-breaking spaces).  Perhaps it would be a good idea to  
add a remark to the documentation which specifies the treatment of non- 
breaking spaces.


-- James___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread malcolm.wallace
I think it's predictable, isSpace (which words is based on) is based on generalCategory, which returns the proper Unicode category:λ generalCategory '\xa0' SpaceI agree, and I also agree that it would make sense the other way (not breaking on non-breaking spaces). Perhaps it would be a good idea to add a remark to the documentation which specifies the treatment of non-breaking spaces.I note that Java has two distinct properties concerning whitespace:Character.isSpaceChar('\xA0') == TrueCharacter.isWhitespace('\xA0') == FalseContrast with-- \x20 is ASCII spaceCharacter.isSpaceChar('\x20') == TrueCharacter.isWhitespace('\x20') == True-- \x2060 is the word-joiner (zero-width non-breaking space)Character.isSpaceChar('\x2060') == FalseCharacter.isWhitespace('\x2060') == False-- \x202F is the narrow non-breaking spaceCharacter.isSpaceChar('\x202F') == TrueCharacter.isWhitespace('\x202F') == False-- \x2009 is the thin spaceCharacter.isSpaceChar('\x2009') == TrueCharacter.isWhitespace('\x2009') == True___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread Nick Bowler
On 2011-03-28 16:20 +, malcolm.wallace wrote:
 But what about the author?  Surely there is no reason to use a
 non-breaking space unless they intend it to mean that the characters
 before and after it belong to the same logical unit-of-comprehension?

The non-breaking part of non-breaking space refers to breaking text
into lines.  In other words, if two words are separated by a
non-breaking space, then a line break will not be put between those
words.  A non-breaking space does *not* make two words into one word.

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] bug in Prelude.words?

2011-03-28 Thread Thomas Davie

On 28 Mar 2011, at 17:20, malcolm.wallace wrote:

 Consider an HTML page with that sentence displayed on it. If you ask the 
 viewer of the page how many words are in the sentence, then surely you will 
 get the answer 3?
  
 
 But what about the author?  Surely there is no reason to use a non-breaking 
 space unless they intend it to mean that the characters before and after it 
 belong to the same logical unit-of-comprehension?

I'm not sure that a logical unit-of-comprehension is the same as a word though. 
 As an aside – in publishing non-breaking spaces are commonly used for other 
purposes too, for example forcing a word onto a certain line to stop a space 
river appearing in a paragraph.

Bob


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe