Re: Bug report : Spell checking doesn't know about HTML entities

2007-03-23 Thread A.J.Mechelynck

Bram Moolenaar wrote:

Tony Mechelynck wrote:

In languages using accented letters, the Vim spell checker doesn't recognise 
HTML entities (in HTML text): for example, the letters outside of the ...; 
entities are highlighted as spellBad (after :set spell spelllang=fr) in 
the following French words:


ougrave;   meaning: où (where)
apregrave;s après  (after)
ceacute;reacute;monie  cérémonie  (ceremony)
courrouccedil;a courrouça  ([he] angered)
deacute;sespeacute;reacute;   désespéré  (desperate)
neacute;cessairenécessaire (necessary)
anneacute;e année  (year)

etc.

They are perfectly valid French words, if one takes into account the following 
equivalences:


ugrave; = ù
egrave; = è
eacute; = é
ccedil; = ç
etc.

I don't know how to solve the problem; maybe an interpretation layer to 
resolve the entities between the HTML text and the French (or other 
non-English language) dictionary?


Well, words with HTML things in them are NOT French words.  Why don't
you use utf-8 encoded HTML?


I started that particular site some years ago, in 7-bit ASCII plus entities. 
I'm loath to change it now, and risk making it incompatible with some older 
browsers. It already holds quite a bit of text.


I disagree with the statement that these words are not French words. In an 
HTML file, where HTML syntax must be taken into account, they are.




If you really want to recognize these words, you could take the French
dictionary, do a global replace and build a spell file from that.


Actually, I don't use spell (I am blessed with a good sense of orthography); 
but I wondered if there couldn't (someday) be a solution for people who don't 
share the same blessing.


The proposed solution would mean creating an additional spell file, slightly 
larger than the French dictionary, for use only with HTML text. I'm not 
convinced of such a solution's viability, especially since it would have to be 
repeated for German, Swedish, Turkish, Polish, etc., etc., etc. Maybe even for 
words like risqué and garçon in English.




You'll have to check if using  and ; in the middle of a word is causing
trouble.  Adding them to word characters will probably create different
problems.



The semicolon can also mean a semicolon, which is a punctuation mark and not a 
word character, and can be used as such after a word with no intervening space 
(or with nbsp; preceding it, depending on typesetting conventions). The case 
of the ampersand is simpler: to obtain a true ampersand in the rendered text, 
one must use one of amp; (symbolic entity) #38; (decimal entity) or #x26; 
(hex entity) in the HTML.



Best regards,
Tony.


Re: Bug report : Spell checking doesn't know about HTML entities

2007-03-23 Thread A.J.Mechelynck

François Pinard wrote:

[Bram Moolenar]


Tony Mechelynck wrote:


In languages using accented letters, the Vim spell checker doesn't 
recognise HTML entities (in HTML text) [...]


You'll have to check if using  and ; in the middle of a word is 
causing trouble.  Adding them to word characters will probably create 
different problems.


Character entities come from the old time people were still trying to 
salvage the 8th bit of each byte, on communication channels, to convey 
byte parity.  And also, whatever justification people may invent, to 
protect their laziness about using tools able to do more than ASCII.


They also bypass compatibility problems for users who have to upload HTML 
pages to servers where they don't master the headers which will be sent with 
the HTML. (Yes, now I know about the BOM and the META 
HTTP-EQUIV=Content-Type tag, but the former isn't mentioned and the latter 
is only mentioned but not explained, in the books I have about HTML.)


Even now, email channels aren't guaranteed do be able to convey 8-bit text 
other than by downgrading it to 7-bit by means of conversion schemes like 
quoted-printable or base64: some servers are 8-bit-compliant, others still 
aren't. In the email I get, I sometimes notice that the body has been 
autoconverted between 8-bit, quoted-printable and base64 by my ISP's 
routers, with no obviously apparent rule to such behaviour.




One property of character entities which is apparently not so well known 
(or maybe that property was withdrawn since then) is that the semicolon 
is optional.  It is only mandatory where ambiguity would otherwise arise 
(for example, when a letter follows, a fairly common case after all).


That property is not part of the present rules; it is obsolete and deprecated: 
ce n'est pas la règle, c'est une tolérance. It is only recognised for 
downward compatibility; IIUC, it does not apply to XHTML. The semicolon has of 
course always been mandatory when the entity is immediately followed by a 
letter or semicolon (or by a digit, but that is rarer).




I presume that if software (or people) generating HTML were sparing 
those semicolons wherever they may be spared, a lot of other software 
would break, we would get a riot against people following standards :-).




I suppose that's why the most recent standards require the semicolons.


Best regards,
Tony.
--
Everything is worth precisely as much as a belch, the difference being
that a belch is more satisfying.
-- Ingmar Bergman