Justin Mason a écrit :
John GALLET writes:
Well, thanks for writing it. I think its main weak point for French and
other accented languages is handling the different encodings for a same
char with an accent, some kind of "synonyms" list. The same letter, say "a
with an accent", can be misspelled with a plain "a", encoded in various
charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
and ; out). I do not know if it is possible at all, it might complicate
things *a lot*.
The tool can take care of this -- it will replace mutating single-characters
with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
"any" patterns.
If the number of permutations is small (as would be the case for
accented letters and the equivalent unaccented ones, or for that matter
obfuscation with lookalike characters), wouldn't it be better for it to
replace the character by a [] list of those permutations (i.e. replace
something that mutates between e and é with [eé] or replace obfuscation
of i with l and 1 by [il1] ?
John.
--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages - www.tradoc.fr