On 02/12/2013 15:53, Steven D'Aprano wrote:
On Mon, Dec 02, 2013 at 06:11:04AM -0800, Albert-Jan Roskam wrote:
Hi,

I created the code below because I want to compare two fields while
ignoring the diacritical signs.

Why would you want to do that? That's like comparing two fields while
ignoring the difference between "e" and "i", or "s" and "z", or "c" and
"k". Or indeed between "s", "z", "c" and "k".

*only half joking*


I think the right way to ignore diacritics and other combining marks is
with a function like this:

import unicodedata

def strip_marks(s):
     decomposed = unicodedata.normalize('NFD', s)
     base_chars = [c for c in decomposed if not unicodedata.combining(c)]
     return ''.join(base_chars)


Example:

py> strip_marks("I will coöperate with Müller's résumé mañana.")
"I will cooperate with Muller's resume manana."


Beware: stripping accents may completely change the meaning of the word
in many languages! Even in English, stripping the accents from "résumé"
makes the word ambiguous (do you mean a CV, or the verb to start
something again?). In other languages, stripping accents may completely
change the word, or even turn it into nonsense.

For example, I understand that in Danish, å is not the letter a with a
circle accent on it, but a distinct letter of the alphabet which should
not be touched. And I haven't even considered non-Western European
languages, like Greek, Polish, Russian, Arabic, Hebrew...

You've actually shown a perfect example above. The Spanish letter ñ has become the quite distinct Spanish letter n. And let's not go here http://spanish.about.com/b/2010/11/29/two-letters-dropped-from-spanish-alphabet.htm. We should just stick with English as we all know that's easy, don't we? http://www.i18nguy.com/chaos.html :)

--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to