On Mon, 12/2/13, Steven D'Aprano <[email protected]> wrote:
Subject: Re: [Tutor] ignoring diacritical signs
To: [email protected]
Date: Monday, December 2, 2013, 4:53 PM
On Mon, Dec 02, 2013 at 06:11:04AM
-0800, Albert-Jan Roskam wrote:
> Hi,
>
> I created the code below because I want to compare two
fields while
> ignoring the diacritical signs.
Why would you want to do that? That's like comparing two
fields while
ignoring the difference between "e" and "i", or "s" and "z",
or "c" and
"k". Or indeed between "s", "z", "c" and "k".
*only half joking*
====> ;-) Unaccented characters that really should be accented are a fact of
life. We often need to merge datasets and if one of them comes from a system
that dates back to the Pleistocene... well...
I think the right way to ignore diacritics and other
combining marks is
with a function like this:
import unicodedata
def strip_marks(s):
decomposed = unicodedata.normalize('NFD', s)
base_chars = [c for c in decomposed if not
unicodedata.combining(c)]
return ''.join(base_chars)
Example:
py> strip_marks("I will coöperate with Müller's
résumé mañana.")
"I will cooperate with Muller's resume manana."
====> woaaah, very different approach compared to mine. Nice! I have to read up
on unicodedata. I have used it a few times (e.g. where the re module is not
enough), but many of the abbreviations are still a mystery to me. This seems a
good start: http://www.unicode.org/reports/tr44/tr44-6.html
Beware: stripping accents may completely change the meaning
of the word
in many languages! Even in English, stripping the accents
from "résumé"
makes the word ambiguous (do you mean a CV, or the verb to
start
something again?). In other languages, stripping accents may
completely
change the word, or even turn it into nonsense.
For example, I understand that in Danish, å is not the
letter a with a
circle accent on it, but a distinct letter of the alphabet
which should
not be touched. And I haven't even considered non-Western
European
languages, like Greek, Polish, Russian, Arabic, Hebrew...
=====> Similarly, ñ is a letter in Spanish and Tagalog. So they have (at
least?) 27 letters in their alphabet.
Another issue: depending on the language, it may be better
to replace
certain accents with letter combinations. For example, a
German might
prefer to see Müller transformed to Mueller. (Although Herr
Müller
probably won't, as people tend to be very sensitive about
their names.)
=====> Strangely, the nazi Goebbels is never referred to as "Göbbels".
Also, the above function leaves LATIN CAPITAL LETTER O WITH
STROKE as Ø
instead of stripping the stroke. I'm not sure whether that
is an
oversight or by design. Likewise for the lowercase version.
You might
want to do some post-processing:
def strip_marks2(s):
# Post-process letter O with stroke.
decomposed = unicodedata.normalize('NFD', s)
result = ''.join([c for c in decomposed if not
unicodedata.combining(c)])
return result.replace('Ø', 'O').replace('ø',
'o')
If you have a lot of characters to post-process (e.g. ß to
"ss" or "sz")
I recommend you look into the str.translate method, which is
more
efficient than repeatedly calling replace.
====> Efficiency certainly counts here, with millions of records to check. It
may even be more important than readability. Then again, accented letters are
fairly rare in my language.
No *simple* function can take into account the myriad of
language-
specific rules for accents. The best you can do is code up a
limited set
of rules for whichever languages you care about, and in the
general case
fall back on just stripping accents like an ignorant
American.
(No offence intended to ignorant Americans *wink*)
====> You are referring to this recipe, right?
http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
;-)
> I thought it'd be cool to overload
> __eq__ for this. Is this a good approach, or have I
been fixated too
> much on using the __eq__ special method?
This isn't Java, no need for a class :-)
On the other hand, if you start building up a set of
language-specific
normalization functions, a class might be what you want. For
example:
class DefaultAccentStripper:
exceptions = {'Ø': 'O', 'ø': 'o'}
mode = 'NFD' # Or possibly 'NFKD' for
some uses?
def __call__(self, s):
decomposed = []
for c in s:
if c in
self.exceptions:
decomposed.append(self.exceptions[c])
else:
decomposed.append(unicodedata.normalize(self.mode, c))
result = ''.join([c for c in
decomposed if not
unicodedata.combining(c)])
return result
class GermanAccentStripper(DefaultAccentStripper):
exceptions =
DefaultAccentStripper.exceptions.copy()
exceptions.update({'Ä': 'AE', 'ä': 'ae',
'Ë': 'EE', 'ë': 'ee',
'Ï': 'IE', 'ï': 'ie',
'Ö': 'OE', 'ö': 'oe',
# there seems to be a
pattern here...
'Ü': 'UE', 'ü': 'ue',
'ß': 'sz',
})
class DanishAccentStripper(DefaultAccentStripper):
exceptions = {'Å': 'Å', 'å': 'å'}
And there you go, three accent-strippers. Just instantiate
the classes,
once, and you're ready to go:
accent_stripper = GermanAccentStripper()
====> very slick. Cool!
====> regarding casefold (in your next mail). What is the difference between
lower and casefold?
Help on built-in function casefold:
casefold(...)
S.casefold() -> str
Return a version of S suitable for caseless comparisons.
>>> "Alala alala".casefold() == "Alala alala".lower()
True
====> And then this article............. sheeeeeesshhh!!!! What a short fuse!
Wouldn't it be easier to say "Look, man, the diacritics of my phone suck"
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor