Hi, Out of curiosity, I wrote the throw-away script below to find a character that is classified (--> LC_CTYPE) as digit in one locale, but not in another. I ran it with 5000 locale combinations in Python 2 but did not find any (somebody shut down my computer!). I just modified the code so it also runs in Python 3. Is this the correct way to find such locale-dependent regex matches?
albertjan@debian:~/Downloads$ uname -a && python --version && python3 --version Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux Python 2.7.9 Python 3.3.4 albertjan@debian:~/Downloads$ cat lc_ctype.py # -*- coding: utf-8 -*- """ Find two locales where a different character classification causes a regex to match a given character in one locale, but fail in another. This is to demonstrate the effect that re.LOCALE (in particular the LC_CTYPE locale category) might have on locale-aware regexes like \w or \d. E.g., a character might be classified as digit in one locale but not in another. """ from __future__ import print_function, division import subprocess import locale import itertools import sys import re try: xrange except NameError: xrange = range unichr = chr if sys.version_info.major> 2: unicode = str proc = subprocess.Popen("locale -a", stdout=subprocess.PIPE, shell=True) locales = proc.communicate() locales = sorted(locales[0].split(b"\n")) # this is the list: http://pastebin.com/FVxUnrWK if sys.version_info.major> 2: locales = [loc.decode("utf-8") for loc in locales] regex = re.compile(r"\d+", re.LOCALE) # is this the correct place? total = len(list(itertools.combinations(locales, 2))) for n, (locale1, locale2) in enumerate(itertools.combinations(locales, 2), 1): if not locale1 or not locale2: continue if n % 10 == 0 or n == 1: sys.stdout.write(" %d (%3.2f%%) ... " % (n, (n / total * 100) )) sys.stdout.flush() # python 2 print *function* does not have flush param for i in xrange(sys.maxunicode + 1): # 1114111 s = unichr(i) #.encode("utf8") try: locale.setlocale(locale.LC_CTYPE, locale1) m1 = bool(regex.match(s)) locale.setlocale(locale.LC_CTYPE, locale2) m2 = bool(regex.match(s)) if m1 ^ m2: # m1 != m2 msg = ("@@ ordinal: %s | character: %s (%r) | " " digit in locale '%s': %s | digit in locale '%s': %s ") print(msg % (i, unichr(i), unichr(i), locale1, m1, locale2, m2)) break except locale.Error as e: #print("Error: %s with %s and/or %s" % (e, locale1, locale2)) continue print("---Done---") Thank you! Albert-Jan _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor