Public bug reported:

I want to do a regular expression match on UTF-8 formatted strings.
A simple example is matching a string consisting of 1 or 2 uppercase 
characters, including Ä,Ë,Ï,Ö,Ü.
The extended regular expression I use is:

'^[A-ZÄ-Ü]{1,2}$'

Expected behaviour:

Input Expect
------------------
Ä       Match
ÄB    Match
ABC  Fail

Test using grep works OK:
$ echo Ä |grep -E '^[A-ZÄ-Ü]{1,2}$'
Ä
$ echo ÄB |grep -E '^[A-ZÄ-Ü]{1,2}$'
ÄB
$ echo ABC |grep -E '^[A-ZÄ-Ü]{1,2}$'

The same test using a simple test program using regex/regcomp:


$ ./regex Ä '^[A-ZÄ-Ü]{1,2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{1,2}$)

$ ./regex ÄB '^[A-ZÄ-Ü]{1,2}$'
MISS  (ÄB) (^[A-ZÄ-Ü]{1,2}$)

$ ./regex ABC '^[A-ZÄ-Ü]{1,2}$'
MISS  (ABC) (^[A-ZÄ-Ü]{1,2}$)

It seems that the single symbol Ä counts as two symbols here, because
this works:

$ ./regex Ä '^[A-ZÄ-Ü]{2}$'
MATCH (Ä) (^[A-ZÄ-Ü]{2}$)


Additional information:

$ lsb_release -rd
Description:    Ubuntu 14.04.2 LTS
Release:        14.04

libc6:amd64 version2.19-0ubuntu6.5

Locale: en_US.UTF-8.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: libc6 2.19-0ubuntu6.5
ProcVersionSignature: Ubuntu 3.13.0-35.62-gatso 3.13.11.6
Uname: Linux 3.13.0-35-gatso x86_64
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Mar  4 11:51:24 2015
Dependencies:
 gcc-4.9-base 4.9.1-0ubuntu1
 libc6 2.19-0ubuntu6.5
 libgcc1 1:4.9.1-0ubuntu1
 multiarch-support 2.19-0ubuntu6.5
InstallationDate: Installed on 2014-09-26 (158 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 
(20140722.3)
SourcePackage: eglibc
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: eglibc (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug libc6 regcomp regex regexec trusty unicode

** Attachment added: "regex_test.c"
   
https://bugs.launchpad.net/bugs/1428091/+attachment/4334307/+files/regex_test.c

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1428091

Title:
  regexec/regcomp fails on regular expression containing UTF-8 multi-
  byte characters

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/1428091/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to