The following issue has been SUBMITTED. ====================================================================== http://austingroupbugs.net/view.php?id=1077 ====================================================================== Reported By: deadpixi Assigned To: ====================================================================== Project: 1003.1(2013)/Issue7+TC1 Issue ID: 1077 Category: System Interfaces Type: Enhancement Request Severity: Editorial Priority: normal Status: New Name: Rob King Organization: User Reference: Section: regcomp Page Number: - Line Number: - Interp Status: --- Final Accepted Text: ====================================================================== Date Submitted: 2016-09-11 17:47 UTC Last Modified: 2016-09-11 17:47 UTC ====================================================================== Summary: Recommend support for wide-character regcomp and regexec and/or specify multi-byte behavior Description: The existing mandated regular expression interfaces are specified to work solely on regular expressions and inputs that are specified using single-byte character strings. The standard is silent on what the regular expression functions should do if either the regular expression or the input contains multi-byte encoded characters in the current (or any) multi-byte encoding.
This makes it impossible to rely on the regular expression interfaces in a portable manner, as their behavior is unspecified with multi-byte characters. For example, in UTF-8 encoding, a regular expression containing a single logical code point might encode that code point as four individual bytes. If this encoding were used in, e.g., a regular expression character class a naive implementation that expected each character to take up a single byte would treat each individual byte as a character to be matched in the character class, and not as a single character. Desired Action: The Standard should specify behavior of the regular expression interfaces when the expression to be compiled or the input to be matched contains multi-byte characters in the current character encoding. Alternatively (and perhaps preferably), the Standard should specify additional wide-character regular expression interfaces (perhaps named regwcomp and regwexec) to perform regular expression compilation and matching on expressions and inputs specified using wide characters; this would avoid any issue with multi-byte encoding. The specification of the additional wide-character interfaces would likely not be too burdensome: the GNU C library (when compiled with RE_ENABLE_I18N defined) already represents regular expressions and the input as wide characters internally. The Mac OS X standard library supports the "regwcomp" and "regwexec" interfaces. A free, portable, well-licensed, and popular implementation (libtre) exists and has been incorporated into several popular C libraries (musl, Darwin's libc, etc). ====================================================================== Issue History Date Modified Username Field Change ====================================================================== 2016-09-11 17:47 deadpixi New Issue 2016-09-11 17:47 deadpixi Name => Rob King 2016-09-11 17:47 deadpixi Section => regcomp 2016-09-11 17:47 deadpixi Page Number => - 2016-09-11 17:47 deadpixi Line Number => - ======================================================================