or specify multi-byte behavior

Austin Group Bug Tracker Sun, 11 Sep 2016 10:50:05 -0700

The following issue has been SUBMITTED. 
====================================================================== 
http://austingroupbugs.net/view.php?id=1077 
====================================================================== 
Reported By:                deadpixi
Assigned To:                
====================================================================== 
Project:                    1003.1(2013)/Issue7+TC1
Issue ID:                   1077
Category:                   System Interfaces
Type:                       Enhancement Request
Severity:                   Editorial
Priority:                   normal
Status:                     New
Name:                       Rob King 
Organization:                
User Reference:              
Section:                    regcomp 
Page Number:                - 
Line Number:                - 
Interp Status:              --- 
Final Accepted Text:         
====================================================================== 
Date Submitted:             2016-09-11 17:47 UTC
Last Modified:              2016-09-11 17:47 UTC
====================================================================== 
Summary:                    Recommend support for wide-character regcomp and
regexec and/or specify multi-byte behavior
Description: 
The existing mandated regular expression interfaces are specified to work
solely on regular expressions and inputs that are specified using
single-byte character strings. The standard is silent on what the regular
expression functions should do if either the regular expression or the
input contains multi-byte encoded characters in the current (or any)
multi-byte encoding.


This makes it impossible to rely on the regular expression interfaces in a
portable manner, as their behavior is unspecified with multi-byte
characters.

For example, in UTF-8 encoding, a regular expression containing a single
logical code point might encode that code point as four individual bytes.
If this encoding were used in, e.g., a regular expression character class a
naive implementation that expected each character to take up a single byte
would treat each individual byte as a character to be matched in the
character class, and not as a single character.
Desired Action: 
The Standard should specify behavior of the regular expression interfaces
when the expression to be compiled or the input to be matched contains
multi-byte characters in the current character encoding.

Alternatively (and perhaps preferably), the Standard should specify
additional wide-character regular expression interfaces (perhaps named
regwcomp and regwexec) to perform regular expression compilation and
matching on expressions and inputs specified using wide characters; this
would avoid any issue with multi-byte encoding.

The specification of the additional wide-character interfaces would likely
not be too burdensome: the GNU C library (when compiled with RE_ENABLE_I18N
defined) already represents regular expressions and the input as wide
characters internally. The Mac OS X standard library supports the
"regwcomp" and "regwexec" interfaces. A free, portable, well-licensed, and
popular implementation (libtre) exists and has been incorporated into
several popular C libraries (musl, Darwin's libc, etc).
====================================================================== 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2016-09-11 17:47 deadpixi       New Issue                                    
2016-09-11 17:47 deadpixi       Name                      => Rob King        
2016-09-11 17:47 deadpixi       Section                   => regcomp         
2016-09-11 17:47 deadpixi       Page Number               => -               
2016-09-11 17:47 deadpixi       Line Number               => -               
======================================================================

[1003.1(2013)/Issue7+TC1 0001077]: Recommend support for wide-character regcomp and regexec and/or specify multi-byte behavior

Reply via email to