Public bug reported:

In theory the regular expression ^.*$ should match any and every string,
including empty strings, but this specific Korean character U+D56D (항),
which I was unlucky enough to have one of my scripts come across, breaks
the expected behavior in egrep:

$ echo '' | egrep '^.*$'; echo $?

0
$ echo 'foo' | egrep '^.*$'; echo $?
foo
0
$ echo 'bar' | egrep '^.*$'; echo $?
bar
0
$ echo 'の名' | egrep '^.*$'; echo $?
の名
0
$ echo '항' | egrep '^.*$'; echo $?
1

Have I lost my mind...or should I go buy a lottery ticket? Here are some
rambling one-liners to illustrate the behavior further.

# An attempt to match the pattern ^.*$ (beginning of string, anything, end of 
string) against this Korean character fails:
$ echo '항' | egrep '^.*$'; echo $?
1

# As you can see here a match works when the $ is dropped from the pattern:
$ echo '항' | egrep '^.*'; echo $?
항
0

# Also using the -P flag from grep instead of -E correctly matches the original 
pattern:
$ echo '항' | grep -P '^.*$'; echo $?
항
0

# Sending a different Korean character (U+C720) to the same original pattern 
works as expected as well:
$ echo '유' | egrep '^.*$'; echo $?
유
0

# Combining the two leads to the original failure mentioned:
$ echo '항유' | egrep '^.*$'; echo $?
1

# And reversing the order of the combination does not effect the outcome:
$ echo '유항' | egrep '^.*$'; echo $?
1

# But dropping the $ from the pattern gives the expected match:
$ echo '유항' | egrep '^.*'; echo $?
유항
0

# Dropping the ^ from the pattern also gives the expected match:
$ echo '유항' | egrep '.*$'; echo $?
유항
0

# Surrounding U+D56D with U+C720 does not alter the behavior:
$ echo '유항유' | egrep '^.*$'; echo $?
1

# But again dropping U+D56D (항) from the input string returns egrep to the 
expected behavior:
$ echo '유유' | egrep '^.*$'; echo $?
유유
0

# And to make it very clear what the input is, here I'm using python to give a 
raw dump of the input:
$ echo '유항유' | python -c 'import sys; 
print(repr(sys.stdin.read().encode("unicode-escape")))'
b'\\uc720\\ud56d\\uc720\\n'

# My grep/egrep version:
$ grep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ egrep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

# My bash version
$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


===========================

If somebody could explain this behavior I would appreciate it. If it
could be fixed, even better. In the meantime I think I will prefer 'grep
-P' over 'egrep' when I expect strings to contain Korean text. In this
contrived example the '^' and '$' didn't make a lot of sense, but I
thought it would be best to provide the simplest possible reproduction
case rather than spell out my full use case.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: grep 3.4-1
ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78
Uname: Linux 5.4.0-65-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair wl
ApportVersion: 2.20.11-0ubuntu27.16
Architecture: amd64
CasperMD5CheckResult: skip
Date: Mon Feb 15 17:10:42 2021
InstallationDate: Installed on 2020-01-22 (389 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
SourcePackage: grep
UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago)

** Affects: grep (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug focal

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to grep in Ubuntu.
https://bugs.launchpad.net/bugs/1915738

Title:
  egrep: U+D56D (항) breaks ^/$ matching

Status in grep package in Ubuntu:
  New

Bug description:
  In theory the regular expression ^.*$ should match any and every
  string, including empty strings, but this specific Korean character
  U+D56D (항), which I was unlucky enough to have one of my scripts come
  across, breaks the expected behavior in egrep:

  $ echo '' | egrep '^.*$'; echo $?

  0
  $ echo 'foo' | egrep '^.*$'; echo $?
  foo
  0
  $ echo 'bar' | egrep '^.*$'; echo $?
  bar
  0
  $ echo 'の名' | egrep '^.*$'; echo $?
  の名
  0
  $ echo '항' | egrep '^.*$'; echo $?
  1

  Have I lost my mind...or should I go buy a lottery ticket? Here are
  some rambling one-liners to illustrate the behavior further.

  # An attempt to match the pattern ^.*$ (beginning of string, anything, end of 
string) against this Korean character fails:
  $ echo '항' | egrep '^.*$'; echo $?
  1

  # As you can see here a match works when the $ is dropped from the pattern:
  $ echo '항' | egrep '^.*'; echo $?
  항
  0

  # Also using the -P flag from grep instead of -E correctly matches the 
original pattern:
  $ echo '항' | grep -P '^.*$'; echo $?
  항
  0

  # Sending a different Korean character (U+C720) to the same original pattern 
works as expected as well:
  $ echo '유' | egrep '^.*$'; echo $?
  유
  0

  # Combining the two leads to the original failure mentioned:
  $ echo '항유' | egrep '^.*$'; echo $?
  1

  # And reversing the order of the combination does not effect the outcome:
  $ echo '유항' | egrep '^.*$'; echo $?
  1

  # But dropping the $ from the pattern gives the expected match:
  $ echo '유항' | egrep '^.*'; echo $?
  유항
  0

  # Dropping the ^ from the pattern also gives the expected match:
  $ echo '유항' | egrep '.*$'; echo $?
  유항
  0

  # Surrounding U+D56D with U+C720 does not alter the behavior:
  $ echo '유항유' | egrep '^.*$'; echo $?
  1

  # But again dropping U+D56D (항) from the input string returns egrep to the 
expected behavior:
  $ echo '유유' | egrep '^.*$'; echo $?
  유유
  0

  # And to make it very clear what the input is, here I'm using python to give 
a raw dump of the input:
  $ echo '유항유' | python -c 'import sys; 
print(repr(sys.stdin.read().encode("unicode-escape")))'
  b'\\uc720\\ud56d\\uc720\\n'

  # My grep/egrep version:
  $ grep --version
  grep (GNU grep) 3.4
  Copyright (C) 2020 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later 
<https://gnu.org/licenses/gpl.html>.
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.

  Written by Mike Haertel and others; see
  <https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
  $ egrep --version
  grep (GNU grep) 3.4
  Copyright (C) 2020 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later 
<https://gnu.org/licenses/gpl.html>.
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.

  Written by Mike Haertel and others; see
  <https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

  # My bash version
  $ bash --version
  GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
  Copyright (C) 2019 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

  This is free software; you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.

  
  ===========================

  If somebody could explain this behavior I would appreciate it. If it
  could be fixed, even better. In the meantime I think I will prefer
  'grep -P' over 'egrep' when I expect strings to contain Korean text.
  In this contrived example the '^' and '$' didn't make a lot of sense,
  but I thought it would be best to provide the simplest possible
  reproduction case rather than spell out my full use case.

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: grep 3.4-1
  ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78
  Uname: Linux 5.4.0-65-generic x86_64
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair wl
  ApportVersion: 2.20.11-0ubuntu27.16
  Architecture: amd64
  CasperMD5CheckResult: skip
  Date: Mon Feb 15 17:10:42 2021
  InstallationDate: Installed on 2020-01-22 (389 days ago)
  InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 
(20190805)
  SourcePackage: grep
  UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grep/+bug/1915738/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to