Re: [PATCH] specify encoding for sed command
>> I think the best way to move forward with this is a new patch that uses >> `awk` instead of `sed`: I tested several `awk` variants and the command >> was portable without requiring any changes to LANG or LC_ALL. >> >> Does that sound like a good plan? > > No ;) > Could you please give the patch below a try? Your patch works great and is clearly a way better approach than parsing the output of `set`. > I'm just not sure whether we should burden this otherwise nice and short > commit message with the details of this Powerline-macOS-Bash-sed issue... It might be worth mentioning it briefly, in order to reduce the chances of a regression in the future. Maybe something like... "In Bash we can do better: run the 'compgen -v __gitcomp_builtin_' builtin command, which lists the same variables, but without the risk of errors caused by poor Unicode handling in some shells' 'set' builtin command or the overhead of executing 'sed'."
Re: [PATCH] specify encoding for sed command
> I did a little more digging into this issue today. > The issue isn't actually caused by `sed`: it's caused by the way that > the `set` builtin outputs characters in the Unicode Private Use Area > (PUA) in the build of Bash 3.2.57 that macOS Sierra ships with. > > Powerline uses several PUA code points to make some of its pretty text > UI elements: > > Code point (hex value): description > U+E0A0 (0xEE82A0): Version control branch > U+E0A1 (0xEE82A1): LN (line) symbol > U+E0A2 (0xEE82A2): Closed padlock > U+E0B0 (0xEE82B0): Rightwards black arrowhead > U+E0B1 (0xEE82B1): Rightwards arrowhead > U+E0B2 (0xEE82B2): Leftwards black arrowhead > U+E0B3 (0xEE82B3): Leftwards arrowhead > > macOS Bash 3.2.57's `set` builtin has garbled output where Powerline's > special symbols should be in the PS1 variable, but Bash 4.4.19 > (installed on macOS via homebrew) and Bash 4.3.38 (Ubuntu 16.04) both > display it correctly in the output of `set`. `echo $PS1` does display > the symbols correctly on these versions of Bash. Thanks for digging. > I think the best way to move forward with this is a new patch that uses > `awk` instead of `sed`: I tested several `awk` variants and the command > was portable without requiring any changes to LANG or LC_ALL. > > Does that sound like a good plan? No ;) Could you please give the patch below a try? I intended it as a fix for a minor performance regression introduced in the same commit that triggered this issue, but if it can work around this issue, too, all the better. I'm just not sure whether we should burden this otherwise nice and short commit message with the details of this Powerline-macOS-Bash-sed issue... -- >8 -- Subject: [PATCH] completion: reduce overhead of clearing cached --options To get the names of all '$__git_builtin_*' variables caching --options of builtin commands in order to unset them, 8b0eaa41f2 (completion: clear cached --options when sourcing the completion script, 2018-03-22) runs a 'set |sed s///' pipeline. This works both in Bash and in ZSH, but has a higher than necessasry overhead with the extra processes. In Bash we can do better: run the 'compgen -v __gitcomp_builtin_' builtin command, which lists the same variables, but without a pipeline and 'sed' it can do so with lower overhead. Signed-off-by: SZEDER Gábor TODO: as an unexpected bonus, this might work around the issue with 'sed' and invalid UTF-8 characters in the environment as well, at least in Bash. https://public-inbox.org/git/0102016293c8dca7-6626fcde-548d-476e-b61f-c83ecdeedfe1-000...@eu-west-1.amazonses.com/ --- contrib/completion/git-completion.bash | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index b09c8a2362..4ef59a51be 100644 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -282,7 +282,11 @@ __gitcomp () # Clear the variables caching builtins' options when (re-)sourcing # the completion script. -unset $(set |sed -ne 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null +if [[ -n ${ZSH_VERSION-} ]]; then + unset $(set |sed -ne 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null +else + unset $(compgen -v __gitcomp_builtin_) +fi # This function is equivalent to # -- 2.17.0.366.gbe216a3084
Re: [PATCH] specify encoding for sed command
I did a little more digging into this issue today. > On Apr 11, 2018, at 4:42 PM, Matt Coleman wrote: > > I found another (possibly better) way to fix this: > >> On Apr 10, 2018, at 3:18 AM, Matt Coleman wrote: >> >>> 1) What platform OS / version / sed version is this on? >> I'm experiencing this on macOS Sierra (10.12.6). The issue occurs with the >> OS's native sed, which is FreeBSD sed so the version number is kind of >> ambiguous. >> >> The error goes away if I set LANG=C or LC_ALL=C or change it to use GNU sed >> (installed via homebrew as gsed). > > If I change it to use awk instead of sed, it works with mawk, gawk, and macOS > awk: > unset $(set | awk -F '=' '/^__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*=/ > {print $1}') 2>/dev/null > > I compared sed vs. awk on Linux and Mac and they all take about the same > amount of time to run (within 0.002ms). The issue isn't actually caused by `sed`: it's caused by the way that the `set` builtin outputs characters in the Unicode Private Use Area (PUA) in the build of Bash 3.2.57 that macOS Sierra ships with. Powerline uses several PUA code points to make some of its pretty text UI elements: Code point (hex value): description U+E0A0 (0xEE82A0): Version control branch U+E0A1 (0xEE82A1): LN (line) symbol U+E0A2 (0xEE82A2): Closed padlock U+E0B0 (0xEE82B0): Rightwards black arrowhead U+E0B1 (0xEE82B1): Rightwards arrowhead U+E0B2 (0xEE82B2): Leftwards black arrowhead U+E0B3 (0xEE82B3): Leftwards arrowhead macOS Bash 3.2.57's `set` builtin has garbled output where Powerline's special symbols should be in the PS1 variable, but Bash 4.4.19 (installed on macOS via homebrew) and Bash 4.3.38 (Ubuntu 16.04) both display it correctly in the output of `set`. `echo $PS1` does display the symbols correctly on these versions of Bash. So... > 3) There's other invocations of "sed" in the file, aren't those affected as > well? The short answer: no. Slightly longer: not by the same thing that's affecting the line in the patch, at least. Long: As described above, the problem isn't actually `sed`: it's the `set` builtin in macOS's build of Bash. The other invocations of `sed` should be safe, because `sed` properly handles the PUA code points on its own: $ echo $'\xee\x82\xb0' | sed 's/./@/' @ The way that `set` is displaying the PS1 variable seems to be sending them to `sed` individually or somehow split up: $ for character in $'\xee' $'\x82' $'\xb0' $'\xee\x82' $'\x82\xb0'; do echo $character | sed 's/./@/'; done sed: RE error: illegal byte sequence sed: RE error: illegal byte sequence sed: RE error: illegal byte sequence sed: RE error: illegal byte sequence sed: RE error: illegal byte sequence Interestingly, Bash 3.2.25's `set` builtin on CentOS 5 correctly displays the octal values for the symbols (prompt edited to keep this email ASCII): $ PS1=$'\xee\x82\xb0 ' > set | grep PS1 PS1=$'\356\202\260 ' I haven't started digging through Bash's codebase, but it could either be a bug that was introduced between 3.2.25 and 3.2.57 or Apple used some silly flags when compiling Bash. Ideally, this should be fixed in Bash, but since Apple's using such an old version of Bash for license reasons, I think it's unlikely that they'll fix the issue. I think the best way to move forward with this is a new patch that uses `awk` instead of `sed`: I tested several `awk` variants and the command was portable without requiring any changes to LANG or LC_ALL. Does that sound like a good plan?
Re: [PATCH] specify encoding for sed command
I found another (possibly better) way to fix this: > On Apr 10, 2018, at 3:18 AM, Matt Coleman wrote: > >> 1) What platform OS / version / sed version is this on? > I'm experiencing this on macOS Sierra (10.12.6). The issue occurs with the > OS's native sed, which is FreeBSD sed so the version number is kind of > ambiguous. > > The error goes away if I set LANG=C or LC_ALL=C or change it to use GNU sed > (installed via homebrew as gsed). If I change it to use awk instead of sed, it works with mawk, gawk, and macOS awk: unset $(set | awk -F '=' '/^__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*=/ {print $1}') 2>/dev/null I compared sed vs. awk on Linux and Mac and they all take about the same amount of time to run (within 0.002ms).
Re: [PATCH] specify encoding for sed command
> 1) What platform OS / version / sed version is this on? I'm experiencing this on macOS Sierra (10.12.6). The issue occurs with the OS's native sed, which is FreeBSD sed so the version number is kind of ambiguous. The error goes away if I set LANG=C or LC_ALL=C or change it to use GNU sed (installed via homebrew as gsed). > 2) What's the output from "set" that's causing this error? Do we have an > isolated test case for that? On my system, it's caused by whatever powerline is doing to Bash's PS1 variable: $ set | grep ^PS1= | sed 's/foo/bar/' sed: RE error: illegal byte sequence I'm not sure exactly which character is bad here: $ xxd <<<$PS1 : 5c5b 1b5b 303b 3338 3b32 3b32 3535 3b32 \[.[0;38;2;255;2 0010: 3535 3b32 3535 3b34 383b 323b 303b 3133 55;255;48;2;0;13 0020: 353b 3137 353b 316d 5c5d c2a0 6d61 7474 5;175;1m\]..matt 0030: c2a0 5c5b 1b5b 303b 3338 3b32 3b30 3b31 ..\[.[0;38;2;0;1 0040: 3335 3b31 3735 3b34 383b 323b 3838 3b38 35;175;48;2;88;8 0050: 383b 3838 3b32 326d 5c5d ee82 b0c2 a05c 8;88;22m\].\ 0060: 5b1b 5b30 3b33 383b 323b 3230 383b 3230 [.[0;38;2;208;20 0070: 383b 3230 383b 3438 3b32 3b38 383b 3838 8;208;48;2;88;88 0080: 3b38 383b 316d 5c5d 7ec2 a05c 5b1b 5b30 ;88;1m\]~..\[.[0 0090: 3b33 383b 323b 3133 383b 3133 383b 3133 ;38;2;138;138;13 00a0: 383b 3438 3b32 3b38 383b 3838 3b38 383b 8;48;2;88;88;88; 00b0: 3232 6d5c 5dee 82b1 c2a0 5c5b 1b5b 303b 22m\].\[.[0; 00c0: 3338 3b32 3b32 3038 3b32 3038 3b32 3038 38;2;208;208;208 00d0: 3b34 383b 323b 3838 3b38 383b 3838 3b31 ;48;2;88;88;88;1 00e0: 6d5c 5dc2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 m\]. 00f0: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 0100: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 0110: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 0120: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 0130: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a05c ...\ 0140: 5b1b 5b30 3b33 383b 323b 3438 3b34 383b [.[0;38;2;48;48; 0150: 3438 3b34 383b 323b 3838 3b38 383b 3838 48;48;2;88;88;88 0160: 3b32 326d 5c5d c2a0 ee82 b25c 5b1b 5b30 ;22m\].\[.[0 0170: 3b33 383b 323b 3135 383b 3135 383b 3135 ;38;2;158;158;15 0180: 383b 3438 3b32 3b34 383b 3438 3b34 386d 8;48;2;48;48;48m 0190: 5c5d c2a0 3230 3138 2d30 342d 3130 5c5b \]..2018-04-10\[ 01a0: 1b5b 303b 3338 3b32 3b39 383b 3938 3b39 .[0;38;2;98;98;9 01b0: 383b 3438 3b32 3b34 383b 3438 3b34 383b 8;48;2;48;48;48; 01c0: 3232 6d5c 5dc2 a0ee 82b3 5c5b 1b5b 303b 22m\].\[.[0; 01d0: 3338 3b32 3b32 3038 3b32 3038 3b32 3038 38;2;208;208;208 01e0: 3b34 383b 323b 3438 3b34 383b 3438 3b31 ;48;2;48;48;48;1 01f0: 6d5c 5dc2 a030 333a 3136 c2a0 5c5b 1b5b m\]..03:16..\[.[ 0200: 306d 5c5d 205c 5b1b 5b30 3b33 383b 323b 0m\] \[.[0;38;2; 0210: 3135 383b 3135 383b 3135 383b 3438 3b32 158;158;158;48;2 0220: 3b34 383b 3438 3b34 386d 5c5d c2a0 c2a0 ;48;48;48m\] 0230: 5c5b 1b5b 303b 3338 3b32 3b34 383b 3438 \[.[0;38;2;48;48 0240: 3b34 383b 3439 3b32 326d 5c5d ee82 b0c2 ;48;49;22m\] 0250: a05c 5b1b 5b30 6d5c 5d0a .\[.[0m\].
Re: [PATCH] specify encoding for sed command
Stephon Harris writes: > Fixes issue with seeing `sed: RE error: illegal byte sequence` when running > git-completion.bash > --- > contrib/completion/git-completion.bash | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/contrib/completion/git-completion.bash > b/contrib/completion/git-completion.bash > index b09c8a23626b4..52a4ab5e2165a 100644 > --- a/contrib/completion/git-completion.bash > +++ b/contrib/completion/git-completion.bash > @@ -282,7 +282,7 @@ __gitcomp () > > # Clear the variables caching builtins' options when (re-)sourcing > # the completion script. > -unset $(set |sed -ne > 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null > +unset $(set |LANG=C sed -ne > 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null Shouldn't LC_ALL and LANG both set and exported to C, as (1) ancient systems understand only LANG but not LC_*; and (2) modern ones that understand both give precedence to LC_ALL over LANG? If we were to set only one, it is probably more sensible to set LC_ALL, I suspect, but it may be better to set both, which sends a sign to the readers that we know what we are doing ;-)
Re: [PATCH] specify encoding for sed command
On Thu, Apr 5, 2018 at 8:53 AM, Ævar Arnfjörð Bjarmason wrote: > > On Thu, Apr 05 2018, Stephon Harris wrote: > >> Fixes issue with seeing `sed: RE error: illegal byte sequence` when running >> git-completion.bash Please: - prefix the subject line with "completion:". - nit: replace "running" with "sourcing". - wrap the body of the commit message around 70 characters. - sign off your patch; see Documentation/SubmittingPatches and 'git commit -s'. >> --- >> contrib/completion/git-completion.bash | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/contrib/completion/git-completion.bash >> b/contrib/completion/git-completion.bash >> index b09c8a23626b4..52a4ab5e2165a 100644 >> --- a/contrib/completion/git-completion.bash >> +++ b/contrib/completion/git-completion.bash >> @@ -282,7 +282,7 @@ __gitcomp () >> >> # Clear the variables caching builtins' options when (re-)sourcing >> # the completion script. >> -unset $(set |sed -ne >> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null >> +unset $(set |LANG=C sed -ne >> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null I was wondering how could you see an error message from 'sed', when the stderr of the whole thing is redirected to /dev/null. It turns out that such a redirection doesn't affect the stderr of the commands executed inside the command substitution: $ echo $(echo foo ; echo >&2 error) 2>/dev/null error foo Interesting, I didn't know that. > This is getting closer to the issue than your previous patch, but > there's still some open questions: > > 1) What platform OS / version / sed version is this on? > > 2) What's the output from "set" that's causing this error? Do we have an >isolated test case for that? A quick internet search suggests that this error message tends to appear on Macs, when its 'sed' encounters an invalid UTF-8 character in its input. > 3) There's other invocations of "sed" in the file, aren't those affected >as well? The two 'sed' invocations in _git_stash() are suspect, as they process the output of 'git stash list', which includes the stashes' descriptions, which can contain whatever the users wished to store there (though they would probably get a "Warning: commit message did not conform to UTF-8" message from 9e83266525 (commit-tree: encourage UTF-8 commit messages., 2006-12-22) when doing so). The 'sed' invocation in __git_complete_revlist_file() is suspect as well, as it processes the output of 'git ls-tree'. Pathnames can contain anything, and while 'git ls-tree' quotes pathnames with "unusual" characters, I don't know whether it quotes all characters that can possibly upset Stephon's 'sed'. What about 'awk' on Stephon's platform? We already have one 'awk' invocation in __git_match_ctag(), which we use for 'git grep ' and 'git log -L:'. That 'awk' script uses a regexp to match the symbol name in a ctags file; does any programming language allow such unusual characters in symbol names? What about 'sed' and 'awk' scripts that don't use regexps at all? (I intend to add such an 'awk' script in a day or two...) > 4) Any reason we wouldn't just set LC_AlL=C for the whole file? I see we >already do it for our invocation to "git merge". That would perhaps be the safest thing to do... But how can we set it for a whole file, when the file is not executed as a script but sourced into the user's shell environment?
Re: [PATCH] specify encoding for sed command
On Thu, Apr 5, 2018 at 2:53 AM, Ævar Arnfjörð Bjarmason wrote: > On Thu, Apr 05 2018, Stephon Harris wrote: >> Fixes issue with seeing `sed: RE error: illegal byte sequence` when running >> git-completion.bash > > This is getting closer to the issue than your previous patch, but > there's still some open questions: > > 1) What platform OS / version / sed version is this on? > > 2) What's the output from "set" that's causing this error? Do we have an >isolated test case for that? > > 3) There's other invocations of "sed" in the file, aren't those affected >as well? > > 4) Any reason we wouldn't just set LC_AlL=C for the whole file? I see we >already do it for our invocation to "git merge". Also, missing Signed-off-by: (see Documentation/SubmittingPatches). Thanks.
Re: [PATCH] specify encoding for sed command
On Thu, Apr 05 2018, Stephon Harris wrote: > Fixes issue with seeing `sed: RE error: illegal byte sequence` when running > git-completion.bash > --- > contrib/completion/git-completion.bash | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/contrib/completion/git-completion.bash > b/contrib/completion/git-completion.bash > index b09c8a23626b4..52a4ab5e2165a 100644 > --- a/contrib/completion/git-completion.bash > +++ b/contrib/completion/git-completion.bash > @@ -282,7 +282,7 @@ __gitcomp () > > # Clear the variables caching builtins' options when (re-)sourcing > # the completion script. > -unset $(set |sed -ne > 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null > +unset $(set |LANG=C sed -ne > 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null This is getting closer to the issue than your previous patch, but there's still some open questions: 1) What platform OS / version / sed version is this on? 2) What's the output from "set" that's causing this error? Do we have an isolated test case for that? 3) There's other invocations of "sed" in the file, aren't those affected as well? 4) Any reason we wouldn't just set LC_AlL=C for the whole file? I see we already do it for our invocation to "git merge".