Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-08-01 Thread Steffen Nurpmeso
Hello.

David A. Wheeler wrote in
 :
 |On Fri, 31 Jul 2020 16:51:56 + (UTC), shwaresyst  \
 |wrote:
 |> Please look at my former message.  It stands that \Uu is ISO
 |> 10646, and that does not represent characters but codepoints,
 |> multiple of which may be necessary to represent one real
 |> character, which then may be a valid character in the locale
 |> encoding.
 |
 |I think in $'...' the \u and \U should be either:
 |(1) omitted, or
 |(2) only be required to support UTF-8. Do NOT require translations \
 |of \u and \U to other locales. Also, if \u and \U are included, they \
 |should simply omit the code points as requested; that's it. If a user \
 |specifies nonsense, like a combining character with no character to \
 |combine with, it's not the shell's job to disbelieve (maybe it will \
 |be concatenated later!).
 |
 |Also, I believe \nnn octal, which was in my original proposal,
 |SHOULD NOT be included. 
 |
 |Details below.
 |
 |--- David A. Wheeler
 |
 |=== DETAILS ===
 |
 |Supporting internationalization is important, but the shell
 |should be implementable in a small(ish) size.
 |The original proposal for $'...' did NOT include \u or \U at all,
 |as you can see here: https://austingroupbugs.net/view.php?id=249
 |
 |I don't think \u or \U are really *necessary* for international use.
 |If you want to include characters for
 |arbitrary language in an encoding that is currently in use... just \
 |USE them.
 |The \u... format is NOT as clear as using the actual characters, because
 |if you "just use the characters" then editors (etc.)
 |can display them as the actual characters.
 |What's way more important is supporting things like \n
 |(so you can finally set values terminating in newline) and
 |\xHH hex values (to generate escape sequences & other byte sequences).
 |Those are *not* easily seen in shell source code.
 |
 |Also, an aside: I now think \nnn octal, which was in my original proposal,
 |SHOULD NOT be included. The \nnn syntax is incompatible with bash's \
 |\0nnn syntax.
 |Requiring \0nnn syntax would mysteriously different from C's \nnn syntax.
 |Generally people use hex (not octal) nowadays for identifying characters \
 |& bytes,
 |so let's just standardize \xHH. It's really the better thing to use anyway.
 |
 |I don't *object* to \u and \U as long as their implementation requirements
 |are not complicated. Obviously some shells support \u and \U, e.g., bash.

I like $'' because quoting and expansion are a vivid core of POSIX
shell, and $'' can almost act as printf(1) -- it gets the job
done, and in a way that makes the construct -- in comparison --
visually comprehensible.  Even more so if \$ could also be
embedded.  That is, i would rather go for more not less, i think
yet another "mutilated", only partially usable quote mechanism
is.. not what i would call an improvement.

 |HOWEVER: I think that *requiring* all shells to be able translate between
 |encodings is excessive and unnecessary.
 |I don't think it's what current shells do.
 |Instead, simply require that shells at *least* support UTF-8.

But POSIX still allows multiple character sets, and therefore the
shell can run in such an environment?  I mean, if one could state
a character set, as in RFC 2231 (say, something like $.utf8'',
where . is a place holder), but that not, of course.

 |Trying to support *all* encodings in a shell is a potentially big
 |complication, especially in small-memory systems like TVs.
 |E.g., if someone is using Latin-1 (ISO-8859-1) as their locale,
 |I think it is *NOT* reasonable to expect the shell to convert
 |unicode to the other locale. Instead, call on a specialty program to \
 |do that!
 |
 |I did a quick test with bash, and does *not* appear to
 |try to do any locale conversions. Instead, it appears to assume that
 |if you ask for \u you want the UTF-8 byte sequence.
 |E.g., small y, acute accent (ý)
 |in Latin-1 is decimal 253 and hex 0xfd (and HTML ).
 |Unsuprisingly, it's Unicode code point U+00fd. But when I set the
 |locale to ISO-8859-1, it generates its *UTF-8* encoding:
 |(LANG=ISO-8859-1 echo $'\u00fd' | od -c )

The other respone gives the reasoning for that (it depends on your
locale, and existing shells do the conversion).

But you know, here lies the actual problem, it is that for example
bash uses the ISO C / POSIX multibyte character interface (mb to
XX) to convert \Uu if available, and only uses iconv(3) if not.
And that is a problem.  It will not work out for many languages of
the world, since that MB interface is inherently incapable to deal
with ISO 10646, as that, to reiterate, does _not_ represent a 1:1
mapping of codepoint and "character", but may require multiple
codepoints to actually form something that maps to a "character",
where i defined character as something that can be seen on
a terminal screen.

And because of this i was by then adding something to the
discussion in issue 249, because if you define this interface
anew, and add support for \Uu, then this 

Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-31 Thread David A. Wheeler
On Fri, 31 Jul 2020 16:51:56 + (UTC), shwaresyst  wrote:
> Please look at my former message.  It stands that \Uu is ISO
> 10646, and that does not represent characters but codepoints,
> multiple of which may be necessary to represent one real
> character, which then may be a valid character in the locale
> encoding.

I think in $'...' the \u and \U should be either:
(1) omitted, or
(2) only be required to support UTF-8. Do NOT require translations of \u and \U 
to other locales. Also, if \u and \U are included, they should simply omit the 
code points as requested; that's it. If a user specifies nonsense, like a 
combining character with no character to combine with, it's not the shell's job 
to disbelieve (maybe it will be concatenated later!).

Also, I believe \nnn octal, which was in my original proposal,
SHOULD NOT be included. 

Details below.

--- David A. Wheeler

=== DETAILS ===

Supporting internationalization is important, but the shell
should be implementable in a small(ish) size.
The original proposal for $'...' did NOT include \u or \U at all,
as you can see here: https://austingroupbugs.net/view.php?id=249

I don't think \u or \U are really *necessary* for international use.
If you want to include characters for
arbitrary language in an encoding that is currently in use... just USE them.
The \u... format is NOT as clear as using the actual characters, because
if you "just use the characters" then editors (etc.)
can display them as the actual characters.
What's way more important is supporting things like \n
(so you can finally set values terminating in newline) and
\xHH hex values (to generate escape sequences & other byte sequences).
Those are *not* easily seen in shell source code.

Also, an aside: I now think \nnn octal, which was in my original proposal,
SHOULD NOT be included. The \nnn syntax is incompatible with bash's \0nnn 
syntax.
Requiring \0nnn syntax would mysteriously different from C's \nnn syntax.
Generally people use hex (not octal) nowadays for identifying characters & 
bytes,
so let's just standardize \xHH. It's really the better thing to use anyway.

I don't *object* to \u and \U as long as their implementation requirements
are not complicated. Obviously some shells support \u and \U, e.g., bash.

HOWEVER: I think that *requiring* all shells to be able translate between
encodings is excessive and unnecessary.
I don't think it's what current shells do.
Instead, simply require that shells at *least* support UTF-8.
Trying to support *all* encodings in a shell is a potentially big
complication, especially in small-memory systems like TVs.
E.g., if someone is using Latin-1 (ISO-8859-1) as their locale,
I think it is *NOT* reasonable to expect the shell to convert
unicode to the other locale. Instead, call on a specialty program to do that!

I did a quick test with bash, and does *not* appear to
try to do any locale conversions. Instead, it appears to assume that
if you ask for \u you want the UTF-8 byte sequence.
E.g., small y, acute accent (ý)
in Latin-1 is decimal 253 and hex 0xfd (and HTML ).
Unsuprisingly, it's Unicode code point U+00fd. But when I set the
locale to ISO-8859-1, it generates its *UTF-8* encoding:
(LANG=ISO-8859-1 echo $'\u00fd' | od -c )

What *is* reasonable to require, if \u and \U are supported?
I think POSIX should require at least support for UTF-8;
this is very widely used, and would provide a "minimum and useful floor"
without a complex implementation. Here are some options:
1. The standard could require that \u and \U *always* generate UTF-8.
All shells could easily implement/support that, and they could then
pass UTF-8 to some other program if it needs to be converted to
another locale. That would make supporting other locales easy,
as long as you're willing to call out to another program.
2. The standard say it must be supported if UTF-8 is the encoding, and
say nothing otherwise. The problem is that \u and \U would then
only be sure to work in that case.
3. Like #2, but also generate UTF-8 if the locale is C or POSIX.
I like this #3 option, it's a useful compromise for common cases.
Then shells don't have to implement the universe of weird special cases.

To see what a current shell implementation does, I looked at bash's docs here:
https://www.gnu.org/software/bash/manual/bash.html
which say:
\u - the Unicode (ISO/IEC 10646) character whose value is the hexadecimal 
value  (one to four hex digits) 
\U - the Unicode (ISO/IEC 10646) character whose value is the 
hexadecimal value  (one to eight hex digits) 
That text says "character", but they don't mean stand-alone characters.
For example, U+032A is a combining character, and this works fine:
echo $'y\u032a'
(with LANG=en_US.UTF-8).

The shell should not try to verify that some sequence is
a valid sequence of characters in some particular encoding;
there's no reason to believe they would be.
In practice most POSIX systems allow filenames to 

Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-31 Thread shwaresyst

Yes, right-assoc does exist too, and the standard supports the DBCS varieties 
in charmaps as an election, even. There's also medials, which have left and 
right associativity, and a number of other types, depending on primary script 
family. I agree with your last point that such sequence conversions are 
plausible, it's just how has no portable specification currently, it is left as 
unspecified.
On Friday, July 31, 2020 Steffen Nurpmeso  wrote:
shwaresyst wrote in
 <1371185781.9853799.1596158030...@mail.yahoo.com>:
 |It is not "some sensible \u sequences" alone. First off, there's little \
 |agreement on what constitutes 'sensible'. Just the headache of the \
 |U300 diacritics adds to XBD6 significantly, if they're to be supported, \
 |as one example. The 'sensible' present solution is to not support them \
 |at all; others will argue the 'sensible' thing is to support them because \
 |Unocode does include these code points. The headache stems from it \
 |is not simply arbitrarily saying let's have the utility support these \
 |in $'', it's ensuring there are interfaces for the utilities to be \
 |written in that understand left-associative combining sequences, and \

i think right-associative does also exist.  I have long not worked
with this stuff.

 |these interfaces are portable because requirements in XBD add that support.

Please look at my former message.  It stands that \Uu is ISO
10646, and that does not represent characters but codepoints,
multiple of which may be necessary to represent one real
character, which then may be a valid character in the locale
encoding.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter          he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-31 Thread Steffen Nurpmeso
shwaresyst wrote in
 <1371185781.9853799.1596158030...@mail.yahoo.com>:
 |It is not "some sensible \u sequences" alone. First off, there's little \
 |agreement on what constitutes 'sensible'. Just the headache of the \
 |U300 diacritics adds to XBD6 significantly, if they're to be supported, \
 |as one example. The 'sensible' present solution is to not support them \
 |at all; others will argue the 'sensible' thing is to support them because \
 |Unocode does include these code points. The headache stems from it \
 |is not simply arbitrarily saying let's have the utility support these \
 |in $'', it's ensuring there are interfaces for the utilities to be \
 |written in that understand left-associative combining sequences, and \

i think right-associative does also exist.  I have long not worked
with this stuff.

 |these interfaces are portable because requirements in XBD add that support.

Please look at my former message.  It stands that \Uu is ISO
10646, and that does not represent characters but codepoints,
multiple of which may be necessary to represent one real
character, which then may be a valid character in the locale
encoding.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-31 Thread Steffen Nurpmeso
Hello.

Robert Elz wrote in
 <12526.1596151...@jinx.noi.kre.to>:
 |Date:Thu, 30 Jul 2020 15:53:53 +0200
 |From:Steffen Nurpmeso 
 |Message-ID:  <20200730135353.qwslp%stef...@sdaoden.eu>
 |
 || The problem being that what is in the wild does not work out for
 || many languages.
 |
 |I admit to not knowing a lot of the internationalisation issues,
 |or of unicode, but I don't understand this at all.
 |
 |The quoting mechanisms in the shell provide a means to create
 |specific bit patterns to assign to variables, pass as parameters
 |to programs, etc.   I don't see that the mechanism by which they're
 |encoded in the sh language should matter all that much, the same
 |thing could be read from a file instead ( var=$(cat file) ) in which
 |case the shell spec has no control over the bit patterns at all.
 |
 |Of course the quoting mechanisms make a difference to the ease of
 |use for the sh programmer, but that's an entirely different issue.

No, no.  Ah, i had to reread the bug report know.  But i am being
misunderstood.

 || The in-use shell quote pattern consisting of small, isolated parts
 || which depend on which kind of escaping and expanding is necessary
 || just does not work out for many languages.
 |
 |Can you give an example of something which cannot be done (assuming
 |$'' as currently intended to be specified)?   Note: not an example of
 |someone using the mechanisms to do the wrong thing - there are zillions
 |of ways to write bad code, but an example of something which cannot be
 |done correctly as specified.   Then we'll see if that really matters.
 |
 ||   ? echo Don"'"t you worry$'\x21' The sun shines on us. $'\u263A'
 ||
 || The latter is what i mean.  There are many languages on this world
 || where these \u expansions do not work out that way, but where the
 || "entire sentence must be interpreted as a unity" in order to get
 || the iconv(3) conversation to nl_langinfo(CODESET) correctly, aka
 || the way it is _desired_.
 |
 |Surely this depends upon how the shell works - if the shell is attempting
 |to convert just the \u escape into some other codeset, I can see your \
 |point,
 |but it doesn't need to work like that - it can work internally in 10646
 |code points (whether encoded in 16 or 32 bit values, or as UTF-8), and
 |only convert to the desired charset when actually used (that is, when
 |about to run "echo" at which point the entire string is available.

Yes it could.  This would solve the issue, being only that the \u
escape can be used to specify Unicode codepoints, which then will
be converted to the locale character set via iconv(3).  And that
this may yield different results dependent on the context it has
to process.  As a primitive example of a western language i know

  u$'\u{DIAERESIS}'

cannot be converted to LATIN1, even though $'u\u{DIAERESIS}' could.
In theory.  In praxis only if iconv(3) would apply a normalization
step for Unicode input, aka take care for "combining" marks.  This
then would be U+00FC (LATIN SMALL LETTER U WITH DIAERESIS).

This is a primitive example.  There are languages which have
complex rules, and where multiple Unicode codepoints, aka multiple
adjacent \u sequences, form a "grapheme".  This is because Unicode
does not provide a codepoint for each and every character of all
languages which are supported, but it uses combining marks and
other categories of codepoints, which glue together to form the
actual character.

For example i know an Australian who lives in Southeast Asia (that
happens more often than one would think), now in Malaysia but also
Vietnam and Thailand, whatever, and he said

  In Thai vowels can be infront, behind, below, above, infront and
  behind, infront and above, infront and behind and above. And
  also have tonal markers above. So can be tripple stacked.

Such things are very often represented via combined codepoints in
Unicode.  When he said that he was at odds with a ncurses based
Unix terminal application, by the way.

 |In any case, if the user has specified a specific unicode code point,
 |shouldn't that always be what is generated, regardless of whether it
 |makes sense or not?
 |
 || And for that it would be tremendous if $'' would be defined so
 || that it can be used as the sole quoting mechanism,
 |
 |No thanks.   Partly because $'' is already implemented (widely)
 |and used (perhaps slightly less yet) - so that ship has sailed.
 |
 |I believe I've seen $" ... " used that way somewhere though (don't
 |recall where) and I believe it is a mistake.

That $"" is used by bash for translation aka gettext(3) purposes
i think.

 |As soon as you have multiple different types of expansions that
 |can occur, there are problems with which one gets priority, which
 |is performed first.   So, assuming there is a $"..." which works
 |as you desire, what happens with
 |
 | $"${VAR+foo\x7Dbar}"
 |
 |Do we get foo}bar or foobar} ?   (assuming VAR was set of course).

Well, for one i do not understand 

Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread shwaresyst

It is not "some sensible \u sequences" alone. First off, there's little 
agreement on what constitutes 'sensible'. Just the headache of the U300 
diacritics adds to XBD6 significantly, if they're to be supported, as one 
example. The 'sensible' present solution is to not support them at all; others 
will argue the 'sensible' thing is to support them because Unocode does include 
these code points. The headache stems from it is not simply arbitrarily saying 
let's have the utility support these in $'', it's ensuring there are interfaces 
for the utilities to be written in that understand left-associative combining 
sequences, and these interfaces are portable because requirements in XBD add 
that support.
On Thursday, July 30, 2020 Steffen Nurpmeso  wrote:
shwaresyst wrote in
 <1127836834.9524758.1596121054...@mail.yahoo.com>:
 |Yes, the additions necessary still for even limited Unicode support \
 |above the broken bandaids C11+ provide are one of those issues. Where \
 |Unicode is incompatible with POSIX, and is therefore (by design) broken \
 |too needs addressing also. The white papers detailing most of these \
 |changes have yet to be written, or published if some have been.

Hmm, the ISO C reference is of course true.  But then this is
about Unix/POSIX shells, and then adding some sensible \u
sequences and defining their conversion to locale charset can only
be an improvement, i think.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter          he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread Chet Ramey
On 7/30/20 7:29 PM, Robert Elz wrote:

>   | And for that it would be tremendous if $'' would be defined so
>   | that it can be used as the sole quoting mechanism,
> 
> No thanks.   Partly because $'' is already implemented (widely)
> and used (perhaps slightly less yet) - so that ship has sailed.
> 
> I believe I've seen $" ... " used that way somewhere though (don't
> recall where) and I believe it is a mistake.

None of the existing implementations of $"..." use it in that way.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread Robert Elz
Date:Thu, 30 Jul 2020 15:53:53 +0200
From:Steffen Nurpmeso 
Message-ID:  <20200730135353.qwslp%stef...@sdaoden.eu>


  | The problem being that what is in the wild does not work out for
  | many languages.

I admit to not knowing a lot of the internationalisation issues,
or of unicode, but I don't understand this at all.

The quoting mechanisms in the shell provide a means to create
specific bit patterns to assign to variables, pass as parameters
to programs, etc.   I don't see that the mechanism by which they're
encoded in the sh language should matter all that much, the same
thing could be read from a file instead ( var=$(cat file) ) in which
case the shell spec has no control over the bit patterns at all.

Of course the quoting mechanisms make a difference to the ease of
use for the sh programmer, but that's an entirely different issue.

  | The in-use shell quote pattern consisting of small, isolated parts
  | which depend on which kind of escaping and expanding is necessary
  | just does not work out for many languages.

Can you give an example of something which cannot be done (assuming
$'' as currently intended to be specified)?   Note: not an example of
someone using the mechanisms to do the wrong thing - there are zillions
of ways to write bad code, but an example of something which cannot be
done correctly as specified.   Then we'll see if that really matters.

  |   ? echo Don"'"t you worry$'\x21' The sun shines on us. $'\u263A'
  |
  | The latter is what i mean.  There are many languages on this world
  | where these \u expansions do not work out that way, but where the
  | "entire sentence must be interpreted as a unity" in order to get
  | the iconv(3) conversation to nl_langinfo(CODESET) correctly, aka
  | the way it is _desired_.

Surely this depends upon how the shell works - if the shell is attempting
to convert just the \u escape into some other codeset, I can see your point,
but it doesn't need to work like that - it can work internally in 10646
code points (whether encoded in 16 or 32 bit values, or as UTF-8), and
only convert to the desired charset when actually used (that is, when
about to run "echo" at which point the entire string is available.

In any case, if the user has specified a specific unicode code point,
shouldn't that always be what is generated, regardless of whether it
makes sense or not?

  | And for that it would be tremendous if $'' would be defined so
  | that it can be used as the sole quoting mechanism,

No thanks.   Partly because $'' is already implemented (widely)
and used (perhaps slightly less yet) - so that ship has sailed.

I believe I've seen $" ... " used that way somewhere though (don't
recall where) and I believe it is a mistake.

As soon as you have multiple different types of expansions that
can occur, there are problems with which one gets priority, which
is performed first.   So, assuming there is a $"..." which works
as you desire, what happens with

$"${VAR+foo\x7Dbar}"

Do we get foo}bar or foobar} ?   (assuming VAR was set of course).

Whichever way you pick, there will be arguments for doing it
the other way, in some other case.   This stuff simply becomes
a mess.   Please, don't go there.   If we wanted to add C type
encodings along with the others, we'd need to do it in a way that
is consistent with the other expansions, perhaps using something
like $[x7D] or $[u263A] or $[n] (but no, this is not a serious
suggestion).

And I cannot fathom how this in any way overcomes your earlier
objection, quoted strings in sh are not units, they're simply
pieces of some longer word (or can be) - your Don"'"t example
above (and the worry$'\x21') are both examples of that.

kre




Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread Steffen Nurpmeso
shwaresyst wrote in
 <1127836834.9524758.1596121054...@mail.yahoo.com>:
 |Yes, the additions necessary still for even limited Unicode support \
 |above the broken bandaids C11+ provide are one of those issues. Where \
 |Unicode is incompatible with POSIX, and is therefore (by design) broken \
 |too needs addressing also. The white papers detailing most of these \
 |changes have yet to be written, or published if some have been.

Hmm, the ISO C reference is of course true.  But then this is
about Unix/POSIX shells, and then adding some sensible \u
sequences and defining their conversion to locale charset can only
be an improvement, i think.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread Steffen Nurpmeso
David A. Wheeler wrote in
 :
 |Steffen Nurpmeso  wrote:
 |>> And for that it would be tremendous if $'' would be defined so
 |>> that it can be used as the sole quoting mechanism, and that would
 |>> then also include expansion of $VAR (i use \$VAR or \${VAR} in my
 |>> mailer).  But to know exactly how problematic splitting of quotes
 |>> is for many languages of the world, including right-to-left
 |>> direction and shift state changes etc., and changing of meaning as
 |>> such if the sentence cannot be interpreted as a unity, a real
 |>> expert had to be asked.  Anyhow, the Unicode effort mandates
 |>> processing of entire strings and denotes isolated treatment as
 |>> a complete error.
 |
 |I think eliminating old quoting mechanisms would be a mistake.

That is an unfortunate misunderstanding, sorry.  I do not want to
obsolete them from the standard side, regarding that all i would
like to see is that $'' gets the few tweaks it needs to include
the possibilities of the other quoting mechanisms, and in effect
this is only "" ($VAR and `` thereof).  And this is solely, no,
this is because (a) like that the entire string expansion can be
fed into iconv(3), and (b) because i think for users, and for
program/script source hm audit it is much easier to grasp than
having the need to sequence it, for example to embed $VAR
expansion into a string.

 |On Thu, 30 Jul 2020 16:09:56 +0200, Joerg Schilling  wrote:
 |> Even if it would become part of the standad today, you stilll would need
 |> to wait some years until all implementations take it up.
 |
 |That's true for almost all standards changes.
 |However, many shells *already* implement $'...'.
 |It's also relatively trivial to implement, and it provides
 |very useful capabilities (such as the ability to easily assign terminating \
 |newlines).
 |
 |I'd still like to see the addition of $'...'.

Me too, i am all in favour of $'', and i hope it is not because of
me that issue 249 is still open.  It is anyway implemented the way
it is as of today!

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread Steffen Nurpmeso
Joerg Schilling wrote in
 <5f22d4b4.8vf9+w1hbegjrn1d%joerg.schill...@fokus.fraunhofer.de>:
 |Steffen Nurpmeso  wrote:
 |
 |> And for that it would be tremendous if $'' would be defined so
 |> that it can be used as the sole quoting mechanism, and that would
 |> then also include expansion of $VAR (i use \$VAR or \${VAR} in my
 |> mailer).  But to know exactly how problematic splitting of quotes
 |> is for many languages of the world, including right-to-left
 |> direction and shift state changes etc., and changing of meaning as
 |> such if the sentence cannot be interpreted as a unity, a real
 |> expert had to be asked.  Anyhow, the Unicode effort mandates
 |> processing of entire strings and denotes isolated treatment as
 |> a complete error.
 |
 |Even if it would become part of the standad today, you stilll would need
 |to wait some years until all implementations take it up.

I must admit the last time i looked in an iconv(3) implementation
(GNU) it was not like that either, it was plain "1:1" conversion.
(I hope i am not lying now, ..it is what i remember.)

But even if it is for the future, if you write u$'\u0308' nothing
can happen, if you would write $'u\u0308' then an iconv(3) which
does its job really well could recognize the COMBINING DIAERESIS
and create the ü you want in your LATIN1 environment.  (This is
a simple example, but \u is meant to embed Unicode, and then
graphemes come into play; i mean, even ncurses is capable to
properly deal with this stuff since many years, and this is
something yet to be standardized.)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread David A. Wheeler
Steffen Nurpmeso  wrote:
> > And for that it would be tremendous if $'' would be defined so
> > that it can be used as the sole quoting mechanism, and that would
> > then also include expansion of $VAR (i use \$VAR or \${VAR} in my
> > mailer).  But to know exactly how problematic splitting of quotes
> > is for many languages of the world, including right-to-left
> > direction and shift state changes etc., and changing of meaning as
> > such if the sentence cannot be interpreted as a unity, a real
> > expert had to be asked.  Anyhow, the Unicode effort mandates
> > processing of entire strings and denotes isolated treatment as
> > a complete error.

I think eliminating old quoting mechanisms would be a mistake.

On Thu, 30 Jul 2020 16:09:56 +0200, Joerg Schilling 
 wrote:
> Even if it would become part of the standad today, you stilll would need
> to wait some years until all implementations take it up.

That's true for almost all standards changes.
However, many shells *already* implement $'...'.
It's also relatively trivial to implement, and it provides
very useful capabilities (such as the ability to easily assign terminating 
newlines).

I'd still like to see the addition of $'...'.

--- David A. Wheeler



Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread shwaresyst

Yes, the additions necessary still for even limited Unicode support above the 
broken bandaids C11+ provide are one of those issues. Where Unicode is 
incompatible with POSIX, and is therefore (by design) broken too needs 
addressing also. The white papers detailing most of these changes have yet to 
be written, or published if some have been.
On Thursday, July 30, 2020 Steffen Nurpmeso  wrote:
shwaresyst wrote in
 <311169368.9432836.1596108598...@mail.yahoo.com>:
 |On Thursday, July 30, 2020 Geoff Clare  wrote:
 |Robert Elz  wrote, on 29 Jul 2020:
 |>
 |> Speaking of which, what is the current holdup with resolving
 |> whichever bug it is (I hate searching in mantis, so I won't
 |> try here) which specifies $'...' ?  Perhaps whatever the
 |> problem was (before my time) with the specification of that
 |> is no longer a problem?
 |
 |It's bug 249. It was reopened in Oct 2015 and several notes were
 |added to the bug after that, starting with 
 |
 |https://austingroupbugs.net/view.php?id=249#c2893
 |
 |My guess is the conference calls postponed returning to it because
 |there was ongoing discussion, but by the time the discussion ended
 |it had "gone off the radar".
 ...
 |Also, as something new, its inclusion is part of a later draft of Issue \
 |8. Additional issues it depends on need to be addressed first, specified \
 |fully, and incorporated. This is more why it went on the back burner, \
 |that I recall. Various other bugs are in similar state; the prerequisites \
 |to finish speciifying them so they can be considered portable aren't \
 |done yet either.

The problem being that what is in the wild does not work out for
many languages.  The in-use shell quote pattern consisting of
small, isolated parts which depend on which kind of escaping and
expanding is necessary just does not work out for many languages.
Period.

I (the mailer i maintain, using POSIX-incompatible sh(1)ell-style
command line input) for example claim

  ? echo 'Quotes '${HOME}' and 'tokens" differ!"# no comment
  ? echo Quotes ${HOME} and tokens differ! # comment
  ? echo Don"'"t you worry$'\x21' The sun shines on us. $'\u263A'

The latter is what i mean.  There are many languages on this world
where these \u expansions do not work out that way, but where the
"entire sentence must be interpreted as a unity" in order to get
the iconv(3) conversation to nl_langinfo(CODESET) correctly, aka
the way it is _desired_.  Of course you can move it all to the
twilight zone of "undefined behaviour", but if you do not, then
quoting must extend to the largest possible extend, and
interpreted as a unity.

And for that it would be tremendous if $'' would be defined so
that it can be used as the sole quoting mechanism, and that would
then also include expansion of $VAR (i use \$VAR or \${VAR} in my
mailer).  But to know exactly how problematic splitting of quotes
is for many languages of the world, including right-to-left
direction and shift state changes etc., and changing of meaning as
such if the sentence cannot be interpreted as a unity, a real
expert had to be asked.  Anyhow, the Unicode effort mandates
processing of entire strings and denotes isolated treatment as
a complete error.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter          he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread Joerg Schilling
Steffen Nurpmeso  wrote:

> And for that it would be tremendous if $'' would be defined so
> that it can be used as the sole quoting mechanism, and that would
> then also include expansion of $VAR (i use \$VAR or \${VAR} in my
> mailer).  But to know exactly how problematic splitting of quotes
> is for many languages of the world, including right-to-left
> direction and shift state changes etc., and changing of meaning as
> such if the sentence cannot be interpreted as a unity, a real
> expert had to be asked.  Anyhow, the Unicode effort mandates
> processing of entire strings and denotes isolated treatment as
> a complete error.

Even if it would become part of the standad today, you stilll would need
to wait some years until all implementations take it up.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'



Re: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread Steffen Nurpmeso
shwaresyst wrote in
 <311169368.9432836.1596108598...@mail.yahoo.com>:
 |On Thursday, July 30, 2020 Geoff Clare  wrote:
 |Robert Elz  wrote, on 29 Jul 2020:
 |>
 |> Speaking of which, what is the current holdup with resolving
 |> whichever bug it is (I hate searching in mantis, so I won't
 |> try here) which specifies $'...' ?  Perhaps whatever the
 |> problem was (before my time) with the specification of that
 |> is no longer a problem?
 |
 |It's bug 249. It was reopened in Oct 2015 and several notes were
 |added to the bug after that, starting with 
 |
 |https://austingroupbugs.net/view.php?id=249#c2893
 |
 |My guess is the conference calls postponed returning to it because
 |there was ongoing discussion, but by the time the discussion ended
 |it had "gone off the radar".
 ...
 |Also, as something new, its inclusion is part of a later draft of Issue \
 |8. Additional issues it depends on need to be addressed first, specified \
 |fully, and incorporated. This is more why it went on the back burner, \
 |that I recall. Various other bugs are in similar state; the prerequisites \
 |to finish speciifying them so they can be considered portable aren't \
 |done yet either.

The problem being that what is in the wild does not work out for
many languages.  The in-use shell quote pattern consisting of
small, isolated parts which depend on which kind of escaping and
expanding is necessary just does not work out for many languages.
Period.

I (the mailer i maintain, using POSIX-incompatible sh(1)ell-style
command line input) for example claim

  ? echo 'Quotes '${HOME}' and 'tokens" differ!"# no comment
  ? echo Quotes ${HOME} and tokens differ! # comment
  ? echo Don"'"t you worry$'\x21' The sun shines on us. $'\u263A'

The latter is what i mean.  There are many languages on this world
where these \u expansions do not work out that way, but where the
"entire sentence must be interpreted as a unity" in order to get
the iconv(3) conversation to nl_langinfo(CODESET) correctly, aka
the way it is _desired_.  Of course you can move it all to the
twilight zone of "undefined behaviour", but if you do not, then
quoting must extend to the largest possible extend, and
interpreted as a unity.

And for that it would be tremendous if $'' would be defined so
that it can be used as the sole quoting mechanism, and that would
then also include expansion of $VAR (i use \$VAR or \${VAR} in my
mailer).  But to know exactly how problematic splitting of quotes
is for many languages of the world, including right-to-left
direction and shift state changes etc., and changing of meaning as
such if the sentence cannot be interpreted as a unity, a real
expert had to be asked.  Anyhow, the Unicode effort mandates
processing of entire strings and denotes isolated treatment as
a complete error.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



RE: Status of $'...' addition (was: ksh93 job control behaviour)

2020-07-30 Thread shwaresyst

Also, as something new, its inclusion is part of a later draft of Issue 8. 
Additional issues it depends on need to be addressed first, specified fully, 
and incorporated. This is more why it went on the back burner, that I recall. 
Various other bugs are in similar state; the prerequisites to finish 
speciifying them so they can be considered portable aren't done yet either.
On Thursday, July 30, 2020 Geoff Clare  wrote:
Robert Elz  wrote, on 29 Jul 2020:
>
> Speaking of which, what is the current holdup with resolving
> whichever bug it is (I hate searching in mantis, so I won't
> try here) which specifies $'...' ?  Perhaps whatever the
> problem was (before my time) with the specification of that
> is no longer a problem?

It's bug 249. It was reopened in Oct 2015 and several notes were
added to the bug after that, starting with 

https://austingroupbugs.net/view.php?id=249#c2893

My guess is the conference calls postponed returning to it because
there was ongoing discussion, but by the time the discussion ended
it had "gone off the radar".

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England