mandoc(1) -Tutf8 misrenders accents

Ingo Schwarze Fri, 28 Feb 2014 15:29:33 -0800

Hi Ted and Stuart,

Ted Unangst wrote on Mon, Feb 24, 2014 at 03:11:16PM -0500:
> On Mon, Feb 24, 2014 at 19:48, Stuart Henderson wrote:
>> On 2014/02/24 10:46, Ted Unangst wrote:


>>> CVSROOT:    /cvs
>>> Module name:        src
>>> Changes by: t...@cvs.openbsd.org    2014/02/24 10:46:37
>>> 
>>> Modified files:
>>>     etc            : man.conf 
>>> 
>>> Log message:
>>> default to locale awareness. safer than changing internal mandoc defaults.

>> As an example, each of the following commands will input the program
>> ??slithy_toves.?? and write its indented text to ??slithy_toves.out??:

> As near as I can tell, this is the "correct" output.

No, it is not.  That's indeed a bug in mandoc(1).

> Here's the source.
> 
> As an example, each of the following commands will input the program
> \`slithy_toves.c\' and write its indented text to
> \`slithy_toves.out\':
> 
> This is, I believe, asking for combining characters. i.e., it is the
> markup one would use to create an accented c, which plain ascii output
> only approximates as c'. The leading ` may or may not appear depending
> on whether xterm or whatever decides to combine it with the space.

No, this is not asking for combining characters.

Both the Ossanna/Kernighan/Ritter troff manual and the GNU troff
documentation document \' as equivalent to \(aa and \` as equivalent
to \(ga, and all of these are supposed to be rendered stand-alone.
In particular, groff doesn't produce combining Unicode characters
in this case.

If you want combining Unicode characters, there are several ways,
none of them very portable:

 1. base character + combine character using U escape
 2. base character + combine character using C escape
 3. base character + combine character using name-u escape
 4. name-combine escape using numeric names
 5. name-combine escape using mnemonic names
 6. name escape with leading accent
 7. name escape with trailing accent

For example, the French "e accent aigu" can be expressed as follows.
The various forms are rendered correctly by the following programs:

    input          groff  mandoc  Heirloom  plan9
 1. e\U'0301'      no     no      yes       no
 2. e\C'u0301'     yes    yes     unlikely  no
 3. e\[u0301]      yes    yes     unlikely  no
 4. \[u0065_0301]  yes    no      unlikely  no
 5. \[e aa]        yes    no      unlikely  no
 6. \('e           yes    yes     unlikely  no
 7. \(e'           no     no      unlikely  yes

Where the invocations are:

  groff -P-c -mandoc -Tutf8 ...
  mandoc -Tutf8 ...
  9 nroff -Tutf -man ...

I haven't Heirloom troff set up for testing right now;
if somebody wants to build a port, you are welcome.

> I've been running the diff for a while and didn't notice anything
> unusual in our manpages because we use the correct markup.

You are right in so far as the example above shouldn't use
accents at all, but quotes, or even better a .Pa macro.

That said, here is a patch to mandoc that i intend to commit
after unlock.  Now is your rare chance to OK a mandoc patch,
or raise concerns before commit.

Yours,
  Ingo


Index: chars.in
===================================================================
RCS file: /cvs/src/usr.bin/mandoc/chars.in,v
retrieving revision 1.20
diff -u -p -r1.20 chars.in
--- chars.in    22 Jan 2014 20:58:35 -0000      1.20
+++ chars.in    28 Feb 2014 16:46:26 -0000
@@ -49,21 +49,21 @@
 CHAR("}",                      "",             0)
 
 /* Accents. */
-CHAR("a\"",                    "\"",           779)
+CHAR("a\"",                    "\"",           733)
 CHAR("a-",                     "-",            175)
 CHAR("a.",                     ".",            729)
-CHAR("a^",                     "^",            770)
-CHAR("\'",                     "\'",           769)
-CHAR("aa",                     "\'",           769)
-CHAR("ga",                     "`",            768)
-CHAR("`",                      "`",            768)
-CHAR("ab",                     "`",            774)
-CHAR("ac",                     ",",            807)
-CHAR("ad",                     "\"",           776)
+CHAR("a^",                     "^",            94)
+CHAR("\'",                     "\'",           180)
+CHAR("aa",                     "\'",           180)
+CHAR("ga",                     "`",            96)
+CHAR("`",                      "`",            96)
+CHAR("ab",                     "`",            728)
+CHAR("ac",                     ",",            184)
+CHAR("ad",                     "\"",           168)
 CHAR("ah",                     "v",            711)
 CHAR("ao",                     "o",            730)
-CHAR("a~",                     "~",            771)
-CHAR("ho",                     ",",            808)
+CHAR("a~",                     "~",            126)
+CHAR("ho",                     ",",            731)
 CHAR("ha",                     "^",            94)
 CHAR("ti",                     "~",            126)

mandoc(1) -Tutf8 misrenders accents

Reply via email to