Bug#549233: docbook-to-man: Does not accept (some) (unicode) characters)
On Thu, Feb 27, 2020 at 03:48:44PM +0100, Agustin Martin wrote: > > I recently tried to play with linuxdoc and utf-8 documents and run into the > same problem, > > onsgmls: ... 01.precmdout:1559:71:E: non SGML character number 141 > > This time I was lucky and a web search pointed me to > https://bugzilla.redhat.com/show_bug.cgi?id=66179. After that suggestion, > > SP_CHARSET_FIXED=yes SP_ENCODING=xml sgml2html FAQ-CervanTeX-utf8.sgml Better, for utf-8 $ SP_CHARSET_FIXED=yes SP_ENCODING=utf-8 sgml2html FAQ-CervanTeX-utf8.sgml -- Agustin
Bug#549233: docbook-to-man: Does not accept (some) (unicode) characters)
On Sat, Aug 25, 2018 at 10:02:27AM +0200, Helge Kreutzmann wrote: > reopen 549233 > found 549233 1:2.0.0-42 > severity 549233 minor > thanks > > Hello Chris, > On Mon, Aug 20, 2018 at 10:27:11AM +, Debian Bug Tracking System wrote: > > This is an automatic notification regarding your Bug report > > which was filed against the docbook-to-man package: > > > > #549233: docbook-to-man: Does not accept (some) (unicode) characters > > > > > It appears that docbook-to-man is not UTF-8 ready. If you compile the > > > attached man page "as is" then you'll get the following error: > > > /usr/bin/nsgmls:demo.man.sgml:60:6:E: non SGML character number 156 > > > /usr/bin/nsgmls:demo.man.sgml:60:6: open elements: REFENTRY REFSECT1[1] > > > PARA[1] (#PCDATA[1]) > > > /usr/bin/nsgmls:demo.man.sgml:62:9:E: non SGML character number 159 > > > /usr/bin/nsgmls:demo.man.sgml:62:9: open elements: REFENTRY REFSECT1[1] > > > PARA[1] (#PCDATA[1]) > > > > This is no longer reproducible; so closing :) > > Well, in my environment (current testing) it is: > helge@samd:~/download$ recode latin1..utf8 demo.man.sgml > helge@samd:~/download$ file *.sgml > demo.man.sgml: HTML document, UTF-8 Unicode text > helge@samd:~/download$ docbook-to-man demo.man.sgml > demo.1 > /usr/bin/nsgmls:demo.man.sgml:60:6:E: non SGML character number 156 > /usr/bin/nsgmls:demo.man.sgml:60:6: open elements: REFENTRY REFSECT1[1] > PARA[1] (#PCDATA[1]) > /usr/bin/nsgmls:demo.man.sgml:62:9:E: non SGML character number 159 > /usr/bin/nsgmls:demo.man.sgml:62:9: open elements: REFENTRY REFSECT1[1] > PARA[1] (#PCDATA[1]) > > The same error happens with the file from Paul. (I did not see his e-mail > earlier, because he did not CC me and adressed only the bug) and the > output is the same for both. Hi, I recently tried to play with linuxdoc and utf-8 documents and run into the same problem, onsgmls: ... 01.precmdout:1559:71:E: non SGML character number 141 This time I was lucky and a web search pointed me to https://bugzilla.redhat.com/show_bug.cgi?id=66179. After that suggestion, SP_CHARSET_FIXED=yes SP_ENCODING=xml sgml2html FAQ-CervanTeX-utf8.sgml made that messages disappear with opensp. I am including that in linuxdoc-tools as part of preliminary utf-8 support and may be of help here. > > > Interestingly, some characters (like "ü") are accepted without > > > problems while others (Ü,ß) yield the above errors. May be it complains only about one part of the multi-byte representation, not present in lowercase characters. -- Agustin
Bug#549233: docbook-to-man: Does not accept (some) (unicode) characters
Helge, I looked at this bug report because I was looking into other things related to DocBook and man pages. I think this is not a bug. If you look at the file you attached with "less", you will see characters such as "" and "". Those are the hexadecimal values of Latin1 characters, not parts of UTF-8 characters. Maybe you accidentally attached the wrong file; I don't know. That would explain why there are no complaints when you try to convert this as a Latin1 document though. I am attaching a version of your document re-encoded as UTF-8 for you to experiment with. I did not try to process it. I also changed the "doctype" word on the first line to "DOCTYPE"; I think it is always supposed to be upper-case even if some tools don't complain. Because this is a DocBook file, you could try giving the filename the suffix ".xml" and insert this as a first line in the file to declare its encoding as UTF-8: You might also need to modify the system identifier in the DOCTYPE line. In summary, I think the problem is that the document you attached is not a valid UTF-8 document. There might be other reasons why it is also not a valid SGML document. The "emacs" editor will recognize SGML documents as such if the filename ends in ".sgml". I do not know how well it validates general SGML files though. If you end the filename in ".xml", you can enable Nxml mode in emacs. See https://www.emacswiki.org/emacs/UsingNxmlModeWithDocBook for more information. Note that some older emacs instructions might say you need to download the Nxml package and install it for emacs, but recent versions of emacs include it. I do not know what other validation tools are available for general SGML files on Debian, apart from xsltproc. I am not the maintainer of this package, and DocBook 4.1 is very, very old at this point, but I am posting this in case it helps you resolve this bug. Hopefully, this is also enough information for you to decide to close the bug. Good luck, Paul Hardy FIXME"> Niedermeyer"> FIXME 2009"> 1"> chias...@bsi.bund.de"> CHIASMUS"> Debian"> GNU"> GPL"> "> ]> FIXME 2009 Demo for docbook-to-man problems -hilfe -beispiel BESCHREIBUNG is some programme Für Hinweise zur Sicherheit siehe "HINWEISE", für erste Schritte siehe "EINFÜHRUNG" und für weiter Anwendungsbeispiele "BEISPIELE". Testing ß. OPTIONEN Zwischen einer Option und dem zugehörigen Parameter können, müssen aber keine Leerzeichen stehen. Die Optionen können in beliebiger Reihenfolge angegeben werden. Wird eine Option mehrfach angegeben, so wird nur das letzte Auftreten der Option (bzw. der zugehörige Parameter) ausgewertet. Wildcards werden von chiasmus nicht unterstützt. -hilfe Gibt eine kurze Hilfe aus. -beispiel Gibt Beispiele zur Verwendung von Chiasmus aus. -m something some text Beispiele: a. This is something b. Something else c. Even more so d. A fourth item e. A fifth item -q something Does somthing more Beispiele: a. Should be a) (restarted) b. Should be b) % demo something ... Hinweis: Die Option -q braucht nicht mit angegeben zu werden. Das Kommando % demo something else leistet dasselbe wie das Kommando in Beispiel a. -z Options FIXME some text Beispiele: a. Should be again a) b. Should be again b)
Bug#549233: docbook-to-man: Does not accept (some) (unicode) characters
Package: docbook-to-man Version: 1:2.0.0-27 Severity: important Tags: l10n It appears that docbook-to-man is not UTF-8 ready. If you compile the attached man page as is then you'll get the following error: /usr/bin/nsgmls:demo.man.sgml:60:6:E: non SGML character number 156 /usr/bin/nsgmls:demo.man.sgml:60:6: open elements: REFENTRY REFSECT1[1] PARA[1] (#PCDATA[1]) /usr/bin/nsgmls:demo.man.sgml:62:9:E: non SGML character number 159 /usr/bin/nsgmls:demo.man.sgml:62:9: open elements: REFENTRY REFSECT1[1] PARA[1] (#PCDATA[1]) (The man page looks fine, though). Interestingly, some characters (like ü) are accepted without problems while others (Ü,ß) yield the above errors. If you recode the file to latin1, then the errors vanish (Note that in my UTF-8 environment, the generated man page appears now broken because all umlauts and ß appear to be silently removed - this can be fixed by recoding the generated man page back to UTF-8). -- System Information: Debian Release: 5.0.3 APT prefers stable APT policy: (500, 'stable') Architecture: powerpc (ppc) Kernel: Linux 2.6.24.3-grsec Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages docbook-to-man depends on: ii docbook 4.5-4 standard SGML representation syste ii libc6 2.7-18 GNU C Library: Shared libraries ii sp1.3.4-1.2.1-47 James Clark's SGML parsing tools docbook-to-man recommends no packages. docbook-to-man suggests no packages. -- no debconf information -- Dr. Helge Kreutzmann deb...@helgefjell.de Dipl.-Phys. http://www.helgefjell.de/debian.php 64bit GNU powered gpg signed mail preferred Help keep free software libre: http://www.ffii.de/ !doctype refentry PUBLIC -//OASIS//DTD DocBook V4.1//EN [ !ENTITY dhfirstname firstnameFIXME/firstname !ENTITY dhsurname surnameNiedermeyer/surname !ENTITY dhdate dateFIXME 2009/date !ENTITY dhsection manvolnum1/manvolnum !ENTITY dhemail emailchias...@bsi.bund.de/email !ENTITY dhusername Max Mustermann !ENTITY dhucpackage refentrytitleCHIASMUS/refentrytitle !ENTITY dhpackage demo !ENTITY debian productnameDebian/productname !ENTITY gnu acronymGNU/acronym !ENTITY gpl gnu; acronymGPL/acronym !ENTITY demo commanddhpackage;/command ] refentry refentryinfo address dhemail; /address author dhfirstname; dhsurname; /author copyright yearFIXME 2009/year holderdhusername;/holder /copyright dhdate; /refentryinfo refmeta dhucpackage; dhsection; /refmeta refnamediv refnamedhpackage;/refname refpurposeDemo for docbook-to-man problems/refpurpose /refnamediv refsynopsisdiv cmdsynopsis sepchar= demo; arg choice=plain -hilfe/arg /cmdsynopsis cmdsynopsis demo; arg choice=plain -beispiel/arg /cmdsynopsis /refsynopsisdiv refsect1 titleBESCHREIBUNG/title para demo; is some programme /para para Für Hinweise zur Sicherheit siehe HINWEISE, für erste Schritte siehe EINFÜHRUNG und für weiter Anwendungsbeispiele BEISPIELE. Testing ß. /para /refsect1 refsect1 titleOPTIONEN/title para Zwischen einer Option und dem zugehörigen Parameter können, müssen aber keine Leerzeichen stehen. Die Optionen können in beliebiger Reihenfolge angegeben werden. Wird eine Option mehrfach angegeben, so wird nur das letzte Auftreten der Option (bzw. der zugehörige Parameter) ausgewertet. Wildcards werden von chiasmus nicht unterstützt. /para variablelist varlistentry termoption-hilfe/option/term listitem para Gibt eine kurze Hilfe aus. /para /listitem /varlistentry varlistentry termoption-beispiel/option/term listitem para Gibt Beispiele zur Verwendung von Chiasmus aus. /para /listitem /varlistentry varlistentry termoption-m/option optionsomething/option/term listitem para some text /para para Beispiele: /para orderedlist numeration=loweralpha continuation='restarts' listitem para a. This is something /para /listitem listitem para b. Something else /para /listitem listitem para c. Even more so /para /listitem listitem para d. A fourth item /para /listitem listitem para e. A fifth item /para /listitem /orderedlist /listitem /varlistentry varlistentry termoption-q/option optionsomething/option/term listitem para Does somthing more /para para Beispiele: /para orderedlist numeration=loweralpha continuation='restarts' listitem para a. Should be a) (restarted)