Hi,

up front, thanks for finding and fixing this and sorry for not coming
round to testing the patch before commit.

Crystal Kolipe wrote on Wed, Sep 27, 2023 at 06:04:01PM -0300:
> On Wed, Sep 27, 2023 at 02:05:14PM -0600, Todd C. Miller wrote:

>> As written, deroff will not emit a line that does not end with a
>> newline.  That could be changed in a subsequent commit if it so
>> desired.

> That's always been the behaviour in every version that I've seen.
> It does have the possibly unexpected effect that /usr/bin/spell
> silently ignores the last line of input if it's not newline terminated.

Well, if a file does not end in a newline character, it is not a text
file according to POSIX.  Similarly, if a file contains a sequence
of LINE_MAX bytes and none of those bytes is a newline, it is not a
text file either.  The behaviour of troff(1), nroff(1), and deroff(1)
is undefined on non-text files.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html

Definition "3.403 Text File":

  A file that contains characters organized into zero or more
  lines. The lines do not contain NUL characters and none can
  exceed {LINE_MAX} bytes in length, including the <newline>
  character. Although POSIX.1-2017 does not distinguish between text
  files and binary files (see the ISO C standard), many utilities
  only produce predictable or meaningful output when operating on
  text files. The standard utilities that have such restrictions
  always specify "text files" in their STDIN or INPUT FILES sections.

All that said, i do agree that our text-handling tools should handle
non-text files gracefully whereever that is possible without excessive
effort.  But it isn't really surprising that an extremely old, almost
obsolete utility like deroff(1) relies on these rules.

> Whichever way we decide on for this, the future regression test

I certainly don't object that you commit a regression test you already
have, but systematically writing more regressions tests looks like a
waste of time to me.  That time would be better spent figuring out
how we can finally delete deroff(1) from the tree.  That has been
discussed in the past, but sadly never came to fruition.  The main
obstacle to overcome is that spell(1) relies on it - but nothing else
as far as i remember.

Note that while using deroff(1) in spell(1) kind of works, it is *not*
a good solution, and has never been, it's an awful hack and always was.
deroff(1) is an ad-hoc simplified parser for various complex languages,
so it will almost certainly misparse some valid roff(7) and high-level
macro constructions, both missing some relevant parts of the input
text and inserting bogus words that have nothing to do with the text
contained in the input into the output stream.  The error rate is low
enough that it still kind of works for spell(1), especially considering
that a tool like spell(1) also has its own, quite significant rate of
both false negative and false positive misclassifications.  But none
of that makes the basic approach a good idea.

On top of that, the source code of deroff(1) looks badly obfuscated.
Every language, including roff(7) and the macro sets on top of it,
evolves over time, and i fail to see any reasoable way to maintain
the mess inside deroff.c during such evolution.  I guess no one
even tried to maintain it for roughly 30 years now.  What is the
point of regression tests for code that no one ever changes?

The proper way to improve spell(1) would probably be to use
nroff -Tascii output instead of deroff(1).  Now, we don't have
nroff(1) in base.  For the case of -mdoc, -man, and -mandoc, using
mandoc(1) -Tascii would be a significant improvement over using
deroff(1).  For the cases of other ancient macrosets like -me, -mm, -ms
that we no longer support in base, asking people to install
the textproc/groff port to support them seems reasonable to me.
Why should base be able to run spell(1) on a file that base
cannot format in the first place?

For processing a modern macro set like -mom, you already have to
install groff(1) right now - deroff(1) is going to yield very
poor results with that.

Still, some work is needed in this direction before we may become
able to send deroff(1) to its well-earned retirement.

Oh, and before you ask, the demandoc(1) still contained in the
portable mandoc distribution, which we never had in base, should
also be deleted from portable mandoc.  It is also a fundamentally
flawed idea and totally unmaintained for over a decade.  Simply
using mandoc -Tascii is the way to go even in portable mandoc.

Yours,
  Ingo

Reply via email to