Hi, up front, thanks for finding and fixing this and sorry for not coming round to testing the patch before commit.
Crystal Kolipe wrote on Wed, Sep 27, 2023 at 06:04:01PM -0300: > On Wed, Sep 27, 2023 at 02:05:14PM -0600, Todd C. Miller wrote: >> As written, deroff will not emit a line that does not end with a >> newline. That could be changed in a subsequent commit if it so >> desired. > That's always been the behaviour in every version that I've seen. > It does have the possibly unexpected effect that /usr/bin/spell > silently ignores the last line of input if it's not newline terminated. Well, if a file does not end in a newline character, it is not a text file according to POSIX. Similarly, if a file contains a sequence of LINE_MAX bytes and none of those bytes is a newline, it is not a text file either. The behaviour of troff(1), nroff(1), and deroff(1) is undefined on non-text files. https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html Definition "3.403 Text File": A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character. Although POSIX.1-2017 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections. All that said, i do agree that our text-handling tools should handle non-text files gracefully whereever that is possible without excessive effort. But it isn't really surprising that an extremely old, almost obsolete utility like deroff(1) relies on these rules. > Whichever way we decide on for this, the future regression test I certainly don't object that you commit a regression test you already have, but systematically writing more regressions tests looks like a waste of time to me. That time would be better spent figuring out how we can finally delete deroff(1) from the tree. That has been discussed in the past, but sadly never came to fruition. The main obstacle to overcome is that spell(1) relies on it - but nothing else as far as i remember. Note that while using deroff(1) in spell(1) kind of works, it is *not* a good solution, and has never been, it's an awful hack and always was. deroff(1) is an ad-hoc simplified parser for various complex languages, so it will almost certainly misparse some valid roff(7) and high-level macro constructions, both missing some relevant parts of the input text and inserting bogus words that have nothing to do with the text contained in the input into the output stream. The error rate is low enough that it still kind of works for spell(1), especially considering that a tool like spell(1) also has its own, quite significant rate of both false negative and false positive misclassifications. But none of that makes the basic approach a good idea. On top of that, the source code of deroff(1) looks badly obfuscated. Every language, including roff(7) and the macro sets on top of it, evolves over time, and i fail to see any reasoable way to maintain the mess inside deroff.c during such evolution. I guess no one even tried to maintain it for roughly 30 years now. What is the point of regression tests for code that no one ever changes? The proper way to improve spell(1) would probably be to use nroff -Tascii output instead of deroff(1). Now, we don't have nroff(1) in base. For the case of -mdoc, -man, and -mandoc, using mandoc(1) -Tascii would be a significant improvement over using deroff(1). For the cases of other ancient macrosets like -me, -mm, -ms that we no longer support in base, asking people to install the textproc/groff port to support them seems reasonable to me. Why should base be able to run spell(1) on a file that base cannot format in the first place? For processing a modern macro set like -mom, you already have to install groff(1) right now - deroff(1) is going to yield very poor results with that. Still, some work is needed in this direction before we may become able to send deroff(1) to its well-earned retirement. Oh, and before you ask, the demandoc(1) still contained in the portable mandoc distribution, which we never had in base, should also be deleted from portable mandoc. It is also a fundamentally flawed idea and totally unmaintained for over a decade. Simply using mandoc -Tascii is the way to go even in portable mandoc. Yours, Ingo