Re: Line-breaking style in documentation source (was Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page)

2024-11-25 Thread G. Branden Robinson
At 2024-11-24T23:04:23-0600, Dave Kemper wrote:
> Every time I look at a diff in groff.texi, and I have to spend some
> time figuring out what *actually* changed versus what merely got
> reflowed, I wish that that manual used manpage-style line breaking
> internally.

You might be interested in or amused by a recent patch series to GNU
Bash, starting here:

https://lists.gnu.org/archive/html/bug-bash/2024-11/msg00161.html

> Is there a reason for it not to, besides historical practice?

Not to my knowledge.

> Going through the entire manual and changing its format would be a
> slog (though perhaps at least partly automatable), and would make "git
> blame" useless for tracking down any particular change from before
> that point--but might make diffs after that point enough easier to
> follow to be a net win.
> 
> Alternately, everyone[1] who hacks on the manual could agree to use
> the new format for any future edits.  This would make the formatting
> change more gradual and manageable, but would also result in
> inconsistent line-breaking style within the manual for, realistically,
> decades (or at least until AI takes over all content creation, and no
> humans need use groff ever again).  This, too, might be worth
> tolerating for the readability of diffs going forward.
> 
> [1] Yeah, "everyone" is pretty much just Branden, credited for 1255 of
> the 1272 commits in the last five years (and even that is an
> undercount, as several of those commits are credited to me but
> actually applied and sometimes refined by Branden).

I happen to have a change pending to our Texinfo manual.  I'll pilot
this and see what happens.  (It _does_ happen to warrant a parallel
change to one of our man pages.)

Regards,
Branden


signature.asc
Description: PGP signature


Line-breaking style in documentation source (was Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page)

2024-11-24 Thread Dave Kemper
On Sat, Nov 2, 2024 at 5:08 AM G. Branden Robinson
 wrote:
> At 2024-11-01T21:07:29+0100, Alejandro Colomar wrote:
> > No, this isn't outdated, since that reduces the quality of the diff.
> > Also, I review a lot of patches in the mail client, without running
> > git(1).  And it's not just for reviewing diffs, but also for writing
> > them.  Semantic newlines reduce the amount of work for producing the
> > diffs.
>
> It's a real win for diffs.
>
> Here's a very recent example from groff.
>
> diff --git a/man/groff.7.man b/man/groff.7.man
> index 1fb635f2b..1d248b237 100644
> --- a/man/groff.7.man
> +++ b/man/groff.7.man
> @@ -1281,6 +1281,7 @@ .SH Identifiers
>  typeface,
>  color,
>  special character or character class,
> +hyphenation language code,
>  environment,
>  or stream.

Every time I look at a diff in groff.texi, and I have to spend some
time figuring out what *actually* changed versus what merely got
reflowed, I wish that that manual used manpage-style line breaking
internally.  Is there a reason for it not to, besides historical
practice?

Going through the entire manual and changing its format would be a
slog (though perhaps at least partly automatable), and would make "git
blame" useless for tracking down any particular change from before
that point--but might make diffs after that point enough easier to
follow to be a net win.

Alternately, everyone[1] who hacks on the manual could agree to use
the new format for any future edits.  This would make the formatting
change more gradual and manageable, but would also result in
inconsistent line-breaking style within the manual for, realistically,
decades (or at least until AI takes over all content creation, and no
humans need use groff ever again).  This, too, might be worth
tolerating for the readability of diffs going forward.

[1] Yeah, "everyone" is pretty much just Branden, credited for 1255 of
the 1272 commits in the last five years (and even that is an
undercount, as several of those commits are credited to me but
actually applied and sometimes refined by Branden).



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-03 Thread Colin Watson
On Sun, Nov 03, 2024 at 02:59:34AM +0100, Alejandro Colomar wrote:
> On Sun, Nov 03, 2024 at 12:47:23AM +, Colin Watson wrote:
> > I'm not trying to stop you committing whatever you want to your
> > repository, of course, but I want to be clear that this doesn't actually
> > solve the right problem for manual page indexing.  The point of the
> > parsing code in mandb(8) - and I'm not claiming that it's great code or
> > the perfect design, just that it works most of the time - is to extract
> > the names and summary-descriptions from each page so that they can be
> > used by tools such as apropos(1) and whatis(1).  Splitting on section
> > boundaries is just the simplest part of that problem, and I don't think
> > that doing it in a separate program really gains anything.
> 
> Splitting on section boundaries is the minimum thing so that mandb(8)
> can use groff(1) directly to parse the section (instead of rolling your
> own man(7) parser).

No, it doesn't help, because mandb(8) still has to do a bunch of other
man(7) parsing on top of that (including the problem that caused me to
be CCed into this thread in the first place).  Delegating just the
section splitting to a separate tool would add quite a bit of complexity
without removing the need for man-db's own parser.

A separate tool is only useful if it solves the whole problem at hand,
rather than maybe 10% of it.  And even then it would need some careful
thought around integration.

Thanks,

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread G. Branden Robinson
Hi Alex,

At 2024-11-02T11:39:37+0100, Alejandro Colomar wrote:
> And diffs are a real win for text.  Thus, semantic newlines are a real
> win for text.  "Write poems, not prose."  (Any chance we may get that
> warning added to groff(1)?  :D)

Yes, but I've kicked it out to groff 1.25 because a gift-wrapped
opportunity came along.  We get to retire a warning category and its
number.

groff(7) [1.23.0]:

Warnings
...
   el 16   The el request was encountered with no prior
   corresponding ie request.

groff 1.24.0 [in preparation] NEWS:

*  The "el" warning category has been withdrawn.  If enabled (which it
   was not by default), the formatter would emit a diagnostic if it
   inferred an imbalance between `ie` and `el` requests.  Unfortunately
   its technique wasn't reliable and sometimes spuriously issued these
   warnings, and making it perfectly reliable did not look tractable.
   We recommend using brace escape sequences `\{` and `\}` to ensure
   that your control flow structures remain maintainable.

This was a 35-year-old bug (or incomplete feature) in GNU troff that as
far as I know first came to attention 10 years ago when the
then-Heirloom Doctools maintainer pointed out an incompatibility between
AT&T troff (from which Heirloom Doctools descends) and GNU troff.

https://savannah.gnu.org/bugs/?45502

More recently, Paul Eggert scored big-time grognard points by actually
depending on the AT&T troff behavior in the zic(8) man page.

https://savannah.gnu.org/bugs/?65474

We therefore _had_ to fix it.

The consequence is that the warning category `el` and bit 4 in the
warning mask integer are undefined for groff 1.24.

This was irresistible serendipity, because this warning category was (1)
not enabled by default and (2) probably used only by people who wouldn't
object to style warnings anyway.

In groff 1.25, I want to revive bit 4 as new warning category `style`.

Ending sentences before the end of a text line is something we can warn
about as discussed a while back, and I plan to do so.

https://lists.gnu.org/archive/html/groff/2022-06/msg00052.html

I've been collecting specimens of other contemplated style warnings.

https://savannah.gnu.org/bugs/?62776

Regards,
Branden


signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Colin Watson
On Sat, Nov 02, 2024 at 07:50:23PM -0500, G. Branden Robinson wrote:
> At 2024-11-02T19:06:53+, Colin Watson wrote:
> > How embarrassing.  Could somebody please file a bug on
> > https://gitlab.com/man-db/man-db/-/issues to remind me to fix that?
> 
> Done; .

Thanks, working on it.

> > I already know that getting acceptable performance for
> > this requires care, as illustrated by one of the NEWS entries for
> > man-db 2.10.0:
> > 
> >  * Significantly improve `mandb(8)` and `man -K` performance in the
> >common case where pages are of moderate size and compressed using
> >`zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test
> >system.
> > 
> > ... so I'm prepared to bet that forking nroff one page at a time will
> > be unacceptably slow.
> 
> Probably, but there is little reason to run nroff that way (as of groff
> 1.23).  It already works well, but I have ideas for further hardening
> groff's man(7) and mdoc(7) packages such that they return to a
> well-defined state when changing input documents.

Being able to keep track of which output goes with which input pages is
critical to the indexer, though (as you acknowledge later in your
reply).  It can't just throw the whole lot at nroff and call it a day.

One other thing: mandb/lexgrog also looks for preprocessing filter hints
in pages (`'\" te` and the like).  This is obscure, to be sure, but
either a replacement would need to do the same thing or we'd need to be
certain that it's no longer required.

> > and of course care would be needed around error handling and so on.
> 
> I need to give this thought, too.  What sorts of error scenarios do you
> foresee?  GNU troff itself, if it can't open a file to be formatted,
> reports an error diagnostic and continues to the next `argv` string
> until it reaches the end of input.

That might be sufficient, or man-db might need to be able to detect
which pages had errors.  I'm not currently sure.

> > but on the other hand this starts to feel like a much less natural fit
> > for the way nroff is run in every other situation, where you're
> > processing one document at a time.
> 
> This I disagree with.  Or perhaps more precisely, it's another example
> of the exception (man(1)) swallowing the rule (nroff/troff).  nroff and
> troff were written as Unix filters; they read the standard input stream
> (and/or argument list)[1], do some processing, and write to standard
> output.[2]
> 
> Historically, troff (or one of its preprocessors) was commonly used with
> multiple input files to catenate them.

But this application is not conceptually like catenation (even if it
might be possible to implement it that way).  The collection of all
manual pages on a system is not like one long document that happens to
be split over multiple files, certainly not from an indexer's point of
view.

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Alejandro Colomar
Hi Colin,

On Sun, Nov 03, 2024 at 12:47:23AM +, Colin Watson wrote:
> I'm not trying to stop you committing whatever you want to your
> repository, of course, but I want to be clear that this doesn't actually
> solve the right problem for manual page indexing.  The point of the
> parsing code in mandb(8) - and I'm not claiming that it's great code or
> the perfect design, just that it works most of the time - is to extract
> the names and summary-descriptions from each page so that they can be
> used by tools such as apropos(1) and whatis(1).  Splitting on section
> boundaries is just the simplest part of that problem, and I don't think
> that doing it in a separate program really gains anything.

Splitting on section boundaries is the minimum thing so that mandb(8)
can use groff(1) directly to parse the section (instead of rolling your
own man(7) parser).

groff(1) could also be used --avoiding a shell script--, but that would
need a new feature in groff(1) --which Breanden has suggested--.  I
prefer avoiding the growth of groff(1), if a simple sed(1) invocation
can do it.

The script will be useful for now to me, so I'll probably commit it.
Feel free to use it if you find it useful.  (If so, please let me know
so that I keep the interface stable.)


Cheers,
Alex

> (That's leaving aside things like localized man pages, which I know some
> folks on the groff list tend to sniff at but I think they're important,
> and the fact that the NAME section has both semantic and presentational
> meaning means that like it or not the parser needs to be aware of this.)
> 
> -- 
> Colin Watson (he/him)  [cjwat...@debian.org]
> 
> 

-- 



signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Colin Watson
(now with some local vim macros fixed to stop accidentally corrupting
the To: lines of some of my outgoing emails ...)

On Sat, Nov 02, 2024 at 08:09:29PM -0500, G. Branden Robinson wrote:
> At 2024-11-03T00:47:23+, Colin Watson wrote:
> > and the fact that the NAME section has both semantic and
> > presentational meaning means that like it or not the parser needs to
> > be aware of this.)
> 
> Even if mandb(8) doesn't run groff to extract the summary descriptions/
> apropos lines, I think this feature might be useful to you for
> coverage/regression testing.  Presumably, for valid inputs, groff and
> mandb(8) should reach similar conclusions about how the text of a "Name"
> section is to be formatted.

Yes, that's a good point and I agree with that.

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread G. Branden Robinson
Hi Colin,

At 2024-11-03T00:47:23+, Colin Watson wrote:
> (That's leaving aside things like localized man pages, which I know
> some folks on the groff list tend to sniff

I can think of only one, the maintainer of a rival formatter.  ;-)

> at but I think they're important,

Me too.  I agree with the sniffer that no language is ever likely to
reach 100% parity with English in something like the Debian
distribution, but more modest domains exist.

I've put effort into l10n issues in man(7) and in groff generally.  In
particular, I really want seamless multilingual document support and
achievement of that goal will be, I think, much closer in groff 1.24.
(My pending push is gated on deciding how to change the me(7) and ms(7)
packages to accommodate a formatter-level fix to an ugly wart in the
l10n department; see .)

> and the fact that the NAME section has both semantic and
> presentational meaning means that like it or not the parser needs to
> be aware of this.)

Even if mandb(8) doesn't run groff to extract the summary descriptions/
apropos lines, I think this feature might be useful to you for
coverage/regression testing.  Presumably, for valid inputs, groff and
mandb(8) should reach similar conclusions about how the text of a "Name"
section is to be formatted.

Regards,
Branden


signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Colin Watson
I'm not trying to stop you committing whatever you want to your
repository, of course, but I want to be clear that this doesn't actually
solve the right problem for manual page indexing.  The point of the
parsing code in mandb(8) - and I'm not claiming that it's great code or
the perfect design, just that it works most of the time - is to extract
the names and summary-descriptions from each page so that they can be
used by tools such as apropos(1) and whatis(1).  Splitting on section
boundaries is just the simplest part of that problem, and I don't think
that doing it in a separate program really gains anything.

(That's leaving aside things like localized man pages, which I know some
folks on the groff list tend to sniff at but I think they're important,
and the fact that the NAME section has both semantic and presentational
meaning means that like it or not the parser needs to be aware of this.)

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread G. Branden Robinson
Hi Colin,

At 2024-11-02T19:06:53+, Colin Watson wrote:
> How embarrassing.  Could somebody please file a bug on
> https://gitlab.com/man-db/man-db/-/issues to remind me to fix that?

Done; .

> lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if
> you focus on that then you'll end up with a design that's not very
> useful.  What really matters is indexing the whole system's manual
> pages, and mandb(8) does not do that by invoking lexgrog(1) one page
> at a time, but rather by running more or less the same code
> in-process.

Ah, I see it now--"lexgrog.l" is in both the Automake macros
"lexgrog_SOURCES" and "mandb_SOURCES".  Nice and DRY!

> I already know that getting acceptable performance for
> this requires care, as illustrated by one of the NEWS entries for
> man-db 2.10.0:
> 
>  * Significantly improve `mandb(8)` and `man -K` performance in the
>common case where pages are of moderate size and compressed using
>`zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test
>system.
> 
> ... so I'm prepared to bet that forking nroff one page at a time will
> be unacceptably slow.

Probably, but there is little reason to run nroff that way (as of groff
1.23).  It already works well, but I have ideas for further hardening
groff's man(7) and mdoc(7) packages such that they return to a
well-defined state when changing input documents.

> (This also combines with the fact that man-db applies some sandboxing
> when it's calling nroff just in case it might happen that a
> moderately-sized C++ project has less than 100% perfect security when
> doing text processing, which I'm sure everyone agrees would never
> happen.)

Inconceivable, yes!  But fortunately you can run nroff over N documents
and pay its own startup overhead costs as well as those of sandboxing
only once.

> If it were possible to run nroff over a whole batch of pages and get
> output for each of them in one go, then mybe.

That's already true for formatting the entire page.  It's how this was
created.

https://www.gnu.org/software/groff/manual/groff-man-pages.utf8.txt

(...best viewed with "less -R")

With the `-d EXTRACT` feature I have in mind, in its
as-simple-as-possible first-cut form, the problem you anticipate...

> man-db would need a reliable way to associate each line (or sometimes
> multiple lines) of output with each source file,

...would remain.  I'll have to think of a good way to write out
"metadata" (the input file name and the arguments to the `TH` request)
as each page is encountered, and of an interface to enable that.  I
don't see it happening before groff 1.25.

> and of course care would be needed around error handling and so on.

I need to give this thought, too.  What sorts of error scenarios do you
foresee?  GNU troff itself, if it can't open a file to be formatted,
reports an error diagnostic and continues to the next `argv` string
until it reaches the end of input.

> I can see the appeal, in terms of processing the actual language
> rather than a pile of hacks that try to guess what to do with it

...a major selling point, IMO...

> but on the other hand this starts to feel like a much less natural fit
> for the way nroff is run in every other situation, where you're
> processing one document at a time.

This I disagree with.  Or perhaps more precisely, it's another example
of the exception (man(1)) swallowing the rule (nroff/troff).  nroff and
troff were written as Unix filters; they read the standard input stream
(and/or argument list)[1], do some processing, and write to standard
output.[2]

Historically, troff (or one of its preprocessors) was commonly used with
multiple input files to catenate them.

Here's an example of this practice from 1980.

https://minnie.tuhs.org/cgi-bin/utree.pl?file=3BSD/usr/doc/pascal/makefile

Regards,
Branden

[1] ...including this option from Seventh Edition Unix (1979) or
earlier, which survives in GNU troff to this day.

 -i Read standard input after the input files are
exhausted.

[2] Seventh Edition troff didn't write to stdout by default, but tried
to open the typesetter device.  But it had an option to write to
standard output.

 -t Direct output to the standard output instead of the
phototypesetter.

   Running old school Unix under emulation these days, you _have_ to use
   this option to avoid the dreaded "Typesetter busy." diagnostic.

   When Kernighan refactored troff for device-independence, he
   reseated it more squarely in the Unix filter tradition by writing
   its plain-text page description language to stdout.  The output
   driver, such as "dpost" for PostScript, also read its standard input,
   and could thus become just one more stage in a pipeline.  [CSTR #97]


signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Alejandro Colomar
[CC trimmed]

Hi Colin,

On Sun, Nov 03, 2024 at 12:24:54AM +, Colin Watson wrote:
> On Sun, Nov 03, 2024 at 01:05:34AM +0100, Alejandro Colomar wrote:
> > Are you sure?  With a small tweak, I get the following comparison:
> > 
> > alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | 
> > wc
> > lexgrog: can't resolve man7/groff_man.7
> >   12475   99295  919842
> 
> Comparing anything to lexgrog isn't very interesting; it's a debugging
> tool and is not in itself very performance-sensitive.  As I've explained
> elsewhere, the interesting thing is mandb, which uses the same code
> in-process to scan a whole tree of pages in one go.  I do not expect to
> ever want to replace that with a shell pipeline.

I don't know how to compare to mandb(8), since it does other stuff, and
skips some when things haven't changed.  In any case, if this is of any
use, you may use it to compare, if you have an idea of what's more or
less the percentage of time that mandb(8) spends on this task:

alx@devuan:~/src/linux/man-pages/man-pages/master$ time mansect NAME man/ | wc
   4851   23548  169216

real0m0.044s
user0m0.033s
sys 0m0.015s
alx@devuan:~/src/linux/man-pages/man-pages/master$ time mandb man/ |& wc
 30 1792487

real0m1.341s
user0m1.065s
sys 0m0.302s
alx@devuan:~/src/linux/man-pages/man-pages/master$ time mandb man/ |& wc
 15  801116

real0m0.030s
user0m0.013s
sys 0m0.008s


This has been run on the Linux man-pages repository, with uncompressed
pages.  I've optimized mansect(1) to be 3x faster, and slightly simpler
and more robust, compared to the version posted on the list (and xargs
doesn't need -L1 anymore).


Cheers,
Alex

-- 



signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Colin Watson
On Sun, Nov 03, 2024 at 01:05:34AM +0100, Alejandro Colomar wrote:
> Are you sure?  With a small tweak, I get the following comparison:
> 
>   alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | 
> wc
>   lexgrog: can't resolve man7/groff_man.7
> 12475   99295  919842

Comparing anything to lexgrog isn't very interesting; it's a debugging
tool and is not in itself very performance-sensitive.  As I've explained
elsewhere, the interesting thing is mandb, which uses the same code
in-process to scan a whole tree of pages in one go.  I do not expect to
ever want to replace that with a shell pipeline.

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Alejandro Colomar
On Sun, Nov 03, 2024 at 01:05:42AM +0100, Alejandro Colomar wrote:
> Hi Colin,
> 
> On Sat, Nov 02, 2024 at 11:47:14PM +, Colin Watson wrote:
> > On Sat, Nov 02, 2024 at 10:36:20PM +0100, Alejandro Colomar wrote:
> > > This is quite naive, and will not work with pages that define their own
> > > stuff, since this script is not groff(1).  But it should be as fast as
> > > is possible, which is what Colin wants, is as simple as it can be (and
> > > thus relatively safe), and should work with most pages (as far as
> > > indexing is concerned, probably all?).
> > 
> > I seem to be being invoked here for something I actually don't think I
> > want at all, which suggests that wires have been crossed somewhere.  Can
> > you explain why I'd want to replace some part of a fairly well-optimized
> > and established C program with a shell pipeline?  I'm pretty certain it
> > would not be faster, at least.
> 
> Are you sure?  With a small tweak, I get the following comparison:
> 
>   alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | 
> wc
>   lexgrog: can't resolve man7/groff_man.7
> 12475   99295  919842
> 
>   real0m6.166s
>   user0m5.132s
>   sys 0m1.336s
>   alx@devuan:~/src/linux/man-pages/man-pages/main$ time mansect NAME man/ 
> \
>   | groff -man -Tutf8 | wc
>  9830   27109  689478
> 
>   real0m0.156s
>   user0m0.219s
>   sys 0m0.019s
> 
> Yes, I'm working with uncompressed pages.  We'd need to add support for
> handling compressed pages.  Also, we'd need to compare the performance
> of lexgrog(1) with compressed pages.  But for a starter, this suggests
> some good performance.
> 
> (I say with a small tweak, because the version I've posted uses
>  xargs -L1, but I've tested for performance without the -L1, which is
>  the main bottleneck.  It has no consequences for the NAME.  I need to
>  work out some nasty details with sed -n1 for the generic version,

s/n1/n/

>  though.)
> 
> 
> Have a lovely night!
> Alex
> 
> -- 
> 



-- 



signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Alejandro Colomar
Hi Colin,

On Sat, Nov 02, 2024 at 11:47:14PM +, Colin Watson wrote:
> On Sat, Nov 02, 2024 at 10:36:20PM +0100, Alejandro Colomar wrote:
> > This is quite naive, and will not work with pages that define their own
> > stuff, since this script is not groff(1).  But it should be as fast as
> > is possible, which is what Colin wants, is as simple as it can be (and
> > thus relatively safe), and should work with most pages (as far as
> > indexing is concerned, probably all?).
> 
> I seem to be being invoked here for something I actually don't think I
> want at all, which suggests that wires have been crossed somewhere.  Can
> you explain why I'd want to replace some part of a fairly well-optimized
> and established C program with a shell pipeline?  I'm pretty certain it
> would not be faster, at least.

Are you sure?  With a small tweak, I get the following comparison:

alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | 
wc
lexgrog: can't resolve man7/groff_man.7
  12475   99295  919842

real0m6.166s
user0m5.132s
sys 0m1.336s
alx@devuan:~/src/linux/man-pages/man-pages/main$ time mansect NAME man/ 
\
| groff -man -Tutf8 | wc
   9830   27109  689478

real0m0.156s
user0m0.219s
sys 0m0.019s

Yes, I'm working with uncompressed pages.  We'd need to add support for
handling compressed pages.  Also, we'd need to compare the performance
of lexgrog(1) with compressed pages.  But for a starter, this suggests
some good performance.

(I say with a small tweak, because the version I've posted uses
 xargs -L1, but I've tested for performance without the -L1, which is
 the main bottleneck.  It has no consequences for the NAME.  I need to
 work out some nasty details with sed -n1 for the generic version,
 though.)


Have a lovely night!
Alex

-- 



signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Colin Watson
On Sat, Nov 02, 2024 at 10:36:20PM +0100, Alejandro Colomar wrote:
> This is quite naive, and will not work with pages that define their own
> stuff, since this script is not groff(1).  But it should be as fast as
> is possible, which is what Colin wants, is as simple as it can be (and
> thus relatively safe), and should work with most pages (as far as
> indexing is concerned, probably all?).

I seem to be being invoked here for something I actually don't think I
want at all, which suggests that wires have been crossed somewhere.  Can
you explain why I'd want to replace some part of a fairly well-optimized
and established C program with a shell pipeline?  I'm pretty certain it
would not be faster, at least.

Thanks,

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Alejandro Colomar
Hi Branden, Colin,

On Sat, Nov 02, 2024 at 11:40:13AM +0100, Alejandro Colomar wrote:
> > I also of course have ideas for generalizing the feature, so that you
> > can request any (sub)section by name, and, with a bit more ambition,[4]
> > paragraph tags (`TP`) too.
> > 
> > So you could do things like:
> > 
> > nroff -man -d EXTRACT="RETURN VALUE" man3/bsearch.3
> 
> I certainly use this.
> 
>   #  man_section()  prints specific manual page sections (DESCRIPTION, 
> SYNOPSIS,
>   # ...) of all manual pages in a directory (or in a single manual page 
> file).
>   # Usage example:  .../man-pages$ man_section man2 SYNOPSIS 'SEE ALSO';
> 
>   man_section()
>   {
>   if [ $# -lt 2 ]; then
>   >&2 echo "Usage: ${FUNCNAME[0]}  ...";
>   return $EX_USAGE;
>   fi
> 
>   local page="$1";
>   shift;
>   local sect="$*";
> 
>   find "$page" -type f \
>   |xargs wc -l \
>   |grep -v -e '\b1 ' -e '\btotal\b' \
>   |awk '{ print $2 }' \
>   |sort \
>   |while read -r manpage; do
>   (sed -n '/^\.TH/,/^\.SH/{/^\.SH/!p}' <"$manpage";
>for s in $sect; do
>   <"$manpage" \
>   sed -n \
>   -e "/^\.SH $s/p" \
>   -e "/^\.SH $s/,/^\.SH/{/^\.SH/!p}";
>done;) \
>   |mandoc -Tutf8 2>/dev/null \
>   |col -pbx;
>   done;
>   }

On the other hand, you may want to just package this small shell script
(or rather a part of it) as a program.

How about this?

$ cat /usr/local/bin/mansect
#!/bin/sh

if [ $# -lt 1 ]; then
>&2 echo "Usage: $0 SECTION [FILE ...]";
return 1;
fi

s="$1";
shift;


if test -z "$*"; then
sed -n \
-e '/^\.TH/,/^\.SH/{/^\.SH/!p}' \
-e '/^\.SH '"$s"'$/p' \
-e '/^\.SH '"$s"'$/,/^\.SH/{/^\.SH/!p}' \
;
else
find "$@" -not -type d \
| xargs wc -l \
| sed '${/ total$/d}' \
| grep -v '\b1 ' \
| awk '{ print $2 }' \
| xargs -L1 sed -n \
-e '/^\.TH/,/^\.SH/{/^\.SH/!p}' \
-e '/^\.SH '"$s"'$/p' \
-e '/^\.SH '"$s"'$/,/^\.SH/{/^\.SH/!p}' \
;
fi;

This only filters the source of the page, producing output that's
suitable for the groff pipeline.

alx@devuan:~$ man -w proc | xargs cat | mansect NAME
.TH proc 5 2024-06-15 "Linux man-pages 6.9.1-158-g2ac94c631"
.SH NAME
proc \- process information, system information, and sysctl 
pseudo-filesystem
alx@devuan:~$ man -w strtol strtoul | xargs mansect 'NAME'
.TH strtol 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631"
.SH NAME
strtol, strtoll, strtoq \- convert a string to a long integer
.TH strtoul 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631"
.SH NAME
strtoul, strtoull, strtouq \- convert a string to an unsigned long 
integer

You can request several sections with a regex:

$ man -w strtol strtoul | xargs mansect '\(NAME\|SEE ALSO\)'
.TH strtol 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631"
.SH NAME
strtol, strtoll, strtoq \- convert a string to a long integer
.SH SEE ALSO
.BR atof (3),
.BR atoi (3),
.BR atol (3),
.BR strtod (3),
.BR strtoimax (3),
.BR strtoul (3)
.TH strtoul 3 2024-07-23 "Linux man-pages 6.9.1-158-g2ac94c631"
.SH NAME
strtoul, strtoull, strtouq \- convert a string to an unsigned long 
integer
.SH SEE ALSO
.BR a64l (3),
.BR atof (3),
.BR atoi (3),
.BR atol (3),
.BR strtod (3),
.BR strtol (3),
.BR strtoumax (3)

And it can then be piped to groff(1) to format the entire set of pages:

$ man -w strtol strtoul | xargs mansect '\(NAME\|SEE ALSO\)' | groff 
-man -Tutf8
strtol(3)  Library Functions Manual  
strtol(3)

NAME
 strtol, strtoll, strtoq - convert a string to a long integer

SEE ALSO
 atof(3), atoi(3), atol(3), strtod(3), strtoimax(3), strtoul(3)

Linux man‐pages 6.9.1‐158‐g2ac... 2024‐07‐23 
strtol(3)

───
strtoul(3) Library Functions Manual 
strtoul(3)

NAME
 strtoul, strtoull, strtouq - conve

Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Colin Watson
On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote:
> On GNU/Linux systems, the only man page indexer I know of is Colin
> Watson's man-db--specifically, its mandb(8) program.  But it's nicely
> designed so that the "topic and summary description extraction" task is
> delegated to a standalone tool, lexgrog(1), and we can use that.
> 
> $ lexgrog /tmp/proc_pid_fdinfo_mini.5
> /tmp/proc_pid_fdinfo_mini.5: parse failed
> 
> Oh, damn.  I wasn't expecting that.  Maybe this is what defeats Michael
> Kerrisk's scraper with respect to groff's man pages.[1]

How embarrassing.  Could somebody please file a bug on
https://gitlab.com/man-db/man-db/-/issues to remind me to fix that?  (Of
course there'll be a lead time for fixes to get into distributions.)

> Well, I can find a silver lining here, because it gives me an even
> better reason than I had to pitch an idea I've been kicking around for a
> while.  Why not enhance groff man(7) to support a mode where _it_ will
> spit out the "Name"/"NAME" section, and only that, _for_ you?
> 
> This would be as easy as checking for an option, say '-d EXTRACT=Name',
> and having the package's "TH" and "SH" macro definitions divert
> (literally, with the `di` request) everything _except_ the section of
> interest to a diversion that is then never called/output.  (This is
> similar to an m4 feature known as the "black hole diversion".)
> 
> All of the features necessary to implement this[2] were part of troff as
> far as back as the birth of the man(7) package itself.  It's not clear
> to me why it wasn't done back in the 1980s.
> 
> lexgrog(1) itself will of course have to stay around for years to come,
> but this could take a significant distraction off of Colin's plate--I
> believe I have seen him grumble about how much *roff syntax he has to
> parse to have the feature be workable, and that's without upstart groff
> maintainers exploring up to every boundary that existed even in 1979 and
> cheerfully exercising their findings in man pages.

lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if
you focus on that then you'll end up with a design that's not very
useful.  What really matters is indexing the whole system's manual
pages, and mandb(8) does not do that by invoking lexgrog(1) one page at
a time, but rather by running more or less the same code in-process.  I
already know that getting acceptable performance for this requires care,
as illustrated by one of the NEWS entries for man-db 2.10.0:

 * Significantly improve `mandb(8)` and `man -K` performance in the common
   case where pages are of moderate size and compressed using `zlib`: `mandb
   -c` goes from 344 seconds to 10 seconds on a test system.

... so I'm prepared to bet that forking nroff one page at a time will be
unacceptably slow.  (This also combines with the fact that man-db
applies some sandboxing when it's calling nroff just in case it might
happen that a moderately-sized C++ project has less than 100% perfect
security when doing text processing, which I'm sure everyone agrees
would never happen.)

If it were possible to run nroff over a whole batch of pages and get
output for each of them in one go, then mybe.  man-db would need a
reliable way to associate each line (or sometimes multiple lines) of
output with each source file, and of course care would be needed around
error handling and so on.  I can see the appeal, in terms of processing
the actual language rather than a pile of hacks that try to guess what
to do with it - but on the other hand this starts to feel like a much
less natural fit for the way nroff is run in every other situation,
where you're processing one document at a time.

Cheers,

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread Alejandro Colomar
Hi Branden,

On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote:
> [adding Colin Watson to CC; and the groff list because I started musing]
> 
> Hi Alex,
> 
> At 2024-11-01T21:07:29+0100, Alejandro Colomar wrote:
> > > > > -/proc/pid/fdinfo/ \- information about file descriptors
> > > > > +.IR /proc/ pid /fdinfo " \- information about file descriptors"
> > > >
> > > > I wouldn't add formatting here for now.  That's something I prefer
> > > > to be cautious about, and if we do it, we should do it in a
> > > > separate commit.
> > > 
> > > I'll move it to a separate patch. Is the caution due to a lack of
> > > test infrastructure? That could be something to get resolved,
> > > perhaps through Google summer-of-code and the like.
> > 
> > That change might be controversial.
> 
> Then let those with objections step forward and make them!

Sure!  But that in itself (and the length of your mail) makes a strong
reason to have this in a separate commit.  :)

I'm not opposed to the change.  Only cautious.

> 
> (I may be one of them; see below.)
> 
> > We'd first need to check that all software that reads the NAME section
> > would behave well for this.
> 
> Not _all_ software, surely.  Anybody can write a craptastic man(7)
> scraper, and several have, mainly back when Web 1.0 was going to eat the
> world.  Most of those have withered on the vine.

Ahh, yeah, I committed the same mistake I criticise in others every now
and then.  $all does not really mean "all".  (-Wall, `make all`, ...)

I meant all [of which I care], which is basically groff(1) and
mandoc(1).  :)

> This is the _Linux_ man-pages project, so what matters are (1) man page
> formatters and (2) man page indexers that GNU/Linux systems actually
> use.  Where people get nervous with the "NAME" section is because of the
> indexer; if one's man(7) _formatter_ can't handle an `IR` call, it
> hasn't earned the name.

Yup.

> 
> Here's a sample input.
> 
> $ cat /tmp/proc_pid_fdinfo_mini.5
> .TH proc_pid_fdinfo_mini 5 2024-11-02 "example"
> .SH Name
> .IR /proc/ pid /fdinfo " \- information about file descriptors"
> .SH Description
> Text text text text.
> 
> Starting with formatters, let's see how they do.
> 
> $ nroff -man /tmp/proc_pid_fdinfo_mini.5
> proc_pid_fdinfo_mini(5)   File Formats Manual  proc_pid_fdinfo_mini(5)
> 
> Name
>/proc/pid/fdinfo - information about file descriptors
> 
> Description
>Text text text text.
> 
> example   2024‐11‐02   proc_pid_fdinfo_mini(5)
> $ mandoc /tmp/proc_pid_fdinfo_mini.5 | ul
> proc_pid_fdinfo_mini(5)   File Formats Manual  proc_pid_fdinfo_mini(5)
> 
> Name
>/proc/pid/fdinfo - information about file descriptors
> 
> Description
>Text text text text.
> 
> example   2024-11-02   proc_pid_fdinfo_mini(5)
> $ ~/heirloom/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | ul
> proc_pid_fdinfo_mini(5)   File Formats Manual  proc_pid_fdinfo_mini(5)
> 
> 
> 
> Name
>/proc/pid/fdinfo - information about file descriptors
> 
> Description
>Text text text text.
> 
> 
> 
> example   2024-11-02   proc_pid_fdinfo_mini(5)
> $ DWBHOME=~/dwb ~/dwb/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | cat -s | ul
> 
>proc_pid_fdinfo_mini(5)example (2024-11-02)roc_pid_fdinfo_mini(5)
> 
>Name
> /proc/pid/fdinfo - information about file descriptors
> 
>Description
> Text text text text.
> 
>Page 1(printed 11/2/2024)
> 
> I leave the execution of these to perceive the correct font style
> changes as an exercise for the reader, but they all get the
> "/proc/pid/fdinfo" line right.
> 
> On GNU/Linux systems, the only man page indexer I know of is Colin
> Watson's man-db--specifically, its mandb(8) program.  But it's nicely
> designed so that the "topic and summary description extraction" task is
> delegated to a standalone tool, lexgrog(1), and we can use that.
> 
> $ lexgrog /tmp/proc_pid_fdinfo_mini.5
> /tmp/proc_pid_fdinfo_mini.5: parse failed
> 
> Oh, damn.  I wasn't expecting that.  Maybe this is what defeats Michael
> Kerrisk's scraper with respect to groff's man pages.[1]
> 
> Well, I can find a silver lining here, because it gives me an even
> better reason than I had to pitch an idea I've been kicking around for a
> while.  Why not enhance groff man(7) to support a mode where _it_ will
> spit out the "Name"/"NAME" section, and only that, _for_ you?
> 
> This would be as easy as checking for an option, say '-d EXTRACT=Name',
> and having the package's "TH" and "SH" macro definitions divert
> (literally, with the `di` request) everything _except_ the section of
> interest to a diversion that is then never called/output.  (This is
> similar to an m4 feature known as the "black hole diversion".)

Sounds good.  And then lexgrog(1) would be a one-liner that calls
groff(1) with

Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

2024-11-02 Thread G. Branden Robinson
[adding Colin Watson to CC; and the groff list because I started musing]

Hi Alex,

At 2024-11-01T21:07:29+0100, Alejandro Colomar wrote:
> > > > -/proc/pid/fdinfo/ \- information about file descriptors
> > > > +.IR /proc/ pid /fdinfo " \- information about file descriptors"
> > >
> > > I wouldn't add formatting here for now.  That's something I prefer
> > > to be cautious about, and if we do it, we should do it in a
> > > separate commit.
> > 
> > I'll move it to a separate patch. Is the caution due to a lack of
> > test infrastructure? That could be something to get resolved,
> > perhaps through Google summer-of-code and the like.
> 
> That change might be controversial.

Then let those with objections step forward and make them!

(I may be one of them; see below.)

> We'd first need to check that all software that reads the NAME section
> would behave well for this.

Not _all_ software, surely.  Anybody can write a craptastic man(7)
scraper, and several have, mainly back when Web 1.0 was going to eat the
world.  Most of those have withered on the vine.

This is the _Linux_ man-pages project, so what matters are (1) man page
formatters and (2) man page indexers that GNU/Linux systems actually
use.  Where people get nervous with the "NAME" section is because of the
indexer; if one's man(7) _formatter_ can't handle an `IR` call, it
hasn't earned the name.

Here's a sample input.

$ cat /tmp/proc_pid_fdinfo_mini.5
.TH proc_pid_fdinfo_mini 5 2024-11-02 "example"
.SH Name
.IR /proc/ pid /fdinfo " \- information about file descriptors"
.SH Description
Text text text text.

Starting with formatters, let's see how they do.

$ nroff -man /tmp/proc_pid_fdinfo_mini.5
proc_pid_fdinfo_mini(5)   File Formats Manual  proc_pid_fdinfo_mini(5)

Name
   /proc/pid/fdinfo - information about file descriptors

Description
   Text text text text.

example   2024‐11‐02   proc_pid_fdinfo_mini(5)
$ mandoc /tmp/proc_pid_fdinfo_mini.5 | ul
proc_pid_fdinfo_mini(5)   File Formats Manual  proc_pid_fdinfo_mini(5)

Name
   /proc/pid/fdinfo - information about file descriptors

Description
   Text text text text.

example   2024-11-02   proc_pid_fdinfo_mini(5)
$ ~/heirloom/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | ul
proc_pid_fdinfo_mini(5)   File Formats Manual  proc_pid_fdinfo_mini(5)



Name
   /proc/pid/fdinfo - information about file descriptors

Description
   Text text text text.



example   2024-11-02   proc_pid_fdinfo_mini(5)
$ DWBHOME=~/dwb ~/dwb/bin/nroff -man /tmp/proc_pid_fdinfo_mini.5 | cat -s | ul

   proc_pid_fdinfo_mini(5)example (2024-11-02)roc_pid_fdinfo_mini(5)

   Name
/proc/pid/fdinfo - information about file descriptors

   Description
Text text text text.

   Page 1(printed 11/2/2024)

I leave the execution of these to perceive the correct font style
changes as an exercise for the reader, but they all get the
"/proc/pid/fdinfo" line right.

On GNU/Linux systems, the only man page indexer I know of is Colin
Watson's man-db--specifically, its mandb(8) program.  But it's nicely
designed so that the "topic and summary description extraction" task is
delegated to a standalone tool, lexgrog(1), and we can use that.

$ lexgrog /tmp/proc_pid_fdinfo_mini.5
/tmp/proc_pid_fdinfo_mini.5: parse failed

Oh, damn.  I wasn't expecting that.  Maybe this is what defeats Michael
Kerrisk's scraper with respect to groff's man pages.[1]

Well, I can find a silver lining here, because it gives me an even
better reason than I had to pitch an idea I've been kicking around for a
while.  Why not enhance groff man(7) to support a mode where _it_ will
spit out the "Name"/"NAME" section, and only that, _for_ you?

This would be as easy as checking for an option, say '-d EXTRACT=Name',
and having the package's "TH" and "SH" macro definitions divert
(literally, with the `di` request) everything _except_ the section of
interest to a diversion that is then never called/output.  (This is
similar to an m4 feature known as the "black hole diversion".)

All of the features necessary to implement this[2] were part of troff as
far as back as the birth of the man(7) package itself.  It's not clear
to me why it wasn't done back in the 1980s.

lexgrog(1) itself will of course have to stay around for years to come,
but this could take a significant distraction off of Colin's plate--I
believe I have seen him grumble about how much *roff syntax he has to
parse to have the feature be workable, and that's without upstart groff
maintainers exploring up to every boundary that existed even in 1979 and
cheerfully exercising their findings in man pages.

I also of course have ideas for generalizing the feature, so that you
can request any (sub)section by name, and, with a bit more ambition,[4]
paragraph tags (`TP`) too.

So you could do thing