[PATCH 0/2] cindex: future-proof blob OID indexing

2023-12-05 Thread Eric Wong
1/2 fixes a bug while checking over the blob OID indexing code Eric Wong (2): searchidx: drop redundant decl in index_git_blob_id cindex: index full (40/64 char) hex blob OIDs lib/PublicInbox/CodeSearchIdx.pm | 15 +-- lib/PublicInbox/SearchIdx.pm | 1 - t/cindex.t

[PATCH 2/2] cindex: index full (40/64 char) hex blob OIDs

2023-12-05 Thread Eric Wong
This future proofs the index against git auto-abbreviation needing more characters as the repo grows. It'll be useful for joining against inboxes using dfpre. As with emails, we'll continue indexing abbreviated blob OIDs down to 7 hex characters so a SHA-1 git repo will have all abbreviations of

[PATCH 0/4] DragonFly-related fixes

2023-11-30 Thread Eric Wong
2/4 probably affects NetBSD and OpenBSD, too, but tests don't always fail... Eric Wong (4): t/xap_helper: make sendmsg errors more obvious xap_helper.h: fix non-assignable stderr case tests: note kevent+tmpfs failures on DragonFly <= 6.4 xap_helper: enable stderr assignment on Dragon

[PATCH 3/4] tests: note kevent+tmpfs failures on DragonFly <= 6.4

2023-11-30 Thread Eric Wong
I forgot to set TMPDIR=/path/to/non-tmpfs again. --- lib/PublicInbox/TestCommon.pm | 23 ++- t/dir_idle.t | 7 +-- t/kqnotify.t | 2 +- 3 files changed, 28 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/TestCommon.pm

[PATCH 1/4] t/xap_helper: make sendmsg errors more obvious

2023-11-30 Thread Eric Wong
By ignoring SIGPIPE, we hit our own error path and emit an informative error message instead of dying abruptly and requiring somebody to run `echo $?' to see the child status from their shell. --- t/xap_helper.t | 1 + 1 file changed, 1 insertion(+) diff --git a/t/xap_helper.t b/t/xap_helper.t

[PATCH 4/4] xap_helper: enable stderr assignment on DragonFly

2023-11-30 Thread Eric Wong
It looks like DragonFly inherited this from FreeBSD to allow us to save us some syscalls. --- lib/PublicInbox/xap_helper.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h index c1ab66f3..1f8c426b 100644 ---

[PATCH 2/4] xap_helper.h: fix non-assignable stderr case

2023-11-30 Thread Eric Wong
I mixed up "flush" with "close" :x Fixes: 87b7f633f241 (xap_helper: implement mset endpoint for WWW, IMAP, etc...) --- lib/PublicInbox/xap_helper.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/xap_helper.h b/lib/PublicInbox/xap_helper.h index

[PATCH] doc: config: fix grammar for nameIsUrl

2023-11-30 Thread Eric Wong
Kyle Meyer wrote: > Eric Wong writes: > > +Treat the name of the public inbox as it's unqualified URL when > > s/it's/its/ Thanks, will push this fix out: ---8<-- Subject: [PATCH] doc: config: fix grammar for nameIsUrl Reported-by: Kyle Meyer Link: https://pub

[PATCH v2] codesearch: use retry_reopen for WWW

2023-11-30 Thread Eric Wong
As with mail search, a cindex may be updated while WWW is serving requests. Thus we must reopen the Xapian DB when the revision we're using becomes stale. --- v2: avoid reintroducing load_ct as noted in https://public-inbox.org/meta/20231130213641.M35664@dcvr/ lib/PublicInbox/CodeSearch.pm |

Re: [PATCH 06/15] cindex: store extensions.objectFormat with repo data

2023-11-30 Thread Eric Wong
Eric Wong wrote: > +++ b/lib/PublicInbox/CodeSearch.pm > @@ -242,15 +247,21 @@ sub paths2roots { > \%ret; > } > > +sub load_ct { # retry_reopen cb > + my ($self, $git_dir) = @_; > + my @ids = docids_of_git_dir $self, $git_dir or return; > + for

[PATCH 08/15] cindex: skip getpid guard for most OnDestroy use

2023-11-30 Thread Eric Wong
We no longer fork after cidx_init, so there's no need to spend CPU cycles on the getpid() syscall, especially since it's no longer cached on glibc while syscalls are also more expensive these days due to CPU vulnerability mitigations. --- lib/PublicInbox/CodeSearchIdx.pm | 22

[PATCH 14/15] inbox: shrink data structures for publicinbox.*.hide

2023-11-30 Thread Eric Wong
We no longer vivify the intermediate $ibx->{-hide} hashref, instead we use $ibx->{-hide_$KEY} directly. This avoids an intermediate hashref and extra hash table lookups. --- lib/PublicInbox/CodeSearch.pm | 2 +- lib/PublicInbox/Inbox.pm | 8 ++-- lib/PublicInbox/WwwListing.pm | 2 +- 3

[PATCH 13/15] www_listing: support publicInbox.nameIsUrl

2023-11-30 Thread Eric Wong
This is a convenient (and slightly memory-saving) alternative to specifying a `publicinbox.*.url' entry for every single inbox when using publicinbox.wwwListing. --- Documentation/public-inbox-config.pod | 19 ++- lib/PublicInbox/WwwListing.pm | 21 + 2

[PATCH 12/15] git_async_cat: use git from "all" extindex if possible

2023-11-30 Thread Eric Wong
For inboxes associated with an extindex (currently only the special "all") one, we can share the git process across all those inboxes unambiguously when retrieving full SHA-1 blobs. The comment for my proposed patch is also out-of-date as that git speedup has been a part of git since 2.33. ---

[PATCH 15/15] codesearch: use retry_reopen for WWW

2023-11-30 Thread Eric Wong
As with mail search, a cindex may be updated while WWW is serving requests. Thus we must reopen the Xapian DB when the revision we're using becomes stale. --- lib/PublicInbox/CodeSearch.pm | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git

[PATCH 10/15] cindex: speed up initial scan setup phase

2023-11-30 Thread Eric Wong
This brings a no-op -cindex scan of a git.kernel.org mirror down from 70s to 10s with a hot cache on a busy machine. CPU-intensive SHA-256 fingerprinting of the `git show-ref' result can be parallelized on shard workers. Future changes can move more of the initial scan setup phase into shard

[PATCH 09/15] spawn: drop IO layer support from redirects

2023-11-30 Thread Eric Wong
When setting up stdin for commands, the write_file API is convenient enough nowadays to not be worth having special support with process spawning. When reading stdout of commands, we should probably be using utf8_maybe everywhere since there'll always be legacy encodings in git repos. Reading

[PATCH 02/15] codesearch: allow inbox count to exceed matches

2023-11-30 Thread Eric Wong
It's entirely possible for public inboxes to have zero patches in them, so the amount of match slots may not match match the number of joined ekeys. --- lib/PublicInbox/CodeSearch.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/CodeSearch.pm

[PATCH 11/15] inbox: expire resources more aggressively

2023-11-30 Thread Eric Wong
We no longer trigger git cleanups from the Inbox package since `git cat-file' users have their own cleanup to support git coderepos not associated with any inbox. This change means we unconditionally expire SQLite and Xapian FDs and some internal caches regardless of git activity. The old logic

[PATCH 07/15] git: share unlinked pack checking code with gcf2

2023-11-30 Thread Eric Wong
It saves some code in case we keep libgit2 around. --- lib/PublicInbox/Gcf2.pm | 16 lib/PublicInbox/Git.pm | 27 ++- 2 files changed, 18 insertions(+), 25 deletions(-) diff --git a/lib/PublicInbox/Gcf2.pm b/lib/PublicInbox/Gcf2.pm index

[PATCH 00/15] various cindex fixes + speedups

2023-11-30 Thread Eric Wong
Notable changes: 10/15 provides a huge speedup which will hopefully make future developments faster. 12/15 probably obsoletes libgit2 for extindex "all" users. 13/15 can save some memory with many inboxes while making configuration easier. Eric Wong (15): cindex: fix store_repo+r

[PATCH 06/15] cindex: store extensions.objectFormat with repo data

2023-11-30 Thread Eric Wong
This will allow WWW to use a combined LeiALE-like thing to reduce git processes. --- lib/PublicInbox/CodeSearch.pm| 27 -- lib/PublicInbox/CodeSearchIdx.pm | 161 +-- 2 files changed, 127 insertions(+), 61 deletions(-) diff --git

[PATCH 04/15] cindex: only create {-cidx_err} field on failures

2023-11-30 Thread Eric Wong
We only use it as a boolean flag, and there's no need to waste space for common, non-error cases. --- lib/PublicInbox/CodeSearchIdx.pm | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm index

[PATCH 05/15] cindex: keep batch pipe for pruning SHA-256 repos

2023-11-30 Thread Eric Wong
This fixes the case where we're running both SHA-256 and SHA-1. There's no tests for SHA-256, yet, but the bug is pretty obvious upon reading the code. --- lib/PublicInbox/CodeSearchIdx.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/CodeSearchIdx.pm

[PATCH 01/15] cindex: fix store_repo+repo_stored on no-op

2023-11-30 Thread Eric Wong
It's possible to update the fingerprint for a given repo when we have no commits to index on because they were already done for another repo. Thus we'll always vivify $repo_ctx->{active} before calling store_repo since $active may've been undef. --- lib/PublicInbox/CodeSearchIdx.pm | 6 +++--- 1

[PATCH 03/15] config: reject newlines consistently in dir names

2023-11-30 Thread Eric Wong
Explicitly drop support for "\n" in git coderepo pathnames as we do other stuff. Gcf2 (our libgit2 helper) was always broken with "\n" in pathnames, and I'm not sure if cgit config files work with them, either. Dealing with newline characters requires extra complexity that I'm not willing to

Re: OpenBSD debugging

2023-11-29 Thread Eric Wong
Štěpán Němec wrote: > > I apologize for the late response. No worries, I still have mails in other places from months ago I've been meaning to get to :x > On Mon, 23 Oct 2023 19:58:18 +0000 > Eric Wong wrote: > > > Thanks for the info. Just curious, what

Re: [PATCH 2/2] doc: fix a few typos and wording issues

2023-11-29 Thread Eric Wong
Thanks, both patches in this series pushed

Re: extra search flags and params? (ispatch, replycount, ...)

2023-11-28 Thread Eric Wong
Konstantin Ryabitsev wrote: > On Tue, Nov 28, 2023 at 06:20:03PM +0000, Eric Wong wrote: > > Though being able to find unanswered threads could be helpful. > > Note, I'm not saying it's not a cool feature. :) However, I imagine people > would be more interested in searching

Re: extra search flags and params? (ispatch, replycount, ...)

2023-11-28 Thread Eric Wong
Konstantin Ryabitsev wrote: > On Tue, Nov 28, 2023 at 05:35:09PM +0000, Eric Wong wrote: > > > I understand the reasoning, but I'm not sure we should be trying too hard > > > to > > > make public-inbox a patch tracking platform. What makes lei great is > > &

[PATCH 15/14] www: load cindex join data for ->ALL, too

2023-11-28 Thread Eric Wong
This ensures the /all/ extindex can have auto-associations with coderepos just like normal inboxes do. --- lib/PublicInbox/CodeSearch.pm | 9 + 1 file changed, 9 insertions(+) diff --git a/lib/PublicInbox/CodeSearch.pm b/lib/PublicInbox/CodeSearch.pm index 7c0dd063..5c5774cf 100644 ---

[PATCH 1/4] lei q: fix --no-import-before completion + docs

2023-11-28 Thread Eric Wong
--no-import-before skips importing entire messages, not just keywords, so it can cause permanent data loss if -o is pointed to precious data. --- Documentation/lei-q.pod | 5 +++-- lib/PublicInbox/LEI.pm | 1 + t/lei-q-kw.t| 19 --- 3 files changed, 20

[PATCH 2/4] www: mail_diff: fix optional address obfuscation

2023-11-28 Thread Eric Wong
We need to load the proper package and fully-qualify the sub call since we shouldn't load Hval in lei. Some users use this feature even if its broken, oh well :< --- lib/PublicInbox/MailDiff.pm | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/MailDiff.pm

[PATCH 4/4] www: mail_diff: add missing tag

2023-11-28 Thread Eric Wong
Found by tidy(1) while dealing with other stuff. --- lib/PublicInbox/MailDiff.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/MailDiff.pm b/lib/PublicInbox/MailDiff.pm index 89284e39..e4e262ef 100644 --- a/lib/PublicInbox/MailDiff.pm +++

[PATCH 0/4] non-cindex-related stuff

2023-11-28 Thread Eric Wong
Well, I actually found the mail_diff bugs while looking into micro-optimizing -cindex. Eric Wong (4): lei q: fix --no-import-before completion + docs www: mail_diff: fix optional address obfuscation www: mail_diff: add final newline before diffing www: mail_diff: add missing tag

[PATCH 3/4] www: mail_diff: add final newline before diffing

2023-11-28 Thread Eric Wong
This gets rid of the "\ No newline at end of file" since it's distracting noise. --- lib/PublicInbox/MailDiff.pm | 2 +- t/lei-mail-diff.t | 1 + t/psgi_v2.t | 1 + 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/MailDiff.pm

Re: extra search flags and params? (ispatch, replycount, ...)

2023-11-28 Thread Eric Wong
Konstantin Ryabitsev wrote: > On Tue, Nov 28, 2023 at 12:10:28AM +0000, Eric Wong wrote: > > Would they be useful? > > > > It's not currently possible to quickly search for whether or not > > a term (e.g. patchid:) is present in a Xapian document. Having > >

[PATCH 05/14] xap_helper.h: move cindex endpoints to separate file

2023-11-28 Thread Eric Wong
It ought to help a bit with organization since xap_helper.h is getting somewhat large and we'll need new endpoints to support WWW, lei, and whatever else that needs to come. --- MANIFEST| 1 + lib/PublicInbox/XapHelperCxx.pm | 10 +- lib/PublicInbox/xap_helper.h|

[PATCH 02/14] t/cindex*: require SCM_RIGHTS for these tests

2023-11-28 Thread Eric Wong
Code search will require SCM_RIGHTS, and Inline::C on BSDs probably isn't too onerous a dependency for new features as all the ones I've tested have it packaged. Furthermore, requiring SCM_RIGHTS isn't far off since OpenBSD's Perl is patched to route the `syscall' perlop through libc[1], while

[PATCH 12/14] admin: resolve_git_dir respects symlinks

2023-11-28 Thread Eric Wong
Absolute pathnames of git coderepos are stored in the cindex, but we should favor paths relative to $ENV{PWD} since it respects symlinks in the heirarchy. Respecting symlinks makes it easier to migrate cindex to new storage as old storage wears out and to relocate the storage device onto another

[PATCH 09/14] git: speed up ->git_path for non-worktrees

2023-11-28 Thread Eric Wong
Only worktrees need to use `git rev-parse --git-path', so avoid the spawn overhead of a new process. With the SolverGit.pm limit on coderepo scans disabled and scanning over 800 git repos for git@vger matches, this reduces up xt/solver.t times by roughly 25%. --- lib/PublicInbox/Git.pm | 17

[PATCH 07/14] hval: use File::Spec to make relative paths for href

2023-11-28 Thread Eric Wong
File::Spec->abs2rel doesn't touch the filesystem at all when given an absolute base arg ($env->{PATH_INFO}), so we can rely on it to generate relative links that work with the `mount' from Plack::Builder and also people running `wget -r' mirrors. --- lib/PublicInbox/Hval.pm | 12 +++- 1

[PATCH 04/14] solver: schedule cleanup after synchronous git->check

2023-11-28 Thread Eric Wong
We don't want hundreds of git cat-file processes for coderepos lingering around. --- lib/PublicInbox/Git.pm | 7 ++- lib/PublicInbox/SolverGit.pm | 3 +++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm index

[PATCH 14/14] www: start working on a repo listing

2023-11-28 Thread Eric Wong
The HTML is still extremely rough, but links seem to be mostly working... --- MANIFEST | 1 + lib/PublicInbox/CodeSearch.pm | 8 +++ lib/PublicInbox/RepoList.pm| 39 ++ lib/PublicInbox/WwwCoderepo.pm | 3 +++

[PATCH 01/14] test_common: create_*: detect changes all parameters

2023-11-28 Thread Eric Wong
Data::Dumper+B::Deparse seems fast enough to generate cache keys with, so this makes updating and developing tests easier (as opposed to forcing the developer to change the identifier). The main downside is we'll have to deal with cache expiration, but "make clean" seems overly aggressive already

[PATCH 10/14] cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'

2023-11-28 Thread Eric Wong
Accepting @ARGV without switches ends up being ambiguous with optional parameters for --join and --show. Requiring users to specify `--join=' or `--show=' is a bit awkward (as it with -clone --objstore= and the like, but that is historical baggage we need to carry at this point...) ---

[PATCH 00/14] IT'S ALIVE! www loads cindex join data

2023-11-28 Thread Eric Wong
<5 minutes if done frequently. New performance problem: solver could definitely be smarter about dealing with common roots/groups. For the longest time, I've only had 1 coderepo per-inbox, having hundreds is wacky. Actual searching against the cindex isn't done, yet, but that's kinda straightf

[PATCH 06/14] xap_helper: implement mset endpoint for WWW, IMAP, etc...

2023-11-28 Thread Eric Wong
The C++ version will allow us to take full advantage of Xapian's APIs for better queries, and the Perl bindings version can still be advantageous in the future since we'll be able to support timeouts effectively. --- MANIFEST| 1 + Makefile.PL | 8

[PATCH 11/14] git: speed up Git->new by 5% or so

2023-11-28 Thread Eric Wong
This becomes noticeable when loading lots of coderepos on my local mirror of git.kernel.org now that we can load repos from cindex. --- lib/PublicInbox/Git.pm | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm index

[PATCH 08/14] www: load and use cindex join data

2023-11-28 Thread Eric Wong
This is a major step in solving the problem of having to manually associate hundreds/thousands of coderepos with hundreds/thousands of public-inboxes to power solver (and more). --- lib/PublicInbox/CodeSearch.pm| 153 +-- lib/PublicInbox/CodeSearchIdx.pm | 42

[PATCH 03/14] codesearch: eliminate redundant substitutions

2023-11-28 Thread Eric Wong
We store the full path name and xap_terms already removes the `P' character, so the loop and substr calls are a no-op replacing `/' with `/'. --- lib/PublicInbox/CodeSearch.pm | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/PublicInbox/CodeSearch.pm b/lib/PublicInbox/CodeSearch.pm index

[PATCH 13/14] cindex: extra quit checks

2023-11-28 Thread Eric Wong
We don't want to be accessing uninitialized variables on process teardown since much of our control flow revolves around DESTROY for dependency handling. --- lib/PublicInbox/CodeSearchIdx.pm | 5 + 1 file changed, 5 insertions(+) diff --git a/lib/PublicInbox/CodeSearchIdx.pm

extra search flags and params? (ispatch, replycount, ...)

2023-11-27 Thread Eric Wong
Would they be useful? It's not currently possible to quickly search for whether or not a term (e.g. patchid:) is present in a Xapian document. Having the ability to do so would make it easier to find non-patch messages, or easily filter down to cover letters, bot replies, etc... Thus adding

[PATCH] disallow NUL characters in Message-ID and List-Id

2023-11-27 Thread Eric Wong
While MTAs seem to stop '\0' from appearing in headers, users fetching archives via git remain susceptible to having '\0' land in archives. So we'll filter them out of Xapian and SQLite DBs to avoid interopability problems with CLI tools since there's no known messages in lore or any of my

Re: [BUG] Unescaped '&' ampersands in atom header links

2023-11-27 Thread Eric Wong
Thanks, pushed as commit 577e421a0815e66f965bd4317adad5aeea3cc52a with your Tested-By (sent privately in <87leaj4ea1@collabora.com>)

[PATCH 1/2] xap_helper: avoid strerror(3) inside signal handler

2023-11-27 Thread Eric Wong
It's not async-signal-safe and the glibc implementation uses malloc via asnprintf. Practically it's not a problem unless the kernel OOMs and the write(2) fails to the self-pipe. --- lib/PublicInbox/xap_helper.h | 29 - 1 file changed, 12 insertions(+), 17 deletions(-)

[PATCH 0/2] xap_helper C++ fixes

2023-11-27 Thread Eric Wong
Already pushed out since I forgot which VM I was on :x Eric Wong (2): xap_helper: avoid strerror(3) inside signal handler xap_helper.h: avoid some off_t vs size_t problems lib/PublicInbox/xap_helper.h | 59 ++-- 1 file changed, 30 insertions(+), 29 deletions(-)

[PATCH 2/2] xap_helper.h: avoid some off_t vs size_t problems

2023-11-27 Thread Eric Wong
We'll introduce a helper to cast off_t to size_t consistently for mmap/munmap/calloc calls which require size_t. Also, an extra check for multiplication overflow can be helpful just in case we end up with a gigantic file roots file. --- lib/PublicInbox/xap_helper.h | 30

Re: [BUG] Unescaped '&' ampersands in atom header links

2023-11-27 Thread Eric Wong
Ricardo Cañuelo wrote: > where the '&' character is escaped in the text of the tag but > not in the href attributes. Shouldn't these be escaped as well? If so, > the fix should be most likely located in WwwAtomStream.pm:atom_header(). Thanks for the bug report. Yes, '&' should be escaped,

[PATCH] t/nntpd-tls: avoid test failure on OpenBSD 7.3

2023-11-26 Thread Eric Wong
The LibreSSL 3.7.2 on my OpenBSD 7.3 VM seems return 7 bytes of junk data before EOF/ECONNRESET when a client attempts to write plain-text to a TLS socket. --- t/nntpd-tls.t | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/t/nntpd-tls.t b/t/nntpd-tls.t index

[PATCH v2] drop redundant calls to DS->Reset

2023-11-26 Thread Eric Wong
Reset gets called on END{} anyways to workaround DBI lifetime problems, so there's no need to call it near exit. We can't replace calls to POSIX::_exit with `exit' to force END{} to run just yet, as there are still some lingering destruction ordering problems on newer DBI and or Perls. ---

[PATCH 0/7] more I/O + process reliability and cleanups

2023-11-25 Thread Eric Wong
6/7 ought to fix another hang in t/lei-q-save.t when writing to v2 outputs. Much of this stuff will be relevant to code search since Xapian searches will be moved to C++ (if available) to support features which aren't usable from Perl bindings and allow more predictable performance anyways. Eric

[PATCH 5/7] git: move rbuf handling to PublicInbox::IO

2023-11-25 Thread Eric Wong
The long-term plan is to share non-blocking read buffering logic with HTTP/NNTP/IMAP/POP3 and also XapClient. --- lib/PublicInbox/Gcf2Client.pm | 1 - lib/PublicInbox/Git.pm| 59 ++- lib/PublicInbox/IO.pm | 53 ++- 3

[PATCH 3/7] xap_client: pass arguments to top-level xap_helper

2023-11-25 Thread Eric Wong
This ensures our tests actually test the -j0 and -j1 cases properly. --- lib/PublicInbox/XapClient.pm | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/PublicInbox/XapClient.pm b/lib/PublicInbox/XapClient.pm index 7737e30d..1f9ddccc 100644 --- a/lib/PublicInbox/XapClient.pm +++

[PATCH 7/7] drop redundant calls to DS->Reset

2023-11-25 Thread Eric Wong
Reset gets called on END{} anyways to workaround DBI lifetime problems, so there's no need to call it near exit. We'll also replace many calls to POSIX::_exit with the normal `exit'. This ensures END{} gets called since all of our destructors are fork-safe nowadays so POSIX::_exit is

[PATCH 6/7] git: improve coupling with {sock} and {inflight} fields

2023-11-25 Thread Eric Wong
While the {inflight} array should be tied to the IO object even more tightly, that's not an easy task with our current code. So take some small steps by introducing a gcf_inflight helper to validate the ownership of the process and to drain the inflight array via the awaitpid callback. This

[PATCH 2/7] xap_client: attach PID to the IO object

2023-11-25 Thread Eric Wong
As with our popen_* uses, we can simplify callers by using attach_pid to handle automatic reaping upon close. --- lib/PublicInbox/CodeSearchIdx.pm | 10 ++ lib/PublicInbox/XapClient.pm | 4 +++- t/xap_helper.t | 3 +-- 3 files changed, 6 insertions(+), 11

[PATCH 1/7] xap_helper_cxx: do not copy xap_helper.h source

2023-11-25 Thread Eric Wong
No need to waste memory bandwidth when we can just rely on the preprocessor to load the header. --- lib/PublicInbox/XapHelperCxx.pm | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/lib/PublicInbox/XapHelperCxx.pm b/lib/PublicInbox/XapHelperCxx.pm index

[PATCH 4/7] xap_helper: allow PI_NO_CXX to disable C++ in more places

2023-11-25 Thread Eric Wong
This also reduces repetition in the setup code. --- lib/PublicInbox/XapClient.pm| 4 +--- lib/PublicInbox/XapHelperCxx.pm | 1 + t/xap_helper.t | 2 +- 3 files changed, 3 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/XapClient.pm b/lib/PublicInbox/XapClient.pm

[PATCH 2/3] select+poll: have caller retry on EINTR

2023-11-25 Thread Eric Wong
We can't assume signals are blocked when neither signalfd nor EVFILT_SIGNAL are in use. So just return an empty result so the caller can recalculate the timeout. I found this bug while making xt/httpd-async-stream.t use our event loop to reap processes but have abandoned that effort for now

[PATCH 0/3] ds: event loop-related fixes

2023-11-25 Thread Eric Wong
Eric Wong (3): http: fix HTTP/1.1 pipelining during long async requests select+poll: have caller retry on EINTR ds: long_step: eliminate redundant fileno call lib/PublicInbox/DS.pm | 1 - lib/PublicInbox/DSPoll.pm | 6 ++-- lib/PublicInbox/HTTP.pm | 17 +- lib/PublicInbox

[PATCH 3/3] ds: long_step: eliminate redundant fileno call

2023-11-25 Thread Eric Wong
We already stash the associated FD for reporting at startup and don't need to call `fileno' again. Found via manual code inspection while considering the effort to make async {forward} from PublicInbox::HTTP more like the generic long_response API and {long_cb} field used by IMAP/NNTP/POP3. ---

[PATCH 1/3] http: fix pipelining during long async requests

2023-11-25 Thread Eric Wong
We must not attempt to read request bodies from the HTTP client while processing a long request since that drains pipelined requests. The NNTP/IMAP/POP3 event_step callbacks follow the same behavior when {long_cb} is present from ->long_response. This bug has little real-world consequence since

libgit2 [was: Re: t/cindex.t "associate w/o search" test hangs for me]

2023-11-25 Thread Eric Wong
Eric Wong wrote: > Konstantin Ryabitsev wrote: > > I'm quite happy to not require libgit2 -- I've always found it easier to > > just > > use git plumbing commands even if this requires exec'ing an external > > executable. Sorry, I forget, individual inboxes via

[PATCH v3] doc/extindex: document --dedupe switch

2023-11-25 Thread Eric Wong
Eric Wong wrote: > Štěpán Němec wrote: > > Eric Wong wrote: > > > +Rerun deduplication on messages of with the given Message-ID or > >^^^ > > not so fast :-P > > Thanks. Will s/of // when I commit when more awake. &g

Re: [PATCH] doc/extindex: document --dedupe switch

2023-11-25 Thread Eric Wong
Štěpán Němec wrote: > Eric Wong wrote: > > > > I'm also wondering if it's necessary to have a blurb about NOT > > supporting comma-delimited Message-IDs on the CLI, since some > > strange Message-IDs may have a comma in them. > > I think the description is a

[PATCH] examples/unsubscribe.milter: limit scope of munging

2023-11-24 Thread Eric Wong
We don't want the milter to munge List-Unsubscribe headers from external (incoming) mlmmj lists, only lists hosted on the server running unsubscribe.milter. Adding support for an allow_domains file should've been enough, but this further restricts the milter to only operating on Postfix

[PATCH] t/cindex-join: fix warnings from a missing comma

2023-11-24 Thread Eric Wong
Yes, that was valid Perl syntax :x --- t/cindex-join.t | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/cindex-join.t b/t/cindex-join.t index 0972afa4..2836eb6c 100644 --- a/t/cindex-join.t +++ b/t/cindex-join.t @@ -41,7 +41,7 @@ EOM while (my ($url, $v, $ng) =

Re: [PATCH] doc/extindex: document --dedupe switch

2023-11-24 Thread Eric Wong
Štěpán Němec wrote: > Eric Wong wrote: > > +++ b/Documentation/public-inbox-extindex.pod > > @@ -47,6 +47,20 @@ C set to C and their respective Xapian > > public-inboxes where cross-posting is common, this allows > > significant space savings on Xapian indices. >

[PATCH] cindex: fix --join=reset and speed up incremental joins

2023-11-24 Thread Eric Wong
`reset' means we want to ignore existing join data, while the default (non-reset) means we perform an incremental join while taking into account existing (fuzzy) join data. --- lib/PublicInbox/CodeSearchIdx.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git

[PATCH] doc/extindex: document --dedupe switch

2023-11-23 Thread Eric Wong
We've had it since v1.7.0 when -extindex was introduced, but it was never documented outside of commit messages. --- Documentation/public-inbox-extindex.pod | 26 + 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/Documentation/public-inbox-extindex.pod

[PATCH] lei_saved_search: don't create Git object during ->DESTROY

2023-11-22 Thread Eric Wong
This fixes t/lei-q-save.t getting stuck since $self->{ale} is already gone by the time DESTROY gets called. --- lib/PublicInbox/LeiSavedSearch.pm | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/LeiSavedSearch.pm b/lib/PublicInbox/LeiSavedSearch.pm

[PATCH] watch: support `watch=false' to negate watchspam

2023-11-21 Thread Eric Wong
For users hosting read-only mirrors (via clone|fetch) and feeding inboxes via -watch --- I'm also considering a `fetchonly' directive for -learn/-mda, too; but I think overloading watch can coexist with that... Documentation/public-inbox-watch.pod | 5 - lib/PublicInbox/Watch.pm

[PATCH] lei_to_mail: don't close STDOUT unless it is a mbox* output

2023-11-21 Thread Eric Wong
We only care about error checking when stdout is an mbox output pointed to a pathname. This is noticeable with `lei up' with multiple non-mbox* destinations. We'll also ensure throwing exceptions to trigger lei->x_it from lei->do_env results in the epoll/kqueue watch being discarded, otherwise

[squash 4/3] t/cindex-join: fix alternates setup

2023-11-21 Thread Eric Wong
We'll also disable GC since fetch/clone already leaves us with packs. --- Will squash this into 3/3 t/cindex-join.t | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/t/cindex-join.t b/t/cindex-join.t index fad30d93..0972afa4 100644 --- a/t/cindex-join.t +++

[PATCH 0/3] cindex: rename `associate' to `join'

2023-11-21 Thread Eric Wong
3/3 fleshes out more join functionality, including storing the join data in compressed JSON as Xapian metadata and loading it as a Perl hash won't be excessive (compared to having 30-50k inbox names+paths in memory). Eric Wong (3): cindex: avoid unneeded and redundant `local' calls doc/cindex

[PATCH 1/3] cindex: avoid unneeded and redundant `local' calls

2023-11-21 Thread Eric Wong
We only set $MAX_SIZE at startup, and there's no need to use a local $self->{roots} for the per-repo roots array. --- lib/PublicInbox/CodeSearchIdx.pm | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/lib/PublicInbox/CodeSearchIdx.pm

[PATCH 3/3] cindex: rename --associate to --join, test w/ real repos

2023-11-21 Thread Eric Wong
The association data is just stored as deflated JSON in Xapian metadata keys of shard[0] for now. It should be reasonably compact and fit in memory for now since we'll assume sane, non-malicious git coderepo history, for now. The new cindex-join.t test requires TEST_REMOTE_JOIN=1 to be set in

[PATCH 2/3] doc/cindex: point no-fsync,dangerous to -index(1)

2023-11-21 Thread Eric Wong
There's no point in duplicating --no-fsync documentation across manpages. --dangerous can be useful for reducing SSD wear, so add a pointer to it as well. --- Documentation/public-inbox-cindex.pod | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git

[PATCH] searchidx: run `git patch-id' in parallel

2023-11-20 Thread Eric Wong
Informal benchmarks show a rough 5% indexing improvement on an SMP system when there are idle cores due to Xapian shards being I/O bound (since `git patch-id' is mainly CPU bound). This is only parallelized on a per-patch basis. Further increasing parallelism would increase complexity and

[PATCH] git: return upon self->close

2023-11-20 Thread Eric Wong
I encountered the odd lack of `return' while chasing Gcf2 bugs on CentOS 7.x which resulted in commit 7d06b126e939 ("gcf2: fix autodie usage for older Perl") and commit e618c7654794 ("gcf2client: add alias for PublicInbox::Git::fail") before realizing the lack of `return' here wasn't the culprit

[PATCH] test_common: fix excessive wait for GNU tail inotify

2023-11-19 Thread Eric Wong
We want to use the filenames tail will watch, not the number of args passed to the `tail_f' subroutine. Fixes: 9231d2e7b93f (tests: map CLOFORK->FD_CLOEXEC temporarily for `tail -f') --- lib/PublicInbox/TestCommon.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git

[RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs)

2023-11-19 Thread Eric Wong
"Robin H. Johnson" wrote: > Hi, > > This is more of a feature request / request for pointers on how to tweak > the design to support something, and it might be suited to maintaining > as a local patch. Since the indexing internals are somewhat in flux and tied to Xapian and Perl, I'm happy to

Re: publicinbox watch path globbing

2023-11-19 Thread Eric Wong
"Robin H. Johnson" wrote: > The date is based on arrival time at the archive ingest. > > For some of the very old lists, we do have a list of message-ids that we > know existed but aren't captured in the archive, and those mails have > been added to the old locations if they are ever found

Re: publicinbox watch path globbing

2023-11-19 Thread Eric Wong
"Robin H. Johnson" wrote: > Hi! > > Writing to see about work in converting Gentoo's (now-broken) other > archives web interface over into using public-inbox instead. > > This is the first of a few questions/bumps along the way. > > For historical reasons on the scaling side, the archive

[PATCH] extindex: warn and hint about --gc on bad ibx_id

2023-11-16 Thread Eric Wong
Stale entries from newsgroup name changes (including adding a `publicinbox..newsgroup' entry when none existed before) can wreak havoc during a --reindex. So give the hint to users about running -extindex with --gc to clean up stale entries. --- Documentation/public-inbox-extindex.pod | 5 +++--

Re: t/cindex.t "associate w/o search" test hangs for me

2023-11-15 Thread Eric Wong
Konstantin Ryabitsev wrote: > I'm quite happy to not require libgit2 -- I've always found it easier to just > use git plumbing commands even if this requires exec'ing an external > executable. Yeah, I don't have libgit2 installed on most of my systems, either. Hoping git itself eventually makes

[PATCH 4/4] lei q|up|convert: common finish_output to detect errors

2023-11-15 Thread Eric Wong
We need to consistently check the exit code of pigz|gzip|xz|bzip2 when writing to compressed mboxes (or bad storage). --- lib/PublicInbox/LeiConvert.pm | 4 ++-- lib/PublicInbox/LeiToMail.pm | 11 +++ lib/PublicInbox/LeiXSearch.pm | 9 + 3 files changed, 14 insertions(+), 10

[PATCH 3/4] lei: avoid extra fork for v2 outputs

2023-11-15 Thread Eric Wong
We've always forced LeiToMail to only have one process for v2 outputs anyways since v2 has its own sharding and IPC. Thus we can use the single LeiToMail process directly to avoid extra IPC overhead. --- lib/PublicInbox/LeiConvert.pm | 7 ++- lib/PublicInbox/LeiToMail.pm | 19

[PATCH 0/4] lei convert: support idempotent v2 outputs

2023-11-15 Thread Eric Wong
working on 3/4. Eric Wong (4): lei: fix idempotent STDERR redirect in workers lei convert: fix repeat and idempotent v2 output lei: avoid extra fork for v2 outputs lei q|up|convert: common finish_output to detect errors lib/PublicInbox/LEI.pm | 2 +- lib/PublicInbox/LeiConvert.pm

<    1   2   3   4   5   6   7   8   9   10   >