Re: [WIP 12/17] search: free up 'Q' prefix for a real unique identifier

2018-02-22 Thread Eric Wong
"Eric Wong (Contractor, The Linux Foundation)" wrote: > This will allow easier-compatibility with v2 code which will > introduce content_id as the unique identifier. > The old "XMID" becomes "XM" as a free text searchable term. > "Q" becomes "XMID" as a boolean prefix. Leaning towards rejecting t

Re: [v2] introduction of content_id

2018-02-22 Thread Eric Wong
Eric Wong wrote: > In addition to the git object_id (blob SHA-1) and Message-Id > header; it seems necessary to introduce an in-between identifier > for deduplicating which isn't as loose as Message-Id or as > strict as object_id: content_id I think this will only be calculated-on-the-fly in case

[WIP PATCH 0/12] v2: git repo rotation + parallel Xapian indexing

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
The key thing is sharding git and sharding of Xapian are not tied together: git repos are sharded to reduce clone/repack costs; so we shard them based on size (currently 1G or so). Xapian DBs are sharded to take advantage of SMP during the indexing phase. Current import times are as follows:

[PATCH 02/12] extmsg: fix broken Xapian MID lookup

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
This likely has no real world implications, though, as we fall back to Msgmap lookups anyways. Broken since commit 7eeadcb62729b0efbcb53cd9b7b181897c92cf9a ("search: remove unnecessary abstractions and functionality") --- lib/PublicInbox/ExtMsg.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletio

[PATCH 12/12] searchidxpart: increase pipe size for partitions

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
We want to reduce the time in the main V2Writable process spends writing to the pipe, as the main process itself is the primary source of contention. While we're at it, always flush after writing to ensure the child sees it at once. (Grr... Perl doesn't use writev) --- lib/PublicInbox/SearchIdxP

[PATCH 11/12] v2writable: round-robin to partitions based on article number

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
Instead of relying on the git object_id hash to partition, round-robin to these partitions based on the NNTP article number. This reduces the partition pipes as a source of contention when two (or more) sequential messages end up going to the same partition. --- lib/PublicInbox/V2Writable.pm | 34

[PATCH 04/12] www: stop assuming mainrepo == git_dir

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
It won't be in v2 --- lib/PublicInbox/WWW.pm | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm index 4ddc187..e95fba0 100644 --- a/lib/PublicInbox/WWW.pm +++ b/lib/PublicInbox/WWW.pm @@ -150,7 +150,6 @@ sub invalid_inbox ($$) { my $www = $ct

[PATCH 01/12] import: allow the epoch (0s) as a valid time

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
Despite email not existing until 1971; "Jan 1, 1970 00:00:00" seems like a common default timestamp for some test emails to use as a Date: header. --- lib/PublicInbox/Import.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm

[PATCH 08/12] import_vger_from_inbox: allow "-V" option

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
This will let us quickly test between v2 and v1 inboxes. --- scripts/import_vger_from_mbox | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox index d30e8a3..abc2d37 100644 --- a/scripts/import_

[PATCH 10/12] v2: parallelize Xapian indexing

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NN

[PATCH 09/12] import_vger_from_mbox: use PublicInbox::MIME and avoid clobbering

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
It is less confusing without the clobber assignment; and PublicInbox::MIME exists to workaround bugs in older Email::MIME (which is in Debian 9 (stretch)) --- scripts/import_vger_from_mbox | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/scripts/import_vger_from_mbox b/s

[PATCH 07/12] v2: support Xapian + SQLite indexing

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least. --- lib/PublicInbox/Import.pm | 1 + lib/PublicInbox/Inbo

[PATCH 06/12] git: reload alternates file on missing blob

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
Since we'll be adding new repositories to the `alternates' file in git, we must restart the `git cat-file --batch' process as git currently does not detect changes to the alternates file in long-running cat-file processes. Don't bother with the `--batch-check' process since we won't be using it wi

[PATCH 03/12] search: stop assuming Message-ID is unique

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
In general, they are, but there's no way for or general purpose mail server to enforce that. This is a step in allowing us to handle more corner cases which existing lists throw at us. --- lib/PublicInbox/ExtMsg.pm| 2 +- lib/PublicInbox/Search.pm| 14 -- lib/PublicInbox/Sear

[PATCH 05/12] v2writable: initial cut for repo-rotation

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing. --- MANIFEST | 1 + lib/PublicInbox/Git.pm| 12 +++ lib/PublicInbox/Import.pm | 9 ++-

SQLite, UPDATE, and LIMIT

2018-02-22 Thread Jonathan Corbet
So I'm messing with public-inbox a bit. I had to make the following change before things would work on my Fedora system. (This is with v1.0.0). The symptom was errors out of SQLite; it wouldn't pass "make test" much less work properly. Some digging turns up this: If SQLite is built wit

Re: SQLite, UPDATE, and LIMIT

2018-02-22 Thread Eric Wong
Jonathan Corbet wrote: > So I'm messing with public-inbox a bit. I had to make the following > change before things would work on my Fedora system. (This is with > v1.0.0). > > The symptom was errors out of SQLite; it wouldn't pass "make test" much > less work properly. Some digging turns up t

Re: SQLite, UPDATE, and LIMIT

2018-02-22 Thread Jonathan Corbet
On Thu, 22 Feb 2018 23:24:00 + Eric Wong wrote: > So with your patch, is everything else good on Fedora? It seems to be. I spent forever in this loop: while (it still doesn't work) install_next_perl_module() I've stashed the list in an ansible config, so I can pass

Re: SQLite, UPDATE, and LIMIT

2018-02-22 Thread Eric Wong
Jonathan Corbet wrote: > On Thu, 22 Feb 2018 23:24:00 + > Eric Wong wrote: > > > So with your patch, is everything else good on Fedora? > > It seems to be. I spent forever in this loop: > > while (it still doesn't work) > install_next_perl_module() > > I've stashed th

Re: [WIP PATCH 0/12] v2: git repo rotation + parallel Xapian indexing

2018-02-22 Thread Eric Wong
> git+Xapian+SQLite 4 parts: ~15 minutes (2 + 2 hyperthread) That was for 2017, which was 11.6% of the 2000-2017 archives. For the 2000-2017 set, it took 5h20m; which is disappointing to me but probably tolerable for now. 2017-only didn't trigger repo rotation at ~1G (being only 750MB or so) T