"Eric Wong (Contractor, The Linux Foundation)" wrote:
> This will allow easier-compatibility with v2 code which will
> introduce content_id as the unique identifier.
> The old "XMID" becomes "XM" as a free text searchable term.
> "Q" becomes "XMID" as a boolean prefix.
Leaning towards rejecting t
Eric Wong wrote:
> In addition to the git object_id (blob SHA-1) and Message-Id
> header; it seems necessary to introduce an in-between identifier
> for deduplicating which isn't as loose as Message-Id or as
> strict as object_id: content_id
I think this will only be calculated-on-the-fly in case
The key thing is sharding git and sharding of Xapian are not
tied together:
git repos are sharded to reduce clone/repack costs; so we shard
them based on size (currently 1G or so).
Xapian DBs are sharded to take advantage of SMP during the indexing
phase.
Current import times are as follows:
This likely has no real world implications, though, as we
fall back to Msgmap lookups anyways.
Broken since commit 7eeadcb62729b0efbcb53cd9b7b181897c92cf9a
("search: remove unnecessary abstractions and functionality")
---
lib/PublicInbox/ExtMsg.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletio
We want to reduce the time in the main V2Writable process
spends writing to the pipe, as the main process itself is
the primary source of contention.
While we're at it, always flush after writing to ensure
the child sees it at once. (Grr... Perl doesn't use writev)
---
lib/PublicInbox/SearchIdxP
Instead of relying on the git object_id hash to partition,
round-robin to these partitions based on the NNTP article
number. This reduces the partition pipes as a source of
contention when two (or more) sequential messages end up
going to the same partition.
---
lib/PublicInbox/V2Writable.pm | 34
It won't be in v2
---
lib/PublicInbox/WWW.pm | 1 -
1 file changed, 1 deletion(-)
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 4ddc187..e95fba0 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -150,7 +150,6 @@ sub invalid_inbox ($$) {
my $www = $ct
Despite email not existing until 1971; "Jan 1, 1970 00:00:00"
seems like a common default timestamp for some test emails
to use as a Date: header.
---
lib/PublicInbox/Import.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
This will let us quickly test between v2 and v1 inboxes.
---
scripts/import_vger_from_mbox | 24 +++-
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index d30e8a3..abc2d37 100644
--- a/scripts/import_
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.
git-fast-import is fast, so we don't bother parallelizing it.
Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NN
It is less confusing without the clobber assignment; and
PublicInbox::MIME exists to workaround bugs in older
Email::MIME (which is in Debian 9 (stretch))
---
scripts/import_vger_from_mbox | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/scripts/import_vger_from_mbox b/s
This is too slow, currently. Working with only 2017 LKML
archives:
git-only: ~1 minute
git + SQLite: ~12 minutes
git+Xapian+SQlite: ~45 minutes
So yes, it looks like we'll need to parallelize Xapian indexing,
at least.
---
lib/PublicInbox/Import.pm | 1 +
lib/PublicInbox/Inbo
Since we'll be adding new repositories to the `alternates' file
in git, we must restart the `git cat-file --batch' process as
git currently does not detect changes to the alternates file
in long-running cat-file processes.
Don't bother with the `--batch-check' process since we won't be
using it wi
In general, they are, but there's no way for or general purpose
mail server to enforce that. This is a step in allowing us
to handle more corner cases which existing lists throw at us.
---
lib/PublicInbox/ExtMsg.pm| 2 +-
lib/PublicInbox/Search.pm| 14 --
lib/PublicInbox/Sear
Wrap the old Import package to enable creating new repos based
on size thresholds. This is better than relying on time-based
rotation as LKML traffic seems to be increasing.
---
MANIFEST | 1 +
lib/PublicInbox/Git.pm| 12 +++
lib/PublicInbox/Import.pm | 9 ++-
So I'm messing with public-inbox a bit. I had to make the following
change before things would work on my Fedora system. (This is with
v1.0.0).
The symptom was errors out of SQLite; it wouldn't pass "make test" much
less work properly. Some digging turns up this:
If SQLite is built wit
Jonathan Corbet wrote:
> So I'm messing with public-inbox a bit. I had to make the following
> change before things would work on my Fedora system. (This is with
> v1.0.0).
>
> The symptom was errors out of SQLite; it wouldn't pass "make test" much
> less work properly. Some digging turns up t
On Thu, 22 Feb 2018 23:24:00 +
Eric Wong wrote:
> So with your patch, is everything else good on Fedora?
It seems to be. I spent forever in this loop:
while (it still doesn't work)
install_next_perl_module()
I've stashed the list in an ansible config, so I can pass
Jonathan Corbet wrote:
> On Thu, 22 Feb 2018 23:24:00 +
> Eric Wong wrote:
>
> > So with your patch, is everything else good on Fedora?
>
> It seems to be. I spent forever in this loop:
>
> while (it still doesn't work)
> install_next_perl_module()
>
> I've stashed th
> git+Xapian+SQLite 4 parts: ~15 minutes (2 + 2 hyperthread)
That was for 2017, which was 11.6% of the 2000-2017 archives.
For the 2000-2017 set, it took 5h20m; which is disappointing to
me but probably tolerable for now.
2017-only didn't trigger repo rotation at ~1G (being only 750MB
or so)
T
20 matches
Mail list logo