Re: [v2] one file to rule them all?

2018-02-15 Thread Eric Wong
Eric Wong  wrote:
> Timing "git rev-list --objects --all |wc -l" reveals much bigger
> differences.  Timings are a bit fuzzy since (AFAIK) this is a
> shared system, but it's not really a contest:
> 
> 2-38 ~5 minutes
> 2-2-2-34 ~30 seconds
> 2-2-36   ~30 seconds
> 1-file   ~5 seconds
> 
> Smaller trees are way faster :)

The LKML 2000-2017 archives (16GB uncompressed mboxes) I have
are 6.3G of objects with 1-file storage in git and took around
33 minutes to do a full import utilizing a single core and
single git repo (no deduplication checks).

"git repack -adb" takes about 2 minutes

"git rev-list --objects --all |wc -l"
takes around 1 minute with over 8 million objects

As a baseline, pure Perl parsing of the mboxes (no writing to
git) was around 23 minutes on a single core; so git-fast-import
does add some overhead but probably not as much as Xapian will
add for the initial import.

The v1 2-38 code slowed to a crawl as more data got into the
repo and I gave up after it hit 18G and hit snags with
badly-formatted dates (worked around by relying on Date::Parse
instead of git's RFC2822 parser).

Side note:  Using 4 parallel processes for the parse-only tests
took around 10.5 minutes; while 2 processes took around 11-12
minutes.  Then I realized 2 of the 4 processors were HT,
so it appears HT doesn't help much with Perl parsing...

> In other words, git scales infinitely well to deep history
> depths, but not to breadth of large trees[1].

Serving a ~6G for clones is still a lot of bandwidth; so
partitioning the git repos to limit the size of each clone
seems worth it.

Yearly partitions is probably too frequent and we'd end up with
too many packs (and resulting more open-FDs, cache-misses,
metadata stored in Xapian).  I think partitioning based on
message-count/sizes might be a better metric for splitting as
LKML seems to get more traffic year-after-year.

> Marking spam and handling message removals might be a little
> trickier as chronology will have to be taken into account...
> (will post more on this, later)

Keeping track of everything in Xapian while walking backwards
through git history shouldn't be a big problem, actually.
(Xapian has read-after-write consistency)

However, trying to reason about partitioning of Xapian DBs
across time/message-count boundaries was making my head hurt and
now I'm not sure if it's necessary to partition Xapian DBs.

While normal text indexing is trivial to partition and
parallelize, associating messages with thread IDs requires
"global" knowledge spanning all partitions (since mail threads
span multiple periods).  Unfortunately, this requires much
synchronization and synchronization hurts parallelism.

Partitioning Xapian DBs is useful to speed up full-indexing and
not much else.  Full-(re)indexing is a rare event, and can be
done on a cold DB while the hot one is taking traffic.  In fact,
I would expect lookups on partitioned DBs to be slower since it
has more files to go through and has to map things like
internal document_ids to non-conflicting ones.

Also, we don't serve Xapian data to be cloned; which is the main
reason to do partitioning of git storage...
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



[v2] one file to rule them all?

2018-02-09 Thread Eric Wong
Since 95acd5901491e4f333f5d2bbeed6fb5e6b53e07c
("searchmsg: add git object ID to doc_data")
the need for having file stored in trees is reduced
since Xapian stores the git object_id and asks git to
retrieve it without doing tree lookups.

So, as long as git knows an object exists, it should be no
problem to just continually replace a single blob at the top
level.

Testing with git@vger history (https://public-inbox.org/git/ at
10066bdacf246bf885f7a11d06a5a87946d7af73 <20180208172153.ga30...@tor.lan>
 by Torsten Bögershausen  @ Thu Feb 8 18:21:53 2018 +0100)


For the 2-2-36 and 2-2-36 trees I took into account naming the
last 16-bytes since that's what git.git uses for
sorting/grouping for packing (pack_name_hash in pack-objects.h)


2-38 914M (baseline)
2-2-2-34 849M
2-2-36   832M
1-file   839M

2-2-2-34 has the most trees, so it's not great in terms of
space. 2-2-36 optimizes deltas better than the 1-file route;
but not significantly so.

It seems optimizing for deltafication isn't worth the effort...


Timing "git rev-list --objects --all |wc -l" reveals much bigger
differences.  Timings are a bit fuzzy since (AFAIK) this is a
shared system, but it's not really a contest:

2-38 ~5 minutes
2-2-2-34 ~30 seconds
2-2-36   ~30 seconds
1-file   ~5 seconds

Smaller trees are way faster :)

The downside of this change is squashing history will no longer
be possible; but it won't be needed for efficiency reasons.

In other words, git scales infinitely well to deep history
depths, but not to breadth of large trees[1].


Marking spam and handling message removals might be a little
trickier as chronology will have to be taken into account...
(will post more on this, later)

I also considered storing messages in the commit object itself
but that would be tougher to reconcile if rewriting git history
is necessary for legal reasons (DMCA).



[1] - we currently process history with --reverse to walk
  in chronological order to ease processing of message
  removals; but --reverse is has an O(n) cost associated
  with it so we should avoid it.  The thread association
  logic should be robust enough to be time-independent.
#!/usr/bin/perl -w
# Copyright 2018 The Linux Foundation
# License: AGPL-3.0+ 
use strict;
use warnings;
use Email::MIME;
use Digest::MD5 qw(md5_hex);

$| = 0;
my $h = '[0-9a-f]';
my $state = '';
my $blob;
my $suff; # 16 bytes for git hashing
while () {
if ($_ eq "blob\n") {
$state = 'blob';
} elsif (/^commit /) {
$state = 'commit';
} elsif ($state eq 'commit') {
if (m{^(M 100644 :\d+) ${h}{2}/${h}{38}}o) {
my ($pfx) = ($1);
print "$pfx msg\n";
next;
}
if (/^data (\d+)/) {
print $_;
my $len = $1;
if ($len) {
my $tmp;
read(STDIN, $tmp, $len) or die "read: $!\n";
print $tmp;
}
next;
}
} elsif ($state eq 'blob') {
if (/^data (\d+)/) {
my $len = $1;
print $_;
next unless $len;

read(STDIN, $blob, $len) or die "read: $!\n";
print $blob;
next;
}
}
print $_;
}
#!/usr/bin/perl -w
# Copyright 2018 The Linux Foundation
# License: AGPL-3.0+ 
use strict;
use warnings;
use Email::MIME;
use Digest::MD5 qw(md5_hex);

$| = 0;
my $h = '[0-9a-f]';
my $state = '';
my $blob;
my $suff; # 16 bytes for git hashing
while () {
if ($_ eq "blob\n") {
$state = 'blob';
} elsif (/^commit /) {
$state = 'commit';
} elsif ($state eq 'commit') {
if (m{^(M 100644 :\d+) (${h}{2})/(${h}{2})(${h}{36})}o) {
my ($pfx, $x2, $x4, $x36) = ($1, $2, $3, $4);
print "$pfx $x2/$x4/$x36.$suff\n";
next;
}
if (/^data (\d+)/) {
print $_;
my $len = $1;
if ($len) {
my $tmp;
read(STDIN, $tmp, $len) or die "read: $!\n";
print $tmp;
}
next;
}
} elsif ($state eq 'blob') {
if (/^data (\d+)/) {
my $len = $1;
print $_;
next unless $len;

read(STDIN, $blob, $len) or die "read: $!\n";