There's a lot of crap in archives and git-fast-import
accepts empty names and email addresses for authors
just fine.
---
lib/PublicInbox/Import.pm | 27 +++
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
The mboxes I got from cregit have two spaces after the email
address, while the "git format-patch" output I'm used to dealing
with only has one space.
It's still a "strict" match in that it checks for something
resembling a timestamp, but it relaxes the number of spaces
between the email address
Check for this before doing the Xapian-based v2 importer.
---
t/import.t | 25 -
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/t/import.t b/t/import.t
index fb6238e..92c82b9 100644
--- a/t/import.t
+++ b/t/import.t
@@ -6,7 +6,10 @@ use Test::More;
use
Big lists are orders of magnitude more efficient with v2.
---
scripts/import_vger_from_mbox | 24 ++--
1 file changed, 6 insertions(+), 18 deletions(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 3fa5c77..6ea2ca5 100644
---
Wrap "get-mark" and "checkpoint" commands for git-fast-import
while documenting/cementing parts of the API.
---
lib/PublicInbox/Import.pm | 28 ++--
t/import.t| 4 +++-
2 files changed, 25 insertions(+), 7 deletions(-)
diff --git
This will allow easier-compatibility with v2 code which will
introduce content_id as the unique identifier.
The old "XMID" becomes "XM" as a free text searchable term.
"Q" becomes "XMID" as a boolean prefix.
There's no user-visible changes in this, but there needs to
be a schema version bump
There's a lot of weird characters which show up in LKML archives
which we did not support before. Furthermore, allow spaces
before the '>' in the From: line as at least some non-spam
poster used it.
---
lib/PublicInbox/Address.pm | 3 ++-
t/address.t| 5 +++--
2 files changed, 5
I decided not to copy the notmuch implementation regarding
serialization of integers to Xapian metadata.
---
lib/PublicInbox/SearchIdx.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0ee0779..fa5057f 100644
Call order will need to change a bit since this is going to be
tied to Xapian
---
MANIFEST | 1 +
lib/PublicInbox/ContentId.pm | 30 ++
lib/PublicInbox/Import.pm| 74 +++-
3 files changed, 91 insertions(+), 14
Hostnames can contain '-' and this allows public-inbox-watch(1)
to work on machines which generate Maildir files with '-' in
them.
---
lib/PublicInbox/WatchMaildir.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/WatchMaildir.pm
The basic idea is to outsource deduplication to Xapian
and use git as dumb storage. This yields huge dividends
in object traversal based on preliminary tests:
https://public-inbox.org/meta/20180209205140.GA11047@dcvr/
Additionally, insertion time does not degrade due to giant
tree objects which
Eric Wong wrote:
> Timing "git rev-list --objects --all |wc -l" reveals much bigger
> differences. Timings are a bit fuzzy since (AFAIK) this is a
> shared system, but it's not really a contest:
>
> 2-38 ~5 minutes
> 2-2-2-34 ~30 seconds
> 2-2-36 ~30 seconds
>
12 matches
Mail list logo