[WIP 17/17] import: quiet down warnings from bogus From: lines

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
There's a lot of crap in archives and git-fast-import accepts empty names and email addresses for authors just fine. --- lib/PublicInbox/Import.pm | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm

[WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
The mboxes I got from cregit have two spaces after the email address, while the "git format-patch" output I'm used to dealing with only has one space. It's still a "strict" match in that it checks for something resembling a timestamp, but it relaxes the number of spaces between the email address

[WIP 07/17] t/import: test for last_object_id insertion

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Check for this before doing the Xapian-based v2 importer. --- t/import.t | 25 - 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/t/import.t b/t/import.t index fb6238e..92c82b9 100644 --- a/t/import.t +++ b/t/import.t @@ -6,7 +6,10 @@ use Test::More; use

[WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Big lists are orders of magnitude more efficient with v2. --- scripts/import_vger_from_mbox | 24 ++-- 1 file changed, 6 insertions(+), 18 deletions(-) diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox index 3fa5c77..6ea2ca5 100644 ---

[WIP 11/17] import: APIs to support v2 use

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API. --- lib/PublicInbox/Import.pm | 28 ++-- t/import.t| 4 +++- 2 files changed, 25 insertions(+), 7 deletions(-) diff --git

[WIP 12/17] search: free up 'Q' prefix for a real unique identifier

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
This will allow easier-compatibility with v2 code which will introduce content_id as the unique identifier. The old "XMID" becomes "XM" as a free text searchable term. "Q" becomes "XMID" as a boolean prefix. There's no user-visible changes in this, but there needs to be a schema version bump

[WIP 14/17] address: extract more characters from email addresses

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
There's a lot of weird characters which show up in LKML archives which we did not support before. Furthermore, allow spaces before the '>' in the From: line as at least some non-spam poster used it. --- lib/PublicInbox/Address.pm | 3 ++- t/address.t| 5 +++-- 2 files changed, 5

[WIP 13/17] searchidx: fix comment around next_thread_id

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
I decided not to copy the notmuch implementation regarding serialization of integers to Xapian metadata. --- lib/PublicInbox/SearchIdx.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 0ee0779..fa5057f 100644

[WIP 06/17] import: initial handling for v2

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Call order will need to change a bit since this is going to be tied to Xapian --- MANIFEST | 1 + lib/PublicInbox/ContentId.pm | 30 ++ lib/PublicInbox/Import.pm| 74 +++- 3 files changed, 91 insertions(+), 14

[WIP 02/17] watch_maildir: allow '-' in mail filename

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Hostnames can contain '-' and this allows public-inbox-watch(1) to work on machines which generate Maildir files with '-' in them. --- lib/PublicInbox/WatchMaildir.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/WatchMaildir.pm

[WIP 0/17] initial v2 work based on one-file tree

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
The basic idea is to outsource deduplication to Xapian and use git as dumb storage. This yields huge dividends in object traversal based on preliminary tests: https://public-inbox.org/meta/20180209205140.GA11047@dcvr/ Additionally, insertion time does not degrade due to giant tree objects which

Re: [v2] one file to rule them all?

2018-02-15 Thread Eric Wong
Eric Wong wrote: > Timing "git rev-list --objects --all |wc -l" reveals much bigger > differences. Timings are a bit fuzzy since (AFAIK) this is a > shared system, but it's not really a contest: > > 2-38 ~5 minutes > 2-2-2-34 ~30 seconds > 2-2-36 ~30 seconds >