[WIP 17/17] import: quiet down warnings from bogus From: lines

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
There's a lot of crap in archives and git-fast-import accepts empty names and email addresses for authors just fine. --- lib/PublicInbox/Import.pm | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm

[WIP 14/17] address: extract more characters from email addresses

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
There's a lot of weird characters which show up in LKML archives which we did not support before. Furthermore, allow spaces before the '>' in the From: line as at least some non-spam poster used it. --- lib/PublicInbox/Address.pm | 3 ++- t/address.t| 5 +++-- 2 files changed, 5

[WIP 12/17] search: free up 'Q' prefix for a real unique identifier

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
This will allow easier-compatibility with v2 code which will introduce content_id as the unique identifier. The old "XMID" becomes "XM" as a free text searchable term. "Q" becomes "XMID" as a boolean prefix. There's no user-visible changes in this, but there needs to be a schema version bump

[WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
The mboxes I got from cregit have two spaces after the email address, while the "git format-patch" output I'm used to dealing with only has one space. It's still a "strict" match in that it checks for something resembling a timestamp, but it relaxes the number of spaces between the email address

[WIP 07/17] t/import: test for last_object_id insertion

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Check for this before doing the Xapian-based v2 importer. --- t/import.t | 25 - 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/t/import.t b/t/import.t index fb6238e..92c82b9 100644 --- a/t/import.t +++ b/t/import.t @@ -6,7 +6,10 @@ use Test::More; use

[WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Big lists are orders of magnitude more efficient with v2. --- scripts/import_vger_from_mbox | 24 ++-- 1 file changed, 6 insertions(+), 18 deletions(-) diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox index 3fa5c77..6ea2ca5 100644 ---

[WIP 11/17] import: APIs to support v2 use

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API. --- lib/PublicInbox/Import.pm | 28 ++-- t/import.t| 4 +++- 2 files changed, 25 insertions(+), 7 deletions(-) diff --git

[WIP 13/17] searchidx: fix comment around next_thread_id

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
I decided not to copy the notmuch implementation regarding serialization of integers to Xapian metadata. --- lib/PublicInbox/SearchIdx.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 0ee0779..fa5057f 100644

[WIP 06/17] import: initial handling for v2

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Call order will need to change a bit since this is going to be tied to Xapian --- MANIFEST | 1 + lib/PublicInbox/ContentId.pm | 30 ++ lib/PublicInbox/Import.pm| 74 +++- 3 files changed, 91 insertions(+), 14

[WIP 02/17] watch_maildir: allow '-' in mail filename

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
Hostnames can contain '-' and this allows public-inbox-watch(1) to work on machines which generate Maildir files with '-' in them. --- lib/PublicInbox/WatchMaildir.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/WatchMaildir.pm

[WIP 0/17] initial v2 work based on one-file tree

2018-02-15 Thread Eric Wong (Contractor, The Linux Foundation)
plagued the initial v1 design. There's also a couple of small fixes along the way to make it tolerate some crap in older archives. The search indexer and content-based deduplication will still need to be worked on. Eric Wong (Contractor, The Linux Foundation) (17): AUTHORS: add The Linux

[PATCH 10/12] v2: parallelize Xapian indexing

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers

[PATCH 09/12] import_vger_from_mbox: use PublicInbox::MIME and avoid clobbering

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
It is less confusing without the clobber assignment; and PublicInbox::MIME exists to workaround bugs in older Email::MIME (which is in Debian 9 (stretch)) --- scripts/import_vger_from_mbox | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/scripts/import_vger_from_mbox

[PATCH 07/12] v2: support Xapian + SQLite indexing

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least. --- lib/PublicInbox/Import.pm | 1 +

[PATCH 12/12] searchidxpart: increase pipe size for partitions

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
We want to reduce the time in the main V2Writable process spends writing to the pipe, as the main process itself is the primary source of contention. While we're at it, always flush after writing to ensure the child sees it at once. (Grr... Perl doesn't use writev) ---

[PATCH 01/12] import: allow the epoch (0s) as a valid time

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
Despite email not existing until 1971; "Jan 1, 1970 00:00:00" seems like a common default timestamp for some test emails to use as a Date: header. --- lib/PublicInbox/Import.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/Import.pm

[PATCH 08/12] import_vger_from_inbox: allow "-V" option

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
This will let us quickly test between v2 and v1 inboxes. --- scripts/import_vger_from_mbox | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox index d30e8a3..abc2d37 100644 ---

[PATCH 06/12] git: reload alternates file on missing blob

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
Since we'll be adding new repositories to the `alternates' file in git, we must restart the `git cat-file --batch' process as git currently does not detect changes to the alternates file in long-running cat-file processes. Don't bother with the `--batch-check' process since we won't be using it

[PATCH 03/12] search: stop assuming Message-ID is unique

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
In general, they are, but there's no way for or general purpose mail server to enforce that. This is a step in allowing us to handle more corner cases which existing lists throw at us. --- lib/PublicInbox/ExtMsg.pm| 2 +- lib/PublicInbox/Search.pm| 14 --

[PATCH 05/12] v2writable: initial cut for repo-rotation

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing. --- MANIFEST | 1 + lib/PublicInbox/Git.pm| 12 +++ lib/PublicInbox/Import.pm | 9

[PATCH 02/12] extmsg: fix broken Xapian MID lookup

2018-02-22 Thread Eric Wong (Contractor, The Linux Foundation)
This likely has no real world implications, though, as we fall back to Msgmap lookups anyways. Broken since commit 7eeadcb62729b0efbcb53cd9b7b181897c92cf9a ("search: remove unnecessary abstractions and functionality") --- lib/PublicInbox/ExtMsg.pm | 2 +- 1 file changed, 1 insertion(+), 1

[PATCH v2 0/21] UI bits and v2 import fixes

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
rors and bugs quickly in the PSGI interface. For sorting, relying on the Date: header seems unreliable as kernel developers seem more prone to having bad clocks than other lists I've imported. I'll probably switch the internal timestamps to use the Received: date as a result. Eric Wong (Contract

[PATCH 06/21] use PublicInbox::MIME consistently

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
It works around some bugs in older Email::MIME which we'll find useful. --- lib/PublicInbox/MIME.pm | 2 ++ lib/PublicInbox/SearchIdx.pm| 2 -- lib/PublicInbox/V2Writable.pm | 2 -- lib/PublicInbox/WatchMaildir.pm | 2 -- lib/PublicInbox/WwwAttach.pm| 3 +--

[PATCH 01/21] v2writable: warn on duplicate Message-IDs

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
This should give us an idea of how much a problem deduplication will be. --- lib/PublicInbox/SearchIdx.pm | 6 -- lib/PublicInbox/V2Writable.pm | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index

[PATCH 17/21] v2writable: commit to skeleton via remote partitions

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
We need to ensure Xapian transaction commits are made to remote partitions before associated commits hit the skeleton DB. This causes unnecessary commits to be made to the skeleton DB; but they're mostly harmless. Further work will be necessary to ensure proper ordering and avoidance of

[PATCH 05/21] searchidxpart: binmode

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
Probably unnecessary, but set binmode for consistency across platforms. --- lib/PublicInbox/SearchIdxPart.pm | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm index 64e5263..63fc704 100644 ---

[PATCH 19/21] searchidx: do not modify Xapian DB while iterating

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
Iterating through a list of documents while modifying them does not seem to be supported in Xapian and it can trigger DatabaseCorruptError exceptions. This only worked with past datasets out of dumb luck. With the work-in-progress "v2" public-inbox layout, this problem might become more visible

[PATCH 02/21] v2/ui: some hacky things to get the PSGI UI to show up

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
Fortunately, Xapian multiple database support makes things easier but we still need to handle the skeleton DB separately. --- lib/PublicInbox/Inbox.pm | 21 + lib/PublicInbox/Search.pm | 42 -- 2 files changed, 57 insertions(+), 6

[PATCH 18/21] searchidxskeleton: extra error checking

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
I added these while chasing down the DatabaseCorruptError exceptions which turned out to be caused by Xapian DB modifications during iteration. --- lib/PublicInbox/SearchIdxSkeleton.pm | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm

[PATCH 15/21] search: use different Enquire object for skeleton queries

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
A different Xapian DB requires the use of a different Enquire object. This is necessary for get_thread and thread skeleton to work in the PSGI UI. --- lib/PublicInbox/Search.pm | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/Search.pm

[PATCH 11/21] searchidxthread: load doc data for references

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
Otherwise, references and thread linking doesn't happen across subject mismatches. Oops, this is important. --- lib/PublicInbox/SearchIdxThread.pm | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/PublicInbox/SearchIdxThread.pm b/lib/PublicInbox/SearchIdxThread.pm index fd133d1..57bb293

[PATCH 16/21] rename SearchIdxThread to SearchIdxSkeleton

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers. --- MANIFEST | 2 +- lib/PublicInbox/Search.pm |

[PATCH 03/21] v2/ui: retry DB reopens in a few more places

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
Relying more on Xapian requires retrying reopens in more places to ensure it does not fall down and show errors to the user. --- lib/PublicInbox/Inbox.pm | 8 +--- lib/PublicInbox/Search.pm | 3 ++- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/Inbox.pm

[PATCH 13/21] search: reopen skeleton DB as well

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
Any Xapian DB is subject to the same errors and retries. Perhaps in the future this can made more granular to avoid unnecessary reopens. --- lib/PublicInbox/Search.pm | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm

[PATCH 07/21] searchidxpart: chomp line before splitting

2018-02-28 Thread Eric Wong (Contractor, The Linux Foundation)
This was adding a needless newline into doc_data --- lib/PublicInbox/SearchIdxPart.pm | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm index 63fc704..e37887a 100644 --- a/lib/PublicInbox/SearchIdxPart.pm +++

[PATCH 4/7] view: avoid offset during pagination

2018-04-03 Thread Eric Wong (Contractor, The Linux Foundation)
OFFSET in SQLite gets painful to deal with. Instead, rely on timestamps (from Received:) for pagination. This also sets us up for more precise Date searching in case we want it. --- lib/PublicInbox/Feed.pm | 25 ++ lib/PublicInbox/Inbox.pm | 4 +-- lib/PublicInbox/Over.pm | 24

[PATCH 0/7] optimize V2

2018-04-03 Thread Eric Wong (Contractor, The Linux Foundation)
be improved across the board and commands like XOVER/XHDR are over 20x faster than they were in v1 with Xapian. There's also a few cleanups and code simplifications which should make future work easier. Eric Wong (Contractor, The Linux Foundation) (7): t/thread-all.t: modernize test to support

[PATCH 7/7] nntp: simplify the long_response API

2018-04-03 Thread Eric Wong (Contractor, The Linux Foundation)
We we worked around the default range/termination conditions of long_response in many cases to reduce calls to SQLite or Xapian. So continue that trend and become more like the PSGI API which doesn't force callers to specify an article range or work inside a loop. --- lib/PublicInbox/Msgmap.pm |

[PATCH 6/7] msgmap: replace id_batch with ids_after

2018-04-03 Thread Eric Wong (Contractor, The Linux Foundation)
id_batch had a an overly complicated interface, replace it with id_batch which is simpler and takes advantage of selectcol_arrayref in DBI. This allows simplification of callers and the diffstat agrees with me. --- lib/PublicInbox/Mbox.pm | 11 ++- lib/PublicInbox/Msgmap.pm | 19

[PATCH 2/7] rename+rewrite test using Benchmark module

2018-04-03 Thread Eric Wong (Contractor, The Linux Foundation)
There'll be more performance-related tests in the future. --- MANIFEST | 2 +- t/perf-threading.t | 32 t/thread-all.t | 32 3 files changed, 33 insertions(+), 33 deletions(-) create mode 100644 t/perf-threading.t

[PATCH 1/7] t/thread-all.t: modernize test to support modern inboxes

2018-04-03 Thread Eric Wong (Contractor, The Linux Foundation)
We'll be adding more tests in the same vein as this to improve NNTP performance. --- t/thread-all.t | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/t/thread-all.t b/t/thread-all.t index d4e8c1f..820fba8 100644 --- a/t/thread-all.t +++ b/t/thread-all.t

[PATCH 3/4] over: use only supported and safe SQLite APIs

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
Some of this jankiness was from early performance problems and they turned out to be unnecessary measures. --- lib/PublicInbox/OverIdx.pm| 9 - lib/PublicInbox/V2Writable.pm | 3 +-- script/public-inbox-compact | 13 + 3 files changed, 6 insertions(+), 19 deletions(-)

[PATCH 4/4] search: index and allow searching by date-time

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.qro.7.76.6.1804041821420...@zvavag-6oxh6da.rhebcr.pbec.zvpebfbsg.pbz

[PATCH 2/4] v2writable: refer to git each repository as "epoch"

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
This hopefully helps for people who try to understand this design. --- lib/PublicInbox/V2Writable.pm | 54 +-- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index

[PATCH 4/4] ensure Xapian and SQLite are still optional for v1 tests

2018-04-06 Thread Eric Wong (Contractor, The Linux Foundation)
Xapian is size-intensive and SQLite is not strictly necessary for v1. --- script/public-inbox-compact | 2 +- scripts/import_vger_from_mbox | 2 +- t/convert-compact.t | 2 +- t/v2mirror.t | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git

[PATCH 2/4] nntp: set Xref across multiple inboxes

2018-04-06 Thread Eric Wong (Contractor, The Linux Foundation)
Noted by Jonathan Corbet in https://lwn.net/Articles/748184/ --- lib/PublicInbox/NNTP.pm | 43 --- t/nntp.t| 6 -- 2 files changed, 32 insertions(+), 17 deletions(-) diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm index

[PATCH 8/8] msgmap: speed up minmax with separate queries

2018-04-06 Thread Eric Wong (Contractor, The Linux Foundation)
This significantly improves the performance of the NNTP GROUP command with 2.7 million messages from over 250ms to 700us. SQLite is weird about this, but at least there's a way to optimize it. --- lib/PublicInbox/Msgmap.pm | 10 +++--- t/perf-nntpd.t| 13 + 2 files

[PATCH 3/8] over: remove forked subprocess

2018-04-06 Thread Eric Wong (Contractor, The Linux Foundation)
Since the overview stuff is a synchronization point anyways, move it into the main V2Writable process and allow us to drop a bunch of code. This is another step towards making Xapian optional for v2. In other words, the fan-out point is moved and the Xapian partitions no longer need to

[PATCH 4/8] v2writable: reduce barriers

2018-04-06 Thread Eric Wong (Contractor, The Linux Foundation)
Since we handle the overview info synchronously, we only need barriers in tests, now. We will use asynchronous checkpoints to sync less-important Xapian data. For data deduplication, this requires us to hoist out the cat-blob support in ::Import for reading uncommitted data in git. ---

[PATCH 7/8] store less data in the Xapian document

2018-04-06 Thread Eric Wong (Contractor, The Linux Foundation)
Since we only query the SQLite over DB for OVER/XOVER; do not need to waste space storing fields To/Cc/:bytes/:lines or the XNUM term. We only use From/Subject/References/Message-ID/:blob in various places of the PSGI code. For reindexing, we will take advantage of docid stability in

[PATCH 1/8] psgi: ensure /$INBOX/$MESSAGE_ID/T/ endpoint is chronological

2018-04-06 Thread Eric Wong (Contractor, The Linux Foundation)
We only need to call get_thread beyond 1000 messages for fetching entire mboxes. It's probably too much for the HTML display otherwise. --- lib/PublicInbox/Mbox.pm | 6 +++--- lib/PublicInbox/Over.pm | 18 +++--- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git

[PATCH 1/7] v2writable: recount partitions after acquiring lock

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
The partition count can change if public-inbox-compact runs while public-inbox-watch or public-inbox-index is running. --- lib/PublicInbox/V2Writable.pm | 44 --- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm

[PATCH 6/7] support altid mechanism for v2

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
There's enough gmane links out there in wild that it makes sense to maintain support for these mappings. --- MANIFEST | 1 + lib/PublicInbox/AltId.pm | 20 +++--- lib/PublicInbox/Filter/RubyLang.pm | 22 ++- t/altid_v2.t

[PATCH 7/7] compact: better handling of over.sqlite3* files

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
Lets not scare users when they encounter files that are supposed to be there. Then, preserve the journal and pipe.lock, even if they're supposedly unused due to us holding the inbox-wide lock. --- script/public-inbox-compact | 9 + 1 file changed, 9 insertions(+) diff --git

[PATCH 5/7] searchview: minor cleanup

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
$mset->size is probably more obvious than relying on a tied array and saves us a line. --- lib/PublicInbox/SearchView.pm | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index c789795..d038dfc 100644 ---

[PATCH 0/7] v2 odds and ends

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
Eric Wong (Contractor, The Linux Foundation) (7): v2writable: recount partitions after acquiring lock searchmsg: remove unused `tid' and `path' methods search: remove unnecessary OP_AND of query mbox: do not sort search results searchview: minor cleanup support altid mechanism for v2

[PATCH 2/7] searchmsg: remove unused `tid' and `path' methods

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
These internal attributes are not exposed and no longer used in our APIs. --- lib/PublicInbox/SearchMsg.pm | 5 - 1 file changed, 5 deletions(-) diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm index 6c0780e..d43853a 100644 --- a/lib/PublicInbox/SearchMsg.pm +++

[PATCH 4/7] mbox: do not sort search results

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
Sorting large msets is a waste when it comes to mboxes since MUAs should thread and sort them as the user desires. This forces us to rework each of the mbox download mechanisms to be more independent of each other, but might make things easier to reason about. --- lib/PublicInbox/Mbox.pm | 139

[PATCH 3/7] search: remove unnecessary OP_AND of query

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
This was vestigial code from the switch to the overview DB --- lib/PublicInbox/Search.pm | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index eca2b0f..4e014f4 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -216,7

[PATCH] v2writable: remove redundant remove from Over DB

2018-04-05 Thread Eric Wong (Contractor, The Linux Foundation)
The Xapian partitions will trigger the removal anyways. Test this and fix some description/spelling errors while we're at it. --- lib/PublicInbox/V2Writable.pm | 1 - t/v2writable.t| 6 -- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git

[PATCH] searchidx: ensure duplicated Message-IDs can be linked together

2018-04-04 Thread Eric Wong (Contractor, The Linux Foundation)
This allows us to emulate the display of thread-aware MUAs when multiple messages share the same Message-ID. This also is a place where "public-inbox-index --reindex" is useful to fix existing messages and no schema version bump is necessary. --- lib/PublicInbox/SearchIdx.pm | 13 +++--

[PATCH 2/4] import: rewrite less history during purge

2018-04-04 Thread Eric Wong (Contractor, The Linux Foundation)
We do not need to rewrite old commits unaffected by the object_id purge, only newer commits. This was a state management bug :x We will also return the new commit ID of rewritten history to aid in incremental indexing of mirrors for the next change. --- lib/PublicInbox/Import.pm | 25

[PATCH 1/4] init: s/GIT_DIR/REPO_DIR/ in usage

2018-04-04 Thread Eric Wong (Contractor, The Linux Foundation)
--- script/public-inbox-init | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/script/public-inbox-init b/script/public-inbox-init index d6a6482..3ef6c3b 100755 --- a/script/public-inbox-init +++ b/script/public-inbox-init @@ -5,7 +5,7 @@ # Initializes a public-inbox, basically

[PATCH 4/4] v2writable: do not modify DBs while iterating for ->remove

2018-04-04 Thread Eric Wong (Contractor, The Linux Foundation)
Xapian may become unhappy if a DB is modified during iteration: nntp://news.gmane.org/20180228004400.gu12...@survex.com --- lib/PublicInbox/V2Writable.pm | 46 +-- 1 file changed, 27 insertions(+), 19 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm

[PATCH] searchidx: revert default BATCH_BYTES to 1_000_000

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
This increases indexing time by around 10% but roughly halves memory usage of an -index process. We will probably make this tunable in the future for people with bigger/smaller machines. --- lib/PublicInbox/SearchIdx.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git

[PATCH 01/12] feed: respect feedmax, again

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
Gigantic feeds probably make some clients unhappy, clamp it to what it was in the past. Fixes: b9534449ecce2c59 ("view: avoid offset during pagination") --- lib/PublicInbox/Feed.pm | 2 +- lib/PublicInbox/View.pm | 7 +++ 2 files changed, 4 insertions(+), 5 deletions(-) diff --git

[PATCH 03/12] compact: do not merge v2 repos by default

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
--no-renumber does not allow merging, and merging is not ideal for reindexing, either. --- script/public-inbox-compact | 15 ++- t/convert-compact.t | 2 -- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/script/public-inbox-compact

[PATCH 06/12] v2: generate better Message-IDs for duplicates

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp

[PATCH 00/12] better dedupe, contiguous article numbers

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
Hopefully the final round of patches, most notably [7/12] improving deduplication and [9/12] to avoid causing difficulties for NNTP readers on v1 inboxes. And a few more bugfixes along the way... Eric Wong (Contractor, The Linux Foundation) (12): feed: respect feedmax, again v1: remove

[PATCH 11/12] use %H consistently to disable abbreviations

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
We generally do not want git to waste time finding abbreviations and we do not want the possibility of them becoming ambiguous over time, either. --- lib/PublicInbox/Feed.pm | 2 +- lib/PublicInbox/V2Writable.pm | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git

[PATCH 12/12] searchidx: increase term positions for all text terms

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
We do not want phrase searches to cross between independent fields (filenames/Message-ID vs bodies) --- lib/PublicInbox/SearchIdx.pm | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 2239c90..6e44887 100644 ---

[PATCH 05/12] search: preserve References in Xapian smsg for x=t view

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
I'm not sure how useful this view is, but it exists for now. --- lib/PublicInbox/SearchIdx.pm | 1 + t/psgi_v2.t | 11 +-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index fd76627..0b1dc21

[PATCH 04/12] v2writable: reduce partititions by one

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
git fast-import and the main V2Writable process combined takes about one CPU, so avoid having too many Xapian partitions which cause unnecessary I/O contention. --- lib/PublicInbox/V2Writable.pm | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git

[PATCH 08/12] import: cat_blob drops leading 'From ' lines like Inbox

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
In case people were running old buggy versions from 2016... (and -convert should probably clean those up, eventually) --- lib/PublicInbox/Import.pm | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm index c7a96e1..b25427e 100644 ---

[PATCH 10/12] extmsg: remove expensive git path checks

2018-04-18 Thread Eric Wong (Contractor, The Linux Foundation)
Searching across different inboxes is expensive without SQLite (or Xapian) installed, so avoid doing expensive tree lookups in git. Since SQLite is required for Xapian support anyways, we won't need to check Xapian, either. Sites without SQLite installed will simply 404 if somebody requests a

[PATCH] extmsg: use Xapian only for partial matches

2018-04-22 Thread Eric Wong (Contractor, The Linux Foundation)
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying

[PATCH] fix tests to run without Xapian installed

2018-04-19 Thread Eric Wong (Contractor, The Linux Foundation)
We'll be ensuring we can run more of the HTTP and all of the NNTP interface with only SQLite (and not Xapian) installed in the future. --- t/over.t | 3 ++- t/v1-add-remove-add.t | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/t/over.t b/t/over.t index

[PATCH 09/11] www: get rid of unnecessary 'inbox' name reference

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
We use the actual Inbox object everywhere else and don't need the name of the inbox separated from the object. --- lib/PublicInbox/WWW.pm | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm index 7bf866f..f5ed271 100644 --- a/lib/PublicInbox/WWW.pm

[PATCH 05/11] feed: fix new.html for v2

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
I forget this endpoint is still accessible (even if not linked). This also simplifies new.html all around and removes some unused clutter from the old days while we're at it. --- lib/PublicInbox/Feed.pm | 82 - t/psgi_v2.t | 6 2

[PATCH 06/11] view: permalink (per-message) view shows multiple messages

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
This needs tests and further refinement, but current tests pass. --- lib/PublicInbox/Mbox.pm | 12 ++--- lib/PublicInbox/SearchMsg.pm | 7 +++ lib/PublicInbox/View.pm | 107 ++- lib/PublicInbox/WWW.pm | 7 +-- t/psgi_v2.t

[PATCH 10/11] searchview: remove unnecessary imports from MID module

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
We do not need many of these, anymore. --- lib/PublicInbox/SearchView.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm index 55c588c..6e537b4 100644 --- a/lib/PublicInbox/SearchView.pm +++

[PATCH 11/11] view: depend on SearchMsg for Message-ID

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
Since we need to handle messages with multiple and duplicate Message-ID headers, our thread skeleton display must account for that. Since we have a "preferred" Message-ID in case of conflicts, use it as the UUID in an Atom feed so readers do not get confused by conflicts. ---

[PATCH 07/11] searchidx: warn about vivifying multiple ghosts

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
This should help us detect bugs sooner in case we have space waste problems. --- lib/PublicInbox/SearchIdx.pm | 4 1 file changed, 4 insertions(+) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 7ac16ec..446cfb0 100644 --- a/lib/PublicInbox/SearchIdx.pm +++

[PATCH 04/11] t/psgi_v2: minimal test for Atom feed and t.mbox.gz

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
Some test coverage is better than none, here. --- t/psgi_v2.t | 22 ++ 1 file changed, 22 insertions(+) diff --git a/t/psgi_v2.t b/t/psgi_v2.t index 6a2ea5b..7389798 100644 --- a/t/psgi_v2.t +++ b/t/psgi_v2.t @@ -43,6 +43,7 @@ $mime->body_set("hello world!\n"); my @warn;

[PATCH 03/11] search: reopen DB if each_smsg_by_mid fails

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
This gives more-up-to-date data in case and allows us to avoid reopening in more places ourselves. --- lib/PublicInbox/Search.pm | 5 + t/psgi_v2.t | 10 +++--- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/lib/PublicInbox/Search.pm

[PATCH 00/11] duplicate support in UI + tests

2018-03-27 Thread Eric Wong (Contractor, The Linux Foundation)
e-ID. I'm not sure if messages with the same Message-ID should be rendered as relatives, or if they should only be grouped together in the same rootset (as if their "Subject:" matches) but not rendered as ancestors. Eric Wong (Contractor, The Linux Foundation) (11): import: consolida

[PATCH 2/4] search: move permissions handling to InboxWritable

2018-03-30 Thread Eric Wong (Contractor, The Linux Foundation)
We'll be making sure V2Writable uses this. --- lib/PublicInbox/InboxWritable.pm | 71 lib/PublicInbox/SearchIdx.pm | 80 t/search.t | 28 +-- 3 files changed, 93 insertions(+), 86 deletions(-) diff

[PATCH 4/4] v2: respect core.sharedRepository in git configs

2018-03-30 Thread Eric Wong (Contractor, The Linux Foundation)
Ensure -convert and -compact do not make repositories unreadable on live servers. --- lib/PublicInbox/InboxWritable.pm | 5 +- lib/PublicInbox/V2Writable.pm| 26 script/public-inbox-compact | 110 +-- script/public-inbox-convert | 23 +--

[PATCH 3/4] t/v2writable: use simplify permissions reading

2018-03-30 Thread Eric Wong (Contractor, The Linux Foundation)
We have Git::qx nowadays. --- t/v2writable.t | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/t/v2writable.t b/t/v2writable.t index 0eda432..7abb14f 100644 --- a/t/v2writable.t +++ b/t/v2writable.t @@ -43,13 +43,9 @@ if ('ensure git configs are correct') {

[PATCH 1/4] convert: avoid redundant "done\n" statement for fast-import

2018-03-30 Thread Eric Wong (Contractor, The Linux Foundation)
This bug was hidden due to timing problems with eatmydata or running with tmpfs for TMPDIR. --- script/public-inbox-convert | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/script/public-inbox-convert b/script/public-inbox-convert index 2b0a385..e6fb4f5 100755 ---

[PATCH 0/4] v2 permissions

2018-03-30 Thread Eric Wong (Contractor, The Linux Foundation)
Eric Wong (Contractor, The Linux Foundation) (4): convert: avoid redundant "done\n" statement for fast-import search: move permissions handling to InboxWritable t/v2writable: use simplify permissions reading v2: respect core.sharedRepository in git configs lib/P

[PATCH] search: retry_reopen on first_smsg_by_mid

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
This was causing errors while attempting to load messages via the WWW interface while mass-importing LKML. While we're at it, remove unnecessary eval from lookup_article. --- lib/PublicInbox/Search.pm | 46 ++ 1 file changed, 22 insertions(+), 24

[PATCH 4/4] mda: support v2 inboxes

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
I mainly focus on -watch for mirroring busy mailing lists, but using -mda should remain an option. --- MANIFEST | 1 + lib/PublicInbox/Emergency.pm | 3 ++- script/public-inbox-mda | 14 +-- t/v2mda.t| 59

[PATCH 2/4] v2writable: initializing an existing inbox is idempotent

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
And we do not want to start making confused repos if somebody leaves out "-V2" the second time around. --- lib/PublicInbox/V2Writable.pm | 1 - script/public-inbox-init | 17 - t/init.t | 32 ++-- 3 files changed, 46

[PATCH 0/4] tooling updates for v2

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
I suppose people will use -mda for large mailing lists, too (as opposed to -watch); so it now supports v2. After importing a giant archives into a v2 inbox, the new -compact wrapper can consolidate and reduce Xapian space (and FD) overhead. Eric Wong (Contractor, The Linux Foundation) (4

[PATCH 08/14] search: cleanup uniqueness checking

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
The only Xapian term which should be unique is the NNTP article number; so we no longer need find_unique_doc_id. --- lib/PublicInbox/Search.pm | 24 1 file changed, 8 insertions(+), 16 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index

[PATCH 10/14] search: move find_doc_ids to searchidx

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
We do not need this subroutine for read-only use in Search.pm --- lib/PublicInbox/Search.pm| 8 lib/PublicInbox/SearchIdx.pm | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 7d42aaa..6f5e062

[PATCH 07/14] v2writable: support purging messages from git entirely

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
Purging existing messages is fairly straightforward since we can take advantage of Xapian and lookup the git object_id with it. Unfortunately, purging an already "removed" message (which is no longer in Xapian) is not as easy and we'll need to expose ->purge_oids to purge by the git object_id

[PATCH 03/14] lookup by Message-ID favors the "primary" one

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
The Message-ID mapped to an NNTP article number is stronger, so we will favor that for attachment lookups. --- lib/PublicInbox/Inbox.pm | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm index 47b8630..4c7305f

[PATCH 05/14] searchmsg: document why we store To: and Cc: for NNTP

2018-03-29 Thread Eric Wong (Contractor, The Linux Foundation)
Otherwise I would forget and be tempted to remove them. --- lib/PublicInbox/SearchMsg.pm | 4 1 file changed, 4 insertions(+) diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm index e314fed..e55d401 100644 --- a/lib/PublicInbox/SearchMsg.pm +++

  1   2   >