Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-22 Thread Jani Nikula
On Thu, 16 Mar 2017, David Bremner wrote: > Daniel Kahn Gillmor writes: > >> On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote: >>> The corresponding xapian document just gets more terms added to it, >>> but this doesn't seem to break anything. >>

[PATCH 5/7] lib/index: generalize filter name

2017-03-22 Thread David Bremner
We can't very well call it uuencode if it is going to filter other things as well. --- lib/index.cc | 92 +++- 1 file changed, 48 insertions(+), 44 deletions(-) diff --git a/lib/index.cc b/lib/index.cc index 02b35b81..3bb1ac1c 100644 ---

[PATCH 6/7] lib/index.cc: generalize filter state machine

2017-03-22 Thread David Bremner
To match things more complicated than fixed strings, we need states with multiple out arrows. --- lib/index.cc | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/lib/index.cc b/lib/index.cc index 3bb1ac1c..fd66762c 100644 --- a/lib/index.cc +++

[PATCH 7/7] lib/index: add simple html filter

2017-03-22 Thread David Bremner
Just drop all tags --- lib/index.cc | 21 - test/T680-html-indexing.sh | 5 - 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/lib/index.cc b/lib/index.cc index fd66762c..324e6e79 100644 --- a/lib/index.cc +++ b/lib/index.cc @@ -206,6 +206,22

[PATCH 2/7] lib: add content type argument to uuencode filter.

2017-03-22 Thread David Bremner
The idea is to support more general types of filtering, based on content type. --- lib/index.cc | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/lib/index.cc b/lib/index.cc index 8c145540..1c04cc3d 100644 --- a/lib/index.cc +++ b/lib/index.cc @@ -56,6 +56,7 @@

[PATCH 3/7] lib/index: Add another layer of indirection in filtering

2017-03-22 Thread David Bremner
We could add a second gmime filter subclass, but prefer to avoid duplicating the boilerplate. --- lib/index.cc | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/lib/index.cc b/lib/index.cc index 1c04cc3d..74a750b9 100644 --- a/lib/index.cc +++ b/lib/index.cc @@

Drop HTML tags when indexing

2017-03-22 Thread David Bremner
Steven Allen pointed out [2] that the previous scanner [1] was a little too simplistic. This version handles (or claims to) quoted strings in attributes, which can apparently contain '>'and '<' characters. This required generalizing the state machine runner a bit [3] to handle states with

[PATCH 1/7] test: add known broken test for indexing html

2017-03-22 Thread David Bremner
'quite' on IRC reported that notmuch new was grinding to a halt during initial indexing, and we eventually narrowed the problem down to some html parts with large embedded images. These cause the number of terms added to the Xapian database to explode (the first 400 messages generated 4.6M unique

[PATCH 4/7] lib/index: separate state table definition from scanner.

2017-03-22 Thread David Bremner
We want to reuse the scanner definition with a different table --- lib/index.cc | 81 +++- 1 file changed, 47 insertions(+), 34 deletions(-) diff --git a/lib/index.cc b/lib/index.cc index 74a750b9..02b35b81 100644 --- a/lib/index.cc +++

Re: [PATCH 1/6] lib: bump SONAME to libnotmuch5

2017-03-22 Thread David Bremner
David Bremner writes: > We plan a sequence of ABI breaking changes. Put the SONAME change in a > separate commit to make reordering easier. I have pushed this series to master. I don't plan on bumping the SONAME for every breakage before the next release, so if you are

Re: [PATCH 4/4] lib: make notmuch_query_add_tag_exclude return a status value

2017-03-22 Thread David Bremner
David Bremner writes: > Since this is an ABI breaking change, bump the SONAME. pushed, although the SONAME bump was already there from the previous series. d ___ notmuch mailing list notmuch@notmuchmail.org

Re: [David Bremner] Re: RFC: drop html tags

2017-03-22 Thread David Bremner
David Bremner writes: > From: David Bremner > Subject: Re: RFC: drop html tags > To: Steven Allen > Date: Tue, 21 Mar 2017 14:03:10 -0300 > > Steven Allen writes: > >> In the JavaScript regex format, I believe

Re: Drop HTML tags when indexing

2017-03-22 Thread Daniel Lublin (quite)
This patch is good. notmuch now gets through my whole archive of 175k mails, memory usage peaking at 430M. ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch