Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-22 Thread Jani Nikula
On Thu, 16 Mar 2017, David Bremner  wrote:
> Daniel Kahn Gillmor  writes:
>
>> On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote:
>>> The corresponding xapian document just gets more terms added to it,
>>> but this doesn't seem to break anything.
>>
>> this is an interesting suggestion.  thanks for proposing it!
>>
>> A couple questions:
>>
>>  0) what happens when one of the files gets deleted from the message
>> store? do the terms it contributes get removed from the index?
>>
>
> That's a good guestion, and an issue I hadn't thought about.
> Currently there's no way to do this short of deleting all the terms (for
> all the files (excepting tags and properties, presumably) and
> reindexing. This will require some more thought, I think.

We already see some of this issue. First file gets indexed, second file
gets added, first file gets removed.

There's also the related problem of reindexing potentially changing the
file being indexed and returned. The first time around the indexing
order is likely the order the message files were received in; on
reindexing it's the order the message files are encountered in the file
system. I presume the patch at hand keeps the search terms that find the
messages the same regardless of the indexing order.

BR,
Jani.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Drop HTML tags when indexing

2017-03-22 Thread Daniel Lublin (quite)
This patch is good. notmuch now gets through my whole archive of 175k mails,
memory usage peaking at 430M.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 4/4] lib: make notmuch_query_add_tag_exclude return a status value

2017-03-22 Thread David Bremner
David Bremner  writes:

> Since this is an ABI breaking change, bump the SONAME.

pushed, although the SONAME bump was already there from the previous
series.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 1/6] lib: bump SONAME to libnotmuch5

2017-03-22 Thread David Bremner
David Bremner  writes:

> We plan a sequence of ABI breaking changes. Put the SONAME change in a
> separate commit to make reordering easier.

I have pushed this series to master. I don't plan on bumping the SONAME
for every breakage before the next release, so if you are building
snapshots, be warned.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[PATCH 1/7] test: add known broken test for indexing html

2017-03-22 Thread David Bremner
'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.

The second test is sanity check for any "improved" indexing of HTML.
---
 test/T680-html-indexing.sh   | 19 +++
 test/corpora/README  |  3 ++
 test/corpora/html/attribute-text | 15 +
 test/corpora/html/embedded-image | 69 
 4 files changed, 106 insertions(+)
 create mode 100755 test/T680-html-indexing.sh
 create mode 100644 test/corpora/html/attribute-text
 create mode 100644 test/corpora/html/embedded-image

diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index ..5e9cc4cb
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_begin_subtest 'non tag text should be indexed'
+notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
+cat < EXPECTED
+thread:XXX   2009-11-17 [1/1] David Bremner; test html attachment (inbox 
unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
 broken
   The broken corpus contains messages that are broken and/or RFC
   non-compliant, ensuring we deal with them in a sane way.
+
+html
+  The html corpus contains html parts
diff --git a/test/corpora/html/attribute-text b/test/corpora/html/attribute-text
new file mode 100644
index ..6dae8194
--- /dev/null
+++ b/test/corpora/html/attribute-text
@@ -0,0 +1,15 @@
+From: David Bremner 
+To: David Bremner 
+Subject: test html attachment
+Date: Tue, 17 Nov 2009 21:28:38 +0600
+Message-ID: <87d1dajhgf@example.net>
+MIME-Version: 1.0
+Content-Type: text/html
+Content-Disposition: inline; filename=test.html
+
+
+  
+
+  
+  hunter2
+
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index ..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= 
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= 
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: 
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; 
boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på 
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större 
+dräneringsarbete som i sin tur har inneburit vissa 
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några 
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi 
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu 
+kommer den vackra fastigheten att klara sig torrskodd under många år 
+framöver [A]
+
+ 
+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+-- 
+Feed: Förvaltnings AB Malmöborg
+
+Item: Tack alla trafikanter och fotgängare!
+
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+
+
+Feed:
+http://malmoborg.se;>
+Förvaltnings AB Malmöborg
+
+Item:
+http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/;>Tack
 alla trafikanter och fotgängare!
+
+
+
+Malmö 2016-07-09
+I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 
3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större 
dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på 
Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren 
är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda 
vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att 
klara sig torrskodd under många år framöver  
+
+
+
+Date:2016-07-19 11:54:24 

[PATCH 4/7] lib/index: separate state table definition from scanner.

2017-03-22 Thread David Bremner
We want to reuse the scanner definition with a different table
---
 lib/index.cc | 81 +++-
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 74a750b9..02b35b81 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -31,6 +31,15 @@ typedef struct _NotmuchFilterDiscardUuencodeClass 
NotmuchFilterDiscardUuencodeCl
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t 
prespace,
char **out, size_t *outlen, size_t *outprespace);
+
+typedef struct {
+int state;
+int a;
+int b;
+int next_if_match;
+int next_if_not_match;
+} scanner_state_t;
+
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -119,46 +128,18 @@ filter_filter (GMimeFilter *gmime_filter, char *inbuf, 
size_t inlen, size_t pres
 }
 
 static void
-filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, 
size_t prespace,
-   char **outbuf, size_t *outlen, size_t *outprespace)
+do_filter (const scanner_state_t states[],
+  int first_skipping_state,
+  GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t 
prespace,
+  char **outbuf, size_t *outlen, size_t *outprespace)
 {
 NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) 
gmime_filter;
 register const char *inptr = inbuf;
 const char *inend = inbuf + inlen;
 char *outptr;
-
+int next;
 (void) prespace;
 
-/* Simple, linear state-transition diagram for our filter.
- *
- * If the character being processed is within the range of [a, b]
- * for the current state then we transition next_if_match
- * state. If not, we transition to the next_if_not_match state.
- *
- * The final two states are special in that they are the states in
- * which we discard data. */
-static const struct {
-   int state;
-   int a;
-   int b;
-   int next_if_match;
-   int next_if_not_match;
-} states[] = {
-   {0,  'b',  'b',  1,  0},
-   {1,  'e',  'e',  2,  0},
-   {2,  'g',  'g',  3,  0},
-   {3,  'i',  'i',  4,  0},
-   {4,  'n',  'n',  5,  0},
-   {5,  ' ',  ' ',  6,  0},
-   {6,  '0',  '7',  7,  0},
-   {7,  '0',  '7',  8,  0},
-   {8,  '0',  '7',  9,  0},
-   {9,  ' ',  ' ',  10, 0},
-   {10, '\n', '\n', 11, 10},
-   {11, 'M',  'M',  12, 0},
-   {12, ' ',  '`',  12, 11}
-};
-int next;
 
 g_mime_filter_set_size (gmime_filter, inlen, FALSE);
 outptr = gmime_filter->outbuf;
@@ -174,7 +155,7 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char 
*inbuf, size_t inlen, si
next = states[filter->state].next_if_not_match;
}
 
-   if (filter->state < 11)
+   if (filter->state < first_skipping_state)
*outptr++ = *inptr;
 
filter->state = next;
@@ -187,6 +168,38 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char 
*inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, 
size_t prespace,
+   char **outbuf, size_t *outlen, size_t *outprespace)
+{
+/* Simple, linear state-transition diagram for our filter.
+ *
+ * If the character being processed is within the range of [a, b]
+ * for the current state then we transition next_if_match
+ * state. If not, we transition to the next_if_not_match state.
+ *
+ * The final two states are special in that they are the states in
+ * which we discard data. */
+static const scanner_state_t states[] = {
+   {0,  'b',  'b',  1,  0},
+   {1,  'e',  'e',  2,  0},
+   {2,  'g',  'g',  3,  0},
+   {3,  'i',  'i',  4,  0},
+   {4,  'n',  'n',  5,  0},
+   {5,  ' ',  ' ',  6,  0},
+   {6,  '0',  '7',  7,  0},
+   {7,  '0',  '7',  8,  0},
+   {8,  '0',  '7',  9,  0},
+   {9,  ' ',  ' ',  10, 0},
+   {10, '\n', '\n', 11, 10},
+   {11, 'M',  'M',  12, 0},
+   {12, ' ',  '`',  12, 11}
+};
+
+do_filter(states, 11,
+ gmime_filter, inbuf, inlen, prespace, outbuf, outlen, 
outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t 
prespace,
 char **outbuf, size_t *outlen, size_t *outprespace)
 {
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[PATCH 6/7] lib/index.cc: generalize filter state machine

2017-03-22 Thread David Bremner
To match things more complicated than fixed strings, we need states
with multiple out arrows.
---
 lib/index.cc | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 3bb1ac1c..fd66762c 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -122,23 +122,25 @@ do_filter (const scanner_state_t states[],
 register const char *inptr = inbuf;
 const char *inend = inbuf + inlen;
 char *outptr;
-int next;
+int next, current;
 (void) prespace;
 
 
 g_mime_filter_set_size (gmime_filter, inlen, FALSE);
 outptr = gmime_filter->outbuf;
 
+current = filter->state;
 while (inptr < inend) {
-   if (*inptr >= states[filter->state].a &&
-   *inptr <= states[filter->state].b)
-   {
-   next = states[filter->state].next_if_match;
-   }
-   else
-   {
-   next = states[filter->state].next_if_not_match;
-   }
+   /* do "fake transitions" until we fire a rule, or run out of rules */
+   do {
+   if (*inptr >= states[current].a && *inptr <= states[current].b)  {
+   next = states[current].next_if_match;
+   } else  {
+   next = states[current].next_if_not_match;
+   }
+
+   current = next;
+   } while (next != states[next].state);
 
if (filter->state < first_skipping_state)
*outptr++ = *inptr;
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[PATCH 7/7] lib/index: add simple html filter

2017-03-22 Thread David Bremner
Just drop all tags
---
 lib/index.cc   | 21 -
 test/T680-html-indexing.sh |  5 -
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index fd66762c..324e6e79 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -206,6 +206,22 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char 
*inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_html (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, 
size_t prespace,
+   char **outbuf, size_t *outlen, size_t *outprespace)
+{
+static const scanner_state_t states[] = {
+   {0,  '<',  '<',  1,  0},
+   {1,  '\'', '\'', 4,  2},  /* scanning for quote or > */
+   {1,  '"',  '"',  5,  3},
+   {1,  '>',  '>',  0,  1},
+   {4,  '\'', '\'', 1,  4},  /* inside single quotes */
+   {5,  '"', '"',   1,  5},  /* inside double quotes */
+};
+do_filter(states, 1,
+ gmime_filter, inbuf, inlen, prespace, outbuf, outlen, 
outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t 
prespace,
 char **outbuf, size_t *outlen, size_t *outprespace)
 {
@@ -252,7 +268,10 @@ notmuch_filter_discard_non_terms_new (GMimeContentType 
*content_type)
 filter = (NotmuchFilterDiscardNonTerms *) g_object_newv (type, 0, NULL);
 filter->state = 0;
 filter->content_type = content_type;
-filter->real_filter = filter_filter_uuencode;
+if (g_mime_content_type_is_type (content_type, "text", "html"))
+   filter->real_filter = filter_filter_html;
+else
+   filter->real_filter = filter_filter_uuencode;
 return (GMimeFilter *) filter;
 }
 
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
index 5e9cc4cb..74f33708 100755
--- a/test/T680-html-indexing.sh
+++ b/test/T680-html-indexing.sh
@@ -5,10 +5,13 @@ test_description="indexing of html parts"
 add_email_corpus html
 
 test_begin_subtest 'embedded images should not be indexed'
-test_subtest_known_broken
 notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
 test_expect_equal_file /dev/null OUTPUT
 
+test_begin_subtest 'ignore > in attribute text'
+notmuch search swordfish | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
 test_begin_subtest 'non tag text should be indexed'
 notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
 cat < EXPECTED
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[PATCH 2/7] lib: add content type argument to uuencode filter.

2017-03-22 Thread David Bremner
The idea is to support more general types of filtering, based on
content type.
---
 lib/index.cc | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 8c145540..1c04cc3d 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -56,6 +56,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass 
NotmuchFilterDiscardUuencodeCl
  **/
 struct _NotmuchFilterDiscardUuencode {
 GMimeFilter parent_object;
+GMimeContentType *content_type;
 int state;
 };
 
@@ -63,7 +64,7 @@ struct _NotmuchFilterDiscardUuencodeClass {
 GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (void);
+static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType 
*content);
 
 static void notmuch_filter_discard_uuencode_finalize (GObject *object);
 
@@ -102,8 +103,9 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-(void) gmime_filter;
-return notmuch_filter_discard_uuencode_new ();
+NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) 
gmime_filter;
+
+return notmuch_filter_discard_uuencode_new (filter->content_type);
 }
 
 static void
@@ -196,7 +198,7 @@ filter_reset (GMimeFilter *gmime_filter)
  * Returns: a new #NotmuchFilterDiscardUuencode filter.
  **/
 static GMimeFilter *
-notmuch_filter_discard_uuencode_new (void)
+notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
 {
 static GType type = 0;
 NotmuchFilterDiscardUuencode *filter;
@@ -220,6 +222,7 @@ notmuch_filter_discard_uuencode_new (void)
 
 filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
 filter->state = 0;
+filter->content_type = content_type;
 
 return (GMimeFilter *) filter;
 }
@@ -396,7 +399,7 @@ _index_mime_part (notmuch_message_t *message,
 g_mime_stream_mem_set_owner (GMIME_STREAM_MEM (stream), FALSE);
 
 filter = g_mime_stream_filter_new (stream);
-discard_uuencode_filter = notmuch_filter_discard_uuencode_new ();
+discard_uuencode_filter = notmuch_filter_discard_uuencode_new 
(content_type);
 
 g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter),
  discard_uuencode_filter);
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[PATCH 3/7] lib/index: Add another layer of indirection in filtering

2017-03-22 Thread David Bremner
We could add a second gmime filter subclass, but prefer to avoid
duplicating the boilerplate.
---
 lib/index.cc | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 1c04cc3d..74a750b9 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -29,6 +29,8 @@
 typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
 typedef struct _NotmuchFilterDiscardUuencodeClass 
NotmuchFilterDiscardUuencodeClass;
 
+typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t 
prespace,
+   char **out, size_t *outlen, size_t *outprespace);
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -57,6 +59,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass 
NotmuchFilterDiscardUuencodeCl
 struct _NotmuchFilterDiscardUuencode {
 GMimeFilter parent_object;
 GMimeContentType *content_type;
+filter_fun real_filter;
 int state;
 };
 
@@ -110,7 +113,14 @@ filter_copy (GMimeFilter *gmime_filter)
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t 
prespace,
-  char **outbuf, size_t *outlen, size_t *outprespace)
+  char **outbuf, size_t *outlen, size_t *outprespace) {
+NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) 
gmime_filter;
+(*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, 
outlen, outprespace);
+}
+
+static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, 
size_t prespace,
+   char **outbuf, size_t *outlen, size_t *outprespace)
 {
 NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) 
gmime_filter;
 register const char *inptr = inbuf;
@@ -223,7 +233,7 @@ notmuch_filter_discard_uuencode_new (GMimeContentType 
*content_type)
 filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
 filter->state = 0;
 filter->content_type = content_type;
-
+filter->real_filter = filter_filter_uuencode;
 return (GMimeFilter *) filter;
 }
 
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Drop HTML tags when indexing

2017-03-22 Thread David Bremner
Steven Allen pointed out [2] that the previous scanner [1] was a
little too simplistic. This version handles (or claims to) quoted
strings in attributes, which can apparently contain '>'and '<'
characters. This required generalizing the state machine runner a bit
[3] to handle states with out-degree more than two.


[1]: id:20170321131549.19557-1-da...@tethera.net
[2]: id:87wpbipl9z@tesseract.cs.unb.ca
[3]:
diff --git a/lib/index.cc b/lib/index.cc
index 03223f7d..324e6e79 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -122,23 +122,25 @@ do_filter (const scanner_state_t states[],
 register const char *inptr = inbuf;
 const char *inend = inbuf + inlen;
 char *outptr;
-int next;
+int next, current;
 (void) prespace;
 
 
 g_mime_filter_set_size (gmime_filter, inlen, FALSE);
 outptr = gmime_filter->outbuf;
 
+current = filter->state;
 while (inptr < inend) {
-   if (*inptr >= states[filter->state].a &&
-   *inptr <= states[filter->state].b)
-   {
-   next = states[filter->state].next_if_match;
-   }
-   else
-   {
-   next = states[filter->state].next_if_not_match;
-   }
+   /* do "fake transitions" until we fire a rule, or run out of rules */
+   do {
+   if (*inptr >= states[current].a && *inptr <= states[current].b)  {
+   next = states[current].next_if_match;
+   } else  {
+   next = states[current].next_if_not_match;
+   }
+
+   current = next;
+   } while (next != states[next].state);
 
if (filter->state < first_skipping_state)
*outptr++ = *inptr;
@@ -209,7 +211,11 @@ filter_filter_html (GMimeFilter *gmime_filter, char 
*inbuf, size_t inlen, size_t
 {
 static const scanner_state_t states[] = {
{0,  '<',  '<',  1,  0},
+   {1,  '\'', '\'', 4,  2},  /* scanning for quote or > */
+   {1,  '"',  '"',  5,  3},
{1,  '>',  '>',  0,  1},
+   {4,  '\'', '\'', 1,  4},  /* inside single quotes */
+   {5,  '"', '"',   1,  5},  /* inside double quotes */
 };
 do_filter(states, 1,
  gmime_filter, inbuf, inlen, prespace, outbuf, outlen, 
outprespace);
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
index ee69209c..74f33708 100755
--- a/test/T680-html-indexing.sh
+++ b/test/T680-html-indexing.sh
@@ -8,4 +8,15 @@ test_begin_subtest 'embedded images should not be indexed'
 notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
 test_expect_equal_file /dev/null OUTPUT
 
+test_begin_subtest 'ignore > in attribute text'
+notmuch search swordfish | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_begin_subtest 'non tag text should be indexed'
+notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
+cat < EXPECTED
+thread:XXX   2009-11-17 [1/1] David Bremner; test html attachment (inbox 
unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_done
diff --git a/test/corpora/html/attribute-text b/test/corpora/html/attribute-text
new file mode 100644
index ..6dae8194
--- /dev/null
+++ b/test/corpora/html/attribute-text
@@ -0,0 +1,15 @@
+From: David Bremner 
+To: David Bremner 
+Subject: test html attachment
+Date: Tue, 17 Nov 2009 21:28:38 +0600
+Message-ID: <87d1dajhgf@example.net>
+MIME-Version: 1.0
+Content-Type: text/html
+Content-Disposition: inline; filename=test.html
+
+
+  
+
+  
+  hunter2
+

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[PATCH 5/7] lib/index: generalize filter name

2017-03-22 Thread David Bremner
We can't very well call it uuencode if it is going to filter other
things as well.
---
 lib/index.cc | 92 +++-
 1 file changed, 48 insertions(+), 44 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 02b35b81..3bb1ac1c 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -26,8 +26,8 @@
 
 /* Oh, how I wish that gobject didn't require so much noisy boilerplate!
  * (Though I have at least eliminated some of the stock set...) */
-typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
-typedef struct _NotmuchFilterDiscardUuencodeClass 
NotmuchFilterDiscardUuencodeClass;
+typedef struct _NotmuchFilterDiscardNonTerms NotmuchFilterDiscardNonTerms;
+typedef struct _NotmuchFilterDiscardNonTermsClass 
NotmuchFilterDiscardNonTermsClass;
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t 
prespace,
char **out, size_t *outlen, size_t *outprespace);
@@ -41,44 +41,29 @@ typedef struct {
 } scanner_state_t;
 
 /**
- * NotmuchFilterDiscardUuencode:
+ * NotmuchFilterDiscardNonTerms:
  *
  * @parent_object: parent #GMimeFilter
  * @encode: encoding vs decoding
  * @state: State of the parser
  *
- * A filter to discard uuencoded portions of an email.
- *
- * A uuencoded portion is identified as beginning with a line
- * matching:
- *
- * begin [0-7][0-7][0-7] .*
- *
- * After that detection, and beginning with the following line,
- * characters will be discarded as long as the first character of each
- * line begins with M and subsequent characters on the line are within
- * the range of ASCII characters from ' ' to '`'.
- *
- * This is not a perfect UUencode filter. It's possible to have a
- * message that will legitimately match that pattern, (so that some
- * legitimate content is discarded). And for most UUencoded files, the
- * final line of encoded data (the line not starting with M) will be
- * indexed.
+ * A filter to discard non terms portions of an email, i.e. stuff not
+ * worth indexing.
  **/
-struct _NotmuchFilterDiscardUuencode {
+struct _NotmuchFilterDiscardNonTerms {
 GMimeFilter parent_object;
 GMimeContentType *content_type;
 filter_fun real_filter;
 int state;
 };
 
-struct _NotmuchFilterDiscardUuencodeClass {
+struct _NotmuchFilterDiscardNonTermsClass {
 GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType 
*content);
+static GMimeFilter *notmuch_filter_discard_non_terms_new (GMimeContentType 
*content);
 
-static void notmuch_filter_discard_uuencode_finalize (GObject *object);
+static void notmuch_filter_discard_non_terms_finalize (GObject *object);
 
 static GMimeFilter *filter_copy (GMimeFilter *filter);
 static void filter_filter (GMimeFilter *filter, char *in, size_t len, size_t 
prespace,
@@ -91,14 +76,14 @@ static void filter_reset (GMimeFilter *filter);
 static GMimeFilterClass *parent_class = NULL;
 
 static void
-notmuch_filter_discard_uuencode_class_init (NotmuchFilterDiscardUuencodeClass 
*klass)
+notmuch_filter_discard_non_terms_class_init (NotmuchFilterDiscardNonTermsClass 
*klass)
 {
 GObjectClass *object_class = G_OBJECT_CLASS (klass);
 GMimeFilterClass *filter_class = GMIME_FILTER_CLASS (klass);
 
 parent_class = (GMimeFilterClass *) g_type_class_ref (GMIME_TYPE_FILTER);
 
-object_class->finalize = notmuch_filter_discard_uuencode_finalize;
+object_class->finalize = notmuch_filter_discard_non_terms_finalize;
 
 filter_class->copy = filter_copy;
 filter_class->filter = filter_filter;
@@ -107,7 +92,7 @@ notmuch_filter_discard_uuencode_class_init 
(NotmuchFilterDiscardUuencodeClass *k
 }
 
 static void
-notmuch_filter_discard_uuencode_finalize (GObject *object)
+notmuch_filter_discard_non_terms_finalize (GObject *object)
 {
 G_OBJECT_CLASS (parent_class)->finalize (object);
 }
@@ -115,15 +100,15 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) 
gmime_filter;
+NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) 
gmime_filter;
 
-return notmuch_filter_discard_uuencode_new (filter->content_type);
+return notmuch_filter_discard_non_terms_new (filter->content_type);
 }
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t 
prespace,
   char **outbuf, size_t *outlen, size_t *outprespace) {
-NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) 
gmime_filter;
+NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) 
gmime_filter;
 (*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, 
outlen, outprespace);
 }
 
@@ -133,7 +118,7 @@ do_filter (const scanner_state_t states[],
   GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t 
prespace,

Re: [David Bremner] Re: RFC: drop html tags

2017-03-22 Thread David Bremner
David Bremner  writes:

> From: David Bremner 
> Subject: Re: RFC: drop html tags
> To: Steven Allen 
> Date: Tue, 21 Mar 2017 14:03:10 -0300
>
> Steven Allen  writes:
>
>> In the JavaScript regex format, I believe the correct way to parse this is:
>>
>> /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>>
>> Basically, while inside a tag, ignore everything between double and single 
>> quotes.
>
> Thanks for the reality check. It should be possible to handle quotes. In
> my limited understanding of that regex, we can do a bit better by
> forcing pairs of quotes to match, since I  is
> probably legal.

Actually, I'm wrong. My eyes just glaze over when faced with any
non-trivial regex, I guess.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch