[RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-15 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything.
---
 lib/database.cc| 3 +++
 test/T670-duplicate-mid.sh | 1 -
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/database.cc b/lib/database.cc
index a679cbab..e83017ed 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -2582,6 +2582,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
}
 
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index d28afc91..41c53bc8 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -6,7 +6,6 @@ add_message [id]=id:duplicate '[subject]="message 1"'
 add_message [id]=id:duplicate '[subject]="message 2"'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat 

a first step for the duplicate message-id dilemma

2017-03-15 Thread David Bremner

These are mainly RFC because I'm not 100% sure about the performance
impact.  It seems OK for me: about 3% slower indexing my 500 K
messages with about 35k duplicates. I didn't see a noticable increase
in database size (both cases it's 5.8G / 3.5G before/after notmuch
compact).

There are also tons of UI issues: for example in the test case here,
notmuch search subject:'"message 2"' will happily print

thread:0001   2001-01-05 [1/1] Notmuch Test Suite; message 1 (inbox 
unread)

I claim it's still an improvement over the current code, where that
second message is not findable by any terms unique to it.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[RFC patch 1/2] test: add known broken test for duplicate message id

2017-03-15 Thread David Bremner
There are many other problems that could be tested, but this one we
have some hope of fixing because it doesn't require UI changes, just
indexing changes.
---
 test/T670-duplicate-mid.sh | 17 +
 1 file changed, 17 insertions(+)
 create mode 100755 test/T670-duplicate-mid.sh

diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
new file mode 100755
index ..d28afc91
--- /dev/null
+++ b/test/T670-duplicate-mid.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+test_description="duplicate message ids"
+. ./test-lib.sh || exit 1
+
+add_message [id]=id:duplicate '[subject]="message 1"'
+add_message [id]=id:duplicate '[subject]="message 2"'
+
+test_begin_subtest 'Search for second subject'
+test_subtest_known_broken
+cat  
OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_done
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 2/2] lib: clamp return value of g_mime_utils_header_decode_date to >=0

2017-03-15 Thread David Bremner
David Bremner  writes:

> For reasons not completely understood at this time, gmime (as of
> 2.6.22) is returning a date before 1900 on bad date input. Since this
> confuses some other software, we clamp such dates to 0,
> i.e. 1970-01-01.

series pushed, amended per Tomi's suggestion. It's possible I've been
writing an unhealthy amount of scheme lately. Dunno what else would make
the ternary if operator look sensible.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 2/2] lib: clamp return value of g_mime_utils_header_decode_date to >=0

2017-03-15 Thread Tomi Ollila
On Sun, Mar 12 2017, David Bremner  wrote:

> For reasons not completely understood at this time, gmime (as of
> 2.6.22) is returning a date before 1900 on bad date input. Since this
> confuses some other software, we clamp such dates to 0,
> i.e. 1970-01-01.
> ---
>  lib/message.cc | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/lib/message.cc b/lib/message.cc
> index 007f1171..8a8a25b4 100644
> --- a/lib/message.cc
> +++ b/lib/message.cc
> @@ -1034,10 +1034,15 @@ _notmuch_message_set_header_values (notmuch_message_t 
> *message,
>  
>  /* GMime really doesn't want to see a NULL date, so protect its
>   * sensibilities. */
> -if (date == NULL || *date == '\0')
> +if (date == NULL || *date == '\0') {
>   time_value = 0;

"Too bad" we already do this time_value = 0, otherwise I'd suggested
-21 

$ perl -le 'print scalar localtime -21'
Sat Feb  7 21:54:38 1903

That is something where Julian calendar is also in 20th century ;)

> -else
> +} else {
>   time_value = g_mime_utils_header_decode_date (date, NULL);
> + /*
> +  * Workaround for https://bugzilla.gnome.org/show_bug.cgi?id=779923
> +  */
> + time_value = (time_value < 0) ? 0 : time_value;

Although the above probably realizes as..., I'd propose (IMO for clarity)

if (time_value < 0)
time_value = 0;

Anyway, LGTM.

Tomi


Btw: I Added notmuch show --format=json '*' >&6 to the test script, and it
printed:

[[[{"id": "msg-001@notmuch-test-suite", "match": true, "excluded": false,
"filename": ["/home/too/vc/ext/notmuch/test/tmp.T111-x/mail/msg-001"],
"timestamp": 2085892096, "date_relative": "1899-12-31", "tags": ["inbox",
"unread"], "headers": {"Subject": "Test message #1", "From": "Notmuch Test
Suite ", "To": "Notmuch Test Suite
", "Date": "Sun, 31 Dec 1899 00:00:00 +"},
"body": [{"id": 1, "content-type": "text/plain", "content": "This is just a
test message (#1)\n"}]}, [

(... which one can see I just pasted to a new file... ;)


$ perl -le 'print scalar localtime 2085892096' 
Wed Feb  6 08:28:16 2036

So, it looks like we store the large negative time_value to a 32-bit signed
integer...



> +}
>  
>  message->doc.add_value (NOTMUCH_VALUE_TIMESTAMP,
>   Xapian::sortable_serialise (time_value));
> -- 
> 2.11.0
>
> ___
> notmuch mailing list
> notmuch@notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch