Duplicate message ids

2017-08-27 Thread Mark Walters

Hi

A concern for notmuch is some form of attack via someone sending a
message with a duplicate message id. I think I have seen it asserted
that it is not so much of a problem as the first message received by
notmuch will have priority.

However, I believe that this is not the case. The script below gives a
demonstration (on my system at least). I have written it as a "test" so
(I hope) it doesn't mess up the system. It should work if you put it in
the test directory and execute it.

It adds a message, runs notmuch new, adds a message with the same id,
runs notmuch new, and then runs notmuch search  and notmuch show
. The former shows the subject of the first message, and the latter
the subject of the second message.

Best wishes

Mark

#!/usr/bin/env bash
test_description='Do duplicate message ids get shown in arrival order'
. ./test-lib.sh || exit 1


find ${MAIL_DIR}
generate_message [id]=duplicate-id [subject]=first [body]=first
mkdir "${MAIL_DIR}"/b
mv "$gen_msg_filename" "${MAIL_DIR}"/b
notmuch new
generate_message [id]=duplicate-id [subject]=second [body]=second
mv "$gen_msg_filename" "${MAIL_DIR}"/a
notmuch new
find ${MAIL_DIR}

echo
echo SEARCH: observe \"first\" is the subject
notmuch search id:duplicate-id

echo
echo SHOW: observe \"second\" is the subject and body
notmuch show --format=json id:duplicate-id |json_pp
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[Patch v4 06/12] lib: index message files with duplicate message-ids

2017-07-21 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything. Values on the other hand get
overwritten, which is a bit annoying, but arguably it is not worse to
take the values (from, subject, date) from the last file indexed
rather than the first.
---
 lib/add-message.cc | 19 +++
 test/T160-json.sh  |  4 ++--
 test/T670-duplicate-mid.sh |  9 +++--
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 2922eaa9..f0a80c4f 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -529,19 +529,22 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (is_ghost)
/* Convert ghost message to a regular message */
_notmuch_message_remove_term (message, "type", "ghost");
-   ret = _notmuch_database_link_message (notmuch, message,
+   }
+
+   ret = _notmuch_database_link_message (notmuch, message,
  message_file, is_ghost);
-   if (ret)
-   goto DONE;
+   if (ret)
+   goto DONE;
 
+   if (is_new || is_ghost)
_notmuch_message_set_header_values (message, date, from, subject);
 
-   ret = _notmuch_message_index_file (message, message_file);
-   if (ret)
-   goto DONE;
-   } else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
+
+   if (! is_new && !is_ghost)
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
-   }
 
_notmuch_message_sync (message);
 } catch (const Xapian::Error ) {
diff --git a/test/T160-json.sh b/test/T160-json.sh
index ac51895e..07955a2b 100755
--- a/test/T160-json.sh
+++ b/test/T160-json.sh
@@ -71,8 +71,8 @@ test_begin_subtest "Format version: too high"
 test_expect_code 21 "notmuch search --format-version=999 \\*"
 
 test_begin_subtest "Show message: multiple filenames"
-add_message "[id]=message...@example.com [filename]=copy1"
-add_message "[id]=message...@example.com [filename]=copy2"
+add_message '[id]=message...@example.com [filename]=copy1 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
+add_message '[id]=message...@example.com [filename]=copy2 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
 cat < EXPECTED
 [
 [
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index ced28a21..137cb6a5 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -5,8 +5,14 @@ test_description="duplicate message ids"
 add_message '[id]="duplicate"' '[subject]="message 1" [filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
+test_begin_subtest 'First subject preserved'
+cat < EXPECTED
+thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; message 1 (inbox unread)
+EOF
+notmuch search id:duplicate | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat <EXPECTED
 MAIL_DIR/copy1
 MAIL_DIR/copy2
@@ -16,7 +22,6 @@ test_expect_equal_file EXPECTED OUTPUT
 
 add_message '[id]="duplicate"' '[body]="sekrit" [filename]=copy3'
 test_begin_subtest 'search for body in duplicate file'
-test_subtest_known_broken
 cat <EXPECTED
 MAIL_DIR/copy1
 MAIL_DIR/copy2
-- 
2.13.2

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[PATCH] fixup! lib: index message files with duplicate message-ids

2017-06-10 Thread David Bremner
---

This seems to do the trick for preserving the subject as additional files are 
indexed.

 lib/add-message.cc | 3 ++-
 test/T670-duplicate-mid.sh | 7 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 26405742..711ed9fa 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -536,7 +536,8 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
 
-   _notmuch_message_set_header_values (message, date, from, subject);
+   if (is_new || is_ghost)
+   _notmuch_message_set_header_values (message, date, from, subject);
 
ret = _notmuch_message_index_file (message, message_file);
if (ret)
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index da5ce5d4..ea5e1d6a 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -5,6 +5,13 @@ test_description="duplicate message ids"
 add_message '[id]="duplicate"' '[subject]="message 1" [filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
+test_begin_subtest 'First subject preserved'
+cat < EXPECTED
+thread:XXX   2001-01-05 [1/1(2)] Notmuch Test Suite; message 1 (inbox unread)
+EOF
+notmuch search id:duplicate | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest 'Search for second subject'
 cat <EXPECTED
 MAIL_DIR/copy1
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [patch v3 06/12] lib: index message files with duplicate message-ids

2017-06-09 Thread David Bremner
David Bremner <da...@tethera.net> writes:

> Daniel Kahn Gillmor <d...@fifthhorseman.net> writes:

>> for example, i could follow up on the current message with another
>> message with Message-Id: 20170604123235.24466-7-da...@tethera.net and
>> give it a subject "Re: [patch v3 06/12] lib: do *not* index message
>> files with duplicate message-ids".  that's a bit odd, no?
>
> Yes, I agree that's a bit strange.  We should make some effort to
> display the subject that belongs with a given message body. I think it's
> not too hard [1] to preserve the old behaviour of keeping the first
> subject, date, and from. This leaves us with a version of the original
> hiding message attack, but only for the special case of regex searches,
> since those rely exclusively on the value slots.

I had a slightly radical idea for how to deal with that. Subject/from
from extra files could be appended to the value slot (e.g. separated by
newlines). Then regexp searches would behave similarly to term based
searches in that matching any file would match the message. We'd have to
be slightly careful about what anchors meant.  A further enhancement
would be to expose the search result as an array. This kind of approach
doesn't really make sense for dates, as we essentially search for those
as numbers, and such a hack would break the built-in xapian range
search.  
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [patch v3 06/12] lib: index message files with duplicate message-ids

2017-06-05 Thread David Bremner
Daniel Kahn Gillmor <d...@fifthhorseman.net> writes:

> On Sun 2017-06-04 09:32:29 -0300, David Bremner wrote:
>> The corresponding xapian document just gets more terms added to it,
>> but this doesn't seem to break anything. Values on the other hand get
>> overwritten, which is a bit annoying, but arguably it is not worse to
>> take the values (from, subject, date) from the last file indexed
>> rather than the first.
[snip]
> for example, i could follow up on the current message with another
> message with Message-Id: 20170604123235.24466-7-da...@tethera.net and
> give it a subject "Re: [patch v3 06/12] lib: do *not* index message
> files with duplicate message-ids".  that's a bit odd, no?

Yes, I agree that's a bit strange.  We should make some effort to
display the subject that belongs with a given message body. I think it's
not too hard [1] to preserve the old behaviour of keeping the first
subject, date, and from. This leaves us with a version of the original
hiding message attack, but only for the special case of regex searches,
since those rely exclusively on the value slots.

[1]: should be just a matter of guarding the call to
_notmuch_message_set_header_values() with if (is_new || is_ghost), but
that needs testing.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [patch v3 06/12] lib: index message files with duplicate message-ids

2017-06-05 Thread Daniel Kahn Gillmor
On Sun 2017-06-04 09:32:29 -0300, David Bremner wrote:
> The corresponding xapian document just gets more terms added to it,
> but this doesn't seem to break anything. Values on the other hand get
> overwritten, which is a bit annoying, but arguably it is not worse to
> take the values (from, subject, date) from the last file indexed
> rather than the first.

It's certainly a change in behavior, though.  This suggests that i can
send you mail and have it change how an existing message shows up in the
summary view, for example.

for example, i could follow up on the current message with another
message with Message-Id: 20170604123235.24466-7-da...@tethera.net and
give it a subject "Re: [patch v3 06/12] lib: do *not* index message
files with duplicate message-ids".  that's a bit odd, no?

I'm not saying it's wrong, i'm just hoping to acknowledge that this
might be a controversial change.

this whole series is pretty elaborate, of course, but so far it's
looking like reasonable steps toward a clearer and more hackable
codebase.

  --dkg


signature.asc
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[patch v3 06/12] lib: index message files with duplicate message-ids

2017-06-04 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything. Values on the other hand get
overwritten, which is a bit annoying, but arguably it is not worse to
take the values (from, subject, date) from the last file indexed
rather than the first.
---
 lib/add-message.cc | 20 +++-
 test/T160-json.sh  |  4 ++--
 test/T670-duplicate-mid.sh |  2 --
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 2922eaa9..ae9b14a7 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -529,19 +529,21 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (is_ghost)
/* Convert ghost message to a regular message */
_notmuch_message_remove_term (message, "type", "ghost");
-   ret = _notmuch_database_link_message (notmuch, message,
+   }
+
+   ret = _notmuch_database_link_message (notmuch, message,
  message_file, is_ghost);
-   if (ret)
-   goto DONE;
+   if (ret)
+   goto DONE;
 
-   _notmuch_message_set_header_values (message, date, from, subject);
+   _notmuch_message_set_header_values (message, date, from, subject);
 
-   ret = _notmuch_message_index_file (message, message_file);
-   if (ret)
-   goto DONE;
-   } else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
+
+   if (! is_new && !is_ghost)
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
-   }
 
_notmuch_message_sync (message);
 } catch (const Xapian::Error ) {
diff --git a/test/T160-json.sh b/test/T160-json.sh
index ac51895e..07955a2b 100755
--- a/test/T160-json.sh
+++ b/test/T160-json.sh
@@ -71,8 +71,8 @@ test_begin_subtest "Format version: too high"
 test_expect_code 21 "notmuch search --format-version=999 \\*"
 
 test_begin_subtest "Show message: multiple filenames"
-add_message "[id]=message...@example.com [filename]=copy1"
-add_message "[id]=message...@example.com [filename]=copy2"
+add_message '[id]=message...@example.com [filename]=copy1 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
+add_message '[id]=message...@example.com [filename]=copy2 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
 cat < EXPECTED
 [
 [
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index ced28a21..f1952555 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -6,7 +6,6 @@ add_message '[id]="duplicate"' '[subject]="message 1" 
[filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat 

[PATCH 06/12] lib: index message files with duplicate message-ids

2017-05-13 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything. Values on the other hand get
overwritten, which is a bit annoying, but arguably it is not worse to
take the values (from, subject, date) from the last file indexed
rather than the first.
---
 lib/add-message.cc | 20 +++-
 test/T160-json.sh  |  4 ++--
 test/T670-duplicate-mid.sh |  2 --
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 2922eaa9..ae9b14a7 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -529,19 +529,21 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (is_ghost)
/* Convert ghost message to a regular message */
_notmuch_message_remove_term (message, "type", "ghost");
-   ret = _notmuch_database_link_message (notmuch, message,
+   }
+
+   ret = _notmuch_database_link_message (notmuch, message,
  message_file, is_ghost);
-   if (ret)
-   goto DONE;
+   if (ret)
+   goto DONE;
 
-   _notmuch_message_set_header_values (message, date, from, subject);
+   _notmuch_message_set_header_values (message, date, from, subject);
 
-   ret = _notmuch_message_index_file (message, message_file);
-   if (ret)
-   goto DONE;
-   } else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
+
+   if (! is_new && !is_ghost)
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
-   }
 
_notmuch_message_sync (message);
 } catch (const Xapian::Error ) {
diff --git a/test/T160-json.sh b/test/T160-json.sh
index ac51895e..07955a2b 100755
--- a/test/T160-json.sh
+++ b/test/T160-json.sh
@@ -71,8 +71,8 @@ test_begin_subtest "Format version: too high"
 test_expect_code 21 "notmuch search --format-version=999 \\*"
 
 test_begin_subtest "Show message: multiple filenames"
-add_message "[id]=message...@example.com [filename]=copy1"
-add_message "[id]=message...@example.com [filename]=copy2"
+add_message '[id]=message...@example.com [filename]=copy1 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
+add_message '[id]=message...@example.com [filename]=copy2 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
 cat < EXPECTED
 [
 [
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index ced28a21..f1952555 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -6,7 +6,6 @@ add_message '[id]="duplicate"' '[subject]="message 1" 
[filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat 

[PATCH 06/10] lib: index message files with duplicate message-ids

2017-05-13 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything. Values on the other hand get
overwritten, which is a bit annoying, but arguably it is not worse to
take the values (from, subject, date) from the last file indexed
rather than the first.
---
 lib/add-message.cc | 20 +++-
 test/T160-json.sh  |  4 ++--
 test/T670-duplicate-mid.sh |  2 --
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 2922eaa9..ae9b14a7 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -529,19 +529,21 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (is_ghost)
/* Convert ghost message to a regular message */
_notmuch_message_remove_term (message, "type", "ghost");
-   ret = _notmuch_database_link_message (notmuch, message,
+   }
+
+   ret = _notmuch_database_link_message (notmuch, message,
  message_file, is_ghost);
-   if (ret)
-   goto DONE;
+   if (ret)
+   goto DONE;
 
-   _notmuch_message_set_header_values (message, date, from, subject);
+   _notmuch_message_set_header_values (message, date, from, subject);
 
-   ret = _notmuch_message_index_file (message, message_file);
-   if (ret)
-   goto DONE;
-   } else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
+
+   if (! is_new && !is_ghost)
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
-   }
 
_notmuch_message_sync (message);
 } catch (const Xapian::Error ) {
diff --git a/test/T160-json.sh b/test/T160-json.sh
index ac51895e..07955a2b 100755
--- a/test/T160-json.sh
+++ b/test/T160-json.sh
@@ -71,8 +71,8 @@ test_begin_subtest "Format version: too high"
 test_expect_code 21 "notmuch search --format-version=999 \\*"
 
 test_begin_subtest "Show message: multiple filenames"
-add_message "[id]=message...@example.com [filename]=copy1"
-add_message "[id]=message...@example.com [filename]=copy2"
+add_message '[id]=message...@example.com [filename]=copy1 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
+add_message '[id]=message...@example.com [filename]=copy2 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
 cat < EXPECTED
 [
 [
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index ced28a21..f1952555 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -6,7 +6,6 @@ add_message '[id]="duplicate"' '[subject]="message 1" 
[filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat 

[PATCH 06/10] lib: index message files with duplicate message-ids

2017-05-13 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything. Values on the other hand get
overwritten, which is a bit annoying, but arguably it is not worse to
take the values (from, subject, date) from the last file indexed
rather than the first.
---
 lib/add-message.cc | 20 +++-
 test/T160-json.sh  |  4 ++--
 test/T670-duplicate-mid.sh |  2 --
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 2922eaa9..ae9b14a7 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -529,19 +529,21 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (is_ghost)
/* Convert ghost message to a regular message */
_notmuch_message_remove_term (message, "type", "ghost");
-   ret = _notmuch_database_link_message (notmuch, message,
+   }
+
+   ret = _notmuch_database_link_message (notmuch, message,
  message_file, is_ghost);
-   if (ret)
-   goto DONE;
+   if (ret)
+   goto DONE;
 
-   _notmuch_message_set_header_values (message, date, from, subject);
+   _notmuch_message_set_header_values (message, date, from, subject);
 
-   ret = _notmuch_message_index_file (message, message_file);
-   if (ret)
-   goto DONE;
-   } else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
+
+   if (! is_new && !is_ghost)
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
-   }
 
_notmuch_message_sync (message);
 } catch (const Xapian::Error ) {
diff --git a/test/T160-json.sh b/test/T160-json.sh
index ac51895e..07955a2b 100755
--- a/test/T160-json.sh
+++ b/test/T160-json.sh
@@ -71,8 +71,8 @@ test_begin_subtest "Format version: too high"
 test_expect_code 21 "notmuch search --format-version=999 \\*"
 
 test_begin_subtest "Show message: multiple filenames"
-add_message "[id]=message...@example.com [filename]=copy1"
-add_message "[id]=message...@example.com [filename]=copy2"
+add_message '[id]=message...@example.com [filename]=copy1 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
+add_message '[id]=message...@example.com [filename]=copy2 [date]="Fri, 05 Jan 
2001 15:43:52 +"'
 cat < EXPECTED
 [
 [
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index ced28a21..f1952555 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -6,7 +6,6 @@ add_message '[id]="duplicate"' '[subject]="message 1" 
[filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat 

[rfc patch v3 6/6] lib: index message files with duplicate message-ids

2017-04-03 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything.
---
 lib/database.cc|  3 +++
 test/T670-duplicate-mid.sh | 22 +++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 5bc131a3..3b9f7828 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -2582,6 +2582,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
}
 
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index 88bd12cb..2c77e11e 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -2,11 +2,10 @@
 test_description="duplicate message ids"
 . ./test-lib.sh || exit 1
 
-add_message '[id]="id:duplicate"' '[subject]="message 1"'
-add_message '[id]="id:duplicate"' '[subject]="message 2"'
+add_message '[id]="duplicate"' '[subject]="message 1"'
+add_message '[id]="duplicate"' '[subject]="message 2"'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat <EXPECTED
 MAIL_DIR/msg-001
 MAIL_DIR/msg-002
@@ -14,4 +13,21 @@ EOF
 notmuch search --output=files subject:'"message 2"' | notmuch_dir_sanitize > 
OUTPUT
 test_expect_equal_file EXPECTED OUTPUT
 
+add_message '[id]="duplicate"' '[body]="sekrit"'
+test_begin_subtest 'search for body in duplicate file'
+cat <EXPECTED
+MAIL_DIR/msg-001
+MAIL_DIR/msg-002
+MAIL_DIR/msg-003
+EOF
+notmuch search --output=files "sekrit" | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest 'reindex removes terms from duplicate file'
+rm $MAIL_DIR/msg-003
+notmuch reindex id:duplicate
+cp /dev/null EXPECTED
+notmuch search --output=files "sekrit" | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
 test_done
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[rfc patch v2 5/5] lib: index message files with duplicate message-ids

2017-04-02 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything.
---
 lib/database.cc|  3 +++
 test/T670-duplicate-mid.sh | 22 +++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 5bc131a3..3b9f7828 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -2582,6 +2582,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
}
 
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index 88bd12cb..2c77e11e 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -2,11 +2,10 @@
 test_description="duplicate message ids"
 . ./test-lib.sh || exit 1
 
-add_message '[id]="id:duplicate"' '[subject]="message 1"'
-add_message '[id]="id:duplicate"' '[subject]="message 2"'
+add_message '[id]="duplicate"' '[subject]="message 1"'
+add_message '[id]="duplicate"' '[subject]="message 2"'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat <EXPECTED
 MAIL_DIR/msg-001
 MAIL_DIR/msg-002
@@ -14,4 +13,21 @@ EOF
 notmuch search --output=files subject:'"message 2"' | notmuch_dir_sanitize > 
OUTPUT
 test_expect_equal_file EXPECTED OUTPUT
 
+add_message '[id]="duplicate"' '[body]="sekrit"'
+test_begin_subtest 'search for body in duplicate file'
+cat <EXPECTED
+MAIL_DIR/msg-001
+MAIL_DIR/msg-002
+MAIL_DIR/msg-003
+EOF
+notmuch search --output=files "sekrit" | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest 'reindex removes terms from duplicate file'
+rm $MAIL_DIR/msg-003
+notmuch reindex id:duplicate
+cp /dev/null EXPECTED
+notmuch search --output=files "sekrit" | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
 test_done
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-22 Thread Jani Nikula
On Thu, 16 Mar 2017, David Bremner  wrote:
> Daniel Kahn Gillmor  writes:
>
>> On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote:
>>> The corresponding xapian document just gets more terms added to it,
>>> but this doesn't seem to break anything.
>>
>> this is an interesting suggestion.  thanks for proposing it!
>>
>> A couple questions:
>>
>>  0) what happens when one of the files gets deleted from the message
>> store? do the terms it contributes get removed from the index?
>>
>
> That's a good guestion, and an issue I hadn't thought about.
> Currently there's no way to do this short of deleting all the terms (for
> all the files (excepting tags and properties, presumably) and
> reindexing. This will require some more thought, I think.

We already see some of this issue. First file gets indexed, second file
gets added, first file gets removed.

There's also the related problem of reindexing potentially changing the
file being indexed and returned. The first time around the indexing
order is likely the order the message files were received in; on
reindexing it's the order the message files are encountered in the file
system. I presume the patch at hand keeps the search terms that find the
messages the same regardless of the indexing order.

BR,
Jani.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-18 Thread David Bremner
Daniel Kahn Gillmor  writes:

> On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote:
>> Daniel Kahn Gillmor  writes:
>>>  0) what happens when one of the files gets deleted from the message
>>> store? do the terms it contributes get removed from the index?
>>
>> That's a good guestion, and an issue I hadn't thought about.
>> Currently there's no way to do this short of deleting all the terms (for
>> all the files (excepting tags and properties, presumably) and
>> reindexing. This will require some more thought, I think.
>
> i didn't mean to raise the concern to drag this work down, i just want
> to make sure the problem is on the table.  dropping all terms on
> deletion and re-indexing remaining files with the same message ID isn't
> terribly efficient, but i don't think it's going to be terribly costly
> either.  we're not talking about hundreds of files per message-id in
> most normal cases; usually only two (sent-to-self,
> recvd-from-mailing-list), and maybe a half-dozen at most (messages sent
> to multiple mailboxes that all forward to me).

I can think of 3 general approaches at the moment. They each have (at
least) one gotcha; more precisely they each require some added
complexity somewhere else in the codebase.

One is this one, just add all the terms to one xapian document. The
gotcha is needing some reindexing facility (we want this for other
reasons, so that might not be so bad).

The second approach that occurs to me is to still add the terms to one
xapian document, but to prefix them with a number identifying the file
copy (1,2, etc). The complexity here is in the generation of queries,
each one needs to be OR_ed with eg. SUBJECT:foo or 1#SUBJECT:foo or
2#SUBJECT:foo. I'm not really sure offhand how to do that without field
processors. I'm also not sure about the performance impact.

The third approach is create extra xapian documents per file, which have
a different document type (from the notmuch point of view). Here the
complexity will be dealing with the returned documents from a xapian
query. We can probably use a wildcard search on the type (mail, mail1,
mail2, etc...) to make the queries reasonably easy. My gut feeling is
that this is the "right" approach, althought it will be a bit more
complicated to get started.  It will also require changing our idea of
threads in the "structured output" where a thread looks something like

(thread
   (message
  (instance/file)
  (instance/file))
   (message
  (instance/file))
  
  





___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-17 Thread Daniel Kahn Gillmor
On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote:
> Daniel Kahn Gillmor  writes:
>>  0) what happens when one of the files gets deleted from the message
>> store? do the terms it contributes get removed from the index?
>
> That's a good guestion, and an issue I hadn't thought about.
> Currently there's no way to do this short of deleting all the terms (for
> all the files (excepting tags and properties, presumably) and
> reindexing. This will require some more thought, I think.

i didn't mean to raise the concern to drag this work down, i just want
to make sure the problem is on the table.  dropping all terms on
deletion and re-indexing remaining files with the same message ID isn't
terribly efficient, but i don't think it's going to be terribly costly
either.  we're not talking about hundreds of files per message-id in
most normal cases; usually only two (sent-to-self,
recvd-from-mailing-list), and maybe a half-dozen at most (messages sent
to multiple mailboxes that all forward to me).

of course, if multiple files are deleted concurrently, and notmuch
notices that one of them is missing, then re-indexing the other will
depend on whether it was also deleted in that same batch.

>>  1) when a message is displayed to the user as a result of a match, it
>> gets pulled from one of the files, not both.  if it's pulled from
>> the file that didn't have the term the user searched for, that's
>> likely to be confusing.  do you have a way to avoid that confusion?
>
> I was looking for an incremental improvement, so I imagined something
> like various output flagging "yes, there are duplicate files for this
> message", and letting users dig those out using something like the
> --duplicate= option.

This kind of output flagging would be worthwhile in its own right, and
maybe is an even less controversial place to start for the incremental
improvement.

>> It also occurs to me that one of the things i'd love to have is
>> well-indexed notes about any given e-mail.  So if this was adopted, i
>> could presumably just write a file that has the same Message-Id as the
>> message, put my notes in it, and index it.  that's a little weird,
>> though.  would there be a better way to do such a thing?
>
> One option would be to use a note=foo mesage property. That's not
> immediately searchable though, although we could kludge together
> something like the subject regexp search which would be slower.

right, i think i'd want the notes to be searchable, if possible.

Now i'm thinking about attack scenarios for this multi-indexed scheme,
though.  If i know that you've already gotten an e-mail with message-id
X, then i can go ahead and remotely, silently add search terms to that
message by sending you new messages that have the same message-id.  That
seems troubling :/ The status quo at least requires the attacker to win
a race to get their message indexed first, obscuring the real message.
in the proposed new scenario, the attacker doesn't need to win any race.
they can't prevent the true message from being indexed, but they can
associate it with whatever toxicity (e.g. "viagra", or "From:
killfiled-user") they want which might be useful in suppressing the
message in a post-processing run.

ugh, mail,

  --dkg


signature.asc
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-16 Thread Mark Walters

Hi

Just a comment on your last point:

> It also occurs to me that one of the things i'd love to have is
> well-indexed notes about any given e-mail.  So if this was adopted, i
> could presumably just write a file that has the same Message-Id as the
> message, put my notes in it, and index it.  that's a little weird,
> though.  would there be a better way to do such a thing?

A different way which might get pretty close to what you would be to
start a reply and then postpone it.

Ideally we would wrap this in a "note" function would delete the
to/cc/bcc headers to make sure it doesn't accidentally get sent and add a
+note tag when saving.

Best wishes

Mark



___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-16 Thread David Bremner
Daniel Kahn Gillmor  writes:

> On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote:
>> The corresponding xapian document just gets more terms added to it,
>> but this doesn't seem to break anything.
>
> this is an interesting suggestion.  thanks for proposing it!
>
> A couple questions:
>
>  0) what happens when one of the files gets deleted from the message
> store? do the terms it contributes get removed from the index?
>

That's a good guestion, and an issue I hadn't thought about.
Currently there's no way to do this short of deleting all the terms (for
all the files (excepting tags and properties, presumably) and
reindexing. This will require some more thought, I think.

>  1) when a message is displayed to the user as a result of a match, it
> gets pulled from one of the files, not both.  if it's pulled from
> the file that didn't have the term the user searched for, that's
> likely to be confusing.  do you have a way to avoid that confusion?

I was looking for an incremental improvement, so I imagined something
like various output flagging "yes, there are duplicate files for this
message", and letting users dig those out using something like the
--duplicate= option.

> It also occurs to me that one of the things i'd love to have is
> well-indexed notes about any given e-mail.  So if this was adopted, i
> could presumably just write a file that has the same Message-Id as the
> message, put my notes in it, and index it.  that's a little weird,
> though.  would there be a better way to do such a thing?
>
>  --dkg

One option would be to use a note=foo mesage property. That's not
immediately searchable though, although we could kludge together
something like the subject regexp search which would be slower.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-16 Thread Daniel Kahn Gillmor
On Wed 2017-03-15 21:57:28 -0400, David Bremner wrote:
> The corresponding xapian document just gets more terms added to it,
> but this doesn't seem to break anything.

this is an interesting suggestion.  thanks for proposing it!

A couple questions:

 0) what happens when one of the files gets deleted from the message
store? do the terms it contributes get removed from the index?

 1) when a message is displayed to the user as a result of a match, it
gets pulled from one of the files, not both.  if it's pulled from
the file that didn't have the term the user searched for, that's
likely to be confusing.  do you have a way to avoid that confusion?

It also occurs to me that one of the things i'd love to have is
well-indexed notes about any given e-mail.  So if this was adopted, i
could presumably just write a file that has the same Message-Id as the
message, put my notes in it, and index it.  that's a little weird,
though.  would there be a better way to do such a thing?

 --dkg
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


[RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-15 Thread David Bremner
The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything.
---
 lib/database.cc| 3 +++
 test/T670-duplicate-mid.sh | 1 -
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/database.cc b/lib/database.cc
index a679cbab..e83017ed 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -2582,6 +2582,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
+   ret = _notmuch_message_index_file (message, message_file);
+   if (ret)
+   goto DONE;
ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
}
 
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index d28afc91..41c53bc8 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -6,7 +6,6 @@ add_message [id]=id:duplicate '[subject]="message 1"'
 add_message [id]=id:duplicate '[subject]="message 2"'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat