Re: [RFC PATCH] Re: excessive thread fusing

2021-12-31 Thread David Bremner
Mark Walters  writes:

>>> I haven't tracked through all the logic of the existing algorithm for
>>> this case. But I don't like hearing that notmuch constructs different
>>> threads for the same messages presented in different orders. This sounds
>>> like a bug separate from what we've discussed above. 
>
> I think I have now found this bug and it is separate from the malformed
> In-Reply-To problems.
>
> The problem is that when we merge threads we update all the thread-ids
> of documents in the loser thread. But we don't (if I understand the code
> correctly) update dangling "metadata" references to threads which don't
> (yet) have any documents.

This bug seems fixed by the introduction of ghost messages discussed
later in the thread, so marking it fixed. The question of malformed
In-Reply-To headers (and whether there is a bug handling those) is
discussed in the thread id:20211226161716.2079457-1-da...@tethera.net.
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: [RFC PATCH] Re: excessive thread fusing

2014-04-21 Thread Mark Walters

 I haven't tracked through all the logic of the existing algorithm for
 this case. But I don't like hearing that notmuch constructs different
 threads for the same messages presented in different orders. This sounds
 like a bug separate from what we've discussed above. 

I think I have now found this bug and it is separate from the malformed
In-Reply-To problems.

The problem is that when we merge threads we update all the thread-ids
of documents in the loser thread. But we don't (if I understand the code
correctly) update dangling metadata references to threads which don't
(yet) have any documents.

To make this explicit consider the 2 messages 17,18 in the set. 

Message 17 has id 87wrkidfrh@pinto.chemeng.ucl.ac.uk and has no
references/in-reply-to so has no parents.

Message 18 has a reference to 87wrkidfrh@pinto.chemeng.ucl.ac.uk
and an in-reply-to to e.fr...@ucl.ac.uk and
87wrkidfrh@pinto.chemeng.ucl.ac.uk

If you add 17 first then it gets thread-id 1 and then when you add 18 message 
18 gets
thread-id 2 as does the dangling link to the unseen message
e.fr...@ucl.ac.uk, and then message 17 is moved to thread-id 2.

However, if you add 18 first then it gets thread-id 1 and the dangling
link gets id 1. When 17 is added it gets thread-id 2, message 18 gets
thread-id updated to 2 *but* the dangling link to e.fr...@ucl.ac.uk does
not get updated so stays thread-id 1.

In particular when 52 comes along with its link to e.fr...@ucl.ac.uk
then:

In the first case it gets put in thread-id 3 and the other two
messages get moved into thread 3.

In the second case, message 52 gets put in thread 3 and thread 1
(the dangling link) gets moved into thread 3 leaving thread 2
(containing messages 17 and 18) untouched.

Best wishes

Mark





___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH] Re: excessive thread fusing

2014-04-21 Thread Austin Clements
Quoth Mark Walters on Apr 21 at  8:20 am:
 
  I haven't tracked through all the logic of the existing algorithm for
  this case. But I don't like hearing that notmuch constructs different
  threads for the same messages presented in different orders. This sounds
  like a bug separate from what we've discussed above. 
 
 I think I have now found this bug and it is separate from the malformed
 In-Reply-To problems.
 
 The problem is that when we merge threads we update all the thread-ids
 of documents in the loser thread. But we don't (if I understand the code
 correctly) update dangling metadata references to threads which don't
 (yet) have any documents.

This exactly the problem I wrote
id:1395608456-9673-1-git-send-email-amdra...@mit.edu to test, but I
had convinced myself everything was okay because we link a message to
both its parents and all of its children.  But that's only true if you
eventually receive the linking message (which in the test I made, you
do).  In this case, you never receive the linking message, so even
though notmuch has enough information to bring the two threads
together, it doesn't.

Maybe I should create a second variant of that test where all of the
messages reference their entire heritage (rather than just their
immediate parent) and test that they're *always* in one thread
regardless of receipt order (rather than only checking once they've
all been received)?  I think that would weed out this case.

 To make this explicit consider the 2 messages 17,18 in the set. 
 
 Message 17 has id 87wrkidfrh@pinto.chemeng.ucl.ac.uk and has no
 references/in-reply-to so has no parents.
 
 Message 18 has a reference to 87wrkidfrh@pinto.chemeng.ucl.ac.uk
 and an in-reply-to to e.fr...@ucl.ac.uk and
 87wrkidfrh@pinto.chemeng.ucl.ac.uk
 
 If you add 17 first then it gets thread-id 1 and then when you add 18 message 
 18 gets
 thread-id 2 as does the dangling link to the unseen message
 e.fr...@ucl.ac.uk, and then message 17 is moved to thread-id 2.
 
 However, if you add 18 first then it gets thread-id 1 and the dangling
 link gets id 1. When 17 is added it gets thread-id 2, message 18 gets
 thread-id updated to 2 *but* the dangling link to e.fr...@ucl.ac.uk does
 not get updated so stays thread-id 1.
 
 In particular when 52 comes along with its link to e.fr...@ucl.ac.uk
 then:
 
 In the first case it gets put in thread-id 3 and the other two
 messages get moved into thread 3.
 
 In the second case, message 52 gets put in thread 3 and thread 1
 (the dangling link) gets moved into thread 3 leaving thread 2
 (containing messages 17 and 18) untouched.

So, there's an obvious, messy way to fix this: update the metadata
references when we do thread renumbering.  This is messy because that
data *isn't indexed*.  The only way to find the records we need to
update is to scan them all.  This isn't completely terrible because
it's a sequential scan and we could cache it in memory, but it
certainly isn't going to help notmuch new's performance.  (My database
has 6,749 of these, which takes ~1 second to scan on a cold cache,
though that's with an SSD [1]).


But let me propose an idea I've been kicking around for a while: ghost
message documents.  Rather than using user metadata for tracking these
missing messages, use regular documents with the exact same terms we
use now for message IDs and thread IDs, but with a Tghost term instead
of a Tmail term to distinguish their type.  This solves the problem
using infrastructure we already have in place, simplifies the message
linking code, and may even make it faster.  It's a schema update, but
a simple and fast one.  I think the hardest part is that things like
notmuch_database_find_message would need to distinguish ghosts and
regular messages (which may require pre-fetching the Tghost or Tmail
posting list to do efficiently).

This also sets us up to do some cool things in the future, though
they're more invasive.  If we have message-like documents for these
ghosts, we can store other message-like metadata as well.  If we store
tags on them, then we can keep tags around for deleted messages and
*reapply them* if the message comes back.  This would finally fix the
races we have now where, if a message is renamed or moved during a
notmuch new, we may think it's deleted only to reindex it with default
tags on the next run.  We could also pre-tag messages that haven't
been indexed yet, say from procmail or when sending a message.  This
could simplify or even obviate notmuch insert.  If we add message
ctimes as proposed by Dave Mazières, this would give us a place to
store and query ctimes of deleted messages (otherwise it's unclear how
you find out about deletions without a full database scan).  In
effect, the database becomes truly monotonic.

[1] Curious?
yes n | xapian-inspect postlist.DB | \
awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\xd0/ 
{exit} END {print N}'

Re: [RFC PATCH] Re: excessive thread fusing

2014-04-21 Thread Carl Worth
Austin Clements amdra...@mit.edu writes:
 But let me propose an idea I've been kicking around for a while: ghost
 message documents.  Rather than using user metadata for tracking these
 missing messages, use regular documents with the exact same terms we
 use now for message IDs and thread IDs, but with a Tghost term instead
 of a Tmail term to distinguish their type.

For what it's worth. I like this proposed fix very much. It's obviously
what I should have done in the first place.

 In effect, the database becomes truly monotonic.

Yes. All of these benefits are very compelling.

 [1] Curious?
 yes n | xapian-inspect postlist.DB | \
 awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\xd0/ 
 {exit} END {print N}'

16725 for me.

-Carl



pgpuS8ZLLxizB.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread Carl Worth
Mark Walters markwalters1...@gmail.com writes:
 I have done dome debugging of this.

Thanks for looking closely, Mark!

 There is a patch below which fixes this test case but who knows what
 it breaks! Please DO NOT apply unless someone who knows this code says
 it's OK.

I wrote much of the original code being patched here, so hopefully I
understand it and can say something useful.

I agree that the patch should not be applied. I don't like to see one
piece of code not trusting another in the same code base. If the
parse_references() function doesn't deal well with a malformed header,
then we should fix it, not step around it.

Meanwhile, not treating all potential referenced message IDs
consistently could definitely make the notmuch algorithm more fragile
and sensitive to the order of message indexing, etc. So let's not do
that.

Instead, let's track down and fix the actual bug.

Thanks for the idea of using two-digit names for these messages. That
makes it much easier to inspect the relevant headers.

Below, I've grepped out the actual References and In-Reply-To headers
From the messages, and then simply substituted minimal, and
easy-to-understand values for the message IDs.

With these minimally modified headers, it's easy to manually inspect the
relationships and see that messages 17 and 18 belong in one thread, and
messages 32-52 belong in a separate thread.

It's also quite easy to see the potential source of the bug. The
In-Reply-To headers for messages 18, 32, and 52 all share a common
string (an email address) formatted to look like a message-id,
e.fr...@ucl.ac.uk. If notmuch looks at those headers, and treats
that string as a message-id, then all of theses messages will be
connected into a single thread.

And since that's the reported behavior, it seems likely that
e.fr...@ucl.ac.uk is the cause of this bug.

 I put some debug stuff in _notmuch_database_link_message_to_parents and
 I think that the problem comes from the call to parse_references on line
 1767 which adds the malformed in-reply-to header to the hash table, so
 this malformed line gets added as a potential parent. 

Am I correct that your debugging showed that e.fr...@ucl.ac.uk is
being added to the hash table?

My inspection of _parse_references() and parse_message_id() suggests
that that's exactly what notmuch is doing, (treating both of the
angle-bracketed portions (e.fr...@ucl.ac.uk as well as the actual
message-ID, ID17 or ID31 or ID39) as message IDs.

So it seems like we need a new _parse_in_reply_to() function to use in
place of _parse_references() and the new function will need a better
heuristic for dealing with the unpredictability of In-Reply-To.

The only real reason that we are trying to grab multiple message ID
values from an In-Reply-To header is that RFC 2822 explicitly allows
that, (to support a message simultaneously replying to multiple
messages). I don't believe that that's common, but we might as well
support it. At the same time, RFC 2822 also explicitly specifies that
the In-Reply-To header will consist of nothing but message IDs.

So perhaps the heuristic here could be to notice any characters outside
of angle brackets, (like Message in the headers below), and in that
case go to a strictly not RFC 2822 mode and look for exactly one
message ID. At that point, JWZ would recommend the first -bracketed
text found therein, but that would give precisely the wrong answer in
this particular case. Here the correct Message ID appears in the last
-bracketed text. I have not surveyed a large email corpus to determine
how often last -bracketed text would fail as a heuristic.

Another idea would be to trigger specifically on common forms. Judging
From the samples in this particular thread, it seems like a workable
heuristic would be:

If the In-Reply-To header begins with '':

Parse that initial portion as a message ID

Else if it ends with '':

Parse that final portion as a message ID

Else

Ignore this garbage-valued header.

That's probably the best and most reliably thing to do here.

Does anyone have any better ideas?

 As a clear example that I don't understand this code I don't know why
 this no longer causes a problem if message 17 gets added too.

I wanted to test my own knowledge of the code to see if I could explain
this. But I didn't precisely follow your explanation of the behavior you
saw. In cases (1) and (2) of your description, what order are you using
to add all messages or add all apart from 52?

Then, for cases (3) and (4), what is done before adding the messages
mentioned in these cases? Add all other messages? Again, in what order?

I haven't tracked through all the logic of the existing algorithm for
this case. But I don't like hearing that notmuch constructs different
threads for the same messages presented in different orders. This sounds
like a bug separate from what we've discussed above. 

-Carl

18:References: ID17

Re: [RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread Mark Walters

On Sun, 20 Apr 2014, Carl Worth cwo...@cworth.org wrote:
 Mark Walters markwalters1...@gmail.com writes:
 I have done dome debugging of this.

 Thanks for looking closely, Mark!

 There is a patch below which fixes this test case but who knows what
 it breaks! Please DO NOT apply unless someone who knows this code says
 it's OK.

 I wrote much of the original code being patched here, so hopefully I
 understand it and can say something useful.

 I agree that the patch should not be applied. I don't like to see one
 piece of code not trusting another in the same code base. If the
 parse_references() function doesn't deal well with a malformed header,
 then we should fix it, not step around it.


 Meanwhile, not treating all potential referenced message IDs
 consistently could definitely make the notmuch algorithm more fragile
 and sensitive to the order of message indexing, etc. So let's not do
 that.

I agree. This bug first came up in id:874nvcekjk@qmul.ac.uk; I think
that got mostly fixed by cf8aaafbad68
(id:1361836225-17279-1-git-send-email-aarone...@gmail.com and related
thread) so we may want to check whether that change is still wanted if
we fix the actual bug.

 Instead, let's track down and fix the actual bug.

 Thanks for the idea of using two-digit names for these messages. That
 makes it much easier to inspect the relevant headers.

 Below, I've grepped out the actual References and In-Reply-To headers
 From the messages, and then simply substituted minimal, and
 easy-to-understand values for the message IDs.

 With these minimally modified headers, it's easy to manually inspect the
 relationships and see that messages 17 and 18 belong in one thread, and
 messages 32-52 belong in a separate thread.

 It's also quite easy to see the potential source of the bug. The
 In-Reply-To headers for messages 18, 32, and 52 all share a common
 string (an email address) formatted to look like a message-id,
 e.fr...@ucl.ac.uk. If notmuch looks at those headers, and treats
 that string as a message-id, then all of theses messages will be
 connected into a single thread.

 And since that's the reported behavior, it seems likely that
 e.fr...@ucl.ac.uk is the cause of this bug.

 I put some debug stuff in _notmuch_database_link_message_to_parents and
 I think that the problem comes from the call to parse_references on line
 1767 which adds the malformed in-reply-to header to the hash table, so
 this malformed line gets added as a potential parent. 

 Am I correct that your debugging showed that e.fr...@ucl.ac.uk is
 being added to the hash table?

Yes that is correct.

 My inspection of _parse_references() and parse_message_id() suggests
 that that's exactly what notmuch is doing, (treating both of the
 angle-bracketed portions (e.fr...@ucl.ac.uk as well as the actual
 message-ID, ID17 or ID31 or ID39) as message IDs.

 So it seems like we need a new _parse_in_reply_to() function to use in
 place of _parse_references() and the new function will need a better
 heuristic for dealing with the unpredictability of In-Reply-To.

 The only real reason that we are trying to grab multiple message ID
 values from an In-Reply-To header is that RFC 2822 explicitly allows
 that, (to support a message simultaneously replying to multiple
 messages). I don't believe that that's common, but we might as well
 support it. At the same time, RFC 2822 also explicitly specifies that
 the In-Reply-To header will consist of nothing but message IDs.

 So perhaps the heuristic here could be to notice any characters outside
 of angle brackets, (like Message in the headers below), and in that
 case go to a strictly not RFC 2822 mode and look for exactly one
 message ID. At that point, JWZ would recommend the first -bracketed
 text found therein, but that would give precisely the wrong answer in
 this particular case. Here the correct Message ID appears in the last
 -bracketed text. I have not surveyed a large email corpus to determine
 how often last -bracketed text would fail as a heuristic.

 Another idea would be to trigger specifically on common forms. Judging
 From the samples in this particular thread, it seems like a workable
 heuristic would be:

   If the In-Reply-To header begins with '':

   Parse that initial portion as a message ID

   Else if it ends with '':

   Parse that final portion as a message ID

   Else

   Ignore this garbage-valued header.

 That's probably the best and most reliably thing to do here.

 Does anyone have any better ideas?

Is there a case coming before all the above: if the In-Reply-To header
is correctly formed then parse as we do currently? (You sort of suggest
so above but I just wanted to check)

 As a clear example that I don't understand this code I don't know why
 this no longer causes a problem if message 17 gets added too.

 I wanted to test my own knowledge of the code to see if I could explain
 this. But I didn't precisely follow your 

Re: [RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread David Bremner
Carl Worth cwo...@cworth.org writes:

 Another idea would be to trigger specifically on common forms. Judging
 From the samples in this particular thread, it seems like a workable
 heuristic would be:

   If the In-Reply-To header begins with '':

   Parse that initial portion as a message ID

   Else if it ends with '':

   Parse that final portion as a message ID

   Else

   Ignore this garbage-valued header.


using the hacky script below, I scanned my own mail collection of about
300k messages. I can make the following observations

- I have some RFC compliant in-reply-to's with multiple ids
- I have have a non-trivial number of Message from $NAME address of $date id
- I didn't see any cases where using the last angle bracketed thing
  would fail.
- I did see some some cases where the header starts with '' but the
  matching '' was missing
- I also noticed some rfc2047 encoding of in-reply-to headers.


##
# hacky script follows
dir=$1
echo Scanning $dir

tempdir=$(mktemp -d)
echo Writing to ${tempdir}

find $dir -exec sh -c formail -c -xIn-reply-to  {} \; \
   ${tempdir}/ids

sed  -e 's/\t/ /' -e 's/   */ /g' -e 's/[^ ]*/id/g' -e 's/(.*)/(comment)/' 
 ${tempdir}/ids | sort | uniq | tee ${tempdir}/report
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch