Re: [RFC PATCH] Re: excessive thread fusing
Mark Walters writes: >>> I haven't tracked through all the logic of the existing algorithm for >>> this case. But I don't like hearing that notmuch constructs different >>> threads for the same messages presented in different orders. This sounds >>> like a bug separate from what we've discussed above. > > I think I have now found this bug and it is separate from the malformed > In-Reply-To problems. > > The problem is that when we merge threads we update all the thread-ids > of documents in the loser thread. But we don't (if I understand the code > correctly) update dangling "metadata" references to threads which don't > (yet) have any documents. This bug seems fixed by the introduction of ghost messages discussed later in the thread, so marking it fixed. The question of malformed In-Reply-To headers (and whether there is a bug handling those) is discussed in the thread id:20211226161716.2079457-1-da...@tethera.net. ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: [RFC PATCH] Re: excessive thread fusing
Austin Clements writes: > But let me propose an idea I've been kicking around for a while: ghost > message documents. Rather than using user metadata for tracking these > missing messages, use regular documents with the exact same terms we > use now for message IDs and thread IDs, but with a Tghost term instead > of a Tmail term to distinguish their type. For what it's worth. I like this proposed fix very much. It's obviously what I should have done in the first place. > In effect, the database becomes truly monotonic. Yes. All of these benefits are very compelling. > [1] Curious? > yes n | xapian-inspect postlist.DB | \ > awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\xd0/ > {exit} END {print N}' 16725 for me. -Carl pgpuS8ZLLxizB.pgp Description: PGP signature ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [RFC PATCH] Re: excessive thread fusing
Quoth Mark Walters on Apr 21 at 8:20 am: > > >> I haven't tracked through all the logic of the existing algorithm for > >> this case. But I don't like hearing that notmuch constructs different > >> threads for the same messages presented in different orders. This sounds > >> like a bug separate from what we've discussed above. > > I think I have now found this bug and it is separate from the malformed > In-Reply-To problems. > > The problem is that when we merge threads we update all the thread-ids > of documents in the loser thread. But we don't (if I understand the code > correctly) update dangling "metadata" references to threads which don't > (yet) have any documents. This exactly the problem I wrote id:1395608456-9673-1-git-send-email-amdra...@mit.edu to test, but I had convinced myself everything was okay because we link a message to both its parents and all of its children. But that's only true if you eventually receive the linking message (which in the test I made, you do). In this case, you never receive the linking message, so even though notmuch has enough information to bring the two threads together, it doesn't. Maybe I should create a second variant of that test where all of the messages reference their entire heritage (rather than just their immediate parent) and test that they're *always* in one thread regardless of receipt order (rather than only checking once they've all been received)? I think that would weed out this case. > To make this explicit consider the 2 messages 17,18 in the set. > > Message 17 has id <87wrkidfrh@pinto.chemeng.ucl.ac.uk> and has no > references/in-reply-to so has no parents. > > Message 18 has a reference to <87wrkidfrh@pinto.chemeng.ucl.ac.uk> > and an in-reply-to to and > <87wrkidfrh@pinto.chemeng.ucl.ac.uk> > > If you add 17 first then it gets thread-id 1 and then when you add 18 message > 18 gets > thread-id 2 as does the dangling link to the "unseen" message > e.fr...@ucl.ac.uk, and then message 17 is moved to thread-id 2. > > However, if you add 18 first then it gets thread-id 1 and the dangling > link gets id 1. When 17 is added it gets thread-id 2, message 18 gets > thread-id updated to 2 *but* the dangling link to e.fr...@ucl.ac.uk does > not get updated so stays thread-id 1. > > In particular when 52 comes along with its link to e.fr...@ucl.ac.uk > then: > > In the first case it gets put in thread-id 3 and the other two > messages get moved into thread 3. > > In the second case, message 52 gets put in thread 3 and thread 1 > (the dangling link) gets moved into thread 3 leaving thread 2 > (containing messages 17 and 18) untouched. So, there's an obvious, messy way to fix this: update the metadata references when we do thread renumbering. This is messy because that data *isn't indexed*. The only way to find the records we need to update is to scan them all. This isn't completely terrible because it's a sequential scan and we could cache it in memory, but it certainly isn't going to help notmuch new's performance. (My database has 6,749 of these, which takes ~1 second to scan on a cold cache, though that's with an SSD [1]). But let me propose an idea I've been kicking around for a while: ghost message documents. Rather than using user metadata for tracking these missing messages, use regular documents with the exact same terms we use now for message IDs and thread IDs, but with a Tghost term instead of a Tmail term to distinguish their type. This solves the problem using infrastructure we already have in place, simplifies the message linking code, and may even make it faster. It's a schema update, but a simple and fast one. I think the hardest part is that things like notmuch_database_find_message would need to distinguish ghosts and regular messages (which may require pre-fetching the Tghost or Tmail posting list to do efficiently). This also sets us up to do some cool things in the future, though they're more invasive. If we have message-like documents for these ghosts, we can store other message-like metadata as well. If we store tags on them, then we can keep tags around for deleted messages and *reapply them* if the message comes back. This would finally fix the races we have now where, if a message is renamed or moved during a notmuch new, we may think it's deleted only to reindex it with default tags on the next run. We could also pre-tag messages that haven't been indexed yet, say from procmail or when sending a message. This could simplify or even obviate notmuch insert. If we add message ctimes as proposed by Dave Mazières, this would give us a place to store and query ctimes of deleted messages (otherwise it's unclear how you find out about deletions without a full database scan). In effect, the database becomes truly monotonic. [1] Curious? yes n | xapian-inspect postlist.DB | \ awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\
Re: [RFC PATCH] Re: excessive thread fusing
>> I haven't tracked through all the logic of the existing algorithm for >> this case. But I don't like hearing that notmuch constructs different >> threads for the same messages presented in different orders. This sounds >> like a bug separate from what we've discussed above. I think I have now found this bug and it is separate from the malformed In-Reply-To problems. The problem is that when we merge threads we update all the thread-ids of documents in the loser thread. But we don't (if I understand the code correctly) update dangling "metadata" references to threads which don't (yet) have any documents. To make this explicit consider the 2 messages 17,18 in the set. Message 17 has id <87wrkidfrh@pinto.chemeng.ucl.ac.uk> and has no references/in-reply-to so has no parents. Message 18 has a reference to <87wrkidfrh@pinto.chemeng.ucl.ac.uk> and an in-reply-to to and <87wrkidfrh@pinto.chemeng.ucl.ac.uk> If you add 17 first then it gets thread-id 1 and then when you add 18 message 18 gets thread-id 2 as does the dangling link to the "unseen" message e.fr...@ucl.ac.uk, and then message 17 is moved to thread-id 2. However, if you add 18 first then it gets thread-id 1 and the dangling link gets id 1. When 17 is added it gets thread-id 2, message 18 gets thread-id updated to 2 *but* the dangling link to e.fr...@ucl.ac.uk does not get updated so stays thread-id 1. In particular when 52 comes along with its link to e.fr...@ucl.ac.uk then: In the first case it gets put in thread-id 3 and the other two messages get moved into thread 3. In the second case, message 52 gets put in thread 3 and thread 1 (the dangling link) gets moved into thread 3 leaving thread 2 (containing messages 17 and 18) untouched. Best wishes Mark ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [RFC PATCH] Re: excessive thread fusing
Carl Worth writes: > > Another idea would be to trigger specifically on common forms. Judging > From the samples in this particular thread, it seems like a workable > heuristic would be: > > If the In-Reply-To header begins with '<': > > Parse that initial portion as a message ID > > Else if it ends with '>': > > Parse that final portion as a message ID > > Else > > Ignore this garbage-valued header. > using the hacky script below, I scanned my own mail collection of about 300k messages. I can make the following observations - I have some RFC compliant in-reply-to's with multiple ids - I have have a non-trivial number of Message from $NAME of $date - I didn't see any cases where using the last angle bracketed thing would fail. - I did see some some cases where the header starts with '<' but the matching '>' was missing - I also noticed some rfc2047 encoding of in-reply-to headers. ## # hacky script follows dir=$1 echo Scanning $dir tempdir=$(mktemp -d) echo Writing to ${tempdir} find $dir -exec sh -c "formail -c -xIn-reply-to < {}" \; \ > ${tempdir}/ids sed -e 's/\t/ /' -e 's/ */ /g' -e 's/<[^ ]*>//g' -e 's/(.*)/(comment)/' < ${tempdir}/ids | sort | uniq | tee ${tempdir}/report ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [RFC PATCH] Re: excessive thread fusing
On Sun, 20 Apr 2014, Carl Worth wrote: > Mark Walters writes: >> I have done dome debugging of this. > > Thanks for looking closely, Mark! > >> There is a patch below which fixes this test case but who knows what >> it breaks! Please DO NOT apply unless someone who knows this code says >> it's OK. > > I wrote much of the original code being patched here, so hopefully I > understand it and can say something useful. > > I agree that the patch should not be applied. I don't like to see one > piece of code not trusting another in the same code base. If the > parse_references() function doesn't deal well with a malformed header, > then we should fix it, not step around it. > > Meanwhile, not treating all potential referenced message IDs > consistently could definitely make the notmuch algorithm more fragile > and sensitive to the order of message indexing, etc. So let's not do > that. I agree. This bug first came up in id:874nvcekjk@qmul.ac.uk; I think that got mostly fixed by cf8aaafbad68 (id:1361836225-17279-1-git-send-email-aarone...@gmail.com and related thread) so we may want to check whether that change is still wanted if we fix the actual bug. > Instead, let's track down and fix the actual bug. > > Thanks for the idea of using two-digit names for these messages. That > makes it much easier to inspect the relevant headers. > > Below, I've grepped out the actual References and In-Reply-To headers > From the messages, and then simply substituted minimal, and > easy-to-understand values for the message IDs. > > With these minimally modified headers, it's easy to manually inspect the > relationships and see that messages 17 and 18 belong in one thread, and > messages 32-52 belong in a separate thread. > > It's also quite easy to see the potential source of the bug. The > In-Reply-To headers for messages 18, 32, and 52 all share a common > string (an email address) formatted to look like a message-id, > "". If notmuch looks at those headers, and treats > that string as a message-id, then all of theses messages will be > connected into a single thread. > > And since that's the reported behavior, it seems likely that > "" is the cause of this bug. > >> I put some debug stuff in _notmuch_database_link_message_to_parents and >> I think that the problem comes from the call to parse_references on line >> 1767 which adds the malformed in-reply-to header to the hash table, so >> this malformed line gets added as a potential parent. > > Am I correct that your debugging showed that "" is > being added to the hash table? Yes that is correct. > My inspection of _parse_references() and parse_message_id() suggests > that that's exactly what notmuch is doing, (treating both of the > angle-bracketed portions ("" as well as the actual > message-ID, "" or "" or "") as message IDs. > > So it seems like we need a new _parse_in_reply_to() function to use in > place of _parse_references() and the new function will need a better > heuristic for dealing with the unpredictability of In-Reply-To. > > The only real reason that we are trying to grab multiple message ID > values from an In-Reply-To header is that RFC 2822 explicitly allows > that, (to support a message simultaneously replying to multiple > messages). I don't believe that that's common, but we might as well > support it. At the same time, RFC 2822 also explicitly specifies that > the In-Reply-To header will consist of nothing but message IDs. > > So perhaps the heuristic here could be to notice any characters outside > of angle brackets, (like "Message" in the headers below), and in that > case go to a strictly "not RFC 2822" mode and look for exactly one > message ID. At that point, JWZ would recommend "the first <>-bracketed > text found therein", but that would give precisely the wrong answer in > this particular case. Here the correct Message ID appears in the last > <>-bracketed text. I have not surveyed a large email corpus to determine > how often "last <>-bracketed text" would fail as a heuristic. > > Another idea would be to trigger specifically on common forms. Judging > From the samples in this particular thread, it seems like a workable > heuristic would be: > > If the In-Reply-To header begins with '<': > > Parse that initial portion as a message ID > > Else if it ends with '>': > > Parse that final portion as a message ID > > Else > > Ignore this garbage-valued header. > > That's probably the best and most reliably thing to do here. > > Does anyone have any better ideas? Is there a case coming before all the above: if the In-Reply-To header is correctly formed then parse as we do currently? (You sort of suggest so above but I just wanted to check) >> As a clear example that I don't understand this code I don't know why >> this no longer causes a problem if message 17 gets added too. > > I wanted to test my own knowledge of the code to see if I could explain > this. But I didn't precise
Re: [RFC PATCH] Re: excessive thread fusing
Mark Walters writes: > I have done dome debugging of this. Thanks for looking closely, Mark! > There is a patch below which fixes this test case but who knows what > it breaks! Please DO NOT apply unless someone who knows this code says > it's OK. I wrote much of the original code being patched here, so hopefully I understand it and can say something useful. I agree that the patch should not be applied. I don't like to see one piece of code not trusting another in the same code base. If the parse_references() function doesn't deal well with a malformed header, then we should fix it, not step around it. Meanwhile, not treating all potential referenced message IDs consistently could definitely make the notmuch algorithm more fragile and sensitive to the order of message indexing, etc. So let's not do that. Instead, let's track down and fix the actual bug. Thanks for the idea of using two-digit names for these messages. That makes it much easier to inspect the relevant headers. Below, I've grepped out the actual References and In-Reply-To headers From the messages, and then simply substituted minimal, and easy-to-understand values for the message IDs. With these minimally modified headers, it's easy to manually inspect the relationships and see that messages 17 and 18 belong in one thread, and messages 32-52 belong in a separate thread. It's also quite easy to see the potential source of the bug. The In-Reply-To headers for messages 18, 32, and 52 all share a common string (an email address) formatted to look like a message-id, "". If notmuch looks at those headers, and treats that string as a message-id, then all of theses messages will be connected into a single thread. And since that's the reported behavior, it seems likely that "" is the cause of this bug. > I put some debug stuff in _notmuch_database_link_message_to_parents and > I think that the problem comes from the call to parse_references on line > 1767 which adds the malformed in-reply-to header to the hash table, so > this malformed line gets added as a potential parent. Am I correct that your debugging showed that "" is being added to the hash table? My inspection of _parse_references() and parse_message_id() suggests that that's exactly what notmuch is doing, (treating both of the angle-bracketed portions ("" as well as the actual message-ID, "" or "" or "") as message IDs. So it seems like we need a new _parse_in_reply_to() function to use in place of _parse_references() and the new function will need a better heuristic for dealing with the unpredictability of In-Reply-To. The only real reason that we are trying to grab multiple message ID values from an In-Reply-To header is that RFC 2822 explicitly allows that, (to support a message simultaneously replying to multiple messages). I don't believe that that's common, but we might as well support it. At the same time, RFC 2822 also explicitly specifies that the In-Reply-To header will consist of nothing but message IDs. So perhaps the heuristic here could be to notice any characters outside of angle brackets, (like "Message" in the headers below), and in that case go to a strictly "not RFC 2822" mode and look for exactly one message ID. At that point, JWZ would recommend "the first <>-bracketed text found therein", but that would give precisely the wrong answer in this particular case. Here the correct Message ID appears in the last <>-bracketed text. I have not surveyed a large email corpus to determine how often "last <>-bracketed text" would fail as a heuristic. Another idea would be to trigger specifically on common forms. Judging From the samples in this particular thread, it seems like a workable heuristic would be: If the In-Reply-To header begins with '<': Parse that initial portion as a message ID Else if it ends with '>': Parse that final portion as a message ID Else Ignore this garbage-valued header. That's probably the best and most reliably thing to do here. Does anyone have any better ideas? > As a clear example that I don't understand this code I don't know why > this no longer causes a problem if message 17 gets added too. I wanted to test my own knowledge of the code to see if I could explain this. But I didn't precisely follow your explanation of the behavior you saw. In cases (1) and (2) of your description, what order are you using to "add all messages" or "add all apart from 52"? Then, for cases (3) and (4), what is done before adding the messages mentioned in these cases? Add all other messages? Again, in what order? I haven't tracked through all the logic of the existing algorithm for this case. But I don't like hearing that notmuch constructs different threads for the same messages presented in different orders. This sounds like a bug separate from what we've discussed above. -Carl 18:References: 32:References: 33:References: 34:References: 35:References: 36