[RFC PATCH] Re: excessive thread fusing
Carl Worth writes: > > Another idea would be to trigger specifically on common forms. Judging > From the samples in this particular thread, it seems like a workable > heuristic would be: > > If the In-Reply-To header begins with '<': > > Parse that initial portion as a message ID > > Else if it ends with '>': > > Parse that final portion as a message ID > > Else > > Ignore this garbage-valued header. > using the hacky script below, I scanned my own mail collection of about 300k messages. I can make the following observations - I have some RFC compliant in-reply-to's with multiple ids - I have have a non-trivial number of Message from $NAME of $date - I didn't see any cases where using the last angle bracketed thing would fail. - I did see some some cases where the header starts with '<' but the matching '>' was missing - I also noticed some rfc2047 encoding of in-reply-to headers. ## # hacky script follows dir=$1 echo Scanning $dir tempdir=$(mktemp -d) echo Writing to ${tempdir} find $dir -exec sh -c "formail -c -xIn-reply-to < {}" \; \ > ${tempdir}/ids sed -e 's/\t/ /' -e 's/ */ /g' -e 's/<[^ ]*>//g' -e 's/(.*)/(comment)/' < ${tempdir}/ids | sort | uniq | tee ${tempdir}/report
[Pablo Oliveira] Bug#745303: notmuch new corrupts database throwing 'Xapian::DatabaseCorruptError'
An embedded message was scrubbed... From: Pablo Oliveira <pa...@sifflez.org> Subject: Bug#745303: notmuch new corrupts database throwing 'Xapian::DatabaseCorruptError' Date: Sun, 20 Apr 2014 12:27:15 +0200 Size: 7122 URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20140420/05b58575/attachment.mht>
excessive thread fusing
Quoth myself on Apr 20 at 12:48 pm: > Quoth Andrei POPESCU on Apr 20 at 12:04 am: > > On Sb, 19 apr 14, 18:52:02, Eric wrote: > > > > > > This may not actually be any help, but both hypermail and mhonarc agree > > > that two messages form a separate thread from the rest. I believe that > > > the latter, at least, is the JWZ algorithm. > > > > mutt concurs. > > Can anyone explain why JWZ *doesn't* have the same problem? I don't > see how this heuristic doesn't doom it to the same fate: > > The References field is populated from the ``References'' and/or > ``In-Reply-To'' headers. If both headers exist, take the first thing > in the In-Reply-To header that looks like a Message-ID, and append > it to the References header. > > Given this, even considering only messages 18 and 52 (which "should" > be in different threads), JWZ should find the common "parent" > e.fraga at ucl.ac.uk and link them in to the same thread: > > Add 18 (step 1) > - The combined "references" list is > - Creates and links containers 17 <- e.fraga at ucl.ac.uk <- 18 where the > first two are empty > > Add 52 (step 1) > - The combined "references" list is > > - Creates and links containers 31 <- 32 <- 39 > - Also considers container e.fraga at ucl.ac.uk, but this is already > linked, so it doesn't change it > - Creates container 52 and links e.fraga at ucl.ac.uk <- 52 (step 1C) > > 18 and 52 will later get promoted over their empty parent (step 4), > but will remain in the same thread. > > What am I missing? Or are these other MUAs not using pure JWZ? I dug in to mutt's mutt_sort_threads a bit. It's not using JWZ, though it's something similar. The most salient thing may be how it handles in-reply-to and references: 1. If a message has both in-reply-to and references, the parent chain is the *last* in-reply-to ID and then the references from right to left (skipping the last reference ID if it's the same as the last in-reply-to ID). (See also mutt_parse_references.) 2. If a message has only in-reply-to, the parent chain is *all* of the IDs in in-reply-to *from right to left* (e.g., the right-most one is the immediate parent). 3. If a message has only references, the parent chain is that, from right to left. Like JWZ, mutt creates and links together "empty containers" as it scans the parent chain towards the root, though unlike JWZ it stops when it finds a non-empty container or a container that already has a parent.
[RFC PATCH] Re: excessive thread fusing
On Sun, 20 Apr 2014, Carl Worth wrote: > Mark Walters writes: >> I have done dome debugging of this. > > Thanks for looking closely, Mark! > >> There is a patch below which fixes this test case but who knows what >> it breaks! Please DO NOT apply unless someone who knows this code says >> it's OK. > > I wrote much of the original code being patched here, so hopefully I > understand it and can say something useful. > > I agree that the patch should not be applied. I don't like to see one > piece of code not trusting another in the same code base. If the > parse_references() function doesn't deal well with a malformed header, > then we should fix it, not step around it. > > Meanwhile, not treating all potential referenced message IDs > consistently could definitely make the notmuch algorithm more fragile > and sensitive to the order of message indexing, etc. So let's not do > that. I agree. This bug first came up in id:874nvcekjk.fsf at qmul.ac.uk; I think that got mostly fixed by cf8aaafbad68 (id:1361836225-17279-1-git-send-email-aaronecay at gmail.com and related thread) so we may want to check whether that change is still wanted if we fix the actual bug. > Instead, let's track down and fix the actual bug. > > Thanks for the idea of using two-digit names for these messages. That > makes it much easier to inspect the relevant headers. > > Below, I've grepped out the actual References and In-Reply-To headers > From the messages, and then simply substituted minimal, and > easy-to-understand values for the message IDs. > > With these minimally modified headers, it's easy to manually inspect the > relationships and see that messages 17 and 18 belong in one thread, and > messages 32-52 belong in a separate thread. > > It's also quite easy to see the potential source of the bug. The > In-Reply-To headers for messages 18, 32, and 52 all share a common > string (an email address) formatted to look like a message-id, > "". If notmuch looks at those headers, and treats > that string as a message-id, then all of theses messages will be > connected into a single thread. > > And since that's the reported behavior, it seems likely that > "" is the cause of this bug. > >> I put some debug stuff in _notmuch_database_link_message_to_parents and >> I think that the problem comes from the call to parse_references on line >> 1767 which adds the malformed in-reply-to header to the hash table, so >> this malformed line gets added as a potential parent. > > Am I correct that your debugging showed that "" is > being added to the hash table? Yes that is correct. > My inspection of _parse_references() and parse_message_id() suggests > that that's exactly what notmuch is doing, (treating both of the > angle-bracketed portions ("" as well as the actual > message-ID, "" or "" or "") as message IDs. > > So it seems like we need a new _parse_in_reply_to() function to use in > place of _parse_references() and the new function will need a better > heuristic for dealing with the unpredictability of In-Reply-To. > > The only real reason that we are trying to grab multiple message ID > values from an In-Reply-To header is that RFC 2822 explicitly allows > that, (to support a message simultaneously replying to multiple > messages). I don't believe that that's common, but we might as well > support it. At the same time, RFC 2822 also explicitly specifies that > the In-Reply-To header will consist of nothing but message IDs. > > So perhaps the heuristic here could be to notice any characters outside > of angle brackets, (like "Message" in the headers below), and in that > case go to a strictly "not RFC 2822" mode and look for exactly one > message ID. At that point, JWZ would recommend "the first <>-bracketed > text found therein", but that would give precisely the wrong answer in > this particular case. Here the correct Message ID appears in the last > <>-bracketed text. I have not surveyed a large email corpus to determine > how often "last <>-bracketed text" would fail as a heuristic. > > Another idea would be to trigger specifically on common forms. Judging > From the samples in this particular thread, it seems like a workable > heuristic would be: > > If the In-Reply-To header begins with '<': > > Parse that initial portion as a message ID > > Else if it ends with '>': > > Parse that final portion as a message ID > > Else > > Ignore this garbage-valued header. > > That's probably the best and most reliably thing to do here. > > Does anyone have any better ideas? Is there a case coming before all the above: if the In-Reply-To header is correctly formed then parse as we do currently? (You sort of suggest so above but I just wanted to check) >> As a clear example that I don't understand this code I don't know why >> this no longer causes a problem if message 17 gets added too. > > I wanted to test my own knowledge of the code to see if I could explain > this. But I didn't
excessive thread fusing
Quoth Andrei POPESCU on Apr 20 at 12:04 am: > On Sb, 19 apr 14, 18:52:02, Eric wrote: > > > > This may not actually be any help, but both hypermail and mhonarc agree > > that two messages form a separate thread from the rest. I believe that > > the latter, at least, is the JWZ algorithm. > > mutt concurs. Can anyone explain why JWZ *doesn't* have the same problem? I don't see how this heuristic doesn't doom it to the same fate: The References field is populated from the ``References'' and/or ``In-Reply-To'' headers. If both headers exist, take the first thing in the In-Reply-To header that looks like a Message-ID, and append it to the References header. Given this, even considering only messages 18 and 52 (which "should" be in different threads), JWZ should find the common "parent" e.fraga at ucl.ac.uk and link them in to the same thread: Add 18 (step 1) - The combined "references" list is - Creates and links containers 17 <- e.fraga at ucl.ac.uk <- 18 where the first two are empty Add 52 (step 1) - The combined "references" list is - Creates and links containers 31 <- 32 <- 39 - Also considers container e.fraga at ucl.ac.uk, but this is already linked, so it doesn't change it - Creates container 52 and links e.fraga at ucl.ac.uk <- 52 (step 1C) 18 and 52 will later get promoted over their empty parent (step 4), but will remain in the same thread. What am I missing? Or are these other MUAs not using pure JWZ?
[PATCH v2 0/4] doc: notmuch-show improvements
This LGTM +1 Best wishes Mark On Fri, 18 Apr 2014, Austin Clements wrote: > This is v2 of id:1397834332-25175-1-git-send-email-amdragon at mit.edu. > It expands the explanation of "non-MIME" message parts and moves it to > the --part documentation. > > ___ > notmuch mailing list > notmuch at notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch
[RFC PATCH] Re: excessive thread fusing
On Sat, 19 Apr 2014, David Bremner wrote: > Gregor Zattler mentioned some problems with threading at > >id:20120126004024.GA13704 at shi.workgroup > > After some off list discussions, I believe I have a smaller test case. > > The attached maildir contains 24 messages from the org mode list. > > According to notmuch, these form one thread, but I can't figure out > exactly why. It seems like the chronologically first two messages should > be a seperate thread. There are several of the infamous malformed ME-E > In-reply-to headers, but each of these messages also has a References > header; this seems to indicate a case missed by commit cf8aaafbad68. Hi I have done dome debugging of this. There is a patch below which fixes this test case but who knows what it breaks! Please DO NOT apply unless someone who knows this code says it's OK. First, the bug is quite sensitive. The attached 24 messages are numbered and i will use the last two digits to refer to them (ie the 2 digits are the ?? in 1397885606.0002??.mbox:2,). The number range from 17-52; 17 and 18 should be one thread and the rest a different thread. 1) If you add all messages you get one thread. 2) If you add all apart from 52 you get 2 threads. However, then adding 52 still gives two threads. 3) If you add 18 and then 52 you get 1 thread. 4) If you add 17 and 18 then 52 you get 2 threads. I think notmuch will use inode sort and since the tar file contains these three files in the order 18 52 17 we get cases 1 and 2 above. I put some debug stuff in _notmuch_database_link_message_to_parents and I think that the problem comes from the call to parse_references on line 1767 which adds the malformed in-reply-to header to the hash table, so this malformed line gets added as a potential parent. As a clear example that I don't understand this code I don't know why this no longer causes a problem if message 17 gets added too. Best wishes Mark --- lib/database.cc | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/lib/database.cc b/lib/database.cc index 1efb14d..373a255 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -1763,20 +1763,23 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, this_message_id, parents, refs); -in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to"); -in_reply_to_message_id = parse_references (message, - this_message_id, - parents, in_reply_to); - /* For the parent of this message, use the last message ID of the * References header, if available. If not, fall back to the - * first message ID in the In-Reply-To header. */ + * first message ID in the In-Reply-To header. We only parse the + * In-Reply-To header if we need to as otherwise we might + * contanimate the hash table if it is malformed. */ if (last_ref_message_id) { _notmuch_message_add_term (message, "replyto", last_ref_message_id); -} else if (in_reply_to_message_id) { - _notmuch_message_add_term (message, "replyto", -in_reply_to_message_id); +} else { + in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to"); + in_reply_to_message_id = parse_references (message, + this_message_id, + parents, in_reply_to); + if (in_reply_to_message_id) { + _notmuch_message_add_term (message, "replyto", + in_reply_to_message_id); + } } keys = g_hash_table_get_keys (parents); -- 1.7.10.4
excessive thread fusing
On Sb, 19 apr 14, 18:52:02, Eric wrote: > > This may not actually be any help, but both hypermail and mhonarc agree > that two messages form a separate thread from the rest. I believe that > the latter, at least, is the JWZ algorithm. mutt concurs. Kind regards, Andrei -- If you can't explain it simply, you don't understand it well enough. (Albert Einstein) http://nuvreauspam.ro/gpg-transition.txt -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: Digital signature URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20140420/f59eb095/attachment.pgp>
[RFC PATCH] Re: excessive thread fusing
On Sat, 19 Apr 2014, David Bremner da...@tethera.net wrote: Gregor Zattler mentioned some problems with threading at id:20120126004024.GA13704@shi.workgroup After some off list discussions, I believe I have a smaller test case. The attached maildir contains 24 messages from the org mode list. According to notmuch, these form one thread, but I can't figure out exactly why. It seems like the chronologically first two messages should be a seperate thread. There are several of the infamous malformed ME-E In-reply-to headers, but each of these messages also has a References header; this seems to indicate a case missed by commit cf8aaafbad68. Hi I have done dome debugging of this. There is a patch below which fixes this test case but who knows what it breaks! Please DO NOT apply unless someone who knows this code says it's OK. First, the bug is quite sensitive. The attached 24 messages are numbered and i will use the last two digits to refer to them (ie the 2 digits are the ?? in 1397885606.0002??.mbox:2,). The number range from 17-52; 17 and 18 should be one thread and the rest a different thread. 1) If you add all messages you get one thread. 2) If you add all apart from 52 you get 2 threads. However, then adding 52 still gives two threads. 3) If you add 18 and then 52 you get 1 thread. 4) If you add 17 and 18 then 52 you get 2 threads. I think notmuch will use inode sort and since the tar file contains these three files in the order 18 52 17 we get cases 1 and 2 above. I put some debug stuff in _notmuch_database_link_message_to_parents and I think that the problem comes from the call to parse_references on line 1767 which adds the malformed in-reply-to header to the hash table, so this malformed line gets added as a potential parent. As a clear example that I don't understand this code I don't know why this no longer causes a problem if message 17 gets added too. Best wishes Mark --- lib/database.cc | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/lib/database.cc b/lib/database.cc index 1efb14d..373a255 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -1763,20 +1763,23 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, this_message_id, parents, refs); -in_reply_to = notmuch_message_file_get_header (message_file, in-reply-to); -in_reply_to_message_id = parse_references (message, - this_message_id, - parents, in_reply_to); - /* For the parent of this message, use the last message ID of the * References header, if available. If not, fall back to the - * first message ID in the In-Reply-To header. */ + * first message ID in the In-Reply-To header. We only parse the + * In-Reply-To header if we need to as otherwise we might + * contanimate the hash table if it is malformed. */ if (last_ref_message_id) { _notmuch_message_add_term (message, replyto, last_ref_message_id); -} else if (in_reply_to_message_id) { - _notmuch_message_add_term (message, replyto, -in_reply_to_message_id); +} else { + in_reply_to = notmuch_message_file_get_header (message_file, in-reply-to); + in_reply_to_message_id = parse_references (message, + this_message_id, + parents, in_reply_to); + if (in_reply_to_message_id) { + _notmuch_message_add_term (message, replyto, + in_reply_to_message_id); + } } keys = g_hash_table_get_keys (parents); -- 1.7.10.4 ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [RFC PATCH] Re: excessive thread fusing
Mark Walters markwalters1...@gmail.com writes: I have done dome debugging of this. Thanks for looking closely, Mark! There is a patch below which fixes this test case but who knows what it breaks! Please DO NOT apply unless someone who knows this code says it's OK. I wrote much of the original code being patched here, so hopefully I understand it and can say something useful. I agree that the patch should not be applied. I don't like to see one piece of code not trusting another in the same code base. If the parse_references() function doesn't deal well with a malformed header, then we should fix it, not step around it. Meanwhile, not treating all potential referenced message IDs consistently could definitely make the notmuch algorithm more fragile and sensitive to the order of message indexing, etc. So let's not do that. Instead, let's track down and fix the actual bug. Thanks for the idea of using two-digit names for these messages. That makes it much easier to inspect the relevant headers. Below, I've grepped out the actual References and In-Reply-To headers From the messages, and then simply substituted minimal, and easy-to-understand values for the message IDs. With these minimally modified headers, it's easy to manually inspect the relationships and see that messages 17 and 18 belong in one thread, and messages 32-52 belong in a separate thread. It's also quite easy to see the potential source of the bug. The In-Reply-To headers for messages 18, 32, and 52 all share a common string (an email address) formatted to look like a message-id, e.fr...@ucl.ac.uk. If notmuch looks at those headers, and treats that string as a message-id, then all of theses messages will be connected into a single thread. And since that's the reported behavior, it seems likely that e.fr...@ucl.ac.uk is the cause of this bug. I put some debug stuff in _notmuch_database_link_message_to_parents and I think that the problem comes from the call to parse_references on line 1767 which adds the malformed in-reply-to header to the hash table, so this malformed line gets added as a potential parent. Am I correct that your debugging showed that e.fr...@ucl.ac.uk is being added to the hash table? My inspection of _parse_references() and parse_message_id() suggests that that's exactly what notmuch is doing, (treating both of the angle-bracketed portions (e.fr...@ucl.ac.uk as well as the actual message-ID, ID17 or ID31 or ID39) as message IDs. So it seems like we need a new _parse_in_reply_to() function to use in place of _parse_references() and the new function will need a better heuristic for dealing with the unpredictability of In-Reply-To. The only real reason that we are trying to grab multiple message ID values from an In-Reply-To header is that RFC 2822 explicitly allows that, (to support a message simultaneously replying to multiple messages). I don't believe that that's common, but we might as well support it. At the same time, RFC 2822 also explicitly specifies that the In-Reply-To header will consist of nothing but message IDs. So perhaps the heuristic here could be to notice any characters outside of angle brackets, (like Message in the headers below), and in that case go to a strictly not RFC 2822 mode and look for exactly one message ID. At that point, JWZ would recommend the first -bracketed text found therein, but that would give precisely the wrong answer in this particular case. Here the correct Message ID appears in the last -bracketed text. I have not surveyed a large email corpus to determine how often last -bracketed text would fail as a heuristic. Another idea would be to trigger specifically on common forms. Judging From the samples in this particular thread, it seems like a workable heuristic would be: If the In-Reply-To header begins with '': Parse that initial portion as a message ID Else if it ends with '': Parse that final portion as a message ID Else Ignore this garbage-valued header. That's probably the best and most reliably thing to do here. Does anyone have any better ideas? As a clear example that I don't understand this code I don't know why this no longer causes a problem if message 17 gets added too. I wanted to test my own knowledge of the code to see if I could explain this. But I didn't precisely follow your explanation of the behavior you saw. In cases (1) and (2) of your description, what order are you using to add all messages or add all apart from 52? Then, for cases (3) and (4), what is done before adding the messages mentioned in these cases? Add all other messages? Again, in what order? I haven't tracked through all the logic of the existing algorithm for this case. But I don't like hearing that notmuch constructs different threads for the same messages presented in different orders. This sounds like a bug separate from what we've discussed above. -Carl 18:References: ID17
[Pablo Oliveira] Bug#745303: notmuch new corrupts database throwing 'Xapian::DatabaseCorruptError'
---BeginMessage--- Package: notmuch Version: 0.17-5+b1 Severity: normal Dear Maintainer, I'm using an offlineimap postsynchook to index my mail with `notmuch new`. Twice this week, the following error was thrown: Hook stderr:terminate called after throwing an instance of 'Xapian::DatabaseCorruptError' /home/poliveira/bin/index-mail: line 2: 7613 Aborted notmuch new Afterwards, running `notmuch new` produces the following output: notmuch new Welcome to a new version of notmuch! Your database will now be upgraded. Your notmuch database has now been upgraded to database format version 1. A Xapian exception occurred adding message: No termlist for document 70352. Error: A Xapian exception occurred. Halting processing. Processed 1 file in almost no time. No new mail. Note: A fatal error was encountered: A Xapian exception occurred Yet notmuch was not recently updated on my system. All further notmuch commands fail with: notmuch search tag:spam terminate called after throwing an instance of 'Xapian::DatabaseCorruptError' Aborted To restore a working system I must dump notmuch tags, reindex all my mail, and restore the tags. (Which is pretty long, since I keep a large amount of mails). Apart from notmuch, two other clients access the notmuch database: * emacs-notmuch * afew (9744c18c) Thanks, Pablo *** Reporter, please consider answering these questions, where appropriate *** * What led up to the situation? * What exactly did you do (or not do) that was effective (or ineffective)? * What was the outcome of this action? * What outcome did you expect instead? *** End of the template - remove these template lines *** -- System Information: Debian Release: jessie/sid APT prefers testing-updates APT policy: (500, 'testing-updates'), (500, 'testing') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 3.13-1-amd64 (SMP w/4 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages notmuch depends on: ii libc6 2.18-4 ii libglib2.0-02.40.0-2 ii libgmime-2.6-0 2.6.19-3 ii libnotmuch3 0.17-5+b1 ii libtalloc2 2.1.0-1 Versions of packages notmuch recommends: ii gnupg-agent2.0.22-3 ii notmuch-emacs 0.17-5 notmuch suggests no packages. -- no debconf information ---End Message--- ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [RFC PATCH] Re: excessive thread fusing
On Sun, 20 Apr 2014, Carl Worth cwo...@cworth.org wrote: Mark Walters markwalters1...@gmail.com writes: I have done dome debugging of this. Thanks for looking closely, Mark! There is a patch below which fixes this test case but who knows what it breaks! Please DO NOT apply unless someone who knows this code says it's OK. I wrote much of the original code being patched here, so hopefully I understand it and can say something useful. I agree that the patch should not be applied. I don't like to see one piece of code not trusting another in the same code base. If the parse_references() function doesn't deal well with a malformed header, then we should fix it, not step around it. Meanwhile, not treating all potential referenced message IDs consistently could definitely make the notmuch algorithm more fragile and sensitive to the order of message indexing, etc. So let's not do that. I agree. This bug first came up in id:874nvcekjk@qmul.ac.uk; I think that got mostly fixed by cf8aaafbad68 (id:1361836225-17279-1-git-send-email-aarone...@gmail.com and related thread) so we may want to check whether that change is still wanted if we fix the actual bug. Instead, let's track down and fix the actual bug. Thanks for the idea of using two-digit names for these messages. That makes it much easier to inspect the relevant headers. Below, I've grepped out the actual References and In-Reply-To headers From the messages, and then simply substituted minimal, and easy-to-understand values for the message IDs. With these minimally modified headers, it's easy to manually inspect the relationships and see that messages 17 and 18 belong in one thread, and messages 32-52 belong in a separate thread. It's also quite easy to see the potential source of the bug. The In-Reply-To headers for messages 18, 32, and 52 all share a common string (an email address) formatted to look like a message-id, e.fr...@ucl.ac.uk. If notmuch looks at those headers, and treats that string as a message-id, then all of theses messages will be connected into a single thread. And since that's the reported behavior, it seems likely that e.fr...@ucl.ac.uk is the cause of this bug. I put some debug stuff in _notmuch_database_link_message_to_parents and I think that the problem comes from the call to parse_references on line 1767 which adds the malformed in-reply-to header to the hash table, so this malformed line gets added as a potential parent. Am I correct that your debugging showed that e.fr...@ucl.ac.uk is being added to the hash table? Yes that is correct. My inspection of _parse_references() and parse_message_id() suggests that that's exactly what notmuch is doing, (treating both of the angle-bracketed portions (e.fr...@ucl.ac.uk as well as the actual message-ID, ID17 or ID31 or ID39) as message IDs. So it seems like we need a new _parse_in_reply_to() function to use in place of _parse_references() and the new function will need a better heuristic for dealing with the unpredictability of In-Reply-To. The only real reason that we are trying to grab multiple message ID values from an In-Reply-To header is that RFC 2822 explicitly allows that, (to support a message simultaneously replying to multiple messages). I don't believe that that's common, but we might as well support it. At the same time, RFC 2822 also explicitly specifies that the In-Reply-To header will consist of nothing but message IDs. So perhaps the heuristic here could be to notice any characters outside of angle brackets, (like Message in the headers below), and in that case go to a strictly not RFC 2822 mode and look for exactly one message ID. At that point, JWZ would recommend the first -bracketed text found therein, but that would give precisely the wrong answer in this particular case. Here the correct Message ID appears in the last -bracketed text. I have not surveyed a large email corpus to determine how often last -bracketed text would fail as a heuristic. Another idea would be to trigger specifically on common forms. Judging From the samples in this particular thread, it seems like a workable heuristic would be: If the In-Reply-To header begins with '': Parse that initial portion as a message ID Else if it ends with '': Parse that final portion as a message ID Else Ignore this garbage-valued header. That's probably the best and most reliably thing to do here. Does anyone have any better ideas? Is there a case coming before all the above: if the In-Reply-To header is correctly formed then parse as we do currently? (You sort of suggest so above but I just wanted to check) As a clear example that I don't understand this code I don't know why this no longer causes a problem if message 17 gets added too. I wanted to test my own knowledge of the code to see if I could explain this. But I didn't precisely follow your
Re: [RFC PATCH] Re: excessive thread fusing
Carl Worth cwo...@cworth.org writes: Another idea would be to trigger specifically on common forms. Judging From the samples in this particular thread, it seems like a workable heuristic would be: If the In-Reply-To header begins with '': Parse that initial portion as a message ID Else if it ends with '': Parse that final portion as a message ID Else Ignore this garbage-valued header. using the hacky script below, I scanned my own mail collection of about 300k messages. I can make the following observations - I have some RFC compliant in-reply-to's with multiple ids - I have have a non-trivial number of Message from $NAME address of $date id - I didn't see any cases where using the last angle bracketed thing would fail. - I did see some some cases where the header starts with '' but the matching '' was missing - I also noticed some rfc2047 encoding of in-reply-to headers. ## # hacky script follows dir=$1 echo Scanning $dir tempdir=$(mktemp -d) echo Writing to ${tempdir} find $dir -exec sh -c formail -c -xIn-reply-to {} \; \ ${tempdir}/ids sed -e 's/\t/ /' -e 's/ */ /g' -e 's/[^ ]*/id/g' -e 's/(.*)/(comment)/' ${tempdir}/ids | sort | uniq | tee ${tempdir}/report ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: excessive thread fusing
Quoth Andrei POPESCU on Apr 20 at 12:04 am: On Sb, 19 apr 14, 18:52:02, Eric wrote: This may not actually be any help, but both hypermail and mhonarc agree that two messages form a separate thread from the rest. I believe that the latter, at least, is the JWZ algorithm. mutt concurs. Can anyone explain why JWZ *doesn't* have the same problem? I don't see how this heuristic doesn't doom it to the same fate: The References field is populated from the ``References'' and/or ``In-Reply-To'' headers. If both headers exist, take the first thing in the In-Reply-To header that looks like a Message-ID, and append it to the References header. Given this, even considering only messages 18 and 52 (which should be in different threads), JWZ should find the common parent e.fr...@ucl.ac.uk and link them in to the same thread: Add 18 (step 1) - The combined references list is ID17 e.fr...@ucl.ac.uk - Creates and links containers 17 - e.fr...@ucl.ac.uk - 18 where the first two are empty Add 52 (step 1) - The combined references list is ID31 ID32 ID39 e.fr...@ucl.ac.uk - Creates and links containers 31 - 32 - 39 - Also considers container e.fr...@ucl.ac.uk, but this is already linked, so it doesn't change it - Creates container 52 and links e.fr...@ucl.ac.uk - 52 (step 1C) 18 and 52 will later get promoted over their empty parent (step 4), but will remain in the same thread. What am I missing? Or are these other MUAs not using pure JWZ? ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: excessive thread fusing
Quoth myself on Apr 20 at 12:48 pm: Quoth Andrei POPESCU on Apr 20 at 12:04 am: On Sb, 19 apr 14, 18:52:02, Eric wrote: This may not actually be any help, but both hypermail and mhonarc agree that two messages form a separate thread from the rest. I believe that the latter, at least, is the JWZ algorithm. mutt concurs. Can anyone explain why JWZ *doesn't* have the same problem? I don't see how this heuristic doesn't doom it to the same fate: The References field is populated from the ``References'' and/or ``In-Reply-To'' headers. If both headers exist, take the first thing in the In-Reply-To header that looks like a Message-ID, and append it to the References header. Given this, even considering only messages 18 and 52 (which should be in different threads), JWZ should find the common parent e.fr...@ucl.ac.uk and link them in to the same thread: Add 18 (step 1) - The combined references list is ID17 e.fr...@ucl.ac.uk - Creates and links containers 17 - e.fr...@ucl.ac.uk - 18 where the first two are empty Add 52 (step 1) - The combined references list is ID31 ID32 ID39 e.fr...@ucl.ac.uk - Creates and links containers 31 - 32 - 39 - Also considers container e.fr...@ucl.ac.uk, but this is already linked, so it doesn't change it - Creates container 52 and links e.fr...@ucl.ac.uk - 52 (step 1C) 18 and 52 will later get promoted over their empty parent (step 4), but will remain in the same thread. What am I missing? Or are these other MUAs not using pure JWZ? I dug in to mutt's mutt_sort_threads a bit. It's not using JWZ, though it's something similar. The most salient thing may be how it handles in-reply-to and references: 1. If a message has both in-reply-to and references, the parent chain is the *last* in-reply-to ID and then the references from right to left (skipping the last reference ID if it's the same as the last in-reply-to ID). (See also mutt_parse_references.) 2. If a message has only in-reply-to, the parent chain is *all* of the IDs in in-reply-to *from right to left* (e.g., the right-most one is the immediate parent). 3. If a message has only references, the parent chain is that, from right to left. Like JWZ, mutt creates and links together empty containers as it scans the parent chain towards the root, though unlike JWZ it stops when it finds a non-empty container or a container that already has a parent. ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH] doc: make notmuch-new summary line more generic
David Bremner da...@tethera.net writes: Since 'notmuch new' now takes multiple options, it's confusing to show only one of them in the summary. pushed, d ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH 0/7] doc: Python 3 compat, rst2man.py support, etc.
Tomi Ollila tomi.oll...@iki.fi writes: In this series IMO the patches 1-4: id:8d518408f2da8bc96ae3123f05791142da26b9bc.1396718720.git.wk...@tremily.us id:543aee63407956e60f85dc11a2d25855e98c10c3.1396718720.git.wk...@tremily.us id:5e4509ab08699afe2681110fb35075e1d0bbdc7e.1396718720.git.wk...@tremily.us id:c5ec510ac25c867ad600c475a0070a003440a4b8.1396718720.git.wk...@tremily.us could go in as those are. 5: id:adce76bb9a0ca728d856da4ecaf6b282e22e7440.1396718720.git.wk...@tremily.us if, for consistency reason (we don't use absolute paths with other commands either), rst2man/rst2man.py is used as is (and commit message adjusted accordingly). I've queued 1-4 for merging. Any patches that might break the build (e.g. 5 and 6 in this series) have to go in pretty quick if they are to be in 0.18; patch 7 we can sort out during the freeze. I'm not sure I completely understand the state of the discussion around patch 5. Personally I don't like either undefined or empty RST2MAN as a boolean a priori. I'd rather keep HAVE_RST2MAN for consistency. d ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch