[RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread David Bremner
Carl Worth  writes:
>
> Another idea would be to trigger specifically on common forms. Judging
> From the samples in this particular thread, it seems like a workable
> heuristic would be:
>
>   If the In-Reply-To header begins with '<':
>
>   Parse that initial portion as a message ID
>
>   Else if it ends with '>':
>
>   Parse that final portion as a message ID
>
>   Else
>
>   Ignore this garbage-valued header.
>

using the hacky script below, I scanned my own mail collection of about
300k messages. I can make the following observations

- I have some RFC compliant in-reply-to's with multiple ids
- I have have a non-trivial number of Message from $NAME  of $date 
- I didn't see any cases where using the last angle bracketed thing
  would fail.
- I did see some some cases where the header starts with '<' but the
  matching '>' was missing
- I also noticed some rfc2047 encoding of in-reply-to headers.


##
# hacky script follows
dir=$1
echo Scanning $dir

tempdir=$(mktemp -d)
echo Writing to ${tempdir}

find $dir -exec sh -c "formail -c -xIn-reply-to < {}" \; \
  > ${tempdir}/ids

sed  -e 's/\t/ /' -e 's/   */ /g' -e 's/<[^ ]*>//g' -e 's/(.*)/(comment)/' 
< ${tempdir}/ids | sort | uniq | tee ${tempdir}/report


[Pablo Oliveira] Bug#745303: notmuch new corrupts database throwing 'Xapian::DatabaseCorruptError'

2014-04-20 Thread David Bremner
An embedded message was scrubbed...
From: Pablo Oliveira <pa...@sifflez.org>
Subject: Bug#745303: notmuch new corrupts database throwing 
'Xapian::DatabaseCorruptError'
Date: Sun, 20 Apr 2014 12:27:15 +0200
Size: 7122
URL: 
<http://notmuchmail.org/pipermail/notmuch/attachments/20140420/05b58575/attachment.mht>


excessive thread fusing

2014-04-20 Thread Austin Clements
Quoth myself on Apr 20 at 12:48 pm:
> Quoth Andrei POPESCU on Apr 20 at 12:04 am:
> > On Sb, 19 apr 14, 18:52:02, Eric wrote:
> > > 
> > > This may not actually be any help, but both hypermail and mhonarc agree
> > > that two messages form a separate thread from the rest. I believe that
> > > the latter, at least, is the JWZ algorithm.
> > 
> > mutt concurs.
> 
> Can anyone explain why JWZ *doesn't* have the same problem?  I don't
> see how this heuristic doesn't doom it to the same fate:
> 
>   The References field is populated from the ``References'' and/or
>   ``In-Reply-To'' headers. If both headers exist, take the first thing
>   in the In-Reply-To header that looks like a Message-ID, and append
>   it to the References header.
> 
> Given this, even considering only messages 18 and 52 (which "should"
> be in different threads), JWZ should find the common "parent"
> e.fraga at ucl.ac.uk and link them in to the same thread:
> 
> Add 18 (step 1)
> - The combined "references" list is  
> - Creates and links containers 17 <- e.fraga at ucl.ac.uk <- 18 where the
>   first two are empty
> 
> Add 52 (step 1)
> - The combined "references" list is   
>   
> - Creates and links containers 31 <- 32 <- 39
> - Also considers container e.fraga at ucl.ac.uk, but this is already
>   linked, so it doesn't change it
> - Creates container 52 and links e.fraga at ucl.ac.uk <- 52 (step 1C)
> 
> 18 and 52 will later get promoted over their empty parent (step 4),
> but will remain in the same thread.
> 
> What am I missing?  Or are these other MUAs not using pure JWZ?

I dug in to mutt's mutt_sort_threads a bit.  It's not using JWZ,
though it's something similar.  The most salient thing may be how it
handles in-reply-to and references:

1. If a message has both in-reply-to and references, the parent chain
   is the *last* in-reply-to ID and then the references from right to
   left (skipping the last reference ID if it's the same as the last
   in-reply-to ID).  (See also mutt_parse_references.)
2. If a message has only in-reply-to, the parent chain is *all* of the
   IDs in in-reply-to *from right to left* (e.g., the right-most one
   is the immediate parent).
3. If a message has only references, the parent chain is that, from
   right to left.

Like JWZ, mutt creates and links together "empty containers" as it
scans the parent chain towards the root, though unlike JWZ it stops
when it finds a non-empty container or a container that already has a
parent.


[RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread Mark Walters

On Sun, 20 Apr 2014, Carl Worth  wrote:
> Mark Walters  writes:
>> I have done dome debugging of this.
>
> Thanks for looking closely, Mark!
>
>> There is a patch below which fixes this test case but who knows what
>> it breaks! Please DO NOT apply unless someone who knows this code says
>> it's OK.
>
> I wrote much of the original code being patched here, so hopefully I
> understand it and can say something useful.
>
> I agree that the patch should not be applied. I don't like to see one
> piece of code not trusting another in the same code base. If the
> parse_references() function doesn't deal well with a malformed header,
> then we should fix it, not step around it.

>
> Meanwhile, not treating all potential referenced message IDs
> consistently could definitely make the notmuch algorithm more fragile
> and sensitive to the order of message indexing, etc. So let's not do
> that.

I agree. This bug first came up in id:874nvcekjk.fsf at qmul.ac.uk; I think
that got mostly fixed by cf8aaafbad68
(id:1361836225-17279-1-git-send-email-aaronecay at gmail.com and related
thread) so we may want to check whether that change is still wanted if
we fix the actual bug.

> Instead, let's track down and fix the actual bug.
>
> Thanks for the idea of using two-digit names for these messages. That
> makes it much easier to inspect the relevant headers.
>
> Below, I've grepped out the actual References and In-Reply-To headers
> From the messages, and then simply substituted minimal, and
> easy-to-understand values for the message IDs.
>
> With these minimally modified headers, it's easy to manually inspect the
> relationships and see that messages 17 and 18 belong in one thread, and
> messages 32-52 belong in a separate thread.
>
> It's also quite easy to see the potential source of the bug. The
> In-Reply-To headers for messages 18, 32, and 52 all share a common
> string (an email address) formatted to look like a message-id,
> "". If notmuch looks at those headers, and treats
> that string as a message-id, then all of theses messages will be
> connected into a single thread.
>
> And since that's the reported behavior, it seems likely that
> "" is the cause of this bug.
>
>> I put some debug stuff in _notmuch_database_link_message_to_parents and
>> I think that the problem comes from the call to parse_references on line
>> 1767 which adds the malformed in-reply-to header to the hash table, so
>> this malformed line gets added as a potential parent. 
>
> Am I correct that your debugging showed that "" is
> being added to the hash table?

Yes that is correct.

> My inspection of _parse_references() and parse_message_id() suggests
> that that's exactly what notmuch is doing, (treating both of the
> angle-bracketed portions ("" as well as the actual
> message-ID, "" or "" or "") as message IDs.
>
> So it seems like we need a new _parse_in_reply_to() function to use in
> place of _parse_references() and the new function will need a better
> heuristic for dealing with the unpredictability of In-Reply-To.
>
> The only real reason that we are trying to grab multiple message ID
> values from an In-Reply-To header is that RFC 2822 explicitly allows
> that, (to support a message simultaneously replying to multiple
> messages). I don't believe that that's common, but we might as well
> support it. At the same time, RFC 2822 also explicitly specifies that
> the In-Reply-To header will consist of nothing but message IDs.
>
> So perhaps the heuristic here could be to notice any characters outside
> of angle brackets, (like "Message" in the headers below), and in that
> case go to a strictly "not RFC 2822" mode and look for exactly one
> message ID. At that point, JWZ would recommend "the first <>-bracketed
> text found therein", but that would give precisely the wrong answer in
> this particular case. Here the correct Message ID appears in the last
> <>-bracketed text. I have not surveyed a large email corpus to determine
> how often "last <>-bracketed text" would fail as a heuristic.
>
> Another idea would be to trigger specifically on common forms. Judging
> From the samples in this particular thread, it seems like a workable
> heuristic would be:
>
>   If the In-Reply-To header begins with '<':
>
>   Parse that initial portion as a message ID
>
>   Else if it ends with '>':
>
>   Parse that final portion as a message ID
>
>   Else
>
>   Ignore this garbage-valued header.
>
> That's probably the best and most reliably thing to do here.
>
> Does anyone have any better ideas?

Is there a case coming before all the above: if the In-Reply-To header
is correctly formed then parse as we do currently? (You sort of suggest
so above but I just wanted to check)

>> As a clear example that I don't understand this code I don't know why
>> this no longer causes a problem if message 17 gets added too.
>
> I wanted to test my own knowledge of the code to see if I could explain
> this. But I didn't 

excessive thread fusing

2014-04-20 Thread Austin Clements
Quoth Andrei POPESCU on Apr 20 at 12:04 am:
> On Sb, 19 apr 14, 18:52:02, Eric wrote:
> > 
> > This may not actually be any help, but both hypermail and mhonarc agree
> > that two messages form a separate thread from the rest. I believe that
> > the latter, at least, is the JWZ algorithm.
> 
> mutt concurs.

Can anyone explain why JWZ *doesn't* have the same problem?  I don't
see how this heuristic doesn't doom it to the same fate:

  The References field is populated from the ``References'' and/or
  ``In-Reply-To'' headers. If both headers exist, take the first thing
  in the In-Reply-To header that looks like a Message-ID, and append
  it to the References header.

Given this, even considering only messages 18 and 52 (which "should"
be in different threads), JWZ should find the common "parent"
e.fraga at ucl.ac.uk and link them in to the same thread:

Add 18 (step 1)
- The combined "references" list is  
- Creates and links containers 17 <- e.fraga at ucl.ac.uk <- 18 where the
  first two are empty

Add 52 (step 1)
- The combined "references" list is   
  
- Creates and links containers 31 <- 32 <- 39
- Also considers container e.fraga at ucl.ac.uk, but this is already
  linked, so it doesn't change it
- Creates container 52 and links e.fraga at ucl.ac.uk <- 52 (step 1C)

18 and 52 will later get promoted over their empty parent (step 4),
but will remain in the same thread.

What am I missing?  Or are these other MUAs not using pure JWZ?


[PATCH v2 0/4] doc: notmuch-show improvements

2014-04-20 Thread Mark Walters

This LGTM +1

Best wishes

Mark

On Fri, 18 Apr 2014, Austin Clements  wrote:
> This is v2 of id:1397834332-25175-1-git-send-email-amdragon at mit.edu.
> It expands the explanation of "non-MIME" message parts and moves it to
> the --part documentation.
>
> ___
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch


[RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread Mark Walters

On Sat, 19 Apr 2014, David Bremner  wrote:
> Gregor Zattler mentioned some problems with threading at 
>
>id:20120126004024.GA13704 at shi.workgroup
>
> After some off list discussions, I believe I have a smaller test case.
>
> The attached maildir contains 24 messages from the org mode list.
>
> According to notmuch, these form one thread, but I can't figure out
> exactly why. It seems like the chronologically first two messages should
> be a seperate thread. There are several of the infamous malformed ME-E
> In-reply-to headers, but each of these messages also has a References
> header; this seems to indicate a case missed by commit cf8aaafbad68.

Hi 

I have done dome debugging of this. There is a patch below which fixes
this test case but who knows what it breaks! Please DO NOT apply unless
someone who knows this code says it's OK.

First, the bug is quite sensitive. The attached 24 messages are numbered
and i will use the last two digits to refer to them (ie the 2 digits are
the ?? in 1397885606.0002??.mbox:2,). The number range from 17-52; 17
and 18 should be one thread and the rest a different thread.

1) If you add all messages you get one thread. 
2) If you add all apart from 52 you get 2 threads. However, then adding
52 still gives two threads.
3) If you add 18 and then 52 you get 1 thread.
4) If you add 17 and 18 then 52 you get 2 threads.

I think notmuch will use inode sort and since the tar file contains
these three files in the order 18 52 17 we get cases 1 and 2 above.

I put some debug stuff in _notmuch_database_link_message_to_parents and
I think that the problem comes from the call to parse_references on line
1767 which adds the malformed in-reply-to header to the hash table, so
this malformed line gets added as a potential parent. 

As a clear example that I don't understand this code I don't know why
this no longer causes a problem if message 17 gets added too.

Best wishes

Mark

---
 lib/database.cc |   21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 1efb14d..373a255 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1763,20 +1763,23 @@ _notmuch_database_link_message_to_parents 
(notmuch_database_t *notmuch,
this_message_id,
parents, refs);

-in_reply_to = notmuch_message_file_get_header (message_file, 
"in-reply-to");
-in_reply_to_message_id = parse_references (message,
-  this_message_id,
-  parents, in_reply_to);
-
 /* For the parent of this message, use the last message ID of the
  * References header, if available.  If not, fall back to the
- * first message ID in the In-Reply-To header. */
+ * first message ID in the In-Reply-To header. We only parse the
+ * In-Reply-To header if we need to as otherwise we might
+ * contanimate the hash table if it is malformed. */
 if (last_ref_message_id) {
 _notmuch_message_add_term (message, "replyto",
last_ref_message_id);
-} else if (in_reply_to_message_id) {
-   _notmuch_message_add_term (message, "replyto",
-in_reply_to_message_id);
+} else {
+   in_reply_to = notmuch_message_file_get_header (message_file, 
"in-reply-to");
+   in_reply_to_message_id = parse_references (message,
+  this_message_id,
+  parents, in_reply_to);
+   if (in_reply_to_message_id) {
+   _notmuch_message_add_term (message, "replyto",
+  in_reply_to_message_id);
+   }
 }

 keys = g_hash_table_get_keys (parents);
-- 
1.7.10.4






excessive thread fusing

2014-04-20 Thread Andrei POPESCU
On Sb, 19 apr 14, 18:52:02, Eric wrote:
> 
> This may not actually be any help, but both hypermail and mhonarc agree
> that two messages form a separate thread from the rest. I believe that
> the latter, at least, is the JWZ algorithm.

mutt concurs.

Kind regards,
Andrei
-- 
If you can't explain it simply, you don't understand it well enough.
(Albert Einstein)
http://nuvreauspam.ro/gpg-transition.txt
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: 
<http://notmuchmail.org/pipermail/notmuch/attachments/20140420/f59eb095/attachment.pgp>


[RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread Mark Walters

On Sat, 19 Apr 2014, David Bremner da...@tethera.net wrote:
 Gregor Zattler mentioned some problems with threading at 

id:20120126004024.GA13704@shi.workgroup

 After some off list discussions, I believe I have a smaller test case.

 The attached maildir contains 24 messages from the org mode list.

 According to notmuch, these form one thread, but I can't figure out
 exactly why. It seems like the chronologically first two messages should
 be a seperate thread. There are several of the infamous malformed ME-E
 In-reply-to headers, but each of these messages also has a References
 header; this seems to indicate a case missed by commit cf8aaafbad68.

Hi 

I have done dome debugging of this. There is a patch below which fixes
this test case but who knows what it breaks! Please DO NOT apply unless
someone who knows this code says it's OK.

First, the bug is quite sensitive. The attached 24 messages are numbered
and i will use the last two digits to refer to them (ie the 2 digits are
the ?? in 1397885606.0002??.mbox:2,). The number range from 17-52; 17
and 18 should be one thread and the rest a different thread.

1) If you add all messages you get one thread. 
2) If you add all apart from 52 you get 2 threads. However, then adding
52 still gives two threads.
3) If you add 18 and then 52 you get 1 thread.
4) If you add 17 and 18 then 52 you get 2 threads.

I think notmuch will use inode sort and since the tar file contains
these three files in the order 18 52 17 we get cases 1 and 2 above.

I put some debug stuff in _notmuch_database_link_message_to_parents and
I think that the problem comes from the call to parse_references on line
1767 which adds the malformed in-reply-to header to the hash table, so
this malformed line gets added as a potential parent. 

As a clear example that I don't understand this code I don't know why
this no longer causes a problem if message 17 gets added too.

Best wishes

Mark

---
 lib/database.cc |   21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 1efb14d..373a255 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1763,20 +1763,23 @@ _notmuch_database_link_message_to_parents 
(notmuch_database_t *notmuch,
this_message_id,
parents, refs);
 
-in_reply_to = notmuch_message_file_get_header (message_file, 
in-reply-to);
-in_reply_to_message_id = parse_references (message,
-  this_message_id,
-  parents, in_reply_to);
-
 /* For the parent of this message, use the last message ID of the
  * References header, if available.  If not, fall back to the
- * first message ID in the In-Reply-To header. */
+ * first message ID in the In-Reply-To header. We only parse the
+ * In-Reply-To header if we need to as otherwise we might
+ * contanimate the hash table if it is malformed. */
 if (last_ref_message_id) {
 _notmuch_message_add_term (message, replyto,
last_ref_message_id);
-} else if (in_reply_to_message_id) {
-   _notmuch_message_add_term (message, replyto,
-in_reply_to_message_id);
+} else {
+   in_reply_to = notmuch_message_file_get_header (message_file, 
in-reply-to);
+   in_reply_to_message_id = parse_references (message,
+  this_message_id,
+  parents, in_reply_to);
+   if (in_reply_to_message_id) {
+   _notmuch_message_add_term (message, replyto,
+  in_reply_to_message_id);
+   }
 }
 
 keys = g_hash_table_get_keys (parents);
-- 
1.7.10.4




___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread Carl Worth
Mark Walters markwalters1...@gmail.com writes:
 I have done dome debugging of this.

Thanks for looking closely, Mark!

 There is a patch below which fixes this test case but who knows what
 it breaks! Please DO NOT apply unless someone who knows this code says
 it's OK.

I wrote much of the original code being patched here, so hopefully I
understand it and can say something useful.

I agree that the patch should not be applied. I don't like to see one
piece of code not trusting another in the same code base. If the
parse_references() function doesn't deal well with a malformed header,
then we should fix it, not step around it.

Meanwhile, not treating all potential referenced message IDs
consistently could definitely make the notmuch algorithm more fragile
and sensitive to the order of message indexing, etc. So let's not do
that.

Instead, let's track down and fix the actual bug.

Thanks for the idea of using two-digit names for these messages. That
makes it much easier to inspect the relevant headers.

Below, I've grepped out the actual References and In-Reply-To headers
From the messages, and then simply substituted minimal, and
easy-to-understand values for the message IDs.

With these minimally modified headers, it's easy to manually inspect the
relationships and see that messages 17 and 18 belong in one thread, and
messages 32-52 belong in a separate thread.

It's also quite easy to see the potential source of the bug. The
In-Reply-To headers for messages 18, 32, and 52 all share a common
string (an email address) formatted to look like a message-id,
e.fr...@ucl.ac.uk. If notmuch looks at those headers, and treats
that string as a message-id, then all of theses messages will be
connected into a single thread.

And since that's the reported behavior, it seems likely that
e.fr...@ucl.ac.uk is the cause of this bug.

 I put some debug stuff in _notmuch_database_link_message_to_parents and
 I think that the problem comes from the call to parse_references on line
 1767 which adds the malformed in-reply-to header to the hash table, so
 this malformed line gets added as a potential parent. 

Am I correct that your debugging showed that e.fr...@ucl.ac.uk is
being added to the hash table?

My inspection of _parse_references() and parse_message_id() suggests
that that's exactly what notmuch is doing, (treating both of the
angle-bracketed portions (e.fr...@ucl.ac.uk as well as the actual
message-ID, ID17 or ID31 or ID39) as message IDs.

So it seems like we need a new _parse_in_reply_to() function to use in
place of _parse_references() and the new function will need a better
heuristic for dealing with the unpredictability of In-Reply-To.

The only real reason that we are trying to grab multiple message ID
values from an In-Reply-To header is that RFC 2822 explicitly allows
that, (to support a message simultaneously replying to multiple
messages). I don't believe that that's common, but we might as well
support it. At the same time, RFC 2822 also explicitly specifies that
the In-Reply-To header will consist of nothing but message IDs.

So perhaps the heuristic here could be to notice any characters outside
of angle brackets, (like Message in the headers below), and in that
case go to a strictly not RFC 2822 mode and look for exactly one
message ID. At that point, JWZ would recommend the first -bracketed
text found therein, but that would give precisely the wrong answer in
this particular case. Here the correct Message ID appears in the last
-bracketed text. I have not surveyed a large email corpus to determine
how often last -bracketed text would fail as a heuristic.

Another idea would be to trigger specifically on common forms. Judging
From the samples in this particular thread, it seems like a workable
heuristic would be:

If the In-Reply-To header begins with '':

Parse that initial portion as a message ID

Else if it ends with '':

Parse that final portion as a message ID

Else

Ignore this garbage-valued header.

That's probably the best and most reliably thing to do here.

Does anyone have any better ideas?

 As a clear example that I don't understand this code I don't know why
 this no longer causes a problem if message 17 gets added too.

I wanted to test my own knowledge of the code to see if I could explain
this. But I didn't precisely follow your explanation of the behavior you
saw. In cases (1) and (2) of your description, what order are you using
to add all messages or add all apart from 52?

Then, for cases (3) and (4), what is done before adding the messages
mentioned in these cases? Add all other messages? Again, in what order?

I haven't tracked through all the logic of the existing algorithm for
this case. But I don't like hearing that notmuch constructs different
threads for the same messages presented in different orders. This sounds
like a bug separate from what we've discussed above. 

-Carl

18:References: ID17

[Pablo Oliveira] Bug#745303: notmuch new corrupts database throwing 'Xapian::DatabaseCorruptError'

2014-04-20 Thread David Bremner
---BeginMessage---
Package: notmuch
Version: 0.17-5+b1
Severity: normal

Dear Maintainer,

I'm using an offlineimap postsynchook to index my mail with `notmuch new`.
Twice this week, the following error was thrown:

Hook stderr:terminate called after throwing an instance of 
'Xapian::DatabaseCorruptError'
/home/poliveira/bin/index-mail: line 2:  7613 Aborted
notmuch new

Afterwards, running `notmuch new` produces the following output:

notmuch new
Welcome to a new version of notmuch! Your database will now be upgraded.
Your notmuch database has now been upgraded to database format version 1.
A Xapian exception occurred adding message: No termlist for document 70352.
Error: A Xapian exception occurred. Halting processing.
Processed 1 file in almost no time.
No new mail.
Note: A fatal error was encountered: A Xapian exception occurred

Yet notmuch was not recently updated on my system.
All further notmuch commands fail with:
notmuch search tag:spam
terminate called after throwing an instance of 
'Xapian::DatabaseCorruptError'
Aborted

To restore a working system I must dump notmuch tags, reindex all my
mail, and restore the tags. (Which is pretty long, since I keep
a large amount of mails).

Apart from notmuch, two other clients access the notmuch database:
* emacs-notmuch
* afew (9744c18c)

Thanks,

Pablo

*** Reporter, please consider answering these questions, where appropriate ***

   * What led up to the situation?
   * What exactly did you do (or not do) that was effective (or
 ineffective)?
   * What was the outcome of this action?
   * What outcome did you expect instead?

*** End of the template - remove these template lines ***


-- System Information:
Debian Release: jessie/sid
  APT prefers testing-updates
  APT policy: (500, 'testing-updates'), (500, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.13-1-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages notmuch depends on:
ii  libc6   2.18-4
ii  libglib2.0-02.40.0-2
ii  libgmime-2.6-0  2.6.19-3
ii  libnotmuch3 0.17-5+b1
ii  libtalloc2  2.1.0-1

Versions of packages notmuch recommends:
ii  gnupg-agent2.0.22-3
ii  notmuch-emacs  0.17-5

notmuch suggests no packages.

-- no debconf information

---End Message---
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread Mark Walters

On Sun, 20 Apr 2014, Carl Worth cwo...@cworth.org wrote:
 Mark Walters markwalters1...@gmail.com writes:
 I have done dome debugging of this.

 Thanks for looking closely, Mark!

 There is a patch below which fixes this test case but who knows what
 it breaks! Please DO NOT apply unless someone who knows this code says
 it's OK.

 I wrote much of the original code being patched here, so hopefully I
 understand it and can say something useful.

 I agree that the patch should not be applied. I don't like to see one
 piece of code not trusting another in the same code base. If the
 parse_references() function doesn't deal well with a malformed header,
 then we should fix it, not step around it.


 Meanwhile, not treating all potential referenced message IDs
 consistently could definitely make the notmuch algorithm more fragile
 and sensitive to the order of message indexing, etc. So let's not do
 that.

I agree. This bug first came up in id:874nvcekjk@qmul.ac.uk; I think
that got mostly fixed by cf8aaafbad68
(id:1361836225-17279-1-git-send-email-aarone...@gmail.com and related
thread) so we may want to check whether that change is still wanted if
we fix the actual bug.

 Instead, let's track down and fix the actual bug.

 Thanks for the idea of using two-digit names for these messages. That
 makes it much easier to inspect the relevant headers.

 Below, I've grepped out the actual References and In-Reply-To headers
 From the messages, and then simply substituted minimal, and
 easy-to-understand values for the message IDs.

 With these minimally modified headers, it's easy to manually inspect the
 relationships and see that messages 17 and 18 belong in one thread, and
 messages 32-52 belong in a separate thread.

 It's also quite easy to see the potential source of the bug. The
 In-Reply-To headers for messages 18, 32, and 52 all share a common
 string (an email address) formatted to look like a message-id,
 e.fr...@ucl.ac.uk. If notmuch looks at those headers, and treats
 that string as a message-id, then all of theses messages will be
 connected into a single thread.

 And since that's the reported behavior, it seems likely that
 e.fr...@ucl.ac.uk is the cause of this bug.

 I put some debug stuff in _notmuch_database_link_message_to_parents and
 I think that the problem comes from the call to parse_references on line
 1767 which adds the malformed in-reply-to header to the hash table, so
 this malformed line gets added as a potential parent. 

 Am I correct that your debugging showed that e.fr...@ucl.ac.uk is
 being added to the hash table?

Yes that is correct.

 My inspection of _parse_references() and parse_message_id() suggests
 that that's exactly what notmuch is doing, (treating both of the
 angle-bracketed portions (e.fr...@ucl.ac.uk as well as the actual
 message-ID, ID17 or ID31 or ID39) as message IDs.

 So it seems like we need a new _parse_in_reply_to() function to use in
 place of _parse_references() and the new function will need a better
 heuristic for dealing with the unpredictability of In-Reply-To.

 The only real reason that we are trying to grab multiple message ID
 values from an In-Reply-To header is that RFC 2822 explicitly allows
 that, (to support a message simultaneously replying to multiple
 messages). I don't believe that that's common, but we might as well
 support it. At the same time, RFC 2822 also explicitly specifies that
 the In-Reply-To header will consist of nothing but message IDs.

 So perhaps the heuristic here could be to notice any characters outside
 of angle brackets, (like Message in the headers below), and in that
 case go to a strictly not RFC 2822 mode and look for exactly one
 message ID. At that point, JWZ would recommend the first -bracketed
 text found therein, but that would give precisely the wrong answer in
 this particular case. Here the correct Message ID appears in the last
 -bracketed text. I have not surveyed a large email corpus to determine
 how often last -bracketed text would fail as a heuristic.

 Another idea would be to trigger specifically on common forms. Judging
 From the samples in this particular thread, it seems like a workable
 heuristic would be:

   If the In-Reply-To header begins with '':

   Parse that initial portion as a message ID

   Else if it ends with '':

   Parse that final portion as a message ID

   Else

   Ignore this garbage-valued header.

 That's probably the best and most reliably thing to do here.

 Does anyone have any better ideas?

Is there a case coming before all the above: if the In-Reply-To header
is correctly formed then parse as we do currently? (You sort of suggest
so above but I just wanted to check)

 As a clear example that I don't understand this code I don't know why
 this no longer causes a problem if message 17 gets added too.

 I wanted to test my own knowledge of the code to see if I could explain
 this. But I didn't precisely follow your 

Re: [RFC PATCH] Re: excessive thread fusing

2014-04-20 Thread David Bremner
Carl Worth cwo...@cworth.org writes:

 Another idea would be to trigger specifically on common forms. Judging
 From the samples in this particular thread, it seems like a workable
 heuristic would be:

   If the In-Reply-To header begins with '':

   Parse that initial portion as a message ID

   Else if it ends with '':

   Parse that final portion as a message ID

   Else

   Ignore this garbage-valued header.


using the hacky script below, I scanned my own mail collection of about
300k messages. I can make the following observations

- I have some RFC compliant in-reply-to's with multiple ids
- I have have a non-trivial number of Message from $NAME address of $date id
- I didn't see any cases where using the last angle bracketed thing
  would fail.
- I did see some some cases where the header starts with '' but the
  matching '' was missing
- I also noticed some rfc2047 encoding of in-reply-to headers.


##
# hacky script follows
dir=$1
echo Scanning $dir

tempdir=$(mktemp -d)
echo Writing to ${tempdir}

find $dir -exec sh -c formail -c -xIn-reply-to  {} \; \
   ${tempdir}/ids

sed  -e 's/\t/ /' -e 's/   */ /g' -e 's/[^ ]*/id/g' -e 's/(.*)/(comment)/' 
 ${tempdir}/ids | sort | uniq | tee ${tempdir}/report
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: excessive thread fusing

2014-04-20 Thread Austin Clements
Quoth Andrei POPESCU on Apr 20 at 12:04 am:
 On Sb, 19 apr 14, 18:52:02, Eric wrote:
  
  This may not actually be any help, but both hypermail and mhonarc agree
  that two messages form a separate thread from the rest. I believe that
  the latter, at least, is the JWZ algorithm.
 
 mutt concurs.

Can anyone explain why JWZ *doesn't* have the same problem?  I don't
see how this heuristic doesn't doom it to the same fate:

  The References field is populated from the ``References'' and/or
  ``In-Reply-To'' headers. If both headers exist, take the first thing
  in the In-Reply-To header that looks like a Message-ID, and append
  it to the References header.

Given this, even considering only messages 18 and 52 (which should
be in different threads), JWZ should find the common parent
e.fr...@ucl.ac.uk and link them in to the same thread:

Add 18 (step 1)
- The combined references list is ID17 e.fr...@ucl.ac.uk
- Creates and links containers 17 - e.fr...@ucl.ac.uk - 18 where the
  first two are empty

Add 52 (step 1)
- The combined references list is ID31 ID32 ID39
  e.fr...@ucl.ac.uk
- Creates and links containers 31 - 32 - 39
- Also considers container e.fr...@ucl.ac.uk, but this is already
  linked, so it doesn't change it
- Creates container 52 and links e.fr...@ucl.ac.uk - 52 (step 1C)

18 and 52 will later get promoted over their empty parent (step 4),
but will remain in the same thread.

What am I missing?  Or are these other MUAs not using pure JWZ?
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: excessive thread fusing

2014-04-20 Thread Austin Clements
Quoth myself on Apr 20 at 12:48 pm:
 Quoth Andrei POPESCU on Apr 20 at 12:04 am:
  On Sb, 19 apr 14, 18:52:02, Eric wrote:
   
   This may not actually be any help, but both hypermail and mhonarc agree
   that two messages form a separate thread from the rest. I believe that
   the latter, at least, is the JWZ algorithm.
  
  mutt concurs.
 
 Can anyone explain why JWZ *doesn't* have the same problem?  I don't
 see how this heuristic doesn't doom it to the same fate:
 
   The References field is populated from the ``References'' and/or
   ``In-Reply-To'' headers. If both headers exist, take the first thing
   in the In-Reply-To header that looks like a Message-ID, and append
   it to the References header.
 
 Given this, even considering only messages 18 and 52 (which should
 be in different threads), JWZ should find the common parent
 e.fr...@ucl.ac.uk and link them in to the same thread:
 
 Add 18 (step 1)
 - The combined references list is ID17 e.fr...@ucl.ac.uk
 - Creates and links containers 17 - e.fr...@ucl.ac.uk - 18 where the
   first two are empty
 
 Add 52 (step 1)
 - The combined references list is ID31 ID32 ID39
   e.fr...@ucl.ac.uk
 - Creates and links containers 31 - 32 - 39
 - Also considers container e.fr...@ucl.ac.uk, but this is already
   linked, so it doesn't change it
 - Creates container 52 and links e.fr...@ucl.ac.uk - 52 (step 1C)
 
 18 and 52 will later get promoted over their empty parent (step 4),
 but will remain in the same thread.
 
 What am I missing?  Or are these other MUAs not using pure JWZ?

I dug in to mutt's mutt_sort_threads a bit.  It's not using JWZ,
though it's something similar.  The most salient thing may be how it
handles in-reply-to and references:

1. If a message has both in-reply-to and references, the parent chain
   is the *last* in-reply-to ID and then the references from right to
   left (skipping the last reference ID if it's the same as the last
   in-reply-to ID).  (See also mutt_parse_references.)
2. If a message has only in-reply-to, the parent chain is *all* of the
   IDs in in-reply-to *from right to left* (e.g., the right-most one
   is the immediate parent).
3. If a message has only references, the parent chain is that, from
   right to left.

Like JWZ, mutt creates and links together empty containers as it
scans the parent chain towards the root, though unlike JWZ it stops
when it finds a non-empty container or a container that already has a
parent.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH] doc: make notmuch-new summary line more generic

2014-04-20 Thread David Bremner
David Bremner da...@tethera.net writes:

 Since 'notmuch new' now takes multiple options, it's confusing to show
 only one of them in the summary.

pushed,

d
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 0/7] doc: Python 3 compat, rst2man.py support, etc.

2014-04-20 Thread David Bremner
Tomi Ollila tomi.oll...@iki.fi writes:

 In this series IMO the patches 1-4:

 id:8d518408f2da8bc96ae3123f05791142da26b9bc.1396718720.git.wk...@tremily.us
 id:543aee63407956e60f85dc11a2d25855e98c10c3.1396718720.git.wk...@tremily.us
 id:5e4509ab08699afe2681110fb35075e1d0bbdc7e.1396718720.git.wk...@tremily.us
 id:c5ec510ac25c867ad600c475a0070a003440a4b8.1396718720.git.wk...@tremily.us

 could go in as those are. 5:

 id:adce76bb9a0ca728d856da4ecaf6b282e22e7440.1396718720.git.wk...@tremily.us

 if, for consistency reason (we don't use absolute paths with other commands
 either), rst2man/rst2man.py is used as is (and commit message adjusted
 accordingly).

I've queued 1-4 for merging. Any patches that might break the build
(e.g. 5 and 6 in this series) have to go in pretty quick if they are to
be in 0.18; patch 7 we can sort out during the freeze.

I'm not sure I completely understand the state of the discussion around
patch 5. Personally I don't like either undefined or empty RST2MAN as a
boolean a priori. I'd rather keep HAVE_RST2MAN for consistency.

d
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch