[PATCH] Next attempt to get guessing of From addresses correct in replies

2010-04-14 Thread Dirk Hohndel
On Wed, 14 Apr 2010 10:21:42 -0700, Carl Worth  wrote:
> On Fri, 09 Apr 2010 15:53:04 -0700, Dirk Hohndel  
> wrote:
> > + * WARNING - if the caller is asking for a header that could occur
> > + * multiple times than they MUST first call this function with a 
> > + * a value of NULL for header_desired to ensure that all of the
> > + * headers are parsed and concatenated before a match is returned
> ...
> > +   } else {
> > +   /* not sure this makes sense for all headers that can occur
> > +* multiple times, but for now just concatenate headers
> > +*/
> > +   newhdr = strlen(decoded_value);
> > +   hdrsofar = strlen(header_sofar);
> 
> I'm a little nervous about this semantic change.

So am I - but I haven't found a message where this would have bitten me.

> For example, I know that my mail collection contains at least some
> messages with multiple Message-ID headers, (I'm not sure that's legal,
> but they are there).

No, that is absolutely NOT RFC compliant. Wonder what creates those
messages...

> I found those when doing detailed comparisons of
> the database created by sup with that created by very early versions of
> what became the indexing code for notmuch. [Sup prefers the
> last-encountered Message-Id in the file, while Notmuch prefers the
> first.]

Actually, prior to another fix that I sent (and that you already
applied), notmuch would use the first if you came across this header for
the first time when searching for it (but since you'd search for
Message-Id fairly early on, that's likely what happened). But if your
header was remembered "en-passant" while searching for a different
header later in the file, notmuch would actually remember the last.

But again, I fixed that before making the change to concatenate
duplicates instead.

> So I'm concerned about the change above introducing subtle problems that
> might be hard to notice.

Yes, absolutely - a concatenated Message-Id would SUCK.

> How about an argument that asks explicitly for concatenated header
> values, (and this could just trigger a rescan of the headers and ignore
> the hash). I think that will be fine for your use case where you're just
> opening this message file to get this one concatenated header out,
> right?

That would work just fine. And avoid the potential unintended side
effects. 

I'm about to head for the airport - do you want to make that
modification yourself or should I do it tonight?

/D

-- 
Dirk Hohndel
Intel Open Source Technology Center


[PATCH] Next attempt to get guessing of From addresses correct in replies

2010-04-14 Thread Carl Worth
On Fri, 09 Apr 2010 15:53:04 -0700, Dirk Hohndel  
wrote:
> + * WARNING - if the caller is asking for a header that could occur
> + * multiple times than they MUST first call this function with a 
> + * a value of NULL for header_desired to ensure that all of the
> + * headers are parsed and concatenated before a match is returned
...
> + } else {
> + /* not sure this makes sense for all headers that can occur
> +  * multiple times, but for now just concatenate headers
> +  */
> + newhdr = strlen(decoded_value);
> + hdrsofar = strlen(header_sofar);

I'm a little nervous about this semantic change.

For example, I know that my mail collection contains at least some
messages with multiple Message-ID headers, (I'm not sure that's legal,
but they are there). I found those when doing detailed comparisons of
the database created by sup with that created by very early versions of
what became the indexing code for notmuch. [Sup prefers the
last-encountered Message-Id in the file, while Notmuch prefers the
first.]

So I'm concerned about the change above introducing subtle problems that
might be hard to notice.

How about an argument that asks explicitly for concatenated header
values, (and this could just trigger a rescan of the headers and ignore
the hash). I think that will be fine for your use case where you're just
opening this message file to get this one concatenated header out,
right?

-Carl
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



Re: [PATCH] Next attempt to get guessing of From addresses correct in replies

2010-04-14 Thread Carl Worth
On Fri, 09 Apr 2010 15:53:04 -0700, Dirk Hohndel hohn...@infradead.org wrote:
 + * WARNING - if the caller is asking for a header that could occur
 + * multiple times than they MUST first call this function with a 
 + * a value of NULL for header_desired to ensure that all of the
 + * headers are parsed and concatenated before a match is returned
...
 + } else {
 + /* not sure this makes sense for all headers that can occur
 +  * multiple times, but for now just concatenate headers
 +  */
 + newhdr = strlen(decoded_value);
 + hdrsofar = strlen(header_sofar);

I'm a little nervous about this semantic change.

For example, I know that my mail collection contains at least some
messages with multiple Message-ID headers, (I'm not sure that's legal,
but they are there). I found those when doing detailed comparisons of
the database created by sup with that created by very early versions of
what became the indexing code for notmuch. [Sup prefers the
last-encountered Message-Id in the file, while Notmuch prefers the
first.]

So I'm concerned about the change above introducing subtle problems that
might be hard to notice.

How about an argument that asks explicitly for concatenated header
values, (and this could just trigger a rescan of the headers and ignore
the hash). I think that will be fine for your use case where you're just
opening this message file to get this one concatenated header out,
right?

-Carl


pgpaaaFmNK3bs.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH] Next attempt to get guessing of From addresses correct in replies

2010-04-14 Thread Dirk Hohndel
On Wed, 14 Apr 2010 10:21:42 -0700, Carl Worth cwo...@cworth.org wrote:
 On Fri, 09 Apr 2010 15:53:04 -0700, Dirk Hohndel hohn...@infradead.org 
 wrote:
  + * WARNING - if the caller is asking for a header that could occur
  + * multiple times than they MUST first call this function with a 
  + * a value of NULL for header_desired to ensure that all of the
  + * headers are parsed and concatenated before a match is returned
 ...
  +   } else {
  +   /* not sure this makes sense for all headers that can occur
  +* multiple times, but for now just concatenate headers
  +*/
  +   newhdr = strlen(decoded_value);
  +   hdrsofar = strlen(header_sofar);
 
 I'm a little nervous about this semantic change.

So am I - but I haven't found a message where this would have bitten me.
 
 For example, I know that my mail collection contains at least some
 messages with multiple Message-ID headers, (I'm not sure that's legal,
 but they are there).

No, that is absolutely NOT RFC compliant. Wonder what creates those
messages...

 I found those when doing detailed comparisons of
 the database created by sup with that created by very early versions of
 what became the indexing code for notmuch. [Sup prefers the
 last-encountered Message-Id in the file, while Notmuch prefers the
 first.]

Actually, prior to another fix that I sent (and that you already
applied), notmuch would use the first if you came across this header for
the first time when searching for it (but since you'd search for
Message-Id fairly early on, that's likely what happened). But if your
header was remembered en-passant while searching for a different
header later in the file, notmuch would actually remember the last.

But again, I fixed that before making the change to concatenate
duplicates instead.
 
 So I'm concerned about the change above introducing subtle problems that
 might be hard to notice.

Yes, absolutely - a concatenated Message-Id would SUCK.
 
 How about an argument that asks explicitly for concatenated header
 values, (and this could just trigger a rescan of the headers and ignore
 the hash). I think that will be fine for your use case where you're just
 opening this message file to get this one concatenated header out,
 right?

That would work just fine. And avoid the potential unintended side
effects. 

I'm about to head for the airport - do you want to make that
modification yourself or should I do it tonight?

/D

-- 
Dirk Hohndel
Intel Open Source Technology Center
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH] Next attempt to get guessing of From addresses correct in replies

2010-04-09 Thread Dirk Hohndel

We now go through the following searches in order:
0) one of the users addresses was in a To: or Cc: header
1) check for an Envelope-to: header that matches one of the addresses
2) check for an X-Original-To: header that matches one of the addresses
3) check for a (for ) clause in Received: headers
4) check for the domain part of known email addresses in the
'by' part of Received headers
Finally, if none of these work, we give up and use the primary email address.

Signed-off-by: Dirk Hohndel 
---
 lib/message-file.c |   35 ++-
 notmuch-reply.c|  128 +++
 2 files changed, 122 insertions(+), 41 deletions(-)

diff --git a/lib/message-file.c b/lib/message-file.c
index 0c152a3..0114761 100644
--- a/lib/message-file.c
+++ b/lib/message-file.c
@@ -209,15 +209,21 @@ copy_header_unfolding (header_value_closure_t *value,

 /* As a special-case, a value of NULL for header_desired will force
  * the entire header to be parsed if it is not parsed already. This is
- * used by the _notmuch_message_file_get_headers_end function. */
+ * used by the _notmuch_message_file_get_headers_end function. 
+ * 
+ * WARNING - if the caller is asking for a header that could occur
+ * multiple times than they MUST first call this function with a 
+ * a value of NULL for header_desired to ensure that all of the
+ * headers are parsed and concatenated before a match is returned
+ */
 const char *
 notmuch_message_file_get_header (notmuch_message_file_t *message,
 const char *header_desired)
 {
 int contains;
-char *header, *decoded_value;
+char *header, *decoded_value, *header_sofar, *combined_header;
 const char *s, *colon;
-int match;
+int match,newhdr,hdrsofar;
 static int initialized = 0;

 if (! initialized) {
@@ -312,20 +318,27 @@ notmuch_message_file_get_header (notmuch_message_file_t 
*message,

NEXT_HEADER_LINE (>value);

-   if (header_desired == 0)
+   if (header_desired == NULL)
match = 0;
else
match = (strcasecmp (header, header_desired) == 0);

decoded_value = g_mime_utils_header_decode_text (message->value.str);
-   if (g_hash_table_lookup (message->headers, header) == NULL) {
-   /* Only insert if we don't have a value for this header, yet.
-* This way we always return the FIRST instance of any header
-* we search for
-* FIXME: we should be returning ALL instances of a header
-*or at least provide a way to iterate over them
-*/
+   header_sofar = (char *)g_hash_table_lookup (message->headers, header);
+   if (header_sofar == NULL) {
+   /* Only insert if we don't have a value for this header, yet. */
g_hash_table_insert (message->headers, header, decoded_value);
+   } else {
+   /* not sure this makes sense for all headers that can occur
+* multiple times, but for now just concatenate headers
+*/
+   newhdr = strlen(decoded_value);
+   hdrsofar = strlen(header_sofar);
+   combined_header = xmalloc(hdrsofar + newhdr + 2);
+   strncpy(combined_header,header_sofar,hdrsofar);
+   *(combined_header+hdrsofar) = ' ';
+   strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1);
+   g_hash_table_insert (message->headers, header, combined_header);
}
if (match)
return decoded_value;
diff --git a/notmuch-reply.c b/notmuch-reply.c
index 39377e1..91a2094 100644
--- a/notmuch-reply.c
+++ b/notmuch-reply.c
@@ -285,33 +285,100 @@ add_recipients_from_message (GMimeMessage *reply,
 static const char *
 guess_from_received_header (notmuch_config_t *config, notmuch_message_t 
*message)
 {
-const char *received,*primary;
-char **other;
-char *by,*mta,*ptr,*token;
+const char *received,*primary,*by;
+char **other,*tohdr;
+char *mta,*ptr,*token;
 char *domain=NULL;
 char *tld=NULL;
 const char *delim=". \t";
 size_t i,other_len;

+const char *to_headers[] = {"Envelope-to", "X-Original-To"};
+
+primary = notmuch_config_get_user_primary_email (config);
+other = notmuch_config_get_user_other_email (config, _len);
+
+/* sadly, there is no standard way to find out to which email
+ * address a mail was delivered - what is in the headers depends
+ * on the MTAs used along the way. So we are trying a number of
+ * heuristics which hopefully will answer this question.
+
+ * We only got here if none of the users email addresses are in
+ * the To: or Cc: header. From here we try the following in order:
+ * 1) check for an Envelope-to: header
+ * 2) check for an X-Original-To: header
+ * 3) check for a (for ) clause in Received: headers
+ * 4) check for the domain part of known email addresses in the 
+ *'by' part of Received headers
+ * If