[PATCH 1/1] Store and search for canonical Unicode text [WIP]

2015-08-30 Thread Rob Browning
WARNING: this version is very preliminary, and might eat your data.

Unicode has multiple sequences representing what should normally be
considered the same text.  For example here's a combining Á and a
noncombining Á.

Depending on the way you view this, you may or may not see a
difference, but the former is the canonical form, and is represented
by two Unicode code points: a capital A (U+0041) followed by a
"combining acute accent" (U+0301); the latter is the single code
point (U+00C1), which is probably what most people would type.

Before this change, notmuch would index two strings that differ only
with respect to canonicalization, like tóken and tóken, as separate
terms, even though they may be visually indistinguishable, and do (for
most purposes) represent the same text.  After indexing, searching for
one would not find the other, and which one you present to notmuch
when you search depends on your tools.  See test/T570-normalization.sh
for a working example.

Since we're talking about differing representations that one wouldn't
normally want to distinguish, this patch unifies the various
representations by converting all incoming text to its canonical form
before indexing, and canonicalizing all query strings.

Up to now, notmuch has let Xapian handle converting the incoming bytes
to UTF-8.  Xapian treats any byte sequence as UTF-8, and interprets
any invalid UTF-8 bytes as Latin-1.  This patch maintains the existing
behavior (excepting the new canonicalization) by using Xapian's
Utf8Iterator to handle the initial Unicode character parsing.

Note that the parsing approach in this patch is not particularly
efficient, both because it traverses the incoming bytes three times:

   - once to determine how long the input is (currently the iterator
 can't directly handle null terminated char*'s),

   - once to determine how long the final UTF-8 allocation needs to
 be,

   - and once for the conversion.

And because when the input is already UTF-8, it just blindly converts
from UTF-8 to Unicode code points, and then back to UTF-8 (after
canonicalization), during each pass.  There are certainly
opportunities to optimize, though it may be worth discussing the
detection of data encodings more broadly first.

FIXME: document current encoding behavior clearly in
new/insert/search-terms.

FIXME: what about existing indexed text?
---

 Posted for preliminary discussion, and as a milestone (it appears to
 mostly work now).  Though I doubt I'm handling things correctly
 everywhere notmuch-wise, wrt talloc, etc.

 lib/Makefile.local |  1 +
 lib/database.cc| 17 --
 lib/message.cc | 51 +++-
 lib/notmuch.h  |  3 ++
 lib/query.cc   |  6 ++--
 lib/text-util.cc   | 82 ++
 test/Makefile.local| 10 --
 test/T150-tagging.sh   | 54 +++---
 test/T240-dump-restore.sh  |  4 +--
 test/T480-hex-escaping.sh  |  4 +--
 test/T570-normalization.sh | 28 
 test/corpus/cur/52:2,  |  6 ++--
 test/to-utf8.c | 44 +
 13 files changed, 267 insertions(+), 43 deletions(-)
 create mode 100644 lib/text-util.cc
 create mode 100755 test/T570-normalization.sh
 create mode 100644 test/to-utf8.c

diff --git a/lib/Makefile.local b/lib/Makefile.local
index 3a07090..41fd1e1 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -48,6 +48,7 @@ libnotmuch_cxx_srcs = \
$(dir)/index.cc \
$(dir)/message.cc   \
$(dir)/query.cc \
+   $(dir)/text-util.cc \
$(dir)/thread.cc

 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
diff --git a/lib/database.cc b/lib/database.cc
index 6a15174..7a01f95 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, 
unsigned doc_id)
 char *
 _notmuch_message_id_compressed (void *ctx, const char *message_id)
 {
+// Assumes message_id is normalized utf-8.
 char *sha1, *compressed;

 sha1 = _notmuch_sha1_of_string (message_id);
@@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t 
*notmuch,
 if (message_ret == NULL)
return NOTMUCH_STATUS_NULL_POINTER;

-if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
-   message_id = _notmuch_message_id_compressed (notmuch, message_id);
+const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);
+
+// Is strlen still appropriate?
+if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX)
+{
+   message_id = _notmuch_message_id_compressed (notmuch, u8_id);
+   talloc_free ((char *) u8_id);
+} else
+   message_id = u8_id;

 try {
status = _notmuch_database_find_unique_doc_id (notmuch, "id",
   message_id, _id);
+   talloc_free ((char *) 

Re: [RFC PATCH 4/5] cli: change the data structure for notmuch address deduplication

2015-08-30 Thread Jani Nikula
On Sun, 30 Aug 2015, David Bremner da...@tethera.net wrote:
 Jani Nikula j...@nikula.org writes:

  
 +static int
 +strcase_equal (const void *a, const void *b)
 +{
 +return strcasecmp (a, b) == 0;
 +}
 +
 +static unsigned int
 +strcase_hash (const void *ptr)
 +{
 +const char *s = ptr;
 +
 +/* This is the djb2 hash. */
 +unsigned int hash = 5381;
 +while (s  *s) {
 +hash = ((hash  5) + hash) + tolower (*s);
 +s++;
 +}
 +
 +return hash;
 +}
 +

 as discussed, these functions probably need to be factored out into
 libutil.

Ack.


 +l = g_list_find_custom (list, mailbox, mailbox_compare);
 +if (l) {
 +talloc_free (mailbox);
 +mailbox = l-data;
 +mailbox-count++;
 +return TRUE;
 +}

 I found this use of mailbox as a temporary variable confusing; despite
 the obvious return I thought it might have something to do with the
 g_list_append below.  Maybe just make a block scope temporary variable?

This is how the function would turn out with that. Better, I guess? I
also tried to think of ways to combine the two g_list_append paths here,
but in the end doing it like this has most clarity I think.

static notmuch_bool_t
is_duplicate (const search_context_t *ctx, const char *name, const char *addr)
{
char *key;
GList *list, *l;
mailbox_t *mailbox;

list = g_hash_table_lookup (ctx-addresses, addr);
if (list) {
mailbox_t find = {
.name = name,
.addr = addr,
};

l = g_list_find_custom (list, find, mailbox_compare);
if (l) {
mailbox = l-data;
mailbox-count++;
return TRUE;
}

mailbox = new_mailbox (ctx-format, name, addr);
if (! mailbox)
return FALSE;

g_list_append (list, mailbox);
return FALSE;
}

key = talloc_strdup (ctx-format, addr);
if (! key)
return FALSE;

mailbox = new_mailbox (ctx-format, name, addr);
if (! mailbox)
return FALSE;

list = g_list_append (NULL, mailbox);
if (! list)
return FALSE;

g_hash_table_insert (ctx-addresses, key, list);

return FALSE;
}

 +
 +g_list_append (list, mailbox);
 +return FALSE;
  }
  
 -mailbox = new_mailbox (ctx-format, name, addr);
 -if (! mailbox)
 +key = talloc_strdup (ctx-format, addr);
 +if (! key)
  return FALSE;

 I guess this doesn't make the error handling worse; both old and new
 code silently ignore OOM if I understand correctly. Do you happen to
 understand the original choice of using ctx-format rather than that
 ctx-notmuch for a talloc parent? it doesn't seem to get deallocated any
 earlier.

I don't know or understand that part of the history. It doesn't really
matter though because the deallocation is explicitly done on all keys
and values via g_hash_table_unref.

BR,
Jani.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH 4/5] cli: change the data structure for notmuch address deduplication

2015-08-30 Thread Jani Nikula
On Sun, 30 Aug 2015, David Bremner da...@tethera.net wrote:
 I guess this doesn't make the error handling worse; both old and new
 code silently ignore OOM if I understand correctly.

Oh, and current git will not silently ignore OOM. It will segfault... ;)

BR,
Jani.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH 4/5] cli: change the data structure for notmuch address deduplication

2015-08-30 Thread David Bremner
Jani Nikula j...@nikula.org writes:

 I found this use of mailbox as a temporary variable confusing; despite
 the obvious return I thought it might have something to do with the
 g_list_append below.  Maybe just make a block scope temporary variable?

 This is how the function would turn out with that. Better, I guess? I
 also tried to think of ways to combine the two g_list_append paths here,
 but in the end doing it like this has most clarity I think.


Your (new) version is fine for me. As it happens I was thinking of a
smaller tweak:


l = g_list_find_custom (list, mailbox, mailbox_compare);
if (l) {
mailbox_t *found;

talloc_free (mailbox);

found = l-data;
found-count++;

return TRUE;
}

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH 3/5] cli: add support for not deduplicating notmuch address results

2015-08-30 Thread Jani Nikula
On Sun, 30 Aug 2015, David Bremner da...@tethera.net wrote:
 Jani Nikula j...@nikula.org writes:


 +{ NOTMUCH_OPT_KEYWORD, ctx-dupe, deduplicate, 'x',

 probably you want 'D' or 'd' here. Not that it makes a practical
 difference at this point.

 +  (notmuch_keyword_t []){ { yes, -1 },

 I'm not very enthusiastic about reusing ctx-dupe for this.
 In particular the use of -1 for yes is off-putting
 It seems better to allocate another word of memory and use an
 enum, as elsewhere.

Agreed on both.

BR,
Jani.

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC PATCH 5/5] cli: add support for deduplicating based on case insensitive address

2015-08-30 Thread David Bremner
Jani Nikula j...@nikula.org writes:

 Consider all variants of an email address as one, and print the most
 common variant.

Other than the quibbles already mentioned, the series looks ok to
me. For production it should have one or two tests I guess. Oh, and man
page updates. But you knew that I guess.

I can live with the current argument syntax, but since a certain a mount
of bikeshedding is expected, here is an alternative suggestion

--deduplication={none|mailbox|address}
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 2/2] emacs: add function to resend message to new recipients

2015-08-30 Thread David Bremner
Tomi Ollila tomi.oll...@iki.fi writes:

 The new function notmuch-show-message-resend re-sends
 message to new recipients using #'message-resend.

 Recipients are read from minibuffer as a comma-separated
 string (with some keyboard support including tab completion).


I couldn't get the tab completion to work, at least if I go

  M-x notmuch-show-resend-message

nor when evaluating (notmuch-address-from-minibuffer foo:)

Do I need to bind a key to test this?

 I remember that Emacs VM might have had 'b' bound to this functionality
 but I cannot be sure. A few weeks ago I looked gnus, rmail  mh-e to
 figure out whether 'b' would have been bound to similar functionality
 there but I cannot find it...

mutt uses 'b'. AFAICT, gnus uses some sequence ending in b to resend
bounced messages (i.e. from mailer-daemon).

quoting the manual:

S D b

If you have sent a mail, but the mail was bounced back to you for
some reason (wrong address, transient failure), you can use this
command to resend that bounced mail
(gnus-summary-resend-bounced-mail). [...]

S D r

Not to be confused with the previous command,
gnus-summary-resend-message will prompt you for an address to send
the current message off to, and then send it to that place. [...]

Be that as it may, we already use 'r' for reply, and I'm not sure we
want to go (more) in the way of multi-letter sequences.

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 1/2] emacs: add defsubst notmuch-address--message-insinuated

2015-08-30 Thread David Bremner
Tomi Ollila tomi.oll...@iki.fi writes:

 +(defsubst notmuch-address--message-insinuated ()
 +  (memq notmuch-address-message-alist-member message-completion-alist))
 +

Is there some advantage to defsubst other than (maybe?) performance?  It
just seems like one more construct for people to get up to speed with
the codebase.

d
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 1/2] emacs: add defsubst notmuch-address--message-insinuated

2015-08-30 Thread Tomi Ollila
On Sun, Aug 30 2015, David Bremner da...@tethera.net wrote:

 Tomi Ollila tomi.oll...@iki.fi writes:

 +(defsubst notmuch-address--message-insinuated ()
 +  (memq notmuch-address-message-alist-member message-completion-alist))
 +

 Is there some advantage to defsubst other than (maybe?) performance?  It
 just seems like one more construct for people to get up to speed with
 the codebase.

No advantage. The reason for defsubst may originally be that I did it for
least execution chance -- and iirc I was reading some (gnus?) code that
had quite a few defsubsts which infected my coding. Now that I checked
there is no (other) defsubsts code in notmuch-emacs... so that can be
just be changed to `defun' to be consistent in that sense.


 d

Tomi
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 2/2] emacs: add function to resend message to new recipients

2015-08-30 Thread Tomi Ollila
On Sun, Aug 30 2015, David Bremner da...@tethera.net wrote:

 Tomi Ollila tomi.oll...@iki.fi writes:

 The new function notmuch-show-message-resend re-sends
 message to new recipients using #'message-resend.

 Recipients are read from minibuffer as a comma-separated
 string (with some keyboard support including tab completion).


 I couldn't get the tab completion to work, at least if I go

   M-x notmuch-show-resend-message

 nor when evaluating (notmuch-address-from-minibuffer foo:)

 Do I need to bind a key to test this?

Nope. both of the above should work -- and worked for me just now
(this is how I tested just now:

  emacs -q -L $PWD/emacs -l emacs/notmuch.el -f notmuch --eval '(progn (setq 
notmuch-address-command nottoomuch-addresses.sh) 
(notmuch-address-message-insinuate))'

)

 I remember that Emacs VM might have had 'b' bound to this functionality
 but I cannot be sure. A few weeks ago I looked gnus, rmail  mh-e to
 figure out whether 'b' would have been bound to similar functionality
 there but I cannot find it...

 mutt uses 'b'. AFAICT, gnus uses some sequence ending in b to resend
 bounced messages (i.e. from mailer-daemon).

 quoting the manual:

 S D b

 If you have sent a mail, but the mail was bounced back to you for
 some reason (wrong address, transient failure), you can use this
 command to resend that bounced mail
 (gnus-summary-resend-bounced-mail). [...]
 
 S D r

 Not to be confused with the previous command,
 gnus-summary-resend-message will prompt you for an address to send
 the current message off to, and then send it to that place. [...]

 Be that as it may, we already use 'r' for reply, and I'm not sure we
 want to go (more) in the way of multi-letter sequences.

I agree that we don't want to to multi-letter sequences (early). I
personally would be fine w/ 'b'...

Tomi
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 2/2] emacs: add function to resend message to new recipients

2015-08-30 Thread David Bremner
Tomi Ollila tomi.oll...@iki.fi writes:


   emacs -q -L $PWD/emacs -l emacs/notmuch.el -f notmuch --eval '(progn (setq 
 notmuch-address-command nottoomuch-addresses.sh) 
 (notmuch-address-message-insinuate))'

Ah, I missed notmuch-address-message-insinuate; it does work if I run
that. I wonder if this can be simplified at all now that
notmuch-message-mode exists. In general I'm leery of running code when
we load notmuch*.el, as this has a history of causing problems for
people using notmuch e.g. as a search tool for gnus.

The completion interface takes a little getting used to, but it's
definitely usable. And of course much better than what we have now ;).

One thing I noticed which I _think_ is a bug, is that after calling
notmuch-show-resend-message, notmuch-view-raw-message doesn't work
anymore, complaining about a read-only buffer.

d
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch