Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
Rob Browning writes: > David Bremner writes: >> It seems plausible to specify UTF-8 input for the library, but what >> about the CLI? It seems like the canonicalization operation increases >> the chance of mangling user input in non-UTF-8 locales. > > Yes, the key question: what does notmuch intend? i.e. given a sequence > of bytes, how will notmuch interpret them? I think we should decide > that, and document it clearly somewhere. > > The commit message describes my understanding of how things currently > work, and if/when I get time, I'd like to propose some related > documentation updates (perhaps to notmuch-search-terms or > notmuch-insert/new?). > > Oh, and if I do understand things correctly, notmuch may already stand a > chance of mangling any bytes that aren't an invalid UTF-8 byte sequence, > but also aren't actually in UTF-8 (excepting encodings that are a strict > subset of UTF-8, like ASCII). > > For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing > omega "ѡ", and also valid Latin-1, producing "Ñ¡". So on this particular point, I'm perhaps too used to thinking about the general encoding problem, and wasn't thinking about our specific constraints. If (1) "normal" message bodies are required to be US-ASCII (which I'd neglected to remember might be the case), and (2) MIME handles the rest, then perhaps notmuch will only receive raw bytes via user input (i.e. query strings, etc.). In which case, we could just document that notmuch interprets user input as UTF-8 (and we might or might not mention the Latin-1 fallback). Later locale support could be added if desired, and none of this would involve the quite nasty problem of encoding detection. -- Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4 ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
David Bremner writes: > One way to break this up into more bite sized pieces would be to first > create one or more tests that fail with current notmuch, and mark those > as broken. Right - for the moment I just wanted to post what I had for consideration. I didn't want to spend too much more time on the approach if was uninteresting/inappropriate. One simple place to start might be the included T570-normalization.sh. Though perhaps that should be "canonicalization"? > Can you explain why notmuch is the right place to do this, and not > Xapian? I know we talked back and forth about this, but I never really > got a solid sense of what the conclusion was. Is it just dependencies? I have no strong opinion there, but to do the work in Xapian will require a new release at a minimum, and likely new dependencies. And generally speaking, I suppose I have a suspicion that application needs with respect to encoding "detection", tokenization, stemming, stop words, synonyms, phrase detection, etc. may be domain specific and complex enough that Xapian won't want to try to accommodate the broad array of possibilities, at least not in its core library. Though it might try to handle some or all of that by providing suitable customizability (presumably via callbacks or subclassing or...). And since I'm new to Xapian, I'm not completely sure what's already available. > It seems plausible to specify UTF-8 input for the library, but what > about the CLI? It seems like the canonicalization operation increases > the chance of mangling user input in non-UTF-8 locales. Yes, the key question: what does notmuch intend? i.e. given a sequence of bytes, how will notmuch interpret them? I think we should decide that, and document it clearly somewhere. The commit message describes my understanding of how things currently work, and if/when I get time, I'd like to propose some related documentation updates (perhaps to notmuch-search-terms or notmuch-insert/new?). Oh, and if I do understand things correctly, notmuch may already stand a chance of mangling any bytes that aren't an invalid UTF-8 byte sequence, but also aren't actually in UTF-8 (excepting encodings that are a strict subset of UTF-8, like ASCII). For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing omega "ѡ", and also valid Latin-1, producing "Ñ¡". > I suppose some upgrade code to canonicalize all the terms? That sounds > pretty slow. Perhaps, or I suppose you could just document that older indexed data might not be canonicalized, and that you should reindex if that matters to you. Although I suppose anyone with affected characters might well want to reindex if the canonical form isn't the one people normally receive (which seemed possible). Hmm, another question -- for terms, does notmuch store ordinal positions, Unicode character offsets, input byte offsets, or...? Canonicalization will of course change the latter. I imagine it might be possible to traverse the index terms and just detect and merge those affected, but no idea if that would be reasonable. > I really didn't look at the code very closely, but there were a > surprising number of calls to talloc_free. But those kind of details can > wait. Right, I wasn't sure what the policies were, so in most cases, I just tried to release the data when it was no longer needed. Thanks -- Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4 ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
Rob Browning writes: > > Before this change, notmuch would index two strings that differ only > with respect to canonicalization, like tóken and tóken, as separate > terms, even though they may be visually indistinguishable, and do (for > most purposes) represent the same text. After indexing, searching for > one would not find the other, and which one you present to notmuch > when you search depends on your tools. See test/T570-normalization.sh > for a working example. One way to break this up into more bite sized pieces would be to first create one or more tests that fail with current notmuch, and mark those as broken. > Up to now, notmuch has let Xapian handle converting the incoming bytes > to UTF-8. Xapian treats any byte sequence as UTF-8, and interprets > any invalid UTF-8 bytes as Latin-1. This patch maintains the existing > behavior (excepting the new canonicalization) by using Xapian's > Utf8Iterator to handle the initial Unicode character parsing. Can you explain why notmuch is the right place to do this, and not Xapian? I know we talked back and forth about this, but I never really got a solid sense of what the conclusion was. Is it just dependencies? > And because when the input is already UTF-8, it just blindly converts > from UTF-8 to Unicode code points, and then back to UTF-8 (after > canonicalization), during each pass. There are certainly > opportunities to optimize, though it may be worth discussing the > detection of data encodings more broadly first. It seems plausible to specify UTF-8 input for the library, but what about the CLI? It seems like the canonicalization operation increases the chance of mangling user input in non-UTF-8 locales. > FIXME: what about existing indexed text? I suppose some upgrade code to canonicalize all the terms? That sounds pretty slow. > --- > > Posted for preliminary discussion, and as a milestone (it appears to > mostly work now). Though I doubt I'm handling things correctly > everywhere notmuch-wise, wrt talloc, etc. I really didn't look at the code very closely, but there were a surprising number of calls to talloc_free. But those kind of details can wait. ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
[PATCH 1/1] Store and search for canonical Unicode text [WIP]
WARNING: this version is very preliminary, and might eat your data. Unicode has multiple sequences representing what should normally be considered the same text. For example here's a combining AÌ and a noncombining Ã. Depending on the way you view this, you may or may not see a difference, but the former is the canonical form, and is represented by two Unicode code points: a capital A (U+0041) followed by a "combining acute accent" (U+0301); the latter is the single code point (U+00C1), which is probably what most people would type. Before this change, notmuch would index two strings that differ only with respect to canonicalization, like toÌken and tóken, as separate terms, even though they may be visually indistinguishable, and do (for most purposes) represent the same text. After indexing, searching for one would not find the other, and which one you present to notmuch when you search depends on your tools. See test/T570-normalization.sh for a working example. Since we're talking about differing representations that one wouldn't normally want to distinguish, this patch unifies the various representations by converting all incoming text to its canonical form before indexing, and canonicalizing all query strings. Up to now, notmuch has let Xapian handle converting the incoming bytes to UTF-8. Xapian treats any byte sequence as UTF-8, and interprets any invalid UTF-8 bytes as Latin-1. This patch maintains the existing behavior (excepting the new canonicalization) by using Xapian's Utf8Iterator to handle the initial Unicode character parsing. Note that the parsing approach in this patch is not particularly efficient, both because it traverses the incoming bytes three times: - once to determine how long the input is (currently the iterator can't directly handle null terminated char*'s), - once to determine how long the final UTF-8 allocation needs to be, - and once for the conversion. And because when the input is already UTF-8, it just blindly converts from UTF-8 to Unicode code points, and then back to UTF-8 (after canonicalization), during each pass. There are certainly opportunities to optimize, though it may be worth discussing the detection of data encodings more broadly first. FIXME: document current encoding behavior clearly in new/insert/search-terms. FIXME: what about existing indexed text? --- Posted for preliminary discussion, and as a milestone (it appears to mostly work now). Though I doubt I'm handling things correctly everywhere notmuch-wise, wrt talloc, etc. lib/Makefile.local | 1 + lib/database.cc| 17 -- lib/message.cc | 51 +++- lib/notmuch.h | 3 ++ lib/query.cc | 6 ++-- lib/text-util.cc | 82 ++ test/Makefile.local| 10 -- test/T150-tagging.sh | 54 +++--- test/T240-dump-restore.sh | 4 +-- test/T480-hex-escaping.sh | 4 +-- test/T570-normalization.sh | 28 test/corpus/cur/52:2, | 6 ++-- test/to-utf8.c | 44 + 13 files changed, 267 insertions(+), 43 deletions(-) create mode 100644 lib/text-util.cc create mode 100755 test/T570-normalization.sh create mode 100644 test/to-utf8.c diff --git a/lib/Makefile.local b/lib/Makefile.local index 3a07090..41fd1e1 100644 --- a/lib/Makefile.local +++ b/lib/Makefile.local @@ -48,6 +48,7 @@ libnotmuch_cxx_srcs = \ $(dir)/index.cc \ $(dir)/message.cc \ $(dir)/query.cc \ + $(dir)/text-util.cc \ $(dir)/thread.cc libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o) diff --git a/lib/database.cc b/lib/database.cc index 6a15174..7a01f95 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, unsigned doc_id) char * _notmuch_message_id_compressed (void *ctx, const char *message_id) { +// Assumes message_id is normalized utf-8. char *sha1, *compressed; sha1 = _notmuch_sha1_of_string (message_id); @@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t *notmuch, if (message_ret == NULL) return NOTMUCH_STATUS_NULL_POINTER; -if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX) - message_id = _notmuch_message_id_compressed (notmuch, message_id); +const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1); + +// Is strlen still appropriate? +if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) +{ + message_id = _notmuch_message_id_compressed (notmuch, u8_id); + talloc_free ((char *) u8_id); +} else + message_id = u8_id; try { status = _notmuch_database_find_unique_doc_id (notmuch, "id", message_id, &doc_id); + talloc_free ((char *