Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

2015-09-05 Thread Rob Browning
Rob Browning writes: > David Bremner writes: >> It seems plausible to specify UTF-8 input for the library, but what >> about the CLI? It seems like the canonicalization operation increases >> the chance of mangling user input in non-UTF-8 locales. > >

Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

2015-09-02 Thread David Bremner
Rob Browning writes: > > Before this change, notmuch would index two strings that differ only > with respect to canonicalization, like tóken and tóken, as separate > terms, even though they may be visually indistinguishable, and do (for > most purposes) represent the same

Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

2015-09-02 Thread Rob Browning
David Bremner writes: > One way to break this up into more bite sized pieces would be to first > create one or more tests that fail with current notmuch, and mark those > as broken. Right - for the moment I just wanted to post what I had for consideration. I didn't want to

[PATCH 1/1] Store and search for canonical Unicode text [WIP]

2015-08-30 Thread Rob Browning
WARNING: this version is very preliminary, and might eat your data. Unicode has multiple sequences representing what should normally be considered the same text. For example here's a combining Á and a noncombining Á. Depending on the way you view this, you may or may not see a difference,