Re: [basex-talk] More Diacritic Questions

2014-11-30 Thread Graydon Saunders
Hi Christian -- On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün christian.gr...@gmail.com wrote: Hi Graydon, //text()[contains(.,'lt;')] gives me three hits. I think there should should be four against the relevant bit of XML with full-text search, since with no diacritics, U+226E should

Re: [basex-talk] More Diacritic Questions

2014-11-30 Thread Christian Grün
Hi Graydon, So I would expect that, with a full text search that ignores diacritics, I'd get four hits. By adding some collation hints to one of the standard string functions, the comparison will succeed: fn:compare('≮','lt;','?lang=en;strength=primary') In the example, I used the BaseX

Re: [basex-talk] More Diacritic Questions

2014-11-29 Thread Graydon Saunders
Hi Christian -- After various adventures re-learning Perl's encoding management quirks, I generated a simple XML file of all the codepoints between 0x20 and 0xD7FF; this isn't complete for XML but I thought it would be enough to be interesting. If I load that file into current BaseX dev version

Re: [basex-talk] More Diacritic Questions

2014-11-29 Thread Christian Grün
Hi Graydon, //text()[contains(.,'lt;')] gives me three hits. I think there should should be four against the relevant bit of XML with full-text search, since with no diacritics, U+226E should match. So you would expected this node to be returned as well? glyph≮/glyph For this, you'll

Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christopher Yocum
Hi Christian, Great. Thank you for handling this so quickly. When is the next version due out? I hesitate to run snapshots as my users are rather vocal when things don't work right. All the best, Chris On Mon, Nov 24, 2014 at 1:13 AM, Christian Grün christian.gr...@gmail.com wrote: Hi

Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christian Grün
Hi Chris, Great. Thank you for handling this so quickly. When is the next version due out? I hesitate to run snapshots as my users are rather vocal when things don't work right. Our snapshots are usually very stable, so you should not have much worries. The next official release is planned

Re: [basex-talk] More Diacritic Questions

2014-11-24 Thread Christopher Yocum
Thanks. I will give it a spin on my test machine first. Darn, I will be on holiday to Prague around that time but not at the actual conference. Chris On Mon, Nov 24, 2014 at 11:15 AM, Christian Grün christian.gr...@gmail.com wrote: Hi Chris, Great. Thank you for handling this so quickly.

[basex-talk] More Diacritic Questions

2014-11-23 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Everyone, I am rather confused again about diacritic handling in basex. For instance, with Full Text turned on a word like athgabáil will match both athgabail and athgabáil with diacritics insensitive which is what I would expect. However, if

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris, Thanks for the observation. I can confirm that some characters like ṡ (U+1E61) do not seem be properly normalized yet. I have added an issue for that [1], and I hope I will soon have it fixed. If you encounter some other surprising behavior like this, feel free to tell us. Best,

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Chrsitian Thanks for letting me know! I also need ḟ U+1E1F. All the best, Chris On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote: Hi Chris, Thanks for the observation. I can confirm that some characters like ṡ (U+1E61) do

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon, I just had a look. In BaseX, without diacritics can be explained by this a single, glorious mapping table [1]. It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
I just found a mapping table proposed by John Cowan [1]. It's already pretty old, so it doesn't cover newer Unicode versions, but it's surely better than our current solution. [1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Graydon Saunders
Hi Christian -- That is indeed a glorious table! :) Unicode defines whether or not a character has a decomposition; so e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a combining acute accent.) I think the presence of a decomposition is a recoverable character property in Java.

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Chris, I am glad to report that the latest snapshot of BaseX [1] now provides much better support for diacritical characters. Please find more details in my next mail to Graydon. Hope this helps, Christian [1] http://files.basex.org/releases/latest/

Re: [basex-talk] More Diacritic Questions

2014-11-23 Thread Christian Grün
Hi Graydon, Thanks for your detailed reply, very appreciated. For today, I decided to choose a pragmatic solution that provides support for much more cases than before. I have added some more (glorious) mappings motivated by John Cowan's mail, which can now be found in a new class [1]. However,