Hi Christian --
On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün
christian.gr...@gmail.com wrote:
Hi Graydon,
//text()[contains(.,'lt;')]
gives me three hits.
I think there should should be four against the relevant bit of XML
with full-text search, since with no diacritics, U+226E should
Hi Graydon,
So I would expect that, with a full text search that ignores
diacritics, I'd get four hits.
By adding some collation hints to one of the standard string
functions, the comparison will succeed:
fn:compare('≮','lt;','?lang=en;strength=primary')
In the example, I used the BaseX
Hi Christian --
After various adventures re-learning Perl's encoding management
quirks, I generated a simple XML file of all the codepoints between
0x20 and 0xD7FF; this isn't complete for XML but I thought it would be
enough to be interesting.
If I load that file into current BaseX dev version
Hi Graydon,
//text()[contains(.,'lt;')]
gives me three hits.
I think there should should be four against the relevant bit of XML
with full-text search, since with no diacritics, U+226E should match.
So you would expected this node to be returned as well?
glyph≮/glyph
For this, you'll
Hi Christian,
Great. Thank you for handling this so quickly. When is the next version
due out? I hesitate to run snapshots as my users are rather vocal when
things don't work right.
All the best,
Chris
On Mon, Nov 24, 2014 at 1:13 AM, Christian Grün christian.gr...@gmail.com
wrote:
Hi
Hi Chris,
Great. Thank you for handling this so quickly. When is the next version
due out? I hesitate to run snapshots as my users are rather vocal when
things don't work right.
Our snapshots are usually very stable, so you should not have much
worries. The next official release is planned
Thanks. I will give it a spin on my test machine first. Darn, I will be
on holiday to Prague around that time but not at the actual conference.
Chris
On Mon, Nov 24, 2014 at 11:15 AM, Christian Grün christian.gr...@gmail.com
wrote:
Hi Chris,
Great. Thank you for handling this so quickly.
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Hi Everyone,
I am rather confused again about diacritic handling in basex. For
instance, with Full Text turned on a word like athgabáil will match
both athgabail and athgabáil with diacritics insensitive which is
what I would expect. However, if
Hi Chris,
Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do not seem be properly normalized yet. I have added an issue
for that [1], and I hope I will soon have it fixed.
If you encounter some other surprising behavior like this, feel free to tell us.
Best,
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Hi Chrsitian
Thanks for letting me know! I also need ḟ U+1E1F.
All the best,
Chris
On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
Hi Chris,
Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do
Hi Graydon,
I just had a look. In BaseX, without diacritics can be explained by
this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text
I just found a mapping table proposed by John Cowan [1]. It's already
pretty old, so it doesn't cover newer Unicode versions, but it's
surely better than our current solution.
[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html
On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün
Hi Christian --
That is indeed a glorious table! :)
Unicode defines whether or not a character has a decomposition; so
e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a
combining acute accent.) I think the presence of a decomposition is a
recoverable character property in Java.
Hi Chris,
I am glad to report that the latest snapshot of BaseX [1] now provides
much better support for diacritical characters.
Please find more details in my next mail to Graydon.
Hope this helps,
Christian
[1] http://files.basex.org/releases/latest/
Hi Graydon,
Thanks for your detailed reply, very appreciated.
For today, I decided to choose a pragmatic solution that provides
support for much more cases than before. I have added some more
(glorious) mappings motivated by John Cowan's mail, which can now be
found in a new class [1].
However,
15 matches
Mail list logo