-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Hi Everyone,
I am rather confused again about diacritic handling in basex. For
instance, with Full Text turned on a word like athgabáil will match
both athgabail and athgabáil with diacritics insensitive which is
what I would expect. However, if
Hi Chris,
Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do not seem be properly normalized yet. I have added an issue
for that [1], and I hope I will soon have it fixed.
If you encounter some other surprising behavior like this, feel free to tell us.
Best,
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Hi Chrsitian
Thanks for letting me know! I also need ḟ U+1E1F.
All the best,
Chris
On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
Hi Chris,
Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do
Hi,
I was using an XSLT 2 stylesheet to transform xqdoc XML to Markdown. I
already had a stylesheet that did the same with HTML. This stylesheet
is XSLT 2 and uses Saxon HE which is located inside the BaseX lib
directory. Saxon HE is picked up just fine with xslt:transform and
Hi Andy,
Thanks, that works for me. I always prefer less complex code :-) so it would
be nice if this feature made a return at some point.
So it did: In the latest snapshot, scores will again be propagated
when using and, or, and predicates [2].
Cheers,
Christian
[1]
Hmm, removing all functions and named templates didn't help either.
Something is causing that my HTML conversion stylesheet is using Saxon
whereas the other stylesheet insists on using Xalan.
Here's another error I got when I added a function and called it.
[bxerr:BXSL0001] ERROR: 'Cannot
Stop the press .
No, really. I probably have to stop now. I was switching BaseX
versions and used a version where I didn't add Saxon yet. The HTMl
stylesheet I used didn't use any XSLT 2 features.
Face palm. Sorry for wasting your time
Can I get Google to forget this ;-)
--Marc
Can I get Google to forget this ;-)
Never ever, sorry…
Hi Graydon,
I just had a look. In BaseX, without diacritics can be explained by
this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text
I just found a mapping table proposed by John Cowan [1]. It's already
pretty old, so it doesn't cover newer Unicode versions, but it's
surely better than our current solution.
[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html
On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün
Hi Christian --
That is indeed a glorious table! :)
Unicode defines whether or not a character has a decomposition; so
e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a
combining acute accent.) I think the presence of a decomposition is a
recoverable character property in Java.
Hi Chris,
I am glad to report that the latest snapshot of BaseX [1] now provides
much better support for diacritical characters.
Please find more details in my next mail to Graydon.
Hope this helps,
Christian
[1] http://files.basex.org/releases/latest/
Hi Graydon,
Thanks for your detailed reply, very appreciated.
For today, I decided to choose a pragmatic solution that provides
support for much more cases than before. I have added some more
(glorious) mappings motivated by John Cowan's mail, which can now be
found in a new class [1].
However,
13 matches
Mail list logo