Re: Tagging text as being in arbitrary complex-script languages

2019-04-23 Thread Richard Wordingham
On Tue, 23 Apr 2019 17:35:10 +0200
Eike Rathke  wrote:

> Hi Richard,
> 
> On Thursday, 2019-04-18 20:40:01 +0100, Richard Wordingham wrote:

> > It sounds as though one has to specify the script where there is
> > doubt as to what type of script will dominate. Is it an issue if
> > there are two competing scripts of the same type, e.g Thai v. Lanna
> > for Northern Thai?  A dual script dictionary would correct
> > inefficiently.  

> Competing in the sense two different scripts under one language tag?
> I wouldn't do that and IMHO it would be wrong.

It's worse than that.  The spoken language nod-TH resolves, ignoring
subregional variations, into the three written groups:

nod-Lana-TH
nod-Thai-etymo-TH (name but not concept declared unsuitable on 10 Jan)
nod-Thai-phonetic-TH (ditto)

The scheme 'nod-Thai-etymo-TH' often accompanies published material in
non-Lana-TH. The New Testament is published in nod-Lana-TH and
'nod-Thai-phonetic-TH'.

Until I can find names for the Thai-script variants more specific to
Northern Thai, my plan is to handle the difference by letting the user
choose the dictionary if I ever get round to Thai script Northern Thai
dictionaries.  The biggest need I see for the variant tags is user
interfaces.

The Lana script dictionary is highly desirable for
handling the visual ambiguities in the script for the vernacular
languages and has high priority.  Eyeballs are probably good enough for
the Thai script.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Tagging text as being in arbitrary complex-script languages

2019-04-23 Thread Richard Wordingham
On Tue, 23 Apr 2019 18:00:22 +0200
Eike Rathke  wrote:

> On Friday, 2019-04-19 03:32:34 +0100, Richard Wordingham wrote:

> > In answer to what was intended to be a rhetorical question, I
> > suppose und-Latn-t-sa-m0-iast and und-Latn-t-sa-m0-iso would work
> > for the normative forms.  
> 
> Seem.. at least when entered at https://r12a.github.io/app-subtags/ in
> the Check form it doesn't overly complain.

It seems that some people think that IAST also defines a Cyrillic
representation, so I think the 'Latn' is justified.

> However, I'd avoid 'und', to me it annotates as "can't determine what
> this could be" and in fact it is listed as Undetermined.

Well, as the two systems are international standards (the 'i' in
'iast' and 'iso'), it should be hard to tell whether the intended
audience is English, German, Japanese or whatever.  The what of the
underlying content is contained in the extension - in this case the
'sa'.


> Yes, that's ugly, but unavoidable. For which sa-Latn would be a better
> solution.

And allow for mixtures of the two schemes!

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Tagging text as being in arbitrary complex-script languages

2019-04-18 Thread Richard Wordingham
On Thu, 18 Apr 2019 20:40:01 +0100
Richard Wordingham  wrote:

> On Thu, 18 Apr 2019 12:25:11 +0200
> Eike Rathke  wrote:

> > Though with sa-Latn
> > I doubt there's a use case, so I wouldn't call that "correct" in
> > common sense.  
> 
> So how do you suggest we tag Sanskrit in Latin script?

In answer to what was intended to be a rhetorical question, I suppose
und-Latn-t-sa-m0-iast and und-Latn-t-sa-m0-iso would work for the
normative forms. I've successfully loaded a mocked up extension for the
former (as explicitly using a Western script), though I don't much like
the consequent tagging  in
the document's content.xml. That's a problem with the 't' extension.
Transliteration may change the language of place names in isolation,
but it doesn't really change the language of paragraphs of text.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Tagging text as being in arbitrary complex-script languages

2019-04-18 Thread Richard Wordingham
On Thu, 18 Apr 2019 12:25:11 +0200
Eike Rathke  wrote:

> What I usually did is, lookup the language at SIL and the Ethnologue
> and use the most prevalent script as implied default script. Which
> here https://www.ethnologue.com/language/san would lead to
> Devanagari, but in this case more important is also what MS assigned
> the LCID for.

So I shouldn't be misled by the fact that the CTL script I most
frequently write Sanskrit in is Thai -:)  Seriously, though, I believe
the script of sa-TH is Thai is rather than Devanagari, and I am quite
sure that the script of sa-MM is Mymr.

It sounds as though one has to specify the script where there is doubt
as to what type of script will dominate. Is it an issue if there are
two competing scripts of the same type, e.g Thai v. Lanna for Northern
Thai?  A dual script dictionary would correct inefficiently.

> > "sa-150" Sanskrit written using European conventions - so, any
> > script, but, at least for Devanagari, the anusvara sign is not used
> > for homorganic nasals.  
> 
> Though valid, LibreOffice doesn't use the numeric UN M.49 code, it may
> be accepted but might not work everywhere.
> 
> > "sa-Deva-150" Sanskrit written in Devanagari in the manner used in
> > Europe.  
> 
> Same here.
> 
> > "sa-Latn" Sanskrit written in the Roman script.
> > 
> > "sa-Latf" Sanskrit written in Fraktur (I'm not sure that this
> > exists. It might need a hint as to where to find a Fraktur script
> > with a combining candrabindu.)  
> 
> Both perfectly valid, if they serve any purpose. Though with sa-Latn
> I doubt there's a use case, so I wouldn't call that "correct" in
> common sense.

So how do you suggest we tag Sanskrit in Latin script?  Within English
works, its not uncommon for any Sankrit quoted precisely to be in the
Latin script; about half the English language articles in the
'International Journal of Sanskrit
Research' (http://www.anantaajournal.com/) that quote Sanskrit passages
quote them in the Latin script.  Several papers would benefit from the
application of sa-Latn proofing tools, though I don't denying that
proofing Sanskrit may be difficult.

Moreover, I've only ever seen U+0310 COMBINING CANDRABINDU in examples
of Sanskrit in Latin text. 

> I also just learned that sa-Latf somehow exists..

That example is in the same spirit as en-Thai (which I've successfully
used for privacy) and notes I've seen kept in en-Runr on a publicly
accessible whiteboard.
 
I was wondering whether Sanskrit was printed in Antiqua or Fraktur in
early 20th Century Germany.  You seem to think neither.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Tagging text as being in arbitrary complex-script languages

2019-04-17 Thread Richard Wordingham
On Wed, 17 Apr 2019 13:53:25 +0200
Eike Rathke  wrote:

> > > On 4/15/19 12:26 PM, Eike Rathke wrote:  
> > > > Adding arbitrary dictionary languages (as long as they strictly
> > > > follow the BCP 47 language tag specification) works since quite
> > > > a while (2014?) already.  

> > An interesting experiment would be to try adding a language to both
> > Western and CTL (as with Mongolian and some minor SEA languages) or
> > Western and CJK (various Zhuang writing systems), though I suppose
> > it won't hurt to simply disambiguate by script.  
> 
> In fact you have to, or use an ISO 639-1/2/3 language code that
> implies a default script for one and specify an ISO 15924 script code
> for the other, which I was referring with "correct BCP 47 language
> tags".

Is there a pointer as to which tag sequences that "strictly follow the
BCP 47 language tag specification" are "correct"?

As far as I can tell, the following all strictly follow the
specification:

"sa" Sanskrit, with no specification of the script or spelling
conventions.

"sa-IN" Sanskrit as used in India - so far as I can tell, that could be
in, for example, Devanagari, Grantha or even the Tamil script!  For
Devanagari at least, I understand that this implies that homorganic
nasals may be written using U+0902 DEVANAGARI SIGN ANUSVARA.

"sa-150" Sanskrit written using European conventions - so, any script,
but, at least for Devanagari, the anusvara sign is not used for
homorganic nasals.

"sa-Deva-150" Sanskrit written in Devanagari in the manner used in
Europe.

"sa-Latn" Sanskrit written in the Roman script.

"sa-Latf" Sanskrit written in Fraktur (I'm not sure that this exists.
It might need a hint as to where to find a Fraktur script with a
combining candrabindu.)

The only Sanskrit tag sequence I can find in isolang.cxx is "sa-IN".

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Tagging text as being in arbitrary complex-script languages

2019-04-16 Thread Richard Wordingham
On Mon, 15 Apr 2019 15:14:49 +
jonathon  wrote:

> On 4/15/19 12:26 PM, Eike Rathke wrote:

> > Adding arbitrary dictionary languages (as long as they strictly
> > follow the BCP 47 language tag specification) works since quite a
> > while (2014?) already.

Only if you hacked the text to declare the CTL or CJK language as
appropriate to be the one of the dictionary. Otherwise, you could only
use such a dictionary for a 'Western' script.

As recently as 2015, another issue was that I was having to regenerate
hunspell/utf_info.cxx for a LibreOffice build so that it would accept
word characters as word characters.  I don't know how well that file
tracks the Unicode standard nowadays.  When should Pali spell-checking
in the extended Lao script (Pali support to 1930's standards was only
added this year) only have problems due to the inadequacy of the
dictionaries?

> > New(er) in the mentioned mechanism is the
> > ability to add a language also to the CTL or CJK sections where
> > previously it was only possible to add to the (misnamed) "Western"
> > section, and give the language list entries a proper UI name
> > instead of showing just the language tag.

> Thanks.
> I wasn't aware that that functionality was present.

> I'll play with over the next month or so, then write about in my
> long-neglected blog.

An interesting experiment would be to try adding a language to both
Western and CTL (as with Mongolian and some minor SEA languages) or
Western and CJK (various Zhuang writing systems), though I suppose it
won't hurt to simply disambiguate by script. In general, tagging has the
potential to get very messy, e.g. Pali in Lanna script as used in
Northern Thailand as opposed to Pali in Lanna script as used in
North-eastern Thailand. (Yes, there are systematic spelling differences
between the two.)

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Tagging text as being in arbitrary complex-script languages

2019-04-10 Thread Richard Wordingham
On Wed, 10 Apr 2019 15:13:52 +0200
Eike Rathke  wrote:

> Hi Richard,
> 
> On Wednesday, 2019-04-10 04:02:53 +0100, Richard Wordingham wrote:
> 
> > I was also able to get SIL's oxttools to work sufficiently  
> 
> What are those oxttools and where to get them?

Tools for assembling extensions for LibreOffice, particularly
dictionaries and the like.  They're available at
https://github.com/silnrsi/oxttools .  It looks as though there may be
some tools for assembling dictionaries, but I haven't dug deeply into
them.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Tagging text as being in arbitrary complex-script languages

2019-04-09 Thread Richard Wordingham
On Mon, 8 Apr 2019 16:17:38 +0200
Eike Rathke  wrote:

> ScriptType value 3 here means CTL. The values are explained in
> officecfg/registry/schema/org/openoffice/VCL.xcs under
> 

Thank you for the information, and thanks to Stephan Bergmann for the
localisation information.

For plodders like me, the definitions are:

officecfg/registry/schema/org/openoffice/VCL.xcs (content, as stated by
Eike)


officecfg/registry/component-schema.dtd (syntax of VCL.xcs)

officecfg/registry/component-update.dtd (syntax and some semantics of
extension writer's dictionaries.xcu; the allowed information content is
given in VCL.xcs.)

I was also able to get SIL's oxttools to work sufficiently to work out
what I needed.  A dictionaries.xcu that works is:


http://openoffice.org/2001/registry;
 xmlns:xs="http://www.w3.org/2001/XMLSchema;>



 
 Northern Thai
 
 
 3
 




 



%origin%/nod_TH.aff %origin%/nod_TH.dic


DICT_SPELL


nod-TH



 


The LibreOffice extension manager seems tolerant and has some helpful
error reporting.  *My* next step is to sort out copyright issues so
that I can share the dictionary.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Tagging text as being in arbitrary complex-script languages

2019-04-06 Thread Richard Wordingham
https://wiki.documentfoundation.org/ReleaseNotes/5.4 says,

"The language list for text attribution now also displays BCP47
language tags provided by dictionaries if a language is not known in
the predefined set of languages. (Eike Rathke (Red Hat, Inc.))

Such additional language tags are placed in curly brackets /
braces, for example {en-DK}, and are displayed at the top of the
list after the [None] entry."

Is some additional information required in the .oxt file for the
dictionary if the script is not "Western text"?  For example, I have
installed a dictionary (of my own devising) for language nod-TH, but it
only shows up (in LibreOffice 6.2.2.2) in the language list for
Western text.  (The language is only written in CTL scripts - Thai
and Lanna.)

The work-around of manually editing the XML of a writer document to
insert the language and country still works.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice

Special Fonts for Spell Checking Northern Thai in Lanna Script

2017-10-15 Thread Richard Wordingham
I am trying to put together a workable solution for spell-checking
Northern Thai in the Lanna (a.k.a. Tai Tham) script.  I have a good idea
how to do it, and it is already working in Firefox.  The solution may
not be suitable for run of the mill users, but I don't believe run of
the mill users need the solution.  Additionally, a Thai or English user
interface is probably better than a Northern Thai interface.

There are a number of problems, but the significant ones all relate to
fonts.  The others are all soluble.

1) The Universal Script Engine

The Universal Script Engine inserts far too many dotted circles into
Tai Tham text.  Most closed syllables cannot be written in accordance
with Unicode's principle of phonetic ordering, and some cannot be
written at all.  This I have overcome by creating a font that removes
inappropriate dotted circles.

This turns the Universal Script Engine into a solution for DirectWrite,
HarfBuzz and AAT.

2) Scriptio Continua

The Tai languages in the Tai Tham script do not separate words by
spaces.  The old solution to this problem, U+200B ZERO WIDTH SPACE,
works.  (By contrast, Pali, at least in modern texts, tends to have
spaces between words, as is done in Pali in the Thai script.
Significant sandhi may suppress the word-breaks.)

3) Northern Thai is not supported by LibreOffice

It is, however, supported by Open Document Format.  The solution is to
edit the XML file to set the CTL language in the XML, and then propagate
and edit text for which nod-TH is the CTL language.

The lack of a Northern Thai interface is probably not a problem.  Any
need for it is emotional rather than practical.

It is possible that Burmese, Chinese, English and possibly Lao
interfaces will similarly cater for Tai Khuen and Tai Lue users. 

4) Visually Ambiguous Spelling

Words that normally look identical may be sorted and pronounced
differently.  Actually, there are surprisingly few visual homographs
with such differences.

So that users may see what they are typing, the solution I have adopted
is to colour code the glyphs so that users can see whether a consonant
precedes or follows the vowel of the syllable in coding and phonetic
order.

5) Font Support

Does LibreOffice support any type of multi-colour font?  I may have to
devise a shape difference to indicate the spelling, which is less
appealing.  This would be most important in choosing a spelling
correction.

To see what it is that one has actually typed, switching to a
transliteration font and then undoing the change is one approach. 

6) Font Selection

How does one control the font used in the spell-checking interface?  I
am particularly interested in the solution for Ubuntu, but it would be
good to also know the solution for Windows.  For Ubuntu, I suspect the
answer will lie in Fontconfig, but I first need to know how to identify
the font that LibreOffice tries to use.  Fontconfig would work by
controlling the fallback.

Even without grammar coding, there may be an issue in that some Lanna
script fonts are barely usable in the User Interface - readable Northern
Thai text can need much greater vertical extent than English, depending
on the style.

7) Dictionary Creation

I currently have a large, working Northern Thai dictionary.  I do need
to sort out IP issues before I can share it.  Even then, there needs to
be a lot of shake-down testing to eliminate my typographical errors,
and birds, fish and trees need to be added.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice


Version of gcc for LibreOffice

2015-10-09 Thread Richard Wordingham
On Wed, 07 Oct 2015 11:10:08 +0200
Jan-Marek Glogowski <glo...@fbihome.de> wrote:
(when topic was 'Can't track flow of characters in from Input Method
Editor')

> Am 06.10.2015 um 23:51 schrieb Richard Wordingham:
> > I think my compiler (gcc
> > Version 4.6.3) is too old to compile Version 5.0, which is where I
> > noticed the problem.
> 
> ...
> 
> > I am running Ubuntu 12.04 with the default desktop.

> LO 5.0 builds just fine in Precise / 12.04. See
> https://launchpad.net/~libreoffice/+archive/ubuntu/ppa?field.series_filter=precise
> for newer packages.

OK. I found a tar ball for 5.0.2.2 which *does* build on Ubuntu 12.04.

However, when I try building from 'trunk' (or whatever its called)
pulling in the source via git, compilation still fails, just as (well,
one line number's changed) happened just over three months ago
(https://ask.libreoffice.org/en/question/52435/what-version-of-gcc-do-i-need-to-build-libreoffice/
).  I did not get a usable answer then.

In response to my example patch at
https://bugs.documentfoundation.org/show_bug.cgi?id=94753 , I've been
told to use gerrit to discuss patch proposals.  Presumably I should at
least confirm that my patches compile in the developing form of
LibreOffice.  So, what version of gcc do I need to build LibreOffice?
Or is there a bug in include/rtl/ustring.hxx?  I don't know C++ well
enough to understand the problem.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Can't track flow of characters in from Input Method Editor

2015-10-08 Thread Richard Wordingham
On Thu, 8 Oct 2015 01:17:14 +0100
Richard Wordingham <richard.wording...@ntlworld.com> wrote:

> Thank you all for your inputs.

I've finally found where the problem materialises.  There is a callback
of GtkSalFrame::IMHandler::signalIMDeleteSurrounding() to delete one
'character'.  I now need to work out where the interfacing is in
error.  The intent of the call is to delete one Unicode character; it
is now a question of where the conversion from Unicode characters to
code units should be made.  It might be anywhere from KMfL to
signalIMDeleteSurrounding().  For hacking, there is the good news that
when KMfL decides to delete two Unicode characters, there are two calls
of the function, so I could fix *my* problem straightforwardly.

Does this appear to relate to any other known problems in interfacing
with ibus?

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Can't track flow of characters in from Input Method Editor

2015-10-08 Thread Richard Wordingham
On Thu, 08 Oct 2015 10:18:15 +0100
Caolán McNamara <caol...@redhat.com> wrote:

> On Thu, 2015-10-08 at 08:52 +0100, Richard Wordingham wrote:
> > The intent of the call is to delete one Unicode character;

On reading the GTK documentation, it is clear that the arguments are
in terms of Unicode characters, and not UTF-16 code units. 

> I imagine you need to change signalIMDeleteSurrounding where we have
> nDeletePos = nPosition + offset and
> nDeleteEnd = nDeletePos + nchars
> and instead of adding "offset" and adding "nchars" you need to call
> getText on xText to get the string, then use
> OUString::iterateCodePoints to count forward from nPosition by
> "offset" IM codepoints to get the utf-16 offset for LibreOffice, and
> similarly iterateCodePoints by IM nchars to get the LibreOffice
> utf-16 nchars to delete.
> 
> might suck rocks for performance.

I can't fathom how getText() works - obfuscation by abstraction!
However, as using OUString::iterateCodePoints would appear to involve,
at the very least, copying a long string, I have coded up a similar
function that works directly with the 'editable accessible' string
(and associated data).  I have added a patch to the bug report
https://bugs.documentfoundation.org/show_bug.cgi?id=94753 .

Richard

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Can't track flow of characters in from Input Method Editor

2015-10-07 Thread Richard Wordingham
Thank you all for your inputs.

On Wed, 7 Oct 2015 09:57:14 +0200
Miklos Vajna  wrote:

> Writer "main text" gets all keyboard input in SwEditWin::KeyInput(),
> sw/source/uibase/docvw/edtwin.cxx. It's VCL that calls that member
> function, and in your case it's probably the VCL KDE backend in
> particular.

On Wed, 7 Oct 2015 22:20:01 +0800
Hung Mark  wrote:

> Since you mentioned that Writer exhibit the problem but Calc
> doesn't,you might want to take a look at
> sw/source/core/doc/extinput.cxx.

SwEditWin::KeyInput() is receiving the input not generated by the IME,
e.g. Latin and Thai as I have my keyboards set up, but the normal
character input generated by the IME (BMP Tai Tham and SMP Tirhuta) is
going to SwExtTextInput::SetInputData instead!  Backspaces generated by
hitting the 'rubout' key (labelled with a right-to-left arrow) follow
the non-IME route.  I do not yet know what happens to backspaces
generated by the IME.

On Wed, 07 Oct 2015 11:10:08 +0200
Jan-Marek Glogowski  wrote:

> I guess you're running Kubuntu 12.04, as you talk about KDE in this
> post.

The KDE code was a red herring.  The characters are coming in from the
basic X system via GtkSalFrame::signalKey, as one would expect for a
primarily Gnome system, despite the graphical shell being Unity.  So,
it's basically Ubuntu.

> LO 5.0 builds just fine in Precise / 12.04. See
> https://launchpad.net/~libreoffice/+archive/ubuntu/ppa?field.series_filter=precise
> for newer packages.

I'll give it another try.  Pre-release versions obtained via Git
wouldn't compile.

> We also had problems with Qt4 / all KDE applications and ibus. At the
> end we backported the 14.04 / trusty version of fcitx and use this
> currently :-(

I hope we haven't got a race condition.  I don't understand the order
of my monitoring outputs.  I was able to run LibreOffice under gdb
running from Emacs Version 23, whereas the combination failed under
Emacs 24.  (The two Emacsen use different interfaces to gdb,
which may be the reason for the difference.)  However, not only was I
not able to set a break point where I wanted (probably my lack of
competence), I could not reproduce the error.  I got no lone
surrogate!  This better behaviour has not been reproduced.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Can't track flow of characters in from Input Method Editor

2015-10-06 Thread Richard Wordingham
On Sunday I raised bug report 94753 about the apparent generation of
lone surrogates in response to the use of Keyman for Linux under ibus
as the input method editor. I have compiled Version 4.4.4.3.0+ with
debug to facilitate my investigation; I think my compiler (gcc Version
4.6.3) is too old to compile Version 5.0, which is where I noticed the
problem.

I use emacs as an IDE for debugging, but Emacs Version 24 does not seem
able to cope with Version 4.4.4.3.0+.  The debugger gdb run from the
terminal appears to be able to cope.  I have been trying to narrow down
the source of the error by inserting fprintf() calls.  However, I cannot
find where characters enter the program from the IME.  I am running
Ubuntu 12.04 with the default desktop.  The IME is KMfL running under
ibus.

I set up fprintf() and abort() calls to monitor the apparent sole call
of XmbLookupString (there are no visible calls of XwcLookupString) and
also within the call of SalKDEDisplay::checkdirectInputEvent().
However, inputting text from the Supplementary Multilingual Plane using
the IME to input characters generates neither output from the fprintf()
calls nor a core dump from abort().  Have I overlooked another route by
which characters are reaching the program?

My current suspicion is that Qt is not handling KMfL's replacement of
one supplementary character by another properly, but I cannot
demonstrate that.  My test input text sequence is the three characters
dYH, which when applied to an instrumented program using X generates
the characters U+1148F, U+114C0, U+0008 (also as symbol), U+114BF.  I
suspect that U+0008 is only cancelling the low surrogate of U+114C0,
and that this is happening in Qt code. I have seen similar behaviour
with Konsole, which I believe is a Qt application.  Claws mail,
Gnome-terminal, Emacs Version 24, gedit, Abiword and even LibreOffice
Calc all exhibit receipt of the correct sequence of characters, namely
.  (Some of these do not display it properly, but
that is another issue.)

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Unicode 8.0?

2015-07-16 Thread Richard Wordingham
On Thu, 16 Jul 2015 17:40:06 +0100
Caolán McNamara caol...@redhat.com wrote:

 On Thu, 2015-07-16 at 11:53 +0200, Viktor Kovács wrote:
  I would like to ask when will be adopted Old Hungarian fonts. It is
  defined in the UNICODE 8.0, central-europe subgroup, and it must be
  typed right to left writing.
 
 The underlying requirement will be a version of icu that supports
 unicode 8, so someone needs to bump the icu version we're building
 against to version 56, and that's only at milestone 1 level at the
 moment so apparently not ready for stable use yet.
 
 I imagine then there would be a need to extend the RTL support to
 include some additional language if there is a serious attempt to
 support this as a real thing.

Viktor, Have you tried LibreOffice with an Old Hungarian font of your
own?

ICU should have known years ago that the part of the SMP used for this
script is reserved for right-to-left scripts.  For the Bidi algorithm,
the characters had the appropriate properties long before they were
assigned to 'Old Hungarian'.

If you need to specify the language, I suggest you set it to
Hebrew.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Univerbation

2015-07-07 Thread Richard Wordingham
On Tue, 07 Jul 2015 09:55:38 +0100
Caolán McNamara caol...@redhat.com wrote:

 On Mon, 2015-07-06 at 09:13 +0100, Richard Wordingham wrote:
  What mechanisms does ODF have to indicate that a sequence of word
  characters constitutes a word?
 
 But generally we follow the rules of the underlying icu version that
 LibreOffice is built against.

Thanks for answering.  For the problem, see
http://bugs.icu-project.org/trac/ticket/11766 . I am therefore checking
for possible solutions in the likely event that U+2060 and U+FEFF
suppress word breaks and no new character (I intend to suggest U+2065)
is provided to suppress word breaks.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Univerbation

2015-07-06 Thread Richard Wordingham
What mechanisms does ODF have to indicate that a sequence of word
characters constitutes a word?

Having such a mechanism is useful for spell-checking Thai and other
languages where the boundaries between words are not marked.  At
present, one can cancel spurious boundaries by inserting U+2060 WORD
JOINER.  Words formed thus can be entered in personal spelling
dictionaries.  This is the only mechanism I am aware of.  However, it is
currently intended (announcement to private Unicore list only) to
modify the Unicode Standard for Version 8.00 this month to state that
U+2060 should not have have any effect on determining word boundaries;
its function will merely be to suppress line breaks.

I view this as a kick in the teeth of users of languages such as Thai,
but so far I am the only one to have responded.  The only work around
I can see is to add a word joining character (e.g U+2065) to Unicode
and hope that LibreOffice supports U+2060 as a word-joining character
until the new character becomes available.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-07-02 Thread Richard Wordingham
On Wed, 24 Jun 2015 23:40:10 +0200
Michael Stahl mst...@redhat.com wrote:

 On 24.06.2015 23:26, toki wrote:

  That is part of the reason why I think the whole Western/CJKV/CTL
  split should be thrown out, and replaced with language/writing
  system, supplemented by locale data.

 that's a great idea in theory, unfortunately it would throw out any
 hope of compatibility with Microsoft Office as well

How does one achieve compatibility with per script font-selection as
shown in
http://blogs.msdn.com/b/officeinteroperability/archive/2013/04/22/office-open-xml-themes-schemes-and-fonts.aspx
 ?

For that matter, how does the current scheme square with a style having
separate fonts for ASCII and other Latin characters - the *four*-way
split ASCII / 'High ANSI' / Complex Script / East Asian?

Richard.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-30 Thread Richard Wordingham
On Tue, 30 Jun 2015 17:48:05 +0200
Eike Rathke er...@redhat.com wrote:

 On Monday, 2015-06-29 20:40:46 +0200, Khaled Hosny wrote:

  We already handle this at the text shaping level in VCL for
  platforms where HarfBuzz is used.
 
 I think we talk about two different things here.

Yes.  Khaled and I are focused on handling text, whether fundamentally
present or generated by field codes and the like.  What you are talking
of makes most sense for when there is no relevant user-input text. 

 My view is from
 correct language tag attribution that we need anyway, for document
 storage

I don't understand that one.

 and spell-checkers

Seems to work for 'unsupported' nod-TH.  Tai Tham script is encountered,
identified as complex (as demonstrated by the choice of font), so
language nod-TH and corrected using the nod-TH spelling dictionaries.  

(Mind you, they're only populated as nod-Lana-TH.  The fun starts when
we want to distinguish what might be called nod-Thai-TH-etymological,
nod-Thai-TH-Chiangmai and nod-Thai-TH-Chiangrai.)

 and locale dependent representation.

Presumably for generated text.  Yes, here language and country will in
general be inadequate.

 When
 I mention language tag I'm always talking about BCP 47 language
 tags. You, and possibly Richard, have the runtime view and what could
 be automatically detected. So, even if detected automatically we'll
 have to assign a language tag that for the non-default script of a
 language includes the ISO 15924 script code.

 snip arbitrary Western/CTL/CJK classification snip

 The correct route to go is probably to
 assign known scripts to these classes, whether detected automatically
 or not,

Which is already being done, though conceivably going directly from
character to class.

 and distribute language tags according to their (implied or
 not) script over those classes.

I'm not sure I follow you here.  A supported language tag will have
corresponding strings for automatically generated text, and these
strings will generally imply the font.  The only exception I can think
of is common script text, where perhaps script information will be
required to select the styling.  This just requires a default script
for each supported language code (i.e. minimal BCP 47 tag), though we
could get away with default script class.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Licence to Convert Dictionary to Spell-Checker Dictionary

2015-06-29 Thread Richard Wordingham
One way of producing a spelling dictionary is to take the words from
a near-normal dictionary and use them.  Does publishing such a
dictionary require the permission of the dictionary's copyright
holder?  If it's relevant, the dictionary was published in Thailand.

I appreciate that one ought to do a lot more work than just that step
to make a good spelling dictionary.

If I need permission, what licences would be suitable for making the
spelling dictionary available via LibreOffice?

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-29 Thread Richard Wordingham
On Mon, 29 Jun 2015 20:40:46 +0200
Khaled Hosny khaledho...@eglug.org wrote:

 On Mon, Jun 29, 2015 at 12:14:44PM +0200, Eike Rathke wrote:
  Hi Richard,
  
  On Wednesday, 2015-06-24 20:54:54 +0100, Richard Wordingham wrote:
  
   The script is generally implicit in the text.
  
  You want to rely on automatic detection of scripts depending on the
  language chosen? Do you plan to implement that? However, even then
  the resulting tag would include the script code if it wasn't the
  default script of the language.
 
 Almost every character in Unicode has a script property, the
 exceptions is characters that has Inherit (unusually combining marks)
 or Common (punctuation mostly), put there is a simple and pretty
 reliable way to resolve the script of those characters from the
 context.

Indeed, the route I had in mind was:

1) Determine script from character(s).

2) Categorise script as Western/CTL/CJK

3) Locale is then the Western locale, the CTL locale or the CJK locale
as appropriate.

Unless one first categorises the script, one does not know what the
language is.

Now, with more support, one may need the script.  For example, a
Serbian date field should depend on the script (Latin v. Cyrillic) as
well as just the language, and Serbian is not the only language using
competing scripts in the same class.  However, what a date field picks
up from its environment is curious.  If I copy a Thai date field and
paste it into the middle of an English word, I get a date in English!

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-29 Thread Richard Wordingham
On Wed, 24 Jun 2015 21:26:50 +
toki toki.kant...@gmail.com wrote:

 I'll simply point to the current version of Microsoft Office, which is
 claimed, by Microsoft, to support more than 7,000 languages.
 
 As far as UI design goes, there are at least four options.
 1) Offer everything, listed alphabetically;
 2) Select the writing system, which is roughly 200 choices, then the
 language, and then, when needed, the locale;
 3) Select the writing system, which is roughly 200 choices, then the
 locale, which is roughly 250 choices, and then the language, which, in
 the worst case scenario, is a thousand options;

Do you mean 'script' when you say 'writing system'?  Few languages
share a writing system - Welsh, English, French and German have four
different writing systems.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-25 Thread Richard Wordingham
On Wed, 24 Jun 2015 20:54:54 +0100
Richard Wordingham richard.wording...@ntlworld.com wrote:

 On Wed, 24 Jun 2015 12:31:16 +0200
 Eike Rathke er...@redhat.com wrote:

  Simply in a css::lang::Locale set the Language field to qlt and in
  the Variant have the language tag, see
  http://api.libreoffice.org/docs/idl/ref/structcom_1_1sun_1_1star_1_1lang_1_1Locale.html

 It may be 'simply' to you, but my macro to set the language doesn't
 progress beyond the '::' before 'Locale', failing with Object not
 accessible.

Part of my trouble was using '::' instead of '.' in the multi-part
names when writing in Basic.  Another part was forgetting that I could
pass an integer or a struct in the same field.

However, the approach using executeDispatch() failed.  The unusual
languages were simply reported as en-GB, and were recorded thus in
saved .odt files.

However, I now have successful macros of the form:

Sub Lue

dim region as object

dim aLocale As New com.sun.star.lang.Locale

aLocale.Country = 

aLocale.Language = qlt

aLocale.Variant = khb-CN

region = ThisComponent.CurrentSelection.getByIndex(0)

region.CharLocaleComplex = aLocale

end sub

As I can now fairly readily mark complex-script text as khb-CN, kkh-MM,
nod-TH and tts-TH (and all within a few lines of one another), what
problems should I expect?  (I suppose I should try to make this into an
extension.)

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-24 Thread Richard Wordingham
On Tue, 23 Jun 2015 21:07:12 +
toki toki.kant...@gmail.com wrote:

 On 06/22/2015 07:30 PM, Richard Wordingham wrote:
 
  How do I add a language to this menu so that fonts that can will
  render text in the style appropriate to the language? 

I've been getting a fair bit of information off list, though it's not of
immediate use to me.

Most relevantly, arbitrary recognised languages can be entered for the
Western scripts -
https://wiki.documentfoundation.org/ReleaseNotes/4.3#Adding_a_new_language_tag .
There is a bug report out for this prima facie racist behaviour - Bug
#81714: https://bugs.documentfoundation.org/show_bug.cgi?id=81714 ,
where there is a brief explanation of why the capability is so limited.

It's been claimed that this is the tip of a big iceberg:

The ideal would be to allow the following capabilities:

* Tag text according to its language tag rather than using an LCID,
  given even windows uses langtags now.
* Allow arbitrary lang tags to be used in a text anywhere
* Add ability to read language support from say ldml file as
  configuration (should this go with a doc, no idea)
* Be able to associate a language with CTL/CJK.

Each of these points are huge undertakings (well in pairs perhaps),
which would take considerable community political will to see happen.
But as a wise man once said: a single minority language has virtually
no cost benefit, but 2000 languages changes the equation considerably.

I presume LibreOffice is intended to support OpenDocument.  On this
basis, I would say: 

* Tag text according to its language tag rather than using an LCID,
  given even windows uses langtags now.

OpenDocument does this.

* Allow arbitrary lang tags to be used in a text anywhere

OpenDocument allows these - it is just a question of how much
LibreOffice supports this.  I believe the UNO interface supports this,
but I won't be sure until I've tried it.  One problem is that
OpenDocument depends on an undefined split of text into Western, CTL
and CJK text - a useful trick but a bad design.

* Add ability to read language support from say ldml file as
  configuration (should this go with a doc, no idea)

This hits the problem that ICU looks broken in this respect.  One is
meant to compile in the support languages - and the line-breaking
algorithms require human intervention, because the definitions cannot be
compiled to efficient code.

* Be able to associate a language with CTL/CJK.

This is impossible for a few languages.  Several languages exist in
competing scripts of different categories - Sanskrit and Pali may be
written in the Latin script as well as in Indic scripts, and I think
Sanskrit is also available in CJK.  Several languages are used in both
the Latin script and in the national CTL script or in the Arabic script.

However, why is this association necessary?  In the bug report, Eike
Rathke wrote, The existing predefined CTL/CJK tags respectively their
corresponding LCID values occur in various switch cases to be acted
differently upon.  This needs further elucidiation.  It's not obvious
why mistagging is to be preferred.

To return to Jonathon (= Toki)'s advice:

 My recommendation is that you file an RFE for each language and locale
 that you'd like to use in LibO.

With something like 2000 languages, the pick lists will be
overwhelmed.  My preference would be to allow them all, but to have
a flexible method of selecting which appear in the lists.  Most people
won't use many - and a dozen or so choices of varieties of English or
French in a 1-D list is overwhelming, and two dozen choices for Arabic
is also excessive.

 Whilst one can fake it, by using a different language/locale with
 similar characteristics, that doesn't help, if one wants to do spell
 checking and grammar checking in your documents of those specific
 languages.

I'm surprised at the central control of these, especially at the
experimental level.  Is one really meant to mislabel text while
developing and testing such tools?

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-24 Thread Richard Wordingham
On Wed, 24 Jun 2015 11:52:49 +0200
Eike Rathke er...@redhat.com wrote:

  If I have some text with khb-CN as the language and
  region and then try to set the language for a greater expanse of
  text, khb-CN does not come up in the menu.

N.B. By 'language' and 'region', I mean language and region for complex
text.  I tend to forget that one doesn't tag characters with a language,
but sets a tag conditional on the character's script class.

 Does it come up if the cursor is positioned on a portion of text that
 already has the tag assigned?

Yes, khb-CN comes up if the cursor is between the characters tagged as
such when complex.  If the character before is not khb-CN and the
character after is khb-CN, it does not come up.

Richard. 
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-24 Thread Richard Wordingham
On Wed, 24 Jun 2015 12:31:16 +0200
Eike Rathke er...@redhat.com wrote:

  * Allow arbitrary lang tags to be used in a text anywhere

  OpenDocument allows these - it is just a question of how much
  LibreOffice supports this.
 
 It does.
 
  I believe the UNO interface supports this,
  but I won't be sure until I've tried it.
 
 Simply in a css::lang::Locale set the Language field to qlt and in
 the Variant have the language tag, see
 http://api.libreoffice.org/docs/idl/ref/structcom_1_1sun_1_1star_1_1lang_1_1Locale.html

It may be 'simply' to you, but my macro to set the language doesn't
progress beyond the '::' before 'Locale', failing with Object not
accessible. Invalid object reference.  I was using vanilla LibreOffice
4.3.3.2. My macro shorn of superfluous comments read:

sub Tai_Lue3
dim dispatcher as object
ThisComponent.CurrentController.Frame dispatcher =
createUnoService(com.sun.star.frame.DispatchHelper)
' dim args1(0) as new com.sun.star.beans.PropertyValue 
' dim args1(0) as new css::lang::Locale
' dim args1(0) as new com::sun::star::lang::Locale
dim args1(0) as new com::sun::star::lang::locale 
args1(0).Language =qlt
args1(0).Variant  =khb-CN
dispatcher.executeDispatch(document, .uno:Language, , 0, args1())
end sub

The macro recorded from using the combobox just records the LCID
generated on the fly, which is not much use.  It wouldn't mean the same
from editing session to editing session.

  * Be able to associate a language with CTL/CJK.

  This is impossible for a few languages.  Several languages exist in
  competing scripts of different categories - Sanskrit and Pali may be
  written in the Latin script as well as in Indic scripts, and I think
  Sanskrit is also available in CJK.  Several languages are used in
  both the Latin script and in the national CTL script or in the
  Arabic script.
 
 Then you will have different language tags that include the script,
 and have one associated with Western and one with CTL. I don't see
 the problem.

I am having great difficulty seeing why one should want to specify the
script for a barely supported writing system, let alone the class of
script. My thought was that the language code would suffice.  The script
is generally implicit in the text. As far as text properties are
concerned, the class of script would be implicit in the box in which
the language name was entered.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Languages to Writer's Character, Font Menu

2015-06-23 Thread Richard Wordingham
(Copy to list for reference - I accidentally replied to Caolán alone.)

On Tue, 23 Jun 2015 08:59:04 +0100
Caolán McNamara caol...@redhat.com wrote:

 The language combo-box allows you to enter arbitrary language tags.
 What happens if you just enter khb-CN in there.

Using vanilla Version: 4.3.3.2, Build ID:
9bb7eadab57b6755b1265afa86e04bf45fbfc644 on Ubuntu 12.04 with Unity
desktop, I can't enter text in that box.  If I tab to the box so that
is highlighted, 'k' changes the selection to Kannada, 'h' changes the
selection to 'Khmer', 'b', '-' and 'c' have no effect, and 'N' changes
the selection to N'ko.  (The effects seem to depend on the speed of
typing - the 'h' can change the selection to Hebrew and 'b' can change
it to Bengali (Bangladesh).)

The trick of copying inserting the language in the .xml files does not
work well.  If I have some text with khb-CN as the language and region
and then try to set the language for a greater expanse of text, khb-CN
does not come up in the menu.

Ubuntu's Version: 4.4.4.2 Build ID: 40m0(Build:2) is no better.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Adding Languages to Writer's Character, Font Menu

2015-06-22 Thread Richard Wordingham
How do I add a language to this menu so that fonts that can will render
text in the style appropriate to the language?  I am reconciled to
having to create a bespoke version of LibreOffice, though I'd rather
not.

Manually editing a document's XML files would be the last
resort - it seems to work!  While it gets the language into the pick
list, this is only while the selection includes text in that language.
I haven't explored this method, though it does suggest a workable clumsy
technique.

An example is a Tai Lue style (language khb-CN).  LibreOffice (more
precisely, HarfBuzz) includes data to enable the conversion of 'khb' to
OpenType tag 'XBD '.  I'm styling Lanna script glyphs for language, and
selecting Lao as the complex-script language gives me Lao-style glyphs
from the font in LibreOffice.

The only mechanism I can see that might work for Tai Lue is to request
the installation of a scrappy dictionary for the language (perhaps even
empty?) but this method feels wrong.  My wish list for additions,
assuming I only have to include language and country, is:

khb-CN  Tai Lue
nod-TH  Northern Thai
kkh-MM  Tai Khuen
tts-TH  North-Eastern Thai

and, lower down the scale of desire, support for

khb-LA  Tai Lue of Laos

and for various Palaung languages (relevant one unestablished), all
OpenType code PLG:

pce-MM – Ruching
pll-MM – Shwe
rbb-MM – Rumai

If script matters, it gets more complicated.  The above 8 are all for
the Lanna script, but for generally useful dictionary support for
some of the languages one should concentrate on:

nod_Thai-TH
tts_Thai-TH
khb_Laoo-LA

and I'm not sure which script for Palaung - probably Myanmar, but
definitely not Lanna. 

A dictionary for pi_Lana (Pali) would be good to have - I'm not
sure about the the relevance of national variations, though.  I'm not
sure how well a multiscript dictionary will work.  A pi_Latn dictionary
is reported to have been developed, but it's not available for
download.  Pali is on the pick list for Western scripts, but is not
available for 'complex' scripts.  The national variations in Pali
generally go with script. (Of course, there are two very different, but
apparently equivalent, orthographies for pi_Thai-TH.)

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Richard Wordingham
On Thu, 27 Sep 2012 11:52:26 +0700
Nathan Wells sungk...@gmail.com wrote:

 1. If you are shutting off the ICU breakiterator for text following,
 we
 should probably also do it for text preceding. Thus if there is a
 ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break
 iteration is disabled for the whole sentence.
 
 Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU
 break iteration should be disabled for the whole sentence.

What is the logic of this?

The use cases I see are:

1) The user always marks word breaks with ZWSP.

In this case, the ideal is to switch off the break iterator for the
language.

2) The user never marks word breaks.

In this case, the user is totally dependent on the break iterator, and
cannot be helped when it fails.

3) The user only marks word breaks and non-word breaks when the iterator
fails.

In this case, the iterator need only be switched off from the point of
override until it can clearly re-synch.  The obvious re-synching points
are word external punctuation, such as end-of-line, white space,
quotation marks, commas and dandas (and as dandas I would include U+0E2F
THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
ฯลฯ and ฯเปฯ).

Now, it may be easier to explain the rule if it applies to the whole
'word' - for what we are looking at is pretty much a 'word' as
understood by dictionariless editors.

4) Different parts of the text comes from different sources - some mark
word breaks, others expect the application to correctly identify them.

A ZWSP in a chunk of text would then tag the text as having come from a
a user in case 1 or 3; we have no reliable way of distinguishing the
two cases.  A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so
paragraph initial is suspect) would strongly suggest use case 3 - but
might occur in use case 1 if the user has had to fight a break
iterator.

(end of use cases)

Considering these four use cases, it seems simplest to let ZWSP, WJ and
ZWNBSP disable the iterator for the extent of the dictionariless word
in which it occurs.

What is the definition of an ICU sentence boundary?  I see no evidence
from CLDR 2.9 that it should be even approximately right for Khmer (or
Thai). Splitting Thai text into sentences is known to be challenging -
we can therefore expect different applications to split text
differently.

The one downside I can see to my suggestion is that if all word
boundaries are marked, switching the iterator off dictionariless word
by dictionariless word will require slightly greater use of WJ, for a
ZWSP later in the sentence will not necessarily be in the same
dictionariless word.

A related issue that seems not to being handled is repetition mark U+0E46 THAI
CHARACTER MAIYAMOK.  It should be separated from the preceding
alphabetic characters by a space, but Libreoffice doesn't recognised
the sequence as a possible continuation of the word.  Sometimes it
is a necessary part of a word.  I don't know what the situation is in
Khmer.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Richard Wordingham
On Thu, 27 Sep 2012 21:08:13 +0700
Nathan Wells sungk...@gmail.com wrote:

 Firstly, you are right, I was mistaken about ICU and the breakiterator
 working for sentences (I just tried it right now and it does work,
 but just not with the normal khan or period of Khmer rather it
 works with Latin sentence markers which is not enough).  I had
 thought when we put in the code for the breakiterator that it also
 covered the sentence, but I guess not (I will work towards getting it
 working for Khmer).

It may be worth modifying the CLDR definition - sentence breaks can be
customised, though it is presently only done for Greek.  However, if
you want Khmer *sentence* rather than *clause* breaking, it will need a
lot of work - papers are still being published on breaking Thai into
sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ).

 In response to your comments:
 
  1) The user always marks word breaks with ZWSP.
  In this case, the ideal is to switch off the break iterator for the
  language.
 
 
 There is some truth to this - and that is why I had it as my last
 option (just turning the whole thing off). But the ICU breakiterator
 for Khmer actually works quite well with normal language - it breaks
 down when there are proper names. So turning it off is an option, but
 not the most ideal solution. Some users will continue to always mark
 breaks with a ZWSP (for full control), but I also think having the
 option to turn it off for more complex sentences would be ideal.
 
  2) The user never marks word breaks.
  In this case, the user is totally dependent on the break iterator,
  and cannot be helped when it fails.
 
 As I said above, I think a both/and solution would be idea for Khmer.
 But if in the end it would work better for Thai to have and off and
 on option only, that would be fine for Khmer as well for now, until
 we can come up with a more ideal solution.
 
 
  3) The user only marks word breaks and non-word breaks when the
  iterator fails.
 
 The problem with this in Khmer is the user cannot tell when the
 breakiterator fails, unless it is on a line-break.  A word could be
 broken up into three parts and the user would never know it.

I usually notice iterator failures in Thai with unrecognised words,
which prompts red ink over strange extents. Usually the words are not
recognised because they're misspelt, but not always.  The problem I see
in Thai is usually not so much as extra word boundaries as misplaced
word boundaries. 

 Actually, if users could see where the
 breakiterator is breaking words, that would simplify things a lot.

That is a very significant observation.

 The only problem with this would be at the beginning of a document or
 the beginning of any new re-syncing segment because you might run
 into something like this:

 User input (example in English so others can make sense of it I hope):
 wordwordwordwordword.
 How the sentence is broken up by the breakiterator: wo r d word word
 wo rd word.
 User adds ZWSP to fix broken word on line-break: wo r d word word
 ZWSPwordword.

This example confuses me.  The problem here seems to be extra word
breaks rather than missing word breaks, and I don't see how confirming
a word break helps.

 But user has no idea the first word is broken incorrectly and that it
 is also spelled incorrectly.

 This is why it would be best (I think) as Martin suggested that when
 a ZWSP is detected it also turn off break iteration for the previous
 words up until a re-sync point.  This would practicly give the user
 an off option for the whole document if they so chose, and without
 the confusion of having to find some option in the Tools menu to turn
 it on or off - it would just be automatic, depending on the user's
 habit.

I was clearly not clear enough.  In the example above,
'wordwordwordwordword' is what I would call a dictionariless word - a
word-breaker without a dictionary (e.g. a shell's parser) would see it
as just one 'word'.  Therefore, once ZWSP is inserted and
word-breaking disabled, dictionary-based word-breaking is not applied to
wordwordwordZWSPwordword, and, typically, red squiggles appear under
wordwordword and wordword.  The boundary may be revealed by a phase
discontinuity or gap in the squiggle.  Under the proposed scheme, user
has to introduce another three ZWSPs even if the dictionary contains
all the words.

 I agree with this:
 
  Considering these four use cases, it seems simplest to let ZWSP, WJ
  and ZWNBSP disable the iterator for the extent of the
  dictionariless word in which it occurs.

 Except, it also should disable the breakiterator up to the previous
 re-sync point...

But that is what I meant!

 But actually, there is a rule in ICU for the MAIYAMOK
 so unless that is not working properly, I am not sure why LibreOffice
 doesn't break correctly...

I'll have to look further into this - and check that misbehaviour is
still happening.  Squiggly lines is what I chiefly remember.  There may
also be a Hunspell issue 

Re: Adding Extension for Experimental Thai Spelling

2012-07-27 Thread Richard Wordingham
On Thu, 26 Jul 2012 16:33:00 +0700
Martin Hosken martin_hos...@sil.org wrote:

 1. use of U+2060 makes string searching and spell checking harder
 (unless WJ chars are stripped for searching and spell checking). They
 are not part of the spelling of a word, so their introduction in the
 underlying text stream is problematic for other text processing
 processes (like searching as mentioned). This is less of an issue for
 U+200B ZWSP because that occurs between words and searching across
 word boundaries is a rarer activity. Likewise spell checking across
 word boundaries isn't really needed.

U+2060 WJ should definitely be skipped for searching and, once it has
done its gluing job, spell-checking look-up, just like U+00AD SOFT
HYPHEN.  They're both indubitable complete ignorables for collation and
therefore for UCA (Unicode Collation Algorithm) search.

 Now what happens if I want to put zw around a word that occurs  20
 chars after my last zw? The on off nature of the zw has now been
 inverted. One option is to say that zw must always occur in pairs and
 you would have to bracket your first or second word there. But then
 management of which zw is on and which is off will get confusing for
 users.

I think that is the wrong way of looking at it.  Various characters,
some ZWSP, others more natural, such as SP, tell the break iterators
where some word boundaries are.  The rule we would have is that the
break iterator should not try to break runs of less than, say, 20
characters if one of the boundaries is provided by ZWSP.  I am not
proposing that we limit how many breaks it makes in a run - 21
characters could be broken into seven words.  The short runs the break
iterator is prohibited from breaking can still be checked for spelling.
If they are not words, then the user can respond to the red wiggly line
appropriately, e.g. by putting extra word breaks in.

In the example you gave, one would have to split the words between the
delimited words.  I think the users must accept that - the rule we
would be working with is that the break iterator does not break short
runs created by inserted ZWSP, and that is a simple rule to
understand.  I suppose there may be some question of what to count -
base consonants perhaps? (In Unicode jargon, that would be extended
default graphemes.)  That might be a luxury feature we never need to
add.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-17 Thread Richard Wordingham
On Fri, 17 Feb 2012 14:10:21 +
Caolán McNamara caol...@redhat.com wrote:

 On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote:
 Indeed, yeah, I suppose, assuming its as complicated as Thai, that
 the right direction would be for someone to write for icu new
 dictionary-based breakiterators for the nod(?) language and then the
 rather trivial changes to LibreOffice to know about the language in
 order to mark text as that language to bubble that info down to icu

Northern Thai's not quite as simple or standardised as Siamese!  One can
meet (at least) the following spelling systems:

1) Chiangmai phonetics
2) Chiangrai phonetics (different mapping of tones to Siamese spelling
rules)
3) Transliteration from Tai Tham script (probably rare for connected
text)
4) Tai Tham script

However, perhaps dictionary-based break iterators are something to be
treated like dictionaries.  There are several other writing systems
that could probably benefit from them:

Thai script:
  Northern Thai
  NE Thai (for recording songs - use of Siamese tone rules scrambles
  the tonemarks compared to Siamese cognates)

Khmer script:
  Khmer - there's already a project for this set up on SourceForge.
  Pali

Tai Tham script:
  Tai Khuen
  Tai Lue
  Pali

Lao script
  Lao

Tibetan script
  Tibetan

I've a feeling Burmese may also have a need for dictionary based text
breaking, though it's better behaved for syllable breaking than most of
the others listed here.  Shan would come in the same category.

The above list is not exhaustive.  Tai Lue in Lao script probably
belongs in the list.

Not all Thai script writing systems need a break iterator - some of the
minority languages separate words with spaces, but that's partially a
matter of literacy - Thais start writing Thai with interword gaps and
then learn to suppress the gaps.  Pali written in Thai also separates
words with spaces - but Pali has some very long words!

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-16 Thread Richard Wordingham
On Tue, 14 Feb 2012 16:19:17 +
Caolán McNamara caol...@redhat.com wrote:

 I think this change:
 http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12
 should improve matters a lot.

It's a vast improvement - it gives LibreOffice a real Thai
spell-checker.  Thank you.  I have one worry for Siamese - Németh László
suggested that there might be a licensing issue back in
http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html .

If there isn't such an issue, does this mean we can hope to see your
fix in LibreOffice 3.5.1?

 Makes กุหลาบ get treated as a single
 word in the unit test there now anyway, though the Northern Thai one
 is still not considered a single word, that might be due to the
 oldish icu we're still using.

I wouldn't expect a dictionary-based line breaker to handle words from
other languages.  (There's a whole slew of Mon-Khmer languages in
Thailand, and they mostly use the Thai script when they happen to get
written.)  I can work my way round the problem using the sticking
plaster of ZWSP and WJ (no-break no-space), and I think some use of
them or an equivalent is inevitable when the sequence of visible
characters doesn't define the breaks.  In particular, after gluing
กุ๊หลาบ together with WJ, Hunspell offered me กุหลาบ as a correction,
which is good.

There may be some rough edges with ZWSP and WJ going into the
dictionary (TBC), but what you've done will justify LibreOffice claiming
a Thai spell checking capability.

Minority language support may not be compatible with libthai - at least
one language uses a combining underline, and some of the mark
combinations used for minority languages would get rejected by the WTT
rules that libthai supports.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Richard Wordingham
Thank you to every one who's offered me advice.

On Mon, 13 Feb 2012 15:08:20 +
Caolán McNamara caol...@redhat.com wrote:

 I don't think we have any way to override our breakiterators from
 extensions.

Ah well, I'll just have to try to get Thai spell-checking working for
myself and then worry about sharing my changes - assuming I succeed.

 I'd be sort of interested in confirming that what we have right now
 actually works correctly, in the sense that Thai text definitely *is*
 getting run through the special Thai-specific icu word break handler.

It's definitely going through a Siamese-specific word-breaker for
line-breaking.  For example the two-syllable Thai word กุหลาบ
'rose' moves to the next line, but when I convert it to the Northern
Thai form กุ๊หลาบ (not the spelling I'd favour) by adding a
(non-spacing) tone mark, it's promptly broken between lines along the
syllable boundary, although the first syllable does not constitute a
word, at least not one recorded in the Royal Institute Dictionary. I'm
glad to find that inserting U+2060 WJ prevents that break. The
spell-checker seems to break up a phrase consisting of just กุหลาบ into 3 or 4 
words.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Adding Extension for Experimental Thai Spelling

2012-02-11 Thread Richard Wordingham
As I understand it, the lack of a usable Thai spell-checker for
LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai
break iterator.  (I had expected Thai and Khmer to face similar
problems, for neither has a visible word separator and syllable
boundaries are often unclear in both.)  Tagging Thai script text as
Khmer does not work (at least, not in Version 3.4.5); the word
boundaries are still determined by the Thai break iterator.

Is it possible to create an experimental alternative to the Thai
break iterator that can be shared with other people as a LibreOffice
extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE
(ZWSP) to separate words in the Thai script, but I suspect Thais would
not.  Also, I can seem my first useful version fouling up the
rendering of pre-existing text.  I can't work out how to create a break
iterator as an *extension*. Could someone please advise me how, e.g. by
pointing to the documentation or an example.  I can find documentation
for *publishing* an extension, but that does not address *creating* an
extension.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice