Re: Tagging text as being in arbitrary complex-script languages
On Tue, 23 Apr 2019 17:35:10 +0200 Eike Rathke wrote: > Hi Richard, > > On Thursday, 2019-04-18 20:40:01 +0100, Richard Wordingham wrote: > > It sounds as though one has to specify the script where there is > > doubt as to what type of script will dominate. Is it an issue if > > there are two competing scripts of the same type, e.g Thai v. Lanna > > for Northern Thai? A dual script dictionary would correct > > inefficiently. > Competing in the sense two different scripts under one language tag? > I wouldn't do that and IMHO it would be wrong. It's worse than that. The spoken language nod-TH resolves, ignoring subregional variations, into the three written groups: nod-Lana-TH nod-Thai-etymo-TH (name but not concept declared unsuitable on 10 Jan) nod-Thai-phonetic-TH (ditto) The scheme 'nod-Thai-etymo-TH' often accompanies published material in non-Lana-TH. The New Testament is published in nod-Lana-TH and 'nod-Thai-phonetic-TH'. Until I can find names for the Thai-script variants more specific to Northern Thai, my plan is to handle the difference by letting the user choose the dictionary if I ever get round to Thai script Northern Thai dictionaries. The biggest need I see for the variant tags is user interfaces. The Lana script dictionary is highly desirable for handling the visual ambiguities in the script for the vernacular languages and has high priority. Eyeballs are probably good enough for the Thai script. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Tagging text as being in arbitrary complex-script languages
On Tue, 23 Apr 2019 18:00:22 +0200 Eike Rathke wrote: > On Friday, 2019-04-19 03:32:34 +0100, Richard Wordingham wrote: > > In answer to what was intended to be a rhetorical question, I > > suppose und-Latn-t-sa-m0-iast and und-Latn-t-sa-m0-iso would work > > for the normative forms. > > Seem.. at least when entered at https://r12a.github.io/app-subtags/ in > the Check form it doesn't overly complain. It seems that some people think that IAST also defines a Cyrillic representation, so I think the 'Latn' is justified. > However, I'd avoid 'und', to me it annotates as "can't determine what > this could be" and in fact it is listed as Undetermined. Well, as the two systems are international standards (the 'i' in 'iast' and 'iso'), it should be hard to tell whether the intended audience is English, German, Japanese or whatever. The what of the underlying content is contained in the extension - in this case the 'sa'. > Yes, that's ugly, but unavoidable. For which sa-Latn would be a better > solution. And allow for mixtures of the two schemes! Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Tagging text as being in arbitrary complex-script languages
On Thu, 18 Apr 2019 20:40:01 +0100 Richard Wordingham wrote: > On Thu, 18 Apr 2019 12:25:11 +0200 > Eike Rathke wrote: > > Though with sa-Latn > > I doubt there's a use case, so I wouldn't call that "correct" in > > common sense. > > So how do you suggest we tag Sanskrit in Latin script? In answer to what was intended to be a rhetorical question, I suppose und-Latn-t-sa-m0-iast and und-Latn-t-sa-m0-iso would work for the normative forms. I've successfully loaded a mocked up extension for the former (as explicitly using a Western script), though I don't much like the consequent tagging in the document's content.xml. That's a problem with the 't' extension. Transliteration may change the language of place names in isolation, but it doesn't really change the language of paragraphs of text. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Tagging text as being in arbitrary complex-script languages
On Thu, 18 Apr 2019 12:25:11 +0200 Eike Rathke wrote: > What I usually did is, lookup the language at SIL and the Ethnologue > and use the most prevalent script as implied default script. Which > here https://www.ethnologue.com/language/san would lead to > Devanagari, but in this case more important is also what MS assigned > the LCID for. So I shouldn't be misled by the fact that the CTL script I most frequently write Sanskrit in is Thai -:) Seriously, though, I believe the script of sa-TH is Thai is rather than Devanagari, and I am quite sure that the script of sa-MM is Mymr. It sounds as though one has to specify the script where there is doubt as to what type of script will dominate. Is it an issue if there are two competing scripts of the same type, e.g Thai v. Lanna for Northern Thai? A dual script dictionary would correct inefficiently. > > "sa-150" Sanskrit written using European conventions - so, any > > script, but, at least for Devanagari, the anusvara sign is not used > > for homorganic nasals. > > Though valid, LibreOffice doesn't use the numeric UN M.49 code, it may > be accepted but might not work everywhere. > > > "sa-Deva-150" Sanskrit written in Devanagari in the manner used in > > Europe. > > Same here. > > > "sa-Latn" Sanskrit written in the Roman script. > > > > "sa-Latf" Sanskrit written in Fraktur (I'm not sure that this > > exists. It might need a hint as to where to find a Fraktur script > > with a combining candrabindu.) > > Both perfectly valid, if they serve any purpose. Though with sa-Latn > I doubt there's a use case, so I wouldn't call that "correct" in > common sense. So how do you suggest we tag Sanskrit in Latin script? Within English works, its not uncommon for any Sankrit quoted precisely to be in the Latin script; about half the English language articles in the 'International Journal of Sanskrit Research' (http://www.anantaajournal.com/) that quote Sanskrit passages quote them in the Latin script. Several papers would benefit from the application of sa-Latn proofing tools, though I don't denying that proofing Sanskrit may be difficult. Moreover, I've only ever seen U+0310 COMBINING CANDRABINDU in examples of Sanskrit in Latin text. > I also just learned that sa-Latf somehow exists.. That example is in the same spirit as en-Thai (which I've successfully used for privacy) and notes I've seen kept in en-Runr on a publicly accessible whiteboard. I was wondering whether Sanskrit was printed in Antiqua or Fraktur in early 20th Century Germany. You seem to think neither. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Tagging text as being in arbitrary complex-script languages
On Wed, 17 Apr 2019 13:53:25 +0200 Eike Rathke wrote: > > > On 4/15/19 12:26 PM, Eike Rathke wrote: > > > > Adding arbitrary dictionary languages (as long as they strictly > > > > follow the BCP 47 language tag specification) works since quite > > > > a while (2014?) already. > > An interesting experiment would be to try adding a language to both > > Western and CTL (as with Mongolian and some minor SEA languages) or > > Western and CJK (various Zhuang writing systems), though I suppose > > it won't hurt to simply disambiguate by script. > > In fact you have to, or use an ISO 639-1/2/3 language code that > implies a default script for one and specify an ISO 15924 script code > for the other, which I was referring with "correct BCP 47 language > tags". Is there a pointer as to which tag sequences that "strictly follow the BCP 47 language tag specification" are "correct"? As far as I can tell, the following all strictly follow the specification: "sa" Sanskrit, with no specification of the script or spelling conventions. "sa-IN" Sanskrit as used in India - so far as I can tell, that could be in, for example, Devanagari, Grantha or even the Tamil script! For Devanagari at least, I understand that this implies that homorganic nasals may be written using U+0902 DEVANAGARI SIGN ANUSVARA. "sa-150" Sanskrit written using European conventions - so, any script, but, at least for Devanagari, the anusvara sign is not used for homorganic nasals. "sa-Deva-150" Sanskrit written in Devanagari in the manner used in Europe. "sa-Latn" Sanskrit written in the Roman script. "sa-Latf" Sanskrit written in Fraktur (I'm not sure that this exists. It might need a hint as to where to find a Fraktur script with a combining candrabindu.) The only Sanskrit tag sequence I can find in isolang.cxx is "sa-IN". Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Tagging text as being in arbitrary complex-script languages
On Mon, 15 Apr 2019 15:14:49 + jonathon wrote: > On 4/15/19 12:26 PM, Eike Rathke wrote: > > Adding arbitrary dictionary languages (as long as they strictly > > follow the BCP 47 language tag specification) works since quite a > > while (2014?) already. Only if you hacked the text to declare the CTL or CJK language as appropriate to be the one of the dictionary. Otherwise, you could only use such a dictionary for a 'Western' script. As recently as 2015, another issue was that I was having to regenerate hunspell/utf_info.cxx for a LibreOffice build so that it would accept word characters as word characters. I don't know how well that file tracks the Unicode standard nowadays. When should Pali spell-checking in the extended Lao script (Pali support to 1930's standards was only added this year) only have problems due to the inadequacy of the dictionaries? > > New(er) in the mentioned mechanism is the > > ability to add a language also to the CTL or CJK sections where > > previously it was only possible to add to the (misnamed) "Western" > > section, and give the language list entries a proper UI name > > instead of showing just the language tag. > Thanks. > I wasn't aware that that functionality was present. > I'll play with over the next month or so, then write about in my > long-neglected blog. An interesting experiment would be to try adding a language to both Western and CTL (as with Mongolian and some minor SEA languages) or Western and CJK (various Zhuang writing systems), though I suppose it won't hurt to simply disambiguate by script. In general, tagging has the potential to get very messy, e.g. Pali in Lanna script as used in Northern Thailand as opposed to Pali in Lanna script as used in North-eastern Thailand. (Yes, there are systematic spelling differences between the two.) Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Tagging text as being in arbitrary complex-script languages
On Wed, 10 Apr 2019 15:13:52 +0200 Eike Rathke wrote: > Hi Richard, > > On Wednesday, 2019-04-10 04:02:53 +0100, Richard Wordingham wrote: > > > I was also able to get SIL's oxttools to work sufficiently > > What are those oxttools and where to get them? Tools for assembling extensions for LibreOffice, particularly dictionaries and the like. They're available at https://github.com/silnrsi/oxttools . It looks as though there may be some tools for assembling dictionaries, but I haven't dug deeply into them. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Tagging text as being in arbitrary complex-script languages
On Mon, 8 Apr 2019 16:17:38 +0200 Eike Rathke wrote: > ScriptType value 3 here means CTL. The values are explained in > officecfg/registry/schema/org/openoffice/VCL.xcs under > Thank you for the information, and thanks to Stephan Bergmann for the localisation information. For plodders like me, the definitions are: officecfg/registry/schema/org/openoffice/VCL.xcs (content, as stated by Eike) officecfg/registry/component-schema.dtd (syntax of VCL.xcs) officecfg/registry/component-update.dtd (syntax and some semantics of extension writer's dictionaries.xcu; the allowed information content is given in VCL.xcs.) I was also able to get SIL's oxttools to work sufficiently to work out what I needed. A dictionaries.xcu that works is: http://openoffice.org/2001/registry; xmlns:xs="http://www.w3.org/2001/XMLSchema;> Northern Thai 3 %origin%/nod_TH.aff %origin%/nod_TH.dic DICT_SPELL nod-TH The LibreOffice extension manager seems tolerant and has some helpful error reporting. *My* next step is to sort out copyright issues so that I can share the dictionary. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Tagging text as being in arbitrary complex-script languages
https://wiki.documentfoundation.org/ReleaseNotes/5.4 says, "The language list for text attribution now also displays BCP47 language tags provided by dictionaries if a language is not known in the predefined set of languages. (Eike Rathke (Red Hat, Inc.)) Such additional language tags are placed in curly brackets / braces, for example {en-DK}, and are displayed at the top of the list after the [None] entry." Is some additional information required in the .oxt file for the dictionary if the script is not "Western text"? For example, I have installed a dictionary (of my own devising) for language nod-TH, but it only shows up (in LibreOffice 6.2.2.2) in the language list for Western text. (The language is only written in CTL scripts - Thai and Lanna.) The work-around of manually editing the XML of a writer document to insert the language and country still works. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Special Fonts for Spell Checking Northern Thai in Lanna Script
I am trying to put together a workable solution for spell-checking Northern Thai in the Lanna (a.k.a. Tai Tham) script. I have a good idea how to do it, and it is already working in Firefox. The solution may not be suitable for run of the mill users, but I don't believe run of the mill users need the solution. Additionally, a Thai or English user interface is probably better than a Northern Thai interface. There are a number of problems, but the significant ones all relate to fonts. The others are all soluble. 1) The Universal Script Engine The Universal Script Engine inserts far too many dotted circles into Tai Tham text. Most closed syllables cannot be written in accordance with Unicode's principle of phonetic ordering, and some cannot be written at all. This I have overcome by creating a font that removes inappropriate dotted circles. This turns the Universal Script Engine into a solution for DirectWrite, HarfBuzz and AAT. 2) Scriptio Continua The Tai languages in the Tai Tham script do not separate words by spaces. The old solution to this problem, U+200B ZERO WIDTH SPACE, works. (By contrast, Pali, at least in modern texts, tends to have spaces between words, as is done in Pali in the Thai script. Significant sandhi may suppress the word-breaks.) 3) Northern Thai is not supported by LibreOffice It is, however, supported by Open Document Format. The solution is to edit the XML file to set the CTL language in the XML, and then propagate and edit text for which nod-TH is the CTL language. The lack of a Northern Thai interface is probably not a problem. Any need for it is emotional rather than practical. It is possible that Burmese, Chinese, English and possibly Lao interfaces will similarly cater for Tai Khuen and Tai Lue users. 4) Visually Ambiguous Spelling Words that normally look identical may be sorted and pronounced differently. Actually, there are surprisingly few visual homographs with such differences. So that users may see what they are typing, the solution I have adopted is to colour code the glyphs so that users can see whether a consonant precedes or follows the vowel of the syllable in coding and phonetic order. 5) Font Support Does LibreOffice support any type of multi-colour font? I may have to devise a shape difference to indicate the spelling, which is less appealing. This would be most important in choosing a spelling correction. To see what it is that one has actually typed, switching to a transliteration font and then undoing the change is one approach. 6) Font Selection How does one control the font used in the spell-checking interface? I am particularly interested in the solution for Ubuntu, but it would be good to also know the solution for Windows. For Ubuntu, I suspect the answer will lie in Fontconfig, but I first need to know how to identify the font that LibreOffice tries to use. Fontconfig would work by controlling the fallback. Even without grammar coding, there may be an issue in that some Lanna script fonts are barely usable in the User Interface - readable Northern Thai text can need much greater vertical extent than English, depending on the style. 7) Dictionary Creation I currently have a large, working Northern Thai dictionary. I do need to sort out IP issues before I can share it. Even then, there needs to be a lot of shake-down testing to eliminate my typographical errors, and birds, fish and trees need to be added. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice
Version of gcc for LibreOffice
On Wed, 07 Oct 2015 11:10:08 +0200 Jan-Marek Glogowski <glo...@fbihome.de> wrote: (when topic was 'Can't track flow of characters in from Input Method Editor') > Am 06.10.2015 um 23:51 schrieb Richard Wordingham: > > I think my compiler (gcc > > Version 4.6.3) is too old to compile Version 5.0, which is where I > > noticed the problem. > > ... > > > I am running Ubuntu 12.04 with the default desktop. > LO 5.0 builds just fine in Precise / 12.04. See > https://launchpad.net/~libreoffice/+archive/ubuntu/ppa?field.series_filter=precise > for newer packages. OK. I found a tar ball for 5.0.2.2 which *does* build on Ubuntu 12.04. However, when I try building from 'trunk' (or whatever its called) pulling in the source via git, compilation still fails, just as (well, one line number's changed) happened just over three months ago (https://ask.libreoffice.org/en/question/52435/what-version-of-gcc-do-i-need-to-build-libreoffice/ ). I did not get a usable answer then. In response to my example patch at https://bugs.documentfoundation.org/show_bug.cgi?id=94753 , I've been told to use gerrit to discuss patch proposals. Presumably I should at least confirm that my patches compile in the developing form of LibreOffice. So, what version of gcc do I need to build LibreOffice? Or is there a bug in include/rtl/ustring.hxx? I don't know C++ well enough to understand the problem. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Can't track flow of characters in from Input Method Editor
On Thu, 8 Oct 2015 01:17:14 +0100 Richard Wordingham <richard.wording...@ntlworld.com> wrote: > Thank you all for your inputs. I've finally found where the problem materialises. There is a callback of GtkSalFrame::IMHandler::signalIMDeleteSurrounding() to delete one 'character'. I now need to work out where the interfacing is in error. The intent of the call is to delete one Unicode character; it is now a question of where the conversion from Unicode characters to code units should be made. It might be anywhere from KMfL to signalIMDeleteSurrounding(). For hacking, there is the good news that when KMfL decides to delete two Unicode characters, there are two calls of the function, so I could fix *my* problem straightforwardly. Does this appear to relate to any other known problems in interfacing with ibus? Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Can't track flow of characters in from Input Method Editor
On Thu, 08 Oct 2015 10:18:15 +0100 Caolán McNamara <caol...@redhat.com> wrote: > On Thu, 2015-10-08 at 08:52 +0100, Richard Wordingham wrote: > > The intent of the call is to delete one Unicode character; On reading the GTK documentation, it is clear that the arguments are in terms of Unicode characters, and not UTF-16 code units. > I imagine you need to change signalIMDeleteSurrounding where we have > nDeletePos = nPosition + offset and > nDeleteEnd = nDeletePos + nchars > and instead of adding "offset" and adding "nchars" you need to call > getText on xText to get the string, then use > OUString::iterateCodePoints to count forward from nPosition by > "offset" IM codepoints to get the utf-16 offset for LibreOffice, and > similarly iterateCodePoints by IM nchars to get the LibreOffice > utf-16 nchars to delete. > > might suck rocks for performance. I can't fathom how getText() works - obfuscation by abstraction! However, as using OUString::iterateCodePoints would appear to involve, at the very least, copying a long string, I have coded up a similar function that works directly with the 'editable accessible' string (and associated data). I have added a patch to the bug report https://bugs.documentfoundation.org/show_bug.cgi?id=94753 . Richard ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Can't track flow of characters in from Input Method Editor
Thank you all for your inputs. On Wed, 7 Oct 2015 09:57:14 +0200 Miklos Vajnawrote: > Writer "main text" gets all keyboard input in SwEditWin::KeyInput(), > sw/source/uibase/docvw/edtwin.cxx. It's VCL that calls that member > function, and in your case it's probably the VCL KDE backend in > particular. On Wed, 7 Oct 2015 22:20:01 +0800 Hung Mark wrote: > Since you mentioned that Writer exhibit the problem but Calc > doesn't,you might want to take a look at > sw/source/core/doc/extinput.cxx. SwEditWin::KeyInput() is receiving the input not generated by the IME, e.g. Latin and Thai as I have my keyboards set up, but the normal character input generated by the IME (BMP Tai Tham and SMP Tirhuta) is going to SwExtTextInput::SetInputData instead! Backspaces generated by hitting the 'rubout' key (labelled with a right-to-left arrow) follow the non-IME route. I do not yet know what happens to backspaces generated by the IME. On Wed, 07 Oct 2015 11:10:08 +0200 Jan-Marek Glogowski wrote: > I guess you're running Kubuntu 12.04, as you talk about KDE in this > post. The KDE code was a red herring. The characters are coming in from the basic X system via GtkSalFrame::signalKey, as one would expect for a primarily Gnome system, despite the graphical shell being Unity. So, it's basically Ubuntu. > LO 5.0 builds just fine in Precise / 12.04. See > https://launchpad.net/~libreoffice/+archive/ubuntu/ppa?field.series_filter=precise > for newer packages. I'll give it another try. Pre-release versions obtained via Git wouldn't compile. > We also had problems with Qt4 / all KDE applications and ibus. At the > end we backported the 14.04 / trusty version of fcitx and use this > currently :-( I hope we haven't got a race condition. I don't understand the order of my monitoring outputs. I was able to run LibreOffice under gdb running from Emacs Version 23, whereas the combination failed under Emacs 24. (The two Emacsen use different interfaces to gdb, which may be the reason for the difference.) However, not only was I not able to set a break point where I wanted (probably my lack of competence), I could not reproduce the error. I got no lone surrogate! This better behaviour has not been reproduced. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Can't track flow of characters in from Input Method Editor
On Sunday I raised bug report 94753 about the apparent generation of lone surrogates in response to the use of Keyman for Linux under ibus as the input method editor. I have compiled Version 4.4.4.3.0+ with debug to facilitate my investigation; I think my compiler (gcc Version 4.6.3) is too old to compile Version 5.0, which is where I noticed the problem. I use emacs as an IDE for debugging, but Emacs Version 24 does not seem able to cope with Version 4.4.4.3.0+. The debugger gdb run from the terminal appears to be able to cope. I have been trying to narrow down the source of the error by inserting fprintf() calls. However, I cannot find where characters enter the program from the IME. I am running Ubuntu 12.04 with the default desktop. The IME is KMfL running under ibus. I set up fprintf() and abort() calls to monitor the apparent sole call of XmbLookupString (there are no visible calls of XwcLookupString) and also within the call of SalKDEDisplay::checkdirectInputEvent(). However, inputting text from the Supplementary Multilingual Plane using the IME to input characters generates neither output from the fprintf() calls nor a core dump from abort(). Have I overlooked another route by which characters are reaching the program? My current suspicion is that Qt is not handling KMfL's replacement of one supplementary character by another properly, but I cannot demonstrate that. My test input text sequence is the three characters dYH, which when applied to an instrumented program using X generates the characters U+1148F, U+114C0, U+0008 (also as symbol), U+114BF. I suspect that U+0008 is only cancelling the low surrogate of U+114C0, and that this is happening in Qt code. I have seen similar behaviour with Konsole, which I believe is a Qt application. Claws mail, Gnome-terminal, Emacs Version 24, gedit, Abiword and even LibreOffice Calc all exhibit receipt of the correct sequence of characters, namely . (Some of these do not display it properly, but that is another issue.) Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Unicode 8.0?
On Thu, 16 Jul 2015 17:40:06 +0100 Caolán McNamara caol...@redhat.com wrote: On Thu, 2015-07-16 at 11:53 +0200, Viktor Kovács wrote: I would like to ask when will be adopted Old Hungarian fonts. It is defined in the UNICODE 8.0, central-europe subgroup, and it must be typed right to left writing. The underlying requirement will be a version of icu that supports unicode 8, so someone needs to bump the icu version we're building against to version 56, and that's only at milestone 1 level at the moment so apparently not ready for stable use yet. I imagine then there would be a need to extend the RTL support to include some additional language if there is a serious attempt to support this as a real thing. Viktor, Have you tried LibreOffice with an Old Hungarian font of your own? ICU should have known years ago that the part of the SMP used for this script is reserved for right-to-left scripts. For the Bidi algorithm, the characters had the appropriate properties long before they were assigned to 'Old Hungarian'. If you need to specify the language, I suggest you set it to Hebrew. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Univerbation
On Tue, 07 Jul 2015 09:55:38 +0100 Caolán McNamara caol...@redhat.com wrote: On Mon, 2015-07-06 at 09:13 +0100, Richard Wordingham wrote: What mechanisms does ODF have to indicate that a sequence of word characters constitutes a word? But generally we follow the rules of the underlying icu version that LibreOffice is built against. Thanks for answering. For the problem, see http://bugs.icu-project.org/trac/ticket/11766 . I am therefore checking for possible solutions in the likely event that U+2060 and U+FEFF suppress word breaks and no new character (I intend to suggest U+2065) is provided to suppress word breaks. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Univerbation
What mechanisms does ODF have to indicate that a sequence of word characters constitutes a word? Having such a mechanism is useful for spell-checking Thai and other languages where the boundaries between words are not marked. At present, one can cancel spurious boundaries by inserting U+2060 WORD JOINER. Words formed thus can be entered in personal spelling dictionaries. This is the only mechanism I am aware of. However, it is currently intended (announcement to private Unicore list only) to modify the Unicode Standard for Version 8.00 this month to state that U+2060 should not have have any effect on determining word boundaries; its function will merely be to suppress line breaks. I view this as a kick in the teeth of users of languages such as Thai, but so far I am the only one to have responded. The only work around I can see is to add a word joining character (e.g U+2065) to Unicode and hope that LibreOffice supports U+2060 as a word-joining character until the new character becomes available. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Wed, 24 Jun 2015 23:40:10 +0200 Michael Stahl mst...@redhat.com wrote: On 24.06.2015 23:26, toki wrote: That is part of the reason why I think the whole Western/CJKV/CTL split should be thrown out, and replaced with language/writing system, supplemented by locale data. that's a great idea in theory, unfortunately it would throw out any hope of compatibility with Microsoft Office as well How does one achieve compatibility with per script font-selection as shown in http://blogs.msdn.com/b/officeinteroperability/archive/2013/04/22/office-open-xml-themes-schemes-and-fonts.aspx ? For that matter, how does the current scheme square with a style having separate fonts for ASCII and other Latin characters - the *four*-way split ASCII / 'High ANSI' / Complex Script / East Asian? Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Tue, 30 Jun 2015 17:48:05 +0200 Eike Rathke er...@redhat.com wrote: On Monday, 2015-06-29 20:40:46 +0200, Khaled Hosny wrote: We already handle this at the text shaping level in VCL for platforms where HarfBuzz is used. I think we talk about two different things here. Yes. Khaled and I are focused on handling text, whether fundamentally present or generated by field codes and the like. What you are talking of makes most sense for when there is no relevant user-input text. My view is from correct language tag attribution that we need anyway, for document storage I don't understand that one. and spell-checkers Seems to work for 'unsupported' nod-TH. Tai Tham script is encountered, identified as complex (as demonstrated by the choice of font), so language nod-TH and corrected using the nod-TH spelling dictionaries. (Mind you, they're only populated as nod-Lana-TH. The fun starts when we want to distinguish what might be called nod-Thai-TH-etymological, nod-Thai-TH-Chiangmai and nod-Thai-TH-Chiangrai.) and locale dependent representation. Presumably for generated text. Yes, here language and country will in general be inadequate. When I mention language tag I'm always talking about BCP 47 language tags. You, and possibly Richard, have the runtime view and what could be automatically detected. So, even if detected automatically we'll have to assign a language tag that for the non-default script of a language includes the ISO 15924 script code. snip arbitrary Western/CTL/CJK classification snip The correct route to go is probably to assign known scripts to these classes, whether detected automatically or not, Which is already being done, though conceivably going directly from character to class. and distribute language tags according to their (implied or not) script over those classes. I'm not sure I follow you here. A supported language tag will have corresponding strings for automatically generated text, and these strings will generally imply the font. The only exception I can think of is common script text, where perhaps script information will be required to select the styling. This just requires a default script for each supported language code (i.e. minimal BCP 47 tag), though we could get away with default script class. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Licence to Convert Dictionary to Spell-Checker Dictionary
One way of producing a spelling dictionary is to take the words from a near-normal dictionary and use them. Does publishing such a dictionary require the permission of the dictionary's copyright holder? If it's relevant, the dictionary was published in Thailand. I appreciate that one ought to do a lot more work than just that step to make a good spelling dictionary. If I need permission, what licences would be suitable for making the spelling dictionary available via LibreOffice? Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Mon, 29 Jun 2015 20:40:46 +0200 Khaled Hosny khaledho...@eglug.org wrote: On Mon, Jun 29, 2015 at 12:14:44PM +0200, Eike Rathke wrote: Hi Richard, On Wednesday, 2015-06-24 20:54:54 +0100, Richard Wordingham wrote: The script is generally implicit in the text. You want to rely on automatic detection of scripts depending on the language chosen? Do you plan to implement that? However, even then the resulting tag would include the script code if it wasn't the default script of the language. Almost every character in Unicode has a script property, the exceptions is characters that has Inherit (unusually combining marks) or Common (punctuation mostly), put there is a simple and pretty reliable way to resolve the script of those characters from the context. Indeed, the route I had in mind was: 1) Determine script from character(s). 2) Categorise script as Western/CTL/CJK 3) Locale is then the Western locale, the CTL locale or the CJK locale as appropriate. Unless one first categorises the script, one does not know what the language is. Now, with more support, one may need the script. For example, a Serbian date field should depend on the script (Latin v. Cyrillic) as well as just the language, and Serbian is not the only language using competing scripts in the same class. However, what a date field picks up from its environment is curious. If I copy a Thai date field and paste it into the middle of an English word, I get a date in English! Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Wed, 24 Jun 2015 21:26:50 + toki toki.kant...@gmail.com wrote: I'll simply point to the current version of Microsoft Office, which is claimed, by Microsoft, to support more than 7,000 languages. As far as UI design goes, there are at least four options. 1) Offer everything, listed alphabetically; 2) Select the writing system, which is roughly 200 choices, then the language, and then, when needed, the locale; 3) Select the writing system, which is roughly 200 choices, then the locale, which is roughly 250 choices, and then the language, which, in the worst case scenario, is a thousand options; Do you mean 'script' when you say 'writing system'? Few languages share a writing system - Welsh, English, French and German have four different writing systems. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Wed, 24 Jun 2015 20:54:54 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: On Wed, 24 Jun 2015 12:31:16 +0200 Eike Rathke er...@redhat.com wrote: Simply in a css::lang::Locale set the Language field to qlt and in the Variant have the language tag, see http://api.libreoffice.org/docs/idl/ref/structcom_1_1sun_1_1star_1_1lang_1_1Locale.html It may be 'simply' to you, but my macro to set the language doesn't progress beyond the '::' before 'Locale', failing with Object not accessible. Part of my trouble was using '::' instead of '.' in the multi-part names when writing in Basic. Another part was forgetting that I could pass an integer or a struct in the same field. However, the approach using executeDispatch() failed. The unusual languages were simply reported as en-GB, and were recorded thus in saved .odt files. However, I now have successful macros of the form: Sub Lue dim region as object dim aLocale As New com.sun.star.lang.Locale aLocale.Country = aLocale.Language = qlt aLocale.Variant = khb-CN region = ThisComponent.CurrentSelection.getByIndex(0) region.CharLocaleComplex = aLocale end sub As I can now fairly readily mark complex-script text as khb-CN, kkh-MM, nod-TH and tts-TH (and all within a few lines of one another), what problems should I expect? (I suppose I should try to make this into an extension.) Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Tue, 23 Jun 2015 21:07:12 + toki toki.kant...@gmail.com wrote: On 06/22/2015 07:30 PM, Richard Wordingham wrote: How do I add a language to this menu so that fonts that can will render text in the style appropriate to the language? I've been getting a fair bit of information off list, though it's not of immediate use to me. Most relevantly, arbitrary recognised languages can be entered for the Western scripts - https://wiki.documentfoundation.org/ReleaseNotes/4.3#Adding_a_new_language_tag . There is a bug report out for this prima facie racist behaviour - Bug #81714: https://bugs.documentfoundation.org/show_bug.cgi?id=81714 , where there is a brief explanation of why the capability is so limited. It's been claimed that this is the tip of a big iceberg: The ideal would be to allow the following capabilities: * Tag text according to its language tag rather than using an LCID, given even windows uses langtags now. * Allow arbitrary lang tags to be used in a text anywhere * Add ability to read language support from say ldml file as configuration (should this go with a doc, no idea) * Be able to associate a language with CTL/CJK. Each of these points are huge undertakings (well in pairs perhaps), which would take considerable community political will to see happen. But as a wise man once said: a single minority language has virtually no cost benefit, but 2000 languages changes the equation considerably. I presume LibreOffice is intended to support OpenDocument. On this basis, I would say: * Tag text according to its language tag rather than using an LCID, given even windows uses langtags now. OpenDocument does this. * Allow arbitrary lang tags to be used in a text anywhere OpenDocument allows these - it is just a question of how much LibreOffice supports this. I believe the UNO interface supports this, but I won't be sure until I've tried it. One problem is that OpenDocument depends on an undefined split of text into Western, CTL and CJK text - a useful trick but a bad design. * Add ability to read language support from say ldml file as configuration (should this go with a doc, no idea) This hits the problem that ICU looks broken in this respect. One is meant to compile in the support languages - and the line-breaking algorithms require human intervention, because the definitions cannot be compiled to efficient code. * Be able to associate a language with CTL/CJK. This is impossible for a few languages. Several languages exist in competing scripts of different categories - Sanskrit and Pali may be written in the Latin script as well as in Indic scripts, and I think Sanskrit is also available in CJK. Several languages are used in both the Latin script and in the national CTL script or in the Arabic script. However, why is this association necessary? In the bug report, Eike Rathke wrote, The existing predefined CTL/CJK tags respectively their corresponding LCID values occur in various switch cases to be acted differently upon. This needs further elucidiation. It's not obvious why mistagging is to be preferred. To return to Jonathon (= Toki)'s advice: My recommendation is that you file an RFE for each language and locale that you'd like to use in LibO. With something like 2000 languages, the pick lists will be overwhelmed. My preference would be to allow them all, but to have a flexible method of selecting which appear in the lists. Most people won't use many - and a dozen or so choices of varieties of English or French in a 1-D list is overwhelming, and two dozen choices for Arabic is also excessive. Whilst one can fake it, by using a different language/locale with similar characteristics, that doesn't help, if one wants to do spell checking and grammar checking in your documents of those specific languages. I'm surprised at the central control of these, especially at the experimental level. Is one really meant to mislabel text while developing and testing such tools? Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Wed, 24 Jun 2015 11:52:49 +0200 Eike Rathke er...@redhat.com wrote: If I have some text with khb-CN as the language and region and then try to set the language for a greater expanse of text, khb-CN does not come up in the menu. N.B. By 'language' and 'region', I mean language and region for complex text. I tend to forget that one doesn't tag characters with a language, but sets a tag conditional on the character's script class. Does it come up if the cursor is positioned on a portion of text that already has the tag assigned? Yes, khb-CN comes up if the cursor is between the characters tagged as such when complex. If the character before is not khb-CN and the character after is khb-CN, it does not come up. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
On Wed, 24 Jun 2015 12:31:16 +0200 Eike Rathke er...@redhat.com wrote: * Allow arbitrary lang tags to be used in a text anywhere OpenDocument allows these - it is just a question of how much LibreOffice supports this. It does. I believe the UNO interface supports this, but I won't be sure until I've tried it. Simply in a css::lang::Locale set the Language field to qlt and in the Variant have the language tag, see http://api.libreoffice.org/docs/idl/ref/structcom_1_1sun_1_1star_1_1lang_1_1Locale.html It may be 'simply' to you, but my macro to set the language doesn't progress beyond the '::' before 'Locale', failing with Object not accessible. Invalid object reference. I was using vanilla LibreOffice 4.3.3.2. My macro shorn of superfluous comments read: sub Tai_Lue3 dim dispatcher as object ThisComponent.CurrentController.Frame dispatcher = createUnoService(com.sun.star.frame.DispatchHelper) ' dim args1(0) as new com.sun.star.beans.PropertyValue ' dim args1(0) as new css::lang::Locale ' dim args1(0) as new com::sun::star::lang::Locale dim args1(0) as new com::sun::star::lang::locale args1(0).Language =qlt args1(0).Variant =khb-CN dispatcher.executeDispatch(document, .uno:Language, , 0, args1()) end sub The macro recorded from using the combobox just records the LCID generated on the fly, which is not much use. It wouldn't mean the same from editing session to editing session. * Be able to associate a language with CTL/CJK. This is impossible for a few languages. Several languages exist in competing scripts of different categories - Sanskrit and Pali may be written in the Latin script as well as in Indic scripts, and I think Sanskrit is also available in CJK. Several languages are used in both the Latin script and in the national CTL script or in the Arabic script. Then you will have different language tags that include the script, and have one associated with Western and one with CTL. I don't see the problem. I am having great difficulty seeing why one should want to specify the script for a barely supported writing system, let alone the class of script. My thought was that the language code would suffice. The script is generally implicit in the text. As far as text properties are concerned, the class of script would be implicit in the box in which the language name was entered. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Languages to Writer's Character, Font Menu
(Copy to list for reference - I accidentally replied to Caolán alone.) On Tue, 23 Jun 2015 08:59:04 +0100 Caolán McNamara caol...@redhat.com wrote: The language combo-box allows you to enter arbitrary language tags. What happens if you just enter khb-CN in there. Using vanilla Version: 4.3.3.2, Build ID: 9bb7eadab57b6755b1265afa86e04bf45fbfc644 on Ubuntu 12.04 with Unity desktop, I can't enter text in that box. If I tab to the box so that is highlighted, 'k' changes the selection to Kannada, 'h' changes the selection to 'Khmer', 'b', '-' and 'c' have no effect, and 'N' changes the selection to N'ko. (The effects seem to depend on the speed of typing - the 'h' can change the selection to Hebrew and 'b' can change it to Bengali (Bangladesh).) The trick of copying inserting the language in the .xml files does not work well. If I have some text with khb-CN as the language and region and then try to set the language for a greater expanse of text, khb-CN does not come up in the menu. Ubuntu's Version: 4.4.4.2 Build ID: 40m0(Build:2) is no better. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Adding Languages to Writer's Character, Font Menu
How do I add a language to this menu so that fonts that can will render text in the style appropriate to the language? I am reconciled to having to create a bespoke version of LibreOffice, though I'd rather not. Manually editing a document's XML files would be the last resort - it seems to work! While it gets the language into the pick list, this is only while the selection includes text in that language. I haven't explored this method, though it does suggest a workable clumsy technique. An example is a Tai Lue style (language khb-CN). LibreOffice (more precisely, HarfBuzz) includes data to enable the conversion of 'khb' to OpenType tag 'XBD '. I'm styling Lanna script glyphs for language, and selecting Lao as the complex-script language gives me Lao-style glyphs from the font in LibreOffice. The only mechanism I can see that might work for Tai Lue is to request the installation of a scrappy dictionary for the language (perhaps even empty?) but this method feels wrong. My wish list for additions, assuming I only have to include language and country, is: khb-CN Tai Lue nod-TH Northern Thai kkh-MM Tai Khuen tts-TH North-Eastern Thai and, lower down the scale of desire, support for khb-LA Tai Lue of Laos and for various Palaung languages (relevant one unestablished), all OpenType code PLG: pce-MM – Ruching pll-MM – Shwe rbb-MM – Rumai If script matters, it gets more complicated. The above 8 are all for the Lanna script, but for generally useful dictionary support for some of the languages one should concentrate on: nod_Thai-TH tts_Thai-TH khb_Laoo-LA and I'm not sure which script for Palaung - probably Myanmar, but definitely not Lanna. A dictionary for pi_Lana (Pali) would be good to have - I'm not sure about the the relevance of national variations, though. I'm not sure how well a multiscript dictionary will work. A pi_Latn dictionary is reported to have been developed, but it's not available for download. Pali is on the pick list for Western scripts, but is not available for 'complex' scripts. The national variations in Pali generally go with script. (Of course, there are two very different, but apparently equivalent, orthographies for pi_Thai-TH.) Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Thu, 27 Sep 2012 11:52:26 +0700 Nathan Wells sungk...@gmail.com wrote: 1. If you are shutting off the ICU breakiterator for text following, we should probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the whole sentence. Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break iteration should be disabled for the whole sentence. What is the logic of this? The use cases I see are: 1) The user always marks word breaks with ZWSP. In this case, the ideal is to switch off the break iterator for the language. 2) The user never marks word breaks. In this case, the user is totally dependent on the break iterator, and cannot be helped when it fails. 3) The user only marks word breaks and non-word breaks when the iterator fails. In this case, the iterator need only be switched off from the point of override until it can clearly re-synch. The obvious re-synching points are word external punctuation, such as end-of-line, white space, quotation marks, commas and dandas (and as dandas I would include U+0E2F THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5 KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai ฯลฯ and ฯเปฯ). Now, it may be easier to explain the rule if it applies to the whole 'word' - for what we are looking at is pretty much a 'word' as understood by dictionariless editors. 4) Different parts of the text comes from different sources - some mark word breaks, others expect the application to correctly identify them. A ZWSP in a chunk of text would then tag the text as having come from a a user in case 1 or 3; we have no reliable way of distinguishing the two cases. A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so paragraph initial is suspect) would strongly suggest use case 3 - but might occur in use case 1 if the user has had to fight a break iterator. (end of use cases) Considering these four use cases, it seems simplest to let ZWSP, WJ and ZWNBSP disable the iterator for the extent of the dictionariless word in which it occurs. What is the definition of an ICU sentence boundary? I see no evidence from CLDR 2.9 that it should be even approximately right for Khmer (or Thai). Splitting Thai text into sentences is known to be challenging - we can therefore expect different applications to split text differently. The one downside I can see to my suggestion is that if all word boundaries are marked, switching the iterator off dictionariless word by dictionariless word will require slightly greater use of WJ, for a ZWSP later in the sentence will not necessarily be in the same dictionariless word. A related issue that seems not to being handled is repetition mark U+0E46 THAI CHARACTER MAIYAMOK. It should be separated from the preceding alphabetic characters by a space, but Libreoffice doesn't recognised the sequence as a possible continuation of the word. Sometimes it is a necessary part of a word. I don't know what the situation is in Khmer. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Thu, 27 Sep 2012 21:08:13 +0700 Nathan Wells sungk...@gmail.com wrote: Firstly, you are right, I was mistaken about ICU and the breakiterator working for sentences (I just tried it right now and it does work, but just not with the normal khan or period of Khmer rather it works with Latin sentence markers which is not enough). I had thought when we put in the code for the breakiterator that it also covered the sentence, but I guess not (I will work towards getting it working for Khmer). It may be worth modifying the CLDR definition - sentence breaks can be customised, though it is presently only done for Greek. However, if you want Khmer *sentence* rather than *clause* breaking, it will need a lot of work - papers are still being published on breaking Thai into sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ). In response to your comments: 1) The user always marks word breaks with ZWSP. In this case, the ideal is to switch off the break iterator for the language. There is some truth to this - and that is why I had it as my last option (just turning the whole thing off). But the ICU breakiterator for Khmer actually works quite well with normal language - it breaks down when there are proper names. So turning it off is an option, but not the most ideal solution. Some users will continue to always mark breaks with a ZWSP (for full control), but I also think having the option to turn it off for more complex sentences would be ideal. 2) The user never marks word breaks. In this case, the user is totally dependent on the break iterator, and cannot be helped when it fails. As I said above, I think a both/and solution would be idea for Khmer. But if in the end it would work better for Thai to have and off and on option only, that would be fine for Khmer as well for now, until we can come up with a more ideal solution. 3) The user only marks word breaks and non-word breaks when the iterator fails. The problem with this in Khmer is the user cannot tell when the breakiterator fails, unless it is on a line-break. A word could be broken up into three parts and the user would never know it. I usually notice iterator failures in Thai with unrecognised words, which prompts red ink over strange extents. Usually the words are not recognised because they're misspelt, but not always. The problem I see in Thai is usually not so much as extra word boundaries as misplaced word boundaries. Actually, if users could see where the breakiterator is breaking words, that would simplify things a lot. That is a very significant observation. The only problem with this would be at the beginning of a document or the beginning of any new re-syncing segment because you might run into something like this: User input (example in English so others can make sense of it I hope): wordwordwordwordword. How the sentence is broken up by the breakiterator: wo r d word word wo rd word. User adds ZWSP to fix broken word on line-break: wo r d word word ZWSPwordword. This example confuses me. The problem here seems to be extra word breaks rather than missing word breaks, and I don't see how confirming a word break helps. But user has no idea the first word is broken incorrectly and that it is also spelled incorrectly. This is why it would be best (I think) as Martin suggested that when a ZWSP is detected it also turn off break iteration for the previous words up until a re-sync point. This would practicly give the user an off option for the whole document if they so chose, and without the confusion of having to find some option in the Tools menu to turn it on or off - it would just be automatic, depending on the user's habit. I was clearly not clear enough. In the example above, 'wordwordwordwordword' is what I would call a dictionariless word - a word-breaker without a dictionary (e.g. a shell's parser) would see it as just one 'word'. Therefore, once ZWSP is inserted and word-breaking disabled, dictionary-based word-breaking is not applied to wordwordwordZWSPwordword, and, typically, red squiggles appear under wordwordword and wordword. The boundary may be revealed by a phase discontinuity or gap in the squiggle. Under the proposed scheme, user has to introduce another three ZWSPs even if the dictionary contains all the words. I agree with this: Considering these four use cases, it seems simplest to let ZWSP, WJ and ZWNBSP disable the iterator for the extent of the dictionariless word in which it occurs. Except, it also should disable the breakiterator up to the previous re-sync point... But that is what I meant! But actually, there is a rule in ICU for the MAIYAMOK so unless that is not working properly, I am not sure why LibreOffice doesn't break correctly... I'll have to look further into this - and check that misbehaviour is still happening. Squiggly lines is what I chiefly remember. There may also be a Hunspell issue
Re: Adding Extension for Experimental Thai Spelling
On Thu, 26 Jul 2012 16:33:00 +0700 Martin Hosken martin_hos...@sil.org wrote: 1. use of U+2060 makes string searching and spell checking harder (unless WJ chars are stripped for searching and spell checking). They are not part of the spelling of a word, so their introduction in the underlying text stream is problematic for other text processing processes (like searching as mentioned). This is less of an issue for U+200B ZWSP because that occurs between words and searching across word boundaries is a rarer activity. Likewise spell checking across word boundaries isn't really needed. U+2060 WJ should definitely be skipped for searching and, once it has done its gluing job, spell-checking look-up, just like U+00AD SOFT HYPHEN. They're both indubitable complete ignorables for collation and therefore for UCA (Unicode Collation Algorithm) search. Now what happens if I want to put zw around a word that occurs 20 chars after my last zw? The on off nature of the zw has now been inverted. One option is to say that zw must always occur in pairs and you would have to bracket your first or second word there. But then management of which zw is on and which is off will get confusing for users. I think that is the wrong way of looking at it. Various characters, some ZWSP, others more natural, such as SP, tell the break iterators where some word boundaries are. The rule we would have is that the break iterator should not try to break runs of less than, say, 20 characters if one of the boundaries is provided by ZWSP. I am not proposing that we limit how many breaks it makes in a run - 21 characters could be broken into seven words. The short runs the break iterator is prohibited from breaking can still be checked for spelling. If they are not words, then the user can respond to the red wiggly line appropriately, e.g. by putting extra word breaks in. In the example you gave, one would have to split the words between the delimited words. I think the users must accept that - the rule we would be working with is that the break iterator does not break short runs created by inserted ZWSP, and that is a simple rule to understand. I suppose there may be some question of what to count - base consonants perhaps? (In Unicode jargon, that would be extended default graphemes.) That might be a luxury feature we never need to add. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Fri, 17 Feb 2012 14:10:21 + Caolán McNamara caol...@redhat.com wrote: On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote: Indeed, yeah, I suppose, assuming its as complicated as Thai, that the right direction would be for someone to write for icu new dictionary-based breakiterators for the nod(?) language and then the rather trivial changes to LibreOffice to know about the language in order to mark text as that language to bubble that info down to icu Northern Thai's not quite as simple or standardised as Siamese! One can meet (at least) the following spelling systems: 1) Chiangmai phonetics 2) Chiangrai phonetics (different mapping of tones to Siamese spelling rules) 3) Transliteration from Tai Tham script (probably rare for connected text) 4) Tai Tham script However, perhaps dictionary-based break iterators are something to be treated like dictionaries. There are several other writing systems that could probably benefit from them: Thai script: Northern Thai NE Thai (for recording songs - use of Siamese tone rules scrambles the tonemarks compared to Siamese cognates) Khmer script: Khmer - there's already a project for this set up on SourceForge. Pali Tai Tham script: Tai Khuen Tai Lue Pali Lao script Lao Tibetan script Tibetan I've a feeling Burmese may also have a need for dictionary based text breaking, though it's better behaved for syllable breaking than most of the others listed here. Shan would come in the same category. The above list is not exhaustive. Tai Lue in Lao script probably belongs in the list. Not all Thai script writing systems need a break iterator - some of the minority languages separate words with spaces, but that's partially a matter of literacy - Thais start writing Thai with interword gaps and then learn to suppress the gaps. Pali written in Thai also separates words with spaces - but Pali has some very long words! Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Tue, 14 Feb 2012 16:19:17 + Caolán McNamara caol...@redhat.com wrote: I think this change: http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12 should improve matters a lot. It's a vast improvement - it gives LibreOffice a real Thai spell-checker. Thank you. I have one worry for Siamese - Németh László suggested that there might be a licensing issue back in http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html . If there isn't such an issue, does this mean we can hope to see your fix in LibreOffice 3.5.1? Makes กุหลาบ get treated as a single word in the unit test there now anyway, though the Northern Thai one is still not considered a single word, that might be due to the oldish icu we're still using. I wouldn't expect a dictionary-based line breaker to handle words from other languages. (There's a whole slew of Mon-Khmer languages in Thailand, and they mostly use the Thai script when they happen to get written.) I can work my way round the problem using the sticking plaster of ZWSP and WJ (no-break no-space), and I think some use of them or an equivalent is inevitable when the sequence of visible characters doesn't define the breaks. In particular, after gluing กุ๊หลาบ together with WJ, Hunspell offered me กุหลาบ as a correction, which is good. There may be some rough edges with ZWSP and WJ going into the dictionary (TBC), but what you've done will justify LibreOffice claiming a Thai spell checking capability. Minority language support may not be compatible with libthai - at least one language uses a combining underline, and some of the mark combinations used for minority languages would get rejected by the WTT rules that libthai supports. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
Thank you to every one who's offered me advice. On Mon, 13 Feb 2012 15:08:20 + Caolán McNamara caol...@redhat.com wrote: I don't think we have any way to override our breakiterators from extensions. Ah well, I'll just have to try to get Thai spell-checking working for myself and then worry about sharing my changes - assuming I succeed. I'd be sort of interested in confirming that what we have right now actually works correctly, in the sense that Thai text definitely *is* getting run through the special Thai-specific icu word break handler. It's definitely going through a Siamese-specific word-breaker for line-breaking. For example the two-syllable Thai word กุหลาบ 'rose' moves to the next line, but when I convert it to the Northern Thai form กุ๊หลาบ (not the spelling I'd favour) by adding a (non-spacing) tone mark, it's promptly broken between lines along the syllable boundary, although the first syllable does not constitute a word, at least not one recorded in the Royal Institute Dictionary. I'm glad to find that inserting U+2060 WJ prevents that break. The spell-checker seems to break up a phrase consisting of just กุหลาบ into 3 or 4 words. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Adding Extension for Experimental Thai Spelling
As I understand it, the lack of a usable Thai spell-checker for LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai break iterator. (I had expected Thai and Khmer to face similar problems, for neither has a visible word separator and syllable boundaries are often unclear in both.) Tagging Thai script text as Khmer does not work (at least, not in Version 3.4.5); the word boundaries are still determined by the Thai break iterator. Is it possible to create an experimental alternative to the Thai break iterator that can be shared with other people as a LibreOffice extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE (ZWSP) to separate words in the Thai script, but I suspect Thais would not. Also, I can seem my first useful version fouling up the rendering of pre-existing text. I can't work out how to create a break iterator as an *extension*. Could someone please advise me how, e.g. by pointing to the documentation or an example. I can find documentation for *publishing* an extension, but that does not address *creating* an extension. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice