Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Nathan Wells
Thanks for your input Richard,

Firstly, you are right, I was mistaken about ICU and the breakiterator
working for sentences (I just tried it right now and it does work, but just
not with the normal khan or period of Khmer rather it works with Latin
sentence markers which is not enough).  I had thought when we put in the
code for the breakiterator that it also covered the sentence, but I guess
not (I will work towards getting it working for Khmer).

In response to your comments:

1) The user always marks word breaks with ZWSP.
 In this case, the ideal is to switch off the break iterator for the
 language.


There is some truth to this - and that is why I had it as my last option
(just turning the whole thing off). But the ICU breakiterator for Khmer
actually works quite well with normal language - it breaks down when there
are proper names. So turning it off is an option, but not the most ideal
solution. Some users will continue to always mark breaks with a ZWSP (for
full control), but I also think having the option to turn it off for more
complex sentences would be ideal.

2) The user never marks word breaks.
 In this case, the user is totally dependent on the break iterator, and
 cannot be helped when it fails.

As I said above, I think a both/and solution would be idea for Khmer. But
if in the end it would work better for Thai to have and off and on
option only, that would be fine for Khmer as well for now, until we can
come up with a more ideal solution.


3) The user only marks word breaks and non-word breaks when the iterator
 fails.

The problem with this in Khmer is the user cannot tell when the
breakiterator fails, unless it is on a line-break.  A word could be broken
up into three parts and the user would never know it. This is why the issue
is so complex. Actually, if users could see where the breakiterator is
breaking words, that would simplify things a lot. Though I still think the
option to turn the breakiterator on or off for certain sentences would
be ideal (especially sentences with a ton of proper nouns where the ICU
breakiterator for Khmer has the most trouble).

As far as finding re-syncing points (when to turn the breakitorator back on
when it is turned off by a ZWSP) I agree with you:

 The obvious re-synching points
 are word external punctuation, such as end-of-line, white space,
 quotation marks, commas and dandas (and as dandas I would include U+0E2F
 THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
 KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
 ??? and ).


The only problem with this would be at the beginning of a document or the
beginning of any new re-syncing segment because you might run into
something like this:

User input (example in English so others can make sense of it I hope):
wordwordwordwordword.
How the sentence is broken up by the breakiterator: wo r d word word wo rd
word.
User adds ZWSP to fix broken word on line-break: wo r d word word
ZWSPwordword.
But user has no idea the first word is broken incorrectly and that it is
also spelled incorrectly.

This is why it would be best (I think) as Martin suggested that when a ZWSP
is detected it also turn off break iteration for the previous words up
until a re-sync point.  This would practicly give the user an off option
for the whole document if they so chose, and without the confusion of
having to find some option in the Tools menu to turn it on or off - it
would just be automatic, depending on the user's habit.

I agree with this:

 Considering these four use cases, it seems simplest to let ZWSP, WJ and
 ZWNBSP disable the iterator for the extent of the dictionariless word
 in which it occurs.


Except, it also should disable the breakiterator up to the previous re-sync
point to enable users to functionally turn off the breakitorator if they
so choose (for Khmer this is necessary because for a book editor like
myself, I will want to manually put the breaks and not let the
breakitorator do anything automatically - but the feature is nice for the
casual user because it is much faster and more intuitive to not type spaces
between words for Cambodians).

A related issue that seems not to being handled is repetition mark U+0E46
 THAI
 CHARACTER MAIYAMOK.  It should be separated from the preceding
 alphabetic characters by a space, but Libreoffice doesn't recognised
 the sequence as a possible continuation of the word.  Sometimes it
 is a necessary part of a word.  I don't know what the situation is in
 Khmer.


In Khmer the repeat character (U+17D7 LEK TOO) is not separated from the
preceding word by a space, but is connected, so this is not an issue for
us.  But actually, there is a rule in ICU for the MAIYAMOK so unless that
is not working properly, I am not sure why LibreOffice doesn't break
correctly...

Here's the code from ICU4c for the Thai  MAIYAMOK from dictbe.cpp if anyone
is interested...

if (uc == 

Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Martin Hosken
Dear Nathan,

 Here are some new ideas, ordered by desirability, with number one being the
 most desired, to number three being the least.
 
 1) When a zero-width space is detected (U+200B), shut off ICU breakiterator
 for Khmer spell checking for characters following the zero-width space
 until encounters real space (U+0020) or end of sentence (detect end of
 sentence using ICU Sentence Boundary).

I think this is a good direction to head. I have to follow on comments:

1. If you are shutting off the ICU breakiterator for text following, we should 
probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP 
(U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the 
whole sentence.

2. Why limit this to Khmer? I suspect as a model it should work for any 
non-space broken text.

Yours,
Martin



 
 2) Disable use of ICU breakiterator for Khmer spell checking by default,
 but allow users to enable it by adding a check-box to enable ICU
 breakiterator in the Tools  Options  Language Settings  Writing Aids 
 Options dialogue when a Khmer Hunspell dictionary is present (
 http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
  ).
 
 3) Disable use of ICU breakiterator for Khmer spell checking until the ICU
 breakiterator for Khmer is more accurate.
 
 Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6
 it causes a lot of spelling errors to go unnoticed since the ICU
 breakiterator breaks words up incorrectly. So hopfully we can find a
 solution that will work with the current ICU breakiterator - though with
 ICU 50.1 the breakiterator for Khmer will have some improvements. But I do
 feel if solution 1 or 2 (or if someone else has better ideas) cannot
 be implemented the breakiterator for spelling with Khmer should be turned
 off in LibreOffice until the ICU breakiterator for Khmer is more accurate.
 
 
 Thanks again for your help and time, your input is greatly appreciated!
 
 Sincerely,
 
 Nathan
 
 
 
 On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken martin_hos...@sil.orgwrote:
 
  Dear All,
 
An automatic word and line breaker is very necessary for Khmer and
Thai because traditionally they have no spaces between words, and so
line-breaking and spell checking require the use of a zero-width space
between words which is counterintuitive for most native speakers, and
so spell checking goes widely unused.
 
  I agree that automatic word breaking is a good thing and I am relieved to
  see that libreoffice does it based on language selection and not on
  automatic language guessing based on scripts. There are more languages that
  use Thai script and Khmer script than just Thai and Khmer. So one of my
  fears is already alleviated :)
 
But now with the ICU code you implemented, Thai and Khmer can be
automatically broken, and the results are quite good. But with its
implementation in the real world, I have found some issues that I
wanted to raise and also suggest possible solutions. I write this as
an end-user, not so much as a programmer, nor do I claim to fully
understand the inner-workings of ICU and LibreOffice (because I don't!
).
   
First, I will do my best to explain the current results of the ICU
break iterator with Khmer:
   
Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
   
Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
   
Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
ឈ្មោះ|សិវកឥវលិយៈ
   
The differences should be clear – the ICU break iterator does not
break the words with 100% accuracy.
   
One possible solution to this issue is by how the ICU Break Iterator
interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
code was enabled to automatically break Khmer, if an end-user wanted
to spell check Khmer, they had to manually place U+200B characters to
separate words. This solution worked quite well, but was
counterintuitive to most native speakers, because Khmer has no spaces
(as stated before). But with this solution, an end-user could be sure
that their document was broken with 100% accuracy, if there was no
human error (something automatic solutions cannot do – it is more
along the lines of 80% accurate). What I propose, is that the break
iterator code in LibreOffice looks for U+200B characters in a given
string and considers them as a sign to NOT automatically break, but to
allow the end-user full control to manually break words. Let me
explain:
   
 1. The code starts processing the text and automatically breaking
it until it comes across a U+200B character. If one is found,
it searches to see if there are any additional U+200B or U
+0020 characters in the following 20 characters (or so), and
if there are, the break iterator skips over those characters

Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Richard Wordingham
On Thu, 27 Sep 2012 11:52:26 +0700
Nathan Wells sungk...@gmail.com wrote:

 1. If you are shutting off the ICU breakiterator for text following,
 we
 should probably also do it for text preceding. Thus if there is a
 ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break
 iteration is disabled for the whole sentence.
 
 Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU
 break iteration should be disabled for the whole sentence.

What is the logic of this?

The use cases I see are:

1) The user always marks word breaks with ZWSP.

In this case, the ideal is to switch off the break iterator for the
language.

2) The user never marks word breaks.

In this case, the user is totally dependent on the break iterator, and
cannot be helped when it fails.

3) The user only marks word breaks and non-word breaks when the iterator
fails.

In this case, the iterator need only be switched off from the point of
override until it can clearly re-synch.  The obvious re-synching points
are word external punctuation, such as end-of-line, white space,
quotation marks, commas and dandas (and as dandas I would include U+0E2F
THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
ฯลฯ and ฯเปฯ).

Now, it may be easier to explain the rule if it applies to the whole
'word' - for what we are looking at is pretty much a 'word' as
understood by dictionariless editors.

4) Different parts of the text comes from different sources - some mark
word breaks, others expect the application to correctly identify them.

A ZWSP in a chunk of text would then tag the text as having come from a
a user in case 1 or 3; we have no reliable way of distinguishing the
two cases.  A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so
paragraph initial is suspect) would strongly suggest use case 3 - but
might occur in use case 1 if the user has had to fight a break
iterator.

(end of use cases)

Considering these four use cases, it seems simplest to let ZWSP, WJ and
ZWNBSP disable the iterator for the extent of the dictionariless word
in which it occurs.

What is the definition of an ICU sentence boundary?  I see no evidence
from CLDR 2.9 that it should be even approximately right for Khmer (or
Thai). Splitting Thai text into sentences is known to be challenging -
we can therefore expect different applications to split text
differently.

The one downside I can see to my suggestion is that if all word
boundaries are marked, switching the iterator off dictionariless word
by dictionariless word will require slightly greater use of WJ, for a
ZWSP later in the sentence will not necessarily be in the same
dictionariless word.

A related issue that seems not to being handled is repetition mark U+0E46 THAI
CHARACTER MAIYAMOK.  It should be separated from the preceding
alphabetic characters by a space, but Libreoffice doesn't recognised
the sequence as a possible continuation of the word.  Sometimes it
is a necessary part of a word.  I don't know what the situation is in
Khmer.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Richard Wordingham
On Thu, 27 Sep 2012 21:08:13 +0700
Nathan Wells sungk...@gmail.com wrote:

 Firstly, you are right, I was mistaken about ICU and the breakiterator
 working for sentences (I just tried it right now and it does work,
 but just not with the normal khan or period of Khmer rather it
 works with Latin sentence markers which is not enough).  I had
 thought when we put in the code for the breakiterator that it also
 covered the sentence, but I guess not (I will work towards getting it
 working for Khmer).

It may be worth modifying the CLDR definition - sentence breaks can be
customised, though it is presently only done for Greek.  However, if
you want Khmer *sentence* rather than *clause* breaking, it will need a
lot of work - papers are still being published on breaking Thai into
sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ).

 In response to your comments:
 
  1) The user always marks word breaks with ZWSP.
  In this case, the ideal is to switch off the break iterator for the
  language.
 
 
 There is some truth to this - and that is why I had it as my last
 option (just turning the whole thing off). But the ICU breakiterator
 for Khmer actually works quite well with normal language - it breaks
 down when there are proper names. So turning it off is an option, but
 not the most ideal solution. Some users will continue to always mark
 breaks with a ZWSP (for full control), but I also think having the
 option to turn it off for more complex sentences would be ideal.
 
  2) The user never marks word breaks.
  In this case, the user is totally dependent on the break iterator,
  and cannot be helped when it fails.
 
 As I said above, I think a both/and solution would be idea for Khmer.
 But if in the end it would work better for Thai to have and off and
 on option only, that would be fine for Khmer as well for now, until
 we can come up with a more ideal solution.
 
 
  3) The user only marks word breaks and non-word breaks when the
  iterator fails.
 
 The problem with this in Khmer is the user cannot tell when the
 breakiterator fails, unless it is on a line-break.  A word could be
 broken up into three parts and the user would never know it.

I usually notice iterator failures in Thai with unrecognised words,
which prompts red ink over strange extents. Usually the words are not
recognised because they're misspelt, but not always.  The problem I see
in Thai is usually not so much as extra word boundaries as misplaced
word boundaries. 

 Actually, if users could see where the
 breakiterator is breaking words, that would simplify things a lot.

That is a very significant observation.

 The only problem with this would be at the beginning of a document or
 the beginning of any new re-syncing segment because you might run
 into something like this:

 User input (example in English so others can make sense of it I hope):
 wordwordwordwordword.
 How the sentence is broken up by the breakiterator: wo r d word word
 wo rd word.
 User adds ZWSP to fix broken word on line-break: wo r d word word
 ZWSPwordword.

This example confuses me.  The problem here seems to be extra word
breaks rather than missing word breaks, and I don't see how confirming
a word break helps.

 But user has no idea the first word is broken incorrectly and that it
 is also spelled incorrectly.

 This is why it would be best (I think) as Martin suggested that when
 a ZWSP is detected it also turn off break iteration for the previous
 words up until a re-sync point.  This would practicly give the user
 an off option for the whole document if they so chose, and without
 the confusion of having to find some option in the Tools menu to turn
 it on or off - it would just be automatic, depending on the user's
 habit.

I was clearly not clear enough.  In the example above,
'wordwordwordwordword' is what I would call a dictionariless word - a
word-breaker without a dictionary (e.g. a shell's parser) would see it
as just one 'word'.  Therefore, once ZWSP is inserted and
word-breaking disabled, dictionary-based word-breaking is not applied to
wordwordwordZWSPwordword, and, typically, red squiggles appear under
wordwordword and wordword.  The boundary may be revealed by a phase
discontinuity or gap in the squiggle.  Under the proposed scheme, user
has to introduce another three ZWSPs even if the dictionary contains
all the words.

 I agree with this:
 
  Considering these four use cases, it seems simplest to let ZWSP, WJ
  and ZWNBSP disable the iterator for the extent of the
  dictionariless word in which it occurs.

 Except, it also should disable the breakiterator up to the previous
 re-sync point...

But that is what I meant!

 But actually, there is a rule in ICU for the MAIYAMOK
 so unless that is not working properly, I am not sure why LibreOffice
 doesn't break correctly...

I'll have to look further into this - and check that misbehaviour is
still happening.  Squiggly lines is what I chiefly remember.  There may
also be a Hunspell issue 

Re: Adding Extension for Experimental Thai Spelling

2012-09-26 Thread Nathan Wells
Hello Again,

Thank you all for your input!

This is a deeper problem than I first thought...sorry for the delayed
response, but I hope a solution can be found, even though the current ICU
breakiterator is not at 100% for Khmer.

Here are some new ideas, ordered by desirability, with number one being the
most desired, to number three being the least.

1) When a zero-width space is detected (U+200B), shut off ICU breakiterator
for Khmer spell checking for characters following the zero-width space
until encounters real space (U+0020) or end of sentence (detect end of
sentence using ICU Sentence Boundary).

2) Disable use of ICU breakiterator for Khmer spell checking by default,
but allow users to enable it by adding a check-box to enable ICU
breakiterator in the Tools  Options  Language Settings  Writing Aids 
Options dialogue when a Khmer Hunspell dictionary is present (
http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
 ).

3) Disable use of ICU breakiterator for Khmer spell checking until the ICU
breakiterator for Khmer is more accurate.

Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6
it causes a lot of spelling errors to go unnoticed since the ICU
breakiterator breaks words up incorrectly. So hopfully we can find a
solution that will work with the current ICU breakiterator - though with
ICU 50.1 the breakiterator for Khmer will have some improvements. But I do
feel if solution 1 or 2 (or if someone else has better ideas) cannot
be implemented the breakiterator for spelling with Khmer should be turned
off in LibreOffice until the ICU breakiterator for Khmer is more accurate.


Thanks again for your help and time, your input is greatly appreciated!

Sincerely,

Nathan



On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken martin_hos...@sil.orgwrote:

 Dear All,

   An automatic word and line breaker is very necessary for Khmer and
   Thai because traditionally they have no spaces between words, and so
   line-breaking and spell checking require the use of a zero-width space
   between words which is counterintuitive for most native speakers, and
   so spell checking goes widely unused.

 I agree that automatic word breaking is a good thing and I am relieved to
 see that libreoffice does it based on language selection and not on
 automatic language guessing based on scripts. There are more languages that
 use Thai script and Khmer script than just Thai and Khmer. So one of my
 fears is already alleviated :)

   But now with the ICU code you implemented, Thai and Khmer can be
   automatically broken, and the results are quite good. But with its
   implementation in the real world, I have found some issues that I
   wanted to raise and also suggest possible solutions. I write this as
   an end-user, not so much as a programmer, nor do I claim to fully
   understand the inner-workings of ICU and LibreOffice (because I don't!
   ).
  
   First, I will do my best to explain the current results of the ICU
   break iterator with Khmer:
  
   Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
  
   Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
  
   Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
   ឈ្មោះ|សិវកឥវលិយៈ
  
   The differences should be clear – the ICU break iterator does not
   break the words with 100% accuracy.
  
   One possible solution to this issue is by how the ICU Break Iterator
   interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
   code was enabled to automatically break Khmer, if an end-user wanted
   to spell check Khmer, they had to manually place U+200B characters to
   separate words. This solution worked quite well, but was
   counterintuitive to most native speakers, because Khmer has no spaces
   (as stated before). But with this solution, an end-user could be sure
   that their document was broken with 100% accuracy, if there was no
   human error (something automatic solutions cannot do – it is more
   along the lines of 80% accurate). What I propose, is that the break
   iterator code in LibreOffice looks for U+200B characters in a given
   string and considers them as a sign to NOT automatically break, but to
   allow the end-user full control to manually break words. Let me
   explain:
  
1. The code starts processing the text and automatically breaking
   it until it comes across a U+200B character. If one is found,
   it searches to see if there are any additional U+200B or U
   +0020 characters in the following 20 characters (or so), and
   if there are, the break iterator skips over those characters
   and starts again from the second U+200B character (or U+0020,
   but a U+0020 character would only signify the “close” of the
   manual break because sometimes a phrase will end and there
   will be an actual space – so if the word that the user wants
   to manually break has a 

Re: Adding Extension for Experimental Thai Spelling

2012-09-26 Thread Nathan Wells
Thanks Martin,


1. If you are shutting off the ICU breakiterator for text following, we
 should probably also do it for text preceding. Thus if there is a ZWSP or
 ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
 for the whole sentence.


Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break
iteration should be disabled for the whole sentence.


2. Why limit this to Khmer? I suspect as a model it should work for any
 non-space broken text.


I am only limiting it to Khmer because that is my expertise and I didn't
want to cause problems for other languages - but it is possible these
changes would be beneficial for other languages that are not broken by
spaces (like Thai).


Thanks,
Nathan

On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken martin_hos...@sil.orgwrote:

 Dear Nathan,

  Here are some new ideas, ordered by desirability, with number one being
 the
  most desired, to number three being the least.
 
  1) When a zero-width space is detected (U+200B), shut off ICU
 breakiterator
  for Khmer spell checking for characters following the zero-width space
  until encounters real space (U+0020) or end of sentence (detect end of
  sentence using ICU Sentence Boundary).

 I think this is a good direction to head. I have to follow on comments:

 * 1. If you are shutting off the ICU breakiterator for text following, we
 should probably also do it for text preceding. Thus if there is a ZWSP or
 ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
 for the whole sentence.

 2. Why limit this to Khmer? I suspect as a model it should work for any
 non-space broken text.*

 Yours,
 Martin



 
  2) Disable use of ICU breakiterator for Khmer spell checking by default,
  but allow users to enable it by adding a check-box to enable ICU
  breakiterator in the Tools  Options  Language Settings  Writing Aids 
  Options dialogue when a Khmer Hunspell dictionary is present (
 
 http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
   ).
 
  3) Disable use of ICU breakiterator for Khmer spell checking until the
 ICU
  breakiterator for Khmer is more accurate.
 
  Currently, with the ICU breakiterator for Khmer enabled in LibreOffice
 3.6
  it causes a lot of spelling errors to go unnoticed since the ICU
  breakiterator breaks words up incorrectly. So hopfully we can find a
  solution that will work with the current ICU breakiterator - though with
  ICU 50.1 the breakiterator for Khmer will have some improvements. But I
 do
  feel if solution 1 or 2 (or if someone else has better ideas) cannot
  be implemented the breakiterator for spelling with Khmer should be turned
  off in LibreOffice until the ICU breakiterator for Khmer is more
 accurate.
 
 
  Thanks again for your help and time, your input is greatly appreciated!
 
  Sincerely,
 
  Nathan
 
 
 
  On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken martin_hos...@sil.org
 wrote:
 
   Dear All,
  
 An automatic word and line breaker is very necessary for Khmer and
 Thai because traditionally they have no spaces between words, and
 so
 line-breaking and spell checking require the use of a zero-width
 space
 between words which is counterintuitive for most native speakers,
 and
 so spell checking goes widely unused.
  
   I agree that automatic word breaking is a good thing and I am relieved
 to
   see that libreoffice does it based on language selection and not on
   automatic language guessing based on scripts. There are more languages
 that
   use Thai script and Khmer script than just Thai and Khmer. So one of my
   fears is already alleviated :)
  
 But now with the ICU code you implemented, Thai and Khmer can be
 automatically broken, and the results are quite good. But with its
 implementation in the real world, I have found some issues that I
 wanted to raise and also suggest possible solutions. I write this
 as
 an end-user, not so much as a programmer, nor do I claim to fully
 understand the inner-workings of ICU and LibreOffice (because I
 don't!
 ).

 First, I will do my best to explain the current results of the ICU
 break iterator with Khmer:

 Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ

 Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ

 Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
 ឈ្មោះ|សិវកឥវលិយៈ

 The differences should be clear – the ICU break iterator does not
 break the words with 100% accuracy.

 One possible solution to this issue is by how the ICU Break
 Iterator
 interacts with zero-width spaces (U+200B) in LibreOffice. Before
 ICU
 code was enabled to automatically break Khmer, if an end-user
 wanted
 to spell check Khmer, they had to manually place U+200B characters
 to
 separate words. This solution worked quite well, but was
 counterintuitive to most native speakers, 

Re: Adding Extension for Experimental Thai Spelling

2012-07-27 Thread Richard Wordingham
On Thu, 26 Jul 2012 16:33:00 +0700
Martin Hosken martin_hos...@sil.org wrote:

 1. use of U+2060 makes string searching and spell checking harder
 (unless WJ chars are stripped for searching and spell checking). They
 are not part of the spelling of a word, so their introduction in the
 underlying text stream is problematic for other text processing
 processes (like searching as mentioned). This is less of an issue for
 U+200B ZWSP because that occurs between words and searching across
 word boundaries is a rarer activity. Likewise spell checking across
 word boundaries isn't really needed.

U+2060 WJ should definitely be skipped for searching and, once it has
done its gluing job, spell-checking look-up, just like U+00AD SOFT
HYPHEN.  They're both indubitable complete ignorables for collation and
therefore for UCA (Unicode Collation Algorithm) search.

 Now what happens if I want to put zw around a word that occurs  20
 chars after my last zw? The on off nature of the zw has now been
 inverted. One option is to say that zw must always occur in pairs and
 you would have to bracket your first or second word there. But then
 management of which zw is on and which is off will get confusing for
 users.

I think that is the wrong way of looking at it.  Various characters,
some ZWSP, others more natural, such as SP, tell the break iterators
where some word boundaries are.  The rule we would have is that the
break iterator should not try to break runs of less than, say, 20
characters if one of the boundaries is provided by ZWSP.  I am not
proposing that we limit how many breaks it makes in a run - 21
characters could be broken into seven words.  The short runs the break
iterator is prohibited from breaking can still be checked for spelling.
If they are not words, then the user can respond to the red wiggly line
appropriately, e.g. by putting extra word breaks in.

In the example you gave, one would have to split the words between the
delimited words.  I think the users must accept that - the rule we
would be working with is that the break iterator does not break short
runs created by inserted ZWSP, and that is a simple rule to
understand.  I suppose there may be some question of what to count -
base consonants perhaps? (In Unicode jargon, that would be extended
default graphemes.)  That might be a luxury feature we never need to
add.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-07-26 Thread Martin Hosken
Dear All,

  An automatic word and line breaker is very necessary for Khmer and
  Thai because traditionally they have no spaces between words, and so
  line-breaking and spell checking require the use of a zero-width space
  between words which is counterintuitive for most native speakers, and
  so spell checking goes widely unused.

I agree that automatic word breaking is a good thing and I am relieved to see 
that libreoffice does it based on language selection and not on automatic 
language guessing based on scripts. There are more languages that use Thai 
script and Khmer script than just Thai and Khmer. So one of my fears is already 
alleviated :)

  But now with the ICU code you implemented, Thai and Khmer can be
  automatically broken, and the results are quite good. But with its
  implementation in the real world, I have found some issues that I
  wanted to raise and also suggest possible solutions. I write this as
  an end-user, not so much as a programmer, nor do I claim to fully
  understand the inner-workings of ICU and LibreOffice (because I don't!
  ).
  
  First, I will do my best to explain the current results of the ICU
  break iterator with Khmer:
  
  Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
  
  Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
  
  Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
  ឈ្មោះ|សិវកឥវលិយៈ
  
  The differences should be clear – the ICU break iterator does not
  break the words with 100% accuracy.
  
  One possible solution to this issue is by how the ICU Break Iterator
  interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
  code was enabled to automatically break Khmer, if an end-user wanted
  to spell check Khmer, they had to manually place U+200B characters to
  separate words. This solution worked quite well, but was
  counterintuitive to most native speakers, because Khmer has no spaces
  (as stated before). But with this solution, an end-user could be sure
  that their document was broken with 100% accuracy, if there was no
  human error (something automatic solutions cannot do – it is more
  along the lines of 80% accurate). What I propose, is that the break
  iterator code in LibreOffice looks for U+200B characters in a given
  string and considers them as a sign to NOT automatically break, but to
  allow the end-user full control to manually break words. Let me
  explain:
  
   1. The code starts processing the text and automatically breaking
  it until it comes across a U+200B character. If one is found,
  it searches to see if there are any additional U+200B or U
  +0020 characters in the following 20 characters (or so), and
  if there are, the break iterator skips over those characters
  and starts again from the second U+200B character (or U+0020,
  but a U+0020 character would only signify the “close” of the
  manual break because sometimes a phrase will end and there
  will be an actual space – so if the word that the user wants
  to manually break has a “real” U+0020 space at the end of it,
  then the user does not need to put an additional U+200B
  character to close it) which then repeats, looking for U+200B
  characters etc.
  
   2. This would allow end-users to choose to manually break their
  whole document so they can have precise control, as well as
  allow end-users to place U+200B characters around names of
  people, places or transliterations in order to tell the break
  iterator to not try to break those words.

In principle I like this approach. I like the idea of being able to force 
breaks and non-breaks. But I don't think we are quite there with this solution 
yet. Here are my difficulties with it:

1. use of U+2060 makes string searching and spell checking harder (unless WJ 
chars are stripped for searching and spell checking). They are not part of the 
spelling of a word, so their introduction in the underlying text stream is 
problematic for other text processing processes (like searching as mentioned). 
This is less of an issue for U+200B ZWSP because that occurs between words and 
searching across word boundaries is a rarer activity. Likewise spell checking 
across word boundaries isn't really needed.

2. How do we come up with the range of what is considered a word between two 
zwsp chars as opposed to two words? How close to the end of a string must a 
zwsp occur to disable all breaking before the end of the string? does 
abcdefzwspuvwxyz block all breaks in the string? I think we need to think 
harder (deeper) about the use of zwsp in this way and see if we can come up 
with something with a little less ambiguity. Having said that, I think we are 
going to have to think really hard, because I don't think this is an easy 
problem.

   4. I then notice that ម្នាក់ទៀត line breaks together (since the
  

Re: Adding Extension for Experimental Thai Spelling

2012-07-25 Thread Caolán McNamara
I'll cc this to the list if you don't mind, in order to archive it. I
have no immediate great ideas. But I wonder if a view-word boundaries
mode would be helpful, i.e. something that indicates the boundaries of
the words that the software thinks exist.

On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote:
 
 I hope you don't mind if I write and ask some more questions and ask
 for additional help in making the break iterator more functional in
 LibreOffice. Thank you again for your help implementing ICU for Khmer
 in LibreOffice. I downloaded a recent beta build with your code
 implemented and did some testing – it is great! But it also brought to
 my attention some issues that hamper the useability of the automatic
 breaking for Khmer (and I also believe for Thai – see this discussion
 -
 http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455).
  
 
 
 An automatic word and line breaker is very necessary for Khmer and
 Thai because traditionally they have no spaces between words, and so
 line-breaking and spell checking require the use of a zero-width space
 between words which is counterintuitive for most native speakers, and
 so spell checking goes widely unused.
 But now with the ICU code you implemented, Thai and Khmer can be
 automatically broken, and the results are quite good. But with its
 implementation in the real world, I have found some issues that I
 wanted to raise and also suggest possible solutions. I write this as
 an end-user, not so much as a programmer, nor do I claim to fully
 understand the inner-workings of ICU and LibreOffice (because I don't!
 ).
 
 First, I will do my best to explain the current results of the ICU
 break iterator with Khmer:
 
 Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
 
 Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
 
 Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
 ឈ្មោះ|សិវកឥវលិយៈ
 
 The differences should be clear – the ICU break iterator does not
 break the words with 100% accuracy.
 
 But, obviously with a dictionary approach, no automatic word breaker
 will ever break correctly 100% of the time. There is no solution that
 will currently automatically break Thai or Khmer 100% correctly (I
 have used, Hidden Markov Model breakers, dictionary probability
 breakers, and plain dictionary breakers – none work 100% of a time)
 because, especially for names and places, words in Khmer can just defy
 all rules and patterns. Perhaps in the future, a solution will arise
 that can break Khmer words with 100% accuracy, but at this time, we
 are far from any such solution.
 
 And this is an important reality to remember, because it
 differentiates Thai and Khmer (and possibly other languages that do
 not use spaces between words) from Western languages such as English,
 where a line-breaker and word-breaker can be correct 100% of the time.
 
 As an end user, this inability of the ICU break iterator to break
 Khmer words with 100% causes usability issues when it comes to
 correcting the automatic breaks that are broken in error.
 
 Here are some reasons why:
 
  1. In LibreOffice a user cannot see where the words have been
 broken, they are invisible.
 
  2. Therefore, trying to use a U+2060 (No Width Word Joiner) to
 correct an error in order to correctly spell check is very
 difficult, because the user cannot see where to place the
 joiner in order to join the word (as in the example case above
 the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters
 to join it to be treated as one word, but the end user does
 not know this because the breaks are invisible.

FWIW with view-field shading on you should see a little gray mark where
the word joiner exists. At least I do anyway.

  1. Even if LibreOffice were able to change their code so that the
 end user could see the word-breaks, adding three U+2060
 characters is quite laborious just to fix one word so that it
 can be spell checked correctly (as one word, rather than spell
 checked as four individual words).
 
 
 
 One possible solution to this issue is by how the ICU Break Iterator
 interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
 code was enabled to automatically break Khmer, if an end-user wanted
 to spell check Khmer, they had to manually place U+200B characters to
 separate words. This solution worked quite well, but was
 counterintuitive to most native speakers, because Khmer has no spaces
 (as stated before). But with this solution, an end-user could be sure
 that their document was broken with 100% accuracy, if there was no
 human error (something automatic solutions cannot do – it is more
 along the lines of 80% accurate). What I propose, is that the break
 iterator code in LibreOffice looks for U+200B characters in a given
 string and considers them as a sign to NOT 

Re: Adding Extension for Experimental Thai Spelling

2012-07-25 Thread Nathan Wells
Thanks for your reply.

Yes, a  view-word boundaries  mode would be very helpful (or
even incorporating the current view-field shading to include viewing
'gray marks' at the automatic ICU breaking so that users can see what is
being done). Would this be hard to implement?

Also, we are making some changes to the ICU break iterator dictionary for
Khmer - and I've heard there will be some changes in ICU 50 which should
improve results for Khmer.

If anyone has any ideas - it would be appreciated.

Thanks!
Nathan


On Wed, Jul 25, 2012 at 8:41 PM, Caolán McNamara caol...@redhat.com wrote:

 I'll cc this to the list if you don't mind, in order to archive it. I
 have no immediate great ideas. But I wonder if a view-word boundaries
 mode would be helpful, i.e. something that indicates the boundaries of
 the words that the software thinks exist.

 On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote:
 
  I hope you don't mind if I write and ask some more questions and ask
  for additional help in making the break iterator more functional in
  LibreOffice. Thank you again for your help implementing ICU for Khmer
  in LibreOffice. I downloaded a recent beta build with your code
  implemented and did some testing – it is great! But it also brought to
  my attention some issues that hamper the useability of the automatic
  breaking for Khmer (and I also believe for Thai – see this discussion
  -
 
 http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455
 ).
 
 
  An automatic word and line breaker is very necessary for Khmer and
  Thai because traditionally they have no spaces between words, and so
  line-breaking and spell checking require the use of a zero-width space
  between words which is counterintuitive for most native speakers, and
  so spell checking goes widely unused.
  But now with the ICU code you implemented, Thai and Khmer can be
  automatically broken, and the results are quite good. But with its
  implementation in the real world, I have found some issues that I
  wanted to raise and also suggest possible solutions. I write this as
  an end-user, not so much as a programmer, nor do I claim to fully
  understand the inner-workings of ICU and LibreOffice (because I don't!
  ).
 
  First, I will do my best to explain the current results of the ICU
  break iterator with Khmer:
 
  Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
 
  Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
 
  Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
  ឈ្មោះ|សិវកឥវលិយៈ
 
  The differences should be clear – the ICU break iterator does not
  break the words with 100% accuracy.
 
  But, obviously with a dictionary approach, no automatic word breaker
  will ever break correctly 100% of the time. There is no solution that
  will currently automatically break Thai or Khmer 100% correctly (I
  have used, Hidden Markov Model breakers, dictionary probability
  breakers, and plain dictionary breakers – none work 100% of a time)
  because, especially for names and places, words in Khmer can just defy
  all rules and patterns. Perhaps in the future, a solution will arise
  that can break Khmer words with 100% accuracy, but at this time, we
  are far from any such solution.
 
  And this is an important reality to remember, because it
  differentiates Thai and Khmer (and possibly other languages that do
  not use spaces between words) from Western languages such as English,
  where a line-breaker and word-breaker can be correct 100% of the time.
 
  As an end user, this inability of the ICU break iterator to break
  Khmer words with 100% causes usability issues when it comes to
  correcting the automatic breaks that are broken in error.
 
  Here are some reasons why:
 
   1. In LibreOffice a user cannot see where the words have been
  broken, they are invisible.
 
   2. Therefore, trying to use a U+2060 (No Width Word Joiner) to
  correct an error in order to correctly spell check is very
  difficult, because the user cannot see where to place the
  joiner in order to join the word (as in the example case above
  the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters
  to join it to be treated as one word, but the end user does
  not know this because the breaks are invisible.

 FWIW with view-field shading on you should see a little gray mark where
 the word joiner exists. At least I do anyway.

   1. Even if LibreOffice were able to change their code so that the
  end user could see the word-breaks, adding three U+2060
  characters is quite laborious just to fix one word so that it
  can be spell checked correctly (as one word, rather than spell
  checked as four individual words).
 
 
 
  One possible solution to this issue is by how the ICU Break Iterator
  interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
  code was enabled to 

Re: Adding Extension for Experimental Thai Spelling

2012-07-12 Thread Caolán McNamara
On Sun, 2012-07-08 at 08:08 -0700, sungkhum wrote:
 I have two questions: is there a way to have the LibreOffice spelling
 checker (Hunspell) also recognize word-breaks using the ICU break iterator
 for Khmer so that Cambodians no longer have to add zero-width spaces
 manually (as it seems to work for Thai now?)? Currently, lines without
 zero-width spaces are seen as one long word to the spelling checker in
 LibreOffice 3.6. But since the line-breaking is working, it would seem
 breaking words for the spelling checker should also be able to work. Should
 I submit a bug? How should I proceed?

Sounds like a bug really. I mean, hunspell itself generally doesn't do
the parsing of text into words, the app gives each word to hunspell. And
we're *supposed* to be using the icu breakiterator to split words. I
suspect its a similar bug as this original one.

So... sure, file a bug, assign it to me (caol...@redhat.com) and paste a
short two word example text into the bug and indicate where the word
break should be and I'll add a regression test for it and see if its a
trivial fix for Khmer too now that we're using the latest-and-greatest
icu.

 Also, since many other programs do not incorporate ICU's code, is there a
 way to make the line breaks real when a document is saved in another
 format (such as a .doc?). And by real I mean that a zero-width space is
 actually added to the text where a line-break should be.

That should at least be theoretically possible, albeit a bit tricky
seeing as the layout code is the bit that knows the width of the page
and does the line breaking, while the export filters don't get to know
that information. There was something similar done in the past IIRC to
pass around soft-page-break information so that export filters could
know where the layout last put the page breaks. I forget the details of
that though.

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-07-12 Thread sungkhum
Thanks for your reply Caolán,
I have submitted a bug and assigned you to it. I really appreciate you
being willing to look into this!
Here's the bug url:
https://www.libreoffice.org/bugzilla/show_bug.cgi?id=52020
Please let me know if there is anything else I can provide. I have a little
working knowledge of ICU, I helped implement the breakiterator for Khmer by
providing the dictionary and tests, but I am not a programmer by trade.

 There was something similar done in the past IIRC to
 pass around soft-page-break information so that export filters could
 know where the layout last put the page breaks. I forget the details of
 that though.

This would be a very useful feature for Cambodians (and I would assume Thai
as well, although Thai tends to have more programs that currently support
wordbreaking already) - would it be best to seek to do this with an
extension rather than LibreOffice core?

Thanks again for your time,
Nathan


On Thu, Jul 12, 2012 at 11:10 PM, Caolán McNamara [via Document Foundation
Mail Archive] ml-node+s969070n3995127...@n3.nabble.com wrote:

 On Sun, 2012-07-08 at 08:08 -0700, sungkhum wrote:
  I have two questions: is there a way to have the LibreOffice spelling
  checker (Hunspell) also recognize word-breaks using the ICU break
 iterator
  for Khmer so that Cambodians no longer have to add zero-width spaces
  manually (as it seems to work for Thai now?)? Currently, lines without
  zero-width spaces are seen as one long word to the spelling checker in
  LibreOffice 3.6. But since the line-breaking is working, it would seem
  breaking words for the spelling checker should also be able to work.
 Should
  I submit a bug? How should I proceed?

 Sounds like a bug really. I mean, hunspell itself generally doesn't do
 the parsing of text into words, the app gives each word to hunspell. And
 we're *supposed* to be using the icu breakiterator to split words. I
 suspect its a similar bug as this original one.

 So... sure, file a bug, assign it to me ([hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3995127i=0)
 and paste a
 short two word example text into the bug and indicate where the word
 break should be and I'll add a regression test for it and see if its a
 trivial fix for Khmer too now that we're using the latest-and-greatest
 icu.

  Also, since many other programs do not incorporate ICU's code, is there
 a
  way to make the line breaks real when a document is saved in another
  format (such as a .doc?). And by real I mean that a zero-width space
 is
  actually added to the text where a line-break should be.

 That should at least be theoretically possible, albeit a bit tricky
 seeing as the layout code is the bit that knows the width of the page
 and does the line breaking, while the export filters don't get to know
 that information. There was something similar done in the past IIRC to
 pass around soft-page-break information so that export filters could
 know where the layout last put the page breaks. I forget the details of
 that though.

 C.

 ___
 LibreOffice mailing list
 [hidden email] http://user/SendEmail.jtp?type=nodenode=3995127i=1
 http://lists.freedesktop.org/mailman/listinfo/libreoffice


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3995127.html
  To unsubscribe from Adding Extension for Experimental Thai Spelling, click
 herehttp://nabble.documentfoundation.org/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3735637code=c3VuZ2todW1AZ21haWwuY29tfDM3MzU2Mzd8LTE3NzAzNTQxNDk=
 .
 NAMLhttp://nabble.documentfoundation.org/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



--
View this message in context: 
http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3995138.html
Sent from the Dev mailing list archive at Nabble.com.___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-07-08 Thread sungkhum
I hope no one minds if I piggy-back on this thread. Recently I contributed
to the ICU break iterator for Khmer and it was added to ICU 4.8 (I just
helped with the dictionary, another volunteer did the code). LibreOffice 3.6
added the updated ICU code and now uses the code to line-break Khmer even if
zero-width spaces have not been provided. 

I have two questions: is there a way to have the LibreOffice spelling
checker (Hunspell) also recognize word-breaks using the ICU break iterator
for Khmer so that Cambodians no longer have to add zero-width spaces
manually (as it seems to work for Thai now?)? Currently, lines without
zero-width spaces are seen as one long word to the spelling checker in
LibreOffice 3.6. But since the line-breaking is working, it would seem
breaking words for the spelling checker should also be able to work. Should
I submit a bug? How should I proceed?

Also, since many other programs do not incorporate ICU's code, is there a
way to make the line breaks real when a document is saved in another
format (such as a .doc?). And by real I mean that a zero-width space is
actually added to the text where a line-break should be. This also would
make LibreOffice a great tool for Cambodians, since most do not like to type
spaces between words (since the language traditionally doesn't have spaces),
but would then allow them to use their work with other programs without
having to manually type spaces between words.

--
View this message in context: 
http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3994303.html
Sent from the Dev mailing list archive at Nabble.com.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-17 Thread Németh László
Hi,

2012/2/17 Richard Wordingham richard.wording...@ntlworld.com:
 It's a vast improvement - it gives LibreOffice a real Thai
 spell-checker.  Thank you.  I have one worry for Siamese - Németh László
 suggested that there might be a licensing issue back in
 http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html .

There is no problem with the license of the ICU. I'm also very glad of the fix.

Regards,
László
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-17 Thread Caolán McNamara
On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote:
 I wouldn't expect a dictionary-based line breaker to handle words from
 other languages.  (There's a whole slew of Mon-Khmer languages in
 Thailand, and they mostly use the Thai script when they happen to get
 written.)

Indeed, yeah, I suppose, assuming its as complicated as Thai, that the
right direction would be for someone to write for icu new
dictionary-based breakiterators for the nod(?) language and then the
rather trivial changes to LibreOffice to know about the language in
order to mark text as that language to bubble that info down to icu

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-17 Thread Richard Wordingham
On Fri, 17 Feb 2012 14:10:21 +
Caolán McNamara caol...@redhat.com wrote:

 On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote:
 Indeed, yeah, I suppose, assuming its as complicated as Thai, that
 the right direction would be for someone to write for icu new
 dictionary-based breakiterators for the nod(?) language and then the
 rather trivial changes to LibreOffice to know about the language in
 order to mark text as that language to bubble that info down to icu

Northern Thai's not quite as simple or standardised as Siamese!  One can
meet (at least) the following spelling systems:

1) Chiangmai phonetics
2) Chiangrai phonetics (different mapping of tones to Siamese spelling
rules)
3) Transliteration from Tai Tham script (probably rare for connected
text)
4) Tai Tham script

However, perhaps dictionary-based break iterators are something to be
treated like dictionaries.  There are several other writing systems
that could probably benefit from them:

Thai script:
  Northern Thai
  NE Thai (for recording songs - use of Siamese tone rules scrambles
  the tonemarks compared to Siamese cognates)

Khmer script:
  Khmer - there's already a project for this set up on SourceForge.
  Pali

Tai Tham script:
  Tai Khuen
  Tai Lue
  Pali

Lao script
  Lao

Tibetan script
  Tibetan

I've a feeling Burmese may also have a need for dictionary based text
breaking, though it's better behaved for syllable breaking than most of
the others listed here.  Shan would come in the same category.

The above list is not exhaustive.  Tai Lue in Lao script probably
belongs in the list.

Not all Thai script writing systems need a break iterator - some of the
minority languages separate words with spaces, but that's partially a
matter of literacy - Thais start writing Thai with interword gaps and
then learn to suppress the gaps.  Pali written in Thai also separates
words with spaces - but Pali has some very long words!

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-16 Thread Richard Wordingham
On Tue, 14 Feb 2012 16:19:17 +
Caolán McNamara caol...@redhat.com wrote:

 I think this change:
 http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12
 should improve matters a lot.

It's a vast improvement - it gives LibreOffice a real Thai
spell-checker.  Thank you.  I have one worry for Siamese - Németh László
suggested that there might be a licensing issue back in
http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html .

If there isn't such an issue, does this mean we can hope to see your
fix in LibreOffice 3.5.1?

 Makes กุหลาบ get treated as a single
 word in the unit test there now anyway, though the Northern Thai one
 is still not considered a single word, that might be due to the
 oldish icu we're still using.

I wouldn't expect a dictionary-based line breaker to handle words from
other languages.  (There's a whole slew of Mon-Khmer languages in
Thailand, and they mostly use the Thai script when they happen to get
written.)  I can work my way round the problem using the sticking
plaster of ZWSP and WJ (no-break no-space), and I think some use of
them or an equivalent is inevitable when the sequence of visible
characters doesn't define the breaks.  In particular, after gluing
กุ๊หลาบ together with WJ, Hunspell offered me กุหลาบ as a correction,
which is good.

There may be some rough edges with ZWSP and WJ going into the
dictionary (TBC), but what you've done will justify LibreOffice claiming
a Thai spell checking capability.

Minority language support may not be compatible with libthai - at least
one language uses a combining underline, and some of the mark
combinations used for minority languages would get rejected by the WTT
rules that libthai supports.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-14 Thread Caolán McNamara
On Mon, 2012-02-13 at 22:39 +, Richard Wordingham wrote:
 The spell-checker seems to break up a phrase consisting of just กุหลาบ
 into 3 or 4 words.

Hmm, so I played around with this and here's what I think is the
problem...

We have some customized break iterator rules in LibreOffice, so we're
using those ones and *not* the built-in icu ones. But we lack a
customized Thai one, so we're using some ultra-generic word breaking
stuff for Thai and not going near the special built-into-icu Thai
iterator :-(

I think this change:
http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12
should improve matters a lot. Makes กุหลาบ get treated as a single
word in the unit test there now anyway, though the Northern Thai one is
still not considered a single word, that might be due to the oldish icu
we're still using.

After some googling I'm unsure if the right way to go to further
improve Thai break iterators is to simply have another go at upgrading
icu to get the latest and greatest there, or for someone to have a go
at integrating libthai into LibreOffice and hand off break iteration for
Thai to that. Either way, link above and related unit test give an entry
point to the relevant code.

C. 

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-14 Thread Eike Rathke
Hi,

On Tuesday, 2012-02-14 16:19:17 +, Caolán McNamara wrote:

 We have some customized break iterator rules in LibreOffice, so we're
 using those ones and *not* the built-in icu ones. But we lack a
 customized Thai one, so we're using some ultra-generic word breaking
 stuff for Thai and not going near the special built-into-icu Thai
 iterator :-(

Right, I think the generic customized one dates back from times where
ICU didn't have a specialized Thai break iterator (not sure about that,
but ...), so it should be good to have that switched to ICU for 'th'.

  Eike

-- 
LibreOffice Calc developer. Number formatter stricken i18n transpositionizer.
GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD


pgpKIpOYOxUeS.pgp
Description: PGP signature
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Michael Stahl
On 11/02/12 17:23, Richard Wordingham wrote:
 As I understand it, the lack of a usable Thai spell-checker for
 LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai
 break iterator.  (I had expected Thai and Khmer to face similar
 problems, for neither has a visible word separator and syllable
 boundaries are often unclear in both.)  Tagging Thai script text as
 Khmer does not work (at least, not in Version 3.4.5); the word
 boundaries are still determined by the Thai break iterator.
 
 Is it possible to create an experimental alternative to the Thai
 break iterator that can be shared with other people as a LibreOffice
 extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE
 (ZWSP) to separate words in the Thai script, but I suspect Thais would
 not.  Also, I can seem my first useful version fouling up the
 rendering of pre-existing text.  I can't work out how to create a break
 iterator as an *extension*. Could someone please advise me how, e.g. by
 pointing to the documentation or an example.  I can find documentation
 for *publishing* an extension, but that does not address *creating* an
 extension.

hi Richard,

while i don't know anything about break iterators, since OOo 3.0.1 there
is a new grammar checking API, which AFAIK operates on a whole paragraph
at a time; perhaps that API would make implementing a spelling checker
for such languages easier (if LO cannot determine the word boundaries
then the checker can always do it on its own).

http://wiki.services.openoffice.org/wiki/Grammar_Checking
http://www.openoffice.org/lingucomponent/grammar.html

regards,
 michael

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Michael Meeks

On Sat, 2012-02-11 at 16:23 +, Richard Wordingham wrote:
 As I understand it, the lack of a usable Thai spell-checker for
 LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai
 break iterator.

In common with many, I know nothing about Thai ;-) but my friend Tim
does - quite possibly he can help you ? (or do you know each other
already) ?

Thanks !

Michael

[ who abnormally leaves the context intact for Tim ;-]

   (I had expected Thai and Khmer to face similar
 problems, for neither has a visible word separator and syllable
 boundaries are often unclear in both.)  Tagging Thai script text as
 Khmer does not work (at least, not in Version 3.4.5); the word
 boundaries are still determined by the Thai break iterator.
 
 Is it possible to create an experimental alternative to the Thai
 break iterator that can be shared with other people as a LibreOffice
 extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE
 (ZWSP) to separate words in the Thai script, but I suspect Thais would
 not.  Also, I can seem my first useful version fouling up the
 rendering of pre-existing text.  I can't work out how to create a break
 iterator as an *extension*. Could someone please advise me how, e.g. by
 pointing to the documentation or an example.  I can find documentation
 for *publishing* an extension, but that does not address *creating* an
 extension.
 
 Richard.

-- 
michael.me...@suse.com  , Pseudo Engineer, itinerant idiot

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Caolán McNamara
On Sat, 2012-02-11 at 16:23 +, Richard Wordingham wrote:
 Is it possible to create an experimental alternative to the Thai
 break iterator that can be shared with other people as a LibreOffice
 extension?

I don't think we have any way to override our breakiterators from
extensions.

FWIW, i18npool/source/breakiterator is where we have our word,
character, sentence and line break iterators implemented. 

Typically we forward everything on to icu to do the real work, albeit
with some customization of the default icu rules.

What I'd *expect* to happen is that text marked as Thai should end up
getting broken into words by the default icu word break iterator, which
at http://userguide.icu-project.org/boundaryanalysis claims ICU
provides a special dictionary-based break iterator.

So, assuming that nothing is simply broken, improving the icu Thai break
iterator should improve the libreoffice for free.

I'd be sort of interested in confirming that what we have right now
actually works correctly, in the sense that Thai text definitely *is*
getting run through the special Thai-specific icu word break handler.

There is a i18npool/qa/cppunit/test_breakiterator.cxx which we use to
make sure that some existing edge-cases continue to work. If you wanted
to hack that to add some Thai word break tests that'd be helpful, and/or
simply pass me on some sample text where we *are* doing the right thing
and where we *aren't* and I could populate a test in there with that
data and turn the problem into a developer friendly enable this
word-break unit test and make it work problem.

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Richard Wordingham
Thank you to every one who's offered me advice.

On Mon, 13 Feb 2012 15:08:20 +
Caolán McNamara caol...@redhat.com wrote:

 I don't think we have any way to override our breakiterators from
 extensions.

Ah well, I'll just have to try to get Thai spell-checking working for
myself and then worry about sharing my changes - assuming I succeed.

 I'd be sort of interested in confirming that what we have right now
 actually works correctly, in the sense that Thai text definitely *is*
 getting run through the special Thai-specific icu word break handler.

It's definitely going through a Siamese-specific word-breaker for
line-breaking.  For example the two-syllable Thai word กุหลาบ
'rose' moves to the next line, but when I convert it to the Northern
Thai form กุ๊หลาบ (not the spelling I'd favour) by adding a
(non-spacing) tone mark, it's promptly broken between lines along the
syllable boundary, although the first syllable does not constitute a
word, at least not one recorded in the Royal Institute Dictionary. I'm
glad to find that inserting U+2060 WJ prevents that break. The
spell-checker seems to break up a phrase consisting of just กุหลาบ into 3 or 4 
words.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Adding Extension for Experimental Thai Spelling

2012-02-11 Thread Richard Wordingham
As I understand it, the lack of a usable Thai spell-checker for
LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai
break iterator.  (I had expected Thai and Khmer to face similar
problems, for neither has a visible word separator and syllable
boundaries are often unclear in both.)  Tagging Thai script text as
Khmer does not work (at least, not in Version 3.4.5); the word
boundaries are still determined by the Thai break iterator.

Is it possible to create an experimental alternative to the Thai
break iterator that can be shared with other people as a LibreOffice
extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE
(ZWSP) to separate words in the Thai script, but I suspect Thais would
not.  Also, I can seem my first useful version fouling up the
rendering of pre-existing text.  I can't work out how to create a break
iterator as an *extension*. Could someone please advise me how, e.g. by
pointing to the documentation or an example.  I can find documentation
for *publishing* an extension, but that does not address *creating* an
extension.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice