Re: Unicode and languages

2020-06-07 Thread David V Glasgow via use-livecode
Ha!  You beat me to it, Alex.  The only extra is that Paul might be able to 
identify very common but distinct markers to identify the language, and create 
a simple algorithm.  

Made me wonder how Google translate does it when it is set to 'detect language’

Cheers,

David G 

> On 6 Jun 2020, at 2:11 pm, Alex Tweedly via use-livecode 
>  wrote:
> 
> If you simply need to protect users in the scenario you describe, then you 
> could try a simple heuristic
> 
>  - extract the first 100 (200? - 500?) characters (or first 20 words)
> 
>  - spell check that
> 
>  - if there are more than 10 (20? - 50??) spelling errors then flag it as a 
> likely language mismatch.
>  - and if not, proceed to do the spellcheck.
> 
> Adjust the numbers until it gives protection without too many false positives.
> 
> Alex.
> 
> On 05/06/2020 18:15, Paul Dupuis via use-livecode wrote:
>> In all the added stuff the LC7 and higher Unicode engine includes, is there 
>> any way to determine the LANGUAGE of a range of text?
>> 
>> USE-CASE
>> 
>> We have a tool that helps researchers transcribe text from digital media. It 
>> is used internationally. We have added spell checking using lclSpell form 
>> Live Code Labs, a LiveCode store add-on.
>> 
>> For lclSpell, we only have Dictionaries for a small set of languages. You 
>> can build you own Dictionaries for lclSpell, but we'll still only have 
>> Dictionaries for a small subset of the languages people transcribe in. We 
>> also have people who do BOTH transcription AND translations.
>> 
>> For example, transcribing a Chinese language media recording, typing in the 
>> Simplified or Traditional Chinese characters AND then translate it to 
>> English, typing the English translation after the transcription.
>> 
>> With lclSpell (or I suspect ANY LiveCode compatible spell checker) if you 
>> try to spell check a reasonably large chunk of text that is NOT in the same 
>> language as your Dictionary, it ties up LiveCode forever, or at least such a 
>> long time and most people would force-quit. It is after all marking every 
>> word as misspelled and trying to do whatever it does to determine  that.
>> 
>> Now, you can react, that the researcher should just KNOW better than to do 
>> Spell check a text in a language that is not their loaded Dictionary! 
>> However, people are people, and will do such things and expect software to 
>> protect them from their own mistakes. Also, with mixed transcription and 
>> translation, you do want to spell check the English part and skip the 
>> Chinese (if you do not have a Chinese Dictionary)
>> 
>> So, we're looking for a way to detect the LANGUAGE of a range of text, in a 
>> LiveCode field, to be able to then determine whether it matches the current 
>> (or any available) dictionary or not and act accordingly.
>> 
>> There is a "fontLanguage" function in LC, but that seem to predate Unicode 
>> Everywhere and seem pretty useless now.
>> 
>> For example. in a new stack, with a single scrolling field, we paste in a 
>> Chinese text and then execute:
>> 
>> put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)
>> 
>> and get "ansi". Even you you set the range (char 2 to 3) that is 
>> specifically Chinese (no white space), it still returns "ansi". The textFont 
>> returns empty and the effective textFont returns "Segue UI"
>> 
>> I don't even know if language exists in the IBM Unicode engine as some 
>> exportable property a future version of LiveCode could expose.
>> 
>> Any clever ideas or thoughts on this problem are welcome.
>> 
>> 
>> 
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Unicode and languages

2020-06-06 Thread Jim Lambert via use-livecode
Not LC native but take a look at Google Translate API 


Jim Lambert

> Language detection
> When you don’t know your source text language — for instance, in 
> user-generated content that doesn’t include a language code — our translation 
> products automatically identify languages with high accuracy.
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Unicode and languages

2020-06-06 Thread Alex Tweedly via use-livecode
If you simply need to protect users in the scenario you describe, then 
you could try a simple heuristic


 - extract the first 100 (200? - 500?) characters (or first 20 words)

 - spell check that

 - if there are more than 10 (20? - 50??) spelling errors then flag it 
as a likely language mismatch.

 - and if not, proceed to do the spellcheck.

Adjust the numbers until it gives protection without too many false 
positives.


Alex.

On 05/06/2020 18:15, Paul Dupuis via use-livecode wrote:
In all the added stuff the LC7 and higher Unicode engine includes, is 
there any way to determine the LANGUAGE of a range of text?


USE-CASE

We have a tool that helps researchers transcribe text from digital 
media. It is used internationally. We have added spell checking using 
lclSpell form Live Code Labs, a LiveCode store add-on.


For lclSpell, we only have Dictionaries for a small set of languages. 
You can build you own Dictionaries for lclSpell, but we'll still only 
have Dictionaries for a small subset of the languages people 
transcribe in. We also have people who do BOTH transcription AND 
translations.


For example, transcribing a Chinese language media recording, typing 
in the Simplified or Traditional Chinese characters AND then translate 
it to English, typing the English translation after the transcription.


With lclSpell (or I suspect ANY LiveCode compatible spell checker) if 
you try to spell check a reasonably large chunk of text that is NOT in 
the same language as your Dictionary, it ties up LiveCode forever, or 
at least such a long time and most people would force-quit. It is 
after all marking every word as misspelled and trying to do whatever 
it does to determine  that.


Now, you can react, that the researcher should just KNOW better than 
to do Spell check a text in a language that is not their loaded 
Dictionary! However, people are people, and will do such things and 
expect software to protect them from their own mistakes. Also, with 
mixed transcription and translation, you do want to spell check the 
English part and skip the Chinese (if you do not have a Chinese 
Dictionary)


So, we're looking for a way to detect the LANGUAGE of a range of text, 
in a LiveCode field, to be able to then determine whether it matches 
the current (or any available) dictionary or not and act accordingly.


There is a "fontLanguage" function in LC, but that seem to predate 
Unicode Everywhere and seem pretty useless now.


For example. in a new stack, with a single scrolling field, we paste 
in a Chinese text and then execute:


put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)

and get "ansi". Even you you set the range (char 2 to 3) that is 
specifically Chinese (no white space), it still returns "ansi". The 
textFont returns empty and the effective textFont returns "Segue UI"


I don't even know if language exists in the IBM Unicode engine as some 
exportable property a future version of LiveCode could expose.


Any clever ideas or thoughts on this problem are welcome.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Unicode and languages

2020-06-05 Thread Richmond via use-livecode
I doubt that. But if you can determine the Unicode range that is being 
used you can at least know which writing system is being used. You could 
then trap for individual glyphs (such as 'џ', which is only used in 
Macedonian) to narrow things down a spot.


On 5.06.20 20:15, Paul Dupuis via use-livecode wrote:
In all the added stuff the LC7 and higher Unicode engine includes, is 
there any way to determine the LANGUAGE of a range of text?


USE-CASE

We have a tool that helps researchers transcribe text from digital 
media. It is used internationally. We have added spell checking using 
lclSpell form Live Code Labs, a LiveCode store add-on.


For lclSpell, we only have Dictionaries for a small set of languages. 
You can build you own Dictionaries for lclSpell, but we'll still only 
have Dictionaries for a small subset of the languages people 
transcribe in. We also have people who do BOTH transcription AND 
translations.


For example, transcribing a Chinese language media recording, typing 
in the Simplified or Traditional Chinese characters AND then translate 
it to English, typing the English translation after the transcription.


With lclSpell (or I suspect ANY LiveCode compatible spell checker) if 
you try to spell check a reasonably large chunk of text that is NOT in 
the same language as your Dictionary, it ties up LiveCode forever, or 
at least such a long time and most people would force-quit. It is 
after all marking every word as misspelled and trying to do whatever 
it does to determine  that.


Now, you can react, that the researcher should just KNOW better than 
to do Spell check a text in a language that is not their loaded 
Dictionary! However, people are people, and will do such things and 
expect software to protect them from their own mistakes. Also, with 
mixed transcription and translation, you do want to spell check the 
English part and skip the Chinese (if you do not have a Chinese 
Dictionary)


So, we're looking for a way to detect the LANGUAGE of a range of text, 
in a LiveCode field, to be able to then determine whether it matches 
the current (or any available) dictionary or not and act accordingly.


There is a "fontLanguage" function in LC, but that seem to predate 
Unicode Everywhere and seem pretty useless now.


For example. in a new stack, with a single scrolling field, we paste 
in a Chinese text and then execute:


put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)

and get "ansi". Even you you set the range (char 2 to 3) that is 
specifically Chinese (no white space), it still returns "ansi". The 
textFont returns empty and the effective textFont returns "Segue UI"


I don't even know if language exists in the IBM Unicode engine as some 
exportable property a future version of LiveCode could expose.


Any clever ideas or thoughts on this problem are welcome.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Unicode and languages

2020-06-05 Thread Paul Dupuis via use-livecode

On 6/5/2020 1:46 PM, Mark Waddingham via use-livecode wrote:

On 2020-06-05 18:15, Paul Dupuis via use-livecode wrote:

I don't even know if language exists in the IBM Unicode engine as some
exportable property a future version of LiveCode could expose.

Any clever ideas or thoughts on this problem are welcome.


Unicode doesn't deal in languages but 'scripts' e.g. English and 
French are both written in the Latin script, whereas Ukrainian can be 
written in either the Latin or Cyrillic script.


LiveCode gives you access to the unicode properties of all codepoints 
(as held by ICU) via the 'codepointProperty(codepoint, property)' 
function.


e.g. codepointProperty("A", "Script") => "Latin"
 codepointProperty(numToCodepoint(0x03B1), "Script") => "Greek"

This might at least help to cut out trying to spellcheck things which 
are definitely not the languages you do have dictionaries for.


Warmest Regards,

Mark.



Thank you Mark!

That is a least a start as to minimizing the problem for our customers. 
We can make a table of Scripts to Dictionaries and at least prevent 
erroneous spell checking of Chinese with an English Dictionary!




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Unicode and languages

2020-06-05 Thread Mark Waddingham via use-livecode

On 2020-06-05 18:15, Paul Dupuis via use-livecode wrote:

I don't even know if language exists in the IBM Unicode engine as some
exportable property a future version of LiveCode could expose.

Any clever ideas or thoughts on this problem are welcome.


Unicode doesn't deal in languages but 'scripts' e.g. English and French 
are both written in the Latin script, whereas Ukrainian can be written 
in either the Latin or Cyrillic script.


LiveCode gives you access to the unicode properties of all codepoints 
(as held by ICU) via the 'codepointProperty(codepoint, property)' 
function.


e.g. codepointProperty("A", "Script") => "Latin"
 codepointProperty(numToCodepoint(0x03B1), "Script") => "Greek"

This might at least help to cut out trying to spellcheck things which 
are definitely not the languages you do have dictionaries for.


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Unicode and languages

2020-06-05 Thread Paul Dupuis via use-livecode
In all the added stuff the LC7 and higher Unicode engine includes, is 
there any way to determine the LANGUAGE of a range of text?


USE-CASE

We have a tool that helps researchers transcribe text from digital 
media. It is used internationally. We have added spell checking using 
lclSpell form Live Code Labs, a LiveCode store add-on.


For lclSpell, we only have Dictionaries for a small set of languages. 
You can build you own Dictionaries for lclSpell, but we'll still only 
have Dictionaries for a small subset of the languages people transcribe 
in. We also have people who do BOTH transcription AND translations.


For example, transcribing a Chinese language media recording, typing in 
the Simplified or Traditional Chinese characters AND then translate it 
to English, typing the English translation after the transcription.


With lclSpell (or I suspect ANY LiveCode compatible spell checker) if 
you try to spell check a reasonably large chunk of text that is NOT in 
the same language as your Dictionary, it ties up LiveCode forever, or at 
least such a long time and most people would force-quit. It is after all 
marking every word as misspelled and trying to do whatever it does to 
determine  that.


Now, you can react, that the researcher should just KNOW better than to 
do Spell check a text in a language that is not their loaded Dictionary! 
However, people are people, and will do such things and expect software 
to protect them from their own mistakes. Also, with mixed transcription 
and translation, you do want to spell check the English part and skip 
the Chinese (if you do not have a Chinese Dictionary)


So, we're looking for a way to detect the LANGUAGE of a range of text, 
in a LiveCode field, to be able to then determine whether it matches the 
current (or any available) dictionary or not and act accordingly.


There is a "fontLanguage" function in LC, but that seem to predate 
Unicode Everywhere and seem pretty useless now.


For example. in a new stack, with a single scrolling field, we paste in 
a Chinese text and then execute:


put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)

and get "ansi". Even you you set the range (char 2 to 3) that is 
specifically Chinese (no white space), it still returns "ansi". The 
textFont returns empty and the effective textFont returns "Segue UI"


I don't even know if language exists in the IBM Unicode engine as some 
exportable property a future version of LiveCode could expose.


Any clever ideas or thoughts on this problem are welcome.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode