character sets and languages in openEHR

2004-04-06 Thread Thomas Beale
Hoylen Sue wrote:

It is not necessary for openEHR to specify the encoding
format (UTF-8, UTF-16, etc).

Since openEHR does not dictate an implementation or
transport format, it does not need to -- and should not --
specify the character encoding format.

Just saying text will be using the Unicode character set
(and maybe indicating which particular version of Unicode is
being used, version 4.0 is currently the latest) is
sufficient.
  

I wonder if this is true for people using openEHR-based components via 
an API rather than communicating via data messages. I assume that the 
unicode implemementation used in the String type in most of today's 
languages make it easy to determine what width unicode characters you 
have in the data?

I agree that being able to commit to as little as possible and still get 
the effect of standardisation is completely desirable.

- thomas beale


-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org



character sets and languages in openEHR

2004-03-23 Thread Hoylen Sue

It is not necessary for openEHR to specify the encoding
format (UTF-8, UTF-16, etc).

Since openEHR does not dictate an implementation or
transport format, it does not need to -- and should not --
specify the character encoding format.

Just saying text will be using the Unicode character set
(and maybe indicating which particular version of Unicode is
being used, version 4.0 is currently the latest) is
sufficient.

For example, if you are encoding openEHR records using XML,
the XML format already has its own mechanism for identifying
the character encoding of the document (the XML declaration,
BOM, etc).  Having the character encoding in the Entry would
be meaningless and a potential source of conflict.

Hoylen
-- 
__ Dr Hoylen Sue
h.sue at dstc.edu.auhttp://www.dstc.edu.au/
DSTC Pty Ltd --- Australian W3C Office   +61 7 3365 4310

-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org



character sets and languages in openEHR

2004-03-20 Thread gfrer
I agree. :-)

GF


--  private --
Gerard Freriks, arts
Huigsloterdijk 378
2158 LR Buitenkaag
The Netherlands

+31 252 544896
+31 654 792800
On 19 Mar 2004, at 15:36, Thomas Beale wrote:

 ENTRY class has
 - a mandatory language attribute
 - a mandatory character encoding attribute (says which flavour of 
 unicode). This forces the whole ENTRY to be encoded the same way no 
 matter what, but also allows distinct ENTRYs to be encoded in e.g. 
 UTF-8, UTF-16.

 DV_TEXT class has
 - an optional language attribute, which is understood to override the 
 one from its enclosing ENTRY.

 further thoughts from the group?
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 687 bytes
Desc: not available
URL: 
http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040320/5c613913/attachment.bin


character sets and languages in openEHR

2004-03-18 Thread Tim Churches
On Thu, 2004-03-18 at 00:51, gfrer wrote:
 Hi,
 
 
 Anamnesis in psychiatry:
 
 italicAnd then the disturbed patient said: Merdre. [Translation:
 shit]
 
 /italic
 
 Family history:
 
 italicMy father was diagnosed as suffering from: Engelse ziekte
 [Translation: Rickets dissease]
 
 
 /italicCodingsystemsitalic
 
 ICPC-1 Dutch version.
 
 Code: R05.
 
 Displayed text: Hoest
 
 Added translation: Cough
 
 /italic

Yes, I thought of examples which were similar to these. And it is not
just a matter of the recording health professional not knowing what
Engelse ziekte means, and thus having to record to verbatim and
untranslated - many diagnoses have no equivalent in other
languages/cultures, and are thus untranslatable (at least not without
some information loss). Given that the foreign language text may
require accented characters, or even a completely different character
set, then the Unicode encoding used for the entry will need to be
captured as well as the language, unless openEHR will be restricted
purely to one Unicode encoding, such as UTF-8. Remember the golden rule
with Unicode: If you don't know the encoding, you don't know nuffin'.

The only problem with UTF-8 everywhere is that it is Roman alphabet
chauvinistic, in that the basic Roman characters are all represented
with one byte, but everything else needs two bytes. That dooms all
Russian openEHR records to using twice as much storage as the equivalent
English openEHR records. In these days of massive cheap disc storage and
high speed networks, that fact probably doesn't matter, but it just
seems unfair, although I can't think of a better alternative. As an
English speaker, I would not be keen if openEHR mandated the use of
UTF-16, thus forcing me to use two bytes for every letter. Yet that's
what UTF-8 forces Russians, and Greeks, and Thais and Vietnamese and
just about every other non-Roman alphabetic language speaker to do. Of
course, ideographic languages like Chinese are doomed to use more than
one byte per character, but then the language itself encodes a lot more
information in each character, so it probably works out about the same
in the end.

-- 

Tim C

PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
or at http://members.optushome.com.au/tchur/pubkey.asc
Key fingerprint = 8C22 BF76 33BA B3B5 1D5B  EB37 7891 46A9 EAF9 93D0


-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: 
http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040318/0b2519f7/attachment.asc


character sets and languages in openEHR

2004-03-18 Thread Thomas Beale
Tim Churches wrote:


Yes, I thought of examples which were similar to these. And it is not
just a matter of the recording health professional not knowing what
Engelse ziekte means, and thus having to record to verbatim and
untranslated - many diagnoses have no equivalent in other
languages/cultures, and are thus untranslatable (at least not without
some information loss).

actually, these kinds of expressions are not the problem - they can 
happily be recorded inside a DV_TEXT object which has the language set 
to English or Dutch or whatever it may be; an inline occurrence of a 
'foreign' term that is routinely used by speakers of a different 
language (the way we use 'gesundheit' or 'triage' in english) can be 
assumed to be understood and is probably even in the dictionary of the 
language of narration.

The problem is when there are text fragments recorded where the words 
are viable in more than one language, and do not usually have the same 
meaning in each. Words in Danish  Norwegian should be almost the same, 
but I assume there are by now some small differences; there are 
certainly words in most of the European languages which occur in another 
language, and are completely unrelated. So in theory a language marker 
is needed to ensure that a later reader knows what language the words 
were in (maybe even to allow them to know what kind of translator to 
call). So the question remains - do we need the ability to have multiple 
languages inside a single entry? For Gerard's examples - would it really 
be necessary to indicate what the other languages were or not, given 
that they are probably obvious to most users who will use them?

The real reason for the question is that having to record language 
everywhere all the time means wasting a certain amount of data stroage 
on every text fragment stored in the record; the alternative seems to be 
to record it on Entry; if we decide that it has to be possible to have 
text fragments within an Entry for which athe name of a different 
language is actually recorded, we can use an optional language attribute 
on DV_TEXT which is understood as overriding the value elsewhere. In 
general I am against this kind of overriding of values in lower objects 
in a composition - it is not OO, and it is often misunderstood by 
programmers given the specifications; in general it is dangerous. 
However, maybe this is an exception which justifies its use

As for Unicode, obviously we cannot do much about the standard; but I 
guess someone had to have the 8-bit part of the code space.

 Given that the foreign language text may
require accented characters, or even a completely different character
set, then the Unicode encoding used for the entry will need to be
captured as well as the language, unless openEHR will be restricted
purely to one Unicode encoding, such as UTF-8. Remember the golden rule
with Unicode: If you don't know the encoding, you don't know nuffin'.

The only problem with UTF-8 everywhere is that it is Roman alphabet
chauvinistic, in that the basic Roman characters are all represented
with one byte, but everything else needs two bytes. That dooms all
Russian openEHR records to using twice as much storage as the equivalent
English openEHR records. In these days of massive cheap disc storage and
high speed networks, that fact probably doesn't matter, but it just
seems unfair, although I can't think of a better alternative. As an
English speaker, I would not be keen if openEHR mandated the use of
UTF-16, thus forcing me to use two bytes for every letter. Yet that's
what UTF-8 forces Russians, and Greeks, and Thais and Vietnamese and
just about every other non-Roman alphabetic language speaker to do. Of
course, ideographic languages like Chinese are doomed to use more than
one byte per character, but then the language itself encodes a lot more
information in each character, so it probably works out about the same
in the end.

  



-- 
___
CTO Ocean Informatics (http://www.OceanInformatics.biz)
Hon. Research Fellow, University College London

openEHR (http://www.openEHR.org)
Archetypes (http://www.oceaninformatics.biz/adl.html)
Community Informatics (http://www.deepthought.com.au/ci/rii/Output/mainTOC.html)


-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org



character sets and languages in openEHR

2004-03-18 Thread gfrer
Hi,

The examples I provided were those that I could think of.

The real question to be asked is:
Why would we want to record the 'language' of a text fragment?
The only correct answer will be:
Because of computational reasons.

In the light of this there is no real use case for this attribute in 
question other than to indicate in what language the author is 
documenting its provision of healthcare.
Coding systems will have to be used to indicate in an 'absolute' sense 
the meaning of things in a computational and language independent way.

If and when this assumption is true then the level of Composition 
(somewhere high) will be appropriate to record this optional attribute.

Gerard

--  private --
Gerard Freriks, arts
Huigsloterdijk 378
2158 LR Buitenkaag
The Netherlands

+31 252 544896
+31 654 792800
On 17 Mar 2004, at 23:46, Thomas Beale wrote:

 actually, these kinds of expressions are not the problem - they can 
 happily be recorded inside a DV_TEXT object which has the language set 
 to English or Dutch or whatever it may be; an inline occurrence of a 
 'foreign' term that is routinely used by speakers of a different 
 language (the way we use 'gesundheit' or 'triage' in english) can be 
 assumed to be understood and is probably even in the dictionary of the 
 language of narration.

 The problem is when there are text fragments recorded where the words 
 are viable in more than one language, and do not usually have the same 
 meaning in each. Words in Danish  Norwegian should be almost the 
 same, but I assume there are by now some small differences; there are 
 certainly words in most of the European languages which occur in 
 another language, and are completely unrelated. So in theory a 
 language marker is needed to ensure that a later reader knows what 
 language the words were in (maybe even to allow them to know what kind 
 of translator to call). So the question remains - do we need the 
 ability to have multiple languages inside a single entry? For Gerard's 
 examples - would it really be necessary to indicate what the other 
 languages were or not, given that they are probably obvious to most 
 users who will use them?
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size:  bytes
Desc: not available
URL: 
http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040318/d85896f4/attachment.bin


character sets and languages in openEHR

2004-03-18 Thread Thomas Beale
gfrer wrote:

 Hi,

 The examples I provided were those that I could think of.

 The real question to be asked is:
 Why would we want to record the 'language' of a text fragment?
 The only correct answer will be:
 Because of computational reasons.

 In the light of this there is no real use case for this attribute in 
 question other than to indicate in what language the author is 
 documenting its provision of healthcare.
 Coding systems will have to be used to indicate in an 'absolute' sense 
 the meaning of things in a computational and language independent way.

I agree about the use of codes; but when we have narrative text which is 
not coded, the meaning could be ambiguous for human readers, and also 
natural language processors, if not more mundane computing functions. I 
can imagine that this might be more important in psychiatry or other 
disciplines where a lot of narrative is generated.

- thomas



-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org



character sets and languages in openEHR

2004-03-18 Thread Tim Cook

Getting in late on comments but.

On Sat, 2004-03-06 at 14:57, Thomas Beale wrote:
 some higher level class - e.g. COMPOSITION, since almost all the time it 
 is the same on DV_TEXT items in a given EHR. We don't think it should be 
 that high, since language cannot be guaranteed the same throughout a 
 COMPOSITION 

I wholly agree with your analysis.  

The key trigger phrase above is almost all the time. Anytime there is
vagueness then a solution should be taken into account.  This really is
the real reason for this specification and model anyway isn't it?  To
get away from all those it hardly ever happens, we'll use the notes
field for that or five is enough addresses ... instances in other
models.

The scenarios given have been excellent and I especially appreciate
Dipak's comment; But when records really are travelling (sic) across
the globe, and such translation software is mature, will we have
prevented a valuable aid to safe health care? That kind of vision
shared by all those that have worked so hard for so long on this is what
makes it the prime solution that it is going to be.
 
Sorrybroke into a little cheer leading there.g


Ciao,
Tim

-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org



character sets and languages in openEHR

2004-03-17 Thread Eric Browne
Tom,

I have pondered the same issue before. I think it unlikely that language
would change inside an entry, but I did think of the possibility of
medicines, e.g. chinese medicines, or part thereof, being described by
specificly foreign names.

cheers,
eric
[ btw, you may wish to check your computer's date/time. I know Queensland
lags in some respects, but 3 days would make the cows very sore! :-)]


On Sun, 7 Mar 2004, Thomas Beale wrote:


 A couple of technical questions prior to declaring the 0.9 baseline in
 openEHR:

 One of the major openEHR implementors here in Australia has suggested
 moving the attributes 'language' and 'charset' in the class DV_TEXT to
 some higher level class - e.g. COMPOSITION, since almost all the time it
 is the same on DV_TEXT items in a given EHR. We don't think it should be
 that high, since language cannot be guaranteed the same throughout a
 COMPOSITION (in their scheme, you would set the attribute on COMPOSITION
 and then override it on lower nodes if they were different; however, I
 am very wary of this sort of logic - HL7 uses it a lot and it really
 complicates things for developers; at the moment we prefer to avoid it
 completely). One possibility is to move the language attribute to the
 ENTRY class, on the basis that an ENTRY is the minimium indivisible unit
 of information in openEHR (this is true, even for 'large' Entries like a
 microbiology test result). It was initially on DV_TEXT for safety
 reasons - you would always know what language a text fragment is in
 (this is important for words which are the same apearance but different
 meaning in different languages); however, ENTRY is probably just as safe
 from this point of view.

 Q: can anyone think of a scenario where there could be multiple
 languages inside an ENTRY?

 Character set is more difficult to work out. So far, we have specified
 that Unicode should be used in all strings. This means that in theory
 there is no need to record the character set name (e.g. iso-latin-1,
 iso-greek, etc). However, there is still a need to choose between UTF-8,
 UTF-16 and so on in Unicode. And in any case, I am unsure if all
 implementation technologies implement unicode in strings; is there a
 legacy reason to store non-unicode character set names anyway?

 - thomas beale



 -
 If you have any questions about using this list,
 please send a message to d.lloyd at openehr.org


-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org



character sets and languages in openEHR

2004-03-17 Thread gfrer
Hi,

Anamnesis in psychiatry:
And then the disturbed patient said: Merdre. [Translation: shit]

Family history:
My father was diagnosed as suffering from: Engelse ziekte 
[Translation: Rickets dissease]

Codingsystems
ICPC-1 Dutch version.
Code: R05.
Displayed text: Hoest
Added translation: Cough

Gerard
--  private --
Gerard Freriks, arts
Huigsloterdijk 378
2158 LR Buitenkaag
The Netherlands

+31 252 544896
+31 654 792800
On 06 Mar 2004, at 23:57, Thomas Beale wrote:

 Q: can anyone think of a scenario where there could be multiple 
 languages inside an ENTRY?
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 709 bytes
Desc: not available
URL: 
http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040317/0de08274/attachment.bin


character sets and languages in openEHR

2004-03-07 Thread Thomas Beale

A couple of technical questions prior to declaring the 0.9 baseline in 
openEHR:

One of the major openEHR implementors here in Australia has suggested 
moving the attributes 'language' and 'charset' in the class DV_TEXT to 
some higher level class - e.g. COMPOSITION, since almost all the time it 
is the same on DV_TEXT items in a given EHR. We don't think it should be 
that high, since language cannot be guaranteed the same throughout a 
COMPOSITION (in their scheme, you would set the attribute on COMPOSITION 
and then override it on lower nodes if they were different; however, I 
am very wary of this sort of logic - HL7 uses it a lot and it really 
complicates things for developers; at the moment we prefer to avoid it 
completely). One possibility is to move the language attribute to the 
ENTRY class, on the basis that an ENTRY is the minimium indivisible unit 
of information in openEHR (this is true, even for 'large' Entries like a 
microbiology test result). It was initially on DV_TEXT for safety 
reasons - you would always know what language a text fragment is in 
(this is important for words which are the same apearance but different 
meaning in different languages); however, ENTRY is probably just as safe 
from this point of view.

Q: can anyone think of a scenario where there could be multiple 
languages inside an ENTRY?

Character set is more difficult to work out. So far, we have specified 
that Unicode should be used in all strings. This means that in theory 
there is no need to record the character set name (e.g. iso-latin-1, 
iso-greek, etc). However, there is still a need to choose between UTF-8, 
UTF-16 and so on in Unicode. And in any case, I am unsure if all 
implementation technologies implement unicode in strings; is there a 
legacy reason to store non-unicode character set names anyway?

- thomas beale



-
If you have any questions about using this list,
please send a message to d.lloyd at openehr.org