character sets and languages in openEHR
Hoylen Sue wrote: It is not necessary for openEHR to specify the encoding format (UTF-8, UTF-16, etc). Since openEHR does not dictate an implementation or transport format, it does not need to -- and should not -- specify the character encoding format. Just saying text will be using the Unicode character set (and maybe indicating which particular version of Unicode is being used, version 4.0 is currently the latest) is sufficient. I wonder if this is true for people using openEHR-based components via an API rather than communicating via data messages. I assume that the unicode implemementation used in the String type in most of today's languages make it easy to determine what width unicode characters you have in the data? I agree that being able to commit to as little as possible and still get the effect of standardisation is completely desirable. - thomas beale - If you have any questions about using this list, please send a message to d.lloyd at openehr.org
character sets and languages in openEHR
It is not necessary for openEHR to specify the encoding format (UTF-8, UTF-16, etc). Since openEHR does not dictate an implementation or transport format, it does not need to -- and should not -- specify the character encoding format. Just saying text will be using the Unicode character set (and maybe indicating which particular version of Unicode is being used, version 4.0 is currently the latest) is sufficient. For example, if you are encoding openEHR records using XML, the XML format already has its own mechanism for identifying the character encoding of the document (the XML declaration, BOM, etc). Having the character encoding in the Entry would be meaningless and a potential source of conflict. Hoylen -- __ Dr Hoylen Sue h.sue at dstc.edu.auhttp://www.dstc.edu.au/ DSTC Pty Ltd --- Australian W3C Office +61 7 3365 4310 - If you have any questions about using this list, please send a message to d.lloyd at openehr.org
character sets and languages in openEHR
I agree. :-) GF -- private -- Gerard Freriks, arts Huigsloterdijk 378 2158 LR Buitenkaag The Netherlands +31 252 544896 +31 654 792800 On 19 Mar 2004, at 15:36, Thomas Beale wrote: ENTRY class has - a mandatory language attribute - a mandatory character encoding attribute (says which flavour of unicode). This forces the whole ENTRY to be encoded the same way no matter what, but also allows distinct ENTRYs to be encoded in e.g. UTF-8, UTF-16. DV_TEXT class has - an optional language attribute, which is understood to override the one from its enclosing ENTRY. further thoughts from the group? -- next part -- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 687 bytes Desc: not available URL: http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040320/5c613913/attachment.bin
character sets and languages in openEHR
On Thu, 2004-03-18 at 00:51, gfrer wrote: Hi, Anamnesis in psychiatry: italicAnd then the disturbed patient said: Merdre. [Translation: shit] /italic Family history: italicMy father was diagnosed as suffering from: Engelse ziekte [Translation: Rickets dissease] /italicCodingsystemsitalic ICPC-1 Dutch version. Code: R05. Displayed text: Hoest Added translation: Cough /italic Yes, I thought of examples which were similar to these. And it is not just a matter of the recording health professional not knowing what Engelse ziekte means, and thus having to record to verbatim and untranslated - many diagnoses have no equivalent in other languages/cultures, and are thus untranslatable (at least not without some information loss). Given that the foreign language text may require accented characters, or even a completely different character set, then the Unicode encoding used for the entry will need to be captured as well as the language, unless openEHR will be restricted purely to one Unicode encoding, such as UTF-8. Remember the golden rule with Unicode: If you don't know the encoding, you don't know nuffin'. The only problem with UTF-8 everywhere is that it is Roman alphabet chauvinistic, in that the basic Roman characters are all represented with one byte, but everything else needs two bytes. That dooms all Russian openEHR records to using twice as much storage as the equivalent English openEHR records. In these days of massive cheap disc storage and high speed networks, that fact probably doesn't matter, but it just seems unfair, although I can't think of a better alternative. As an English speaker, I would not be keen if openEHR mandated the use of UTF-16, thus forcing me to use two bytes for every letter. Yet that's what UTF-8 forces Russians, and Greeks, and Thais and Vietnamese and just about every other non-Roman alphabetic language speaker to do. Of course, ideographic languages like Chinese are doomed to use more than one byte per character, but then the language itself encodes a lot more information in each character, so it probably works out about the same in the end. -- Tim C PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere or at http://members.optushome.com.au/tchur/pubkey.asc Key fingerprint = 8C22 BF76 33BA B3B5 1D5B EB37 7891 46A9 EAF9 93D0 -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040318/0b2519f7/attachment.asc
character sets and languages in openEHR
Tim Churches wrote: Yes, I thought of examples which were similar to these. And it is not just a matter of the recording health professional not knowing what Engelse ziekte means, and thus having to record to verbatim and untranslated - many diagnoses have no equivalent in other languages/cultures, and are thus untranslatable (at least not without some information loss). actually, these kinds of expressions are not the problem - they can happily be recorded inside a DV_TEXT object which has the language set to English or Dutch or whatever it may be; an inline occurrence of a 'foreign' term that is routinely used by speakers of a different language (the way we use 'gesundheit' or 'triage' in english) can be assumed to be understood and is probably even in the dictionary of the language of narration. The problem is when there are text fragments recorded where the words are viable in more than one language, and do not usually have the same meaning in each. Words in Danish Norwegian should be almost the same, but I assume there are by now some small differences; there are certainly words in most of the European languages which occur in another language, and are completely unrelated. So in theory a language marker is needed to ensure that a later reader knows what language the words were in (maybe even to allow them to know what kind of translator to call). So the question remains - do we need the ability to have multiple languages inside a single entry? For Gerard's examples - would it really be necessary to indicate what the other languages were or not, given that they are probably obvious to most users who will use them? The real reason for the question is that having to record language everywhere all the time means wasting a certain amount of data stroage on every text fragment stored in the record; the alternative seems to be to record it on Entry; if we decide that it has to be possible to have text fragments within an Entry for which athe name of a different language is actually recorded, we can use an optional language attribute on DV_TEXT which is understood as overriding the value elsewhere. In general I am against this kind of overriding of values in lower objects in a composition - it is not OO, and it is often misunderstood by programmers given the specifications; in general it is dangerous. However, maybe this is an exception which justifies its use As for Unicode, obviously we cannot do much about the standard; but I guess someone had to have the 8-bit part of the code space. Given that the foreign language text may require accented characters, or even a completely different character set, then the Unicode encoding used for the entry will need to be captured as well as the language, unless openEHR will be restricted purely to one Unicode encoding, such as UTF-8. Remember the golden rule with Unicode: If you don't know the encoding, you don't know nuffin'. The only problem with UTF-8 everywhere is that it is Roman alphabet chauvinistic, in that the basic Roman characters are all represented with one byte, but everything else needs two bytes. That dooms all Russian openEHR records to using twice as much storage as the equivalent English openEHR records. In these days of massive cheap disc storage and high speed networks, that fact probably doesn't matter, but it just seems unfair, although I can't think of a better alternative. As an English speaker, I would not be keen if openEHR mandated the use of UTF-16, thus forcing me to use two bytes for every letter. Yet that's what UTF-8 forces Russians, and Greeks, and Thais and Vietnamese and just about every other non-Roman alphabetic language speaker to do. Of course, ideographic languages like Chinese are doomed to use more than one byte per character, but then the language itself encodes a lot more information in each character, so it probably works out about the same in the end. -- ___ CTO Ocean Informatics (http://www.OceanInformatics.biz) Hon. Research Fellow, University College London openEHR (http://www.openEHR.org) Archetypes (http://www.oceaninformatics.biz/adl.html) Community Informatics (http://www.deepthought.com.au/ci/rii/Output/mainTOC.html) - If you have any questions about using this list, please send a message to d.lloyd at openehr.org
character sets and languages in openEHR
Hi, The examples I provided were those that I could think of. The real question to be asked is: Why would we want to record the 'language' of a text fragment? The only correct answer will be: Because of computational reasons. In the light of this there is no real use case for this attribute in question other than to indicate in what language the author is documenting its provision of healthcare. Coding systems will have to be used to indicate in an 'absolute' sense the meaning of things in a computational and language independent way. If and when this assumption is true then the level of Composition (somewhere high) will be appropriate to record this optional attribute. Gerard -- private -- Gerard Freriks, arts Huigsloterdijk 378 2158 LR Buitenkaag The Netherlands +31 252 544896 +31 654 792800 On 17 Mar 2004, at 23:46, Thomas Beale wrote: actually, these kinds of expressions are not the problem - they can happily be recorded inside a DV_TEXT object which has the language set to English or Dutch or whatever it may be; an inline occurrence of a 'foreign' term that is routinely used by speakers of a different language (the way we use 'gesundheit' or 'triage' in english) can be assumed to be understood and is probably even in the dictionary of the language of narration. The problem is when there are text fragments recorded where the words are viable in more than one language, and do not usually have the same meaning in each. Words in Danish Norwegian should be almost the same, but I assume there are by now some small differences; there are certainly words in most of the European languages which occur in another language, and are completely unrelated. So in theory a language marker is needed to ensure that a later reader knows what language the words were in (maybe even to allow them to know what kind of translator to call). So the question remains - do we need the ability to have multiple languages inside a single entry? For Gerard's examples - would it really be necessary to indicate what the other languages were or not, given that they are probably obvious to most users who will use them? -- next part -- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: bytes Desc: not available URL: http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040318/d85896f4/attachment.bin
character sets and languages in openEHR
gfrer wrote: Hi, The examples I provided were those that I could think of. The real question to be asked is: Why would we want to record the 'language' of a text fragment? The only correct answer will be: Because of computational reasons. In the light of this there is no real use case for this attribute in question other than to indicate in what language the author is documenting its provision of healthcare. Coding systems will have to be used to indicate in an 'absolute' sense the meaning of things in a computational and language independent way. I agree about the use of codes; but when we have narrative text which is not coded, the meaning could be ambiguous for human readers, and also natural language processors, if not more mundane computing functions. I can imagine that this might be more important in psychiatry or other disciplines where a lot of narrative is generated. - thomas - If you have any questions about using this list, please send a message to d.lloyd at openehr.org
character sets and languages in openEHR
Getting in late on comments but. On Sat, 2004-03-06 at 14:57, Thomas Beale wrote: some higher level class - e.g. COMPOSITION, since almost all the time it is the same on DV_TEXT items in a given EHR. We don't think it should be that high, since language cannot be guaranteed the same throughout a COMPOSITION I wholly agree with your analysis. The key trigger phrase above is almost all the time. Anytime there is vagueness then a solution should be taken into account. This really is the real reason for this specification and model anyway isn't it? To get away from all those it hardly ever happens, we'll use the notes field for that or five is enough addresses ... instances in other models. The scenarios given have been excellent and I especially appreciate Dipak's comment; But when records really are travelling (sic) across the globe, and such translation software is mature, will we have prevented a valuable aid to safe health care? That kind of vision shared by all those that have worked so hard for so long on this is what makes it the prime solution that it is going to be. Sorrybroke into a little cheer leading there.g Ciao, Tim - If you have any questions about using this list, please send a message to d.lloyd at openehr.org
character sets and languages in openEHR
Tom, I have pondered the same issue before. I think it unlikely that language would change inside an entry, but I did think of the possibility of medicines, e.g. chinese medicines, or part thereof, being described by specificly foreign names. cheers, eric [ btw, you may wish to check your computer's date/time. I know Queensland lags in some respects, but 3 days would make the cows very sore! :-)] On Sun, 7 Mar 2004, Thomas Beale wrote: A couple of technical questions prior to declaring the 0.9 baseline in openEHR: One of the major openEHR implementors here in Australia has suggested moving the attributes 'language' and 'charset' in the class DV_TEXT to some higher level class - e.g. COMPOSITION, since almost all the time it is the same on DV_TEXT items in a given EHR. We don't think it should be that high, since language cannot be guaranteed the same throughout a COMPOSITION (in their scheme, you would set the attribute on COMPOSITION and then override it on lower nodes if they were different; however, I am very wary of this sort of logic - HL7 uses it a lot and it really complicates things for developers; at the moment we prefer to avoid it completely). One possibility is to move the language attribute to the ENTRY class, on the basis that an ENTRY is the minimium indivisible unit of information in openEHR (this is true, even for 'large' Entries like a microbiology test result). It was initially on DV_TEXT for safety reasons - you would always know what language a text fragment is in (this is important for words which are the same apearance but different meaning in different languages); however, ENTRY is probably just as safe from this point of view. Q: can anyone think of a scenario where there could be multiple languages inside an ENTRY? Character set is more difficult to work out. So far, we have specified that Unicode should be used in all strings. This means that in theory there is no need to record the character set name (e.g. iso-latin-1, iso-greek, etc). However, there is still a need to choose between UTF-8, UTF-16 and so on in Unicode. And in any case, I am unsure if all implementation technologies implement unicode in strings; is there a legacy reason to store non-unicode character set names anyway? - thomas beale - If you have any questions about using this list, please send a message to d.lloyd at openehr.org - If you have any questions about using this list, please send a message to d.lloyd at openehr.org
character sets and languages in openEHR
Hi, Anamnesis in psychiatry: And then the disturbed patient said: Merdre. [Translation: shit] Family history: My father was diagnosed as suffering from: Engelse ziekte [Translation: Rickets dissease] Codingsystems ICPC-1 Dutch version. Code: R05. Displayed text: Hoest Added translation: Cough Gerard -- private -- Gerard Freriks, arts Huigsloterdijk 378 2158 LR Buitenkaag The Netherlands +31 252 544896 +31 654 792800 On 06 Mar 2004, at 23:57, Thomas Beale wrote: Q: can anyone think of a scenario where there could be multiple languages inside an ENTRY? -- next part -- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 709 bytes Desc: not available URL: http://lists.openehr.org/mailman/private/openehr-technical_lists.openehr.org/attachments/20040317/0de08274/attachment.bin
character sets and languages in openEHR
A couple of technical questions prior to declaring the 0.9 baseline in openEHR: One of the major openEHR implementors here in Australia has suggested moving the attributes 'language' and 'charset' in the class DV_TEXT to some higher level class - e.g. COMPOSITION, since almost all the time it is the same on DV_TEXT items in a given EHR. We don't think it should be that high, since language cannot be guaranteed the same throughout a COMPOSITION (in their scheme, you would set the attribute on COMPOSITION and then override it on lower nodes if they were different; however, I am very wary of this sort of logic - HL7 uses it a lot and it really complicates things for developers; at the moment we prefer to avoid it completely). One possibility is to move the language attribute to the ENTRY class, on the basis that an ENTRY is the minimium indivisible unit of information in openEHR (this is true, even for 'large' Entries like a microbiology test result). It was initially on DV_TEXT for safety reasons - you would always know what language a text fragment is in (this is important for words which are the same apearance but different meaning in different languages); however, ENTRY is probably just as safe from this point of view. Q: can anyone think of a scenario where there could be multiple languages inside an ENTRY? Character set is more difficult to work out. So far, we have specified that Unicode should be used in all strings. This means that in theory there is no need to record the character set name (e.g. iso-latin-1, iso-greek, etc). However, there is still a need to choose between UTF-8, UTF-16 and so on in Unicode. And in any case, I am unsure if all implementation technologies implement unicode in strings; is there a legacy reason to store non-unicode character set names anyway? - thomas beale - If you have any questions about using this list, please send a message to d.lloyd at openehr.org