Re: Passing UTF-8 through variables

Dar Scott Sun, 27 Feb 2005 19:57:15 -0800


On Feb 27, 2005, at 7:11 PM, Sivakatirswami wrote:

Ok, I have a text transformation challenge: One member of our team is working on an index for a book that has diacritical fonts, in plain ascii,


There are no diacritical marks in plain ASCII.

set to "any old font" like Geneva, Arial or Verdana, which are the defaults for her processing environment (a RAD tool built with Revolution) the end result of her work flow prior to importing into InDesign CS is a very simple XML file... where a single entry looks like this:

Is this using an 8-bit encoding that contains ASCII in the lower half? Which?

Or is is a UTF-8 file?

If it is UTF-8, some characters will be represented by multiple bytes.

<indexPara><boldItalicEntry>Å∫ava mârga:</boldItalicEntry> youth susceptibility, 394</indexPara>

Now, in Quark Express, if we simply passed this text to a type box, selected it (or set the font in a style sheet, and applied the style sheet) to "MinionD" (a diacritical font) we get all the proper international standards marks: dash over the top of long vowels, dot underneath retroflex consonants etc. very smooth and predicatable.

But, not so with Adobe's InDesign CS. When we import the file are getting weird strings for certain ones...


Does InDesign know what the encoding is for the input file?

If we set a BBEdit file to UTF-8, and the encoding for the XML file to UTF-8... these strings appear on screen as singular glyphs and a few black squares (meaning BBEdit can't display it).

Looks like InDesign is expecting one encoding and is getting some other encoding.

Since BBEdit at UTF-8 is seeing a similar problem, then I would suspect that InDesign is expecting UTF-8 and is getting something else.

OK so one of our team here identified those characters where were "bad" i.e. not transforming as expected into the expect characters.. and he gave me a small array consisting of 16 lines, as follows (I have no idea how this will show in email) ... some characters are not even passed to email!

...


  # create an array from the conversion file
  split tConversionArray with cr and tab

You can't do that with UTF-8. The bytes for cr and tab might show up in the additional bytes per character.

I am *way* out of my depth here.. any clues from anyone? What are these multi-byte strings..and how to we make them back to the char (129-255) set? (which is where they appear on the font map for MinionD)


Look at the Revolution uniEncode() and uniDecode() functions.

If you are expecting the indexing application to output UTF-8, you can use these functions to convert to that before saving the file.

This might help:
   http://www.cs.tut.fi/~jkorpela/chars.html

You need to decide what encoding to use and stick with that when you can.


--
**********************************************
    DSC (Dar Scott Consulting & Dar's Lab)
    http://www.swcp.com/dsc/
    Programming Services and Software
**********************************************

_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Passing UTF-8 through variables

Reply via email to