[Texmacs-dev] Thoughts about encodings

Norbert Nemec Mon, 28 May 2007 05:10:27 -0700

Hi there,

I have recently spent some time fixing the import and export of special
characters to LaTeX. Took me quite some time to understand the
intricacies of TeXmacs' handling of encodings. Now, after I have fought
my way through it (some patches are already in CVS, others are in the
queue) I want to share a few thoughts about this:


The TeXmacs documentation talks about "special symbols" and "universal
symbols". The latter are a reasonable concept: similar to latex or
HTML-entities, each character gets a unique identifier that represents
its meaning independent of its graphical representation. The renderer
then has all the information it needs to find the correct glyph
depending on the environment to render the character. In math mode,
universal symbols work perfectly.

"special symbols" however are a nightmare. In fact, the core idea is not
that bad: special symbols are characters in some specific encoding. The
horror comes from the fact that this encoding is
a) difficult to retrieve from the environment
b) seemingly fixed to 1-byte encodings (blocking the path towards UTF-8)
c) silently assumed to be Cork T1 in many places in the code

There are some provisions for Cork T2a to accomodate cyrillic characters
but otherwise conversions are at best been done ad-hoc and incompletely,
which can be easily observed when you try to use special characters in
alternative fonts.

I remember someone talking about a long term vision of moving to unicode
internally. This probably is an extremely ambitious goal that has no
chance to be tackled in the near future (other design issues are much
more pressing). So what can be done to clean up the current situation?

I believe the most important thing is to clearly define the current
situation. If internal encoding is de-facto Cork T1, this should be
stated in the documentation instead of talking about the abstract
concept of "special symbols".

One straightforward solution that I see, would be to move towards
universal symbols for anything that is not ASCII. The support for
special symbols in text mode should be there. It should be possible to
replace any reference to non-ASCII codes in the sources by the
respective universal symbol. As ASCII is the overlap of most relevant
codes (namely Cork, latinN and UTF-8), the step towards ASCII+universal
symbols would be a simple step towards real encoding-independence.

Once we have eradicated Cork T1 that way, it should also be much easier
to introduce UTF-8 internally, which should be reasonably close to a
1-to-1 mapping on universal symbols.

What do you think?

Greetings,
Norbert


_______________________________________________
Texmacs-dev mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/texmacs-dev

[Texmacs-dev] Thoughts about encodings

Reply via email to