On 1/17/2026 5:31 PM, Marius Spix via Unicode wrote:
Gesendet: Sonntag, den 18.01.2026 um 02:30 Uhr
Von: "Marius Spix" <[email protected]>
An: "Jukka K. Korpela" <[email protected]>
Betreff: Aw: Re: General Categories Pe, Pf, Pi, Ps

I see. Another example would be the Frech quotation marks (guillemets) which 
are pointing outwards and are separated from the quoted text via space in 
French texts, but are pointing inwards and have no additional spaces in German 
texts (especially in books). So, these categories come from a time, where 
Unicode had been very English-centric and can be considered as “historically 
heritage”, correct?

Unicode was never "English-centric" by design. And nearly all participants in the early effort were familiar with or even experts in software localization (within the limitations of what that meant in the late '80s).

There are many problems with the General_Category and ultimately, they reflect that experience with character properties was limited.

Also, a solid understanding of the differences between properties inherent in a character and properties assumed by a character in the context of a specific orthography emerged over time.

There are some properties that are inherent in a character (or exceptions, if they exist, are very limited). I'm not aware of any orthography that treats "A" as a lowercase letter (there are some that use smallcaps forms, but those would have the lowercase property in Unicode).

When Unicode was created, what set it apart, was the insistence that encoded characters had properties beyond their appearance, name and code point value. No other widely used standard at the time did anything like that. It meant, that Unicode had a lot of attributes that could be used to identify "what" was being encoded at a given code point, something that required reliance on "customary knowledge" for other standards.

You had to infer that DIGIT ZERO had the numeric value of 0, but Unicode spells that out. And so on.

Unicode also refused to encode a "decimal period", arguing that the overloaded use of the full stop is indeed the norm and what was encoded is the full stop across all its uses. Of course this went along with a widely shared understanding that many languages use different conventions.

For some reason, the full range of conventions for quotation marks in particular was less well known, presumably because applying language specific quotation marks by software wasn't as much a "thing" as it is today with autocorrect, etc.

There's another reading on General_Category: this interpretation assumes that these are "defaults", to be applied in context where information on language is not available. So, you could think of these properties as applying to the language code "unknown".

There is nothing "historic" about having a default - anytime the language is not specified (and cannot be determined) you need to do something.

A./


Gesendet: Freitag, den 16.01.2026 um 20:46 Uhr
Von: "Jukka K. Korpela via Unicode" <[email protected]>
An: "Marius Spix" <[email protected]>
Cc: [email protected]
Betreff: Re: General Categories Pe, Pf, Pi, Ps

My guess is that  Pe, Pf, Pi and Ps were based on the usage of punctuation
in English and some other languages. If this subclassification is taken too
seriously, problems will arise. For example, software that takes U+201D too
seriously as Pf, treats texts like xxx ”xxx” xxx badly: since  U+201D is
Pf, a line break is not permitted before it, even when a space intervenes.
This is what MS Word does, irrespective of language settings, even for a
language for which it knows that U+201D is both “start quotation” and “end
quotation”.

Generally, whether a character is closing, final, initial, or opening
punctation should be based on language-specific information, such as CLDR.

Yucca


pe 16.1.2026 klo 18.09 Marius Spix via Unicode ([email protected])
kirjoitti:

I wonder what is the point of the General Categories Pe, Pf, Pi and Ps?

Different languages use different quotation marks, for example:

English:  “ (U+201C, Pi) + ” (U+201D, Pf)
German: „ (U+201E, Ps) + “ (U+201C, Pi)
Polish: „ (U+201E, Ps) + ” (U+201D, Pf)

How does a character classify as closing, final, initial, or opening
punctation? Are there any general criteria?

Best regards,

Marius



Reply via email to