The design of the general category predates the fuller understanding of how languages (orthographies) actually use quotation marks. Whether something is opening, closing or even paired with another quotation mark is ultimately language dependent, as Jan writes. Mathematical notation will use brackets in their normal or reversed sense, which also makes any generalized opening or closing property useless.

The GC values can be understood, at best, to represent the most common usage of a given punctuation mark.

They are moderately useful when no other information is available. (Unknown language, or language set to "none" in metadata).

Specific applications may have an issue that changing behavior will reflow existing documents on opening with a downstream version.

For Unicode, the issue is similar. Changes to long established properties get more and more restricted to cases where side effects on existing documents can be balanced against the benefit based on the severity of the issue and practical relevance of the fix in actual use.

Any motivation such as "this could have been done better" or "these characters are not treated in a perfectly consistent manner" are increasingly seen as insufficient to make any adjustments in standard (language neutral) properties and algorithms.

Instead, the focus will be on fixing actual use cases that have been or could be raised as bug reports against implementations, assuming that there's no impact on other users.

One exception is that consistency between the segmentation algorithms is useful. This gives a small window to fix inconsistent treatment of edge cases.

A./

On 1/16/2026 11:46 AM, Jukka K. Korpela via Unicode wrote:
My guess is that Pe, Pf, Pi and Ps were based on the usage of punctuation in English and some other languages. If this subclassification is taken too seriously, problems will arise. For example, software that takes U+201D too seriously as Pf, treats texts like xxx ”xxx” xxx badly: since  U+201D is Pf, a line break is not permitted before it, even when a space intervenes. This is what MS Word does, irrespective of language settings, even for a language for which it knows that U+201D is both “start quotation” and “end quotation”.

Generally, whether a character is closing, final, initial, or opening punctation should be based on language-specific information, such as CLDR.

Yucca


pe 16.1.2026 klo 18.09 Marius Spix via Unicode ([email protected]) kirjoitti:

    I wonder what is the point of the General Categories Pe, Pf, Pi
    and Ps?

    Different languages use different quotation marks, for example:

    English:  “ (U+201C, Pi) + ” (U+201D, Pf)
    German: „ (U+201E, Ps) + “ (U+201C, Pi)
    Polish: „ (U+201E, Ps) + ” (U+201D, Pf)

    How does a character classify as closing, final, initial, or
    opening punctation? Are there any general criteria?

    Best regards,

    Marius

Reply via email to