RFC: API to access Unicode db files

Karl Williamson Thu, 21 Jul 2011 08:04:48 -0700

Some applications are finding it necessary to read in the Unicode filesthat mktables generates. For example, grepping through CPAN indicatesthat Text::Unicode::Equivalents reads Decomposition.pl. This, and mostof the other generated files are marked for internal use only, becausewe wish to reserve the right to change them around, etc. Butapplications currently have no feasible alternative. Prior to 5.14, wedelivered the full Unicode db files that the Unicode consortiumpublishes, and whose format is guaranteed not to change. But we droppedthose files in 5.14 to save disk space.

I'm proposing a new function Unicode::UCD::prop_invmap() to return thecontents of those files in a Unicode-centric way, so that applicationscan use it and we can deprecate non-core use of our generated files.

The function returns an inversion map, which is a data structure moreused in the Unicode world than the Perl world. It consists of twoparallel arrays. I suppose a more Perl-centric data structure would bean array of hashes, but the inversion map seems simpler to me to manipulate.

(This function would be in addition to the previously rfc'd functionUnicode::UCD::prop_invlist() which would return a list of all codepoints that match a property-value.)


=pod

=head2 prop_invmap

C<prop_invmap> is used to get the complete mapping definition for the input
property, in the form of an inversion map.  An inversion map consists of two
parallel arrays.  One is an ordered list of code points that mark range
beginnings, and the other gives the value that all code points in the
corresponding range have.  C<prop_invmap> is called with the name of the
desired property, and references to the two arrays, which it fills.  For
example,

 prop_invmap("Numeric_Value", \@numerics_ranges, \@numerics_maps);

will populate the arrays as shown below:

 @numerics_ranges  @numerics_maps        Note
        0x00             "NaN"          NaN stands for "Not a Number"
        0x30             0              DIGIT 0
        0x31             1
        0x32             2
        ...
        0x37             7
        0x38             8
        0x39             9              DIGIT 9
        0x3A             "NaN"
        0xB2             2              SUPERSCRIPT 2
        0xB3             3              SUPERSCRIPT 2
        0xB4             "NaN"
        0xB9             1              SUPERSCRIPT 1
        0xBA             "NaN"
        0xBC             0.25           VULGAR FRACTION 1/4
        0xBD             0.5            VULGAR FRACTION 1/2
        0xBE             0.75           VULGAR FRACTION 3/4
        0xBF             "NaN"
        0x660            0              ARABIC-INDIC DIGIT ZERO
        ...              ...
     0x110000            undef

The second line means that the value for the code point 0x30 (which is"DIGIT0") is 0. The first line means that all code points in the range from0x00 to

0x2F (which is 0x30 (from the second line) - 1) have the value "NaN".
The final line means that the value for all code points above the legal

Unicode maximum code point have the value C<undef> (not the string"u-n-d-e-f").


The arrays completely specify the mappings for all possible code points.

The special string S<C<"E<lt>code pointE<gt>">> is used to specify that
the value of a code point is itself.  For example, the beginnings of the
arrays for

 prop_invmap("Uppercase_Mapping", \@uppers_ranges, \@uppers_maps);

look like this:

 @uppers_ranges    @uppers_maps       Note
       0          "<code point>"
      97              65          'a' maps to 'A'
      98              66          'b' => 'B'
      99              67          'c' => 'C'
      ...
     120              88          'x' => 'X'
     121              89          'y' => 'Y'
     122              90          'z' => 'Z'
     123         "<code point>"
     181             924          MICRO SIGN => Greek Cap MU
     182         "<code point>"
     223           [ 83 83 ]      SHARP S => 'SS'
     224             192

The first line means that the uppercase of code point 0 is 0, of 1 is 1, ...

of 96 is 96. Without the C<"E<lt>code_pointE<gt>"> notation, every codepointwould have to have an entry. This would mean that the arrays would eachhave

more than a million entries to list just the legal Unicode code points!

In some properties some code points map to a sequence of multiple codepoints.

For those, the corresponding entries in the map array are not scalars, but
references to anonymous arrays containing the ordered list of code points
mapped to, as shown in the example above for 223.

The "Name" property map includes entries such as

 CJK UNIFIED IDEOGRAPH-<code point>

This means that the name for the code point is "CJK UNIFIED IDEOGRAPH-"
with the code point (expressed in hexadecimal) appended to it.  Also, the

notation "E<lt>hangul syllableE<gt>" occurs in this property, meaningthat the

name is algorithmically calculated.  These names can be generated via the
function C<charnames::viacode>().

The "Decomposition_Mapping" property also uses "E<lt>hangulsyllableE<gt>" for

those code points whose decomposition is algorithmically calculated.  These

can be generated via the function C<Unicode::Normalize::NFD>(). Thisproperty

contains many occurrences of code points whose mappings are ordered lists of
other code points.

The return value is
C<undef> if the property is unknown;
C<s> if all the elements of the map array are simple scalars;
C<n> for the Name property, which has the complications described above;

C<d> for the Decomposition_Mapping property (complications alreadydescribed);otherwise C<c> if some of map array elements are S<C<"E<lt>codepointE<gt>">>;

and C<l> if additionally some are lists of code points.

A binary search can be used to quickly find a code point in the inversion
list, and hence its corresponding mapping.

=cut

RFC: API to access Unicode db files

Reply via email to