The code points giving you trouble are 0xFDD0..0xFDEF:

http://stackoverflow.com/questions/5188679/whats-the-purpose-of-the-noncharacters-ufdd0-to-ufdef

You can split this into two ranges to avoid the problematic points (and
could use this to combine the distinct ranges you have above.)

$  perl6 -e 'say ?("\c[0xFDCF]" ~~
/<[\c[0xE000]..\c[0xFDCF]\c[0xFDF0]..\c[0xFFFD]]>/)'
True

Note that if you have invalid UTF-8 input, though, you'll still get the
invalid character error, so you'll need to deal with that before trying to
use the rule.

$  perl6 -e 'say ?("\c[0xFDD0]" ~~
/<[\c[0xE000]..\c[0xFDCF]\c[0xFDF0]..\c[0xFFFD]]>/)'
===SORRY!===
Invalid character for UTF-8 encoding

Hope this helps.



On Mon, Feb 18, 2013 at 11:29 PM, David Warring <david.warr...@gmail.com>wrote:

> Hi Guys,
> A quick question.
>
> I'm trying to interpret unicode code-point ranges from the CSS 3 spec -
> http://www.w3.org/TR/css3-syntax/#CHARSETS
>
> The rule in question is
>
> nonascii :== #x80-#xD7FF #xE000-#xFFFD #x10000-#x10FFFF
>
> Where (I think) these are unicode code-point ranges.
>
> The latest rakudo build is fine with:
>
>
> % perl6 -e perl6 -e '/<[\c[0x80]..\c[0xD7FF]]>/'
>
>
> ...but doesn't like the second (or third) range:
>
>
> % perl6 -e '/<[\c[0xE000]..\c[0xFFFD]]>/'
> ===SORRY!===
> Invalid character for UTF-8 encoding
>
>
> ...the individual code points are ok:
>
>
> % perl6 -e '/<[\c[0xE000]]>/'
> % perl6 -e '/<[\c[0xFFFD]]>/'
>
>
> I'm think I'm getting the above error because not all unicode code-points
> are defined for the range xE000 to xFFFD - see
> http://www.utf8-chartable.de/unicode-utf8-table.pl  .
>
> I'm just having a problem implementing a concise regex/grammar rule for the
> above. Looking for advice.
>
> Cheers,
> David Warring
>



-- 
Will "Coke" Coleda

Reply via email to