Re: URI replacement pseudocode

2010-05-17 Thread Mark J. Reed
On Mon, May 17, 2010 at 3:00 PM, Aaron Sherman  wrote:
> FFFE and FEFF are used to manage byte-ordering, so they really shouldn't be
> part of a URI (URIs should exist in a context in which byte ordering is
> assured, would be my take).

Neither U+FFFE nor U+ is a valid character, but  U+FEFF is
perfectly cromulent, if deprecated: it's the ZERO-WIDTH NON-BREAKING
SPACE (U+200C ZERO WIDTH NON-JOINER is the modern replacement).   The
choice of byte-order mark protocol was well-considered: if U+FEFFis
interpreted as a character instead of a BOM, it's a pretty harmless
character.

> The Unicode spec says that  is guaranteed not to be a valid Unicode
> character, but does not explain why. [
> http://unicode.org/charts/PDF/UFFF0.pdf]

The Unicode specification is a lot more than code charts.  See section
15.8, "Noncharacters", for discussion of these code points.   (and
U+x for all valid values of x up through 0x10) are invalid so they
can be used as sentinel values within application memory, for
instance.  Whereas U+FFFE is illegal precisely because it's the
inverse of the BOM.

-- 
Mark J. Reed 


Re: URI replacement pseudocode

2010-05-17 Thread Moritz Lenz
Hi,

Aaron Sherman wrote:
> Over the past week, I've been using my scant bits of nighttime coding to
> cobble together a pseudocode version of what I think the URI module should
> look like. There's already one available as example code, but it doesn't
> actually implement either the URI or IRI spec correctly. Instead, this
> approach uses a pluggable grammar so that you can:
> 
>   my URI $uri .= new( get_url_from_user(), :spec )
> 
> which would parse the given URL using the RFC3987 IRI grammar. By default,
> it will use RFC3896 to parse URIs, which does not implement the UCS
> extensions. It can even handle the "legacy" RFC2396 and regex-based RFC3896
> variations.
> 
> Here's the code:
> 
> https://docs.google.com/leaf?id=0B41eVYcoggK7YjdkMzVjODctMTAxMi00ZGE0LWE1OTAtZTg1MTY0Njk5YjY4&hl=en

I think your code would benefit greatly from actually trying to get run
with Rakudo (or at least that parts that are yet implemented), as well
as from a version control system.

> So, my questions are:
> 
> * Is this code doing anything that is explicitly not Perl 6ish?

Some things I've noticed:
* you put lots of subs into roles - you probably meant methods
* Don't inherit from roles, implement them with 'does'
* the grammars contain a mixture of tokens for parsing and of
methods/subs for data extraction; yet Perl 6 offers a nice way to
separate the two, in the form of action/reduction methods; your code
might benefit from them.
* class URI::GrammarType seems not very extensible... maybe keep a hash
of URI names that map to URIs, which can be extended by method calls?

> * Is this style of pluggable grammar the correct approach?

Looks good, from a first glance.

> * Should I hold off until R* to even try to convert this into working code?

No need for that. The support for grammars and roles is pretty good,
character classes and match objects are still a bit unstable/whacky.

> * What's the best way to write tests/package?

Every Perl 6 compiler comes with a Test.pm module, so use that. It
outputs TAP, so you can use the 'prove' command from perl5/Tap::Harness

> * Am I correct in assuming that <...> in a regex is intended to allow the
> creation of interface roles for grammars?

You lost me here.  calls a named rule (with arguments).
Could you rephrase your question?

> * I guessed wildly at how I should be invoking the match against a saved
> "token" reference:
> if $s ~~ m/^ <.$.spec.gtype.URI_reference> $/ {
>   is that correct?

probably just $s ~~ /^ $regex $/;

> * Are implementations going to be OK with massive character classes like:
> <+[\xA0 .. \xD7FF] + [\xF900 .. \xFDCF] + [\xFDF0 .. \xFFEF] +
>   [\x1 .. \x1FFFD] + [\x2 .. \x2FFFD] +
>   [\x3 .. \x3FFFD] + [\x4 .. \x4FFFD] +
>   [\x5 .. \x5FFFD] + [\x6 .. \x6FFFD] +
>   [\x7 .. \x7FFFD] + [\x8 .. \x8FFFD] +
>   [\x9 .. \x9FFFD] + [\xA .. \xAFFFD] +
>   [\xB .. \xBFFFD] + [\xC .. \xCFFFD] +
>   [\xD .. \xDFFFD] + [\xE1000 .. \xEFFFD]>
> (from the IRI specification)

Funny thing, why does it exclude the FFFE and  codepoints?
Anyway, I can't answer that question.

Cheers,
Moritz