multi-character ranges

2010-07-21 Thread Aaron Sherman
[changing the subject because it's now clear we have two different
discussions on our hands. I think we're at or closing in on a consensus for
a .. z, and this discussion is aa .. bb]

On Wed, Jul 21, 2010 at 1:56 AM, Darren Duncan dar...@darrenduncan.netwrote:

 Aaron Sherman wrote:

 2) The spec doesn't put this information anywhere near the definition of
 the
 range operator. Perhaps we can make a note? This was a source of confusion
 for me.


 My impression is that a Range primarily defines an interval in terms of
 2 endpoint values such that it defines a possibly infinite set values
 between those endpoints.


I don't think that has much to do with the fact that it was quite reasonable
for me to look to the definition of .. is S03 for what the range between
two characters contains.

3) It seems that there are two competing multi-character approaches and both
 seem somewhat valid. Should we use a pragma to toggle behavior between A
 and
 B:

  A: aa .. bb contains az
  B: aa .. bb contains ONLY aa, ab, ba and bb


 I would find A to be the only reasonable answer.


[Before I respond, let's agree that, below, I'm going to say things like
generates when talking about ... What I'm describing is the idea that a
value exists in the range given, not that a range is actually a list.]

I would find B to be the only reasonable answer, but enough people seem to
think the other way that I understand there's a valid need to be able to get
both behaviors.


 If you want B's semantics then use ... instead; .. should not be
 overloaded for that.


I wasn't really distinguishing between .. and ... as I'm pretty sure
they should have the same behavior, here. The case where I'm not sure they
should have the same behavior is apple .. orange. Frankly, I think that
there's no right solution there. There's the one I proposed in my original
message (treat each character index as a distinct sequence and then
increment in a base defined by all of the sequences), but even I don't like
that. To generate all possible strings of length 5+ that sort between those
two is another suggestion, but then what do you expect father-in-law ..
orange to do? Punctuation throws a whole new dimension in there, and I'm
immediately lost. When you go to my Japanese example from many messages ago,
which I got from a fairly typical Web site and contained 2 Scripts with 4
different General Categories, I begin to need pharmaceuticals.

I don't see any value in having different rules for what .. and ... generate
in these cases, however. (frankly, I'm still on the fence about ... for
single endpoints, which I think should just devolve to .. (... with a list
for LHS is another animal, of course))



 If there were to be any similar pragma, then it should control matters like
 collation, or what nationality/etc-specific subtype of Str the 'aa' and
 'bb' are blessed into on definition, so that their collation/sorting/etc
 rules can be applied when figuring out if a particular $foo~~$bar..$baz is
 TRUE or not.


For inclusion (e.g. does aa .. zz generate cliché) see the
single-character range discussion, which has already touched on locale
issues.

-- 
Aaron Sherman
Email or GTalk: a...@ajs.com
http://www.ajs.com/~ajs


Re: multi-character ranges

2010-07-21 Thread Jon Lang
Aaron Sherman wrote:
 Darren Duncan wrote:
 3) It seems that there are two competing multi-character approaches and both
 seem somewhat valid. Should we use a pragma to toggle behavior between A
 and
 B:

  A: aa .. bb contains az
  B: aa .. bb contains ONLY aa, ab, ba and bb


 I would find A to be the only reasonable answer.

 [Before I respond, let's agree that, below, I'm going to say things like
 generates when talking about ... What I'm describing is the idea that a
 value exists in the range given, not that a range is actually a list.]

 I would find B to be the only reasonable answer, but enough people seem to
 think the other way that I understand there's a valid need to be able to get
 both behaviors.

FWIW, the reasoning behind A is that it's very much like looking up a
word in a dictionary.  Is az greater than, less than, or equal to
aa?  Greater than.  Is az greater than, equal to, or less than
bb?  Less than.  Since it is greater than aa and less than bb,
it is between aa and bb.  This is what infix:.. tests for.

 If you want B's semantics then use ... instead; .. should not be
 overloaded for that.


 I wasn't really distinguishing between .. and ... as I'm pretty sure
 they should have the same behavior, here. The case where I'm not sure they
 should have the same behavior is apple .. orange. Frankly, I think that
 there's no right solution there. There's the one I proposed in my original
 message (treat each character index as a distinct sequence and then
 increment in a base defined by all of the sequences), but even I don't like
 that. To generate all possible strings of length 5+ that sort between those
 two is another suggestion, but then what do you expect father-in-law ..
 orange to do? Punctuation throws a whole new dimension in there, and I'm
 immediately lost. When you go to my Japanese example from many messages ago,
 which I got from a fairly typical Web site and contained 2 Scripts with 4
 different General Categories, I begin to need pharmaceuticals.

What you're asking about now isn't the range or series operators; its
the comparison operators: before, after, gt, lt, ge, le, leg, and so
on.  When comparing two strings, establishing an order between them is
generally straightforward as long as both are composed of letters from
the same alphabet and with the same case; but once you start mixing
cases, introducing non-alphabetical characters such as spaces or
punctuation, and/or introducing characters from other alphabets, the
common-sense meaning of order becomes messy.

Traditionally, this has been addressed by falling back on a comparison
of the characters' ordinals: 0x0041 comes before 0x0042, and so on.
It includes counterintuitive situations where d  E, because all
capital letters come earlier in the Unicode sequencing than any
lower-case letters do.  OTOH, it's robust: if all that you want is a
way to ensure that strings can always be sorted, this will do the job.
 It won't always be an _intuitive_ ordering; but there will always be
an ordering.

 I don't see any value in having different rules for what .. and ... generate
 in these cases, however. (frankly, I'm still on the fence about ... for
 single endpoints, which I think should just devolve to .. (... with a list
 for LHS is another animal, of course))

The only area where infix:.. and infix:... overlap is when you're
talking about list generation; when using them for matching purposes,
C $x ~~ 1..3  is equivalent to C $x = 1  $x = 3  (that is,
it's a single value that falls somewhere between the two endpoints),
while C $x ~~ 1...3  is equivalent to C $x ~~ (1, 2, 3)  (that is,
$x is a three-element list that contains the values 1, 2, and 3 in
that order) - two very different things.  There simply is not enough
similarity between the two operators for one to degenerate to the
other in anything  more than a few edge-cases.

-- 
Jonathan Dataweaver Lang


Re: multi-character ranges

2010-07-21 Thread yary
On Wed, Jul 21, 2010 at 3:47 PM, Jon Lang datawea...@gmail.com wrote:
 ...  When comparing two strings, establishing an order between them is
 generally straightforward as long as both are composed of letters from
 the same alphabet and with the same case; but once you start mixing
 cases, introducing non-alphabetical characters such as spaces or
 punctuation, and/or introducing characters from other alphabets, the
 common-sense meaning of order becomes messy.

Well, there's locale considerations that can make it less
straightforward, even with the same case and alphabet. EG, in Danish,
aa comes after zz. But at least there are agreed-upon rules, even
if they are locale- specific, so your point about non-alphabetical
characters needing definition holds.

-y