multi-character ranges
[changing the subject because it's now clear we have two different discussions on our hands. I think we're at or closing in on a consensus for a .. z, and this discussion is aa .. bb] On Wed, Jul 21, 2010 at 1:56 AM, Darren Duncan dar...@darrenduncan.netwrote: Aaron Sherman wrote: 2) The spec doesn't put this information anywhere near the definition of the range operator. Perhaps we can make a note? This was a source of confusion for me. My impression is that a Range primarily defines an interval in terms of 2 endpoint values such that it defines a possibly infinite set values between those endpoints. I don't think that has much to do with the fact that it was quite reasonable for me to look to the definition of .. is S03 for what the range between two characters contains. 3) It seems that there are two competing multi-character approaches and both seem somewhat valid. Should we use a pragma to toggle behavior between A and B: A: aa .. bb contains az B: aa .. bb contains ONLY aa, ab, ba and bb I would find A to be the only reasonable answer. [Before I respond, let's agree that, below, I'm going to say things like generates when talking about ... What I'm describing is the idea that a value exists in the range given, not that a range is actually a list.] I would find B to be the only reasonable answer, but enough people seem to think the other way that I understand there's a valid need to be able to get both behaviors. If you want B's semantics then use ... instead; .. should not be overloaded for that. I wasn't really distinguishing between .. and ... as I'm pretty sure they should have the same behavior, here. The case where I'm not sure they should have the same behavior is apple .. orange. Frankly, I think that there's no right solution there. There's the one I proposed in my original message (treat each character index as a distinct sequence and then increment in a base defined by all of the sequences), but even I don't like that. To generate all possible strings of length 5+ that sort between those two is another suggestion, but then what do you expect father-in-law .. orange to do? Punctuation throws a whole new dimension in there, and I'm immediately lost. When you go to my Japanese example from many messages ago, which I got from a fairly typical Web site and contained 2 Scripts with 4 different General Categories, I begin to need pharmaceuticals. I don't see any value in having different rules for what .. and ... generate in these cases, however. (frankly, I'm still on the fence about ... for single endpoints, which I think should just devolve to .. (... with a list for LHS is another animal, of course)) If there were to be any similar pragma, then it should control matters like collation, or what nationality/etc-specific subtype of Str the 'aa' and 'bb' are blessed into on definition, so that their collation/sorting/etc rules can be applied when figuring out if a particular $foo~~$bar..$baz is TRUE or not. For inclusion (e.g. does aa .. zz generate cliché) see the single-character range discussion, which has already touched on locale issues. -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs
Re: multi-character ranges
Aaron Sherman wrote: Darren Duncan wrote: 3) It seems that there are two competing multi-character approaches and both seem somewhat valid. Should we use a pragma to toggle behavior between A and B: A: aa .. bb contains az B: aa .. bb contains ONLY aa, ab, ba and bb I would find A to be the only reasonable answer. [Before I respond, let's agree that, below, I'm going to say things like generates when talking about ... What I'm describing is the idea that a value exists in the range given, not that a range is actually a list.] I would find B to be the only reasonable answer, but enough people seem to think the other way that I understand there's a valid need to be able to get both behaviors. FWIW, the reasoning behind A is that it's very much like looking up a word in a dictionary. Is az greater than, less than, or equal to aa? Greater than. Is az greater than, equal to, or less than bb? Less than. Since it is greater than aa and less than bb, it is between aa and bb. This is what infix:.. tests for. If you want B's semantics then use ... instead; .. should not be overloaded for that. I wasn't really distinguishing between .. and ... as I'm pretty sure they should have the same behavior, here. The case where I'm not sure they should have the same behavior is apple .. orange. Frankly, I think that there's no right solution there. There's the one I proposed in my original message (treat each character index as a distinct sequence and then increment in a base defined by all of the sequences), but even I don't like that. To generate all possible strings of length 5+ that sort between those two is another suggestion, but then what do you expect father-in-law .. orange to do? Punctuation throws a whole new dimension in there, and I'm immediately lost. When you go to my Japanese example from many messages ago, which I got from a fairly typical Web site and contained 2 Scripts with 4 different General Categories, I begin to need pharmaceuticals. What you're asking about now isn't the range or series operators; its the comparison operators: before, after, gt, lt, ge, le, leg, and so on. When comparing two strings, establishing an order between them is generally straightforward as long as both are composed of letters from the same alphabet and with the same case; but once you start mixing cases, introducing non-alphabetical characters such as spaces or punctuation, and/or introducing characters from other alphabets, the common-sense meaning of order becomes messy. Traditionally, this has been addressed by falling back on a comparison of the characters' ordinals: 0x0041 comes before 0x0042, and so on. It includes counterintuitive situations where d E, because all capital letters come earlier in the Unicode sequencing than any lower-case letters do. OTOH, it's robust: if all that you want is a way to ensure that strings can always be sorted, this will do the job. It won't always be an _intuitive_ ordering; but there will always be an ordering. I don't see any value in having different rules for what .. and ... generate in these cases, however. (frankly, I'm still on the fence about ... for single endpoints, which I think should just devolve to .. (... with a list for LHS is another animal, of course)) The only area where infix:.. and infix:... overlap is when you're talking about list generation; when using them for matching purposes, C $x ~~ 1..3 is equivalent to C $x = 1 $x = 3 (that is, it's a single value that falls somewhere between the two endpoints), while C $x ~~ 1...3 is equivalent to C $x ~~ (1, 2, 3) (that is, $x is a three-element list that contains the values 1, 2, and 3 in that order) - two very different things. There simply is not enough similarity between the two operators for one to degenerate to the other in anything more than a few edge-cases. -- Jonathan Dataweaver Lang
Re: multi-character ranges
On Wed, Jul 21, 2010 at 3:47 PM, Jon Lang datawea...@gmail.com wrote: ... When comparing two strings, establishing an order between them is generally straightforward as long as both are composed of letters from the same alphabet and with the same case; but once you start mixing cases, introducing non-alphabetical characters such as spaces or punctuation, and/or introducing characters from other alphabets, the common-sense meaning of order becomes messy. Well, there's locale considerations that can make it less straightforward, even with the same case and alphabet. EG, in Danish, aa comes after zz. But at least there are agreed-upon rules, even if they are locale- specific, so your point about non-alphabetical characters needing definition holds. -y