subject:"\[Python\-ideas\] Re\: Python 4000\: Have stringlike objects provide sequence views rather than being sequences"

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-11-03 Thread Andrew Barnert via Python-ideas

On Nov 2, 2019, at 20:33, Random832  wrote:
> 
>> On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote:
>>> On Oct 26, 2019, at 19:59, Random832  wrote:
>>> 
>>> A string representation considering of (say) a UTF-8 string, plus an 
>>> auxiliary list of byte indices of, say, 256-codepoint-long chunks [along 
>>> with perhaps a flag to say that the chunk is all-ASCII or not] would 
>>> provide O(1) random access, though, of course, despite both being O(1), 
>>> "single index access" vs "single index access then either another index 
>>> access or up to 256 iterate-forward operations" aren't *really* the same 
>>> speed.
>> 
>> Yes, but that means constructing a string takes linear time, because 
>> you have to construct that index. You can’t just take a 
>> read/recv/mmap/result of a C library/whatever and use it as a string 
>> without doing linear work on it first. 
> 
> constructing a string already takes linear time because you have to copy it 
> into memory managed by the python garbage collector.

Not necessarily. There are certainly _some_ strings that come into the 
interpreter (or extension module) as externally-allocated things that have to 
be copied. But not all, or even most, strings. Things like reading a file or 
recving from a socket, you allocate a buffer which is managed by your GC, and 
the string gets placed there, so there’s no need to copy it. When you mmap a 
file, you know the lifetime of the string is the lifetime of the mmap, so you 
don’t track it separately, much less copy it. And so on.

Also, even when you do need to copy, a memcpy is a whole lot faster than a 
loop, even though they are both linear. Especially when that loop has 
additional operations (maybe even including a conditional that branches 80/20 
or worse). But even without that, copying byte by byte, rather than by whatever 
chunks the CPU likes, can already be 16x as slow. Go often ends up copying 
strings unnecessarily, but the memcpy is still so much faster than the decode 
that Java/C#/Python/Ruby does that Go fanatics like to brag about their fast 
text handling (until you show them some Rust to Swift code that’s even faster 
as well as more readable…).

> And you can track whether you'll need the index in one pass while copying, 
> rather than, as currently, having to do one pass to pick a representation and 
> another to actually perform the copying and conversion, so my suggestion may 
> have a cache locality advantage over the other way.

Sure, the existing implementation of building strings is slow, and that’s what 
keeping strings in UTF-8 is intended to solve, and if your suggestion makes it 
take 1/4th as long (which seems possible, but obviously it’s just a number I 
pulled out of thin air), that’s nice—but nowhere near as nice as completely 
eliminating that cost.

And most strings, you never need to randomly access (or only need to randomly 
access because other parts of the API, like str.find and re.search, make you), 
so why should you pay any cost, even if it’s only 1/4th the cost you pay in 
Python 3.8? (Also, for some random-access uses, it really is going to be faster 
to just decode to UTF-32 and subscript that; why build an index plus decoding 
when you can just decode?) If you’re already making a radical breaking change, 
why not get the radical benefits?

Also, consider this: if str is unindexed and non-subscriptable, it’s trivial to 
build a class IndexedStr(str) whose __new__ builds the index (or copies it even 
passed an IndexedStr) and that adds __getitem__, while still acting as a str 
even at the C API level. Whenever you need random access, you construct an 
IndexedStr, the rest of the time you don’t bother. And you can even create 
special-purpose variants for special strings (I know this is always ASCII, or I 
know it’s always under 16 chars…) or specific use cases (I know I’m going to 
iterate backward, so repeatedly counting forward from index[idx%16] would be 
hugely wasteful). But if str builds an index, there’s no way to write a class 
FastStr(str) that skips that, or any of the variants that does it differently.

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GTCGPA3NIXVK63QIFGR5H74YRIDGK3SR/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-11-02 Thread Random832

On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote:
> On Oct 26, 2019, at 19:59, Random832  wrote:
> > 
> > A string representation considering of (say) a UTF-8 string, plus an 
> > auxiliary list of byte indices of, say, 256-codepoint-long chunks [along 
> > with perhaps a flag to say that the chunk is all-ASCII or not] would 
> > provide O(1) random access, though, of course, despite both being O(1), 
> > "single index access" vs "single index access then either another index 
> > access or up to 256 iterate-forward operations" aren't *really* the same 
> > speed.
> 
> Yes, but that means constructing a string takes linear time, because 
> you have to construct that index. You can’t just take a 
> read/recv/mmap/result of a C library/whatever and use it as a string 
> without doing linear work on it first. 

constructing a string already takes linear time because you have to copy it 
into memory managed by the python garbage collector. And you can track whether 
you'll need the index in one pass while copying, rather than, as currently, 
having to do one pass to pick a representation and another to actually perform 
the copying and conversion, so my suggestion may have a cache locality 
advantage over the other way.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/UY5IIRWPSB37XRKDHLJYECIIFWPZS5SN/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-28 Thread Steven D'Aprano

I think that we're more or less in broad agreement, but I wanted to 
comment on this:

On Sun, Oct 27, 2019 at 09:41:00PM -0700, Andrew Barnert wrote:

> Yes, that’s the whole point of the message you were responding to: 
> extended grapheme clusters are the Unicode approximation of 
> characters; code units are not.

I don't think that's quite correct. See:

http://www.unicode.org/glossary/#abstract_character

http://www.unicode.org/glossary/#character

http://www.unicode.org/glossary/#extended_grapheme_cluster

http://www.unicode.org/glossary/#code_point

From the glossay definition of code point: "A value, or position, for a 
character, in any coded character set." In other words, the code point 
is a numeric code such as U+041 that represents a character such as "A". 
(Except when it is a numeric code that represents a non-character.)

And from definitions D60 and D61 here:

http://www.unicode.org/versions/Unicode12.1.0/ch03.pdf

"Grapheme clusters and extended grapheme clusters may not have any 
particular linguistic significance"

"The grapheme cluster represents a horizontally segmentable unit of 
text, consisting of some grapheme base (which may consist of a Korean 
SYLLABLE) together with any number of nonspacing marks applied to it."
[Emphasis added.]

"A grapheme cluster is similar, but not identical to a combining 
character sequence."

So it is much more complicated than just "code point != character, 
extended grapheme cluster = character". Lots of code points are 
characters; lots of graphemes aren't characters but syllables or some 
other linguistic entity, or no linguistic entity at all; and lots of 
things that are characters aren't graphemes, such such combining 
character sequences.

And none of this mentions what to do with variation selectors, flags 
etc. The whole thing is very complicated and I don't pretend to 
understand all the details. (Until now, I thought that combining 
character sequences were grapheme clusters. Apparently they aren't.)

-- 
Steven
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/EWQL4T7QGVSSPBYTAM7BSLFVZ2WSB5SO/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Andrew Barnert via Python-ideas

On Oct 27, 2019, at 18:00, Steven D'Aprano  wrote:
> 
> On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas 
> wrote:
> 
>>> File "/home/rosuav/tmp/demo.py", line 1
>>>   print("Hello, world!')
>>>^
>>> SyntaxError: EOL while scanning string literal
>> 
>> So if those 12 glyphs take 14 code units 
> 
> I'm not really sure how glyphs (the graphical representation of a 
> character) comes into this discussion

Because, assuming your using a mono space font, the number of glyphs to the 
error is how many spaces you need to indent.

This example happens to be pure ASCII, so the count of glyphs, extended 
grapheme clusters, code units, and code points happens to be the name. But just 
change that e to an è made of two combining code units—like the ç in your 
previous example might have been—and now there are still the same number of 
glyphs and clusters; but one fewer code point and one fewer code unit.

Extended grapheme clusters are intended to be the best approximation of 
“characters” in the Unicode standard. Code units are not.

> but for what it's worth, I 
> count 22, not 12 (excluding the leading spaces).

Sorry; that was a typo. Plus, I miscounted on top of the typo; I meant to count 
the spaces.

>> because you’re using Stephen’s string and it’s in NFKD, getting 14 and 
>> then indenting two spaces too many (as Python does today)
> 
> You mean something like this?
> 
> 
>py> value = äë +* 42
>  File "", line 1
>value = äë +* 42
>  ^
>SyntaxError: invalid syntax
> 
> (the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}')
> 
> Yes, that looks like a bug to me, but a super low priority one to fix.

Yes.  It this is a general bug: everywhere that you count code units intending 
to use that as a count of glyphs or characters, both in Python itself and in 
third-party libraries and in applications. This is one of the most trivial 
examples, and you obviously wouldn’t break backward compatibility with 
everything solely to fix this example.

And I don’t know why I have to keep repeating this, but one more time: I’m not 
proposing to change Python, I’m arguing to _not_ change Python, because it’s 
already good enough, and the suggested improvement wouldn’t make it right 
despite breaking lots of code, and making it right is a big thing that would 
break even more code. If I were designing a new language, I would do it right 
from the start, and it would not have this bug, or any of the other 
manifestations of the same issue, but Python 4000 (or even 5000) is not an 
opportunity to design a new language.

(And to be clear: Python’s design made perfect sense when it was chosen; 
Unicode has just gotten more complicated since then. In fact, most other 
languages that adopted Unicode as early as Python got permanently stuck with 
the UCS-2 assumption, forcing all user code to deal with UTF-16 code units 
forever.)

> (This is assuming that the Python interpreter promises to line the caret 
> up with the offending symbol "always", rather than just making a best 
> effort to do so.)

Well, the reason I called it a good-enough best effort is because I assume that 
it’s only meant to be a best effort, and I think it’s good enough for that.

I’m not the one who said people would be up in arms if that were broken, I’m 
the one arguing that people are fine with it being broken as long as it’s 
usually good enough.

> And probably tough to fix too: I think you need to count in grapheme 
> clusters, not code points, 

Yes, that’s the whole point of the message you were responding to: extended 
grapheme clusters are the Unicode approximation of characters; code units are 
not. And a randomly-accessible sequence of grapheme clusters is impossible to 
do efficiently, but a sized iterable container, or a sequence-like thing that’s 
indexable by special indexes but not by integers, is. So tying the string type 
even more closely to code units would not fix it; changing the way it works as 
a Sequence would not fix it.

> but even that won't fix the issue since it 
> leaves you open to the *opposite* problem of undercounting if the 
> terminal or IDE fails to display combining characters properly:
> 
>value = a¨e¨ +* 42
>^
>SyntaxError: invalid syntax
> 
> I had to fake the above, because I couldn't find a terminal on my system 
> which would misdisplay COMBINING DIAERESIS, but I've seen editors do it.

That’s a matter of working around broken editors, terminals; and IDEs—which do 
exist, but are uncommon, and getting less common. Not having a workaround for a 
broken editor that most people don’t use is not a bug in the same way as being 
broken in a properly-working environment is.

(Not having a workaround for something broken that half the users in the world 
have to deal with, like Windows cmd.exe, would be a different story, of course. 
You can claim that it’s Windows’ bug, not yours, but

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Andrew Barnert via Python-ideas

On Oct 27, 2019, at 05:49, Chris Angelico  wrote:

>> Given zero-based indexing, and the string:
>> 
>>"abÇÐεф"
>> 
>> the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10
>> (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door
>> with a pitchfork and a flaming torch *wink*
>> 
>> And returning  is even worse.
>> 
> 
> And in response to the notion that you don't actually need the index,
> just a position marker... consider this:
> 
>  File "/home/rosuav/tmp/demo.py", line 1
>print("Hello, world!')
> ^
> SyntaxError: EOL while scanning string literal

So if those 12 glyphs take 14 code units because you’re using Stephen’s string 
and it’s in NFKD, getting 14 and then indenting two spaces too many (as Python 
does today) is not just a good-enough best effort, but something we actually 
want to ensure at all costs by making sure you always deal in code unit indexes?

> Well, either that, or we need to make it so that " "* object at 0xb7ce1bf0> results in the correct number of spaces to
> indent it to that position. That ought to bring in plenty of
> pitchforks...

Would you still bring pitchforks for " " * StrIndex(chars=12, points=14, 
bytes=22)? 

If so, then you require code to spell it as " " * index.chars instead of " " * 
index.

It’s not like the namedtuple/structseq/dataclass/etc. repr is some innovative 
new idea nobody’s ever thought of to get a useful display, or like people can’t 
figure out how to get the index out of a regex match object. This is all simple 
stuff; I don’t get the incredulity that it could possibly be done. (Especially 
given that there are other languages that do exactly the same thing, like 
Swift, which ought to be proof that it’s not impossible.)

(Could it be done without breaking a whole ton of existing code? I strongly 
doubt it. But my whole argument for why we shouldn’t be trying to “fix” strings 
in “Python 4000” in the first place is that the right fix probably cannot be 
done in a way that’s remotely feasible for backward compatibility. So I hope 
you wouldn’t expect that something additional that I suggested could be 
considered only if that unfeasible fix were implemented would itself be 
feasible.)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/X5B5SQ2RLRHMBPAZABRQ5TSRQ74JAXW5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Andrew Barnert via Python-ideas

> On Oct 27, 2019, at 05:38, Steven D'Aprano  wrote:
> 
>> On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas 
>> wrote:
>> 
>> If you redesign your find, re.search, etc. APIs to not return 
>> character indexes, then I think you can get away with not having 
>> character-indexable strings.
> 
> If string.index(c) doesn't return the index of c in string, then what 
> does it return?
> 
> I think you are conflating the public API based on characters (to be 
> precise: code points) for some underlying implementation based on bytes. 

No, what I’m doing is avoiding conflating the public API based on characters 
with the underlying representation based on code points, treating them no more 
fundamental than the code units.

You can still iterate the code points if you want to, because that’s 
occasionally useful. And you can also iterate the UTF-8 code units, because 
that’s also occasionally useful.

> Given zero-based indexing, and the string:
> 
>"abÇÐεф"
> 
> the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10 
> (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door 
> with a pitchfork and a flaming torch *wink*

Really? Even if the string is in NFKD, as it would be if this were, say, the 
name of a file on a standard Mac file system? Then that Ç character is stored 
as the code unit U+0043 followed by the code unit U+0327, rather than the 
single unit U+00D0. So had it still better be 5, not 6? If so, Python 3 is 
broken, and always has been; where’s your pitchfork?

And what were you going to do with that 5 anyway that it has to be an integer? 
Without a use case, you’re just demanding infinite flexibility regardless of 
what the cost might be. You _could_ make this work by building a grapheme 
cluster index at construction time for every string, or by storing strings as 
an array of grapheme clusters that are themselves arrays of code points rather 
than as a flat array, or by normalizing every string at construction time. But 
do you actually want to do any of those things; or is guaranteeing 5 rather 
than 6 there not worth the cost?

Also, have you ever used seek and tell on a text file? What do you think tell 
gives you? According to the language spec; it could be anything and you have to 
treat it as an abstract index; I think in current CPython it’s a byte index. 
Where’s your pitchfork there?

> And returning  is even worse.

Why?

That object can be used to index/slice/start finding at/etc.

I suggested earlier that it could also have attributes that give you the 
integer character, code unit (byte), and, if you really want it, code point 
index. If you have a use for one of those, you use the one you have a use for. 
If not, why do you need it to be equal to any of those three integers, much 
less the least useful of them?

If you’re just concerned about the REPL, then it can be , or even something eval-able like CharIndex(chars=5, units=6, 
bytes=10). Which isn’t as nice as a number I can just spot a few lines back and 
retype (as I mentioned before, this is occasionally an annoyance when dealing 
with Swift), but that’s a tradeoff that allows you to see the number 5 that 
you’re insisting you’d better be able to get even though you can’t actually use 
the number 5.

> Strings might not be implemented as an array of characters. They could 
> be a rope, a linked list, a piece table, a gap buffer, or something 
> else. The public API which operates on code points should not depend on 
> the implementation. Regardless of how your string is implemented, it is 
> conceptually a sequential array of N code points indexed from 0 to N-1.

If you want a public API that’s independent of implementation, where a string 
could be a linked list, then you want a public API that doesn’t include 
indexing. If your language comes with fundamental builtin types where the [] 
operator takes linear time, then your language doesn’t have a [] operator like 
Python’s, or C++’s or most other languages with the same syntax; it has 
something that looks misleadingly like [] in other languages but has to be used 
differently.

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MHOYCKINBLZKEITIAQVDP46U2RTWJ7US/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Chris Angelico

On Sun, Oct 27, 2019 at 11:43 PM Steven D'Aprano  wrote:
>
> On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas 
> wrote:
>
> > If you redesign your find, re.search, etc. APIs to not return
> > character indexes, then I think you can get away with not having
> > character-indexable strings.
>
> If string.index(c) doesn't return the index of c in string, then what
> does it return?
>
> I think you are conflating the public API based on characters (to be
> precise: code points) for some underlying implementation based on bytes.
> Given zero-based indexing, and the string:
>
> "abÇÐεф"
>
> the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10
> (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door
> with a pitchfork and a flaming torch *wink*
>
> And returning  is even worse.
>

And in response to the notion that you don't actually need the index,
just a position marker... consider this:

  File "/home/rosuav/tmp/demo.py", line 1
print("Hello, world!')
 ^
SyntaxError: EOL while scanning string literal

Well, either that, or we need to make it so that " "* results in the correct number of spaces to
indent it to that position. That ought to bring in plenty of
pitchforks...

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/U2Z574GFGT6GIMS737WQO3QLFPEQZXIN/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Steven D'Aprano

On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas wrote:

> If you redesign your find, re.search, etc. APIs to not return 
> character indexes, then I think you can get away with not having 
> character-indexable strings.

If string.index(c) doesn't return the index of c in string, then what 
does it return?

I think you are conflating the public API based on characters (to be 
precise: code points) for some underlying implementation based on bytes. 
Given zero-based indexing, and the string:

"abÇÐεф"

the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10 
(UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door 
with a pitchfork and a flaming torch *wink*

And returning  is even worse.

Strings might not be implemented as an array of characters. They could 
be a rope, a linked list, a piece table, a gap buffer, or something 
else. The public API which operates on code points should not depend on 
the implementation. Regardless of how your string is implemented, it is 
conceptually a sequential array of N code points indexed from 0 to N-1.

-- 
Steven
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/YESU3G7CRBNTO43ULYCC652KTI4YVLBF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Steven D'Aprano

On Sun, Oct 27, 2019 at 03:33:16PM +1100, Steven D'Aprano wrote:

> else:
> assert c <= '\U0001':

Oops, missplaced a zero there. That was supposed to be '\U0010'.


-- 
Steven
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/G5AFW2FOE2YC45I67OKWKOTYM6AAJC3M/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Random832

On Sun, Oct 27, 2019, at 03:39, Andrew Barnert via Python-ideas wrote:
> (Actually, IIRC, one of the two has a str type that, despite being 2.x, 
> is unicode rather than bytes, but with some extra undocumented 
> functionality to smuggle bytes around in a str and have it sometimes 
> work.)

I do like the way GNU Emacs represents strings - abstractly, a string can 
contain any character, or any byte > 127 distinct from a character. Concretely, 
IIRC they are represented either as pure byte strings or as UTF-8 with "bytes > 
127" represented as the extended UTF-8 representations of code points 0x3FFF80 
through 0x3F [values between 0x11 and 0x3FFF7F are used for other 
purposes].
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GLM57Y6TC2HR4BEGXA6UPL44BULIIDTH/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Andrew Barnert via Python-ideas

On Oct 26, 2019, at 21:33, Steven D'Aprano  wrote:
> 
> IronPython and Jython use whatever .Net and Java use.

Which makes them sequences of UTF-16 code units, not code points. Which is 
allowed for the Python 2.x unicode type, but would violate the rules for 3.x 
str, but neither one has a 3.x. If you want to deal with code points, you have 
to handle surrogates manually.

(Actually, IIRC, one of the two has a str type that, despite being 2.x, is 
unicode rather than bytes, but with some extra undocumented functionality to 
smuggle bytes around in a str and have it sometimes work.)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/E4VWU42A5RXJQRJSQMQDEN4W3D2FJNZS/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-27 Thread Andrew Barnert via Python-ideas

On Oct 26, 2019, at 19:59, Random832  wrote:
> 
> A string representation considering of (say) a UTF-8 string, plus an 
> auxiliary list of byte indices of, say, 256-codepoint-long chunks [along with 
> perhaps a flag to say that the chunk is all-ASCII or not] would provide O(1) 
> random access, though, of course, despite both being O(1), "single index 
> access" vs "single index access then either another index access or up to 256 
> iterate-forward operations" aren't *really* the same speed.

Yes, but that means constructing a string takes linear time, because you have 
to construct that index. You can’t just take a read/recv/mmap/result of a C 
library/whatever and use it as a string without doing linear work on it first. 

And you have to do that on _every_ string, even though you only need the index 
on a small percentage of them. (Unless you can statically look ahead at the 
code and prove that a string will never be indexed—which a Haskell compiler can 
do, but I don’t think it’s remotely feasible for a language like Python.)

If you redesign your find, re.search, etc. APIs to not return character 
indexes, then I think you can get away with not having character-indexable 
strings. On the rare occasions where you need it, construct a tuple of chars. 
If that isn’t good enough, you can easily write a custom object that wraps a 
string and an index list together that acts like a string and a sequence of 
chars at the same time. There’s no need for the string type itself to do that.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KPT2TJWZ4W4JXRHAIHDV557CWS53LEPX/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread David Mertz

PEP 393

The Unicode string type is changed to support multiple internal
representations, depending on the character with the largest Unicode
ordinal (1, 2, or 4 bytes)

... Ah, OK. I get it. One byte representation is only ASCII, which happens
to match utf-8. Well, the latin-1 oddness. But the internal representation
is utf-16 or utf-32 if the string contains code points requiring multi-byte
representation.

On Sun, Oct 27, 2019, 12:19 AM Chris Angelico  wrote:

> On Sun, Oct 27, 2019 at 2:37 PM David Mertz  wrote:
> > What does actual CPython do currently to find that s[1_000_000],
> assuming utf-8 internal representation?
> >
>
> Mu.
>
> CPython does not have a UTF-8 internal representation.
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/JZF35M3NBU42EH5Y37AAN4BCXQCZ63B2/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/UD6M2WXPOCAIPXOGWMWLYEFA77OZPUHH/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Steven D'Aprano

On Sat, Oct 26, 2019 at 11:34:34PM -0400, David Mertz wrote:

> What does actual CPython do currently to find that s[1_000_000], assuming
> utf-8 internal representation?

CPython doesn't use a UTF-8 internal representation.

MicroPython *may*, but I don't know if they do anything fancy to avoid 
O(N) indexing.

IronPython and Jython use whatever .Net and Java use.

CPython uses a custom implementation, the Flexible String 
Representation, which picks the smallest code unit size required to 
store all the characters in the string.


# Pseudo-code
c = max(string)  # Highest code-point
if c <= '\xFF':
# effectively ASCII or Latin-1
use one byte per code point
elif c <= '\u':
# effectively UCS-2, or UTF-16 without the surregate pairs
use two bytes per code point
else:
assert c <= '\U0001':
# effectively UCS-4, or UTF-32
use four bytes per code point


-- 
Steven
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5ALOHG346WTZ5OFIJPISTZCZR6KDPZQF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Chris Angelico

On Sun, Oct 27, 2019 at 2:37 PM David Mertz  wrote:
> What does actual CPython do currently to find that s[1_000_000], assuming 
> utf-8 internal representation?
>

Mu.

CPython does not have a UTF-8 internal representation.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/JZF35M3NBU42EH5Y37AAN4BCXQCZ63B2/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread David Mertz

Ok, true enough that dereferencing and limited linear search is still O(1).
I could have phrased that slightly more precisely.

But the trade-off part is true. Indexing into character 1 million of a
utf-32 string is just one memory offset calculation, them following the
reference. Indexing into the utf-8-with-offset-list is a couple
dereferences, and on average 128 sequential scans. So it's not worse big-O,
but it's around 100x slower... Still a lot faster than sequential scan of
all 1 million though.

What does actual CPython do currently to find that s[1_000_000], assuming
utf-8 internal representation?

On Sat, Oct 26, 2019, 11:02 PM Random832  wrote:

> On Sat, Oct 26, 2019, at 20:26, David Mertz wrote:
> > Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is
> > the same storage requirement as utf-16 or utf-32. For O(1) random
> > access into all strings, we have to eat 32-bits per character, one way
> > or the other, but of course there are space/speed trade-offs one could
> > make for intermediate behavior.
>
> A string representation considering of (say) a UTF-8 string, plus an
> auxiliary list of byte indices of, say, 256-codepoint-long chunks [along
> with perhaps a flag to say that the chunk is all-ASCII or not] would
> provide O(1) random access, though, of course, despite both being O(1),
> "single index access" vs "single index access then either another index
> access or up to 256 iterate-forward operations" aren't *really* the same
> speed.
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/N4ONH5O443FWB7M7E2FF24QR32HXAPAD/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ENN4Y3ZZOPG2NM5SEQOQMLQ2N7P6L3LI/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Random832

On Sat, Oct 26, 2019, at 20:26, David Mertz wrote:
> Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is 
> the same storage requirement as utf-16 or utf-32. For O(1) random 
> access into all strings, we have to eat 32-bits per character, one way 
> or the other, but of course there are space/speed trade-offs one could 
> make for intermediate behavior.

A string representation considering of (say) a UTF-8 string, plus an auxiliary 
list of byte indices of, say, 256-codepoint-long chunks [along with perhaps a 
flag to say that the chunk is all-ASCII or not] would provide O(1) random 
access, though, of course, despite both being O(1), "single index access" vs 
"single index access then either another index access or up to 256 
iterate-forward operations" aren't *really* the same speed.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/N4ONH5O443FWB7M7E2FF24QR32HXAPAD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Random832

On Wed, Oct 23, 2019, at 19:00, Christopher Barker wrote:
> On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas 
>  wrote:
> > The main problem is that a str is a sequence of single-character str, each 
> > of which is a one-element sequence of itself, etc. forever. If you wanted 
> > to change this, I think it would make more sense to go the opposite way: 
> > leave str a sequence, but make it a sequence of char objects. (And 
> > likewise, bytes and bytearray could be sequences of byte objects—or just go 
> > all the way to making them sequences of ints.) And then maybe add a c 
> > prefix for defining char constants, and you’ve solved all the problems 
> > without having to add new confusing methods or properties.
> 
> I've thought for a long time that this would be a "good thing". the 
> "string or sequence of strings" issues is pretty much the only 
> hidden-bug-triggering type error I've gotten since "true division".
> 
> The only way we really live with it fairly easily is that strings are 
> pretty much never duck typed -- so I can check if I got a string, and 
> then I know I didn't get a sequence of strings. But I've always 
> wondered how disruptive it would be to add a char type -- it doesn't 
> seem like it would be very disruptive, but I have not thought it 
> through at all. And I'm not sure how much string functionality a char 
> should have -- probably next to none, as the point is that it would be 
> easy to distinguish from a string that happened to have one character.

There's lots of functionality that's on str that if I were designing the 
language I'd put on character.

character type functions are definitely in - and, frankly, str.isnumeric is an 
attractive nuisance, it may well make sense to remove it from str and require 
explicit use of all().
upper/lower is tricky - cases like ß can change the length of a string... maybe 
put it on char but return a string?
No reason not to allow + or * to concatenate chars to each other or to strings, 
multiply a char to a string
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OD5B2OOVMZ46RURZCMYFHQ7GSUPXVS5F/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Andrew Barnert via Python-ideas

On Oct 26, 2019, at 16:28, Steven D'Aprano  wrote:
> 
>> On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas 
>> wrote:
>> On Oct 13, 2019, at 12:02, Steve Jorgensen  wrote:
> [...]
>>> This proposal is a serious breakage of backward compatibility, so 
>>> would be something for Python 4.x, not 3.x.
>> 
>> I’m pretty sure almost nobody wants a 3.0-like break again, so this 
>> will probably never happen.
> 
> Indeed, and Guido did rule some time ago that 4.0 would be ordinary 
> transition, like 3.7 to 3.8, not a big backwards breaking version 
> change.

That _could_ change, especially if 3.9 is followed by 3.10 (or has that already 
been rejected?). But I think almost everyone agrees with Guido, and that’ll 
probably be true until the memory of 2.7 fades (a few years after Apple stops 
shipping it and the last Linux distros go out of LTS). I guess your 5000 
implies about 16 years off, so… ok. But at that point, it makes as much sense 
to talk about a hypothetical new Python-like language.

>> And finally, if you want to break strings, it’s probably worth at 
>> least considering making UTF-8 strings first-class objects. They can’t 
>> be randomly accessed, 
> 
> I don't see why you can't make arrays of UTF-8 indexable and provide 
> random access to any code point. I understand that ``str`` in 
> Micropython is implemented that way.

Most of the time, you really don’t need random access to strings—except in the 
case where you got that integer index back from a the find method or a regex 
match object or something, in which case using Swift-style non-integer indexes, 
or Rust-style (and Python file object seek/tell) byte offsets, solves the 
problem just as well.

But when you do want it, it’s very likely you don’t want it to take linear 
time. Providing indexing, but having it be unacceptably slow for anything but 
small strings, isn’t providing a useful feature, it’s providing a cruel tease. 
Logarithmic time is probably acceptable, but building that index takes linear 
time, so now constructing strings becomes slow, which is even worse (especially 
since it affects even strings you were never going to randomly access).

> But why would you want an explicit UTF-8 string object? What benefit 
> do you get from exposing the fact that the implementation happens to be 
> UTF-8 rather than something else? (Not rhetorical questions.)

For novices who only deal with UTF-8, it might mean never having to call encode 
or decode again. But the real benefit is to enable low-level code (that in turn 
makes high-level code easier to write).

Have you ever written code that mmaps a text file and processes it as text? You 
either have to treat it as bytes and not do proper Unicode (which works for 
some string operations—until the first time you get some data where it 
doesn’t), or implement all the Unicode algorithms yourself (especially fun if 
what you’re trying to do is, say, a regex search), or put a buffer in front of 
it and decode on the fly, defeating the whole point of mmap.

Have you ever read an HTTP header as bytes to verify that it’s UTF-8 and then 
tried to switch to using the same socket connection as a text file object 
rather than binary? It’s doable, but it’s a pain.

And the reason all of this is a pain is that when Python (and Java and Ruby and 
so on) added Unicode support, the idea of assuming most files and protocols and 
streams are UTF-8 was ridiculous. Making UTF-8 a little easier to deal with by 
making everything else either slower or harder to deal with was a terrible 
trade off then. But in 2019—much less in Python 5000-land—that’s no longer true.

> If the UTF-8 object operates on the basis of Unicode code points, then 
> its just a str, and the implementation is just an implementation detail.

Ideally, it can iterate any of code units (bytes), code points, or grapheme 
clusters, not just one. Because they’re all useful at different times. But most 
of the string methods would be in terms of grapheme clusters.

> If the UTF-8 object operates on the basis of raw bytes, with no 
> protection against malformed UTF-8 (e.g. allowing you to insert bytes 
> 0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- 
> or three-byte UTF-8 sequence) then its just a bytes object (or 
> bytearray) initialised with a UTF-8 sequence.

What’s this about inserting bytes? I’m not suggesting making strings mutable; 
that’s insane even for 5.0. :)

Anyway, it’s just a bytes object with all of the string methods, and that duck 
types as a string for all third-party string functions and so on, which is a 
lot different than “just a bytes object”.

But a much better way to see it is that it’s a str object that also offers 
direct access to its UTF-8 bytes. Which you don’t usually need, but it is 
sometimes useful. And it would be more useful if things like sockets and pipes 
and so on had UTF-8 modes where they could just send UTF-8 strings, without you 
having to

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread David Mertz

Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is the
same storage requirement as utf-16 or utf-32. For O(1) random access into
all strings, we have to eat 32-bits per character, one way or the other,
but of course there are space/speed trade-offs one could make for
intermediate behavior.

On Sat, Oct 26, 2019, 7:58 PM Steven D'Aprano  wrote:

> On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote:
> > On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
> >
> >
> > > (At worst, a code-point in UTF-8 takes three bytes, compared to four in
> > > UTF-16 or UTF-32.)
> > >
> >
> > http://www.fileformat.info/info/unicode/char/1/index.htm
>
> Oops, you're right, UTF-8 can use four code units (four bytes) too, I
> forgot about that. Thanks for the correction.
>
> So in the worst case, if your string consists of all (let's say)
> Linear-B syllables, UTF-8 will use four bytes per character, the same as
> UTF-32. But for strings consisting of a mix of (say) ASCII, Latin-1, etc
> with only a few Linear-B syllables, UTF-8 will use a lot less memory.
>
>
>
> --
> Steven
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/DNFYA7Z3IGDWYLNMKL7ITZ3AON6JJVKO/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RMH7GU5JHZ7EW2E4DAFHITHQRYF6PJG4/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Steven D'Aprano

On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote:
> On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
> 
> 
> > (At worst, a code-point in UTF-8 takes three bytes, compared to four in
> > UTF-16 or UTF-32.)
> >
> 
> http://www.fileformat.info/info/unicode/char/1/index.htm

Oops, you're right, UTF-8 can use four code units (four bytes) too, I 
forgot about that. Thanks for the correction.

So in the worst case, if your string consists of all (let's say) 
Linear-B syllables, UTF-8 will use four bytes per character, the same as 
UTF-32. But for strings consisting of a mix of (say) ASCII, Latin-1, etc 
with only a few Linear-B syllables, UTF-8 will use a lot less memory.

-- 
Steven
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/DNFYA7Z3IGDWYLNMKL7ITZ3AON6JJVKO/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread David Mertz

On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano


> (At worst, a code-point in UTF-8 takes three bytes, compared to four in
> UTF-16 or UTF-32.)
>

http://www.fileformat.info/info/unicode/char/1/index.htm

>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ZL24UAC4AWMBHY7L7Y72QVBWXDR5XEXP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Steven D'Aprano

On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote:
> On Oct 13, 2019, at 12:02, Steve Jorgensen  wrote:
[...]
> > This proposal is a serious breakage of backward compatibility, so 
> > would be something for Python 4.x, not 3.x.
> 
> I’m pretty sure almost nobody wants a 3.0-like break again, so this 
> will probably never happen.

Indeed, and Guido did rule some time ago that 4.0 would be ordinary 
transition, like 3.7 to 3.8, not a big backwards breaking version 
change.

I've taken up referring to some hypothetical future 3.0-like version as 
Python 5000 (not 4000) in analogy to Python 3000, but to emphasise just 
how far away it will be.

> And finally, if you want to break strings, it’s probably worth at 
> least considering making UTF-8 strings first-class objects. They can’t 
> be randomly accessed, 

I don't see why you can't make arrays of UTF-8 indexable and provide 
random access to any code point. I understand that ``str`` in 
Micropython is implemented that way.

The obvious implementation means that you lose O(1) indexing (to reach 
the N-th code point, you have to count from the beginning each time) but 
save memory over other encodings. (At worst, a code-point in UTF-8 takes 
three bytes, compared to four in UTF-16 or UTF-32.) There are ways to 
get back O(1) indexing, but they cost more memory.

But why would you want an explicit UTF-8 string object? What benefit 
do you get from exposing the fact that the implementation happens to be 
UTF-8 rather than something else? (Not rhetorical questions.)

If the UTF-8 object operates on the basis of Unicode code points, then 
its just a str, and the implementation is just an implementation detail.

If the UTF-8 object operates on the basis of raw bytes, with no 
protection against malformed UTF-8 (e.g. allowing you to insert bytes 
0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- 
or three-byte UTF-8 sequence) then its just a bytes object (or 
bytearray) initialised with a UTF-8 sequence.

That is, as I understand it, what languages like Go do. To paraphrase, 
they offer data types they *call* UTF-8 strings, except that they can 
contain arbitrary bytes and be invalid UTF-8. We can already do this, 
today, without the deeply misleading name:

string.encode('utf-8')

and then work with the bytes. I think this is even quite efficient in 
CPython's "Flexible string representation". For ASCII-only strings, the 
UTF-8 encoding uses the same storage as the original ASCII bytes. For 
others, the UTF-8 representation is cached for later use.

So I don't see any advantage to this UTF-8 object. If the API works on
code points, then it's just an implementation detail of str; if the API 
works on code units, that's just a fancy name for bytes. We already have 
both str and bytes so what is the purpose of this utf8 object?

-- 
Steven
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RKY73YB2UVJMZ2PNIYJ74AFVKUAIK45K/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-26 Thread Steven D'Aprano

On Fri, Oct 25, 2019 at 08:44:17PM -0700, Ben Rudiak-Gould wrote:

> Nothing good can come of decomposing strings into Unicode code points.

Sure there is. In Python, it's the fastest way to calculate the digit
sum of an integer. It's also useful for implementing classical
encryption algorithms, like Playfair.

Introspection, e.g. if I want to know if a string contains any
surrogates, I can do this:

any('\uD800' <= c <= '\uDFFF' for c in s)

Of perhaps I want to know if the string contains any "astral
characters", in which case they aren't safe to pass to a Javascript or
Tcl script which doesn't handle them correctly:

any(c > '\u' for c in s)

How about education? One of the things I can do with strings is:

for c in string:
print(unicodedata.name(c))

or possible even just

# what is that weird symbol in position five?
print(unicodedata.name(string[5]))

to find out what that weird character is called, so I can look it up and
find out what it means. Knowing stuff is good, right?

Or do you think the world would be better off if it was really hard
and "ugly" (your word) for people like me to find out what code points
are called and what their meaning is?

Rather than just telling us that we shouldn't be allowed to access code
points in strings, would you please be explicit about *why* this access
is a bad thing?

And if code points are "bad", then what should we be allowed to do with
strings? If code points is too low level, then what is an appropriate
level?

I guess you're probably going to mention grapheme clusters. (If you
aren't, then I have no idea what your objection is based on.)

Grapheme clusters are a hard problem to solve, since they are dependent
on the language and the locale. There's a Unicode algorithm for
splitting on graphemes, but it ignores the locale differences.

Processing on graphemes is more expensive than on code points. There is,
as far as I can tell, no O(1) access to graphemes in a string without
pre-processing them and keeping a list of their indices.

For many people, and for many purposes, paying that extra cost in either
time or memory is just a total waste, since they're hardly ever going to
come across a grapheme cluster. Few people have to process completely
arbitrary strings: their data tends to come from a particular subset of
natural language strings, and for some such languages, you might go a
whole lifetime without coming across a grapheme cluster of more than one
code point.

(This may be slowly changing, even for American English, driven in part
by the use of emoji and variation selectors.)

If Python came with a grapheme processing API, I would probably use it.
But in the meantime, the code point API is "good enough" for most things
I do with strings. And for the rest, graphemes are too low-level: I need
things like sentences; clauses, words, word stems, prefixes and
suffixes, syllables etc.

But even if Python had an excellent, fast grapheme API, I would still
want a nice, clean, fast interface that operates on code-points.

--
Steven
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/OCG64OW4WPVDFUSN3R7AGI6M4NFKGJIP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-25 Thread Ben Rudiak-Gould

Since this is Python 4000, where everything's made up and the points
don't matter...

I think there shouldn't be a char type, and also strings shouldn't be
iterable, or indexable by integers, or anything else that makes them
appear to be tuples of code points.

Nothing good can come of decomposing strings into Unicode code points.
The code point abstraction is practically as low level as the internal
byte encoding of the strings. Only lexing libraries should look at
strings at that level, and you should use a well written and tested
lexing library, not a hacky hand-coded lexer.

Someone in this thread mentioned that they'd used ' '.join on a string
in production code. Was the intent really to put a space between every
pair of code points of an arbitrary string? Or did they know that only
certain code points would appear in that string? A convenient way of
splitting strings into more meaningful character units would make the
actual intent clear in the code, and it would allow for runtime
testing of the programmer's assumptions.

Explicit access to code points should be ugly – s.__codepoints__,
maybe. And that should be a sequence of integers, not strings like
"́".

>it’s probably worth at least considering making UTF-8 strings first-class
>objects. They can’t be randomly accessed,

They can be randomly accessed by abstract indices: objects that look
similar to ints from C code, but that have no extractable integer
value in Python code, so that they're independent of the underlying
string representation.

They can't be randomly accessed by code point index, but there's no
reason you should ever want to randomly access a string by a code
point index. It's a completely meaningless operation.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/MDX4LXOWJQ2DXPIG27DJ3TVETSUSMSVW/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-25 Thread Andrew Barnert via Python-ideas

On Oct 25, 2019, at 06:26, Serhiy Storchaka  wrote:
> 
> 25.10.19 15:53, Andrew Barnert via Python-ideas пише:
>> If you were designing a new Python-like language today, or if you had a time 
>> machine back to the 90s, it would be a different story.
> 
> Interesting, how far in past you will need to travel? Initially builtin types 
> did not have methods or properties, and the iterable protocol did not exist.

Well, the str methods are largely carried over from the functions in the string 
module, which was there before 1.0. And I think the ord builtin goes back 
pretty far as well. So ideally, back to the start.

On the other hand, it’s not like there was a huge ecosystem of third-party 
modules using Python 0.9 that Guido couldn’t afford to break, so If your time 
machine couldn’t go back quite that far, it might be ok to do it as late as, 
say, 2.2.

> Adding this will require too much work, and I am not sure Guido would like 
> how much complexity it adds to his simple language.

Adding it today would certainly require too much work—not so much for Python 
itself as for thousands of third-party libraries and even more applications, 
but that’s even worse. Even if it weren’t just to fix a small wart that people 
have been dealing with for decades, it would be too much. That’s why I’m -1 on 
it. But adding it in early Python would have been very little work. Just add 
one more builtin type, and change a dozen or so builtin and string module 
functions, and you’re done. And it would be very little complexity, too. Yes, 
it would mean one more built in type, but nothing else is complex about it. The 
Smalltalk guys who were advertising that their whole language fits on an index 
card could handle chars.

And it’s not like it would be an unprecedented innovation—most languages that 
existed at the time, from Lisp to Perl, had strings as sequences of either 
chars or integers (or, as with C, of chars but char is just a type of integer). 
If anything, it was ABC that was innovative for making strings out of strings 
(although Tcl and BASIC also do); it just turned out to collide with other 
innovations that Python got over the next couple of years, in ways nobody could 
have imagined.

And the hard bit would be describing what the Python 2.x language and ecosystem 
would look like to Guido so you could explain why it would matter. The idea 
that his language would have an “iterable protocol” that could be implemented 
by not just lists and strings but also user-defined types using special 
methods, and automatically by functions using a special coroutine semantics, 
and also by some syntax borrowed from Haskell, and that was checkable via an 
abstract base class using implicit structural testing…
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BDJZAZ6U4JWEFOACLVANUMDO6TFWJJKS/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-25 Thread Serhiy Storchaka


25.10.19 15:53, Andrew Barnert via Python-ideas пише:

If you were designing a new Python-like language today, or if you had a time 
machine back to the 90s, it would be a different story.


Interesting, how far in past you will need to travel? Initially builtin 
types did not have methods or properties, and the iterable protocol did 
not exist. Adding this will require too much work, and I am not sure 
Guido would like how much complexity it adds to his simple language.

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/DKJXU7CIS2AQZNHDGBBYSZMRLIQHYXZC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-25 Thread Andrew Barnert via Python-ideas

On Oct 25, 2019, at 01:34, Paul Moore  wrote:
> 
> On Thu, 24 Oct 2019 at 23:47, Andrew Barnert via Python-ideas
>  wrote:
>> But again, I don’t think either of these is the reason Python strings being 
>> iterable is a problem; I think it really is primarily about them being 
>> iterables of strings.
> 
> The *real* problem is that there's a whole load of functions that
> would need rewriting to accept "character or string" arguments - or a
> whole load of debating over whether they should only use one or
> another.

That’s not the problem that this thread is trying to solve, it’s the problem 
with the solution to that problem. :)

I’ve already said that I don’t think this is feasible, because it would be too 
much work and too disruptive to compatibility. If you were designing a new 
Python-like language today, or if you had a time machine back to the 90s, it 
would be a different story. But for Python 4.0—even if we wanted a 3.0-like 
break, which I don’t think anyone does—we can’t break all of the millions of 
lines of code dealing with strings in a way that forces people to not just fix 
that code but rethink the design.

> And no cheating by saying these are cases where you should use
> 1-character strings. The fact that people don't typically distinguish
> between characters and 1-character strings in real life is precisely
> why it's useful that Python currently doesn't make a distinction
> either.

Many of your examples are not cases where people should use 1-character 
strings; they’re cases where we need polymorphic APIs.

But that’s not a problem. Countless other languages already do exactly that, 
and people use them every day. (The languages that are too weak to do that kind 
of polymorphism, like C and PHP, instead have to offer separate functions like 
strstr vs. strchr, which is manifestly less friendly. Fortunately, fact that 
Python’s core API was originally loosely based on C’s string.h wouldn’t in any 
way force the same problem on Python or a Python-like language.)

And, while there are plenty of functions that would need to treat char and 
1-char str the same, there are also many—not just iter—that should only work 
for one or the other, such as ord, or re.search. And there are also functions 
that should work for both, but not do exactly the same thing—char replacement 
is the same thing as translate; substring replacement is not.

In fact, notice what happens if you call ord on a 2-char string today: 
TypeError: ord() expected a character, but string of length 2 found. Python is 
_pretending_ that it has a character type even though it doesn’t.

And you find the same thing in third-party code: some of the functions people 
have written would need to handle str and char polymorphically, some would need 
to handle both but with different logic, and many would need to handle just one 
or the other. Which is exactly why it couldn’t be fixed with a 3to4 script, and 
people would instead need to rethink the design of a bunch of their existing 
functions before they could upgrade.

> In case it's not glaringly obvious, I'm -1 on this, even as an
> exercise in speculation :-)

I’m -1 on this, but I think speculating on how it could be solved is the best 
way to show that it can’t and shouldn’t be solved.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/IIRARPLSPKPG6CDPXHYFVNUT27JDET5N/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-25 Thread Paul Moore

On Thu, 24 Oct 2019 at 23:47, Andrew Barnert via Python-ideas
 wrote:
> But again, I don’t think either of these is the reason Python strings being 
> iterable is a problem; I think it really is primarily about them being 
> iterables of strings.

The *real* problem is that there's a whole load of functions that
would need rewriting to accept "character or string" arguments - or a
whole load of debating over whether they should only use one or
another. Examples:

"a" in char_string, vs "word" in sentence. Both useful and used in real code.
list_of_stuff.join(",") vs list_of_stuff.join(", ")
list_of_characters.join("")

And return values: string.partition(sep) - if sep is a character,
should the middle return value be a character too? Remember that the
typing module needs to be able to express the type signatures of all
these functions, in a way that usefully allows checking usage.

Plus many, many more. And not just in the stdlib, but in millions of
lines of 3rd party code.

And no cheating by saying these are cases where you should use
1-character strings. The fact that people don't typically distinguish
between characters and 1-character strings in real life is precisely
why it's useful that Python currently doesn't make a distinction
either.

In case it's not glaringly obvious, I'm -1 on this, even as an
exercise in speculation :-)

Paul
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/JCJPLI5UCPDUQCRNTMFJMII4E6EE27ZQ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-24 Thread Andrew Barnert via Python-ideas

On Oct 24, 2019, at 14:13, Greg Ewing  wrote:
> 
> I'm thinking of things like a function to recursively flatten
> a nested list. You probably want it to stop when it gets to a
> string, and not flatten the string into a list of characters.

A function to recursively flatten a nested list should only work on lists; it 
should stop on a string, but it should also stop on a namedtuple or a 2x2 
ndarray or a dict. A function to recursively flatten arbitrary iterables, on 
the other hand…

And I don’t think there’s any conceptual problem with strings being iterable. A 
C++ string is a sequence of chars. A Haskell string is a plain old (lazy 
linked) list of chars. And similarly in lots of other languages. And it’s 
rarely a problem.

There are other differences that might be relevant here; I don’t think they 
are, but to be fair:

C++ and Haskell implementations are expected to optimize everything well enough 
that you can just any arbitrary sequence of chars as a string with reasonable 
efficiency, so strings being a thin convenience wrapper above that makes 
intuitive sense. In Python, that isn’t true; a function that loops character by 
character would often be too slow to use.

C++ and Haskell type systems make it a little easier to say “Here’s a function 
that works on generic iterables of T, but when T is char here’s a more specific 
function”.

But again, I don’t think either of these is the reason Python strings being 
iterable is a problem; I think it really is primarily about them being 
iterables of strings.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/4YF3JNZ6KYZLOCOLFHFLDSGRZL66O4SP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-24 Thread Greg Ewing


Christopher Barker wrote:
wouldn't it? once you got to an object that couldn't be iterated, you'd 
know you had an atomic value.


I'm thinking of things like a function to recursively flatten
a nested list. You probably want it to stop when it gets to a
string, and not flatten the string into a list of characters.

--
Greg
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/SERL4HR6WC56ZXTNPWBYRGDQLQ7324L2/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-24 Thread Christopher Barker

On Thu, Oct 24, 2019 at 1:13 AM Greg Ewing 
wrote:

> Christopher Barker wrote:
> > I've always wondered
> > how disruptive it would be to add a char type
>
> I'm not sure if it would help much. Usually the problem with
> strings being sequences of strings lies in the fact that they're
> sequences at all. Code that operates generically on nested sequences
> often has to special-case strings so that it can treat them as
> atomic values. Having them be sequences of something else
> wouldn't change that.
>

wouldn't it? once you got to an object that couldn't be iterated, you'd
know you had an atomic value. And this is why I was thinking that chars had
less functionality, it would work. This is really common code for me that I
need to type check:

for filename in sequence_of_filenames:
open(filename)

if a char could not be used as a filename, then I'd get a similar error if
a single string was passed in as if a list of numbers was passed in, say.

That is, a string is a sequence of chars, not a sequence of strings. and a
char can not be used as a string in many contexts.

If I were to advocate changing anything in that area, it would
> be to make strings not be sequences. They would support slicing,
> but not indexing single characters, and would not be directly
> iterable.

you are right -- that would be a great solution to the above problem. And I
can't think of many real uses for iterating strings where you don't know
that you want the chars, so .chars() iterator, or maybe str.iter_chars()
would be fine.

Something tells me that I've had uses for char in other contexts, but I
can't think of them now, so maybe not :-)

But again -- too disruptive, we've lived with this for a LONG time.

-CHB

> If you really wanted to iterate over characters, there
> could be a method such as s.chars() giving a sequence view.
> But that would be a disruptive enough change for so little
> benefit that I don't expect it to ever happen.
>
> --
> Greg
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/LUR6URJANVGIVJCWQMIUEM7XASTLV47B/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/DG57PO6KFDSKZ4ZQMK7R7VNYHWYRQELN/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-24 Thread Greg Ewing


Christopher Barker wrote:
I've always wondered 
how disruptive it would be to add a char type


I'm not sure if it would help much. Usually the problem with
strings being sequences of strings lies in the fact that they're
sequences at all. Code that operates generically on nested sequences
often has to special-case strings so that it can treat them as
atomic values. Having them be sequences of something else
wouldn't change that.

If I were to advocate changing anything in that area, it would
be to make strings not be sequences. They would support slicing,
but not indexing single characters, and would not be directly
iterable. If you really wanted to iterate over characters, there
could be a method such as s.chars() giving a sequence view.
But that would be a disruptive enough change for so little
benefit that I don't expect it to ever happen.

--
Greg
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/LUR6URJANVGIVJCWQMIUEM7XASTLV47B/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-24 Thread Anders Hovmöller



> On 24 Oct 2019, at 01:02, Christopher Barker  wrote:
> 
> 
>> On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas 
>>  wrote:
> 
>> The main problem is that a str is a sequence of single-character str, each 
>> of which is a one-element sequence of itself, etc. forever. If you wanted to 
>> change this, I think it would make more sense to go the opposite way: leave 
>> str a sequence, but make it a sequence of char objects. (And likewise, bytes 
>> and bytearray could be sequences of byte objects—or just go all the way to 
>> making them sequences of ints.) And then maybe add a c prefix for defining 
>> char constants, and you’ve solved all the problems without having to add new 
>> confusing methods or properties.
> 
> I've thought for a long time that this would be a "good thing". the "string 
> or sequence of strings" issues is pretty much the only hidden-bug-triggering 
> type error I've gotten since "true division".
> 
> The only way we really live with it fairly easily is that strings are pretty 
> much never duck typed -- so I can check if I got a string, and then I know I 
> didn't get a sequence of strings. But I've always wondered how disruptive it 
> would be to add a char type -- it doesn't seem like it would be very 
> disruptive, but I have not thought it through at all. And I'm not sure how 
> much string functionality a char should have -- probably next to none, as the 
> point is that it would be easy to distinguish from a string that happened to 
> have one character.
> 
> By the way, the bytes and bytearray types already does this -- index into or 
> loop through a bytes object, you get an int.

I would think it's fine if we depreciate the iter on str and supply a chars() 
method. Personally I think that can yield str and not int. The could be a 
codes() or char_codes() method for that. 

/ Anders ___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/IM4ANTZ6NDDHPOPRO3SAWFS4ECZL5MKV/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-23 Thread Christopher Barker

There's a reason I've never actually proposed adding a char 

On Wed, Oct 23, 2019 at 5:34 PM Andrew Barnert  wrote:

> Well, just adding a char type (and presumably a way of defining char
literals) wouldn’t be too disruptive.

sure.

> But changing str to iterate chars instead of strs, that probably would be.

And that would be the whole point -- a char type by itself isn't very
useful. in some ssense, the only difference between a char and a str would
be that a char isn't iterable -- but the benefit would be that a string is
an iterable (and sequence) of chars, rather than an (infinitely recursable)
iterable of strings.

> Also, you’d have to go through a lot of functions and decide what types
they should take.

sure would -- a lot of thought to see how disruptive it would be ...

> For example, does str.join still accept a string instead of an iterable
of strings? Does it accept other iterables of char too?

if it accepted an iterable of either char or str, then I *think* there
would be little disruption.

> Can you pass a char to str.__contains__

yes, that's a no brainer, the whole point is that a string would be a
sequence of chars.

> or str.endswith?

I would think so -- a char would behave like a length-one string as much as
possible.

> What about a tuple of chars?

that's an odd one -- but I'm not sutre I see the point, if you have a tuple
of chars, you could "".join() them if you want a string, in any context.

> Or should we take the backward-compat breaking opportunity to eliminate
the “str or tuple of str” thing and instead use *args, or at least change
it to “str or iterable of str (which no longer includes str itself)”?

Is this for .endswith() and friends? if so, there was discussion a while
back about that -- but probably not the time to introduce even more
backward incompatible changes.

And I'm not sure how much string functionality a char should have --
probably next to none, as the point is that it would be easy to distinguish
from a string that happened to have one character.

> Surely you’d want to be able to do things like isdigit or swapcase. Even
C has functions to do most of that kind of stuff on chars.

probably -- it would be least disruptive for a char to act as much as
possible the same as a length-one string -- so maybe inexorability and
indexability would be it.

> But I think that, other than join and maybe encode and translate,

not sure why encode or translate should be an issue off the top of my head
-- it would surley be a unicode char :-)

> there’s an obvious right answer for every str method and operator, so
this isn’t too much of a problem.

well, we'd have to go through all of them, and do a lot of thinking...

I think the greater confusion is where can you use a char instead of a
string in other places? using it as a filename, for instance would make it
pointless for at least the cases I commonly deal with (list of filenames).

I can only imagine how many "things" take a string where a char would make
sense, but then it gets harder to distinguish them all.

> Speaking of operators, should char+int and char-int and char-char be
legal? (What about char%int? A thousand students doing the rot13 assignment
would rejoice, but allowing % without * and // is kind of weird, and
allowing * and // even weirder—as well as potentially confusing with
str*int being legal but meaning something very different.)

I would say no -- in C a char IS an unsigned 8bit int, but that's C -- in
pyhton a char and a number are very diferent things.

ord() and chr() would work, of course.

By the way, the bytes and bytearray types already does this -- index into
or loop through a bytes object, you get an int.

Sure, but b'abc'.find(66) is -1, and b'abc'.replace(66, 70) is a TypeError,
and so on.

I wonder if they need to be -- would we need a "byte" type, or would it be
OK to accept an int in all those sorts of places?

> Fixing those inconsistencies is what I meant by “go all the way to making
them sequences of ints”. But it might be friendlier to undo the changes and
instead add a byte type like the char type for bytes to be a sequence of.
I’m not sure which is better.

me neither.

> But anyway, I think all of these questions are questions for a new
language. If making str not iterate str was too big a change even for 3.0,
how could it be reasonable for any future version?

Well, I don't know that it was seriously considered -- with the Unicode
changes, that WOULD have been the time to do it!

Again though,, it seems like it would be pretty disruptive, so a
non-starter, but maybe not?

-CHB

-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-23 Thread Andrew Barnert via Python-ideas

On Oct 23, 2019, at 16:00, Christopher Barker  wrote:
> 
>> On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas 
>>  wrote:
> 
>> The main problem is that a str is a sequence of single-character str, each 
>> of which is a one-element sequence of itself, etc. forever. If you wanted to 
>> change this, I think it would make more sense to go the opposite way: leave 
>> str a sequence, but make it a sequence of char objects. (And likewise, bytes 
>> and bytearray could be sequences of byte objects—or just go all the way to 
>> making them sequences of ints.) And then maybe add a c prefix for defining 
>> char constants, and you’ve solved all the problems without having to add new 
>> confusing methods or properties.
> 
> I've thought for a long time that this would be a "good thing". the "string 
> or sequence of strings" issues is pretty much the only hidden-bug-triggering 
> type error I've gotten since "true division".
> 
> The only way we really live with it fairly easily is that strings are pretty 
> much never duck typed -- so I can check if I got a string, and then I know I 
> didn't get a sequence of strings. But I've always wondered how disruptive it 
> would be to add a char type -- it doesn't seem like it would be very 
> disruptive, but I have not thought it through at all.

Well, just adding a char type (and presumably a way of defining char literals) 
wouldn’t be too disruptive. 

But changing str to iterate chars instead of strs, that probably would be.

Also, you’d have to go through a lot of functions and decide what types they 
should take. For example, does str.join still accept a string instead of an 
iterable of strings? Does it accept other iterables of char too? (I have used ' 
'.join on a string in real life production code, even if I did feel guilty 
about it…) Can you pass a char to str.__contains__ or str.endswith? What about 
a tuple of chars? Or should we take the backward-compat breaking opportunity to 
eliminate the “str or tuple of str” thing and instead use *args, or at least 
change it to “str or iterable of str (which no longer includes str itself)”?

> And I'm not sure how much string functionality a char should have -- probably 
> next to none, as the point is that it would be easy to distinguish from a 
> string that happened to have one character.

Surely you’d want to be able to do things like isdigit or swapcase. Even C has 
functions to do most of that kind of stuff on chars.

But I think that, other than join and maybe encode and translate, there’s an 
obvious right answer for every str method and operator, so this isn’t too much 
of a problem.

Speaking of operators, should char+int and char-int and char-char be legal? 
(What about char%int? A thousand students doing the rot13 assignment would 
rejoice, but allowing % without * and // is kind of weird, and allowing * and 
// even weirder—as well as potentially confusing with str*int being legal but 
meaning something very different.)

> By the way, the bytes and bytearray types already does this -- index into or 
> loop through a bytes object, you get an int.

Sure, but b'abc'.find(66) is -1, and b'abc'.replace(66, 70) is a TypeError, and 
so on.

Fixing those inconsistencies is what I meant by “go all the way to making them 
sequences of ints”. But it might be friendlier to undo the changes and instead 
add a byte type like the char type for bytes to be a sequence of. I’m not sure 
which is better.

But anyway, I think all of these questions are questions for a new language. If 
making str not iterate str was too big a change even for 3.0, how could it be 
reasonable for any future version?

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/WRVOKGHNK7JKR66WG7MG73FUFZODLC4R/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-23 Thread Christopher Barker

On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas <
python-ideas@python.org> wrote:

> The main problem is that a str is a sequence of single-character str, each
> of which is a one-element sequence of itself, etc. forever. If you wanted
> to change this, I think it would make more sense to go the opposite way:
> leave str a sequence, but make it a sequence of char objects. (And
> likewise, bytes and bytearray could be sequences of byte objects—or just go
> all the way to making them sequences of ints.) And then maybe add a c
> prefix for defining char constants, and you’ve solved all the problems
> without having to add new confusing methods or properties.
>

I've thought for a long time that this would be a "good thing". the "string
or sequence of strings" issues is pretty much the only
hidden-bug-triggering type error I've gotten since "true division".

The only way we really live with it fairly easily is that strings are
pretty much never duck typed -- so I can check if I got a string, and then
I know I didn't get a sequence of strings. But I've always wondered how
disruptive it would be to add a char type -- it doesn't seem like it would
be very disruptive, but I have not thought it through at all. And I'm not
sure how much string functionality a char should have -- probably next to
none, as the point is that it would be easy to distinguish from a string
that happened to have one character.

By the way, the bytes and bytearray types already does this -- index into
or loop through a bytes object, you get an int.

-CHB

-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/QV2SLFQAR2VKOLD5Y7ACRO6LBX4ZE5UQ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-13 Thread Steve Jorgensen

Yup. I think you're absolutely right.

After I posted this, I had a better idea: 
https://mail.python.org/archives/list/python-ideas@python.org/thread/OVP6SIOFNGGENJAJHXOS2AEUUPWSSRD2/
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6GNP6YMKZT4DWWTKKXWLC34GPEIUHLXZ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-13 Thread Chris Angelico

On Mon, Oct 14, 2019 at 6:49 AM Andrew Barnert via Python-ideas
 wrote:
> And finally, if you want to break strings, it’s probably worth at least 
> considering making UTF-8 strings first-class objects. They can’t be randomly 
> accessed, but with an iterable-plus API like files, with seek/tell, or a new 
> more powerful iterable API like Swift or C++, a lot of languages have found 
> that to be a useful trade off anyway.
>

Breaking the str type to do this seems like a really REALLY bad idea,
but if you want a first-class UTF8String, you can certainly have it.
Build it on top of some sort of byte buffer (maybe bytearray rather
than bytes) with a whole lot of handy methods, and there you are.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/K6C4VH7XY2I3YJOMI3JCUTPESRROOAG5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

2019-10-13 Thread Andrew Barnert via Python-ideas

On Oct 13, 2019, at 12:02, Steve Jorgensen  wrote:
> 
> There are many cases in which it is awkward that testing whether an object is 
> a sequence returns `True` for instances of of `str`, `bytes`, etc.
> 
> This proposal is a serious breakage of backward compatibility, so would be 
> something for Python 4.x, not 3.x.

I’m pretty sure almost nobody wants a 3.0-like break again, so this will 
probably never happen.
> 
> Instead of those objects _being_ sequences, have them provide views that are 
> sequences using a method named something like `members` or `items`.

Nothing else in Python works like this. Dicts do have an `items` method, but 
that provides an iterable (but not indexable) view of key-value pairs, while 
the dict itself is an iterable if its keys. So I think this would be pretty 
confusing.

Also, would you want them to not be iterable either? If so, that would break 
even more code; if not, I don’t think it would actually solve that much in the 
first place.

The main problem is that a str is a sequence of single-character str, each of 
which is a one-element sequence of itself, etc. forever. If you wanted to 
change this, I think it would make more sense to go the opposite way: leave str 
a sequence, but make it a sequence of char objects. (And likewise, bytes and 
bytearray could be sequences of byte objects—or just go all the way to making 
them sequences of ints.) And then maybe add a c prefix for defining char 
constants, and you’ve solved all the problems without having to add new 
confusing methods or properties.

Meanwhile, the most common places you run into this problem are in functions 
that take a single str argument or a single iterable-of-str argument. Most such 
cases have already been solved by taking a str or tuple-of-str, which is 
clunky, even it’s worked since Python 0.9. But a better solution for almost all 
such cases is to just change the function to take a *args parameter for 0 or 
more string arguments.

While we’re at it, if you really wanted to make a radical breaking change to 
Python involving view objects, I’d prefer one that expanded on dict views, to 
make all kinds of lazy view objects that are sequences or sets (e.g., calling 
map on a sequence gives you a sequence that’s computed on the fly; filtering a 
set gives you a set; reversing a sequence gives you a sequence; etc.), rather 
than making something else that’s kind of similar but doesn’t work the same way.

And finally, if you want to break strings, it’s probably worth at least 
considering making UTF-8 strings first-class objects. They can’t be randomly 
accessed, but with an iterable-plus API like files, with seek/tell, or a new 
more powerful iterable API like Swift or C++, a lot of languages have found 
that to be a useful trade off anyway.

But again, I doubt any of this is likely to happen, as nobody wants to go 
through another decade-long painful transition unless the benefits are a whole 
lot bigger than fixing a couple of minor things people have already learned how 
to deal with.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/T2TLCUVKZHEZBTY3IUH34MU2XH7VNE4T/
Code of Conduct: http://python.org/psf/codeofconduct/

40 matches

Mail list logo