On Nov 2, 2019, at 20:33, Random832 wrote:
>
>> On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote:
>>> On Oct 26, 2019, at 19:59, Random832 wrote:
>>>
>>> A string representation considering of (say) a UTF-8 string, plus an
>>> auxiliary list of byte indices of, say, 256-codepoint-long
On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote:
> On Oct 26, 2019, at 19:59, Random832 wrote:
> >
> > A string representation considering of (say) a UTF-8 string, plus an
> > auxiliary list of byte indices of, say, 256-codepoint-long chunks [along
> > with perhaps a flag to say that the
I think that we're more or less in broad agreement, but I wanted to
comment on this:
On Sun, Oct 27, 2019 at 09:41:00PM -0700, Andrew Barnert wrote:
> Yes, that’s the whole point of the message you were responding to:
> extended grapheme clusters are the Unicode approximation of
> characters;
On Oct 27, 2019, at 18:00, Steven D'Aprano wrote:
>
> On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas
> wrote:
>
>>> File "/home/rosuav/tmp/demo.py", line 1
>>> print("Hello, world!')
>>>^
>>> SyntaxError: EOL while scanning string literal
>>
On Oct 27, 2019, at 05:49, Chris Angelico wrote:
>> Given zero-based indexing, and the string:
>>
>>"abÇÐεф"
>>
>> the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10
>> (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door
>> with a pitchfork and a flaming
> On Oct 27, 2019, at 05:38, Steven D'Aprano wrote:
>
>> On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas
>> wrote:
>>
>> If you redesign your find, re.search, etc. APIs to not return
>> character indexes, then I think you can get away with not having
>>
On Sun, Oct 27, 2019 at 11:43 PM Steven D'Aprano wrote:
>
> On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas
> wrote:
>
> > If you redesign your find, re.search, etc. APIs to not return
> > character indexes, then I think you can get away with not having
> >
On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas wrote:
> If you redesign your find, re.search, etc. APIs to not return
> character indexes, then I think you can get away with not having
> character-indexable strings.
If string.index(c) doesn't return the index of c in
On Sun, Oct 27, 2019 at 03:33:16PM +1100, Steven D'Aprano wrote:
> else:
> assert c <= '\U0001':
Oops, missplaced a zero there. That was supposed to be '\U0010'.
--
Steven
___
Python-ideas mailing list -- python-ideas@python.org
On Sun, Oct 27, 2019, at 03:39, Andrew Barnert via Python-ideas wrote:
> (Actually, IIRC, one of the two has a str type that, despite being 2.x,
> is unicode rather than bytes, but with some extra undocumented
> functionality to smuggle bytes around in a str and have it sometimes
> work.)
I do
On Oct 26, 2019, at 21:33, Steven D'Aprano wrote:
>
> IronPython and Jython use whatever .Net and Java use.
Which makes them sequences of UTF-16 code units, not code points. Which is
allowed for the Python 2.x unicode type, but would violate the rules for 3.x
str, but neither one has a 3.x.
On Oct 26, 2019, at 19:59, Random832 wrote:
>
> A string representation considering of (say) a UTF-8 string, plus an
> auxiliary list of byte indices of, say, 256-codepoint-long chunks [along with
> perhaps a flag to say that the chunk is all-ASCII or not] would provide O(1)
> random access,
PEP 393
The Unicode string type is changed to support multiple internal
representations, depending on the character with the largest Unicode
ordinal (1, 2, or 4 bytes)
... Ah, OK. I get it. One byte representation is only ASCII, which happens
to match utf-8. Well, the latin-1 oddness. But the
On Sat, Oct 26, 2019 at 11:34:34PM -0400, David Mertz wrote:
> What does actual CPython do currently to find that s[1_000_000], assuming
> utf-8 internal representation?
CPython doesn't use a UTF-8 internal representation.
MicroPython *may*, but I don't know if they do anything fancy to avoid
On Sun, Oct 27, 2019 at 2:37 PM David Mertz wrote:
> What does actual CPython do currently to find that s[1_000_000], assuming
> utf-8 internal representation?
>
Mu.
CPython does not have a UTF-8 internal representation.
ChrisA
___
Python-ideas
Ok, true enough that dereferencing and limited linear search is still O(1).
I could have phrased that slightly more precisely.
But the trade-off part is true. Indexing into character 1 million of a
utf-32 string is just one memory offset calculation, them following the
reference. Indexing into
On Sat, Oct 26, 2019, at 20:26, David Mertz wrote:
> Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is
> the same storage requirement as utf-16 or utf-32. For O(1) random
> access into all strings, we have to eat 32-bits per character, one way
> or the other, but of course
On Wed, Oct 23, 2019, at 19:00, Christopher Barker wrote:
> On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas
> wrote:
> > The main problem is that a str is a sequence of single-character str, each
> > of which is a one-element sequence of itself, etc. forever. If you wanted
>
On Oct 26, 2019, at 16:28, Steven D'Aprano wrote:
>
>> On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas
>> wrote:
>> On Oct 13, 2019, at 12:02, Steve Jorgensen wrote:
> [...]
>>> This proposal is a serious breakage of backward compatibility, so
>>> would be something
Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is the
same storage requirement as utf-16 or utf-32. For O(1) random access into
all strings, we have to eat 32-bits per character, one way or the other,
but of course there are space/speed trade-offs one could make for
intermediate
On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote:
> On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
>
>
> > (At worst, a code-point in UTF-8 takes three bytes, compared to four in
> > UTF-16 or UTF-32.)
> >
>
> http://www.fileformat.info/info/unicode/char/1/index.htm
Oops, you're
On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
> (At worst, a code-point in UTF-8 takes three bytes, compared to four in
> UTF-16 or UTF-32.)
>
http://www.fileformat.info/info/unicode/char/1/index.htm
>
___
Python-ideas mailing list --
On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote:
> On Oct 13, 2019, at 12:02, Steve Jorgensen wrote:
[...]
> > This proposal is a serious breakage of backward compatibility, so
> > would be something for Python 4.x, not 3.x.
>
> I’m pretty sure almost nobody
On Fri, Oct 25, 2019 at 08:44:17PM -0700, Ben Rudiak-Gould wrote:
> Nothing good can come of decomposing strings into Unicode code points.
Sure there is. In Python, it's the fastest way to calculate the digit
sum of an integer. It's also useful for implementing classical
encryption algorithms,
Since this is Python 4000, where everything's made up and the points
don't matter...
I think there shouldn't be a char type, and also strings shouldn't be
iterable, or indexable by integers, or anything else that makes them
appear to be tuples of code points.
Nothing good can come of decomposing
On Oct 25, 2019, at 06:26, Serhiy Storchaka wrote:
>
> 25.10.19 15:53, Andrew Barnert via Python-ideas пише:
>> If you were designing a new Python-like language today, or if you had a time
>> machine back to the 90s, it would be a different story.
>
> Interesting, how far in past you will need
25.10.19 15:53, Andrew Barnert via Python-ideas пише:
If you were designing a new Python-like language today, or if you had a time
machine back to the 90s, it would be a different story.
Interesting, how far in past you will need to travel? Initially builtin
types did not have methods or
On Oct 25, 2019, at 01:34, Paul Moore wrote:
>
> On Thu, 24 Oct 2019 at 23:47, Andrew Barnert via Python-ideas
> wrote:
>> But again, I don’t think either of these is the reason Python strings being
>> iterable is a problem; I think it really is primarily about them being
>> iterables of
On Thu, 24 Oct 2019 at 23:47, Andrew Barnert via Python-ideas
wrote:
> But again, I don’t think either of these is the reason Python strings being
> iterable is a problem; I think it really is primarily about them being
> iterables of strings.
The *real* problem is that there's a whole load of
On Oct 24, 2019, at 14:13, Greg Ewing wrote:
>
> I'm thinking of things like a function to recursively flatten
> a nested list. You probably want it to stop when it gets to a
> string, and not flatten the string into a list of characters.
A function to recursively flatten a nested list should
Christopher Barker wrote:
wouldn't it? once you got to an object that couldn't be iterated, you'd
know you had an atomic value.
I'm thinking of things like a function to recursively flatten
a nested list. You probably want it to stop when it gets to a
string, and not flatten the string into a
On Thu, Oct 24, 2019 at 1:13 AM Greg Ewing
wrote:
> Christopher Barker wrote:
> > I've always wondered
> > how disruptive it would be to add a char type
>
> I'm not sure if it would help much. Usually the problem with
> strings being sequences of strings lies in the fact that they're
> sequences
Christopher Barker wrote:
I've always wondered
how disruptive it would be to add a char type
I'm not sure if it would help much. Usually the problem with
strings being sequences of strings lies in the fact that they're
sequences at all. Code that operates generically on nested sequences
often
> On 24 Oct 2019, at 01:02, Christopher Barker wrote:
>
>
>> On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas
>> wrote:
>
>> The main problem is that a str is a sequence of single-character str, each
>> of which is a one-element sequence of itself, etc. forever. If you
There's a reason I've never actually proposed adding a char
On Wed, Oct 23, 2019 at 5:34 PM Andrew Barnert wrote:
> Well, just adding a char type (and presumably a way of defining char
literals) wouldn’t be too disruptive.
sure.
> But changing str to iterate chars instead of strs, that
On Oct 23, 2019, at 16:00, Christopher Barker wrote:
>
>> On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas
>> wrote:
>
>> The main problem is that a str is a sequence of single-character str, each
>> of which is a one-element sequence of itself, etc. forever. If you wanted to
On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas <
python-ideas@python.org> wrote:
> The main problem is that a str is a sequence of single-character str, each
> of which is a one-element sequence of itself, etc. forever. If you wanted
> to change this, I think it would make more
Yup. I think you're absolutely right.
After I posted this, I had a better idea:
https://mail.python.org/archives/list/python-ideas@python.org/thread/OVP6SIOFNGGENJAJHXOS2AEUUPWSSRD2/
___
Python-ideas mailing list -- python-ideas@python.org
To
On Mon, Oct 14, 2019 at 6:49 AM Andrew Barnert via Python-ideas
wrote:
> And finally, if you want to break strings, it’s probably worth at least
> considering making UTF-8 strings first-class objects. They can’t be randomly
> accessed, but with an iterable-plus API like files, with seek/tell,
On Oct 13, 2019, at 12:02, Steve Jorgensen wrote:
>
> There are many cases in which it is awkward that testing whether an object is
> a sequence returns `True` for instances of of `str`, `bytes`, etc.
>
> This proposal is a serious breakage of backward compatibility, so would be
> something
40 matches
Mail list logo