[issue41377] memoryview of str (unicode)

2020-07-24 Thread Guido van Rossum


Guido van Rossum  added the comment:

We should not do this, it would expose internals that we need to keep private. 
The right approach would be to keep things as bytes.

--
nosy: +gvanrossum
resolution:  -> wont fix
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41377] memoryview of str (unicode)

2020-07-23 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I concur with Raymond.

Also, it could not help to caught bugs when you get a string instead expected 
bytes object. It may "work" in tests while string is ASCII, but fail miserably 
on real-world non-ASCII data.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41377] memoryview of str (unicode)

2020-07-23 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

I think we can close this.  AFAICT, if we exposed the raw internal object with 
a memory view, there would be no practical way to use the data without a user 
having to substantially recreate the logic already present in encode() and the 
other string methods.

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41377] memoryview of str (unicode)

2020-07-23 Thread Karthikeyan Singaravelan


Change by Karthikeyan Singaravelan :


--
nosy: +skrah

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41377] memoryview of str (unicode)

2020-07-23 Thread Eric V. Smith


Eric V. Smith  added the comment:

I don't think there's a python-level api to find out the "kind", but I can't 
say I've looked closely. And there are no doubt problems with doing so and 
alternate implementations other than CPython. I'm not sure we want to expose 
this implementation detail, but maybe it's the case that all implementations 
could expose this. For example, JPython could always just say "I'm UCS-2", or 
something.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41377] memoryview of str (unicode)

2020-07-23 Thread jakirkham


jakirkham  added the comment:

Thanks for the clarification, Eric! :)

Is this the sort of thing that we could capture in the `format`[1] field (like 
with `"B"`, `"H"`, and `"I"`[2]) or are there potential issues there?

[1]: https://docs.python.org/3/c-api/buffer.html#c.Py_buffer.format
[2]: https://docs.python.org/3/library/struct.html#format-characters

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41377] memoryview of str (unicode)

2020-07-23 Thread Eric V. Smith


Eric V. Smith  added the comment:

> AIUI (though I could be misunderstanding things) `str` objects do use some 
> kind of typed array of unicode characters (either 16-bit narrow or 32-bit 
> wide). 

It's somewhat more complicated. The string data is stored differently depending 
on the maximum code point in the string. See PEP 393.

The "kind" field describes this as:
1 byte (Latin-1)
2 byte (UCS-2)
4 byte (UCS-4)

--
nosy: +eric.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41377] memoryview of str (unicode)

2020-07-23 Thread jakirkham

New submission from jakirkham :

When working with lower level C/C++ code, the Python Buffer Protocol[1] has 
been immensely useful as it allows common Python `bytes`-like objects to expose 
the underlying memory buffer in a pointer that C/C++ code can easily work with 
zero-copy. In fact `memoryview` objects can be quite handy when facilitating 
coercion of Python objects supporting the Python Buffer Protocol to something 
that Python and/or C/C++ code can use easily. This works with several Python 
objects, many Python APIs, and in is relied on heavily by many performance 
conscious 3rd party libraries.

However one object that gets a lot of use in Python that doesn't support this 
API is the Python `str` (previously `unicode`) object (see code below).

```python
In [1]: s = "Hello World!"  

In [2]: mv = memoryview(s)  
---
TypeError Traceback (most recent call last)
 in 
> 1 mv = memoryview(s)

TypeError: memoryview: a bytes-like object is required, not 'str'
```

The canonical answer today is [to encode to `bytes` first]( 
https://stackoverflow.com/a/54449407 ) and decode to `str` later. While this is 
ok for a smallish piece of text, it can start to slowdown considerably for 
larger pieces of text. So being able to skip this encode/decode step can be 
quite impactful.

```python
In [1]: s = "Hello World!"  

In [2]: %timeit s.encode(); 
54.9 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: s = 100_000_000 * "Hello World!"

In [4]: %timeit s.encode(); 
729 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

AIUI (though I could be misunderstanding things) `str` objects do use some kind 
of typed array of unicode characters (either 16-bit narrow or 32-bit wide). So 
it seems like it *should* be possible to expose this as a 1-D contiguous array 
that C/C++ code could use. Though I may be misunderstanding how `str`s actually 
work under-the-hood (if so apologies).

It would be quite helpful to bypass this encoding/decoding step and instead 
work directly with the underlying buffer in these situations where C/C++ is 
involved to help performance critical code.

[1]: https://docs.python.org/3/c-api/buffer.html

--
components: Library (Lib)
messages: 374147
nosy: jakirkham
priority: normal
severity: normal
status: open
title: memoryview of str (unicode)
type: enhancement
versions: Python 3.10, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com