[issue1125] bytes.split shold have same interface as str.split, or different name
Guido van Rossum added the comment: Committed revision 58093. -- resolution: - accepted status: open - closed __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Stefan Sonnenberg-Carstens added the comment: IMHO I also aggree that strings and bytes (list of bytes) should have the same interface. It is common sense that talking about strings most programmers think of a list of bytes composing it (char *). So the abbreviation should also hold true with python. -- nosy: +pythonmeister __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Guido van Rossum added the comment: Updated patch that also modifies bytes.*strip(). __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __Index: Objects/bytesobject.c === --- Objects/bytesobject.c (revision 58052) +++ Objects/bytesobject.c (working copy) @@ -2104,7 +2104,7 @@ Py_LOCAL_INLINE(PyObject *) split_char(const char *s, Py_ssize_t len, char ch, Py_ssize_t maxcount) { -register Py_ssize_t i, j, count=0; +register Py_ssize_t i, j, count = 0; PyObject *str; PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); @@ -2113,7 +2113,7 @@ i = j = 0; while ((j len) (maxcount-- 0)) { -for(; jlen; j++) { +for(; j len; j++) { /* I found that using memchr makes no difference */ if (s[j] == ch) { SPLIT_ADD(s, i, j); @@ -2133,28 +2133,72 @@ return NULL; } +#define ISSPACE(c) (isspace(Py_CHARMASK(c)) ((c) 0x80) == 0) + +Py_LOCAL_INLINE(PyObject *) +split_whitespace(const char *s, Py_ssize_t len, Py_ssize_t maxcount) +{ +register Py_ssize_t i, j, count = 0; +PyObject *str; +PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); + +if (list == NULL) +return NULL; + +for (i = j = 0; i len; ) { + /* find a token */ + while (i len ISSPACE(s[i])) + i++; + j = i; + while (i len !ISSPACE(s[i])) + i++; + if (j i) { + if (maxcount-- = 0) + break; + SPLIT_ADD(s, j, i); + while (i len ISSPACE(s[i])) + i++; + j = i; + } +} +if (j len) { + SPLIT_ADD(s, j, len); +} +FIX_PREALLOC_SIZE(list); +return list; + + onError: +Py_DECREF(list); +return NULL; +} + PyDoc_STRVAR(split__doc__, -B.split(sep [,maxsplit]) - list of bytes\n\ +B.split([sep [, maxsplit]]) - list of bytes\n\ \n\ Return a list of the bytes in the string B, using sep as the\n\ -delimiter. If maxsplit is given, at most maxsplit\n\ -splits are done.); +delimiter. If sep is not given, B is split on ASCII whitespace\n\ +characters (space, tab, return, newline, formfeed, vertical tab).\n\ +If maxsplit is given, at most maxsplit splits are done.); static PyObject * bytes_split(PyBytesObject *self, PyObject *args) { Py_ssize_t len = PyBytes_GET_SIZE(self), n, i, j; -Py_ssize_t maxsplit = -1, count=0; +Py_ssize_t maxsplit = -1, count = 0; const char *s = PyBytes_AS_STRING(self), *sub; -PyObject *list, *str, *subobj; +PyObject *list, *str, *subobj = Py_None; #ifdef USE_FAST Py_ssize_t pos; #endif -if (!PyArg_ParseTuple(args, O|n:split, subobj, maxsplit)) +if (!PyArg_ParseTuple(args, |On:split, subobj, maxsplit)) return NULL; if (maxsplit 0) maxsplit = PY_SSIZE_T_MAX; + +if (subobj == Py_None) +return split_whitespace(s, len, maxsplit); + if (PyBytes_Check(subobj)) { sub = PyBytes_AS_STRING(subobj); n = PyBytes_GET_SIZE(subobj); @@ -2167,7 +2211,7 @@ PyErr_SetString(PyExc_ValueError, empty separator); return NULL; } -else if (n == 1) +if (n == 1) return split_char(s, len, sub[0], maxsplit); list = PyList_New(PREALLOC_SIZE(maxsplit)); @@ -2293,26 +2337,71 @@ return NULL; } +Py_LOCAL_INLINE(PyObject *) +rsplit_whitespace(const char *s, Py_ssize_t len, Py_ssize_t maxcount) +{ +register Py_ssize_t i, j, count = 0; +PyObject *str; +PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); + +if (list == NULL) +return NULL; + +for (i = j = len - 1; i = 0; ) { + /* find a token */ + while (i = 0 Py_UNICODE_ISSPACE(s[i])) + i--; + j = i; + while (i = 0 !Py_UNICODE_ISSPACE(s[i])) + i--; + if (j i) { + if (maxcount-- = 0) + break; + SPLIT_ADD(s, i + 1, j + 1); + while (i = 0 Py_UNICODE_ISSPACE(s[i])) + i--; + j = i; + } +} +if (j = 0) { + SPLIT_ADD(s, 0, j + 1); +} +FIX_PREALLOC_SIZE(list); +if (PyList_Reverse(list) 0) +goto onError; + +return list; + + onError: +Py_DECREF(list); +return NULL; +} + PyDoc_STRVAR(rsplit__doc__, B.rsplit(sep [,maxsplit]) - list of bytes\n\ \n\ Return a list of the sections in the byte B, using sep as the\n\ delimiter, starting at the end of the bytes and working\n\ -to the front. If maxsplit is given, at most maxsplit splits are\n\ -done.); +to the front. If sep is not given, B is split on ASCII whitespace\n\ +characters (space, tab, return, newline, formfeed, vertical tab).\n\ +If maxsplit is given, at most maxsplit splits are done.); static PyObject * bytes_rsplit(PyBytesObject *self, PyObject *args) { Py_ssize_t len = PyBytes_GET_SIZE(self), n, i, j; -Py_ssize_t maxsplit = -1, count=0; +Py_ssize_t maxsplit = -1, count = 0; const char *s = PyBytes_AS_STRING(self), *sub; -PyObject *list, *str, *subobj; +PyObject *list, *str,
[issue1125] bytes.split shold have same interface as str.split, or different name
Guido van Rossum added the comment: New version with corrected docstrings and buffer support for *split() as well. Added unittests. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __Index: Objects/bytesobject.c === --- Objects/bytesobject.c (revision 58052) +++ Objects/bytesobject.c (working copy) @@ -2104,7 +2104,7 @@ Py_LOCAL_INLINE(PyObject *) split_char(const char *s, Py_ssize_t len, char ch, Py_ssize_t maxcount) { -register Py_ssize_t i, j, count=0; +register Py_ssize_t i, j, count = 0; PyObject *str; PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); @@ -2113,7 +2113,7 @@ i = j = 0; while ((j len) (maxcount-- 0)) { -for(; jlen; j++) { +for(; j len; j++) { /* I found that using memchr makes no difference */ if (s[j] == ch) { SPLIT_ADD(s, i, j); @@ -2133,46 +2133,91 @@ return NULL; } +#define ISSPACE(c) (isspace(Py_CHARMASK(c)) ((c) 0x80) == 0) + +Py_LOCAL_INLINE(PyObject *) +split_whitespace(const char *s, Py_ssize_t len, Py_ssize_t maxcount) +{ +register Py_ssize_t i, j, count = 0; +PyObject *str; +PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); + +if (list == NULL) +return NULL; + +for (i = j = 0; i len; ) { + /* find a token */ + while (i len ISSPACE(s[i])) + i++; + j = i; + while (i len !ISSPACE(s[i])) + i++; + if (j i) { + if (maxcount-- = 0) + break; + SPLIT_ADD(s, j, i); + while (i len ISSPACE(s[i])) + i++; + j = i; + } +} +if (j len) { + SPLIT_ADD(s, j, len); +} +FIX_PREALLOC_SIZE(list); +return list; + + onError: +Py_DECREF(list); +return NULL; +} + PyDoc_STRVAR(split__doc__, -B.split(sep [,maxsplit]) - list of bytes\n\ +B.split([sep [, maxsplit]]) - list of bytes\n\ \n\ -Return a list of the bytes in the string B, using sep as the\n\ -delimiter. If maxsplit is given, at most maxsplit\n\ -splits are done.); +Return a list of the bytes in the string B, using sep as the delimiter.\n\ +If sep is not given, B is split on ASCII whitespace charcters\n\ +(space, tab, return, newline, formfeed, vertical tab).\n\ +If maxsplit is given, at most maxsplit splits are done.); static PyObject * bytes_split(PyBytesObject *self, PyObject *args) { Py_ssize_t len = PyBytes_GET_SIZE(self), n, i, j; -Py_ssize_t maxsplit = -1, count=0; +Py_ssize_t maxsplit = -1, count = 0; const char *s = PyBytes_AS_STRING(self), *sub; -PyObject *list, *str, *subobj; +PyObject *list, *str, *subobj = Py_None; +PyBuffer vsub; #ifdef USE_FAST Py_ssize_t pos; #endif -if (!PyArg_ParseTuple(args, O|n:split, subobj, maxsplit)) +if (!PyArg_ParseTuple(args, |On:split, subobj, maxsplit)) return NULL; if (maxsplit 0) maxsplit = PY_SSIZE_T_MAX; -if (PyBytes_Check(subobj)) { -sub = PyBytes_AS_STRING(subobj); -n = PyBytes_GET_SIZE(subobj); -} -/* XXX - use the modern buffer interface */ -else if (PyObject_AsCharBuffer(subobj, sub, n)) + +if (subobj == Py_None) +return split_whitespace(s, len, maxsplit); + +if (_getbuffer(subobj, vsub) 0) return NULL; +sub = vsub.buf; +n = vsub.len; if (n == 0) { PyErr_SetString(PyExc_ValueError, empty separator); +PyObject_ReleaseBuffer(subobj, vsub); return NULL; } -else if (n == 1) +if (n == 1) return split_char(s, len, sub[0], maxsplit); list = PyList_New(PREALLOC_SIZE(maxsplit)); -if (list == NULL) +if (list == NULL) { +PyObject_ReleaseBuffer(subobj, vsub); return NULL; +} #ifdef USE_FAST i = j = 0; @@ -2198,10 +2243,12 @@ #endif SPLIT_ADD(s, i, len); FIX_PREALLOC_SIZE(list); +PyObject_ReleaseBuffer(subobj, vsub); return list; onError: Py_DECREF(list); +PyObject_ReleaseBuffer(subobj, vsub); return NULL; } @@ -2293,44 +2340,90 @@ return NULL; } +Py_LOCAL_INLINE(PyObject *) +rsplit_whitespace(const char *s, Py_ssize_t len, Py_ssize_t maxcount) +{ +register Py_ssize_t i, j, count = 0; +PyObject *str; +PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); + +if (list == NULL) +return NULL; + +for (i = j = len - 1; i = 0; ) { + /* find a token */ + while (i = 0 Py_UNICODE_ISSPACE(s[i])) + i--; + j = i; + while (i = 0 !Py_UNICODE_ISSPACE(s[i])) + i--; + if (j i) { + if (maxcount-- = 0) + break; + SPLIT_ADD(s, i + 1, j + 1); + while (i = 0 Py_UNICODE_ISSPACE(s[i])) + i--; + j = i; + } +} +if (j = 0) { + SPLIT_ADD(s, 0, j + 1); +} +FIX_PREALLOC_SIZE(list); +if (PyList_Reverse(list) 0) +goto onError; + +return list; + + onError: +Py_DECREF(list); +return NULL; +} +
[issue1125] bytes.split shold have same interface as str.split, or different name
Changes by Guido van Rossum: __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Changes by Guido van Rossum: __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Georg Brandl added the comment: I don't think so. They can't have the same behavior, and split is the most reasonable name for what the bytes method does. There have always been subtle differences between the behavior of string and unicode methods; this was even more objectable because they were supposed to be interchangeable to some degree; 3k strings and bytes are not. -- nosy: +georg.brandl __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Walter Dörwald added the comment: Because it's not clear whether b'\xa0' *is* whitespace or not. Bytes have no meaning, characters do. -- nosy: +doerwalter __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Nir Soffer added the comment: Why bytes should not use a default whitespace split behavior as str? __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Guido van Rossum added the comment: I tend to agree with the author; I've run into this myself. For whitespace, I propose to use only the following: tab LF FF VT CR space. These are the whitespace ASCII characters according to isspace() in libc. (Unicode also treats hex 1C, 1D, 1E and 1F as whitespace; I have no idea what these mean. In practice I don't think it matters either way.) -- assignee: - gvanrossum nosy: +gvanrossum __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Changes by Guido van Rossum: __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Guido van Rossum added the comment: Here's a patch that fixes bytes.split and .rsplit. I'll hold off for a while in case there's strong disagreement. I might add a patch for bytes.strip later (it's simpler). -- keywords: +patch __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __Index: Objects/bytesobject.c === --- Objects/bytesobject.c (revision 58048) +++ Objects/bytesobject.c (working copy) @@ -2104,7 +2104,7 @@ Py_LOCAL_INLINE(PyObject *) split_char(const char *s, Py_ssize_t len, char ch, Py_ssize_t maxcount) { -register Py_ssize_t i, j, count=0; +register Py_ssize_t i, j, count = 0; PyObject *str; PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); @@ -2113,7 +2113,7 @@ i = j = 0; while ((j len) (maxcount-- 0)) { -for(; jlen; j++) { +for(; j len; j++) { /* I found that using memchr makes no difference */ if (s[j] == ch) { SPLIT_ADD(s, i, j); @@ -2133,28 +2133,72 @@ return NULL; } +#define ISSPACE(c) (isspace(Py_CHARMASK(c)) ((c) 0x80) == 0) + +Py_LOCAL_INLINE(PyObject *) +split_whitespace(const char *s, Py_ssize_t len, Py_ssize_t maxcount) +{ +register Py_ssize_t i, j, count = 0; +PyObject *str; +PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); + +if (list == NULL) +return NULL; + +for (i = j = 0; i len; ) { + /* find a token */ + while (i len ISSPACE(s[i])) + i++; + j = i; + while (i len !ISSPACE(s[i])) + i++; + if (j i) { + if (maxcount-- = 0) + break; + SPLIT_ADD(s, j, i); + while (i len ISSPACE(s[i])) + i++; + j = i; + } +} +if (j len) { + SPLIT_ADD(s, j, len); +} +FIX_PREALLOC_SIZE(list); +return list; + + onError: +Py_DECREF(list); +return NULL; +} + PyDoc_STRVAR(split__doc__, -B.split(sep [,maxsplit]) - list of bytes\n\ +B.split([sep [, maxsplit]]) - list of bytes\n\ \n\ Return a list of the bytes in the string B, using sep as the\n\ -delimiter. If maxsplit is given, at most maxsplit\n\ -splits are done.); +delimiter. If sep is not given, B is split on ASCII whitespace\n\ +characters (space, tab, return, newline, formfeed, vertical tab).\n\ +If maxsplit is given, at most maxsplit splits are done.); static PyObject * bytes_split(PyBytesObject *self, PyObject *args) { Py_ssize_t len = PyBytes_GET_SIZE(self), n, i, j; -Py_ssize_t maxsplit = -1, count=0; +Py_ssize_t maxsplit = -1, count = 0; const char *s = PyBytes_AS_STRING(self), *sub; -PyObject *list, *str, *subobj; +PyObject *list, *str, *subobj = Py_None; #ifdef USE_FAST Py_ssize_t pos; #endif -if (!PyArg_ParseTuple(args, O|n:split, subobj, maxsplit)) +if (!PyArg_ParseTuple(args, |On:split, subobj, maxsplit)) return NULL; if (maxsplit 0) maxsplit = PY_SSIZE_T_MAX; + +if (subobj == Py_None) +return split_whitespace(s, len, maxsplit); + if (PyBytes_Check(subobj)) { sub = PyBytes_AS_STRING(subobj); n = PyBytes_GET_SIZE(subobj); @@ -2167,7 +2211,7 @@ PyErr_SetString(PyExc_ValueError, empty separator); return NULL; } -else if (n == 1) +if (n == 1) return split_char(s, len, sub[0], maxsplit); list = PyList_New(PREALLOC_SIZE(maxsplit)); @@ -2293,26 +2337,71 @@ return NULL; } +Py_LOCAL_INLINE(PyObject *) +rsplit_whitespace(const char *s, Py_ssize_t len, Py_ssize_t maxcount) +{ +register Py_ssize_t i, j, count = 0; +PyObject *str; +PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)); + +if (list == NULL) +return NULL; + +for (i = j = len - 1; i = 0; ) { + /* find a token */ + while (i = 0 Py_UNICODE_ISSPACE(s[i])) + i--; + j = i; + while (i = 0 !Py_UNICODE_ISSPACE(s[i])) + i--; + if (j i) { + if (maxcount-- = 0) + break; + SPLIT_ADD(s, i + 1, j + 1); + while (i = 0 Py_UNICODE_ISSPACE(s[i])) + i--; + j = i; + } +} +if (j = 0) { + SPLIT_ADD(s, 0, j + 1); +} +FIX_PREALLOC_SIZE(list); +if (PyList_Reverse(list) 0) +goto onError; + +return list; + + onError: +Py_DECREF(list); +return NULL; +} + PyDoc_STRVAR(rsplit__doc__, B.rsplit(sep [,maxsplit]) - list of bytes\n\ \n\ Return a list of the sections in the byte B, using sep as the\n\ delimiter, starting at the end of the bytes and working\n\ -to the front. If maxsplit is given, at most maxsplit splits are\n\ -done.); +to the front. If sep is not given, B is split on ASCII whitespace\n\ +characters (space, tab, return, newline, formfeed, vertical tab).\n\ +If maxsplit is given, at most maxsplit splits are done.); static PyObject * bytes_rsplit(PyBytesObject *self, PyObject *args) { Py_ssize_t len = PyBytes_GET_SIZE(self), n, i, j; -Py_ssize_t maxsplit = -1, count=0; +
[issue1125] bytes.split shold have same interface as str.split, or different name
Changes by Guido van Rossum: -- components: +Interpreter Core -Library (Lib) type: rfe - behavior __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
New submission from Nir Soffer: b'foo bar'.split() Traceback (most recent call last): File stdin, line 1, in module TypeError: split() takes at least 1 argument (0 given) b'foo bar'.split(None) Traceback (most recent call last): File stdin, line 1, in module TypeError: expected an object with the buffer interface str.split and bytes.split should have the same interface, or different names. -- components: Library (Lib) messages: 55723 nosy: nirs severity: normal status: open title: bytes.split shold have same interface as str.split, or different name versions: Python 3.0 __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1125] bytes.split shold have same interface as str.split, or different name
Nir Soffer added the comment: set type -- type: - rfe __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue1125 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com