Re: Unicode and Python - how often do you index strings?
On 05.06.2014 20:52, Ryan Hiebert wrote: 2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de: On 05.06.2014 20:16, Paul Rubin wrote: Johannes Bauer dfnsonfsdu...@gmx.de writes: line = line[:-1] Which truncates the trailing \n of a textfile line. use line.rstrip() for that. rstrip has different functionality than what I'm doing. How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? Ah, I didn't know rstrip() accepted parameters and since you wrote line.rstrip() this would also cut away whitespaces (which sadly are relevant in odd cases). Thanks for the clarification, I'll definitely introduce that. Cheers, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 05.06.2014 22:18, Ian Kelly wrote: Personally I tend toward rstrip('\r\n') so that I don't have to worry about files with alternative line terminators. Hm, I was under the impression that Python already took care of removing the \r at a line ending. Checking that right now: (DOS encoded file y) for line in open(y, r): print(line.encode(utf-8)) ... b'foo\n' b'bar\n' b'moo\n' b'koo\n' Yup, the \r was removed automatically. Are there cases when it isn't? Cheers, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 2014-06-06 10:47, Johannes Bauer wrote: Personally I tend toward rstrip('\r\n') so that I don't have to worry about files with alternative line terminators. Hm, I was under the impression that Python already took care of removing the \r at a line ending. Checking that right now: (DOS encoded file y) for line in open(y, r): print(line.encode(utf-8)) ... b'foo\n' b'bar\n' b'moo\n' b'koo\n' Yup, the \r was removed automatically. Are there cases when it isn't? It's possible if the file is opened as binary: f = file('delme.txt', 'wb') f.write('hello\r\nworld\r\n') f.close() f = file('delme.txt', 'rb') for row in f: print repr(row) ... 'hello\r\n' 'world\r\n' f.close() -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Fri, 06 Jun 2014 10:47:44 +0200, Johannes Bauer wrote: Hm, I was under the impression that Python already took care of removing the \r at a line ending. Checking that right now: [snip example] This is called Universal Newlines. Technically it is a build-time option which applies when you build the Python interpreter from source, so, yes, some Pythons may not implement it at all. But I think that it has been on by default for a long time, and the option to turn it off may have been removed in Python 3.3 or 3.4. In practical terms, you should normally expect it to be on. Here's the PEP that introduced it: http://legacy.python.org/dev/peps/pep-0278/ The idea is that when universal newlines support is enabled, by default will convert any of \n, \r or \r\n into \n when reading from a file in text mode, and convert back the other way when writing the file. In binary mode, newlines are *never* changed. In Python 3, you can return end-of-lines unchanged by passing newline='' to the open() function. https://docs.python.org/2/library/functions.html#open https://docs.python.org/3/library/functions.html#open -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 2014-06-06, Roy Smith r...@panix.com wrote: Roy is using MT-NewsWatcher as a client. Yes. Except for the fact that it hasn't kept up with unicode, I find the U/I pretty much perfect. I imagine at some point I'll be force to look elsewhere, but then again, netnews is pretty much dead. There are still a few active groups, but reading e-mail lists via NNTP (in my case using slrn) via gmane is a huge reason to have an efficient, well-designed news client. If usenet does really pack it in someday and I have to switch from comp.lang.python to the mailing list, it will be done by pointing slrn at new.gmane.org -- not by having all those e-mails sent to me so I can try to sort through them... -- Grant Edwards grant.b.edwardsYow! My NOSE is NUMB! at gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 06/06/2014 01:42 AM, Johannes Bauer wrote: snip Ah, I didn't know rstrip() accepted parameters and since you wrote line.rstrip() this would also cut away whitespaces (which sadly are relevant in odd cases). No problem. If a parameter is used in the strip() family, than _only_ those characters are stripped. Example: s = 'some text \n' print('{}'.format(s.rstrip())) # No parameter, strip all whitespace some text print('{}'.format(s.rstrip('\n'))) # Parameter is newline, only strip newlines some text -=- Larry BTW, the strip() parameter (which must be a string) is not limited to whitespace, it can be used with any set of characters. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
wxjmfa...@gmail.com: Unicode ? I have the feeling is similar as explaining, i (the imaginary number) is not equal to sqrt(-1). jmf PS Once I gave you a link pointing to unicode.org doc, you obviously did not read it. Sir, you are an artist, a poet even! With admiration, Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Thu, 05 Jun 2014 00:06:54 -0700, wxjmfauth wrote: Le mercredi 4 juin 2014 16:50:59 UTC+2, Michael Torrie a écrit : On 06/04/2014 12:50 AM, wxjmfa...@gmail.com wrote: Like many, you are not understanding unicode because you do not understand the coding of characters. If that is true, then I'm sure a well-written paragraph or two can set him straight. You continually berate people for not understanding unicode, but you've posted nothing to explain anything, nor demonstrate your own understanding. That's one reason your posts are so frustrating and considered trolling. You never ever explain yourself, instead just flailing around and muttering about folks not understanding unicode, just as you've done here, true to form. You do not understand the coding of the characters because you do not understand the mathematics behind it. flamebaiting here... FSR *is* UTF-32 internally, compresses off leading zero bits during string creation. You focussed on the wrong problem. Frankly it is you who is focused on the wrong problem, at least with this particular thread. I think you got distracted by the subject line. Chris's original post really has nothing to do with unicode at all. He's simply asking for use cases for string indexing where O(1) is desired or necessary. Could be old Python 2 byte strings, or Python 3 unicode strings. It does not matter. Unicode is orthogonal to his question. Maybe his purpose in asking the question is to justify a fixed-length encoding scheme (which is what FSR actually is), or maybe it is to explore the costs of using a much slower, but more compact, variable-length encoding scheme like UTF-8. Particularly in the context of low-memory applications where unicode support would be nice, but memory is at a premium. But either way, you got hung up on the wrong thing. (All this stuff has been discussed, tested and worked on 20 (twenty) years ago.) Sorry. As am I. = Unicode ? I have the feeling is similar as explaining, i (the imaginary number) is not equal to sqrt(-1). jmf PS Once I gave you a link pointing to unicode.org doc, you obviously did not read it. And you have may time been given a link explaining the problems with posting g=from google groups but deliberately choose to not make your replys readable. -- If you're not part of the solution, you're part of the precipitate. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 6/5/14 10:39 AM, alister wrote: {snipped all the mess} And you have may time been given a link explaining the problems with posting g=from google groups but deliberately choose to not make your replys readable. The problem is that thing look fine in google groups. What helps is getting to see what the mess looks like from Thunderbird or equivalent. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 04.06.2014 02:39, Chris Angelico wrote: I know the collective experience of python-list can't fail to bring up a few solid examples here :) Just also grepped lots of code and have surprisingly few instances of index-search. Most are with constant indices. One particular example that comes up a lot is line = line[:-1] Which truncates the trailing \n of a textfile line. Then some indexing in the form of negative = (line[0] == -) All in all I'm actually a bit surprised this isn't too common. Cheers, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 05/06/2014 16:57, Mark H Harris wrote: On 6/5/14 10:39 AM, alister wrote: {snipped all the mess} And you have may time been given a link explaining the problems with posting g=from google groups but deliberately choose to not make your replys readable. The problem is that thing look fine in google groups. What helps is getting to see what the mess looks like from Thunderbird or equivalent. Wrong. 99.99% of people when asked politely take action so there is no problem. The remaining 0.01% consists of one complete ignoramus. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 4 June 2014 15:50, Michael Torrie torr...@gmail.com wrote: On 06/04/2014 12:50 AM, wxjmfa...@gmail.com wrote: [Things] [Reply to things] Please. Just don't. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Thu, 05 Jun 2014 18:15:31 +0100, Mark Lawrence wrote: The problem is that thing look fine in google groups. What helps is getting to see what the mess looks like from Thunderbird or equivalent. Wrong. 99.99% of people when asked politely take action so there is no problem. The remaining 0.01% consists of one complete ignoramus. Who has actively stated he will not change. pretty much the same attitude he has constantly saying pythons unicode implementation is broken* without any valid supporting evidence. * Not just incomplete or inefficient but irrevocably broken. -- Yow! It's some people inside the wall! This is better than mopping! -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
Johannes Bauer dfnsonfsdu...@gmx.de writes: line = line[:-1] Which truncates the trailing \n of a textfile line. use line.rstrip() for that. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 05.06.2014 20:16, Paul Rubin wrote: Johannes Bauer dfnsonfsdu...@gmx.de writes: line = line[:-1] Which truncates the trailing \n of a textfile line. use line.rstrip() for that. rstrip has different functionality than what I'm doing. Cheers, Johannes -- Wo hattest Du das Beben nochmal GENAU vorhergesagt? Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de: On 05.06.2014 20:16, Paul Rubin wrote: Johannes Bauer dfnsonfsdu...@gmx.de writes: line = line[:-1] Which truncates the trailing \n of a textfile line. use line.rstrip() for that. rstrip has different functionality than what I'm doing. How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Fri, Jun 6, 2014 at 4:52 AM, Ryan Hiebert r...@ryanhiebert.com wrote: 2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de: On 05.06.2014 20:16, Paul Rubin wrote: Johannes Bauer dfnsonfsdu...@gmx.de writes: line = line[:-1] Which truncates the trailing \n of a textfile line. use line.rstrip() for that. rstrip has different functionality than what I'm doing. How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? line = Hello,\nworld!\n\n line[:-1] 'Hello,\nworld!\n' line.rstrip('\n') 'Hello,\nworld!' If it's guaranteed to end with exactly one newline, then and only then will they be identical. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
Ryan Hiebert r...@ryanhiebert.com writes: How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? rstrip removes all the newlines off the end, whether there are zero or multiple. In perl the difference is chomp vs chop. line=line[:-1] removes one character, that might or might not be a newline. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Thu, Jun 5, 2014 at 2:59 PM, Chris Angelico ros...@gmail.com wrote: On Fri, Jun 6, 2014 at 4:52 AM, Ryan Hiebert r...@ryanhiebert.com wrote: 2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de: On 05.06.2014 20:16, Paul Rubin wrote: Johannes Bauer dfnsonfsdu...@gmx.de writes: line = line[:-1] Which truncates the trailing \n of a textfile line. use line.rstrip() for that. rstrip has different functionality than what I'm doing. How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? line = Hello,\nworld!\n\n line[:-1] 'Hello,\nworld!\n' line.rstrip('\n') 'Hello,\nworld!' If it's guaranteed to end with exactly one newline, then and only then will they be identical. OK, that's not an issue for my case, and additionally I'm using the open(_, 'U') file iterable, so I shouldn't see multiple trailing newlines anyway. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Thu, Jun 5, 2014 at 1:58 PM, Paul Rubin no.email@nospam.invalid wrote: Ryan Hiebert r...@ryanhiebert.com writes: How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? rstrip removes all the newlines off the end, whether there are zero or multiple. In perl the difference is chomp vs chop. line=line[:-1] removes one character, that might or might not be a newline. Given the description that the input string is a textfile line, if it has multiple newlines then it's invalid. Personally I tend toward rstrip('\r\n') so that I don't have to worry about files with alternative line terminators. If you want to be really picky about removing exactly one line terminator, then this captures all the relatively modern variations: re.sub('\r?\n$|\n?\r$', line, '', count=1) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
- Original Message - From: Ian Kelly ian.g.ke...@gmail.com To: Python python-list@python.org Cc: Sent: Thursday, June 5, 2014 10:18 PM Subject: Re: Unicode and Python - how often do you index strings? On Thu, Jun 5, 2014 at 1:58 PM, Paul Rubin no.email@nospam.invalid wrote: Ryan Hiebert r...@ryanhiebert.com writes: How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? rstrip removes all the newlines off the end, whether there are zero or multiple. In perl the difference is chomp vs chop. line=line[:-1] removes one character, that might or might not be a newline. Given the description that the input string is a textfile line, if it has multiple newlines then it's invalid. Personally I tend toward rstrip('\r\n') so that I don't have to worry about files with alternative line terminators. I tend to use: s.rstrip(os.linesep) If you want to be really picky about removing exactly one line terminator, then this captures all the relatively modern variations: re.sub('\r?\n$|\n?\r$', line, '', count=1) or perhaps: re.sub([^ \S]+$, , line) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
In article mailman.10767.1402000635.18130.python-l...@python.org, Albert-Jan Roskam fo...@yahoo.com wrote: - Original Message - From: Ian Kelly ian.g.ke...@gmail.com To: Python python-list@python.org Cc: Sent: Thursday, June 5, 2014 10:18 PM Subject: Re: Unicode and Python - how often do you index strings? On Thu, Jun 5, 2014 at 1:58 PM, Paul Rubin no.email@nospam.invalid wrote: Ryan Hiebert r...@ryanhiebert.com writes: How so? I was using line=line[:-1] for removing the trailing newline, and just replaced it with rstrip('\n'). What are you doing differently? rstrip removes all the newlines off the end, whether there are zero or multiple.? In perl the difference is chomp vs chop.? line=line[:-1] removes one character, that might or might not be a newline. Given the description that the input string is a textfile line, if it has multiple newlines then it's invalid. Personally I tend toward rstrip('\r\n') so that I don't have to worry about files with alternative line terminators. I tend to use: s.rstrip(os.linesep) If you want to be really picky about removing exactly one line terminator, then this captures all the relatively modern variations: re.sub('\r?\n$|\n?\r$', line, '', count=1) or perhaps: re.sub([^ \S]+$, , line) Just for fun, I took a screen-shot of what this looks like in my newsreader. URL below. Looks like something chomped on unicode pretty hard :-) http://www.panix.com/~roy/unicode.pdf -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Friday, June 6, 2014 2:30:26 AM UTC+5:30, Roy Smith wrote: Just for fun, I took a screen-shot of what this looks like in my newsreader. URL below. Looks like something chomped on unicode pretty hard :-) http://www.panix.com/~roy/unicode.pdf Yii -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
In article 8681edf0-7a1f-4110-9f87-a8cd0988c...@googlegroups.com, Rustom Mody rustompm...@gmail.com wrote: On Friday, June 6, 2014 2:30:26 AM UTC+5:30, Roy Smith wrote: Just for fun, I took a screen-shot of what this looks like in my newsreader. URL below. Looks like something chomped on unicode pretty hard :-) http://www.panix.com/~roy/unicode.pdf Yii Roy is using MT-NewsWatcher as a client. Because its codebase's origins are back in classic MacOS (= 9), it has its own *interesting* ways to deal with encodings. BTW, don't upgrade to OS X 10.9 Mavericks if you're dependent on MT-NW; it finally stops working there because what was left of Open Transport support in OS X has finally been ripped out of 10.9. -- Ned Deily, n...@acm.org -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Thu, Jun 5, 2014 at 2:34 PM, Albert-Jan Roskam fo...@yahoo.com wrote: If you want to be really picky about removing exactly one line terminator, then this captures all the relatively modern variations: re.sub('\r?\n$|\n?\r$', line, '', count=1) or perhaps: re.sub([^ \S]+$, , line) That will remove more than one terminator, plus tabs. Points for including \f and \v though. I suppose if we want to be absolutely correct, we should follow the Unicode standard: re.sub(r'\r?\n$|[\r\v\f\x85\u2028\u2029]$', line, '', count=1) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
In article mailman.10781.1402009056.18130.python-l...@python.org, Ned Deily n...@acm.org wrote: In article 8681edf0-7a1f-4110-9f87-a8cd0988c...@googlegroups.com, Rustom Mody rustompm...@gmail.com wrote: On Friday, June 6, 2014 2:30:26 AM UTC+5:30, Roy Smith wrote: Just for fun, I took a screen-shot of what this looks like in my newsreader. URL below. Looks like something chomped on unicode pretty hard :-) http://www.panix.com/~roy/unicode.pdf Yii Roy is using MT-NewsWatcher as a client. Yes. Except for the fact that it hasn't kept up with unicode, I find the U/I pretty much perfect. I imagine at some point I'll be force to look elsewhere, but then again, netnews is pretty much dead. BTW, don't upgrade to OS X 10.9 Mavericks if you're dependent on MT-NW; it finally stops working there because what was left of Open Transport support in OS X has finally been ripped out of 10.9. Hmmm, good to know. I'm still on 10.7, and don't see any reason to move. But, then again, you'd expect that from somebody who's still on Python 2.x, wouldn't you? -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
In article roy-2a9d82.20100705062...@news.panix.com, Roy Smith r...@panix.com wrote: In article mailman.10781.1402009056.18130.python-l...@python.org, Ned Deily n...@acm.org wrote: Roy is using MT-NewsWatcher as a client. Yes. Except for the fact that it hasn't kept up with unicode, I find the U/I pretty much perfect. I imagine at some point I'll be force to look elsewhere, but then again, netnews is pretty much dead. I agree about the U/I, although I'm sure a lot of that has to do with familiarity. However, netnews isn't dead, it has just morphed a bit. A newsreader, like MT-NW, is great for following mailing lists like this (and most other Python-related lists) via gmane.org's bi-directional mailing list - NNTP gateways. And for this list it's usually better to read the mailing list variant via gmane.org NNTP than the Usenet group variant via a traditional USENET NNTP server because there's less spam with the former. BTW, don't upgrade to OS X 10.9 Mavericks if you're dependent on MT-NW; it finally stops working there because what was left of Open Transport support in OS X has finally been ripped out of 10.9. Hmmm, good to know. I'm still on 10.7, and don't see any reason to move. But, then again, you'd expect that from somebody who's still on Python 2.x, wouldn't you? Heh. Well, both 10.8 and 10.9 proved various improvements, both feature and performance, over 10.7. Alas, Apple won't likely be supporting 10.7 with security updates for as long as the PSF will be supporting 2.7.x. But, by then, you'll have had a chance to re-implement MT-NW in Python. -- Ned Deily, n...@acm.org -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
Chris Angelico wrote: On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith r...@panix.com wrote: sarcasm style=regex-pedantUm, you mean cent(er|re), don't you? The pattern you wrote also matches centee and centrr./sarcasm Maybe there's someone who spells it that way! Come visit Pirate Island, the centrr of the universe! -- Pegleg Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 04/06/2014 01:39, Chris Angelico wrote: A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) around one critical question: Is string indexing common? Python strings can be indexed with integers to produce characters (strings of length 1). They can also be iterated over from beginning to end. Lots of operations can be built on either one of those two primitives; the question is, how much can NOT be implemented efficiently over iteration, and MUST use indexing? Theories are great, but solid use-cases are better - ideally, examples from actual production code (actual code optional). I know the collective experience of python-list can't fail to bring up a few solid examples here :) Thanks in advance, all!! ChrisA Single characters quite often, iteration rarely if ever, slicing all the time, but does that last one count? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Wed, Jun 4, 2014 at 6:22 PM, Mark Lawrence breamore...@yahoo.co.uk wrote: Single characters quite often, iteration rarely if ever, slicing all the time, but does that last one count? Yes, slicing counts. What matters here is the potential impact of internally representing strings as UTF-8 streams; when you ask for the Nth character, it would have to scan from either the beginning or end (more likely beginning) of the string and count, instead of doing what CPython 3.3+ does and simply look up the header to find out the kind, bit-shift the index by one less than that, and use that as a memory location. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
Mark Lawrence wrote: On 04/06/2014 01:39, Chris Angelico wrote: A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) around one critical question: Is string indexing common? Python strings can be indexed with integers to produce characters (strings of length 1). They can also be iterated over from beginning to end. Lots of operations can be built on either one of those two primitives; the question is, how much can NOT be implemented efficiently over iteration, and MUST use indexing? Theories are great, but solid use-cases are better - ideally, examples from actual production code (actual code optional). I know the collective experience of python-list can't fail to bring up a few solid examples here :) Thanks in advance, all!! ChrisA Single characters quite often, iteration rarely if ever, slicing all the time, but does that last one count? The indices used for slicing typically don't come out of nowhere. A simple example would be def strip_prefix(text, prefix): if text.startswith(prefix): text = text[len(prefix):] return text If both prefix and text use UTF-8 internally the byte offset is already known. The question is then how we can preserve that information. The first approach that comes to mind is an int subtype: for i, c in enumerate(123αλφα): ... print(i, byteoffset(i), c) ... 0 0 1 1 1 2 2 2 3 3 3 α 4 5 λ 5 7 φ 6 9 α This would work in the strip_prefix() example, but lead to data corruption in most other cases unless limited to a specific string -- in which case it would no longer work with strip_prefix(). So a new interface would be needed. My second try, an object with two byte offsets linked to a specific string: span(foobar).startswith(oob) p = span(foobar).startswith(foo) p.replace(baz) 'bazbar' p.before() '' p.after() 'bar' span(foo bar baz).find(bar).replace(spam) 'foo spam bar' I have no idea if that could work out... -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Wed, Jun 4, 2014 at 8:10 PM, Peter Otten __pete...@web.de wrote: The indices used for slicing typically don't come out of nowhere. A simple example would be def strip_prefix(text, prefix): if text.startswith(prefix): text = text[len(prefix):] return text If both prefix and text use UTF-8 internally the byte offset is already known. The question is then how we can preserve that information. Almost completely useless. First off, it solves only the problem of operating on the string at exactly some point where you just got an index; and secondly, you don't always get that index from a string method. Suppose, for instance, that you iterate over a string thus: for i, ch in enumerate(string): if ch=='{': start = i elif ch=='}': return string[start:end+1] Okay, so this could be done by searching, but for something more complicated, I can imagine it being better to enumerate. (But I can imagine is much weaker than Here's code that we use in production, which is why I asked the question.) Incidentally, the above code highlights the first problem too. With direct indexing, you can ask for inclusive or exclusive slicing by adding or subtracting one from the index. If you do that with a byte-position-retaining special integer, you lose the byte position. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Tue, 03 Jun 2014 21:18:12 -0400, Roy Smith wrote: In article mailman.10656.1401842403.18130.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) sarcasm style=regex-pedantUm, you mean cent(er|re), don't you? The pattern you wrote also matches centee and centrr./sarcasm super pedant mode The language is ENGLISH so the correct spelling is Centre regional variations my be common but they are incorrect /super pedant mode :-) -- Prepare for tomorrow -- get ready. -- Edith Keeler, The City On the Edge of Forever, stardate unknown -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Wed, 04 Jun 2014 18:48:29 +1200, Gregory Ewing wrote: Chris Angelico wrote: On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith r...@panix.com wrote: sarcasm style=regex-pedantUm, you mean cent(er|re), don't you? The pattern you wrote also matches centee and centrr./sarcasm Maybe there's someone who spells it that way! Come visit Pirate Island, the centrr of the universe! that should be Cent-argh -- I hope the ``Eurythmics'' practice birth control ... -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Wednesday, June 4, 2014 4:20:01 PM UTC+5:30, alister wrote: The language is ENGLISH so the correct spelling is Centre regional variations my be common but they are incorrect my? O mee Oo my -- cockney (or Aussie) pedant?? -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
Le mercredi 4 juin 2014 02:39:54 UTC+2, Chris Angelico a écrit : A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) around one critical question: Is string indexing common? Python strings can be indexed with integers to produce characters (strings of length 1). They can also be iterated over from beginning to end. Lots of operations can be built on either one of those two primitives; the question is, how much can NOT be implemented efficiently over iteration, and MUST use indexing? Theories are great, but solid use-cases are better - ideally, examples from actual production code (actual code optional). I know the collective experience of python-list can't fail to bring up a few solid examples here :) Thanks in advance, all!! ChrisA = Like many, you are not understanding unicode because you do not understand the coding of characters. You do not understand the coding of the characters because you do not understand the mathematics behind it. You focussed on the wrong problem. (All this stuff has been discussed, tested and worked on 20 (twenty) years ago.) Sorry. jmf -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Wed, 04 Jun 2014 05:52:24 -0700, Rustom Mody wrote: On Wednesday, June 4, 2014 4:20:01 PM UTC+5:30, alister wrote: The language is ENGLISH so the correct spelling is Centre regional variations my be common but they are incorrect my? O mee Oo my -- cockney (or Aussie) pedant?? I made no claims about my typing or spelling being correct. That post was actually quite good fro me usually my typing is worse. -- The difference between genius and stupidity is that genius has its limits. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 06/04/2014 12:50 AM, wxjmfa...@gmail.com wrote: Like many, you are not understanding unicode because you do not understand the coding of characters. If that is true, then I'm sure a well-written paragraph or two can set him straight. You continually berate people for not understanding unicode, but you've posted nothing to explain anything, nor demonstrate your own understanding. That's one reason your posts are so frustrating and considered trolling. You never ever explain yourself, instead just flailing around and muttering about folks not understanding unicode, just as you've done here, true to form. You do not understand the coding of the characters because you do not understand the mathematics behind it. flamebaiting here... FSR *is* UTF-32 internally, compresses off leading zero bits during string creation. You focussed on the wrong problem. Frankly it is you who is focused on the wrong problem, at least with this particular thread. I think you got distracted by the subject line. Chris's original post really has nothing to do with unicode at all. He's simply asking for use cases for string indexing where O(1) is desired or necessary. Could be old Python 2 byte strings, or Python 3 unicode strings. It does not matter. Unicode is orthogonal to his question. Maybe his purpose in asking the question is to justify a fixed-length encoding scheme (which is what FSR actually is), or maybe it is to explore the costs of using a much slower, but more compact, variable-length encoding scheme like UTF-8. Particularly in the context of low-memory applications where unicode support would be nice, but memory is at a premium. But either way, you got hung up on the wrong thing. (All this stuff has been discussed, tested and worked on 20 (twenty) years ago.) Sorry. As am I. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
Chris Angelico ros...@gmail.com Wrote in message: On Wed, Jun 4, 2014 at 8:10 PM, Peter Otten __pete...@web.de wrote: The indices used for slicing typically don't come out of nowhere. A simple example would be def strip_prefix(text, prefix): if text.startswith(prefix): text = text[len(prefix):] return text If both prefix and text use UTF-8 internally the byte offset is already known. The question is then how we can preserve that information. Almost completely useless. First off, it solves only the problem of operating on the string at exactly some point where you just got an index; and secondly, you don't always get that index from a string method. Suppose, for instance, that you iterate over a string thus: for i, ch in enumerate(string): if ch=='{': start = i elif ch=='}': return string[start:end+1] Okay, so this could be done by searching, but for something more complicated, I can imagine it being better to enumerate. (But I can imagine is much weaker than Here's code that we use in production, which is why I asked the question.) Incidentally, the above code highlights the first problem too. With direct indexing, you can ask for inclusive or exclusive slicing by adding or subtracting one from the index. If you do that with a byte-position-retaining special integer, you lose the byte position. ChrisA A string could have two extra fields in it that hold index and offset for the most recent substring reference. Even though the string is immutable, nothing prevents mutable elements that are externally visible only by performance measurement. So a loop using a subscript of a string would tend to be faster even if written in a naive way. It's also conceivable to build an array of such pairs in strings over a threshold size. So if you had a megabyte string, there might be 100 evenly spaced pairs, calculated when the string object is first created. And naturally there can be flags indicating that the particular string is pure ASCII. Clearly this breaks down if there are two alternating references at different offsets, but I think this would be exceeding rare. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Unicode and Python - how often do you index strings?
A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) around one critical question: Is string indexing common? Python strings can be indexed with integers to produce characters (strings of length 1). They can also be iterated over from beginning to end. Lots of operations can be built on either one of those two primitives; the question is, how much can NOT be implemented efficiently over iteration, and MUST use indexing? Theories are great, but solid use-cases are better - ideally, examples from actual production code (actual code optional). I know the collective experience of python-list can't fail to bring up a few solid examples here :) Thanks in advance, all!! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 2014-06-04 10:39, Chris Angelico wrote: A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) around one critical question: Is string indexing common? Python strings can be indexed with integers to produce characters (strings of length 1). They can also be iterated over from beginning to end. Lots of operations can be built on either one of those two primitives; the question is, how much can NOT be implemented efficiently over iteration, and MUST use indexing? Theories are great, but solid use-cases are better - ideally, examples from actual production code (actual code optional). Many of my string-indexing uses revolve around a sliding window which can be done with itertools[1], though I often just roll it as something like n = 3 for i in range(1 + len(s) - n): do_something(s[i:i+n]) So that could be supplanted by the SO iterator linked below. The other use big case I have from production code involves a column-offset delimited file where the headers have a row of underscores under them delimiting the field widths, so it looks something like EmpID NameCost Center - --- - 314159Longstocking, Pippi RJ45 265358Davis, MilesJA22 979328Bell, Alexander RJ15 I then take row 2 and use it to make a mapping of header-name to a slice-object for slicing the subsequent strings: import re r = re.compile('-+') # a sequence of 1+ dashes f = file(data.txt) headers = next(f) lines = next(f) header_map = dict(( headers[i.start():i.end()].strip().upper(), slice(i.start(), i.end()) ) for i in r.finditer(lines) ) for row in f: print(EmpID = %s % row[header_map[EMPID]].strip()) print(Name = %s % row[header_map[NAME]].strip()) # ... which I presume uses string indexing under the hood. Perhaps there's a better way of doing that, but it's what I currently use to process these large-ish files (largest max out at 10-20MB each) There might be other use-cases I've done, but those two leap to mind. -tkc [1] http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
In article mailman.10656.1401842403.18130.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) sarcasm style=regex-pedantUm, you mean cent(er|re), don't you? The pattern you wrote also matches centee and centrr./sarcasm around one critical question: Is string indexing common? Not in our code. I've got 80008 non-blank lines of Python (2.7) source handy. I tried a few heuristics to find patterns which might be string indexing. $ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]' and then looked them over manually. I see this pattern a bunch of times (in a single-use script): data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4] We do this once: if tz_offset[0] == '-': We do this somewhere in some command-line parsing: process_match = args.process[:15] There's this little gem: return [dedup(x[1:-1].lower()) for x in re.findall('(\[[^\]\[]+\]|\([^\)\(]+\))',title)] It appears I wrote this one, but I don't remember exactly what I had in mind at the time... withhyphen = number if '-' in number else (number[:-2] + '-' + number[-2:]) # big assumption here Anyway, there's a bunch more, but the bottom line is that in our code, indexing into a string (at least explicitly in application source code) is a pretty rare thing. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 06/03/2014 05:39 PM, Chris Angelico wrote: A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) around one critical question: Is string indexing common? I use it quite a bit, but the strings are usually quite small (well under 100 characters) so an implementation that wasn't O(1) would not hurt me much. -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith r...@panix.com wrote: In article mailman.10656.1401842403.18130.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: A current discussion regarding Python's Unicode support centres (or centers, depending on how close you are to the cent[er]{2} of the universe) sarcasm style=regex-pedantUm, you mean cent(er|re), don't you? The pattern you wrote also matches centee and centrr./sarcasm Maybe there's someone who spells it that way! Let's not be excluding people. That'd be rude. around one critical question: Is string indexing common? Not in our code. I've got 80008 non-blank lines of Python (2.7) source handy. I tried a few heuristics to find patterns which might be string indexing. $ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]' and then looked them over manually. I see this pattern a bunch of times (in a single-use script): data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4] Slicing is a form of indexing too, although in this case (slicing from the front) it could be implemented on top of UTF-8 without much problem. withhyphen = number if '-' in number else (number[:-2] + '-' + number[-2:]) # big assumption here This *definitely* counts; if strings were represented internally in UTF-8, this would involve two scans (although a smart implementation could probably count backward rather than forward). By the way, any time you slice up to the third from the end, you win two extra awesome points, just for putting [:-3] into your code and having it mean something. But I digress. Anyway, there's a bunch more, but the bottom line is that in our code, indexing into a string (at least explicitly in application source code) is a pretty rare thing. Thanks. Of course, the pattern you searched for is looking only for literals; it's a bit harder to find cases where the index (or slice position) comes from a variable or expression, and those situations are also rather harder to optimize (the MD5 prefix is clearly better scanned from the front, the number tail is clearly better scanned from the back - but with a variable?). ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On Wed, Jun 4, 2014 at 11:11 AM, Tim Chase python.l...@tim.thechases.com wrote: I then take row 2 and use it to make a mapping of header-name to a slice-object for slicing the subsequent strings: slice(i.start(), i.end()) print(EmpID = %s % row[header_map[EMPID]].strip()) print(Name = %s % row[header_map[NAME]].strip()) which I presume uses string indexing under the hood. Yes, it's definitely going to be indexing. If strings were represented internally in UTF-8, each of those calls would need to scan from the beginning of the string, counting and discarding characters until it finds the place to start, then counting and retaining characters until it finds the place to stop. Definite example of what I'm looking for, thanks! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode and Python - how often do you index strings?
On 2014-06-04 12:16, Chris Angelico wrote: On Wed, Jun 4, 2014 at 11:11 AM, Tim Chase python.l...@tim.thechases.com wrote: I then take row 2 and use it to make a mapping of header-name to a slice-object for slicing the subsequent strings: slice(i.start(), i.end()) print(EmpID = %s % row[header_map[EMPID]].strip()) print(Name = %s % row[header_map[NAME]].strip()) which I presume uses string indexing under the hood. Yes, it's definitely going to be indexing. If strings were represented internally in UTF-8, each of those calls would need to scan from the beginning of the string, counting and discarding characters until it finds the place to start, then counting and retaining characters until it finds the place to stop. Definite example of what I'm looking for, thanks! For what it's worth, most of the lines in each file are under ~2k, so even O(N) or O(log N) indexing wouldn't be grievous. Noticeable, but not grievous. Glad my example could give you some fodder. -tkc -- https://mail.python.org/mailman/listinfo/python-list