Re: Unicode and Python - how often do you index strings?

2014-06-06 Thread Johannes Bauer
On 05.06.2014 20:52, Ryan Hiebert wrote:
 2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de:
 
 On 05.06.2014 20:16, Paul Rubin wrote:
 Johannes Bauer dfnsonfsdu...@gmx.de writes:
 line = line[:-1]
 Which truncates the trailing \n of a textfile line.

 use line.rstrip() for that.

 rstrip has different functionality than what I'm doing.
 
 How so? I was using line=line[:-1] for removing the trailing newline, and
 just replaced it with rstrip('\n'). What are you doing differently?

Ah, I didn't know rstrip() accepted parameters and since you wrote
line.rstrip() this would also cut away whitespaces (which sadly are
relevant in odd cases).

Thanks for the clarification, I'll definitely introduce that.

Cheers,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-06 Thread Johannes Bauer
On 05.06.2014 22:18, Ian Kelly wrote:

 Personally I tend toward rstrip('\r\n') so that I don't have to worry
 about files with alternative line terminators.

Hm, I was under the impression that Python already took care of removing
the \r at a line ending. Checking that right now:

(DOS encoded file y)
 for line in open(y, r): print(line.encode(utf-8))
...
b'foo\n'
b'bar\n'
b'moo\n'
b'koo\n'

Yup, the \r was removed automatically. Are there cases when it isn't?

Cheers,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-06 Thread Tim Chase
On 2014-06-06 10:47, Johannes Bauer wrote:
  Personally I tend toward rstrip('\r\n') so that I don't have to
  worry about files with alternative line terminators.
 
 Hm, I was under the impression that Python already took care of
 removing the \r at a line ending. Checking that right now:
 
 (DOS encoded file y)
  for line in open(y, r): print(line.encode(utf-8))
 ...
 b'foo\n'
 b'bar\n'
 b'moo\n'
 b'koo\n'
 
 Yup, the \r was removed automatically. Are there cases when it
 isn't?

It's possible if the file is opened as binary:

 f = file('delme.txt', 'wb')
 f.write('hello\r\nworld\r\n')
 f.close()
 f = file('delme.txt', 'rb')
 for row in f: print repr(row)
... 
'hello\r\n'
'world\r\n'
 f.close()


-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-06 Thread Steven D'Aprano
On Fri, 06 Jun 2014 10:47:44 +0200, Johannes Bauer wrote:

 Hm, I was under the impression that Python already took care of removing
 the \r at a line ending. Checking that right now:
[snip example]

This is called Universal Newlines. Technically it is a build-time 
option which applies when you build the Python interpreter from source, 
so, yes, some Pythons may not implement it at all. But I think that it 
has been on by default for a long time, and the option to turn it off may 
have been removed in Python 3.3 or 3.4. In practical terms, you should 
normally expect it to be on.


Here's the PEP that introduced it: 
http://legacy.python.org/dev/peps/pep-0278/


The idea is that when universal newlines support is enabled, by default 
will convert any of \n, \r or \r\n into \n when reading from a file in 
text mode, and convert back the other way when writing the file.

In binary mode, newlines are *never* changed.

In Python 3, you can return end-of-lines unchanged by passing newline='' 
to the open() function.

https://docs.python.org/2/library/functions.html#open
https://docs.python.org/3/library/functions.html#open




-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-06 Thread Grant Edwards
On 2014-06-06, Roy Smith r...@panix.com wrote:

 Roy is using MT-NewsWatcher as a client.

 Yes.  Except for the fact that it hasn't kept up with unicode, I find 
 the U/I pretty much perfect.  I imagine at some point I'll be force to 
 look elsewhere, but then again, netnews is pretty much dead.

There are still a few active groups, but reading e-mail lists via NNTP
(in my case using slrn) via gmane is a huge reason to have an
efficient, well-designed news client.

If usenet does really pack it in someday and I have to switch from
comp.lang.python to the mailing list, it will be done by pointing slrn
at new.gmane.org -- not by having all those e-mails sent to me so I
can try to sort through them...

-- 
Grant Edwards   grant.b.edwardsYow! My NOSE is NUMB!
  at   
  gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-06 Thread Larry Hudson

On 06/06/2014 01:42 AM, Johannes Bauer wrote:
snip

Ah, I didn't know rstrip() accepted parameters and since you wrote
line.rstrip() this would also cut away whitespaces (which sadly are
relevant in odd cases).



No problem.  If a parameter is used in the strip() family, than _only_ those characters are 
stripped.  Example:


 s = 'some text \n'
 print('{}'.format(s.rstrip()))  #  No parameter, strip all whitespace
some text
 print('{}'.format(s.rstrip('\n')))  #  Parameter is newline, only strip 
newlines
some text 

 -=- Larry

BTW, the strip() parameter (which must be a string) is not limited to whitespace, it can be used 
with any set of characters.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Marko Rauhamaa
wxjmfa...@gmail.com:

 Unicode ?
 I have the feeling is similar as explaining,
 i (the imaginary number) is not equal to
 sqrt(-1).

 jmf

 PS Once I gave you a link pointing
 to unicode.org doc, you obviously did not read it.

Sir, you are an artist, a poet even!

With admiration,


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread alister
On Thu, 05 Jun 2014 00:06:54 -0700, wxjmfauth wrote:

 Le mercredi 4 juin 2014 16:50:59 UTC+2, Michael Torrie a écrit :
 On 06/04/2014 12:50 AM, wxjmfa...@gmail.com wrote:
 
  Like many, you are not understanding unicode because
 
  you do not understand the coding of characters.
 
 
 
 If that is true, then I'm sure a well-written paragraph or two can set
 
 him straight.  You continually berate people for not understanding
 
 unicode, but you've posted nothing to explain anything, nor demonstrate
 
 your own understanding.  That's one reason your posts are so
 frustrating
 
 and considered trolling.  You never ever explain yourself, instead just
 
 flailing around and muttering about folks not understanding unicode,
 
 just as you've done here, true to form.
 
 
 
 
  
  You do not understand the coding of the characters
 
  because you do not understand the mathematics behind it.
 
 
 
 flamebaiting here... FSR *is* UTF-32 internally, compresses off leading
 
 zero bits during string creation.
 
 
 
  You focussed on the wrong problem.
 
 
 
 Frankly it is you who is focused on the wrong problem, at least with
 
 this particular thread.  I think you got distracted by the subject
 line.
 
  Chris's original post really has nothing to do with unicode at all.
 
 He's simply asking for use cases for string indexing where O(1) is
 
 desired or necessary.  Could be old Python 2 byte strings, or Python 3
 
 unicode strings.  It does not matter.  Unicode is orthogonal to his
 
 question.
 
 
 
 Maybe his purpose in asking the question is to justify a fixed-length
 
 encoding scheme (which is what FSR actually is), or maybe it is to
 
 explore the costs of using a much slower, but more compact,
 
 variable-length encoding scheme like UTF-8.  Particularly in the
 context
 
 of low-memory applications where unicode support would be nice, but
 
 memory is at a premium.  But either way, you got hung up on the wrong
 thing.
 
 
 
 
  
  (All this stuff has been discussed, tested and worked on
 
  20 (twenty) years ago.)
 
 
  
  Sorry.
 
 
 
 As am I.
 
 =
 
 Unicode ?
 I have the feeling is similar as explaining,
 i (the imaginary number) is not equal to sqrt(-1).
 
 jmf
 
 PS Once I gave you a link pointing to unicode.org doc, you obviously did
 not read it.



And you have may time been given a link explaining the problems with 
posting g=from google groups but deliberately choose to not make your 
replys readable.

-- 
If you're not part of the solution, you're part of the precipitate.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Mark H Harris

On 6/5/14 10:39 AM, alister wrote:

{snipped all the mess}

And you have may time been given a link explaining the problems with
posting g=from google groups but deliberately choose to not make your
replys readable.



The problem is that thing look fine in google groups. What helps is 
getting to see what the mess looks like from Thunderbird or equivalent.



--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Johannes Bauer
On 04.06.2014 02:39, Chris Angelico wrote:

 I know the collective experience of python-list can't fail to bring up
 a few solid examples here :)

Just also grepped lots of code and have surprisingly few instances of
index-search. Most are with constant indices. One particular example
that comes up a lot is

line = line[:-1]

Which truncates the trailing \n of a textfile line.

Then some indexing in the form of

negative = (line[0] == -)

All in all I'm actually a bit surprised this isn't too common.

Cheers,
Johannes


-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Mark Lawrence

On 05/06/2014 16:57, Mark H Harris wrote:

On 6/5/14 10:39 AM, alister wrote:

{snipped all the mess}

And you have may time been given a link explaining the problems with
posting g=from google groups but deliberately choose to not make your
replys readable.



The problem is that thing look fine in google groups. What helps is
getting to see what the mess looks like from Thunderbird or equivalent.



Wrong.  99.99% of people when asked politely take action so there is no 
problem.  The remaining 0.01% consists of one complete ignoramus.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Joshua Landau
On 4 June 2014 15:50, Michael Torrie torr...@gmail.com wrote:
 On 06/04/2014 12:50 AM, wxjmfa...@gmail.com wrote:
 [Things]

 [Reply to things]

Please. Just don't.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread alister
On Thu, 05 Jun 2014 18:15:31 +0100, Mark Lawrence wrote:

 The problem is that thing look fine in google groups. What helps is
 getting to see what the mess looks like from Thunderbird or equivalent.


 Wrong.  99.99% of people when asked politely take action so there is no
 problem.  The remaining 0.01% consists of one complete ignoramus.

Who has actively stated he will not change.
pretty much the same attitude he has constantly saying pythons unicode 
implementation is broken* without any valid supporting evidence.  


* Not just incomplete or inefficient but irrevocably broken.
 


-- 
Yow!  It's some people inside the wall!  This is better than mopping!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Paul Rubin
Johannes Bauer dfnsonfsdu...@gmx.de writes:
 line = line[:-1]
 Which truncates the trailing \n of a textfile line.

use line.rstrip() for that.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Johannes Bauer
On 05.06.2014 20:16, Paul Rubin wrote:
 Johannes Bauer dfnsonfsdu...@gmx.de writes:
 line = line[:-1]
 Which truncates the trailing \n of a textfile line.
 
 use line.rstrip() for that.

rstrip has different functionality than what I'm doing.

Cheers,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Ryan Hiebert
2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de:

 On 05.06.2014 20:16, Paul Rubin wrote:
  Johannes Bauer dfnsonfsdu...@gmx.de writes:
  line = line[:-1]
  Which truncates the trailing \n of a textfile line.
 
  use line.rstrip() for that.

 rstrip has different functionality than what I'm doing.


How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 4:52 AM, Ryan Hiebert r...@ryanhiebert.com wrote:
 2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de:

 On 05.06.2014 20:16, Paul Rubin wrote:
  Johannes Bauer dfnsonfsdu...@gmx.de writes:
  line = line[:-1]
  Which truncates the trailing \n of a textfile line.
 
  use line.rstrip() for that.

 rstrip has different functionality than what I'm doing.


 How so? I was using line=line[:-1] for removing the trailing newline, and
 just replaced it with rstrip('\n'). What are you doing differently?

 line = Hello,\nworld!\n\n
 line[:-1]
'Hello,\nworld!\n'
 line.rstrip('\n')
'Hello,\nworld!'

If it's guaranteed to end with exactly one newline, then and only then
will they be identical.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Paul Rubin
Ryan Hiebert r...@ryanhiebert.com writes:
 How so? I was using line=line[:-1] for removing the trailing newline, and
 just replaced it with rstrip('\n'). What are you doing differently?

rstrip removes all the newlines off the end, whether there are zero or
multiple.  In perl the difference is chomp vs chop.  line=line[:-1]
removes one character, that might or might not be a newline.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Ryan Hiebert
On Thu, Jun 5, 2014 at 2:59 PM, Chris Angelico ros...@gmail.com wrote:

 On Fri, Jun 6, 2014 at 4:52 AM, Ryan Hiebert r...@ryanhiebert.com wrote:
  2014-06-05 13:42 GMT-05:00 Johannes Bauer dfnsonfsdu...@gmx.de:
 
  On 05.06.2014 20:16, Paul Rubin wrote:
   Johannes Bauer dfnsonfsdu...@gmx.de writes:
   line = line[:-1]
   Which truncates the trailing \n of a textfile line.
  
   use line.rstrip() for that.
 
  rstrip has different functionality than what I'm doing.
 
 
  How so? I was using line=line[:-1] for removing the trailing newline, and
  just replaced it with rstrip('\n'). What are you doing differently?

  line = Hello,\nworld!\n\n
  line[:-1]
 'Hello,\nworld!\n'
  line.rstrip('\n')
 'Hello,\nworld!'

 If it's guaranteed to end with exactly one newline, then and only then
 will they be identical.

  OK, that's not an issue for my case, and additionally I'm using the
open(_, 'U') file iterable, so I shouldn't see multiple trailing newlines
anyway.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Ian Kelly
On Thu, Jun 5, 2014 at 1:58 PM, Paul Rubin no.email@nospam.invalid wrote:
 Ryan Hiebert r...@ryanhiebert.com writes:
 How so? I was using line=line[:-1] for removing the trailing newline, and
 just replaced it with rstrip('\n'). What are you doing differently?

 rstrip removes all the newlines off the end, whether there are zero or
 multiple.  In perl the difference is chomp vs chop.  line=line[:-1]
 removes one character, that might or might not be a newline.

Given the description that the input string is a textfile line, if
it has multiple newlines then it's invalid.

Personally I tend toward rstrip('\r\n') so that I don't have to worry
about files with alternative line terminators.

If you want to be really picky about removing exactly one line
terminator, then this captures all the relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '', count=1)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Albert-Jan Roskam






- Original Message -
 From: Ian Kelly ian.g.ke...@gmail.com
 To: Python python-list@python.org
 Cc: 
 Sent: Thursday, June 5, 2014 10:18 PM
 Subject: Re: Unicode and Python - how often do you index strings?
 
 On Thu, Jun 5, 2014 at 1:58 PM, Paul Rubin no.email@nospam.invalid 
 wrote:
  Ryan Hiebert r...@ryanhiebert.com writes:
  How so? I was using line=line[:-1] for removing the trailing newline, 
 and
  just replaced it with rstrip('\n'). What are you doing 
 differently?
 
  rstrip removes all the newlines off the end, whether there are zero or
  multiple.  In perl the difference is chomp vs chop.  line=line[:-1]
  removes one character, that might or might not be a newline.
 
 Given the description that the input string is a textfile line, if
 it has multiple newlines then it's invalid.
 
 Personally I tend toward rstrip('\r\n') so that I don't have 
 to worry
 about files with alternative line terminators.

I tend to use: s.rstrip(os.linesep)

 If you want to be really picky about removing exactly one line
 terminator, then this captures all the relatively modern variations:
 re.sub('\r?\n$|\n?\r$', line, '', count=1)

or perhaps: re.sub([^ \S]+$, , line)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Roy Smith
In article mailman.10767.1402000635.18130.python-l...@python.org,
 Albert-Jan Roskam fo...@yahoo.com wrote:

 





- Original Message -
 From: Ian Kelly ian.g.ke...@gmail.com

  To: Python python-list@python.org
 Cc: 
 Sent: Thursday, June 5, 2014 
 10:18 PM
 Subject: Re: Unicode and Python - how often do you index strings?

  
 On Thu, Jun 5, 2014 at 1:58 PM, Paul Rubin no.email@nospam.invalid 
 
 wrote:
  Ryan Hiebert r...@ryanhiebert.com writes:
  How so? I was 
 using line=line[:-1] for removing the trailing newline, 
 and
  just 
 replaced it with rstrip('\n'). What are you doing 
 differently?
 
  
 rstrip removes all the newlines off the end, whether there are zero or
  
 multiple.? In perl the difference is chomp vs chop.? line=line[:-1]
  
 removes one character, that might or might not be a newline.
 
 Given the 
 description that the input string is a textfile line, if
 it has multiple 
 newlines then it's invalid.
 
 Personally I tend toward rstrip('\r\n') so 
 that I don't have 
 to worry
 about files with alternative line 
 terminators.

I tend to use: s.rstrip(os.linesep)

 If you want to be really 
 picky about removing exactly one line
 terminator, then this captures all 
 the relatively modern variations:
 re.sub('\r?\n$|\n?\r$', line, '', 
 count=1)

or perhaps: re.sub([^ \S]+$, , line)

Just for fun, I took a screen-shot of what this looks like in my 
newsreader.  URL below.  Looks like something chomped on unicode pretty 
hard :-)

http://www.panix.com/~roy/unicode.pdf
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Rustom Mody
On Friday, June 6, 2014 2:30:26 AM UTC+5:30, Roy Smith wrote:
 Just for fun, I took a screen-shot of what this looks like in my 
 newsreader.  URL below.  Looks like something chomped on unicode pretty 
 hard :-)
  
 http://www.panix.com/~roy/unicode.pdf

Yii
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Ned Deily
In article 8681edf0-7a1f-4110-9f87-a8cd0988c...@googlegroups.com,
 Rustom Mody rustompm...@gmail.com wrote:

 On Friday, June 6, 2014 2:30:26 AM UTC+5:30, Roy Smith wrote:
  Just for fun, I took a screen-shot of what this looks like in my 
  newsreader.  URL below.  Looks like something chomped on unicode pretty 
  hard :-)
   
  http://www.panix.com/~roy/unicode.pdf
 
 Yii

Roy is using MT-NewsWatcher as a client.  Because its codebase's origins 
are back in classic MacOS (= 9), it has its own *interesting* ways to 
deal with encodings.  BTW, don't upgrade to OS X 10.9 Mavericks if 
you're dependent on MT-NW; it finally stops working there because what 
was left of Open Transport support in OS X has finally been ripped out 
of 10.9.

-- 
 Ned Deily,
 n...@acm.org

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Ian Kelly
On Thu, Jun 5, 2014 at 2:34 PM, Albert-Jan Roskam fo...@yahoo.com wrote:
 If you want to be really picky about removing exactly one line
 terminator, then this captures all the relatively modern variations:
 re.sub('\r?\n$|\n?\r$', line, '', count=1)

 or perhaps: re.sub([^ \S]+$, , line)

That will remove more than one terminator, plus tabs. Points for
including \f and \v though.

I suppose if we want to be absolutely correct, we should follow the
Unicode standard:
re.sub(r'\r?\n$|[\r\v\f\x85\u2028\u2029]$', line, '', count=1)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Roy Smith
In article mailman.10781.1402009056.18130.python-l...@python.org,
 Ned Deily n...@acm.org wrote:

 In article 8681edf0-7a1f-4110-9f87-a8cd0988c...@googlegroups.com,
  Rustom Mody rustompm...@gmail.com wrote:
 
  On Friday, June 6, 2014 2:30:26 AM UTC+5:30, Roy Smith wrote:
   Just for fun, I took a screen-shot of what this looks like in my 
   newsreader.  URL below.  Looks like something chomped on unicode pretty 
   hard :-)

   http://www.panix.com/~roy/unicode.pdf
  
  Yii
 
 Roy is using MT-NewsWatcher as a client.

Yes.  Except for the fact that it hasn't kept up with unicode, I find 
the U/I pretty much perfect.  I imagine at some point I'll be force to 
look elsewhere, but then again, netnews is pretty much dead.

 BTW, don't upgrade to OS X 10.9 Mavericks if you're dependent on 
 MT-NW; it finally stops working there because what was left of Open 
 Transport support in OS X has finally been ripped out of 10.9.

Hmmm, good to know.  I'm still on 10.7, and don't see any reason to 
move.  But, then again, you'd expect that from somebody who's still on 
Python 2.x, wouldn't you?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-05 Thread Ned Deily
In article roy-2a9d82.20100705062...@news.panix.com,
 Roy Smith r...@panix.com wrote:
 In article mailman.10781.1402009056.18130.python-l...@python.org,
  Ned Deily n...@acm.org wrote:
  Roy is using MT-NewsWatcher as a client.
 Yes.  Except for the fact that it hasn't kept up with unicode, I find 
 the U/I pretty much perfect.  I imagine at some point I'll be force to 
 look elsewhere, but then again, netnews is pretty much dead.

I agree about the U/I, although I'm sure a lot of that has to do with 
familiarity. However, netnews isn't dead, it has just morphed a bit.  A 
newsreader, like MT-NW, is great for following mailing lists like this 
(and most other Python-related lists) via gmane.org's bi-directional 
mailing list - NNTP gateways.  And for this list it's usually better to 
read the mailing list variant via gmane.org NNTP than the Usenet group 
variant via a traditional USENET NNTP server because there's less spam 
with the former.
 
  BTW, don't upgrade to OS X 10.9 Mavericks if you're dependent on 
  MT-NW; it finally stops working there because what was left of Open 
  Transport support in OS X has finally been ripped out of 10.9.
 Hmmm, good to know.  I'm still on 10.7, and don't see any reason to 
 move.  But, then again, you'd expect that from somebody who's still on 
 Python 2.x, wouldn't you?

Heh. Well, both 10.8 and 10.9 proved various improvements, both feature 
and performance, over 10.7.  Alas, Apple won't likely be supporting 10.7 
with security updates for as long as the PSF will be supporting 2.7.x.  
But, by then, you'll have had a chance to re-implement MT-NW in Python.

-- 
 Ned Deily,
 n...@acm.org

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Gregory Ewing

Chris Angelico wrote:

On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith r...@panix.com wrote:


sarcasm style=regex-pedantUm, you mean cent(er|re), don't you?  The
pattern you wrote also matches centee and centrr./sarcasm


Maybe there's someone who spells it that way!


Come visit Pirate Island, the centrr of the universe!

--
Pegleg Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Mark Lawrence

On 04/06/2014 01:39, Chris Angelico wrote:

A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe) around one critical question: Is string indexing common?

Python strings can be indexed with integers to produce characters
(strings of length 1). They can also be iterated over from beginning
to end. Lots of operations can be built on either one of those two
primitives; the question is, how much can NOT be implemented
efficiently over iteration, and MUST use indexing? Theories are great,
but solid use-cases are better - ideally, examples from actual
production code (actual code optional).

I know the collective experience of python-list can't fail to bring up
a few solid examples here :)

Thanks in advance, all!!

ChrisA



Single characters quite often, iteration rarely if ever, slicing all the 
time, but does that last one count?


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Chris Angelico
On Wed, Jun 4, 2014 at 6:22 PM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 Single characters quite often, iteration rarely if ever, slicing all the
 time, but does that last one count?

Yes, slicing counts. What matters here is the potential impact of
internally representing strings as UTF-8 streams; when you ask for the
Nth character, it would have to scan from either the beginning or end
(more likely beginning) of the string and count, instead of doing what
CPython 3.3+ does and simply look up the header to find out the kind,
bit-shift the index by one less than that, and use that as a memory
location.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Peter Otten
Mark Lawrence wrote:

 On 04/06/2014 01:39, Chris Angelico wrote:
 A current discussion regarding Python's Unicode support centres (or
 centers, depending on how close you are to the cent[er]{2} of the
 universe) around one critical question: Is string indexing common?

 Python strings can be indexed with integers to produce characters
 (strings of length 1). They can also be iterated over from beginning
 to end. Lots of operations can be built on either one of those two
 primitives; the question is, how much can NOT be implemented
 efficiently over iteration, and MUST use indexing? Theories are great,
 but solid use-cases are better - ideally, examples from actual
 production code (actual code optional).

 I know the collective experience of python-list can't fail to bring up
 a few solid examples here :)

 Thanks in advance, all!!

 ChrisA

 
 Single characters quite often, iteration rarely if ever, slicing all the
 time, but does that last one count?

The indices used for slicing typically don't come out of nowhere. A simple 
example would be

def strip_prefix(text, prefix):
if text.startswith(prefix):
text = text[len(prefix):] 
return text

If both prefix and text use UTF-8 internally the byte offset is already 
known. The question is then how we can preserve that information.

The first approach that comes to mind is an int subtype:

 for i, c in enumerate(123αλφα):
... print(i, byteoffset(i), c)
... 
0 0 1
1 1 2
2 2 3
3 3 α
4 5 λ
5 7 φ
6 9 α

This would work in the strip_prefix() example, but lead to data corruption 
in most other cases unless limited to a specific string -- in which case it 
would no longer work with strip_prefix().

So a new interface would be needed. My second try, an object with two byte 
offsets linked to a specific string:

 span(foobar).startswith(oob)
 p = span(foobar).startswith(foo)
 p.replace(baz)
'bazbar'
 p.before()
''
 p.after()
'bar'
 span(foo bar baz).find(bar).replace(spam)
'foo spam bar'

I have no idea if that could work out...

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Chris Angelico
On Wed, Jun 4, 2014 at 8:10 PM, Peter Otten __pete...@web.de wrote:
 The indices used for slicing typically don't come out of nowhere. A simple
 example would be

 def strip_prefix(text, prefix):
 if text.startswith(prefix):
 text = text[len(prefix):]
 return text

 If both prefix and text use UTF-8 internally the byte offset is already
 known. The question is then how we can preserve that information.

Almost completely useless. First off, it solves only the problem of
operating on the string at exactly some point where you just got an
index; and secondly, you don't always get that index from a string
method. Suppose, for instance, that you iterate over a string thus:

for i, ch in enumerate(string):
if ch=='{': start = i
elif ch=='}': return string[start:end+1]

Okay, so this could be done by searching, but for something more
complicated, I can imagine it being better to enumerate. (But I can
imagine is much weaker than Here's code that we use in production,
which is why I asked the question.)

Incidentally, the above code highlights the first problem too. With
direct indexing, you can ask for inclusive or exclusive slicing by
adding or subtracting one from the index. If you do that with a
byte-position-retaining special integer, you lose the byte position.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread alister
On Tue, 03 Jun 2014 21:18:12 -0400, Roy Smith wrote:

 In article mailman.10656.1401842403.18130.python-l...@python.org,
  Chris Angelico ros...@gmail.com wrote:
 
 A current discussion regarding Python's Unicode support centres (or
 centers, depending on how close you are to the cent[er]{2} of the
 universe)
 
 sarcasm style=regex-pedantUm, you mean cent(er|re), don't you?  The
 pattern you wrote also matches centee and centrr./sarcasm

super pedant mode
The language is ENGLISH so the correct spelling is Centre regional 
variations my be common but they are incorrect
/super pedant mode
:-)
-- 
Prepare for tomorrow -- get ready.
-- Edith Keeler, The City On the Edge of Forever,
   stardate unknown
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread alister
On Wed, 04 Jun 2014 18:48:29 +1200, Gregory Ewing wrote:

 Chris Angelico wrote:
 On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith r...@panix.com wrote:
 
sarcasm style=regex-pedantUm, you mean cent(er|re), don't you?  The
pattern you wrote also matches centee and centrr./sarcasm
 
 Maybe there's someone who spells it that way!
 
 Come visit Pirate Island, the centrr of the universe!

that should be Cent-argh



-- 
I hope the ``Eurythmics'' practice birth control ...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Rustom Mody
On Wednesday, June 4, 2014 4:20:01 PM UTC+5:30, alister wrote:
 The language is ENGLISH so the correct spelling is Centre regional 
 variations my be common but they are incorrect

my?

O mee Oo my -- cockney (or Aussie) pedant??
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread wxjmfauth
Le mercredi 4 juin 2014 02:39:54 UTC+2, Chris Angelico a écrit :
 A current discussion regarding Python's Unicode support centres (or
 
 centers, depending on how close you are to the cent[er]{2} of the
 
 universe) around one critical question: Is string indexing common?
 
 
 
 Python strings can be indexed with integers to produce characters
 
 (strings of length 1). They can also be iterated over from beginning
 
 to end. Lots of operations can be built on either one of those two
 
 primitives; the question is, how much can NOT be implemented
 
 efficiently over iteration, and MUST use indexing? Theories are great,
 
 but solid use-cases are better - ideally, examples from actual
 
 production code (actual code optional).
 
 
 
 I know the collective experience of python-list can't fail to bring up
 
 a few solid examples here :)
 
 
 
 Thanks in advance, all!!
 
 
 
 ChrisA

=

Like many, you are not understanding unicode because
you do not understand the coding of characters.

You do not understand the coding of the characters
because you do not understand the mathematics behind it.

You focussed on the wrong problem.

(All this stuff has been discussed, tested and worked on
20 (twenty) years ago.)

Sorry.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread alister
On Wed, 04 Jun 2014 05:52:24 -0700, Rustom Mody wrote:

 On Wednesday, June 4, 2014 4:20:01 PM UTC+5:30, alister wrote:
 The language is ENGLISH so the correct spelling is Centre regional
 variations my be common but they are incorrect
 
 my?
 
 O mee Oo my -- cockney (or Aussie) pedant??

I made no claims about my typing or spelling being correct.
That post was actually quite good fro me usually my typing is worse.
 



-- 
The difference between genius and stupidity is that genius has its limits.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Michael Torrie
On 06/04/2014 12:50 AM, wxjmfa...@gmail.com wrote:
 Like many, you are not understanding unicode because
 you do not understand the coding of characters.

If that is true, then I'm sure a well-written paragraph or two can set
him straight.  You continually berate people for not understanding
unicode, but you've posted nothing to explain anything, nor demonstrate
your own understanding.  That's one reason your posts are so frustrating
and considered trolling.  You never ever explain yourself, instead just
flailing around and muttering about folks not understanding unicode,
just as you've done here, true to form.

 
 You do not understand the coding of the characters
 because you do not understand the mathematics behind it.

flamebaiting here... FSR *is* UTF-32 internally, compresses off leading
zero bits during string creation.

 You focussed on the wrong problem.

Frankly it is you who is focused on the wrong problem, at least with
this particular thread.  I think you got distracted by the subject line.
 Chris's original post really has nothing to do with unicode at all.
He's simply asking for use cases for string indexing where O(1) is
desired or necessary.  Could be old Python 2 byte strings, or Python 3
unicode strings.  It does not matter.  Unicode is orthogonal to his
question.

Maybe his purpose in asking the question is to justify a fixed-length
encoding scheme (which is what FSR actually is), or maybe it is to
explore the costs of using a much slower, but more compact,
variable-length encoding scheme like UTF-8.  Particularly in the context
of low-memory applications where unicode support would be nice, but
memory is at a premium.  But either way, you got hung up on the wrong thing.

 
 (All this stuff has been discussed, tested and worked on
 20 (twenty) years ago.)
 
 Sorry.

As am I.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-04 Thread Dave Angel
Chris Angelico ros...@gmail.com Wrote in message:
 On Wed, Jun 4, 2014 at 8:10 PM, Peter Otten __pete...@web.de wrote:
 The indices used for slicing typically don't come out of nowhere. A simple
 example would be

 def strip_prefix(text, prefix):
 if text.startswith(prefix):
 text = text[len(prefix):]
 return text

 If both prefix and text use UTF-8 internally the byte offset is already
 known. The question is then how we can preserve that information.
 
 Almost completely useless. First off, it solves only the problem of
 operating on the string at exactly some point where you just got an
 index; and secondly, you don't always get that index from a string
 method. Suppose, for instance, that you iterate over a string thus:
 
 for i, ch in enumerate(string):
 if ch=='{': start = i
 elif ch=='}': return string[start:end+1]
 
 Okay, so this could be done by searching, but for something more
 complicated, I can imagine it being better to enumerate. (But I can
 imagine is much weaker than Here's code that we use in production,
 which is why I asked the question.)
 
 Incidentally, the above code highlights the first problem too. With
 direct indexing, you can ask for inclusive or exclusive slicing by
 adding or subtracting one from the index. If you do that with a
 byte-position-retaining special integer, you lose the byte position.
 
 ChrisA
 

A string could have two extra fields in it that hold index and
 offset for the most recent substring reference.  Even though the
 string is immutable,  nothing prevents mutable elements that are
 externally visible only by performance measurement.
 

So a loop using a subscript of a string would tend to be faster
 even if written in a naive way.

It's also conceivable to build an array of such pairs in strings
 over a threshold size. So if you had a megabyte string, there
 might be 100 evenly spaced pairs, calculated when the string
 object is first created.

And naturally there can be flags indicating that the particular
 string is pure ASCII.

Clearly this breaks down if there are two alternating references
 at different offsets, but I think this would be exceeding
 rare.

-- 
DaveA

-- 
https://mail.python.org/mailman/listinfo/python-list


Unicode and Python - how often do you index strings?

2014-06-03 Thread Chris Angelico
A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe) around one critical question: Is string indexing common?

Python strings can be indexed with integers to produce characters
(strings of length 1). They can also be iterated over from beginning
to end. Lots of operations can be built on either one of those two
primitives; the question is, how much can NOT be implemented
efficiently over iteration, and MUST use indexing? Theories are great,
but solid use-cases are better - ideally, examples from actual
production code (actual code optional).

I know the collective experience of python-list can't fail to bring up
a few solid examples here :)

Thanks in advance, all!!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-03 Thread Tim Chase
On 2014-06-04 10:39, Chris Angelico wrote:
 A current discussion regarding Python's Unicode support centres (or
 centers, depending on how close you are to the cent[er]{2} of the
 universe) around one critical question: Is string indexing common?
 
 Python strings can be indexed with integers to produce characters
 (strings of length 1). They can also be iterated over from beginning
 to end. Lots of operations can be built on either one of those two
 primitives; the question is, how much can NOT be implemented
 efficiently over iteration, and MUST use indexing? Theories are
 great, but solid use-cases are better - ideally, examples from
 actual production code (actual code optional).

Many of my string-indexing uses revolve around a sliding window which
can be done with itertools[1], though I often just roll it as
something like

  n = 3
  for i in range(1 + len(s) - n):
do_something(s[i:i+n])

So that could be supplanted by the SO iterator linked below.

The other use big case I have from production code involves a
column-offset delimited file where the headers have a row of
underscores under them delimiting the field widths, so it looks
something like

  EmpID NameCost Center
  - --- -
  314159Longstocking, Pippi RJ45
  265358Davis, MilesJA22
  979328Bell, Alexander RJ15

I then take row 2 and use it to make a mapping of header-name to a
slice-object for slicing the subsequent strings:

  import re
  r = re.compile('-+') # a sequence of 1+ dashes
  f = file(data.txt)
  headers = next(f)
  lines = next(f)
  header_map = dict((
  headers[i.start():i.end()].strip().upper(),
  slice(i.start(), i.end())
  )
for i in r.finditer(lines)
)
  for row in f:
print(EmpID = %s % row[header_map[EMPID]].strip())
print(Name = %s % row[header_map[NAME]].strip())
# ...

which I presume uses string indexing under the hood.

Perhaps there's a better way of doing that, but it's what I currently
use to process these large-ish files (largest max out at 10-20MB each)

There might be other use-cases I've done, but those two leap to mind.

-tkc


[1]
http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python




-- 
https://mail.python.org/mailman/listinfo/python-list



Re: Unicode and Python - how often do you index strings?

2014-06-03 Thread Roy Smith
In article mailman.10656.1401842403.18130.python-l...@python.org,
 Chris Angelico ros...@gmail.com wrote:

 A current discussion regarding Python's Unicode support centres (or
 centers, depending on how close you are to the cent[er]{2} of the
 universe)

sarcasm style=regex-pedantUm, you mean cent(er|re), don't you?  The 
pattern you wrote also matches centee and centrr./sarcasm

 around one critical question: Is string indexing common?

Not in our code.  I've got 80008 non-blank lines of Python (2.7) source 
handy.  I tried a few heuristics to find patterns which might be string 
indexing.

$ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]'

and then looked them over manually.  I see this pattern a bunch of times 
(in a single-use script):

data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4]  

We do this once:

if tz_offset[0] == '-':

We do this somewhere in some command-line parsing:

process_match = args.process[:15]

There's this little gem:

return [dedup(x[1:-1].lower()) for x in 
re.findall('(\[[^\]\[]+\]|\([^\)\(]+\))',title)]

It appears I wrote this one, but I don't remember exactly what I had in 
mind at the time...

withhyphen = number if '-' in number else (number[:-2] + '-' + 
number[-2:]) # big assumption here

Anyway, there's a bunch more, but the bottom line is that in our code, 
indexing into a string (at least explicitly in application source code) 
is a pretty rare thing.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-03 Thread Ethan Furman

On 06/03/2014 05:39 PM, Chris Angelico wrote:


A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe) around one critical question: Is string indexing common?


I use it quite a bit, but the strings are usually quite small (well under 100 characters) so an implementation that 
wasn't O(1) would not hurt me much.


--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-03 Thread Chris Angelico
On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith r...@panix.com wrote:
 In article mailman.10656.1401842403.18130.python-l...@python.org,
  Chris Angelico ros...@gmail.com wrote:

 A current discussion regarding Python's Unicode support centres (or
 centers, depending on how close you are to the cent[er]{2} of the
 universe)

 sarcasm style=regex-pedantUm, you mean cent(er|re), don't you?  The
 pattern you wrote also matches centee and centrr./sarcasm

Maybe there's someone who spells it that way! Let's not be excluding
people. That'd be rude.

 around one critical question: Is string indexing common?

 Not in our code.  I've got 80008 non-blank lines of Python (2.7) source
 handy.  I tried a few heuristics to find patterns which might be string
 indexing.

 $ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]'

 and then looked them over manually.  I see this pattern a bunch of times
 (in a single-use script):

 data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4]

Slicing is a form of indexing too, although in this case (slicing from
the front) it could be implemented on top of UTF-8 without much
problem.

 withhyphen = number if '-' in number else (number[:-2] + '-' +
 number[-2:]) # big assumption here

This *definitely* counts; if strings were represented internally in
UTF-8, this would involve two scans (although a smart implementation
could probably count backward rather than forward). By the way, any
time you slice up to the third from the end, you win two extra awesome
points, just for putting [:-3] into your code and having it mean
something. But I digress.

 Anyway, there's a bunch more, but the bottom line is that in our code,
 indexing into a string (at least explicitly in application source code)
 is a pretty rare thing.

Thanks. Of course, the pattern you searched for is looking only for
literals; it's a bit harder to find cases where the index (or slice
position) comes from a variable or expression, and those situations
are also rather harder to optimize (the MD5 prefix is clearly better
scanned from the front, the number tail is clearly better scanned from
the back - but with a variable?).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-03 Thread Chris Angelico
On Wed, Jun 4, 2014 at 11:11 AM, Tim Chase
python.l...@tim.thechases.com wrote:
 I then take row 2 and use it to make a mapping of header-name to a
 slice-object for slicing the subsequent strings:

   slice(i.start(), i.end())

 print(EmpID = %s % row[header_map[EMPID]].strip())
 print(Name = %s % row[header_map[NAME]].strip())

 which I presume uses string indexing under the hood.

Yes, it's definitely going to be indexing. If strings were represented
internally in UTF-8, each of those calls would need to scan from the
beginning of the string, counting and discarding characters until it
finds the place to start, then counting and retaining characters until
it finds the place to stop. Definite example of what I'm looking for,
thanks!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode and Python - how often do you index strings?

2014-06-03 Thread Tim Chase
On 2014-06-04 12:16, Chris Angelico wrote:
 On Wed, Jun 4, 2014 at 11:11 AM, Tim Chase
 python.l...@tim.thechases.com wrote:
  I then take row 2 and use it to make a mapping of header-name to a
  slice-object for slicing the subsequent strings:
 
slice(i.start(), i.end())
 
  print(EmpID = %s % row[header_map[EMPID]].strip())
  print(Name = %s % row[header_map[NAME]].strip())
 
  which I presume uses string indexing under the hood.
 
 Yes, it's definitely going to be indexing. If strings were
 represented internally in UTF-8, each of those calls would need to
 scan from the beginning of the string, counting and discarding
 characters until it finds the place to start, then counting and
 retaining characters until it finds the place to stop. Definite
 example of what I'm looking for, thanks!

For what it's worth, most of the lines in each file are under ~2k, so
even O(N) or O(log N) indexing wouldn't be grievous.  Noticeable, but
not grievous.

Glad my example could give you some fodder.

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list