iterating over strings seems to be really slow?

2014-08-27 Thread Rodrick Brown
*I'm confused why the former function runs significantly faster when
wc1() builds the hash on a single pass and doesn't waste memory of
returning an array of strings? *

*I would think wc2() to be slower what's going on here? *


#!/usr/bin/env python

s = The black cat jump over the bigger black cat


def wc1():

word=

m={}

for c in s:

if c !=  :
word += c
else:
if m.has_key(word):
m[word] += 1
else:
m[word] = 1
word=


return(m)



def wc2():

m={}

for c in s.split():
if m.has_key(c):
m[c] += 1
else:
m[c] = 1
return(m)

if __name__ == '__main__':
import timeit
print(timeit.timeit(wc1(), setup=from __main__ import wc1))
print(timeit.timeit(wc2(), setup=from __main__ import wc2))

⮀python wordcount.py
7.39647197723
3.15220093727
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: iterating over strings seems to be really slow?

2014-08-27 Thread Tim Chase
On 2014-08-27 16:53, Rodrick Brown wrote:
 *I'm confused why the former function runs significantly faster when
 wc1() builds the hash on a single pass and doesn't waste memory of
 returning an array of strings? *
 
 *I would think wc2() to be slower what's going on here? *
 
 
 #!/usr/bin/env python
 
 s = The black cat jump over the bigger black cat
 
 
 def wc1():
 word=
 m={}
 for c in s:
 if c !=  :
 word += c

String-building a character-at-a-time is slow.  Also, it doesn't
produce the same results as wc2() does.  Check

  if wc1() == wc2():
print(Success)
  else:
print(doh!)

 def wc2():
 m={}
 for c in s.split():
 if m.has_key(c):
 m[c] += 1
 else:
 m[c] = 1
 return(m)

The thing that surprises me is that using collections.Counter() and
collections.defaultdict(int) are also slower than your wc2():

from collections import defaultdict, Counter

def wc3():
return Counter(s.split())

def wc4():
m = defaultdict(int)
for c in s.split():
m[c] += 1
return m

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: iterating over strings seems to be really slow?

2014-08-27 Thread Chris Kaynor
On Wed, Aug 27, 2014 at 1:53 PM, Rodrick Brown rodrick.br...@gmail.com
 wrote:

def wc1():
 word=
 m={}
 for c in s:
 if c !=  :
 word += c
 else:
 if m.has_key(word):
 m[word] += 1
 else:
 m[word] = 1
 word=
 return(m)
  def wc2():
 m={}
 for c in s.split():
 if m.has_key(c):
 m[c] += 1
 else:
 m[c] = 1
 return(m)


On Wed, Aug 27, 2014 at 2:21 PM, Tim Chase python.l...@tim.thechases.com
wrote:

 The thing that surprises me is that using collections.Counter() and
 collections.defaultdict(int) are also slower than your wc2():

 from collections import defaultdict, Counter

 def wc3():
 return Counter(s.split())

 def wc4():
 m = defaultdict(int)
 for c in s.split():
 m[c] += 1
 return m


I ran a couple more experiments, and, at least on my machine, it has to do
with the number of duplicate words found. I also added two more variations
of my own:

def wc5(s): # I expect this one to be slow.
m = {}
for c in s.split():
 m.setdefault(c, 0)
m[c] += 1
return m

def wc6(s): # This one might be better than any other option presented so
far.
m = {}
 for c in s.split():
try:
m[c] += 1
 except KeyError:
m[c] = 1
return m

I also switched the OP's versions to use the in operator rather than
has_key, as I am running Python 3.4.1.

With the same dataset (plus a trailing space) the OP provided, here are my
times: (s = The black cat jump over the bigger black cat )

 timeit.timeit(wc1(s), setup=setup, number=100)
6.076951338314008
* timeit.timeit(wc2(s), setup=setup, number=100)*
*2.451220378346954*
 timeit.timeit(wc3(s), setup=setup, number=100)
5.249674617410577
 timeit.timeit(wc4(s), setup=setup, number=100)
3.531042215121076
 timeit.timeit(wc5(s), setup=setup, number=100)
3.4734603842861205
 timeit.timeit(wc6(s), setup=setup, number=100)
4.322543365103378


When I increase the data set by multipling the OP's string 1000 times (s =
The black cat jump over the bigger black cat *1000), here are the times
(I reduced the number of repetitions to keep the time reasonable):

 timeit.timeit(wc1(s), setup=setup, number=1000)
5.807871555058417
 timeit.timeit(wc2(s), setup=setup, number=1000)
2.3245083748933535
* timeit.timeit(wc3(s), setup=setup, number=1000)*
*1.5722138905703211*
 timeit.timeit(wc4(s), setup=setup, number=1000)
1.901478857657942
 timeit.timeit(wc5(s), setup=setup, number=1000)
3.065888476414475
 timeit.timeit(wc6(s), setup=setup, number=1000)
2.0125233934956217


It seems that with a large number of duplicate words, the counter version
is the best, however with fewer duplicates, the contains check is better.

Chris
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: iterating over strings seems to be really slow?

2014-08-27 Thread Chris Angelico
On Thu, Aug 28, 2014 at 6:53 AM, Rodrick Brown rodrick.br...@gmail.com wrote:
 def wc1():
 word=
 m={}
 for c in s:
 if c !=  :
 word += c
 else:
 if m.has_key(word):
 m[word] += 1
 else:
 m[word] = 1
 word=

 return(m)

Your code is all buried behind HTML formatting, which makes it hard to read.

But frankly, I would first look at writing correct and idiomatic code,
unless you're doing a direct translation from C or something (in which
case keeping the style as similar as possible to the original makes it
easier to verify). The second function is MUCH clearer than the first.
Don't worry too much about performance until you've checked that one
against some definition of correctness, and then do all your
performance checking with a quick correctness check against the
idiomatic version.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list