iterating over strings seems to be really slow?
*I'm confused why the former function runs significantly faster when wc1() builds the hash on a single pass and doesn't waste memory of returning an array of strings? * *I would think wc2() to be slower what's going on here? * #!/usr/bin/env python s = The black cat jump over the bigger black cat def wc1(): word= m={} for c in s: if c != : word += c else: if m.has_key(word): m[word] += 1 else: m[word] = 1 word= return(m) def wc2(): m={} for c in s.split(): if m.has_key(c): m[c] += 1 else: m[c] = 1 return(m) if __name__ == '__main__': import timeit print(timeit.timeit(wc1(), setup=from __main__ import wc1)) print(timeit.timeit(wc2(), setup=from __main__ import wc2)) ⮀python wordcount.py 7.39647197723 3.15220093727 -- https://mail.python.org/mailman/listinfo/python-list
Re: iterating over strings seems to be really slow?
On 2014-08-27 16:53, Rodrick Brown wrote: *I'm confused why the former function runs significantly faster when wc1() builds the hash on a single pass and doesn't waste memory of returning an array of strings? * *I would think wc2() to be slower what's going on here? * #!/usr/bin/env python s = The black cat jump over the bigger black cat def wc1(): word= m={} for c in s: if c != : word += c String-building a character-at-a-time is slow. Also, it doesn't produce the same results as wc2() does. Check if wc1() == wc2(): print(Success) else: print(doh!) def wc2(): m={} for c in s.split(): if m.has_key(c): m[c] += 1 else: m[c] = 1 return(m) The thing that surprises me is that using collections.Counter() and collections.defaultdict(int) are also slower than your wc2(): from collections import defaultdict, Counter def wc3(): return Counter(s.split()) def wc4(): m = defaultdict(int) for c in s.split(): m[c] += 1 return m -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: iterating over strings seems to be really slow?
On Wed, Aug 27, 2014 at 1:53 PM, Rodrick Brown rodrick.br...@gmail.com wrote: def wc1(): word= m={} for c in s: if c != : word += c else: if m.has_key(word): m[word] += 1 else: m[word] = 1 word= return(m) def wc2(): m={} for c in s.split(): if m.has_key(c): m[c] += 1 else: m[c] = 1 return(m) On Wed, Aug 27, 2014 at 2:21 PM, Tim Chase python.l...@tim.thechases.com wrote: The thing that surprises me is that using collections.Counter() and collections.defaultdict(int) are also slower than your wc2(): from collections import defaultdict, Counter def wc3(): return Counter(s.split()) def wc4(): m = defaultdict(int) for c in s.split(): m[c] += 1 return m I ran a couple more experiments, and, at least on my machine, it has to do with the number of duplicate words found. I also added two more variations of my own: def wc5(s): # I expect this one to be slow. m = {} for c in s.split(): m.setdefault(c, 0) m[c] += 1 return m def wc6(s): # This one might be better than any other option presented so far. m = {} for c in s.split(): try: m[c] += 1 except KeyError: m[c] = 1 return m I also switched the OP's versions to use the in operator rather than has_key, as I am running Python 3.4.1. With the same dataset (plus a trailing space) the OP provided, here are my times: (s = The black cat jump over the bigger black cat ) timeit.timeit(wc1(s), setup=setup, number=100) 6.076951338314008 * timeit.timeit(wc2(s), setup=setup, number=100)* *2.451220378346954* timeit.timeit(wc3(s), setup=setup, number=100) 5.249674617410577 timeit.timeit(wc4(s), setup=setup, number=100) 3.531042215121076 timeit.timeit(wc5(s), setup=setup, number=100) 3.4734603842861205 timeit.timeit(wc6(s), setup=setup, number=100) 4.322543365103378 When I increase the data set by multipling the OP's string 1000 times (s = The black cat jump over the bigger black cat *1000), here are the times (I reduced the number of repetitions to keep the time reasonable): timeit.timeit(wc1(s), setup=setup, number=1000) 5.807871555058417 timeit.timeit(wc2(s), setup=setup, number=1000) 2.3245083748933535 * timeit.timeit(wc3(s), setup=setup, number=1000)* *1.5722138905703211* timeit.timeit(wc4(s), setup=setup, number=1000) 1.901478857657942 timeit.timeit(wc5(s), setup=setup, number=1000) 3.065888476414475 timeit.timeit(wc6(s), setup=setup, number=1000) 2.0125233934956217 It seems that with a large number of duplicate words, the counter version is the best, however with fewer duplicates, the contains check is better. Chris -- https://mail.python.org/mailman/listinfo/python-list
Re: iterating over strings seems to be really slow?
On Thu, Aug 28, 2014 at 6:53 AM, Rodrick Brown rodrick.br...@gmail.com wrote: def wc1(): word= m={} for c in s: if c != : word += c else: if m.has_key(word): m[word] += 1 else: m[word] = 1 word= return(m) Your code is all buried behind HTML formatting, which makes it hard to read. But frankly, I would first look at writing correct and idiomatic code, unless you're doing a direct translation from C or something (in which case keeping the style as similar as possible to the original makes it easier to verify). The second function is MUCH clearer than the first. Don't worry too much about performance until you've checked that one against some definition of correctness, and then do all your performance checking with a quick correctness check against the idiomatic version. ChrisA -- https://mail.python.org/mailman/listinfo/python-list