On Sat, 24 Apr 2010 09:11:21 am Robert Berman wrote: > Wow. I feel I have just been b…h slapped across the face.
There's no need to be so sensitive. > I think > Hugo’s test results pretty much confirmed ‘in’ is not the way to go If that is the *only* thing you have learned from this email thread, then you have failed to understand what folks are trying to say. Whether that is your failure or our failure is another question :) The most important thing is that the right solution depends on what your data is and what you intend doing with it. If your data is unordered, and all you do is membership testing, then using the `in` operator with a set is the right solution. Membership testing of sets is (approximately) O(1), which is much better than O(log N) for binary search and MUCH better than O(N) for linear search. If you need a list, and you have the opportunity to control the format of your data, then you can keep the data in a list *and* a set, carefully keeping the two in sync. A class is probably good for that. Use the list for list-like things, and the set for membership testing, and that gives you the best of both worlds. If your data is sorted, then binary search may be the right approach. But if your list is small enough, then the extra overhead of binary search will actually make it *slower* than a linear search. If your data needs to be unsorted (say, you have to store it in the order it is delivered to you) then binary search is *not* suitable. In the worst case, you may have no choice but a linear search: copying the list into a fresh set each time is wasteful, or the list might include objects which are mutable and can't go into sets or dicts. Fortunately, Python lists are written in C and so even a linear search using `in` is pretty fast. But probably the best advice given is, don't worry about optimizing your code until you *know* it needs optimizing. Good advice for searching a list with ten million items is *at best* pointless for searching a ten item list, and at worst actually harmful. It's hard to predict where bottlenecks are at the best of times, and particularly hard for people used to languages like C or Java. Text books that give optimization advice are often counter-productive when it comes to Python code. For instance, they will often assume that comparisons are cheap and moving data is relatively expensive, which is true for low-level languages like C. But for Python it's the other way around, comparisons are relatively expensive and moving data relatively cheap. I've actually seen somebody do something like this in the name of "efficiency": def issorted(alist): """Helper function returning true if alist is already sorted""" if len(alist) == 0: return True x = alist[0] flag = True for item in alist: if item < x: flag = False break x = item return flag def bin_search(alist, what): if not issorted(alist): # make a copy of the list and work with that alist = alist[:] # Careful! Don't change the original! alist.sort() i = bisect.bisect(alist, what) if i == 0: return False else: return alist[i-1] == what If you think carefully about what this is doing, it is hilarious! For beginners who don't quite see, consider what happens if you give it a sorted list. First it walks the entire list, to determine whether it is sorted, which takes N steps. Then it does a binary search, which is O(log N). Since the time is dominated by the O(N) test for sortedness, it would be quicker to just do a linear search. If the list is unsorted, first it walks some fraction of the list to discover the fact, then it does a O(N) copy, plus an O(N*log N) sort, before the binary search. And it does this *every single time*. Total, utter fail. True confession time: the person who wrote this ridiculous piece of code was... me. My reasoning was that binary search was much faster than linear search, so it was worth sorting the list to get that advantage. Then I realised that I couldn't just sort the list, because other parts of the code might rely on it being unsorted, hence I had to make a copy first. Then I reasoned that I didn't need to copy it if it were already sorted. For the record, here's some timing benchmarks to see how badly it turned out: >>> from random import shuffle >>> data1 = range(10000) >>> data2 = range(10000) >>> shuffle(data2) >>> >>> setup = """from __main__ import data1, data2, bin_search""" >>> from timeit import Timer >>> t1 = Timer("457 in data1", setup) >>> t2 = Timer("457 in data2", setup) >>> t3 = Timer("bin_search(data1, 457)", setup) >>> t4 = Timer("bin_search(data2, 457)", setup) >>> >>> for t in (t1, t2, t3, t4): ... print min(t.repeat(number=1000)) ... 0.0346097946167 0.461233854294 3.04955101013 5.70547604561 For comparison, here's the timing on a plain binary search: There, I have now exposed my secret shame to the whole world. Please don't hate me. *wink* -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor