On Sat, 24 Apr 2010 07:21:13 am Alan Gauld wrote: > "Emile van Sebille" <em...@fenx.com> wrote > > > It's expensive enough that for a list this size I'd convert it to a > > dict and use in on that. eg, > > > > a = range(100000) > > d = dict(zip(a,a)) > > Surely that would depend on how often you do the search? > If its a one off occurence I'd expect the overhead of zipping > and converting to a dict would outweight the savings?
It absolutely has to, because zipping it has to iterate over the entire list, then calling dict has to iterate over the entire zipped version. That's iterating over the entire list *twice* before you even START doing a search! In Python 3.x, zip is a lazy iterator so that will reduce the excess iterations from twice to once, but for a one-off test it entirely negates the point of converting. > If the search was inside a loop however then I'd definitely > agree. Although I'd opt for a set rather than a dict... Yes, there's no point in making a dict {a:a} just for membership testing when you can just use a set. > Another option would be to use the bisect module on a > sorted version of the list. But keep in mind that unless the list is HUGE, or your Python version includes the C version of bisect, a linear search using in may end up being faster than a slow pure-Python version of bisect. Also, bisect on its own doesn't do membership testing: >>> data = range(0,100,2) >>> 7 in data False >>> bisect.bisect(data, 7) 4 So you have to weigh up the extra complication needed versus the optimization. Another strategy is to move items closer to the front of the list each time they are accessed, so that commonly used items eventually bubble up to the front of the list and searches for them are fast. And finally, if the list is very large, and your searches tend to be clustered, it becomes wasteful to do a fresh binary search each time you look something up. (E.g. consider looking up these three words in the dictionary, one after the other: "caravan", "cat", "cap".) In this case, a good strategy is sometimes called "hunt and search". The algorithm is something like this: Each search takes a second argument, i, the place to start searching from. This will usually be the place the previous search ended at. First you *hunt* for the item: try to bracket the item you want between i and i+1, then i and i+2, then i and i+4, i+8, and so forth, doubling the size of the bracket each time. (This assumes that the value at index i was too small. If it were too large, you hunt in the opposite direction, with i-1 to i, i-2 to i, etc.) Once you have bracketed the item you are searching for, you *search* for it within those limits, using binary search or even a linear search. If your searches are clustered, most of the hunt phases will be short and even linear search will be fast, rather than doing a binary search over the full list each time. -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor