On Sun, Nov 24, 2013 at 8:12 PM, Steven D'Aprano <st...@pearwood.info> wrote: > On Sun, Nov 24, 2013 at 09:33:11PM +0530, Reuben wrote: > [...] >> From the above output, I see key 'c' is at third position during input, but >> while displaying the output it is displayed at second position > > Dictionaries are deliberately made unordered. That means the order that > items will be displayed is free to vary, not only from how you entered > them, but in principle they might even vary each time you inspect the > dict. (They probably won't, but they could.) > > Python dicts are actually "hash tables", which is a standard computer > science data structure that you can look up, but in a nutshell, a hash > table is an array indexed by the key's hash value. Why do we care about > hash tables? Because they make a really fast and efficient way to look > up keys, and from the key get access to a value. > > Let's start with the simplest way to store a key and value pair, an > unsorted table of pairs with a marker indicating "blank", so you know > when you've reached the end. Here's an example of a half-filled table: > > { ("Hello", value), # key first, then some value > ("World", value), > ("Spam", value), > ("Eggs", value), > ("Cheese", value), > <blank>, > <blank>, > <blank>, > <blank>, > <blank>, > } > > In order to find a key in the table, you have to inspect every position > until you reach either the key you are looking for, or <blank>, in which > case you know that the key is not there. So to find "Eggs", you have to > inspect "Hello", "World", "Spam" and finally "Eggs". On average, a > successful search takes half as many inspections as there are keys, e.g. > if you have 10 keys in the table, you need to inspect 5 positions; if > there are 100 keys, you need to inspect 50 positions on average. If > there are 1000 keys, you need to inspect 500 positions. > > This is pretty slow. Imagine looking up a word in a dictionary (a real > paper dictionary), but with a twist: the words are in whatever order the > dictionary writer first thought of them, not alphabetical order, so for > any word you have to start at the beginning and read every single world > until you find it. > > Obviously we can do better, by keeping the words sorted. Our table will > look like this: > > { ("Cheese", value), > ("Eggs", value), > ("Hello", value), > ("Spam", value), > ("World", value), > <blank>, > <blank>, > <blank>, > <blank>, > <blank>, > } > > In this case, we can use a technique called "binary search" to narrow > down the possible places the key might be. This is very much like what > you probably do when looking up words in a paper dictionary: flip the > dictionary open to the middle, then decide whether you've gone too far > or not far enough. Each time, you split the remaining half in half > again, until you've narrowed down to the page containing the word you > want. > > In a sorted table with binary search, the savings over an unsorted table > can be very high. With 10 keys, on average you will end up inspecting 3 > or 4 items, which isn't much improvement over 5 for the unsorted case, > but with 100 keys, you will inspect 6 or 7, and with 1000 keys, 9 or 10. > With a million keys, you only need to inspect about 20 items to either > find the key you are looking for, or determine it isn't there. > > Can we do better than binary search? Yes we can, and we do so with a > hash table, which is what Python dicts are. > > The secret to the hash table is that instead of putting all the keys at > the beginning of the table, we want the entries to be scattered at > random all over the place. Only not quite at random, since we need a way > to jump straight to the key when asked. That's the *hash function*, > which converts a key to an arbitrary, and distinct, index: > > { <blank>, > ("World", value), # hash of "World" is 1 > ("Cheese", value), # hash of "Cheese" is 2 > <blank>, > <blank>, > ("Eggs", value), # hash of "Eggs" is 5 > <blank>, > ("Spam", value), # hash of "Spam" is 7 > <blank>, > ("Hello", value), # hash of "Hello" is 9 > } > > So the positions of the keys in the table depend on the keys, not the > order you write them down. The advantage of this is that looking for a > key requires (near enough to) constant time. If you want to see whether > "Eggs" is in the table, it doesn't matter whether there are ten keys or > ten billion, it will take exactly the same amount of time: hash the key > "Eggs" to give you index 5, look at index 5, it is either there or it > isn't. > > Now, in reality we can't actually guarantee that every hash value is > distinct. In real life, a hash table has to deal with "collisions", > which is when two different keys have the same hash value, and that will > spoil the nice simple constant time performance. But collisions are > usually rare, and so when discussing hash tables in broad terms we > normally ignore them. > > In Python, you can calculate the hash value of a key with the hash() > function. The exact values you get may depend on the specific version of > Python you are using, but to give you an example: > > py> hash("cat") > 405875055 > py> hash("hat") > 1255409212 > py> hash("fat") > -750391218 > > > So you can see, by changing just one letter of the key, we get a > completely different hash value. A good hash function -- and Python's > hash function is good -- will will mix up the positions of keys in a > very unpredictable way. > > The end result of this is that when displaying a dict, keys are shown in > whatever order they happen to be inside the hash table. That's > unpredicable, depending on the details of the hash function, the actual > keys, and whether there have been any collisions or not. > > If you care about this, it is easy enough to sort the keys before > printing them: > > # instead of this > print mydict.keys() > > # do this > print sorted(mydict.keys()) > > > If you care about keeping the insertion order, rather than alphabetical > order, you can use an OrderedDict: > > from collections import OrderedDict > > > but OrderedDicts are slower and use more memory than regular dicts, > since they have to track the order that the keys were inserted. > > > > -- > Steven > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor
Well explained Stephen -- Joel Goldstick http://joelgoldstick.com _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor