Michael Broe wrote: > I'm playing with the whole idea of creating bigram (digram?) > frequencies for text analysis and cryptographic and entropy analysis > etc (this is as much an exercise in learning Python and programming > as anything else, I realise everything has already been done > somewhere somehow :) Though I *am* aiming to run this over unicoded > phonetic representations of natural languages, which doesn't seem to > be that common. > > I implemented a single character count using a dictionary, and it > worked really well: > > Character_Count = {} > for character in text: > Character_Count[character] = Character_Count.get(character, 0) + 1 > > And so for bigrams I was thinking of creating a data-structure that > was a nested dictionary, where the key was the character in position > 1, and the value was a sub-dictionary that gave a count of each > character in position 2. > > So the dictionary might begin something like: > > {'a': {'a':1, 'b':8, 'c':10,...}, 'b' : {'a':23, 'b':0, 'c': > 1,...},..., 'z' : {'a':11, 'b':0, 'c':0,...}} > > The count of the character in position one could be retrieved by > summing over the characters in position 2, so the bigram and single- > character counts can be done in one pass. > > I don't want anyone to tell me how to do this :) I'd just like to get > a feel for: > > (i) is this idea of a nested dictionary a good/efficient/tractable > data-structure?
It can work. An alternative is a dictionary keyed by the two-letter pairs. I don't know which will work better. > > (ii) is there a straightforward path to the goal of constructing this > thing in a single pass over the input string (yes or no will suffice > for the moment!), or is there a can of worms lurking in my future. I think so. > > I'd rather have less information than more, but there is something > about grabbing characters two-at-a-time, but moving forward one-at-a- > time, and sticking the count down a level inside the dictionary > that's just a little baffling at the moment. Encapsulating this in a generator is probably a good plan. If you want help look at the online cookbook. The generator can yield two-character strings or tuples of characters, whichever is more convenient. When you have a generator for the pairs the rest of the program will be pretty straightforward. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor