Re: Recursive generator for combinations of a multiset?
On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote: Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. Any suggestions? I've updated my code at http://stromberg.dnsalias.org/svn/anagrams/trunk/; It's multiword now. It can blast through the word punishment in 4 seconds, but for The public art galleries it ate about 7 gigabytes of RAM and ran for more than a day before I killed it. I believe it's an exponential problem. Parallelization might help, but it'd probably take a lot of RAM that way. Maybe the RAM use would be better with CPython, but it's much faster with Pypy; I did most of my testing with Pypy 2.2. -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Tue, 26 Nov 2013 10:33:06 + Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 26 November 2013 06:18, John O'Hagan resea...@johnohagan.com wrote: [...] def _multicombs(prepend, words, r, chkstr): chkstr is the string of remaining availalable characters if r == 0: yield prepend, chkstr return (word, count, rem), *remaining = words newstr = chkstr for k in range(max(r-rem, 0), min(count, r) + 1): for letter in word * k: if letter in newstr: newstr = newstr.replace(letter, '' , 1) else: return yield from _multicombs(prepend + (word,) * k, remaining, r-k, newstr) Further micro-optimisations are possible. One is to inline the lower recursion levels. For example if the termination condition is checked immediately before recursion you can eliminate a redundant generator function call. By progressively checking against the available characters, it avoids fruitless recursion branches. I'm not 100% sure I'm doing this right yet but so far it's promising. This is the quickest non-redundant approach I've used so far (although strangely, the original product-only version which produces a lot of redundant results is still by far the quickest.) Bear in mind that the interpreter does a lot of redundant things so what you might think of as non-redundant code can actually be highly redundant when interpreter over-head is accounted for. I don't mind redundancy as long as I don't have to see it in the output! Recently I've been trying out PyPY in my own work and it is *really fast*. In fact it is so much faster that I've given up on the idea of speed-optimising code for CPython: just use PyPy instead - although it is worth optimising highly intensive code for PyPy. I'll have a look at it. I'll try to explain better what I'm actually doing. It's quite long, but you seem to be wondering, or maybe someone else is! [snip] Okay I understand now. I was confused because I think of anagrams as being words of the same length. I didn't understand how your multiword version works but I think I do now. In my own words: You want to generate all sequences of English words having the same letters as some input string, ignoring the spaces in both the input and output strings (so that the number or length of the individual words need not be the same). [snip] For example, take the phrase break beat. I make a wordlength dictionary of sub-alphabet words, using the system dictionary file, like this: {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate', 'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke', 'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet', 'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak', 'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate', 'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate', 'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab', 'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake', 'karate', 'rebate', 'retake']} Okay, how about the following (pseudoish-code)? def word_sequences(prepend, target, subwords): # subwords is a list rather than a dict for n in range(1, len(subwords)): for word in subwords[n]: if equal_in_a_multiset_sense(word, target): yield prepend + ' ' + target elif is_a_submultiset(word, target): recurse_target = multiset_subtract(target, word) subsubwords = list_of_subwords_by_length[:len(recurse_target)] if any(subsubwords): yield from word_sequences(prepend + ' ' + word, recurse_target, subsubwords) [...] That is very clever. Direct, economical and does effectively the same pruning job as my approach without all the combinatorial brainstrain. I got it working by changing for n in range... to for words in subwords (otherwise it missed the one-letter words) and the first yield needed to be prepend + ' ' + word. I simplified it a bit more to this: def word_sequences(prepend, target, subwords): subwords is a list of lists of subwords grouped by length, in order of length for words in subwords: for word in words: recurse_target = subset_subtract(target, word) if recurse_target: yield from word_sequences(prepend + ' ' + word, recurse_target, subwords[:len(recurse_target)]) elif recurse_target == '': yield prepend + ' ' + word with just one function to do the subset testing: def subset_subtract(target, word): for i in word: if i in target: target = target.replace(i, '' ,1) else: return return target Speed-wise it is somewhat faster than any of my non-duplicate-producing
Re: Recursive generator for combinations of a multiset?
On 27 November 2013 10:25, John O'Hagan resea...@johnohagan.com wrote: On Tue, 26 Nov 2013 10:33:06 + Oscar Benjamin oscar.j.benja...@gmail.com wrote: I simplified it a bit more to this: def word_sequences(prepend, target, subwords): subwords is a list of lists of subwords grouped by length, in order of length for words in subwords: for word in words: recurse_target = subset_subtract(target, word) if recurse_target: yield from word_sequences(prepend + ' ' + word, recurse_target, subwords[:len(recurse_target)]) elif recurse_target == '': yield prepend + ' ' + word with just one function to do the subset testing: def subset_subtract(target, word): for i in word: if i in target: target = target.replace(i, '' ,1) else: return return target Speed-wise it is somewhat faster than any of my non-duplicate-producing attempts, but still way slower than the current champion, the redundant cartesian product-only version. However, because it iterates through all the remaining words on each recursion, it seems to produce n! of each unique result, where n in the number of words in the result, so this is the new champion as far as redundancy is concerned. I'll keep working on it, the totally different approach is interesting. Whoops, I guess this is what happens when you send untested (pseudo-)code out. It needs an outer helper function that can do something like: def word_sequences_top(target, subwords): for word in copy(subwords): recurse_target = multiset_subtrace(target,word) yield from word_sequences(words, recurse_target, subwords) remove_word_from_list(word, subwords) This way we yield all matches involving the word once and then go on to all matches that don't include the word. Also the partition length logic from your original version can be used in word_sequences to prune recursion branches. Oscar -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On 26 November 2013 06:18, John O'Hagan resea...@johnohagan.com wrote: Thanks, that is exactly the type of thing I was after. Although it is actually slower as is than a hack like def imulticombs(it, n): last = () for i in itertools.combinations(it,n): if i last: yield i last = i it can be modified to be a lot faster for my use-case like this: def _multicombs(prepend, words, r, chkstr): chkstr is the string of remaining availalable characters if r == 0: yield prepend, chkstr return (word, count, rem), *remaining = words newstr = chkstr for k in range(max(r-rem, 0), min(count, r) + 1): for letter in word * k: if letter in newstr: newstr = newstr.replace(letter, '' , 1) else: return yield from _multicombs(prepend + (word,) * k, remaining, r-k, newstr) Further micro-optimisations are possible. One is to inline the lower recursion levels. For example if the termination condition is checked immediately before recursion you can eliminate a redundant generator function call. By progressively checking against the available characters, it avoids fruitless recursion branches. I'm not 100% sure I'm doing this right yet but so far it's promising. This is the quickest non-redundant approach I've used so far (although strangely, the original product-only version which produces a lot of redundant results is still by far the quickest.) Bear in mind that the interpreter does a lot of redundant things so what you might think of as non-redundant code can actually be highly redundant when interpreter over-head is accounted for. Recently I've been trying out PyPY in my own work and it is *really fast*. In fact it is so much faster that I've given up on the idea of speed-optimising code for CPython: just use PyPy instead - although it is worth optimising highly intensive code for PyPy. I'll try to explain better what I'm actually doing. It's quite long, but you seem to be wondering, or maybe someone else is! [snip] Okay I understand now. I was confused because I think of anagrams as being words of the same length. I didn't understand how your multiword version works but I think I do now. In my own words: You want to generate all sequences of English words having the same letters as some input string, ignoring the spaces in both the input and output strings (so that the number or length of the individual words need not be the same). [snip] For example, take the phrase break beat. I make a wordlength dictionary of sub-alphabet words, using the system dictionary file, like this: {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate', 'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke', 'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet', 'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak', 'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate', 'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate', 'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab', 'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake', 'karate', 'rebate', 'retake']} Okay, how about the following (pseudoish-code)? def word_sequences(prepend, target, subwords): # subwords is a list rather than a dict for n in range(1, len(subwords)): for word in subwords[n]: if equal_in_a_multiset_sense(word, target): yield prepend + ' ' + target elif is_a_submultiset(word, target): recurse_target = multiset_subtract(target, word) subsubwords = list_of_subwords_by_length[:len(recurse_target)] if any(subsubwords): yield from word_sequences(prepend + ' ' + word, recurse_target, subsubwords) Your general approach seems reasonable to me now that I understand the problem. The word_sequences generator above is the solution I would probably write at first if faced with the problem. I suspect that your current approach using partition sizes is better though. Oscar -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On 21 November 2013 13:01, John O'Hagan resea...@johnohagan.com wrote: In my use-case the first argument to multicombs is a tuple of words which may contain duplicates, and it produces all unique combinations of a certain length of those words, eg: list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3)) [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the')] I still don't understand what you're actually doing well enough to know whether there is a better general approach to the problem. For the specific thing you requested, here is a recursive multiset combinations generator. Does it do what you wanted? #!/usr/bin/env python def multicombs(it, r): words = [] last = None for N, word in enumerate(it): if word == last: words[-1][1] += 1 else: words.append([word, 1]) last = word cumulative = 0 for n in range(len(words)-1, -1, -1): words[n].append(cumulative) cumulative += words[n][1] return _multicombs((), words, r) def _multicombs(prepend, words, r): if r == 0: yield prepend return (word, count, rem), *remaining = words for k in range(max(r-rem, 0), min(count, r) + 1): yield from _multicombs(prepend + (word,) * k, remaining, r-k) expected = [ ('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the'), ] output = list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3)) assert len(expected) == len(output) assert set(expected) == set(output) # The order is different Oscar -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Mon, 25 Nov 2013 12:15:15 + Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 21 November 2013 13:01, John O'Hagan resea...@johnohagan.com wrote: In my use-case the first argument to multicombs is a tuple of words which may contain duplicates, and it produces all unique combinations of a certain length of those words, eg: list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3)) [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the')] I still don't understand what you're actually doing well enough to know whether there is a better general approach to the problem. For the specific thing you requested, here is a recursive multiset combinations generator. Does it do what you wanted? #!/usr/bin/env python def multicombs(it, r): words = [] last = None for N, word in enumerate(it): if word == last: words[-1][1] += 1 else: words.append([word, 1]) last = word cumulative = 0 for n in range(len(words)-1, -1, -1): words[n].append(cumulative) cumulative += words[n][1] return _multicombs((), words, r) def _multicombs(prepend, words, r): if r == 0: yield prepend return (word, count, rem), *remaining = words for k in range(max(r-rem, 0), min(count, r) + 1): yield from _multicombs(prepend + (word,) * k, remaining, r-k) Thanks, that is exactly the type of thing I was after. Although it is actually slower as is than a hack like def imulticombs(it, n): last = () for i in itertools.combinations(it,n): if i last: yield i last = i it can be modified to be a lot faster for my use-case like this: def _multicombs(prepend, words, r, chkstr): chkstr is the string of remaining availalable characters if r == 0: yield prepend, chkstr return (word, count, rem), *remaining = words newstr = chkstr for k in range(max(r-rem, 0), min(count, r) + 1): for letter in word * k: if letter in newstr: newstr = newstr.replace(letter, '' , 1) else: return yield from _multicombs(prepend + (word,) * k, remaining, r-k, newstr) By progressively checking against the available characters, it avoids fruitless recursion branches. I'm not 100% sure I'm doing this right yet but so far it's promising. This is the quickest non-redundant approach I've used so far (although strangely, the original product-only version which produces a lot of redundant results is still by far the quickest.) This is how I'm calling it: def cartesian_product(args, instr): if not args: yield (),instr return for items, chkstr in cartesian_product(args[:-1], instr): #this prevents bothering with anything useless words, repeats = args[-1] if repeats == 1: for word in words: newstr = chkstr for letter in word: if letter in newstr: newstr = newstr.replace(letter, '' , 1) else: break else: yield items + (word,), newstr else: for item, newstr in multicombs(words, repeats, chkstr): yield items + item, newstr I'll try to explain better what I'm actually doing. It's quite long, but you seem to be wondering, or maybe someone else is! It's an old anagram program I wrote for practice a few years ago which produces the phrases of dictionary words which together contains the same letter-multiset as the input phrase. I would be interested to know if you think there is a better general approach. My approach is: - make a dictionary of candidate words composed of the same sub-alphabet as the input, using their lengths as keys, then - for each integer partition of the length of the input, replace each partition element with the corresponding value from the dictionary, then - Take the cartesian product of the resulting list of lists of words, rejecting any output which exceeds any of the letter frequencies of the input I have omitted descibing some complex optimisations of the lists, which gave only fractional speedups. However, I recently discovered that I could get an orders-of-magnitude speedup by using my own recursive cartesian product algorithnm which checks letter frequencies on the fly, rather than using itertools.product and testing each output phrase. (This algorithm was in my original post). For example, take the phrase break beat. I make a wordlength dictionary of sub-alphabet words, using the system dictionary file, like this: {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate', 'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke', 'era', 'ere
Re: Recursive generator for combinations of a multiset?
On Sat, 23 Nov 2013 04:23:42 + MRAB pyt...@mrabarnett.plus.com wrote: On 23/11/2013 00:58, John O'Hagan wrote: On Thu, 21 Nov 2013 12:59:26 -0800 Dan Stromberg drsali...@gmail.com wrote: On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote: Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. I think you probably need permutations rather than combinations. Also, I think you'll need to form a word (partitioned off by spaces), and then check it against a set containing /usr/share/dict/words before recursing for the remainder of the sentence - this should speed things up a LOT. Thanks for the reply. If I understand you correctly, you are suggesting permuting the input _characters_ to form words and then seeing if they exist, as opposed to my approach of combining known words and seeing if they are anagrams. (Permutations of words would not help find anagrams as they merely change the word order). Here is an attempt at that: def anagrams(partition, input_string): Find anagrams which fit given partition of input string length if not partition: yield (), input_string return for words, checkstring in anagrams(partition[:-1], input_string): for word in itertools.permutations(checkstring, partition[-1]): word = ''.join(word) if word in WORDS: #WORDS is collection of dictionary words newstring = checkstring for l in word: newstring = newstring.replace(l, '' , 1) yield words + (word,), newstring There are two problems with this. If there are repeated characters in the input, redundant results are produced; a multiset-permutation algorithm would fix this. But the main problem is it is incredibly slow: on my run-of-the-mill laptop, it chokes on anything longer than about 10 characters, spending most of its time rejecting non-words. Or have I misunderstood your suggestion? If you want to know how to get unique permutations, have a look here: http://mail.python.org/pipermail/python-ideas/2013-October/023610.html For this particular problem I don't need multiset permutations but multiset combinations (see original post). But that thread was a good read. This is a little OT, but I did need multiset permutations a couple of years back for a project involving the generation of musical structures. There was zero interest here at the time (fair enough!) and I ended up implementing the C++ function next_permutation in Python. So it was good to read in that thread that there seems to be some interest in incorporating multiset combinatorics into itertools (including your excellent contribution). IMHO the scepticism there about non-exotic use-cases was unjustified. Leaving aside my probably atypical uses, it crops in many ordinary situations dealing with arrangements of multiple items of several types where each instance of a type is equivalent. Take stock control: when stacking a warehouse shelf it doesn't matter which particular box goes where, only how many of each size. Or timetabling: if Mrs Smith teaches the same students maths on Tuesdays and Thursdays, swapping the classes does nothing. The same goes for cyclic and cyclic-multiset (necklaces) combinatorics, where the beginning and end of an arrangement is not significant, eg. 24-hour rostering, laying tiles, etc. And musical scales. Disclaimer: I am far from expert on combinatorics but seem to end up using it a lot. -- John -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Fri, 22 Nov 2013 22:33:29 -0800 Dan Stromberg drsali...@gmail.com wrote: On Fri, Nov 22, 2013 at 4:58 PM, John O'Hagan resea...@johnohagan.comwrote: On Thu, 21 Nov 2013 12:59:26 -0800 Dan Stromberg drsali...@gmail.com wrote: On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote: Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. I think you probably need permutations rather than combinations. Also, I think you'll need to form a word (partitioned off by spaces), and then check it against a set containing /usr/share/dict/words before recursing for the remainder of the sentence - this should speed things up a LOT. Thanks for the reply. If I understand you correctly, you are suggesting permuting the input _characters_ to form words and then seeing if they exist, as opposed to my approach of combining known words and seeing if they are anagrams. (Permutations of words would not help find anagrams as they merely change the word order). Here is an attempt at that: You've interpreted me correctly. However, I was thinking about this in the back of my mind, and decided it would probably be best to inhale /usr/share/dict/words (if on Linux), and pull out words of the corrects lengths (as separated by the blanks) over the correct (possible) alphabet, and permute Those, afterward checking if they form good anagrams of the original sentence. This would probably be much faster, since English isn't that dense of a space. If you look back at my original question, you'll see that's pretty much what I've done. I didn't spell out all the details, but I made a dictionary of wordlength keys containing lists of all dictionary words of that length made of the correct sub-alphabet. But to to recap the problem: to produce non-redundant anagram phrases, I need the cartesian product (not permutations) of these lists if each is unique, but with a subroutine producing multiset combinations if a list is repeated, i.e., if a particular length is called for more than once. The version I have so far is correct but substantially slower than the product-only one, which just goes ahead and produces all the redundant results. This seems counter-intuitive, and my theory is that this is because I am unable to prune the non-recursive combination algorithm I currently have. Regards, -- John -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Thu, 21 Nov 2013 12:59:26 -0800 Dan Stromberg drsali...@gmail.com wrote: On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote: Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. I think you probably need permutations rather than combinations. Also, I think you'll need to form a word (partitioned off by spaces), and then check it against a set containing /usr/share/dict/words before recursing for the remainder of the sentence - this should speed things up a LOT. Thanks for the reply. If I understand you correctly, you are suggesting permuting the input _characters_ to form words and then seeing if they exist, as opposed to my approach of combining known words and seeing if they are anagrams. (Permutations of words would not help find anagrams as they merely change the word order). Here is an attempt at that: def anagrams(partition, input_string): Find anagrams which fit given partition of input string length if not partition: yield (), input_string return for words, checkstring in anagrams(partition[:-1], input_string): for word in itertools.permutations(checkstring, partition[-1]): word = ''.join(word) if word in WORDS: #WORDS is collection of dictionary words newstring = checkstring for l in word: newstring = newstring.replace(l, '' , 1) yield words + (word,), newstring There are two problems with this. If there are repeated characters in the input, redundant results are produced; a multiset-permutation algorithm would fix this. But the main problem is it is incredibly slow: on my run-of-the-mill laptop, it chokes on anything longer than about 10 characters, spending most of its time rejecting non-words. Or have I misunderstood your suggestion? Regards, -- John -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Thu, 21 Nov 2013 18:14:41 -0800 (PST) James hslee...@yahoo.com wrote: On Thursday, November 21, 2013 5:01:15 AM UTC-8, John O'Hagan wrote: [...] On 21 November 2013 06:46, John O'Hagan wrote: [...] def multicombs(it, r): result = it[:r] yield result while 1: for i in range(-1, -r - 1, -1): rep = result[i] if rep it[i]: break else: break for j, n in enumerate(it): if n rep: break result = result[:i] + it[j:j - i] yield result [...] I neglected to mention that multicombs takes a sorted iterable; it doesn't work right otherwise. I'd forgotten that because my wordlists are guaranteed sorted by the way they're built. Sorry about that. In my use-case the first argument to multicombs is a tuple of words which may contain duplicates, and it produces all unique combinations of a certain length of those words, eg: list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3)) [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the')] [...] What I'm looking for is a recursive algorithm which does what multicombs does (order unimportant) so that I can apply a pruning shortcut like the one I used in the recursive cartesian product algorithm in my original post. Multiset combination algorithms seem pretty thin on the ground out there - as I said, I could only find a description of the procedure above, no actual code. The ones I did find are non-recursive. I'm hoping some combinatorics and/or recursion experts can offer advice. [...] John Could convert the following perl script to python? use Data::Dump qw(dump); dump combo([@ARGV], 3); sub combo { my ($t, $k) = @_; my @T = @$t; my @R = (); my %g = (); if ($k == 1) { for (@T) { push @R, $_ unless $g{$_}++; } } else { while (my $x = shift @T) { $p = combo([@T], $k-1); for (@{$p}) { my $q = $x.,.$_; push @R, $q unless $g{$q}++; } } } [@R]; } $ prog.pl cat hat in the the [ cat,hat,in, cat,hat,the, cat,in,the, cat,the,the, hat,in,the, hat,the,the, in,the,the, ] James Thanks. Now I just have to learn Perl to understand what that does! :) Regards, -- John -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On 23/11/2013 00:58, John O'Hagan wrote: On Thu, 21 Nov 2013 12:59:26 -0800 Dan Stromberg drsali...@gmail.com wrote: On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote: Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. I think you probably need permutations rather than combinations. Also, I think you'll need to form a word (partitioned off by spaces), and then check it against a set containing /usr/share/dict/words before recursing for the remainder of the sentence - this should speed things up a LOT. Thanks for the reply. If I understand you correctly, you are suggesting permuting the input _characters_ to form words and then seeing if they exist, as opposed to my approach of combining known words and seeing if they are anagrams. (Permutations of words would not help find anagrams as they merely change the word order). Here is an attempt at that: def anagrams(partition, input_string): Find anagrams which fit given partition of input string length if not partition: yield (), input_string return for words, checkstring in anagrams(partition[:-1], input_string): for word in itertools.permutations(checkstring, partition[-1]): word = ''.join(word) if word in WORDS: #WORDS is collection of dictionary words newstring = checkstring for l in word: newstring = newstring.replace(l, '' , 1) yield words + (word,), newstring There are two problems with this. If there are repeated characters in the input, redundant results are produced; a multiset-permutation algorithm would fix this. But the main problem is it is incredibly slow: on my run-of-the-mill laptop, it chokes on anything longer than about 10 characters, spending most of its time rejecting non-words. Or have I misunderstood your suggestion? If you want to know how to get unique permutations, have a look here: http://mail.python.org/pipermail/python-ideas/2013-October/023610.html -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Fri, Nov 22, 2013 at 4:58 PM, John O'Hagan resea...@johnohagan.comwrote: On Thu, 21 Nov 2013 12:59:26 -0800 Dan Stromberg drsali...@gmail.com wrote: On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote: Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. I think you probably need permutations rather than combinations. Also, I think you'll need to form a word (partitioned off by spaces), and then check it against a set containing /usr/share/dict/words before recursing for the remainder of the sentence - this should speed things up a LOT. Thanks for the reply. If I understand you correctly, you are suggesting permuting the input _characters_ to form words and then seeing if they exist, as opposed to my approach of combining known words and seeing if they are anagrams. (Permutations of words would not help find anagrams as they merely change the word order). Here is an attempt at that: You've interpreted me correctly. However, I was thinking about this in the back of my mind, and decided it would probably be best to inhale /usr/share/dict/words (if on Linux), and pull out words of the corrects lengths (as separated by the blanks) over the correct (possible) alphabet, and permute Those, afterward checking if they form good anagrams of the original sentence. This would probably be much faster, since English isn't that dense of a space. -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On 21 November 2013 06:46, John O'Hagan resea...@johnohagan.com wrote: I found a verbal description of such an algorithm and came up with this: def multicombs(it, r): result = it[:r] yield result while 1: for i in range(-1, -r - 1, -1): rep = result[i] if rep it[i]: break else: break for j, n in enumerate(it): if n rep: break result = result[:i] + it[j:j - i] yield result I'm not really sure what it is you're asking for. I thought if I ran the code I'd understand but that just confused me more. Is the output below correct? If not what should it be? multicombs(abracadabra, 0) [''] multicombs(abracadabra, 1) ['a'] multicombs(abracadabra, 2) ['ab', 'br', 'ra'] multicombs(abracadabra, 3) ['abr', 'ara', 'bra'] multicombs(abracadabra, 4) ['abra'] multicombs(abracadabra, 5) ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad', 'racbr', 'racra'] Oscar -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote: Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. I think you probably need permutations rather than combinations. Also, I think you'll need to form a word (partitioned off by spaces), and then check it against a set containing /usr/share/dict/words before recursing for the remainder of the sentence - this should speed things up a LOT. -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Thursday, November 21, 2013 5:01:15 AM UTC-8, John O'Hagan wrote: On Thu, 21 Nov 2013 11:42:49 + Oscar Benjamin wrote: On 21 November 2013 06:46, John O'Hagan wrote: I found a verbal description of such an algorithm and came up with this: def multicombs(it, r): result = it[:r] yield result while 1: for i in range(-1, -r - 1, -1): rep = result[i] if rep it[i]: break else: break for j, n in enumerate(it): if n rep: break result = result[:i] + it[j:j - i] yield result I'm not really sure what it is you're asking for. I thought if I ran the code I'd understand but that just confused me more. Is the output below correct? If not what should it be? multicombs(abracadabra, 0) [''] multicombs(abracadabra, 1) ['a'] multicombs(abracadabra, 2) ['ab', 'br', 'ra'] multicombs(abracadabra, 3) ['abr', 'ara', 'bra'] multicombs(abracadabra, 4) ['abra'] multicombs(abracadabra, 5) ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad', 'racbr', 'racra'] I neglected to mention that multicombs takes a sorted iterable; it doesn't work right otherwise. I'd forgotten that because my wordlists are guaranteed sorted by the way they're built. Sorry about that. In my use-case the first argument to multicombs is a tuple of words which may contain duplicates, and it produces all unique combinations of a certain length of those words, eg: list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3)) [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the')] Contrast this with: list(itertools.combinations(('cat', 'hat', 'in', 'the', 'the'), 3)) [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the')] which produces results which are redundant for my purposes. What I'm looking for is a recursive algorithm which does what multicombs does (order unimportant) so that I can apply a pruning shortcut like the one I used in the recursive cartesian product algorithm in my original post. Multiset combination algorithms seem pretty thin on the ground out there - as I said, I could only find a description of the procedure above, no actual code. The ones I did find are non-recursive. I'm hoping some combinatorics and/or recursion experts can offer advice. Regards, -- John Could convert the following perl script to python? use Data::Dump qw(dump); dump combo([@ARGV], 3); sub combo { my ($t, $k) = @_; my @T = @$t; my @R = (); my %g = (); if ($k == 1) { for (@T) { push @R, $_ unless $g{$_}++; } } else { while (my $x = shift @T) { $p = combo([@T], $k-1); for (@{$p}) { my $q = $x.,.$_; push @R, $q unless $g{$q}++; } } } [@R]; } $ prog.pl cat hat in the the [ cat,hat,in, cat,hat,the, cat,in,the, cat,the,the, hat,in,the, hat,the,the, in,the,the, ] James -- https://mail.python.org/mailman/listinfo/python-list
Re: Recursive generator for combinations of a multiset?
On Thu, 21 Nov 2013 11:42:49 + Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 21 November 2013 06:46, John O'Hagan resea...@johnohagan.com wrote: I found a verbal description of such an algorithm and came up with this: def multicombs(it, r): result = it[:r] yield result while 1: for i in range(-1, -r - 1, -1): rep = result[i] if rep it[i]: break else: break for j, n in enumerate(it): if n rep: break result = result[:i] + it[j:j - i] yield result I'm not really sure what it is you're asking for. I thought if I ran the code I'd understand but that just confused me more. Is the output below correct? If not what should it be? multicombs(abracadabra, 0) [''] multicombs(abracadabra, 1) ['a'] multicombs(abracadabra, 2) ['ab', 'br', 'ra'] multicombs(abracadabra, 3) ['abr', 'ara', 'bra'] multicombs(abracadabra, 4) ['abra'] multicombs(abracadabra, 5) ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad', 'racbr', 'racra'] I neglected to mention that multicombs takes a sorted iterable; it doesn't work right otherwise. I'd forgotten that because my wordlists are guaranteed sorted by the way they're built. Sorry about that. In my use-case the first argument to multicombs is a tuple of words which may contain duplicates, and it produces all unique combinations of a certain length of those words, eg: list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3)) [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the')] Contrast this with: list(itertools.combinations(('cat', 'hat', 'in', 'the', 'the'), 3)) [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'), ('in', 'the', 'the')] which produces results which are redundant for my purposes. What I'm looking for is a recursive algorithm which does what multicombs does (order unimportant) so that I can apply a pruning shortcut like the one I used in the recursive cartesian product algorithm in my original post. Multiset combination algorithms seem pretty thin on the ground out there - as I said, I could only find a description of the procedure above, no actual code. The ones I did find are non-recursive. I'm hoping some combinatorics and/or recursion experts can offer advice. Regards, -- John -- https://mail.python.org/mailman/listinfo/python-list
Recursive generator for combinations of a multiset?
Short story: the subject says it all, so if you have an answer already, fire away. Below is the long story of what I'm using it for, and why I think it needs to be recursive. It may even be of more general interest in terms of filtering the results of generators. I'm playing with an anagram-generating module which works like this: 1) Generate the integer partitions of the length of the input string. 2) For each partition, replace its elements with a list of all dictionary words which are a) the same length as the partition element and b) a sub-multiset of the input string. eg: cat in hat - ... , [3,5], ... - [[act, ant, ...], [antic, attic, ...]] 3) Pass each resulting list of wordlists to itertools.product, filtering the output of anything which is not the same multiset of characters as the input string. This works but gets very slow for long strings. It spends most of its time at the filtering stage because most of the results have too many of some characters (and therefore not enough of others). I do some optimising of the lists prior to running product, but this only shaves off smallish percentages. I got a big speedup (factor of five for medium-length inputs, much more for longer strings) by replacing itertools.product with this recursive generator: def cartesian_product(args, input_str): if not args: yield (), input_str return for words, check_str in cartesian_product(args[:-1], input_str): for word in args[-1]: #this bit prevents bothering with anything useless new_str = check_str for letter in word: if letter in new_str: new_str = new_str.replace(letter, '', 1) else: break else: yield words + (word,), new_str Despite being inherently much slower than itertools.product, it can prune branches of the recursion as soon as it accumulates too many of any character. This means it's much faster and produces correct results without further filtering. But there is another problem. The initial partitions contain repeated elements, so the corresponding wordlists are also repeated. This means that any cartesian product will contain many redundant results - the same words in a different order. For a medium-sized string, this is most of them. A solution for repeated sections of a partition of wordlists is to use r-combinations (where r is the number of repeats). In this scenario, though, some words may be usable more than once, and the right number of copies of these words must be added to the list to allow this. This means I need the combinations of a multiset (so I can't use itertools.combinations). I found a verbal description of such an algorithm and came up with this: def multicombs(it, r): result = it[:r] yield result while 1: for i in range(-1, -r - 1, -1): rep = result[i] if rep it[i]: break else: break for j, n in enumerate(it): if n rep: break result = result[:i] + it[j:j - i] yield result I added a call to this generator in a branch to the main generator above to deal with repeated elements. This eliminates redundant results, but with a substantial slowdown. The program now spends most of its time inside multicombs. I'm hoping that if I could find a recursive multiset combination generator, I could speed it up using the same pruning approach. Any suggestions? -- John -- https://mail.python.org/mailman/listinfo/python-list