Re: Recursive generator for combinations of a multiset?

2013-12-05 Thread Dan Stromberg
On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote:


 Short story: the subject says it all, so if you have an answer already,
 fire away. Below is the long story of what I'm using it for, and why I
 think it needs to be recursive. It may even be of more general
 interest in terms of filtering the results of generators.



 Any suggestions?

I've updated my code at
http://stromberg.dnsalias.org/svn/anagrams/trunk/; It's multiword now.
 It can blast through the word punishment in 4
seconds, but for The public art galleries it ate about 7 gigabytes of RAM
and ran for more than a day before I killed it.

I believe it's an exponential problem.  Parallelization might help, but
it'd probably take a lot of RAM that way.  Maybe the RAM use would be
better with CPython, but it's much faster with Pypy; I did most of my
testing with Pypy 2.2.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-27 Thread John O'Hagan
On Tue, 26 Nov 2013 10:33:06 +
Oscar Benjamin oscar.j.benja...@gmail.com wrote:

 On 26 November 2013 06:18, John O'Hagan resea...@johnohagan.com
 wrote:
 
[...]
 
  def _multicombs(prepend, words, r, chkstr):
  chkstr is the string of remaining availalable characters
  if r == 0:
  yield prepend, chkstr
  return
  (word, count, rem), *remaining = words
  newstr = chkstr
  for k in range(max(r-rem, 0), min(count, r) + 1):
  for letter in word * k:
  if letter in newstr:
  newstr = newstr.replace(letter, '' , 1)
  else:
  return
  yield from _multicombs(prepend + (word,) * k, remaining,
  r-k, newstr)
 
 Further micro-optimisations are possible. One is to inline the lower
 recursion levels. For example if the termination condition is checked
 immediately before recursion you can eliminate a redundant generator
 function call.
 
  By progressively checking against the available characters, it
  avoids fruitless recursion branches. I'm not 100% sure I'm doing
  this right yet but so far it's promising. This is the quickest
  non-redundant approach I've used so far (although strangely, the
  original product-only version which produces a lot of redundant
  results is still by far the quickest.)
 
 Bear in mind that the interpreter does a lot of  redundant things so
 what you might think of as non-redundant code can actually be highly
 redundant when interpreter over-head is accounted for.

I don't mind redundancy as long as I don't have to see it in the output!

 Recently I've been trying out PyPY in my own work and it is *really
 fast*. In fact it is so much faster that I've given up on the idea of
 speed-optimising code for CPython: just use PyPy instead - although it
 is worth optimising highly intensive code for PyPy.

I'll have a look at it.

  I'll try to explain better what I'm actually doing. It's quite long,
  but you seem to be wondering, or maybe someone else is!
 
 [snip]
 
 Okay I understand now. I was confused because I think of anagrams as
 being words of the same length. I didn't understand how your multiword
 version works but I think I do now. In my own words: You want to
 generate all sequences of English words having the same letters as
 some input string, ignoring the spaces in both the input and output
 strings (so that the number or length of the individual words need not
 be the same).
 
 [snip]
 
  For example, take the phrase break beat. I make a wordlength
  dictionary of sub-alphabet words, using the system dictionary file,
  like this:
 
   {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
   'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb',
  'eke', 'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4:
  ['abet', 'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate',
  'beak', 'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake',
  'rate', 'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5:
  ['abate', 'baker', 'beret', 'brake', 'break', 'eater', 'karat',
  'kebab', 'taker'], 6: ['aerate', 'beaker', 'beater', 'berate',
  'betake', 'karate', 'rebate', 'retake']}
 
 Okay, how about the following (pseudoish-code)?
 
 def word_sequences(prepend, target, subwords):
 # subwords is a list rather than a dict
 for n in range(1, len(subwords)):
 for word in subwords[n]:
 if equal_in_a_multiset_sense(word, target):
 yield prepend + ' ' + target
 elif is_a_submultiset(word, target):
 recurse_target = multiset_subtract(target, word)
 subsubwords =
 list_of_subwords_by_length[:len(recurse_target)] if any(subsubwords):
 yield from word_sequences(prepend + ' ' + word,
 recurse_target, subsubwords)
 

[...]

That is very clever. Direct, economical and does effectively the
same pruning job as my approach without all the combinatorial
brainstrain. I got it working by changing for n in range... to for
words in subwords (otherwise it missed the one-letter words) and the
first yield needed to be prepend + ' ' + word. 

I simplified it a bit more to this:

def word_sequences(prepend, target, subwords):
subwords is a list of lists of subwords grouped by length,
in order of length
for words in subwords:
for word in words:
recurse_target = subset_subtract(target, word)
if recurse_target:
yield from word_sequences(prepend + ' ' + word,
recurse_target, subwords[:len(recurse_target)])
elif recurse_target == '':
yield prepend + ' ' + word

with just one function to do the subset testing:

def subset_subtract(target, word):
for i in word:
if i in target:
target = target.replace(i, '' ,1)
else:
return
return target

Speed-wise it is somewhat faster than any of my non-duplicate-producing

Re: Recursive generator for combinations of a multiset?

2013-11-27 Thread Oscar Benjamin
On 27 November 2013 10:25, John O'Hagan resea...@johnohagan.com wrote:
 On Tue, 26 Nov 2013 10:33:06 +
 Oscar Benjamin oscar.j.benja...@gmail.com wrote:

 I simplified it a bit more to this:

 def word_sequences(prepend, target, subwords):
 subwords is a list of lists of subwords grouped by length,
 in order of length
 for words in subwords:
 for word in words:
 recurse_target = subset_subtract(target, word)
 if recurse_target:
 yield from word_sequences(prepend + ' ' + word,
 recurse_target, subwords[:len(recurse_target)])
 elif recurse_target == '':
 yield prepend + ' ' + word

 with just one function to do the subset testing:

 def subset_subtract(target, word):
 for i in word:
 if i in target:
 target = target.replace(i, '' ,1)
 else:
 return
 return target

 Speed-wise it is somewhat faster than any of my non-duplicate-producing
 attempts, but still way slower than the current champion, the redundant
 cartesian product-only version.

 However, because it iterates through all the remaining words on each
 recursion, it seems to produce n! of each unique result, where n in the
 number of words in the result, so this is the new champion as far as
 redundancy is concerned. I'll keep working on it, the totally
 different approach is interesting.

Whoops, I guess this is what happens when you send untested
(pseudo-)code out. It needs an outer helper function that can do
something like:

def word_sequences_top(target, subwords):
for word in copy(subwords):
recurse_target = multiset_subtrace(target,word)
yield from word_sequences(words, recurse_target, subwords)
remove_word_from_list(word, subwords)

This way we yield all matches involving the word once and then go on
to all matches that don't include the word.

Also the partition length logic from your original version can be used
in word_sequences to prune recursion branches.


Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-26 Thread Oscar Benjamin
On 26 November 2013 06:18, John O'Hagan resea...@johnohagan.com wrote:

 Thanks, that is exactly the type of thing I was after. Although it is
 actually slower as is than a hack like

 def imulticombs(it, n):
 last = ()
 for i in itertools.combinations(it,n):
 if i  last:
 yield i
 last = i

 it can be modified to be a lot faster for my use-case like this:

 def _multicombs(prepend, words, r, chkstr):
 chkstr is the string of remaining availalable characters
 if r == 0:
 yield prepend, chkstr
 return
 (word, count, rem), *remaining = words
 newstr = chkstr
 for k in range(max(r-rem, 0), min(count, r) + 1):
 for letter in word * k:
 if letter in newstr:
 newstr = newstr.replace(letter, '' , 1)
 else:
 return
 yield from _multicombs(prepend + (word,) * k, remaining, r-k,
 newstr)

Further micro-optimisations are possible. One is to inline the lower
recursion levels. For example if the termination condition is checked
immediately before recursion you can eliminate a redundant generator
function call.

 By progressively checking against the available characters, it avoids
 fruitless recursion branches. I'm not 100% sure I'm doing this right yet
 but so far it's promising. This is the quickest non-redundant approach
 I've used so far (although strangely, the original product-only version
 which produces a lot of redundant results is still by far the quickest.)

Bear in mind that the interpreter does a lot of  redundant things so
what you might think of as non-redundant code can actually be highly
redundant when interpreter over-head is accounted for.

Recently I've been trying out PyPY in my own work and it is *really
fast*. In fact it is so much faster that I've given up on the idea of
speed-optimising code for CPython: just use PyPy instead - although it
is worth optimising highly intensive code for PyPy.

 I'll try to explain better what I'm actually doing. It's quite long,
 but you seem to be wondering, or maybe someone else is!

[snip]

Okay I understand now. I was confused because I think of anagrams as
being words of the same length. I didn't understand how your multiword
version works but I think I do now. In my own words: You want to
generate all sequences of English words having the same letters as
some input string, ignoring the spaces in both the input and output
strings (so that the number or length of the individual words need not
be the same).

[snip]

 For example, take the phrase break beat. I make a wordlength
 dictionary of sub-alphabet words, using the system dictionary file,
 like this:

  {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
  'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke',
  'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet',
  'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak',
  'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate',
  'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate',
  'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab',
  'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake',
  'karate', 'rebate', 'retake']}

Okay, how about the following (pseudoish-code)?

def word_sequences(prepend, target, subwords):
# subwords is a list rather than a dict
for n in range(1, len(subwords)):
for word in subwords[n]:
if equal_in_a_multiset_sense(word, target):
yield prepend + ' ' + target
elif is_a_submultiset(word, target):
recurse_target = multiset_subtract(target, word)
subsubwords = list_of_subwords_by_length[:len(recurse_target)]
if any(subsubwords):
yield from word_sequences(prepend + ' ' + word,
recurse_target, subsubwords)

Your general approach seems reasonable to me now that I understand the
problem. The word_sequences generator above is the solution I would
probably write at first if faced with the problem. I suspect that your
current approach using partition sizes is better though.


Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-25 Thread Oscar Benjamin
On 21 November 2013 13:01, John O'Hagan resea...@johnohagan.com wrote:
 In my use-case the first argument to multicombs is a tuple of words
 which may contain duplicates, and it produces all unique combinations
 of a certain length of those words, eg:

 list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))

 [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
 ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
 ('in', 'the', 'the')]

I still don't understand what you're actually doing well enough to
know whether there is a better general approach to the problem. For
the specific thing you requested, here is a recursive multiset
combinations generator. Does it do what you wanted?

#!/usr/bin/env python

def multicombs(it, r):
words = []
last = None
for N, word in enumerate(it):
if word == last:
words[-1][1] += 1
else:
words.append([word, 1])
last = word
cumulative = 0
for n in range(len(words)-1, -1, -1):
words[n].append(cumulative)
cumulative += words[n][1]
return _multicombs((), words, r)

def _multicombs(prepend, words, r):
if r == 0:
yield prepend
return
(word, count, rem), *remaining = words
for k in range(max(r-rem, 0), min(count, r) + 1):
yield from _multicombs(prepend + (word,) * k, remaining, r-k)

expected = [
   ('cat', 'hat', 'in'),
   ('cat', 'hat', 'the'),
   ('cat', 'in', 'the'),
   ('cat', 'the', 'the'),
   ('hat', 'in', 'the'),
   ('hat', 'the', 'the'),
   ('in', 'the', 'the'),
]

output = list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))

assert len(expected) == len(output)
assert set(expected) == set(output)  # The order is different


Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-25 Thread John O'Hagan
On Mon, 25 Nov 2013 12:15:15 +
Oscar Benjamin oscar.j.benja...@gmail.com wrote:

 On 21 November 2013 13:01, John O'Hagan resea...@johnohagan.com
 wrote:
  In my use-case the first argument to multicombs is a tuple of words
  which may contain duplicates, and it produces all unique
  combinations of a certain length of those words, eg:
 
  list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))
 
  [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
  ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
  ('in', 'the', 'the')]
 
 I still don't understand what you're actually doing well enough to
 know whether there is a better general approach to the problem. For
 the specific thing you requested, here is a recursive multiset
 combinations generator. Does it do what you wanted?
 
 #!/usr/bin/env python
 
 def multicombs(it, r):
 words = []
 last = None
 for N, word in enumerate(it):
 if word == last:
 words[-1][1] += 1
 else:
 words.append([word, 1])
 last = word
 cumulative = 0
 for n in range(len(words)-1, -1, -1):
 words[n].append(cumulative)
 cumulative += words[n][1]
 return _multicombs((), words, r)
 
 def _multicombs(prepend, words, r):
 if r == 0:
 yield prepend
 return
 (word, count, rem), *remaining = words
 for k in range(max(r-rem, 0), min(count, r) + 1):
 yield from _multicombs(prepend + (word,) * k, remaining, r-k)

Thanks, that is exactly the type of thing I was after. Although it is
actually slower as is than a hack like

def imulticombs(it, n):
last = ()
for i in itertools.combinations(it,n):
if i  last:
yield i
last = i

it can be modified to be a lot faster for my use-case like this:

def _multicombs(prepend, words, r, chkstr):
chkstr is the string of remaining availalable characters
if r == 0:
yield prepend, chkstr
return
(word, count, rem), *remaining = words
newstr = chkstr
for k in range(max(r-rem, 0), min(count, r) + 1):
for letter in word * k:
if letter in newstr:
newstr = newstr.replace(letter, '' , 1)
else:
return
yield from _multicombs(prepend + (word,) * k, remaining, r-k,
newstr)

By progressively checking against the available characters, it avoids
fruitless recursion branches. I'm not 100% sure I'm doing this right yet
but so far it's promising. This is the quickest non-redundant approach
I've used so far (although strangely, the original product-only version
which produces a lot of redundant results is still by far the quickest.)

This is how I'm calling it:

def cartesian_product(args, instr):
if not args:
yield (),instr
return
for items, chkstr in cartesian_product(args[:-1], instr):
#this prevents bothering with anything useless
words, repeats = args[-1]
if repeats == 1:
for word in words:
newstr = chkstr
for letter in word:
if letter in newstr:
newstr = newstr.replace(letter, '' , 1)
else:
break
else:
yield items + (word,), newstr
else:
for item, newstr in multicombs(words, repeats, chkstr):
yield items + item, newstr

I'll try to explain better what I'm actually doing. It's quite long,
but you seem to be wondering, or maybe someone else is!

It's an old anagram program I wrote for practice a few years ago which
produces the phrases of dictionary words which together contains the
same letter-multiset as the input phrase. I would be interested to know
if you think there is a better general approach. My approach is: 

- make a dictionary of candidate words composed of the same
  sub-alphabet as the input, using their lengths as keys, then
 
- for each integer partition of the length of the input, replace each
  partition element with the corresponding value from the dictionary,
  then 

- Take the cartesian product of the resulting list of lists of words,
  rejecting any output which exceeds any of the letter frequencies of
  the input 

I have omitted descibing some complex optimisations of the lists, which
gave only fractional speedups.

However, I recently discovered that I could get an orders-of-magnitude
speedup by using my own recursive cartesian product algorithnm which
checks letter frequencies on the fly, rather than using
itertools.product and testing each output phrase. (This algorithm was
in my original post). 

For example, take the phrase break beat. I make a wordlength
dictionary of sub-alphabet words, using the system dictionary file,
like this:

 {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
 'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke',
 'era', 'ere

Re: Recursive generator for combinations of a multiset?

2013-11-23 Thread John O'Hagan
On Sat, 23 Nov 2013 04:23:42 +
MRAB pyt...@mrabarnett.plus.com wrote:

 On 23/11/2013 00:58, John O'Hagan wrote:
  On Thu, 21 Nov 2013 12:59:26 -0800
  Dan Stromberg drsali...@gmail.com wrote:
 
  On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
  resea...@johnohagan.comwrote:
 
  
   Short story: the subject says it all, so if you have an answer
   already, fire away. Below is the long story of what I'm using it
   for, and why I think it needs to be recursive. It may even be of
   more general interest in terms of filtering the results of
   generators.
  
 
  I think you probably need permutations rather than combinations.
 
  Also, I think you'll need to form a word (partitioned off by
  spaces), and then check it against a set
  containing /usr/share/dict/words before recursing for the
  remainder of the sentence - this should speed things up a LOT.
 
  Thanks for the reply. If I understand you correctly, you are
  suggesting permuting the input _characters_ to form words and then
  seeing if they exist, as opposed to my approach of combining known
  words and seeing if they are anagrams. (Permutations of words would
  not help find anagrams as they merely change the word order). Here
  is an attempt at that:
 
  def anagrams(partition, input_string):
   Find anagrams which fit given partition of input string
  length if not partition:
   yield (), input_string
   return
   for words, checkstring in anagrams(partition[:-1],
  input_string): for word in itertools.permutations(checkstring,
  partition[-1]): word = ''.join(word)
   if word in WORDS: #WORDS is collection of dictionary
  words newstring = checkstring
   for l in word:
   newstring = newstring.replace(l, '' , 1)
   yield words + (word,), newstring
 
  There are two problems with this. If there are repeated characters
  in the input, redundant results are produced; a multiset-permutation
  algorithm would fix this. But the main problem is it is incredibly
  slow: on my run-of-the-mill laptop, it chokes on anything longer
  than about 10 characters, spending most of its time rejecting
  non-words.
 
  Or have I misunderstood your suggestion?
 
 If you want to know how to get unique permutations, have a look here:
 
 http://mail.python.org/pipermail/python-ideas/2013-October/023610.html
 

For this particular problem I don't need multiset permutations but
multiset combinations (see original post). But that thread was a good
read.

This is a little OT, but I did need multiset permutations a couple of
years back for a project involving the generation of musical
structures. There was zero interest here at the time (fair enough!) and
I ended up implementing the C++ function next_permutation in Python.

So it was good to read in that thread that there seems to be some
interest in incorporating multiset combinatorics into itertools
(including your excellent contribution). IMHO the scepticism there about
non-exotic use-cases was unjustified. Leaving aside my probably atypical
uses, it crops in many ordinary situations dealing with arrangements of
multiple items of several types where each instance of a type is
equivalent. Take stock control: when stacking a warehouse shelf it
doesn't matter which particular box goes where, only how many of each
size. Or timetabling: if Mrs Smith teaches the same students maths on
Tuesdays and Thursdays, swapping the classes does nothing. 

The same goes for cyclic and cyclic-multiset (necklaces)
combinatorics, where the beginning and end of an arrangement is not
significant, eg. 24-hour rostering, laying tiles, etc. And musical
scales.

Disclaimer: I am far from expert on combinatorics but seem to end up
using it a lot.

--

John

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-23 Thread John O'Hagan
On Fri, 22 Nov 2013 22:33:29 -0800
Dan Stromberg drsali...@gmail.com wrote:

 On Fri, Nov 22, 2013 at 4:58 PM, John O'Hagan
 resea...@johnohagan.comwrote:
 
  On Thu, 21 Nov 2013 12:59:26 -0800
  Dan Stromberg drsali...@gmail.com wrote:
 
   On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
   resea...@johnohagan.comwrote:
  
   
Short story: the subject says it all, so if you have an answer
already, fire away. Below is the long story of what I'm using it
for, and why I think it needs to be recursive. It may even be of
more general interest in terms of filtering the results of
generators.
   
  
   I think you probably need permutations rather than combinations.
  
   Also, I think you'll need to form a word (partitioned off by
   spaces), and then check it against a set
   containing /usr/share/dict/words before recursing for the
   remainder of the sentence - this should speed things up a LOT.
 
  Thanks for the reply. If I understand you correctly, you are
  suggesting permuting the input _characters_ to form words and then
  seeing if they exist, as opposed to my approach of combining known
  words and seeing if they are anagrams. (Permutations of words would
  not help find anagrams as they merely change the word order). Here
  is an attempt at that:
 
 
 You've interpreted me correctly.
 
 However, I was thinking about this in the back of my mind, and
 decided it would probably be best to inhale /usr/share/dict/words (if
 on Linux), and pull out words of the corrects lengths (as separated
 by the blanks) over the correct (possible) alphabet, and permute
 Those, afterward checking if they form good anagrams of the original
 sentence.  This would probably be much faster, since English isn't
 that dense of a space.

If you look back at my original question, you'll see that's pretty much
what I've done. I didn't spell out all the details, but  I made a
dictionary of wordlength keys containing lists of all dictionary words
of that length made of the correct sub-alphabet.

But to to recap the problem: to produce non-redundant anagram phrases, I
need the cartesian product (not permutations) of these lists if each is
unique, but with a subroutine producing multiset combinations if a list
is repeated, i.e., if a particular length is called for more than once.
The version I have so far is correct but substantially slower than the
product-only one, which just goes ahead and produces all the redundant
results. This seems counter-intuitive, and my theory is that this is
because I am unable to prune the non-recursive combination algorithm
I currently have.

Regards,

--

John
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-22 Thread John O'Hagan
On Thu, 21 Nov 2013 12:59:26 -0800
Dan Stromberg drsali...@gmail.com wrote:

 On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
 resea...@johnohagan.comwrote:
 
 
  Short story: the subject says it all, so if you have an answer
  already, fire away. Below is the long story of what I'm using it
  for, and why I think it needs to be recursive. It may even be of
  more general interest in terms of filtering the results of
  generators.
 
 
 I think you probably need permutations rather than combinations.
 
 Also, I think you'll need to form a word (partitioned off by spaces),
 and then check it against a set containing /usr/share/dict/words
 before recursing for the remainder of the sentence - this should
 speed things up a LOT.

Thanks for the reply. If I understand you correctly, you are suggesting
permuting the input _characters_ to form words and then seeing if
they exist, as opposed to my approach of combining known words and
seeing if they are anagrams. (Permutations of words would not help find
anagrams as they merely change the word order). Here is an attempt at
that:

def anagrams(partition, input_string):
Find anagrams which fit given partition of input string length
if not partition:
yield (), input_string
return
for words, checkstring in anagrams(partition[:-1], input_string):
for word in itertools.permutations(checkstring, partition[-1]):
word = ''.join(word)
if word in WORDS: #WORDS is collection of dictionary words
newstring = checkstring
for l in word:
newstring = newstring.replace(l, '' , 1)
yield words + (word,), newstring

There are two problems with this. If there are repeated characters in
the input, redundant results are produced; a multiset-permutation
algorithm would fix this. But the main problem is it is incredibly
slow: on my run-of-the-mill laptop, it chokes on anything longer than
about 10 characters, spending most of its time rejecting non-words.

Or have I misunderstood your suggestion?

Regards,

--

John

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-22 Thread John O'Hagan
On Thu, 21 Nov 2013 18:14:41 -0800 (PST)
James hslee...@yahoo.com wrote:

 On Thursday, November 21, 2013 5:01:15 AM UTC-8, John O'Hagan wrote:

[...]

   On 21 November 2013 06:46, John O'Hagan 
  
   wrote:
  

[...]

  
def multicombs(it, r):
  
result = it[:r]
  
yield result
  
while 1:
  
for i in range(-1, -r - 1, -1):
  
rep = result[i]
  
if rep  it[i]:
  
break
  
else:
  
break
  
for j, n in enumerate(it):
  
if n  rep:
  
break
  
result = result[:i] + it[j:j - i]
  
yield result
  
   

[...]

  
  I neglected to mention that multicombs takes a sorted iterable;
  
  it doesn't work right otherwise. I'd forgotten that because my
  
  wordlists are guaranteed sorted by the way they're built. Sorry
  about
  
  that.
  
  
  
  In my use-case the first argument to multicombs is a tuple of words
  
  which may contain duplicates, and it produces all unique
  combinations
  
  of a certain length of those words, eg:
  
  
  
  list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))
  
  
  
  [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
  
  ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
  
  ('in', 'the', 'the')]
  
  

[...]

  What I'm looking for is a recursive algorithm which does what
  
  multicombs does (order unimportant) so that I can apply a pruning
  
  shortcut like the one I used in the recursive cartesian product
  
  algorithm in my original post.
  
  
  
  Multiset combination algorithms seem pretty thin on the ground out
  
  there - as I said, I could only find a description of the procedure
  
  above, no actual code. The ones I did find are non-recursive. I'm
  
  hoping some combinatorics and/or recursion experts can offer
  advice. 
  
  

[...]

  
  John
 
 Could convert the following perl script to python?
 
 use Data::Dump qw(dump);
 dump combo([@ARGV], 3);
 
 sub combo {
 my ($t, $k) = @_;
 my @T = @$t;
 my @R = ();
 my %g = ();
 if ($k == 1) {
 for (@T) {
 push @R, $_ unless $g{$_}++;
 }
 } else {
 while (my $x = shift @T) {
 $p = combo([@T], $k-1);
 for (@{$p}) {
 my $q = $x.,.$_;
 push @R, $q unless $g{$q}++;
 }
 }
 }
 [@R];
 }
 
 $ prog.pl cat hat in the the
 [
   cat,hat,in,
   cat,hat,the,
   cat,in,the,
   cat,the,the,
   hat,in,the,
   hat,the,the,
   in,the,the,
 ]
 
 James


Thanks. Now I just have to learn Perl to understand what that
does! :)

Regards,

--

John
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-22 Thread MRAB

On 23/11/2013 00:58, John O'Hagan wrote:

On Thu, 21 Nov 2013 12:59:26 -0800
Dan Stromberg drsali...@gmail.com wrote:


On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
resea...@johnohagan.comwrote:


 Short story: the subject says it all, so if you have an answer
 already, fire away. Below is the long story of what I'm using it
 for, and why I think it needs to be recursive. It may even be of
 more general interest in terms of filtering the results of
 generators.


I think you probably need permutations rather than combinations.

Also, I think you'll need to form a word (partitioned off by spaces),
and then check it against a set containing /usr/share/dict/words
before recursing for the remainder of the sentence - this should
speed things up a LOT.


Thanks for the reply. If I understand you correctly, you are suggesting
permuting the input _characters_ to form words and then seeing if
they exist, as opposed to my approach of combining known words and
seeing if they are anagrams. (Permutations of words would not help find
anagrams as they merely change the word order). Here is an attempt at
that:

def anagrams(partition, input_string):
 Find anagrams which fit given partition of input string length
 if not partition:
 yield (), input_string
 return
 for words, checkstring in anagrams(partition[:-1], input_string):
 for word in itertools.permutations(checkstring, partition[-1]):
 word = ''.join(word)
 if word in WORDS: #WORDS is collection of dictionary words
 newstring = checkstring
 for l in word:
 newstring = newstring.replace(l, '' , 1)
 yield words + (word,), newstring

There are two problems with this. If there are repeated characters in
the input, redundant results are produced; a multiset-permutation
algorithm would fix this. But the main problem is it is incredibly
slow: on my run-of-the-mill laptop, it chokes on anything longer than
about 10 characters, spending most of its time rejecting non-words.

Or have I misunderstood your suggestion?


If you want to know how to get unique permutations, have a look here:

http://mail.python.org/pipermail/python-ideas/2013-October/023610.html

--
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-22 Thread Dan Stromberg
On Fri, Nov 22, 2013 at 4:58 PM, John O'Hagan resea...@johnohagan.comwrote:

 On Thu, 21 Nov 2013 12:59:26 -0800
 Dan Stromberg drsali...@gmail.com wrote:

  On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan
  resea...@johnohagan.comwrote:
 
  
   Short story: the subject says it all, so if you have an answer
   already, fire away. Below is the long story of what I'm using it
   for, and why I think it needs to be recursive. It may even be of
   more general interest in terms of filtering the results of
   generators.
  
 
  I think you probably need permutations rather than combinations.
 
  Also, I think you'll need to form a word (partitioned off by spaces),
  and then check it against a set containing /usr/share/dict/words
  before recursing for the remainder of the sentence - this should
  speed things up a LOT.

 Thanks for the reply. If I understand you correctly, you are suggesting
 permuting the input _characters_ to form words and then seeing if
 they exist, as opposed to my approach of combining known words and
 seeing if they are anagrams. (Permutations of words would not help find
 anagrams as they merely change the word order). Here is an attempt at
 that:


You've interpreted me correctly.

However, I was thinking about this in the back of my mind, and decided it
would probably be best to inhale /usr/share/dict/words (if on Linux), and
pull out words of the corrects lengths (as separated by the blanks) over
the correct (possible) alphabet, and permute Those, afterward checking if
they form good anagrams of the original sentence.  This would probably be
much faster, since English isn't that dense of a space.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-21 Thread Oscar Benjamin
On 21 November 2013 06:46, John O'Hagan resea...@johnohagan.com wrote:

 I found a verbal description of such an algorithm and came up with
 this:

 def multicombs(it, r):
 result = it[:r]
 yield result
 while 1:
 for i in range(-1, -r - 1, -1):
 rep = result[i]
 if rep  it[i]:
 break
 else:
 break
 for j, n in enumerate(it):
 if n  rep:
 break
 result = result[:i] + it[j:j - i]
 yield result

I'm not really sure what it is you're asking for. I thought if I ran
the code I'd understand but that just confused me more. Is the output
below correct? If not what should it be?

multicombs(abracadabra, 0)
['']
multicombs(abracadabra, 1)
['a']
multicombs(abracadabra, 2)
['ab', 'br', 'ra']
multicombs(abracadabra, 3)
['abr', 'ara', 'bra']
multicombs(abracadabra, 4)
['abra']
multicombs(abracadabra, 5)
['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad',
'racbr', 'racra']


Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-21 Thread Dan Stromberg
On Wed, Nov 20, 2013 at 10:46 PM, John O'Hagan resea...@johnohagan.comwrote:


 Short story: the subject says it all, so if you have an answer already,
 fire away. Below is the long story of what I'm using it for, and why I
 think it needs to be recursive. It may even be of more general
 interest in terms of filtering the results of generators.


I think you probably need permutations rather than combinations.

Also, I think you'll need to form a word (partitioned off by spaces), and
then check it against a set containing /usr/share/dict/words before
recursing for the remainder of the sentence - this should speed things up a
LOT.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-21 Thread James
On Thursday, November 21, 2013 5:01:15 AM UTC-8, John O'Hagan wrote:
 On Thu, 21 Nov 2013 11:42:49 +
 
 Oscar Benjamin  wrote:
 
 
 
  On 21 November 2013 06:46, John O'Hagan 
 
  wrote:
 
  
 
   I found a verbal description of such an algorithm and came up with
 
   this:
 
  
 
   def multicombs(it, r):
 
   result = it[:r]
 
   yield result
 
   while 1:
 
   for i in range(-1, -r - 1, -1):
 
   rep = result[i]
 
   if rep  it[i]:
 
   break
 
   else:
 
   break
 
   for j, n in enumerate(it):
 
   if n  rep:
 
   break
 
   result = result[:i] + it[j:j - i]
 
   yield result
 
  
 
  I'm not really sure what it is you're asking for. I thought if I ran
 
  the code I'd understand but that just confused me more. Is the output
 
  below correct? If not what should it be?
 
  
 
  multicombs(abracadabra, 0)
 
  ['']
 
  multicombs(abracadabra, 1)
 
  ['a']
 
  multicombs(abracadabra, 2)
 
  ['ab', 'br', 'ra']
 
  multicombs(abracadabra, 3)
 
  ['abr', 'ara', 'bra']
 
  multicombs(abracadabra, 4)
 
  ['abra']
 
  multicombs(abracadabra, 5)
 
  ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad',
 
  'racbr', 'racra']
 
 
 
 
 
 I neglected to mention that multicombs takes a sorted iterable;
 
 it doesn't work right otherwise. I'd forgotten that because my
 
 wordlists are guaranteed sorted by the way they're built. Sorry about
 
 that.
 
 
 
 In my use-case the first argument to multicombs is a tuple of words
 
 which may contain duplicates, and it produces all unique combinations
 
 of a certain length of those words, eg:
 
 
 
 list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))
 
 
 
 [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
 
 ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
 
 ('in', 'the', 'the')]
 
 
 
 Contrast this with: 
 
 
 
 list(itertools.combinations(('cat', 'hat', 'in', 'the', 'the'), 3))
 
 
 
 [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'hat', 'the'),
 
 ('cat', 'in', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'),
 
 ('hat', 'in', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
 
 ('in', 'the', 'the')]
 
 
 
 which produces results which are redundant for my purposes.
 
 
 
 What I'm looking for is a recursive algorithm which does what
 
 multicombs does (order unimportant) so that I can apply a pruning
 
 shortcut like the one I used in the recursive cartesian product
 
 algorithm in my original post.
 
 
 
 Multiset combination algorithms seem pretty thin on the ground out
 
 there - as I said, I could only find a description of the procedure
 
 above, no actual code. The ones I did find are non-recursive. I'm
 
 hoping some combinatorics and/or recursion experts can offer advice. 
 
 
 
 Regards,
 
 
 
 --
 
 
 
 John

Could convert the following perl script to python?

use Data::Dump qw(dump);
dump combo([@ARGV], 3);

sub combo {
my ($t, $k) = @_;
my @T = @$t;
my @R = ();
my %g = ();
if ($k == 1) {
for (@T) {
push @R, $_ unless $g{$_}++;
}
} else {
while (my $x = shift @T) {
$p = combo([@T], $k-1);
for (@{$p}) {
my $q = $x.,.$_;
push @R, $q unless $g{$q}++;
}
}
}
[@R];
}

$ prog.pl cat hat in the the
[
  cat,hat,in,
  cat,hat,the,
  cat,in,the,
  cat,the,the,
  hat,in,the,
  hat,the,the,
  in,the,the,
]

James
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Recursive generator for combinations of a multiset?

2013-11-21 Thread John O'Hagan
On Thu, 21 Nov 2013 11:42:49 +
Oscar Benjamin oscar.j.benja...@gmail.com wrote:

 On 21 November 2013 06:46, John O'Hagan resea...@johnohagan.com
 wrote:
 
  I found a verbal description of such an algorithm and came up with
  this:
 
  def multicombs(it, r):
  result = it[:r]
  yield result
  while 1:
  for i in range(-1, -r - 1, -1):
  rep = result[i]
  if rep  it[i]:
  break
  else:
  break
  for j, n in enumerate(it):
  if n  rep:
  break
  result = result[:i] + it[j:j - i]
  yield result
 
 I'm not really sure what it is you're asking for. I thought if I ran
 the code I'd understand but that just confused me more. Is the output
 below correct? If not what should it be?
 
 multicombs(abracadabra, 0)
 ['']
 multicombs(abracadabra, 1)
 ['a']
 multicombs(abracadabra, 2)
 ['ab', 'br', 'ra']
 multicombs(abracadabra, 3)
 ['abr', 'ara', 'bra']
 multicombs(abracadabra, 4)
 ['abra']
 multicombs(abracadabra, 5)
 ['abrac', 'abrbr', 'abrra', 'braca', 'brara', 'brbra', 'racad',
 'racbr', 'racra']


I neglected to mention that multicombs takes a sorted iterable;
it doesn't work right otherwise. I'd forgotten that because my
wordlists are guaranteed sorted by the way they're built. Sorry about
that.

In my use-case the first argument to multicombs is a tuple of words
which may contain duplicates, and it produces all unique combinations
of a certain length of those words, eg:

list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))

[('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
('in', 'the', 'the')]

Contrast this with: 

list(itertools.combinations(('cat', 'hat', 'in', 'the', 'the'), 3))

[('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'hat', 'the'),
('cat', 'in', 'the'), ('cat', 'in', 'the'), ('cat', 'the', 'the'),
('hat', 'in', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
('in', 'the', 'the')]

which produces results which are redundant for my purposes.

What I'm looking for is a recursive algorithm which does what
multicombs does (order unimportant) so that I can apply a pruning
shortcut like the one I used in the recursive cartesian product
algorithm in my original post.

Multiset combination algorithms seem pretty thin on the ground out
there - as I said, I could only find a description of the procedure
above, no actual code. The ones I did find are non-recursive. I'm
hoping some combinatorics and/or recursion experts can offer advice. 

Regards,

--

John
-- 
https://mail.python.org/mailman/listinfo/python-list


Recursive generator for combinations of a multiset?

2013-11-20 Thread John O'Hagan

Short story: the subject says it all, so if you have an answer already,
fire away. Below is the long story of what I'm using it for, and why I
think it needs to be recursive. It may even be of more general
interest in terms of filtering the results of generators. 

I'm playing with an anagram-generating module which works like this:

1) Generate the integer partitions of the length of the input string.

2) For each partition, replace its elements with a list of all
dictionary words which are a) the same length as the partition
element and b) a sub-multiset of the input string.

eg: cat in hat - ... , [3,5], ... - [[act, ant, ...], [antic,
attic, ...]]

3) Pass each resulting list of wordlists to itertools.product, filtering
the output of anything which is not the same multiset of characters as
the input string.

This works but gets very slow for long strings. It spends most of its
time at the filtering stage because most of the results have too many
of some characters (and therefore not enough of others). I do some
optimising of the lists prior to running product, but this only shaves
off smallish percentages.

I got a big speedup (factor of five for medium-length inputs, much more
for longer strings) by replacing itertools.product with this recursive
generator:

def cartesian_product(args, input_str):
if not args:
yield (), input_str
return
for words, check_str in cartesian_product(args[:-1], input_str):
for word in args[-1]:
#this bit prevents bothering with anything useless
new_str = check_str
for letter in word:
if letter in new_str:
new_str = new_str.replace(letter, '', 1)
else:
break
else:
yield words + (word,), new_str

Despite being inherently much slower than itertools.product, it can
prune branches of the recursion as soon as it accumulates too many of
any character. This means it's much faster and produces correct results
without further filtering.

But there is another problem. The initial partitions contain repeated
elements, so the corresponding wordlists are also repeated. This means
that any cartesian product will contain many redundant results - the
same words in a different order. For a medium-sized string, this is
most of them. 

A solution for repeated sections of a partition of wordlists is to
use r-combinations (where r is the number of repeats). In this
scenario, though, some words may be usable more than once, and the
right number of copies of these words must be added to the list to
allow this. This means I need the combinations of a multiset (so I
can't use itertools.combinations).

I found a verbal description of such an algorithm and came up with
this:

def multicombs(it, r):
result = it[:r]
yield result
while 1:
for i in range(-1, -r - 1, -1):
rep = result[i]
if rep  it[i]:
break
else:
break
for j, n in enumerate(it):
if n  rep:
break
result = result[:i] + it[j:j - i]
yield result

I added a call to this generator in a branch to the main generator
above to deal with repeated elements. This eliminates redundant
results, but with a substantial slowdown. The program now spends most
of its time inside multicombs. 

I'm hoping that if I could find a recursive multiset combination
generator, I could speed it up using the same pruning approach.

Any suggestions?

--

John
-- 
https://mail.python.org/mailman/listinfo/python-list