Hi Tutors,
I am looking for the proper approach regarding the analysis of a dictionary of
combinations I have.
What I need to do is read from a supplied text file that has a unique ID and
that unique ID's associated combination of elements. So let's say I have the
following lines in a text file (real file could be millions of lines):
"ID" "Elements"
1 'A, B, C, D'
2 'A, D'
3 'D, E'
4 'A, D'
5 'A, B'
6 'A, C, D'
and I do something like...
combinationDict = {}
for line in file:
data = line.split('\t')
comb = tuple(data[1].split(','))
if comb not in combinationDict:
combinationDict[comb] = 1
else:
combination[comb] +=1
Now after I read all of the data I end up with a dictionary with the
combination as the key and the associated total qty as its value.
print combinationDict
{('A','B','C','D'):1, ('A','D'):2, ('D','E'):1, ('A','B'):1, ('A', 'C', 'D'):1}
What I am looking for is a starting point for a solution in python to analyze
the combination list so that I can determine for example that ('A', 'D') is the
most popular combination and then determining how many other combinations in
the dictionary contain this combination.
I would like to incorporate some parameters so for example the combination
('A','B','C','D') and ('A', 'C', 'D') contain ('A','D') so they are valid but I
could also say that as long as one element is contained in a combination it is
valid as well provided I add no more than one additional item to the
combination. If I apply this logic then ('D','E') can ('A','B') can contain
('A', 'D') and if I apply this to the combination dictionary I have:
{('B','C', ('A', 'D')):1, ('A','D'):2, ('E', ('A', 'D')):1, ('B', ('A',
'D')):1, ('C', ('A', 'D')):1}
which I could then query the keys for ('A', 'D') inclusion to get a total of 4
for ('A', 'D').
I hope this isn't too long and confusing but I am looking for an approach where
I can analyze for the highest quantity of combinations and then iterate through
the dictionary substituting those combinations that were determined a "highest
qty" combination into other low qty combinations when valid.
I was hoping to have parameters to qualify a high qty combination (e.g. every
combination with qty above 10,000) with the highest quantity of that determined
set taking precedence for substitution for the first pass then moving on to the
next highest combination for the second pass of substitution etc.. The other
parameter would be for the combination that would receive a substitution
whereby I might say that I can only substitute if a substitution results in
only one additional (superfluous) value being added to the combination existing
low qty combination.
I have looked around and this sounds like it might be similar to a packing
problem and in particular the knapsack problem but I can't seem to wrap my head
around an approach for this in python. I am not looking for a solution just
some guidance on a starting point or perhaps libraries that may be helpful.
Thank you.
_________________________________________________________________
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor