Norman Khine wrote: > hello, > > i have this tuple: > > http://paste.lisp.org/+2F4X > > i have this, which does what i want: > > from collections import defaultdict > > d = defaultdict(set) > for id, url in result: > d[url].add(id) > for url in sorted(d): > if len(d[url]) > 1: > print('%d -- %s' % (len(d[url]), url)) > > so here the code checks for duplicate urls and counts the number of > occurences. > > but i am sort of stuck in that i want to now update the id of the > related table and update the > > basically i have two tables: > > id, url > 24715L, 'http://aqoon.local/muesli/2-muesli-tropical-500g.html' > 24719L, 'http://aqoon.local/muesli/2-muesli-tropical-500g.html' > > id, tid, > 1, 24715L > 2, 24719L > > so i want to first update t(2)'s tid to t(1)'s id for each duplicate > and then delete the row id = 24719L
You can use another dictionary that maps ids associated with the same url to a canonical id. from collections import defaultdict url_table = [ (24715,"http://aqoon.local/muesli/2-muesli-tropical-500g.html"), (24719,"http://aqoon.local/muesli/2-muesli-tropical-500g.html"), (24720,"http://example.com/index.html") ] id_table = [ (1, 24715), (2, 24719), (3, 24720) ] dupes = defaultdict(set) for uid, url in url_table: dupes[url].add(uid) lookup = {} for synonyms in dupes.itervalues(): if len(synonyms) > 1: canonical = min(synonyms) for alias in synonyms: assert alias not in lookup lookup[alias] = canonical ids = [(id, lookup.get(uid, uid)) for id, uid in id_table] print ids urls = [(min(synonyms), url) for url, synonyms in dupes.iteritems()] print urls Note that if you use a database for these tables you can avoid creating duplicates in the first place. Peter _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor