Re: Match items in large list
Thank you all for your answers! I must say I'm really impressed by the willing to help and the quick responses in this group (since it is my first post)! I solved the problem using SQL-queries. I added a id, the same for each item in a chain (two or more similar posts) and just updated the database for each unique post. I had to do this anyway and I think it is the simplest and quickest way! thanks again! -- http://mail.python.org/mailman/listinfo/python-list
Re: Match items in large list
Fisherking anders.mac...@gmail.com writes: Which are the best way of searching through the list and extract the items that are the same. Sort the list, then scan through looking for duplicates, which will be adjacent to each other. -- http://mail.python.org/mailman/listinfo/python-list
Re: Match items in large list
Hi, 2009/2/12 Paul Rubin http: Fisherking anders.mac...@gmail.com writes: Which are the best way of searching through the list and extract the items that are the same. hmmm how about using sqlites in memory database and let SQL take care of finding that for you? hth Martin -- http://soup.alt.delete.co.at http://www.xing.com/profile/Martin_Marcher http://www.linkedin.com/in/martinmarcher You are not free to read this message, by doing so, you have violated my licence and are required to urinate publicly. Thank you. Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Match items in large list
Fisherking wrote: I hope you can help me with this optimizing problem! I have a large list of dictionaries (about 250 000 of them). Two or more of these dictionaries contains the same data (not more than five or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d': 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...]. (the dictionaries are really containg phone numbers, duration time and call time.) I'd assume the dictionaries to look like {phone: 0123..., duration: 5.67, calltime: 10:42} What do the different keys (a, b, c) versus (d, e, f) mean? Which are the best way of searching through the list and extract the items that are the same. In the first run I want to find every item that are the same as {'a':1,'b':'2','c':3}, the second {'d': 4,'e':'5','f':6} etc. The list are not read-only and I can pop them out of the list when I have found them! items = [dict(a=1, b=2), dict(a=1, b=2), dict(a=3, c=2), dict(a=3, c=2.2), dict(a=3, c=2.6)] # 2.5 seen = set() for row in items: rowkey = frozenset(row.iteritems()) if rowkey in seen: print dupe, row else: seen.add(rowkey) # 2.3 (and older); not tested seen = {} for row in items: rowkey = row.items() rowkey.sort() rowkey = tuple(rowkey) if rowkey in seen: print dupe, row else: seen[rowkey] = 1 (How can I found items that are almost the same? e.g. {'a': 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} ) If you are content with detecting similarities on a single axis: def window(items): items = iter(items) prev = items.next() for cur in items: yield cur, prev prev = cur def todict(k, v, rest): d = dict(rest) d[k] = v return d axis = c d = {} eps = 0.5 for row in items: row = dict(row) try: value = row.pop(axis) except KeyError: pass else: key = frozenset(row.iteritems()) d.setdefault(key, []).append(value) for key, values in d.iteritems(): if len(values) 1: for cur, prev in window(sorted(values)): if abs(cur-prev) eps: print similar rows, todict(axis, cur, key), todict(axis, prev, key) Otherwise you have to define a distance function and use a clustering algorithm. Segaran's Programming Collective Intelligence has some examples in Python, though the code can hardly be called idiomatic. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Match items in large list
Fisherking wrote: Hi, I hope you can help me with this optimizing problem! I have a large list of dictionaries (about 250 000 of them). Two or more of these dictionaries contains the same data (not more than five or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d': 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...]. (the dictionaries are really containg phone numbers, duration time and call time.) Which are the best way of searching through the list and extract the items that are the same. In the first run I want to find every item that are the same as {'a':1,'b':'2','c':3}, the second {'d': 4,'e':'5','f':6} etc. The list are not read-only and I can pop them out of the list when I have found them! (How can I found items that are almost the same? e.g. {'a': 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} ) In the second part of this program I need to take each of these items (similar items are treated as one) and find these in a database that contains 2.5M posts! The python version I am bound to is Python 2.3 The best way of looking for duplicates is to make a dict or set of the items. Unfortunately a dict isn't hashable, so it can't be a key. Therefore I suggest you convert the dicts into tuples, which are hashable: counts = {} for d in my_list: d_as_list = d.items() # Standardise the order so that identical dicts become identical tuples. d_as_list.sort() key = tuple(d_as_list) if key in counts: counts[key] += 1 else: counts[key] = 1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Match items in large list
Fisherking wrote: Hi, I hope you can help me with this optimizing problem! I have a large list of dictionaries (about 250 000 of them). Two or more of these dictionaries contains the same data (not more than five or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d': 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...]. (the dictionaries are really containg phone numbers, duration time and call time.) Which are the best way of searching through the list and extract the items that are the same. In the first run I want to find every item that are the same as {'a':1,'b':'2','c':3}, the second {'d': 4,'e':'5','f':6} etc. The list are not read-only and I can pop them out of the list when I have found them! Popping items out of the middle of a list is a slow O(n) operation. MRAB's set of tuples is what I would have suggested. (How can I found items that are almost the same? e.g. {'a': 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} ) In the second part of this program I need to take each of these items (similar items are treated as one) and find these in a database that contains 2.5M posts! The python version I am bound to is Python 2.3 For that, dict of tuples (with fake value == None) might be faster than the add-on set module. -- http://mail.python.org/mailman/listinfo/python-list