Re: Comparing sequences with range objects
Op 11/04/2022 om 02:31 schreef Dan Stromberg: It sounds a little like you're looking for interval arithmetic. Maybe https://pypi.org/project/python-intervals/1.5.3/ ? Not completely but it suggested an idea to explore. -- Antoon. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Op 11/04/2022 om 02:01 schreef duncan smith: On 10/04/2022 21:20, Antoon Pardon wrote: Op 9/04/2022 om 02:01 schreef duncan smith: On 08/04/2022 22:08, Antoon Pardon wrote: Well my first thought is that a bitset makes it less obvious to calulate the size of the set or to iterate over its elements. But it is an idea worth exploring. def popcount(n): """ Returns the number of set bits in n """ cnt = 0 while n: n &= n - 1 cnt += 1 return cnt and not tested, def iterinds(n): """ Returns a generator of the indices of the set bits of n """ i = 0 while n: if n & 1: yield i n = n >> 1 i += 1 Sure but these seem rather naive implementation with a time complexity of O(n) where n is the maximum number of possible elements. Using these would turn my O(n) algorithm in a O(n^2) algorithm. I thought your main concern was memory. Of course, dependent on various factors, you might be able to do much better than the above. But I don't know what your O(n) algorithm is, how using a bitset would make it O(n^2), or if the O(n^2) algorithm would actually be slower for typical n. The overall thing sounds broadly like some of the blocking and clustering methods I've come across in record linkage. Using bitsets would make change my algorithm from O(n) into O(n^2) because the bitset operations are O(n) instead of O(1). If I have a 2 people a bitset will take a vector of about 300, 64bit words. With 4 people the bitset will take a vector of about 600. So doubling the population will also double the time for the bitset operations, meaning doubling the population will increase execution time four times. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Two more things: 1) There are also modules that do interval arithmetic with reals, not just integers. EG: https://pyinterval.readthedocs.io/en/latest/ 2) If you don't mind a small amount of inaccuracy you can often get things down to less than one bit per element for presence/absence, using a bloom filter. EG: https://pypi.org/project/drs-bloom-filter/ On Sun, Apr 10, 2022 at 5:31 PM Dan Stromberg wrote: > > It sounds a little like you're looking for interval arithmetic. > > Maybe https://pypi.org/project/python-intervals/1.5.3/ ? > > On Thu, Apr 7, 2022 at 4:19 AM Antoon Pardon wrote: > >> I am working with a list of data from which I have to weed out duplicates. >> At the moment I keep for each entry a container with the other entries >> that are still possible duplicates. >> >> The problem is sometimes that is all the rest. I thought to use a range >> object for these cases. Unfortunatly I sometimes want to sort things >> and a range object is not comparable with a list or a tuple. >> >> So I have a list of items where each item is itself a list or range >> object. >> I of course could sort this by using list as a key function but that >> would defeat the purpose of using range objects for these cases. >> >> So what would be a relatively easy way to get the same result without >> wasting >> too much memory on entries that haven't any weeding done on them. >> >> -- >> Antoon Pardon. >> -- >> https://mail.python.org/mailman/listinfo/python-list >> > -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 2022-04-10 at 22:20:33 +0200, Antoon Pardon wrote: > > > Op 9/04/2022 om 02:01 schreef duncan smith: > > On 08/04/2022 22:08, Antoon Pardon wrote: > > > > > > Well my first thought is that a bitset makes it less obvious to calulate > > > the size of the set or to iterate over its elements. But it is an idea > > > worth exploring. > > > > > > > > > > > def popcount(n): > > """ > > Returns the number of set bits in n > > """ > > cnt = 0 > > while n: > > n &= n - 1 > > cnt += 1 > > return cnt > > > > and not tested, > > > > def iterinds(n): > > """ > > Returns a generator of the indices of the set bits of n > > """ > > i = 0 > > while n: > > if n & 1: > > yield i > > n = n >> 1 > > i += 1 > > > Sure but these seem rather naive implementation with a time complexity of > O(n) where n is the maximum number of possible elements. Using these would > turn my O(n) algorithm in a O(n^2) algorithm. O(n) where n is the expected number of elements. The loops iterate once for each bit actually contained in the set, which is usually [much] less than the size of the universe. If you have lots and lots of elements in your sets, or your universe is large and n is a long integer, then these may not be as efficient as other methods. You know your data better than we do. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
It sounds a little like you're looking for interval arithmetic. Maybe https://pypi.org/project/python-intervals/1.5.3/ ? On Thu, Apr 7, 2022 at 4:19 AM Antoon Pardon wrote: > I am working with a list of data from which I have to weed out duplicates. > At the moment I keep for each entry a container with the other entries > that are still possible duplicates. > > The problem is sometimes that is all the rest. I thought to use a range > object for these cases. Unfortunatly I sometimes want to sort things > and a range object is not comparable with a list or a tuple. > > So I have a list of items where each item is itself a list or range object. > I of course could sort this by using list as a key function but that > would defeat the purpose of using range objects for these cases. > > So what would be a relatively easy way to get the same result without > wasting > too much memory on entries that haven't any weeding done on them. > > -- > Antoon Pardon. > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 10/04/2022 21:20, Antoon Pardon wrote: Op 9/04/2022 om 02:01 schreef duncan smith: On 08/04/2022 22:08, Antoon Pardon wrote: Well my first thought is that a bitset makes it less obvious to calulate the size of the set or to iterate over its elements. But it is an idea worth exploring. def popcount(n): """ Returns the number of set bits in n """ cnt = 0 while n: n &= n - 1 cnt += 1 return cnt and not tested, def iterinds(n): """ Returns a generator of the indices of the set bits of n """ i = 0 while n: if n & 1: yield i n = n >> 1 i += 1 Sure but these seem rather naive implementation with a time complexity of O(n) where n is the maximum number of possible elements. Using these would turn my O(n) algorithm in a O(n^2) algorithm. I thought your main concern was memory. Of course, dependent on various factors, you might be able to do much better than the above. But I don't know what your O(n) algorithm is, how using a bitset would make it O(n^2), or if the O(n^2) algorithm would actually be slower for typical n. The overall thing sounds broadly like some of the blocking and clustering methods I've come across in record linkage. Duncan -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Op 9/04/2022 om 02:01 schreef duncan smith: On 08/04/2022 22:08, Antoon Pardon wrote: Well my first thought is that a bitset makes it less obvious to calulate the size of the set or to iterate over its elements. But it is an idea worth exploring. def popcount(n): """ Returns the number of set bits in n """ cnt = 0 while n: n &= n - 1 cnt += 1 return cnt and not tested, def iterinds(n): """ Returns a generator of the indices of the set bits of n """ i = 0 while n: if n & 1: yield i n = n >> 1 i += 1 Sure but these seem rather naive implementation with a time complexity of O(n) where n is the maximum number of possible elements. Using these would turn my O(n) algorithm in a O(n^2) algorithm. -- Antoon Pardon. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 09/04/2022 13:14, Christian Gollwitzer wrote: Am 08.04.22 um 09:21 schrieb Antoon Pardon: The first is really hard. Not only may information be missing, no single single piece of information is unique or immutable. Two people may have the same name (I know about several other "Peter Holzer"s), a single person might change their name (when I was younger I went by my middle name - how would you know that "Peter Holzer" and "Hansi Holzer" are the same person?), they will move (= change their address), change jobs, etc. Unless you have a unique immutable identifier that's enforced by some authority (like a social security number[1]), I don't think there is a chance to do that reliably in a program (although with enough data, a heuristic may be good enough). Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Then write your own comparator function? Also, if the only case where this actually works is the index of all other records, then a simple boolean flag "all" vs. "these items in the index list" would suffice - doesn't it? Christian Writing a comparator function is only possible for a given key. So my approach would be: 1) Write a comparator function that takes params X and Y, such that: if key data is missing from X, return 1 If key data is missing from Y return -1 if X > Y return 1 if X < Y return -1 return 0 # They are equal and key data for both is present 2) Sort the data using the comparator function. 3) Run through the data with a trailing enumeration loop, merging matching records together. 4) If there are no records copied out with missing key data, then you are done, so exit. 5) Choose a new key and repeat from step 1). Regards Ian -- Ian Hobson Tel (+66) 626 544 695 -- This email has been checked for viruses by AVG. https://www.avg.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Am 08.04.22 um 09:21 schrieb Antoon Pardon: The first is really hard. Not only may information be missing, no single single piece of information is unique or immutable. Two people may have the same name (I know about several other "Peter Holzer"s), a single person might change their name (when I was younger I went by my middle name - how would you know that "Peter Holzer" and "Hansi Holzer" are the same person?), they will move (= change their address), change jobs, etc. Unless you have a unique immutable identifier that's enforced by some authority (like a social security number[1]), I don't think there is a chance to do that reliably in a program (although with enough data, a heuristic may be good enough). Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Then write your own comparator function? Also, if the only case where this actually works is the index of all other records, then a simple boolean flag "all" vs. "these items in the index list" would suffice - doesn't it? Christian -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 08/04/2022 22:08, Antoon Pardon wrote: Op 8/04/2022 om 16:28 schreef duncan smith: On 08/04/2022 08:21, Antoon Pardon wrote: Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Is there any reason why you can't use ints? Just set the relevant bits. Well my first thought is that a bitset makes it less obvious to calulate the size of the set or to iterate over its elements. But it is an idea worth exploring. def popcount(n): """ Returns the number of set bits in n """ cnt = 0 while n: n &= n - 1 cnt += 1 return cnt and not tested, def iterinds(n): """ Returns a generator of the indices of the set bits of n """ i = 0 while n: if n & 1: yield i n = n >> 1 i += 1 Duncan -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Op 8/04/2022 om 16:28 schreef duncan smith: On 08/04/2022 08:21, Antoon Pardon wrote: Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Is there any reason why you can't use ints? Just set the relevant bits. Well my first thought is that a bitset makes it less obvious to calulate the size of the set or to iterate over its elements. But it is an idea worth exploring. -- Antoon. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 08/04/2022 08:21, Antoon Pardon wrote: Op 8/04/2022 om 08:24 schreef Peter J. Holzer: On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote: Op 7/04/2022 om 16:08 schreef Joel Goldstick: On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon wrote: I am working with a list of data from which I have to weed out duplicates. At the moment I keep for each entry a container with the other entries that are still possible duplicates. [...] Sorry I wasn't clear. The data contains information about persons. But not all records need to be complete. So a person can occur multiple times in the list, while the records are all different because they are missing different bits. So all records with the same firstname can be duplicates. But if I have a record in which the firstname is missing, it can at that point be a duplicate of all other records. There are two problems. The first one is how do you establish identity. The second is how do you ween out identical objects. In your first mail you only asked about the second, but that's easy. The first is really hard. Not only may information be missing, no single single piece of information is unique or immutable. Two people may have the same name (I know about several other "Peter Holzer"s), a single person might change their name (when I was younger I went by my middle name - how would you know that "Peter Holzer" and "Hansi Holzer" are the same person?), they will move (= change their address), change jobs, etc. Unless you have a unique immutable identifier that's enforced by some authority (like a social security number[1]), I don't think there is a chance to do that reliably in a program (although with enough data, a heuristic may be good enough). Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Is there any reason why you can't use ints? Just set the relevant bits. Duncan -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Op 8/04/2022 om 08:24 schreef Peter J. Holzer: On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote: Op 7/04/2022 om 16:08 schreef Joel Goldstick: On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon wrote: I am working with a list of data from which I have to weed out duplicates. At the moment I keep for each entry a container with the other entries that are still possible duplicates. [...] Sorry I wasn't clear. The data contains information about persons. But not all records need to be complete. So a person can occur multiple times in the list, while the records are all different because they are missing different bits. So all records with the same firstname can be duplicates. But if I have a record in which the firstname is missing, it can at that point be a duplicate of all other records. There are two problems. The first one is how do you establish identity. The second is how do you ween out identical objects. In your first mail you only asked about the second, but that's easy. The first is really hard. Not only may information be missing, no single single piece of information is unique or immutable. Two people may have the same name (I know about several other "Peter Holzer"s), a single person might change their name (when I was younger I went by my middle name - how would you know that "Peter Holzer" and "Hansi Holzer" are the same person?), they will move (= change their address), change jobs, etc. Unless you have a unique immutable identifier that's enforced by some authority (like a social security number[1]), I don't think there is a chance to do that reliably in a program (although with enough data, a heuristic may be good enough). Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. -- Antoon Pardon. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On Fri, 8 Apr 2022 at 16:26, Peter J. Holzer wrote: > Unless you have a unique immutable identifier that's enforced by > some authority (like a social security number[1]), I don't think there > is a chance to do that reliably in a program (although with enough data, > a heuristic may be good enough). > Not sure what your footnote was supposed to be, but I'll offer two useful footnotes to that: [1] [a] Though they're not actually immutable either, just less frequently changing [1[ [b] Of course, that has privacy implications. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote: > Op 7/04/2022 om 16:08 schreef Joel Goldstick: > > On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon wrote: > > > I am working with a list of data from which I have to weed out duplicates. > > > At the moment I keep for each entry a container with the other entries > > > that are still possible duplicates. [...] > Sorry I wasn't clear. The data contains information about persons. But not > all records need to be complete. So a person can occur multiple times in > the list, while the records are all different because they are missing > different bits. > > So all records with the same firstname can be duplicates. But if I have > a record in which the firstname is missing, it can at that point be > a duplicate of all other records. There are two problems. The first one is how do you establish identity. The second is how do you ween out identical objects. In your first mail you only asked about the second, but that's easy. The first is really hard. Not only may information be missing, no single single piece of information is unique or immutable. Two people may have the same name (I know about several other "Peter Holzer"s), a single person might change their name (when I was younger I went by my middle name - how would you know that "Peter Holzer" and "Hansi Holzer" are the same person?), they will move (= change their address), change jobs, etc. Unless you have a unique immutable identifier that's enforced by some authority (like a social security number[1]), I don't think there is a chance to do that reliably in a program (although with enough data, a heuristic may be good enough). hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 2022-04-07 16:16, Antoon Pardon wrote: Op 7/04/2022 om 16:08 schreef Joel Goldstick: On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon wrote: I am working with a list of data from which I have to weed out duplicates. At the moment I keep for each entry a container with the other entries that are still possible duplicates. The problem is sometimes that is all the rest. I thought to use a range object for these cases. Unfortunatly I sometimes want to sort things and a range object is not comparable with a list or a tuple. So I have a list of items where each item is itself a list or range object. I of course could sort this by using list as a key function but that would defeat the purpose of using range objects for these cases. So what would be a relatively easy way to get the same result without wasting too much memory on entries that haven't any weeding done on them. -- Antoon Pardon. -- https://mail.python.org/mailman/listinfo/python-list I'm not sure I understand what you are trying to do, but if your data has no order, you can use set to remove the duplicates Sorry I wasn't clear. The data contains information about persons. But not all records need to be complete. So a person can occur multiple times in the list, while the records are all different because they are missing different bits. So all records with the same firstname can be duplicates. But if I have a record in which the firstname is missing, it can at that point be a duplicate of all other records. This is how I'd approach it: # Make a list of groups, where each group is a list of potential duplicates. # Initially, all of the records are potential duplicates of each other. records = [list_of_records] # Split the groups into subgroups according to the first name. new_records = [] for group in records: subgroups = defaultdict(list) for record in group: subgroups[record['first_name']].append(record) # Records without a first name could belong to any of the subgroups. missing = subgroups.pop(None, []) for record in missing: for subgroup in subgroups.values(): subgroup.extend(missing) new_records.extend(subgroups.values()) records = new_records # Now repeat for the last name, etc. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Antoon Pardon wrote at 2022-4-7 17:16 +0200: > ... >Sorry I wasn't clear. The data contains information about persons. But not >all records need to be complete. So a person can occur multiple times in >the list, while the records are all different because they are missing >different bits. > >So all records with the same firstname can be duplicates. But if I have >a record in which the firstname is missing, it can at that point be >a duplicate of all other records. The description is still not clear enough. Especially, it does not show where `range` objects come into play. Answering on a more abstract level: Apparently, you want to sort a list the elements of which can be other lists or range objects. List objects are ordered lexicographically, i.e. for lists l1, l2: l1 <= l2 iff not l1 or (l2 and l1[0] <= l2[0] and l1[1:] <= l2[1:]). If you want to sort a list containing list elements are compared using this order. For your case, you would need to use a `key` parameter for `sort` that implements this order for `range` objects, too. (Note that Python provides a function which transforms an order definition into an appropriate `key` function). A corresponding `sort` call may expand your range objects completely. An alternative might be to not expand `range` objects but to put them all at the start or end of the sorted list. Of course, this would imply that their expansion does not influence their order in the list -- which may or may not be acceptable (depending on your use case). If it is acceptable, it is likely possible to not put range objects into the list to be sorted in the first place. -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Op 7/04/2022 om 16:08 schreef Joel Goldstick: On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon wrote: I am working with a list of data from which I have to weed out duplicates. At the moment I keep for each entry a container with the other entries that are still possible duplicates. The problem is sometimes that is all the rest. I thought to use a range object for these cases. Unfortunatly I sometimes want to sort things and a range object is not comparable with a list or a tuple. So I have a list of items where each item is itself a list or range object. I of course could sort this by using list as a key function but that would defeat the purpose of using range objects for these cases. So what would be a relatively easy way to get the same result without wasting too much memory on entries that haven't any weeding done on them. -- Antoon Pardon. -- https://mail.python.org/mailman/listinfo/python-list I'm not sure I understand what you are trying to do, but if your data has no order, you can use set to remove the duplicates Sorry I wasn't clear. The data contains information about persons. But not all records need to be complete. So a person can occur multiple times in the list, while the records are all different because they are missing different bits. So all records with the same firstname can be duplicates. But if I have a record in which the firstname is missing, it can at that point be a duplicate of all other records. -- Antoon Pardon -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon wrote: > > I am working with a list of data from which I have to weed out duplicates. > At the moment I keep for each entry a container with the other entries > that are still possible duplicates. > > The problem is sometimes that is all the rest. I thought to use a range > object for these cases. Unfortunatly I sometimes want to sort things > and a range object is not comparable with a list or a tuple. > > So I have a list of items where each item is itself a list or range object. > I of course could sort this by using list as a key function but that > would defeat the purpose of using range objects for these cases. > > So what would be a relatively easy way to get the same result without wasting > too much memory on entries that haven't any weeding done on them. > > -- > Antoon Pardon. > -- > https://mail.python.org/mailman/listinfo/python-list I'm not sure I understand what you are trying to do, but if your data has no order, you can use set to remove the duplicates -- Joel Goldstick -- https://mail.python.org/mailman/listinfo/python-list
Comparing sequences with range objects
I am working with a list of data from which I have to weed out duplicates. At the moment I keep for each entry a container with the other entries that are still possible duplicates. The problem is sometimes that is all the rest. I thought to use a range object for these cases. Unfortunatly I sometimes want to sort things and a range object is not comparable with a list or a tuple. So I have a list of items where each item is itself a list or range object. I of course could sort this by using list as a key function but that would defeat the purpose of using range objects for these cases. So what would be a relatively easy way to get the same result without wasting too much memory on entries that haven't any weeding done on them. -- Antoon Pardon. -- https://mail.python.org/mailman/listinfo/python-list