Re: Printing dict value for possibly undefined key

2023-11-25 Thread duncan smith via Python-list

On 24/11/2023 16:35, duncan smith wrote:

On 24/11/2023 14:31, Loris Bennett wrote:

Hi,

I want to print some records from a database table where one of the
fields contains a JSON string which is read into a dict.  I am doing
something like

   print(f"{id} {d['foo']} {d['bar']}")

However, the dict does not always have the same keys, so d['foo'] or
d['bar'] may be undefined.  I can obviously do something like

   if not 'foo' in d:
 d['foo']="NULL"
   if not 'bar' in d:
 d['bar']="NULL"
   print(f"{id} {d['foo']} {d['bar']}")

Is there any more compact way of achieving the same thing?

Cheers,

Loris



Yes. e.g.

d.get('foo', "NULL")

Duncan


Or make d a defaultdict.

from collections import defaultdict

dic = defaultdict(lambda:'NULL')
dic['foo'] = 'astring'
dic['foo']
'astring'
dic['bar']
'NULL'

Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: Printing dict value for possibly undefined key

2023-11-25 Thread duncan smith via Python-list

On 24/11/2023 14:31, Loris Bennett wrote:

Hi,

I want to print some records from a database table where one of the
fields contains a JSON string which is read into a dict.  I am doing
something like

   print(f"{id} {d['foo']} {d['bar']}")

However, the dict does not always have the same keys, so d['foo'] or
d['bar'] may be undefined.  I can obviously do something like

   if not 'foo' in d:
 d['foo']="NULL"
   if not 'bar' in d:
 d['bar']="NULL"
   print(f"{id} {d['foo']} {d['bar']}")

Is there any more compact way of achieving the same thing?

Cheers,

Loris



Yes. e.g.

d.get('foo', "NULL")

Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-10 Thread duncan smith

On 10/04/2022 21:20, Antoon Pardon wrote:



Op 9/04/2022 om 02:01 schreef duncan smith:

On 08/04/2022 22:08, Antoon Pardon wrote:


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
    """
    Returns the number of set bits in n
    """
    cnt = 0
    while n:
    n &= n - 1
    cnt += 1
    return cnt

and not tested,

def iterinds(n):
    """
    Returns a generator of the indices of the set bits of n
    """
    i = 0
    while n:
    if n & 1:
    yield i
    n = n >> 1
    i += 1


Sure but these seem rather naive implementation with a time complexity of
O(n) where n is the maximum number of possible elements. Using these would
turn my O(n) algorithm in a O(n^2) algorithm.



I thought your main concern was memory. Of course, dependent on various 
factors, you might be able to do much better than the above. But I don't 
know what your O(n) algorithm is, how using a bitset would make it 
O(n^2), or if the O(n^2) algorithm would actually be slower for typical 
n. The overall thing sounds broadly like some of the blocking and 
clustering methods I've come across in record linkage.


Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-09 Thread duncan smith

On 08/04/2022 22:08, Antoon Pardon wrote:


Op 8/04/2022 om 16:28 schreef duncan smith:

On 08/04/2022 08:21, Antoon Pardon wrote:


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all 
other

records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Is there any reason why you can't use ints? Just set the relevant bits.


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
"""
Returns the number of set bits in n
"""
cnt = 0
while n:
n &= n - 1
cnt += 1
return cnt

and not tested,

def iterinds(n):
"""
Returns a generator of the indices of the set bits of n
"""
i = 0
while n:
if n & 1:
yield i
n = n >> 1
i += 1


Duncan


--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-08 Thread duncan smith

On 08/04/2022 08:21, Antoon Pardon wrote:



Op 8/04/2022 om 08:24 schreef Peter J. Holzer:

On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote:

Op 7/04/2022 om 16:08 schreef Joel Goldstick:
On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon   
wrote:
I am working with a list of data from which I have to weed out 
duplicates.

At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

[...]
Sorry I wasn't clear. The data contains information about persons. 
But not

all records need to be complete. So a person can occur multiple times in
the list, while the records are all different because they are missing
different bits.

So all records with the same firstname can be duplicates. But if I have
a record in which the firstname is missing, it can at that point be
a duplicate of all other records.

There are two problems. The first one is how do you establish identity.
The second is how do you ween out identical objects. In your first mail
you only asked about the second, but that's easy.

The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Is there any reason why you can't use ints? Just set the relevant bits.

Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: mutating a deque whilst iterating over it

2021-02-14 Thread duncan smith
On 13/02/2021 19:12, Dan Stromberg wrote:
> On Sat, Feb 13, 2021 at 10:25 AM duncan smith 
> wrote:
> 
>> On 12/02/2021 03:04, Terry Reedy wrote:
>>> On 2/11/2021 3:22 PM, duncan smith wrote:
>>>
>>>>It seems that I can mutate a deque while iterating over it if I
>>>> assign to an index, but not if I append to it. Is this the intended
>>>> behaviour?
>>
>> What I was really wondering was whether the behaviour is as it is
>> because of the implementation or because it's how deques should ideally
>> behave. i.e. In my implementation do I stick strictly to the same API,
>> or allow it to differ? In some places I'm jumping through hoops to stick
>> to the API, and (when it comes to iteration) I'm jumping through
>> different hoops for different types of container (e.g. lists versus
>> deques). BTW, the reason I am implementing these at all is that my
>> containers are on-disk. Cheers.
>>
>> collections.deque appears to take the approach of  "allow everything we
> can based on our implementation, and trust the client not to overuse
> features".
> 
> In fact, in Python, "over abstraction" is often seen as a bad thing, mostly
> because it slows things down so much.
> 
> If you want something more abstracted, you might have a look at:
> https://stromberg.dnsalias.org/~strombrg/linked-list/
> https://pypi.org/project/linked_list_mod/
> 
> (Both URL's are about the same module)
> 
> Note that linked_list_mod is quite a bit slower than a collections.deque.
> Deques use a clever-but-hidden trick to gain a lot of speed - it's really a
> linked list of built in lists, which gives better locality of reference
> that masks CPU caches very happy.  linked_list_mod goes for abstraction and
> simplicity.
> 

I'm basically nicking the trick. I have a basic, single file, on-disk
deque class that doesn't stick strictly to the Python deque API, then a
second class that implements a deque as a Python deque containing
instances of my basic on-disk deque class (with fixed size apart from
the first and last instances). Of course, this could change when I get
to testing / profiling. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: mutating a deque whilst iterating over it

2021-02-13 Thread duncan smith
On 12/02/2021 03:04, Terry Reedy wrote:
> On 2/11/2021 3:22 PM, duncan smith wrote:
> 
>>    It seems that I can mutate a deque while iterating over it if I
>> assign to an index, but not if I append to it. Is this the intended
>> behaviour?
> 
> Does the deque doc say anything about mutation while iterating? (Knowing
> the author of deque, I would consider everything about it intentional
> without *good* reason to think otherwise.
> 
>>>>> from collections import deque
>>>>> d = deque(range(8))
>>>>> it = iter(d)
>>>>> next(it)
>> 0
>>>>> d[1] = 78
> 
> This does not change the structure of the deque, so next does not
> notice.  It could be considered not be a mutation.  It could be detected
> by changing deque.__setitem__, but why bother and slow down all
> __setitem__ calls.
> 
>>>>> next(it)
>> 78
>>>>> d.append(8)
> 
> This changes the structure, which can apparently mess-up iteration.
> 
>>>>> next(it)
>> Traceback (most recent call last):
>>    File "", line 1, in 
>>  next(it)
>> RuntimeError: deque mutated during iteration
>>>>>
> 
> 

What I was really wondering was whether the behaviour is as it is
because of the implementation or because it's how deques should ideally
behave. i.e. In my implementation do I stick strictly to the same API,
or allow it to differ? In some places I'm jumping through hoops to stick
to the API, and (when it comes to iteration) I'm jumping through
different hoops for different types of container (e.g. lists versus
deques). BTW, the reason I am implementing these at all is that my
containers are on-disk. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


mutating a deque whilst iterating over it

2021-02-11 Thread duncan smith
Hello,
  It seems that I can mutate a deque while iterating over it if I
assign to an index, but not if I append to it. Is this the intended
behaviour? It seems a bit inconsistent. Cheers.

Duncan

>>> from collections import deque
>>> d = deque(range(8))
>>> it = iter(d)
>>> next(it)
0
>>> d[1] = 78
>>> next(it)
78
>>> d.append(8)
>>> next(it)
Traceback (most recent call last):
  File "", line 1, in 
next(it)
RuntimeError: deque mutated during iteration
>>>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How do you debug in Python? Coming from a Matlab and R user. I'm already aware of pdb.

2021-01-27 Thread duncan smith
On 27/01/2021 22:41, C W wrote:
> Great tutorial Irv, very simple with if-else example, gets the point
> across.
> 
> My main takeaway from the discussion so far is that: you can't troubleshoot
> Python without some kind of breakpoint or debugger.
> 

[snip]

Really?

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-04 Thread duncan smith
On 04/11/2020 19:12, Avi Gross wrote:
> My comments at end:
> 
> -Original Message-
> From: Python-list  On
> Behalf Of duncan smith
> Sent: Wednesday, November 4, 2020 1:09 PM
> To: python-list@python.org
> Subject: Re: Find word by given characters
> 
> On 04/11/2020 04:21, Avi Gross wrote:
>> Duncan, my comments below yours at end.
>>
>> ---YOURS---
>> The Counter approach only requires iterating over the letters once to 
>> construct the letters bag, then each word once to create the relevant 
>> word bag. After that it's (at worst) a couple of lookups and a 
>> comparison for each unique character in letters (for each word).
>>
>> Your approach requires iteration over the words to create the lists of 
>> characters. Then they are (potentially) iterated over multiple times 
>> looking for the characters in letters. There's also the cost of 
>> removing items from arbitrary positions in the lists. Also, it seems 
>> that the character frequencies must be equal, and your approach only 
>> ensures that the words contain at least the required numbers of
> characters.
>>
>> In computational terms, if the first approach is something like O(n+m) 
>> for n letters and words of length m, your algorithm is more like O(nm).
>> Not to say that it will be slower for all possible letters and 
>> dictionaries, but probably for any realistic cases and a lot slower 
>> for large enough dictionaries.
>>
>> Duncan
>>
>> --MINE---
>>
>> I appreciate your analysis. I have not looked at the "counter"
>> implementation and suspect it does some similar loops within, albeit 
>> it may be implemented in a compiled language like C++.
>>
> 
> Before the introduction of counters I would have used a dict to create a
> mapping of characters to counts. That would require iterating over the
> characters in a string only once to create / update the dict entries, and I
> doubt counters are implemented less efficiently than that.
> 
>> I did not write out my algorithm in Python but have done it for 
>> myself. It runs fast enough with most of the time spent in the slow I/O
> part.
>>
>> We can agree all algorithms have to read in all the words in a data file.
>> There may be ways to store the data such as in a binary tree and even 
>> ways to thus prune the search as once a node is reached where all 
>> required letters have been found, all further words qualify below that 
>> point. If you match say "instant" then instants and instantiation 
>> would be deeper in the tree and also qualify assuming extra letters are
> allowed.
> 
> I don't see how a binary tree would be useful. As I've pointed out in
> another post, there are other data structures that could be useful. What I
> had in mind was a trie (prefix tree). But it's only the distinct characters
> and frequencies that are relevant and so I'd exploit that (and one or two
> other things) to reduce space and improve search.
> 
> We may differ on
>> the requirement as I think that the number of repeats for something 
>> like a,t,t require to be at least as big as in "attend" but that 
>> "attention" with yet another "t" would also be OK. If I am wrong, 
>> fine, but I note the person requesting this has admitted a certain 
>> lack of credentials while also claiming he made up a scenario just for 
>> fun. So this is not actually a particularly worthy challenge let alone
> with a purpose.
>>
>> My impression is that the average word length would be something small 
>> like 5-7. The number of words in a dictionary might be 100,000 or 
>> more. So if you want efficiency, where do you get the bang for the buck?
>>
>> I would argue that a simple test run on all the words might often 
>> narrow the field to a much smaller number of answers like just a few 
>> thousand or even much less. Say you test for the presence of "aeiou" 
>> in words, in whatever order. That might be done from reading a file 
>> and filtering out a relatively few potential answers. You can save 
>> those for  second round to determine if they are fully qualified by 
>> any additional rules that may involve more expensive operations.
>>
> 
> Your proposed approach didn't involve any trees (or tries) or filtering of
> words. So I don't see how any of this justifies it.
> 
>> How fast (or slow) are regular expressions for this purpose? Obviously 
>> it depends on complexity and something like "^[^aeiou]*[aeiou] 
>> [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] 
>> [^aeio

Re: Find word by given characters

2020-11-04 Thread duncan smith
On 04/11/2020 04:21, Avi Gross wrote:
> Duncan, my comments below yours at end.
> 
> ---YOURS---
> The Counter approach only requires iterating over the letters once to
> construct the letters bag, then each word once to create the relevant word
> bag. After that it's (at worst) a couple of lookups and a comparison for
> each unique character in letters (for each word).
> 
> Your approach requires iteration over the words to create the lists of
> characters. Then they are (potentially) iterated over multiple times looking
> for the characters in letters. There's also the cost of removing items from
> arbitrary positions in the lists. Also, it seems that the character
> frequencies must be equal, and your approach only ensures that the words
> contain at least the required numbers of characters.
> 
> In computational terms, if the first approach is something like O(n+m) for n
> letters and words of length m, your algorithm is more like O(nm).
> Not to say that it will be slower for all possible letters and dictionaries,
> but probably for any realistic cases and a lot slower for large enough
> dictionaries.
> 
> Duncan
> 
> --MINE---
> 
> I appreciate your analysis. I have not looked at the "counter"
> implementation and suspect it does some similar loops within, albeit it may
> be implemented in a compiled language like C++.
> 

Before the introduction of counters I would have used a dict to create a
mapping of characters to counts. That would require iterating over the
characters in a string only once to create / update the dict entries,
and I doubt counters are implemented less efficiently than that.

> I did not write out my algorithm in Python but have done it for myself. It
> runs fast enough with most of the time spent in the slow I/O part. 
> 
> We can agree all algorithms have to read in all the words in a data file.
> There may be ways to store the data such as in a binary tree and even ways
> to thus prune the search as once a node is reached where all required
> letters have been found, all further words qualify below that point. If you
> match say "instant" then instants and instantiation would be deeper in the
> tree and also qualify assuming extra letters are allowed. 

I don't see how a binary tree would be useful. As I've pointed out in
another post, there are other data structures that could be useful. What
I had in mind was a trie (prefix tree). But it's only the distinct
characters and frequencies that are relevant and so I'd exploit that
(and one or two other things) to reduce space and improve search.

We may differ on
> the requirement as I think that the number of repeats for something like
> a,t,t require to be at least as big as in "attend" but that "attention" with
> yet another "t" would also be OK. If I am wrong, fine, but I note the person
> requesting this has admitted a certain lack of credentials while also
> claiming he made up a scenario just for fun. So this is not actually a
> particularly worthy challenge let alone with a purpose.
> 
> My impression is that the average word length would be something small like
> 5-7. The number of words in a dictionary might be 100,000 or more. So if you
> want efficiency, where do you get the bang for the buck? 
> 
> I would argue that a simple test run on all the words might often narrow the
> field to a much smaller number of answers like just a few thousand or even
> much less. Say you test for the presence of "aeiou" in words, in whatever
> order. That might be done from reading a file and filtering out a relatively
> few potential answers. You can save those for  second round to determine if
> they are fully qualified by any additional rules that may involve more
> expensive operations.
> 

Your proposed approach didn't involve any trees (or tries) or filtering
of words. So I don't see how any of this justifies it.

> How fast (or slow) are regular expressions for this purpose? Obviously it
> depends on complexity and something like 
> "^[^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou]
> [^aeiou]*[aeiou] [^aeiou]*$"
> 
> would be easy to construct once but likely horribly inefficient in searching
> and a bit of overkill here. I suspect there is already some simple C
> function that could be used from within python that looks like
> findall(choices, word) that might return how many of the letters in choices
> were found in word and you simply comparer that to length(word) perhaps more
> efficiently.
> 
> It looks easier to check if a character exists in one of the ways already
> discussed within python using a loop as discussed. Something as simple as
> this:
> 
>   needed = "aeiou"
>   trying = "education"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
> 
>   trying = "educated"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
>   The above prints 
> 
> My point is you can use the above to winnow down possible answers and only
> subject that 

Re: Find word by given characters

2020-11-03 Thread duncan smith
On 03/11/2020 22:14, Avi Gross wrote:
> I, too, have wondered what exactly the point was of the required
> functionality for SCRABBLE but note you can extend a current word so
> additional letters may be available in a game but only if they are an exact
> fit to put before, after, or in middle of your word. 
> 
> But this seems to be a fairly simple problem to solve unless I misunderstand
> it. Elegance aside, what would be wrong with this approach.
> 
> - Read a word at a time in a loop from the file of dictionary words
> (non-Python meaning of dictionary.) For each one do the following, perhaps
> using a function:
> 
> Break the current word into a list of individual letters.
> Loop over the letters you want and:
>   If the letter is in the list, remove it and continue
>   Else skip the current word as it is not a match.
> 
> At the end of each of the above loops, you only reached here if all the
> letters were found and removed. If the list is now empty, fine. If it has
> extra remaining letters, also fine by the requirements stated. Letters in
> the list multiple times are removed multiple times.
> 
> The above need not use  list of letters and can be done many other ways but
> seems conceptually simple. Each word is processed once. It can be converted
> to using a list comprehension or something similar by using "all" and so on.
> 
> Or am I missing something about other approaches being better or more
> efficient or ... And, yes, counting may have an advantage as the list does
> not need to be modified repeatedly but creating an object or three also has
> overhead.

[snip]

The Counter approach only requires iterating over the letters once to
construct the letters bag, then each word once to create the relevant
word bag. After that it's (at worst) a couple of lookups and a
comparison for each unique character in letters (for each word).

Your approach requires iteration over the words to create the lists of
characters. Then they are (potentially) iterated over multiple times
looking for the characters in letters. There's also the cost of removing
items from arbitrary positions in the lists. Also, it seems that the
character frequencies must be equal, and your approach only ensures that
the words contain at least the required numbers of characters.

In computational terms, if the first approach is something like O(n+m)
for n letters and words of length m, your algorithm is more like O(nm).
Not to say that it will be slower for all possible letters and
dictionaries, but probably for any realistic cases and a lot slower for
large enough dictionaries.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-03 Thread duncan smith
On 03/11/2020 23:35, Bischoop wrote:
> On 2020-11-03, duncan smith  wrote:
>>>
>>
>>>>> from collections import Counter
>>>>> letters = 'att'
>>>>> letter_counts = Counter(letters)
>>>>> word = 'tolerate'
>>>>> wd_counts = Counter(word)
>>>>> for char, cnt in letter_counts.items():
>>  print (cnt == wd_counts[char])
>>
>>  
>> True
>> True
>>>>> word = 'auto'
>>>>> wd_counts = Counter(word)
>>>>> for char, cnt in letter_counts.items():
>>  print (cnt == wd_counts[char])
>>
>>  
>> True
>> False
>>>>>
>>
>> or, equivalent to the above loop, but breaking when the first False is
>> generated and returning the single Boolean that you're after,
>>
>>>>> all(cnt == wd_counts[char] for char, cnt in letter_counts.items())
>> False
>>>>>
>>
>> There's still a lot of scope for improvement, but possibly not by doing
>> simple things
> 
> 
> lol
> 
> I'm thinking about it for a couple days and you guys coming with few
> ideas to it like it's nothing.
> Pity I've wasted a decade with woman, now probably to old to learn
> anything new.
> 

It looks like you've learnt something about wasting time with women ;-).
Keep tinkering as long as you're enjoying it (or getting paid for it)
and pick up knowledge of relevant data structures and algorithms as you
go. (That's what I've done. I'm actually a statistician.) The above
solution uses bags (implemented in Python as Counter instances), and the
(repeated) generation of the letters bag was hoisted outside the for
loop. (If the dictionary was going to be searched for multiple sets of
letters the generation of the word bags would also be hoisted.) There's
also the idea of breaking out of the loop once the first False is
generated (implemented most neatly by using all). So there might be
something useful to take from it.

A bag was the obvious data structure because the solution depends only
on the unique characters in a string and their frequencies. But that's
thinking about the problem word by word. We actually have a dictionary
of words, and a dictionary can be structured so that it can be searched
much more efficiently than 'word by word'. The particular data structure
I have in mind is not (yet) in the standard Python library. That's maybe
worth looking at when you've got the word by word approach grokked.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-03 Thread duncan smith
On 03/11/2020 14:06, Bischoop wrote:
> On 2020-11-03, dn  wrote:
>>
>>
>> The (full) specs are not clear. There's certainly room for 
>> misunderstanding. I'd be happier if I could 'see' a full spec or 
>> recognise a practical application, because then we'd be better able to 
>> discuss facts. Meantime, we try to help with what we have been given...
>>
>>
>> The OP's sample code only realises "conforming word[s]" (good term 
>> BTW!). The snippet does not report any "count". Similarly, there's no 
>> facts to avoid ("I assume") an assumption about whether "Letters" may 
>> include duplication - indeed: whether duplication has meaning or should 
>> be regarded as user-error.
>>
> 
> Let me clarify what I want to do:
> We all know Scrabble game.
> there's a file with Dictionary containing word in each line, the idea is
> to input some letters for example mentioned earlier: att, the script
> supposed to find all words in which letters: 'a','t','t' occur and print
> these words. It supposed not to print word: 'auto' because there's only
> one 't' but words such as: 'toast', 'toasty', 'tolerant' are meeting the
> criteria becase we have in these words one 'a' and two 't' as user has
> input.
> I've checked collections counter but seems to complicated for me at this
> stage yet and I'm struggling with applying it.
> 

>>> from collections import Counter
>>> letters = 'att'
>>> letter_counts = Counter(letters)
>>> word = 'tolerate'
>>> wd_counts = Counter(word)
>>> for char, cnt in letter_counts.items():
print (cnt == wd_counts[char])


True
True
>>> word = 'auto'
>>> wd_counts = Counter(word)
>>> for char, cnt in letter_counts.items():
print (cnt == wd_counts[char])


True
False
>>>

or, equivalent to the above loop, but breaking when the first False is
generated and returning the single Boolean that you're after,

>>> all(cnt == wd_counts[char] for char, cnt in letter_counts.items())
False
>>>

There's still a lot of scope for improvement, but possibly not by doing
simple things.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-02 Thread duncan smith
On 02/11/2020 19:09, dn wrote:
> On 02/11/2020 23:29, Bischoop wrote:
>> On 2020-11-01, duncan smith  wrote:
>>>>
>>>
>>> But this generates the letters counts for each word. They only need to
>>> be generated once (outside the for loop). And using count requires
>>> iterating over the letters / words for each x in letters (rather than
>>> once). For a large enough collection of words you'd notice the
>>> difference.
>>>
>>
>> You're pretty much right, it doesn't go too fast.
>> I've done this years ago with Python2, had a bit free time now and
>> wanted to ReLearn Python from starting old projects I've done in past
>> but can't remember how I made it working lol.
> 
> 
> If you have a working Py2 version, once print-statements were changed
> into functions, what errors were thrown-up?
> 
> 
> Multiple loops written in Python are likely to be slower than same in
> compiled code - which was probably part of the motivation for @Terry's
> response. Plus, "re-use" - why write something ourselves if someone else
> has already done the work?
> 
> 
> How about a change of tactics?
> 
> - str.find() or .index() will locate a character within the string
> (starting from character[0]/the left-hand side)
> - if this fails, tears will fall...
> - repeat, from the right
> - if both results are the same character/position, it must be unique
> within the string
> - repeat for each character in "Letters"
> 

[snip]

But he appears to need the count in order to compare against the counts
in letters. I assume letters could be something like 'att', in which
case knowing a word contains more than one 't' doesn't cut it. He could
try iterating over the characters in each word once, checking that no
count from letters is exceeded, and that the number of remaining
characters gives some chance that the relevant counts from letters could
be achieved. In the worst case (e.g. a conforming word) this requires
iterating over all the characters in a word once. That's not to say it
would actually run quicker 'on average' than a solution using Counter.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find word by given characters

2020-11-01 Thread duncan smith
On 01/11/2020 13:38, Bischoop wrote:
> On 2020-11-01, Bischoop  wrote:
>> I'm working on a script i which user inputs letters and then a printed
>> words containing those letters. The scripts works however I can't solve
>> one problem , it prints also words in which these letters occur more 
>> than once.
>> ---
>> Fore example:
>> Letters: at
>> Output: auto, autobahn.
>> ---
>>
>> I supposed to not print word: "autobahn" because I've given it only one
>> letter "a".
>>
> 
> Problem's solved.
> 
> 
> for word in words:
> if all(word.count(x) == letters.count(x) for x in letters):
> print(word)
> -
> 
> Thanks for suggestions, I didn't had to use counter but going to look
> into as it seems useful and interesting.
> 
> Thank You
> 

But this generates the letters counts for each word. They only need to
be generated once (outside the for loop). And using count requires
iterating over the letters / words for each x in letters (rather than
once). For a large enough collection of words you'd notice the difference.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Confused about np.polyfit()

2020-07-19 Thread duncan smith
On 19/07/2020 16:11, Dino wrote:
> On 7/19/2020 4:54 PM, duncan smith wrote:
>>
>> It depends on what you expect the result to be. There's nothing
>> inherently wrong with transforming variables before using least squares
>> fitting. Whether it gives you the "best" estimates for the coefficients
>> is a different issue.
> 
> Thanks a lot for this, Duncan. I guess I have to follow-up questions at
> this point:
> 
> 1) in which ways is this approach sub-optimal?
> 
> 2) what's the "right" way to do it?
> 
> Thank you
> 

You'll have to read up a bit on ordinary least squares (e.g.
https://en.wikipedia.org/wiki/Ordinary_least_squares). It is based on
assumptions that might not necessarily hold for a given model / dataset.
Depending on which assumptions are violated the estimates can be
affected in different ways. It is usual to fit the model, then check the
residuals to see if the assumptions (approximately) hold. If not, it
might indicate a poor model fit or suggest fitting a transformed model
(to estimate the same coefficients while satisfying the assumptions).
e.g. For the latter case,

Y = a + bX

has the same coefficients as

Y/X = a * 1/X + b

but the latter regression might satisfy the assumption of constant
variance for the errors.

Regression analysis is a bit of an art, and it's a long time since I did
any. Ordinary least squares is optimal in a certain sense when the
assumptions hold. When they don't there's no single answer to what the
best alternative is (unless it's employ a good statistician).

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Confused about np.polyfit()

2020-07-19 Thread duncan smith
On 19/07/2020 11:19, Dino wrote:
> 
> Hi, I am looking at someone else's code trying to understand their use
> of numpy.polyfit.
> 
> My understanding was that you can use it to fit polynomials, but
> apparently, the original author has used it for logarithmic and
> exponential curves as well this way:
> 
> Logarithmic
> 
>     def fit_model(self):
>     self.coefficients = np.polyfit(np.log(self.x), self.y, 1)
> 
> Exponential
> 
>     def fit_model(self):
>     self.coefficients = np.polyfit(self.x, np.log(self.y), 1)
> 
> is this an ok use of np.polyfit? Will it yield the expected result.
> 
> Thanks

It depends on what you expect the result to be. There's nothing
inherently wrong with transforming variables before using least squares
fitting. Whether it gives you the "best" estimates for the coefficients
is a different issue.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: inheriting from long on Python2 or int on Python3

2020-03-21 Thread duncan smith
On 21/03/2020 21:55, Skip Montanaro wrote:
>> import sys
>>
>> class C_hash(int if sys.version_info.major >= 3 else long):
> ...
> 
> There's nothing incorrect with your solution. Chris offered another. I
> was going to suggest you consider running your code through 2to3 to
> see what it does. I created a C_hash class similar to yours:
> 
> class C_hash(long):
> def __new__(cls, bits, m, N=1):
> obj = super(C_hash, cls).__new__(cls, bits)
> obj._m = m
> obj._N = N
> return obj
> 
> def meth(self):
> pass
> 
> When I ran it through 2to3, it didn't make any transformations, which
> I thought odd. Still, assuming I did something wrong, you might poke
> around with it a bit to see if you can get it to do the work for you.
> 
> Skip
> 

According to the docs (https://docs.python.org/2/library/2to3.html) it
should have renamed long to int. I also tried something closer to
Chris's suggestion,


try:
long
except NameError:
long = int

class C_hash(long):
# etc.


but (tentatively) opted for something that avoided creating new names at
the module level. If it doesn't look too terrible I might stick with it
for now (although I am quite tempted to go back to the above). Thanks
Skip and Chris.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


inheriting from long on Python2 or int on Python3

2020-03-21 Thread duncan smith
Hello,
  I have a class I wrote for Python2 that needs to be made to work
on Python3. On Python2 it inherited from long, so it needs to inherit
from int on Python3. The following works, but I'm not sure it's the
cleanest solution. It's the first time I've wanted / needed to make a
parent class conditional on the Python version, and I'm not sure it's
even a good idea. I used inheritance because I wanted to perform bitwise
operations on instances, but I could probably switch to composition.
Just wondering how bad the following really is (in a code smell kind of
a away). TIA.

Duncan


import sys

class C_hash(int if sys.version_info.major >= 3 else long):
def __new__(cls, bits, m, N=1):
obj = super(C_hash, cls).__new__(cls, bits)
obj._m = m
obj._N = N
return obj
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: queue versus list

2020-03-20 Thread duncan smith
On 20/03/2020 21:57, Barry wrote:
> 
> 
>> On 20 Mar 2020, at 00:42, duncan smith  wrote:
>>
>> Bingo. Performance is indistinguishable from that of the list. Thread
>> safety is unimportant (for my purposes), but the docs at
>> https://docs.python.org/2/library/collections.html#collections.deque
>> state "Deques support thread-safe, memory efficient appends and pops
>> from either side of the deque with approximately the same O(1)
> 
> I did a performance test of list vs sequel as a fifo that had 10k elements in 
> them.
> Then I timed insert elements at the tail and popping from the head.
> 
> Duque is 10x faster then list, rest run macOS python 3.8
> 
> Barry
> 
> 

Yes, I avoided the cost of popping elements from the head by just
iterating over the list. So the timings came out very similar. But the
deque allows me to remove those elements efficiently. It's also neat
because it lets me switch from what is basically a breadth first search
of a graph to a depth first search by simply changing popleft() to
popright(). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: queue versus list

2020-03-19 Thread duncan smith
On 19/03/2020 20:40, MRAB wrote:
> On 2020-03-19 20:08, duncan smith wrote:
>> Hello,
>>    I have generator code along the following lines,
>>
>>
>> Q = queue.Queue()
>> Q.put(x)
>> while not Q.empty():
>>  x = Q.get()
>>  if :
>>  yield x
>>  else:
>>    Q.put()
>>
>>
>> If I change it to,
>>
>>
>> Q = []
>> Q.append(x)
>> for x in Q:
>>  if :
>>  yield x
>>  else:
>>    Q.append()
>>
>>
>> then it runs approximately 3 times more quickly. I seem to remember
>> (from somewhere) that the language specification doesn't guarantee that
>> the latter will continue to work, but that it's unlikely that future
>> implementations will break it. Apart from that, is there any good reason
>> why I shouldn't use the list? Is there another approach that might avoid
>> the overhead of using a queue? Results from profiling the two
>> implementations below in case it makes anything clearer. TIA.
>>
> [snip]

Thanks MRAB and Paul,

> A number of points:
> 
> 1. I'm not sure it's a good idea to mutate a list while iterating over it.
> 

AFAICR that's the bit that depends on the implementation of lists. As
long as iteration involves incrementing an index and indexing into the
list it will work.

> 2. The example with the list retains all of the results, whereas the one
> with the queue consumes them.
> 
> 3. Queues are thread-safe because they're used for communication between
> threads, but lists aren't thread-safe; that might be something to do
> with the difference.
> 
> 4. If it doesn't need to be thread-safe, why not try deques instead?

Bingo. Performance is indistinguishable from that of the list. Thread
safety is unimportant (for my purposes), but the docs at
https://docs.python.org/2/library/collections.html#collections.deque
state "Deques support thread-safe, memory efficient appends and pops
from either side of the deque with approximately the same O(1)
performance in either direction". Looking at the output from the
profiler for my implementation with the queue it looks like a deque is
being used but there's various other stuff happening relating to thread
safety. Is the lesson from this that if all you're doing is appending
and popping, then use a deque (rather than a queue) even if you want
thread safety? (It makes quite a difference in performance for my use
case.) Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


queue versus list

2020-03-19 Thread duncan smith
Hello,
  I have generator code along the following lines,


Q = queue.Queue()
Q.put(x)
while not Q.empty():
x = Q.get()
if :
yield x
else:
  Q.put()


If I change it to,


Q = []
Q.append(x)
for x in Q:
if :
yield x
else:
  Q.append()


then it runs approximately 3 times more quickly. I seem to remember
(from somewhere) that the language specification doesn't guarantee that
the latter will continue to work, but that it's unlikely that future
implementations will break it. Apart from that, is there any good reason
why I shouldn't use the list? Is there another approach that might avoid
the overhead of using a queue? Results from profiling the two
implementations below in case it makes anything clearer. TIA.

Duncan


 88752234 function calls in 68.603 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10.0000.000   68.603   68.603 :1()
 1009   18.3430.018   68.6020.068
bit_trees.py:699(paired_tries_search)
  84259346.1160.000   10.3730.000 bit_trees.py:751(hamdist)
10.0000.0000.0000.000 bit_trees.py:809(cum_shape)
  23508675.9700.000   18.2000.000 queue.py:115(put)
  23508676.1140.000   16.6390.000 queue.py:147(get)
10.0000.0000.0000.000 queue.py:199(_init)
  47017351.9510.0002.6860.000 queue.py:202(_qsize)
  23508671.1490.0001.5120.000 queue.py:206(_put)
  23508671.0420.0001.4460.000 queue.py:210(_get)
10.0000.0000.0000.000 queue.py:27(__init__)
  23508682.9910.0004.3970.000 queue.py:90(empty)
30.0000.0000.0000.000 threading.py:215(__init__)
  47017342.6610.0003.7420.000 threading.py:239(__enter__)
  47017342.0970.0002.7190.000 threading.py:242(__exit__)
  47017342.4830.0003.8720.000 threading.py:254(_is_owned)
  47017348.1850.000   12.0570.000 threading.py:334(notify)
10.0000.0000.0000.000 {built-in method
_thread.allocate_lock}
  84259341.8740.0001.8740.000 {built-in method builtins.bin}
10.0000.000   68.603   68.603 {built-in method
builtins.exec}
  47017360.7350.0000.7350.000 {built-in method builtins.len}
  47017341.0800.0001.0800.000 {method '__enter__' of
'_thread.lock' objects}
  47017340.6220.0000.6220.000 {method '__exit__' of
'_thread.lock' objects}
  47017341.3890.0001.3890.000 {method 'acquire' of
'_thread.lock' objects}
  23508670.3630.0000.3630.000 {method 'append' of
'collections.deque' objects}
  84259342.3840.0002.3840.000 {method 'count' of 'str'
objects}
10.0000.0000.0000.000 {method 'disable' of
'_lsprof.Profiler' objects}
  47017340.6490.0000.6490.000 {method 'items' of 'dict'
objects}
  23508670.4040.0000.4040.000 {method 'popleft' of
'collections.deque' objects}



32331417 function calls in 26.992 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10.1420.142   26.992   26.992 :1()
 1009   16.7410.017   26.8510.027
bit_trees.py:699(paired_tries_search)
  84259345.2700.0009.1750.000 bit_trees.py:755(hamdist)
10.0000.0000.0000.000 bit_trees.py:813(cum_shape)
  84259341.5760.0001.5760.000 {built-in method builtins.bin}
10.0000.000   26.992   26.992 {built-in method
builtins.exec}
10.0000.0000.0000.000 {built-in method builtins.len}
  23508670.3100.0000.3100.000 {method 'append' of 'list'
objects}
  84259342.3290.0002.3290.000 {method 'count' of 'str'
objects}
10.0000.0000.0000.000 {method 'disable' of
'_lsprof.Profiler' objects}
  47017340.6240.0000.6240.000 {method 'items' of 'dict'
objects}


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What I learned today

2020-02-14 Thread duncan smith
On 14/02/2020 23:21, Dan Stromberg wrote:
> On Fri, Feb 14, 2020 at 3:10 PM Stefan Ram  wrote:
> 
>>   By trial and error (never read documentation!) I found
>>   that you can count the number of e's in a text by just
>>
>> Counter( text ).get( 'e' )
>>
>>   (after »from collections import Counter« that is).
>>
> Even simpler, though not suitable for all situations:
> "abcbdbe".count("b")
> 

[snip]

And by far the quickest way (that I've found) for counting the number of
set bits in an int.

>>> def popcount(n):
cnt = 0
while n:
n &= n-1
cnt += 1
return cnt

>>> import timeit
>>> timeit.timeit("popcount(19847998494279)", "from __main__ import
popcount", number=1)

0.034410387044772506
>>> timeit.timeit("bin(19847998494279).count('1')", number=1)

0.004501901799812913
>>>

OK, things turn around for large integers with very few set bits. But
for my use case bin(n).count('1') wins hands down (which surprised me a
little).

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: distinguishable matplotlib colours / symbols / line styles

2019-12-16 Thread duncan smith
On 16/12/2019 21:08, DL Neil wrote:
> On 17/12/19 5:19 am, Chris Angelico wrote:
>> On Tue, Dec 17, 2019 at 3:16 AM duncan smith 
>> wrote:
>>>
>>> Hello,
>>>    Not really specific to Python or matplotlib (but that's what I'm
>>> using). I'm looking for a good combination of colours and symbols for
>>> scatter plots, and combination of colours and line styles for line
>>> plots. Too often I find myself producing these when I don't know whether
>>> they will end up being printed in colour or greyscale. It would be a lot
>>> easier if I could produce decent plots that would work well in either
>>> case. I can find various stuff on colour palettes, but pretty much
>>> nothing on symbols or line styles. Is anybody aware of an approach that
>>> works well? I'm aware of issues with colour blindness and RGB versus
>>> CMYK. Ideally I'd have something that I could just use that would deal
>>> with all these issues. TIA.
>>>
>>
>> I'd recommend looking at the Web Content Accessibility Guidelines
>> published by the W3C; there are a number of tools out there that are
>> designed to help you pick colours for a web page, and the same sorts
>> of rules will apply here, I think.
>>
>> Also, thank you for even *thinking* about this. A lot of people don't. :)
> 
> +1
> 
> We spend a lot of time teaching this topic (non-Python courses). It
> receives a lot of often highly-polarised comments/discussion. Many folk
> have their 'eyes opened' to an issue which has not affected them
> personally. Some even have to be informed that it is a legal obligation
> in their jurisdiction. However, it also receives the highest number of
> 'why do I have to learn this stuff' complaints...
> 
> I learned (way back) that the incidence of "color blindness" is far
> higher than I had imagined. Secondly, that it affects males more than
> females. Thirdly, that calling it "blindness" is a bit of a misnomer,
> because whilst people often can't see red 'correctly' (most common
> symptom), they do see something (it varies). Which is why they are
> permitted to drive vehicles (traffic lights: red, amber/yellow, green -
> and arrows; plus stop/brake lights), but why many smart-phone apps/web
> pages which encode information-relevance (red is 'wrong' and green/blue
> is acceptable) can become almost unusable (without other cues).
> 
> Those key-words: "accessibility guidelines" will yield a swag of useful
> tools - ignore the ones which are basically 'help choose the color of my
> web page/color palette, because they are often aiming (only) for
> 'pretty'. The best tools enumerate the efficacy of fg/bg
> color-combinations, allowing one to experiment; and will enumerate
> grey-scale variation/comparisons.
> 
> 
> Hmmm, note to self (you've inspired me to specifically review/critique
> the printing-from-screen action): what happens when we take a
> color-checked screen display and print same but end-up viewing it as
> monochrome/grey-scale output? Probably not a main-stream demand, but
> worth tossing at the WCAG experts...

A paper I recently published contained a number of colour images (graphs
- of the nodes and edges variety - and line plots). But it turned out
that the funding we used to have for colour printing charges no longer
exists. So the print version is greyscale with links to online colour
images. Anyone accessing the paper online will be able to download a pdf
with colour images. I don't yet know whether there will be funding for
colour printing charges for the paper I'm currently working on, yet I'm
generating plots. So something that would work for all these possible
end points would be ideal. That brings up the issue of distinguishable
symbols / markers (for scatter plots) and distinguishable line styles
(for line plots). i.e. They might help for greyscale without being too
distracting for colour images. So a single image would do for all the
possible end points. I'm not 100% happy about readers of the print
journal (probably) having to go online to make sense of my recent paper,
but that's done now.

BTW, I've gone with the Seaborn qualitative 'colorblind' palette for
now. Still tinkering with different markers / marker sizes / line
weights etc. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


distinguishable matplotlib colours / symbols / line styles

2019-12-16 Thread duncan smith
Hello,
  Not really specific to Python or matplotlib (but that's what I'm
using). I'm looking for a good combination of colours and symbols for
scatter plots, and combination of colours and line styles for line
plots. Too often I find myself producing these when I don't know whether
they will end up being printed in colour or greyscale. It would be a lot
easier if I could produce decent plots that would work well in either
case. I can find various stuff on colour palettes, but pretty much
nothing on symbols or line styles. Is anybody aware of an approach that
works well? I'm aware of issues with colour blindness and RGB versus
CMYK. Ideally I'd have something that I could just use that would deal
with all these issues. TIA.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: stuck on time

2019-12-08 Thread duncan smith
On 07/12/2019 17:48, RobH wrote:
> I am trying to do this project on a pi zero:
> 
> http://frederickvandenbosch.be/?p=1365
> 
> After overcoming a few errors, I now have the display working and the
> start of the code showing on the display, that being the time.
> 
> It doesn't move on to the next part of the code ie, no rectangle drawn
> def display_time():
> # Collect current time and date
> if(time_format):
>     current_time = time.strftime("%I:%M")<<< stays at this line
> else:
>     current_time = time.strftime("%H:%M")
>    
> current_date = time.strftime("%d/%m/%Y")
> 
> # Clear image buffer by drawing a black filled box
> draw.rectangle((0,0,width,height), outline=0, fill=0
> 
> In the original code you can see that the author used the
> Minecraftia.ttf font, but I had to download the Minecraftia-Regular.ttf
> font. The link the author gave took me to that.
> 
> The Minecraft-Regular.ttf font produced errors when the code was run,
> saying at the end of the error that it cannot open resource.
> 
> I then used the NotoSerif-Regular.ttf font which let the code start
> without error, but as above it is stuck at the time.

One thing you certainly need to do is add a closing brace to the last line.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instantiating sub-class from super

2019-10-19 Thread duncan smith
On 18/10/2019 23:57, DL Neil wrote:
> On 17/10/19 7:52 AM, MRAB wrote:
>> On 2019-10-16 19:43, duncan smith wrote:
>>> On 16/10/2019 04:41, DL Neil wrote:
>>>> On 16/10/19 1:55 PM, duncan smith wrote:
>>>>> On 15/10/2019 21:36, DL Neil wrote:
>>>>>> On 16/10/19 12:38 AM, Rhodri James wrote:
>>>>>>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
>>>>>> ...
>>>>>> So, yes, the "label" is unimportant - except to politicians and
>>>>>> statisticians, who want precise answers from vague collections of
>>>>>> data... (sigh!)
>>>>>>
>>>>>
>>>>> [snip]
>>>>>
>>>>> No not (real) statisticians. People often want us to provide precise
>>>>> answers, but they don't often get them.
>>>>>
>>>>> "It ain’t what you don’t know that gets you into trouble. It’s what
>>>>> you
>>>>> know for sure that just ain’t so." (Mark Twain - perhaps)
>>>>
>>>> +1
>>>>
>>>> Although, you've undoubtedly heard people attempt to make claims of
>>>> having 'accurate figures' (even, "that came from Stats") when you told
>>>> them that the limitations and variations rendered the exercise
>>>> laughable...
>>>>
>>>> My favorite (of the moment) is a local computer store who regularly
>>>> offer such gems as: (underneath the sales (web-) page for an upmarket
>>>> *desktop* computer)  "people who bought this also bought" followed
>>>> by at
>>>> least two portable PC carry cases. They must be rather large
>>>> carry-bags!
>>>> (along with such surprises as keyboard, mouse, ...)
>>>>
>>>> This morning I turned-down a study for a political group. One study has
>>>> already been completed and presented. The antagonist wanted an A/B
>>>> comparison (backing his 'side', of course). I mildly suggested that I
>>>> would do it, if he'd also pay me to do an A/B/C study, where 'C' was a
>>>> costing - the economic opportunity cost of 'the people' waiting for
>>>> 'the
>>>> government' to make a decision - (and delaying that decision by waiting
>>>> for "study" after "study" - The UK and their (MPs') inability to decide
>>>> "Brexit" a particularly disastrous illustration of such)
>>>>
>>>>
>>>> Sorry, don't want to incur the anger of the list-gods - such
>>>> calculations would be performed in Python (of course)
>>>
>>> Clearly, all such analyses should be done in Python. Thank God for rpy2,
>>> otherwise I'd have to write R code. It's bad enough having to read it
>>> occasionally to figure out what's going on under the hood (I like
>>> everything about R - except the syntax).
>>>  > I have too many examples of people ignoring random variation, testing
>>> hypotheses on the data that generated the hypotheses, shifting the
>>> goalposts, using cum / post hoc ergo propter hoc reasoning, assuming
>>> monocausality etc. In some areas these things have become almost
>>> standard practice (and they don't really hinder publication as long as
>>> they are even moderately well hidden). Of course, it's often about
>>> policy promotion, and the economic analyses can be just as bad (e.g.
>>> comparing the negative impacts of a policy on the individual with the
>>> positive impacts aggregated over a very large population). And if it's
>>> about policy promotion a press release is inevitable. So we just need to
>>> survey the news media for specific examples. Unfortunately there's no
>>> reliable service for telling us what's crap and what isn't. (Go on,
>>> somebody pay me, all my data processing / re-analysis will be in Python
>>> ;-).)
>>>
>> Even when using Python, you have to be careful:
>>
>> Researchers find bug in Python script may have affected hundreds of
>> studies
>> https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/
> 
> 
> 
> I think both of our 'Python' comments were made tongue-in-cheek. Sadly
> the tool won't guarantee the result...
> 
> 
> At my first research project, before I'd even completed my first degree,
> I noticed a similar fault in some code*. There was I, the youngest,
> newest, least-est member of staff,

Re: Instantiating sub-class from super

2019-10-16 Thread duncan smith
On 16/10/2019 19:52, MRAB wrote:
> On 2019-10-16 19:43, duncan smith wrote:
>> On 16/10/2019 04:41, DL Neil wrote:
>>> On 16/10/19 1:55 PM, duncan smith wrote:
>>>> On 15/10/2019 21:36, DL Neil wrote:
>>>>> On 16/10/19 12:38 AM, Rhodri James wrote:
>>>>>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
>>>>> ...
>>>>> So, yes, the "label" is unimportant - except to politicians and
>>>>> statisticians, who want precise answers from vague collections of
>>>>> data... (sigh!)
>>>>>
>>>>
>>>> [snip]
>>>>
>>>> No not (real) statisticians. People often want us to provide precise
>>>> answers, but they don't often get them.
>>>>
>>>> "It ain’t what you don’t know that gets you into trouble. It’s what you
>>>> know for sure that just ain’t so." (Mark Twain - perhaps)
>>>
>>> +1
>>>
>>> Although, you've undoubtedly heard people attempt to make claims of
>>> having 'accurate figures' (even, "that came from Stats") when you told
>>> them that the limitations and variations rendered the exercise
>>> laughable...
>>>
>>> My favorite (of the moment) is a local computer store who regularly
>>> offer such gems as: (underneath the sales (web-) page for an upmarket
>>> *desktop* computer)  "people who bought this also bought" followed by at
>>> least two portable PC carry cases. They must be rather large carry-bags!
>>> (along with such surprises as keyboard, mouse, ...)
>>>
>>> This morning I turned-down a study for a political group. One study has
>>> already been completed and presented. The antagonist wanted an A/B
>>> comparison (backing his 'side', of course). I mildly suggested that I
>>> would do it, if he'd also pay me to do an A/B/C study, where 'C' was a
>>> costing - the economic opportunity cost of 'the people' waiting for 'the
>>> government' to make a decision - (and delaying that decision by waiting
>>> for "study" after "study" - The UK and their (MPs') inability to decide
>>> "Brexit" a particularly disastrous illustration of such)
>>>
>>>
>>> Sorry, don't want to incur the anger of the list-gods - such
>>> calculations would be performed in Python (of course)
>>
>> Clearly, all such analyses should be done in Python. Thank God for rpy2,
>> otherwise I'd have to write R code. It's bad enough having to read it
>> occasionally to figure out what's going on under the hood (I like
>> everything about R - except the syntax).
>>  > I have too many examples of people ignoring random variation, testing
>> hypotheses on the data that generated the hypotheses, shifting the
>> goalposts, using cum / post hoc ergo propter hoc reasoning, assuming
>> monocausality etc. In some areas these things have become almost
>> standard practice (and they don't really hinder publication as long as
>> they are even moderately well hidden). Of course, it's often about
>> policy promotion, and the economic analyses can be just as bad (e.g.
>> comparing the negative impacts of a policy on the individual with the
>> positive impacts aggregated over a very large population). And if it's
>> about policy promotion a press release is inevitable. So we just need to
>> survey the news media for specific examples. Unfortunately there's no
>> reliable service for telling us what's crap and what isn't. (Go on,
>> somebody pay me, all my data processing / re-analysis will be in Python
>> ;-).)
>>
> Even when using Python, you have to be careful:
> 
> Researchers find bug in Python script may have affected hundreds of studies
> https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/
> 

Yes, the problem of standing on the shoulders of others (not necessarily
giants). I assume the problematic code was tested on an OS that happened
to sort the results as required. Years ago I had a similar issue with a
race condition, but we caught it because I developed the code under
Linux and everybody else on the project ran it under Windows (where the
bug surfaced). Note to self: before I do any more parallel programming
check out how to test for race conditions.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instantiating sub-class from super

2019-10-16 Thread duncan smith
On 16/10/2019 04:41, DL Neil wrote:
> On 16/10/19 1:55 PM, duncan smith wrote:
>> On 15/10/2019 21:36, DL Neil wrote:
>>> On 16/10/19 12:38 AM, Rhodri James wrote:
>>>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
>>> ...
>>> So, yes, the "label" is unimportant - except to politicians and
>>> statisticians, who want precise answers from vague collections of
>>> data... (sigh!)
>>>
>>
>> [snip]
>>
>> No not (real) statisticians. People often want us to provide precise
>> answers, but they don't often get them.
>>
>> "It ain’t what you don’t know that gets you into trouble. It’s what you
>> know for sure that just ain’t so." (Mark Twain - perhaps)
> 
> +1
> 
> Although, you've undoubtedly heard people attempt to make claims of
> having 'accurate figures' (even, "that came from Stats") when you told
> them that the limitations and variations rendered the exercise laughable...
> 
> My favorite (of the moment) is a local computer store who regularly
> offer such gems as: (underneath the sales (web-) page for an upmarket
> *desktop* computer)  "people who bought this also bought" followed by at
> least two portable PC carry cases. They must be rather large carry-bags!
> (along with such surprises as keyboard, mouse, ...)
> 
> This morning I turned-down a study for a political group. One study has
> already been completed and presented. The antagonist wanted an A/B
> comparison (backing his 'side', of course). I mildly suggested that I
> would do it, if he'd also pay me to do an A/B/C study, where 'C' was a
> costing - the economic opportunity cost of 'the people' waiting for 'the
> government' to make a decision - (and delaying that decision by waiting
> for "study" after "study" - The UK and their (MPs') inability to decide
> "Brexit" a particularly disastrous illustration of such)
> 
> 
> Sorry, don't want to incur the anger of the list-gods - such
> calculations would be performed in Python (of course)

Clearly, all such analyses should be done in Python. Thank God for rpy2,
otherwise I'd have to write R code. It's bad enough having to read it
occasionally to figure out what's going on under the hood (I like
everything about R - except the syntax).

I have too many examples of people ignoring random variation, testing
hypotheses on the data that generated the hypotheses, shifting the
goalposts, using cum / post hoc ergo propter hoc reasoning, assuming
monocausality etc. In some areas these things have become almost
standard practice (and they don't really hinder publication as long as
they are even moderately well hidden). Of course, it's often about
policy promotion, and the economic analyses can be just as bad (e.g.
comparing the negative impacts of a policy on the individual with the
positive impacts aggregated over a very large population). And if it's
about policy promotion a press release is inevitable. So we just need to
survey the news media for specific examples. Unfortunately there's no
reliable service for telling us what's crap and what isn't. (Go on,
somebody pay me, all my data processing / re-analysis will be in Python
;-).)

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instantiating sub-class from super

2019-10-15 Thread duncan smith
On 15/10/2019 21:36, DL Neil wrote:
> On 16/10/19 12:38 AM, Rhodri James wrote:
>> On 14/10/2019 21:55, DL Neil via Python-list wrote:
> ...
> 
>>> It seemed better (at the design-level) to have Man( Person ) and
>>> Woman( Person ) sub-classes to contain the pertinent attributes,
>>> source more detailed and specific questions, and collect such data;
>>> by gender.
>>
>> Knowing a lot of Trans people as I do, may I gently suggest that this
>> solution will find many and varied ways of coming back to bite you?
> 
> 
> [with more respect than the previous humor]
> 
> 
> You are absolutely correct. The use-case was (over-)simplified, per
> list-advice.
> 
> That said, if a "trans" person has ovaries or testes (for example) then
> a non-traditional sexual identification is irrelevant - for medical
> purposes. Diseases in those areas (and now I'm a long way from a
> research questionnaire and from Python - but this is roughly how it was
> explained to me) still apply, and sadly, may in-fact be considerably
> complicated by any medical processes that may have contributed to a
> transition.
> 
> So, yes, the "label" is unimportant - except to politicians and
> statisticians, who want precise answers from vague collections of
> data... (sigh!)
> 

[snip]

No not (real) statisticians. People often want us to provide precise
answers, but they don't often get them.

"It ain’t what you don’t know that gets you into trouble. It’s what you
know for sure that just ain’t so." (Mark Twain - perhaps)

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


duck typing / library functions

2019-09-12 Thread duncan smith
Hello,
  The problem I have relates to writing algorithmic code that can
handle types with a given API, but where some of the required
functionality is implemented as library functions rather than methods.
Specifically I have code that uses numpy arrays, but I want to adapt it
to use sparse arrays that have a (sufficiently) similar API.

For the most part the existing code will work without alteration, but
wherever I have a numpy function (rather than array method) I have an
issue. In some cases I can find a corresponding method, but not for e.g.
numpy.isnan. Roughly the structure of the existing code is:



import numpy


class Algorithm(object):
def __init__(self, arrays):
self.arrays = arrays

def run(self):
shape = a shape calculated from info in self.arrays
arr = numpy.ones(shape)
# other stuff using array methods and numpy functions



I could replace "numpy" above with "sparse" and it would work fine (as
long as the arrays passed to __init__ were of the appropriate type). I
could move the import inside the class and make it conditional on the
types in self.arrays. I could potentially replace the function calls
with something based only on method calls (although that turns out to be
awkward for some functions). But I'm thinking about something more
introspective, maybe identifying the relevant module / package from the
array type and importing the relevant module on the fly? Any ideas? TIA.

Duncan

p.s. Ideally the solution should work with Python 3 and recent versions
of Python 2.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Import module from a different subdirectory

2019-05-16 Thread duncan smith
On 16/05/2019 22:50, Rich Shepard wrote:
> I'm developing a Python3 application using Python3-3.7.3 and
> virtualenv-16.5.0 on a Slackware-14.2 host.
> 
> The project directory contains subdirectories, including gui/ (with the
> tkinter views) and classes/ with the SQLAlchemy model.py.
> 
> Within the gui/ directory as the cwd testing the code for a view needs to
> import the 'model' module from the class/ subdirectory and I have not been
> able to use the correct syntax to import that module. The view module
> starts
> with:
> 
> from classes import model as m
> import data_entry_validators
> from commonDlgs import LabelInput
> 
> import tkinter as tk
> from tkinter import ttk
> from datetime import datetime
> 
> When I try to display this UI view I get an error:
> 
> $ python3 test_act_de.py Traceback (most recent call last):
>   File "test_act_de.py", line 1, in 
>     from classes import model as m
> ModuleNotFoundError: No module named 'classes'
> 
> Of course, 'classes' is a subdirectory not a module. My web searches have
> not found the proper syntax to import the 'model' module from the separate
> subdirectory classes and I need to learn how to do this.
> 
> Regards,
> 
> Rich

You could make the subdirectories Python packages. Google (or better
DuckDuckGo) is your friend.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: doctest random output?

2019-04-17 Thread duncan smith
On 28/08/2017 20:17, Leam Hall wrote:
> On 08/28/2017 11:40 AM, Dennis Lee Bieber wrote:
> 
> ... a bunch of good stuff ...
> 
> I'm (re-)learning python and just trying make sure my function works.
> Not at the statistical or cryptographic level.   :)
> 
> Thanks!
> 
> Leam

If it's supposed to generate values that follow a particular
distribution, and they don't, then it doesn't work. I had a bunch of
functions for generating values from various distributions. My line
manager told me to just set the seed so that the outputs were
deterministic. Worse than no test at all. It relied on my original
implementation (that generated the values for comparison) working, and
only tested if the implementation (of random.random() or my code) had
changed. So I ignored my boss and simply generated samples of values and
tested using a KS goodness of fit test. The tests should fail 5% of the
time. Run them a few times and check that no individual test is failing
consistently. I don't see how you can do much better than that. Of
course, this doesn't relate directly to doctest.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: in a pickle

2019-03-06 Thread duncan smith
On 07/03/2019 00:18, Ethan Furman wrote:
> On 03/06/2019 10:30 AM, duncan smith wrote:
> 
>>  I've been trying to figure out why one of my classes can be
>> pickled but not unpickled. (I realise the problem is probably with the
>> pickling, but I get the error when I attempt to unpickle.)
>>
>> A relatively minimal example is pasted below.
>>
>>
>> --> import pickle
>> --> class test(dict):
>> ...    def __init__(self, keys, shape=None):
>> ...    self.shape = shape
>> ...    for key in keys:
>> ...    self[key] = None
>> ...    def __setitem__(self, key, val):
>> ...    print (self.shape)
>> ...    dict.__setitem__(self, key, val)
>> ...
>> --> x = test([1,2,3])
>> None
>> None
>> None
>> --> s = pickle.dumps(x)
>> --> y = pickle.loads(s)
>> Traceback (most recent call last):
>>    File "", line 1, in 
>>  y = pickle.loads(s)
>>    File "", line 8, in __setitem__
>>  print (self.shape)
>> AttributeError: 'test' object has no attribute 'shape'
> 
> It seems to me the problem only exists because you are trying to print
> `self.shape` -- stop printing it, and everything works normally.
> 
> What problem are you actually trying to solve?
> 
> -- 
> ~Ethan~
> 

Accessing self.shape at that point in the code. In my actual code I am
updating a secondary data structure and doing some sanity checking at
that point. That fails because self.shape doesn't exist. Peter's post
makes it clear that although it fails in my __setitem__() it is not via
calling my __init__(). It must be being called by pickle.loads(). So on
the basis that pickle.loads() will restore the secondary data structure
correctly, and having no need for sanity checking assuming the original
object was valid, then the problem almost goes away. e.g. I can catch
the AttributeError and just call dict.__setitem__(self, key, val). That
I still need to test. I might yet employ Peter's solution (for when I
thought I needed access to self.shape and other attributes, which I now
doubt) which is to define __new__() so that the relevant attributes do
exist when they are required.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: in a pickle

2019-03-06 Thread duncan smith
On 06/03/2019 20:24, Peter Otten wrote:
> duncan smith wrote:
> 
>> On 06/03/2019 16:14, duncan smith wrote:
>>> Hello,
>>>   I've been trying to figure out why one of my classes can be
>>> pickled but not unpickled. (I realise the problem is probably with the
>>> pickling, but I get the error when I attempt to unpickle.)
>>>
>>> A relatively minimal example is pasted below.
>>>
>>>
>>>>>> import pickle
>>>>>> class test(dict):
>>> def __init__(self, keys, shape=None):
>>> self.shape = shape
>>> for key in keys:
>>> self[key] = None
>>>
>>> def __setitem__(self, key, val):
>>> print (self.shape)
>>> dict.__setitem__(self, key, val)
>>>
>>>
>>>>>> x = test([1,2,3])
>>> None
>>> None
>>> None
>>>>>> s = pickle.dumps(x)
>>>>>> y = pickle.loads(s)
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>> y = pickle.loads(s)
>>>   File "", line 8, in __setitem__
>>> print (self.shape)
>>> AttributeError: 'test' object has no attribute 'shape'
>>>
>>>
>>> I have DUCkDuckGo'ed the issue and have tinkered with __getnewargs__ and
>>> __getnewargs_ex__ without being able to figure out exactly what's going
>>> on. If I comment out the print call, then it seems to be fine. I'd
>>> appreciate any pointers to the underlying problem. I have one or two
>>> other things I can do to try to isolate the issue further, but I think
>>> the example is perhaps small enough that someone in the know could spot
>>> the problem at a glance. Cheers.
>>>
>>> Duncan
>>>
>>
>> OK, this seems to  be a "won't fix" bug dating back to 2003
>> (https://bugs.python.org/issue826897). The workaround,
>>
>>
>> class DictPlus(dict):
>>   def __init__(self, *args, **kwargs):
>> self.extra_thing = ExtraThingClass()
>> dict.__init__(self, *args, **kwargs)
>>   def __setitem__(self, k, v):
>> try:
>>   do_something_with(self.extra_thing, k, v)
>> except AttributeError:
>>   self.extra_thing = ExtraThingClass()
>>   do_something_with(self.extra_thing, k, v)
>> dict.__setitem__(self, k, v)
>>   def __setstate__(self, adict):
>> pass
>>
>>
>> doesn't work around the problem for me because I need the actual value
>> of self.shape from the original instance. But I only need it for sanity
>> checking, and under the assumption that the original instance was valid,
>> I don't need to do this when unpickling. I haven't managed to find a
>> workaround that exploits that (yet?). Cheers.
> 
> I've been playing around with __getnewargs__(), and it looks like you can 
> get it to work with a custom __new__(). Just set the shape attribute there 
> rather than in __init__():
> 
> $ cat pickle_dict_subclass.py 
> import pickle
> 
> 
> class A(dict):
> def __new__(cls, keys=(), shape=None):
> obj = dict.__new__(cls)
> obj.shape = shape
> return obj
> 
> def __init__(self, keys=(), shape=None):
> print("INIT")
> for key in keys:
> self[key] = None
> print("EXIT")
> 
> def __setitem__(self, key, val):
> print(self.shape, ": ", key, " <-- ", val, sep="")
> super().__setitem__(key, val)
> 
> def __getnewargs__(self):
> print("GETNEWARGS")
> return ("xyz", self.shape)
> 
> x = A([1, 2, 3], shape="SHAPE")
> x["foo"] = "bar"
> print("pickling:")
> s = pickle.dumps(x)
> print("unpickling:")
> y = pickle.loads(s)
> print(y)
> $ python3 pickle_dict_subclass.py 
> INIT
> SHAPE: 1 <-- None
> SHAPE: 2 <-- None
> SHAPE: 3 <-- None
> EXIT
> SHAPE: foo <-- bar
> pickling:
> GETNEWARGS
> unpickling:
> SHAPE: 1 <-- None
> SHAPE: 2 <-- None
> SHAPE: 3 <-- None
> SHAPE: foo <-- bar
> {1: None, 2: None, 3: None, 'foo': 'bar'}
> 
> It's not clear to me how the dict items survive when they are not included 
> in the __getnewargs__() result, but apparently they do.
> 
> 
> 

Thanks Peter. The docs for pickle say "When a class instance is
unpickled, its __init__() method is usually not invoked. The default
behaviour first creates an uninitialized instance and then restores the
sa

Re: in a pickle

2019-03-06 Thread duncan smith
On 06/03/2019 16:14, duncan smith wrote:
> Hello,
>   I've been trying to figure out why one of my classes can be
> pickled but not unpickled. (I realise the problem is probably with the
> pickling, but I get the error when I attempt to unpickle.)
> 
> A relatively minimal example is pasted below.
> 
> 
>>>> import pickle
>>>> class test(dict):
>   def __init__(self, keys, shape=None):
>   self.shape = shape
>   for key in keys:
>   self[key] = None
> 
>   def __setitem__(self, key, val):
>   print (self.shape)
>   dict.__setitem__(self, key, val)
> 
>   
>>>> x = test([1,2,3])
> None
> None
> None
>>>> s = pickle.dumps(x)
>>>> y = pickle.loads(s)
> Traceback (most recent call last):
>   File "", line 1, in 
> y = pickle.loads(s)
>   File "", line 8, in __setitem__
> print (self.shape)
> AttributeError: 'test' object has no attribute 'shape'
> 
> 
> I have DUCkDuckGo'ed the issue and have tinkered with __getnewargs__ and
> __getnewargs_ex__ without being able to figure out exactly what's going
> on. If I comment out the print call, then it seems to be fine. I'd
> appreciate any pointers to the underlying problem. I have one or two
> other things I can do to try to isolate the issue further, but I think
> the example is perhaps small enough that someone in the know could spot
> the problem at a glance. Cheers.
> 
> Duncan
> 

OK, this seems to  be a "won't fix" bug dating back to 2003
(https://bugs.python.org/issue826897). The workaround,


class DictPlus(dict):
  def __init__(self, *args, **kwargs):
self.extra_thing = ExtraThingClass()
dict.__init__(self, *args, **kwargs)
  def __setitem__(self, k, v):
try:
  do_something_with(self.extra_thing, k, v)
except AttributeError:
  self.extra_thing = ExtraThingClass()
  do_something_with(self.extra_thing, k, v)
dict.__setitem__(self, k, v)
  def __setstate__(self, adict):
pass


doesn't work around the problem for me because I need the actual value
of self.shape from the original instance. But I only need it for sanity
checking, and under the assumption that the original instance was valid,
I don't need to do this when unpickling. I haven't managed to find a
workaround that exploits that (yet?). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: in a pickle

2019-03-06 Thread duncan smith
[snip]

Sorry, this is Python 3.6 on Linux.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


in a pickle

2019-03-06 Thread duncan smith
Hello,
  I've been trying to figure out why one of my classes can be
pickled but not unpickled. (I realise the problem is probably with the
pickling, but I get the error when I attempt to unpickle.)

A relatively minimal example is pasted below.


>>> import pickle
>>> class test(dict):
def __init__(self, keys, shape=None):
self.shape = shape
for key in keys:
self[key] = None

def __setitem__(self, key, val):
print (self.shape)
dict.__setitem__(self, key, val)


>>> x = test([1,2,3])
None
None
None
>>> s = pickle.dumps(x)
>>> y = pickle.loads(s)
Traceback (most recent call last):
  File "", line 1, in 
y = pickle.loads(s)
  File "", line 8, in __setitem__
print (self.shape)
AttributeError: 'test' object has no attribute 'shape'


I have DUCkDuckGo'ed the issue and have tinkered with __getnewargs__ and
__getnewargs_ex__ without being able to figure out exactly what's going
on. If I comment out the print call, then it seems to be fine. I'd
appreciate any pointers to the underlying problem. I have one or two
other things I can do to try to isolate the issue further, but I think
the example is perhaps small enough that someone in the know could spot
the problem at a glance. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-18 Thread duncan smith
On 14/01/2019 20:11, duncan smith wrote:
> Hello,
>   Just checking to see if anyone has attacked this problem before
> for cases where the population size is unfeasibly large. i.e. The number
> of categories is manageable, but the sum of the frequencies, N,
> precludes simple solutions such as creating a list, shuffling it and
> using the first n items to populate the sample (frequency distribution /
> histogram).
> 
> I note that numpy.random.hypergeometric will allow me to generate a
> sample when I only have two categories, and that I could probably
> implement some kind of iterative / partitioning approach calling this
> repeatedly. But before I do I thought I'd ask if anyone has tackled this
> before. Can't find much on the web. Cheers.
> 
> Duncan
> 

After much tinkering I came up with the following:


import numpy as np

def hypgeom_variate(freqs, n):
# recursive partitioning approach
sample = [0] * len(freqs)
cumsum = np.cumsum(list(chain([0], freqs)))
if n > cumsum[-1]:
raise ValueError('n cannot be greater than population size')
hypergeometric = np.random.hypergeometric
argslist = [(0, len(freqs), 0, cumsum[-1], n)]
for i, k, ci, ck, m in argslist:
if k == i + 1:
sample[i] = m
else:
j = (i + k) // 2
cj = cumsum[j]
x = hypergeometric(cj - ci, ck - cj, m, 1)[0]
y = m-x
if x:
argslist.append((i, j, ci, cj, x))
if y:
argslist.append((j, k, cj, ck, y))
return sample


Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-15 Thread duncan smith
On 15/01/2019 17:59, Ian Hobson wrote:
> Hi,
> 
> If I understand your problem you can do it in two passes through the
> population.
> 

The thing is that I start with the population histogram and I want to
generate a sample histogram. The population itself is too large to deal
with each population member individually.

> First, however, lets work through taking a sample of 2 from 7 to
> demonstrate the method.
> 
> Take the first element with a probability of 2/7. (Note 1).
> If you took it, you only want 1 more, so the probability drops to 1/6.
> If you didn't take it you want 2 from 6, so probability goes to 2/6.
> Take the next in the population with probability 1/6 or 2/6 as appropriate.
> Continue in similar manner until the probability
> drops to 0 (when you have your whole sample). When the
> denominator drops to zero the population is expired.
> 

Yes, based on the chain rule.

> Your first pass has to categorise the population and create your
> histogram, (index N) of frequencies Y(N).
> 
> Then divide up the sample size you wish to take into the histogram,
> giving array X(N) of sample sizes. X(N) need not be integer.
> 
> Then pass through the population again, for each entry:
>    Compute the N it falls in the histogram.
>    Take this entry as a sample with a probability of X(N)/Y(N).  Note 2.
>    If the element was taken, decrement X(N).
>    Decrement Y(N).
>    step to next element.
> 

Ah, I'm not quota sampling. I want a simple random sample without
replacement. I just happen to have the data in the form of categories
and frequencies, and that's the form of output that I want.

> Note 1 - In most languages you can generate a pseudo-random number
> with a uniform distribution from 0 to Y(N)-1. Take the element if it is
> in range 0 to floor(X(N))-1.
> 
> Note 2 - X(N) need not be integer, but you can't actually take a sample
> of 6.5 out of 1000. You will either run out of population having taken
> 6, or, if you take 7, the probability will go negative, and no more
> should be taken (treat as zero). The number taken in slot N will be
> floor(X(N)) or ceiling(X(N)). The average over many tries will however
> be X(N).

> Sorry I did not come back to you sooner. It took a while to drag the
> method out of my memory from some 35 years ago when I was working on an
> audit package. 

Well I'd already forgotten that I'd coded up something for srs without
replacement only a few years ago. In fact I coded up a few algorithms
(that I can't take credit for) that allowed weighted sampling with
replacement, and at least one that didn't require a priori knowledge of
the population size (a single pass algorithm). The problem is that they
also (mostly) require scanning the whole population.

That was where I learned two things you may be interested
> in.

> 1) Auditors significantly under sample. Our Auditors actually took
> samples that were between 10% and 25% of what was necessary to support
> their claims.
> 

It's not just auditors :-(. The journals are full of claims based on
positive results from low powered tests or from "null fields". i.e. A
very high proportion are likely to be false positives (like 99% when it
comes to foodstuffs and the risks of various diseases). A while ago a
mate of mine (Prof. of statistics in Oz) told me about a student who
engineered a statistically significant result by copying and pasting her
data to double her sample size. That's no worse than some of the stuff
I've come across in the (usually medical) journals.

> 2) Very very few standard pseudo-random number generators are actually
> any good.
> 
> Regards
> 
> Ian

[snip]

BTW, the approach I'm currently using is also based on the chain rule.
Generate the number of sample units for the first category by sampling
from a (bivariate) hypergeometric. The number of sample units for the
second category (conditional on the number sampled for the first) is
another hypergeometric. Iterate until the full sample is obtained. It
helps to order the categories from largest to smallest. But I think I'll
get better performance by recursive partitioning (when I have the time
to try it). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-15 Thread duncan smith
On 15/01/2019 02:41, Spencer Graves wrote:
> 
> 
> On 2019-01-14 18:40, duncan smith wrote:
>> On 14/01/2019 22:59, Gregory Ewing wrote:
>>> duncan smith wrote:
>>>> Hello,
>>>>    Just checking to see if anyone has attacked this problem before
>>>> for cases where the population size is unfeasibly large.
>>> The fastest way I know of is to create a list of cumulative
>>> frequencies, then generate uniformly distributed numbers and
>>> use a binary search to find where they fall in the list.
>>> That's O(log n) per sample in the size of the list once it's
>>> been set up.
>>>
>> That's the sort of thing I've been thinking about. But once I'd found
>> the relevant category I'd need to reduce its frequency by 1 and
>> correspondingly update the cumulative frequencies. Alternatively, I
>> could add an extra step where I selected a unit from the relevant
>> category with probability equal to the proportion of non-sampled units
>> from the category. I could maybe set up an alias table and do something
>> similar.
>>
>> The other thing I was thinking about was iterating through the
>> categories (ideally from largest frequency to smallest frequency),
>> generating the numbers to be sampled from the current category and the
>> remaining categories (using numpy.random.hypergeometric). With a few
>> large frequencies and lots of small frequencies that could be quite
>> quick (on average). Alternatively I could partition the categories into
>> two sets, generate the number to be sampled from each partition, then
>> partition the partitions etc. binary search style.
>>
>> I suppose I'll try the both the alias table + rejection step and the
>> recursive partitioning approach and see how they turn out. Cheers.
> 
> 
>   R has functions "sample" and "sample.int";  see
> "https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample;.
> You can call R from Python,
> "https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python;.
> 
> 
> 
>   These are in the "base" package.  I believe they have been an
> important part of the base R language almost since its inception and
> have been used extensively.  You'd have to work really hard to do
> better, in my judgment.
> 
> 
>       Spencer Graves
> 
> 
> DISCLAIMER:  I'm primarily an R guy and only use Python when I can't
> find a sensible way to do what I want in R.
>>
>> Duncan
> 

Despite being a statistician I'm primarily a Python guy and only use R
when I can't find a sensible way to do what I want in Python :-). The
problem with the R solution is that it doesn't seem to get round the
issue of having an unfeasibly large population size, but a reasonable
number of categories. It turns out I've already coded up a few reservoir
based algorithms for sampling without replacement that work with data
streams. So I can get round the space issues, but it still means
processing each data point rather than generating the sample frequencies
directly.

After much searching all I've been able to find is the approach I
suggested above, iterating through the frequencies. My implementation:


import numpy

def hypgeom_variate(freqs, n):
sample = [0] * len(freqs)
nbad = sum(freqs)
hypergeometric = numpy.random.hypergeometric
for i, ngood in enumerate(freqs):
nbad -= ngood
x = hypergeometric(ngood, nbad, n, 1)[0]
if x:
sample[i] = x
n -= x
if not n:
break
return sample


Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-14 Thread duncan smith
On 14/01/2019 22:59, Gregory Ewing wrote:
> duncan smith wrote:
>> Hello,
>>   Just checking to see if anyone has attacked this problem before
>> for cases where the population size is unfeasibly large.
> 
> The fastest way I know of is to create a list of cumulative
> frequencies, then generate uniformly distributed numbers and
> use a binary search to find where they fall in the list.
> That's O(log n) per sample in the size of the list once it's
> been set up.
> 

That's the sort of thing I've been thinking about. But once I'd found
the relevant category I'd need to reduce its frequency by 1 and
correspondingly update the cumulative frequencies. Alternatively, I
could add an extra step where I selected a unit from the relevant
category with probability equal to the proportion of non-sampled units
from the category. I could maybe set up an alias table and do something
similar.

The other thing I was thinking about was iterating through the
categories (ideally from largest frequency to smallest frequency),
generating the numbers to be sampled from the current category and the
remaining categories (using numpy.random.hypergeometric). With a few
large frequencies and lots of small frequencies that could be quite
quick (on average). Alternatively I could partition the categories into
two sets, generate the number to be sampled from each partition, then
partition the partitions etc. binary search style.

I suppose I'll try the both the alias table + rejection step and the
recursive partitioning approach and see how they turn out. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


sampling from frequency distribution / histogram without replacement

2019-01-14 Thread duncan smith
Hello,
  Just checking to see if anyone has attacked this problem before
for cases where the population size is unfeasibly large. i.e. The number
of categories is manageable, but the sum of the frequencies, N,
precludes simple solutions such as creating a list, shuffling it and
using the first n items to populate the sample (frequency distribution /
histogram).

I note that numpy.random.hypergeometric will allow me to generate a
sample when I only have two categories, and that I could probably
implement some kind of iterative / partitioning approach calling this
repeatedly. But before I do I thought I'd ask if anyone has tackled this
before. Can't find much on the web. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Injecting methods into instance / class

2018-12-02 Thread duncan smith
On 02/12/2018 18:36, Peter Otten wrote:
> duncan smith wrote:
> 
>> Hello,
>>   I have a lot of functions that take an instance of a particular
>> class as the first argument. I want to create corresponding methods in
>> the class. I have tried the following, which (when called from __init__)
>> creates the relevant methods in an instance (Python 3.6).
>>
>>
>> def init_methods(self):
>> for func_name, method_name in [('add', '__add__'),
>>('subtract', '__sub__')]:
>> setattr(self, method_name,
>> types.MethodType(globals()[func_name], self))
>>
>>
>> The problem is that e.g.
>>
>> x.__sub__(y)
>>
>> works as expected, but
>>
>> x - y
>>
>> does not (because special methods are looked up in the class rather than
>> the instance).
>>
>> I have tried to find examples of injecting methods into classes without
>> success. I have tried a few things that intuition suggested might work,
>> but didn't - like removing the first line above, dedenting and replacing
>> self with the class. This is only to save typing and make the code
>> cleaner, but I would like to get it right. Any pointers appreciated. TIA.
> 
> You can do this in the __ init__() method:
> 
> $ cat tmp.py
> def add(self, other):
> return 42 + other
> 
> def init_methods(self):
> cls = type(self)
> for methodname, funcname in [("__add__", "add")]:
> func = globals()[funcname]
> setattr(cls, methodname, func)
> 
> class Demo:
> def __init__(self):
> init_methods(self)
> 
> x = Demo()
> print(x + 100)
> $ python3 tmp.py
> 142
> 
> but the __add__() method (to take the example above) is shared by all 
> instances. You are either doing too much work by setting it again and again 
> in every Demo.__init__() call or you are changing the behaviour of 
> previously initialised Demo instances (if the last init_methods() sets the 
> method to a different function) and confuse yourself.
> 
> Have you considered a mix-in class instead? Like
> 
> $ cat tmp.py
> def add(self, other):
> return 42 + other
> 
> class CommonMethods:
> __add__ = add
> 
> class Demo(CommonMethods):
> pass
> 
> x = Demo()
> print(x + 100)
> $ python3 tmp.py
> 142
> 
> This is much cleaner, and if you insist you can set up CommonMethods 
> programmatically with
> 
> CommonMethods = type(
> "CommonMethods", (),
> {m: globals()[f] for m, f in [("__add__", "add")]}
> # without globals() and probably better: 
> # {m: f for m, f in [("__add__", add)]}
> )
> 
> though personally I don't see the point.
> 

Ah, I could just bind them within the class,

mean = mean
max = max etc.

but I'd somehow convinced myself that didn't work. As the names are the
same I'd probably still prefer to go with something along the lines of

for name in ['mean', 'max', ...]:
# create a method

But at least I now have something that works, and I could try something
like your suggestion above to see if I prefer it.

The issue was that some of these "functions" are actually callable class
instances (an attempt to emulate numpy's ufuncs).


class Func:
def __init__(self, op):
# op is a numpy ufunc
self.op = op

def __call__(self, *args, **kwargs):
# some logic here to decide
# which method to call on the
# basis of the number of arguments
# and return values

def one_one(self, x):
# the method to call for a single argument
# and single return value


Just rebinding these doesn't work, and these are the callables that need
corresponding special methods. I guess I probably tried binding one of
these first and convinced myself that it didn't work generally.

Many thanks (to everyone who has responded). Not exactly solved, but
I've made some progress. I might just write the methods normally, rather
than looking for a clever shortcut. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Injecting methods into instance / class

2018-12-02 Thread duncan smith
On 02/12/2018 18:26, Stefan Ram wrote:
> duncan smith  writes:
>> I have tried to find examples of injecting methods into classes without
> 
>   Wouldn't the normal approach be to just define a
>   class with your functions as instance methods?
> 
>   main.py
> 
> class C():
> def __init__( self, value=0 ):
> self.value = value
> def __sub__( self, other ):
> return C( self.value - other.value )
> def __str__( self ):
> return 'C '+ str( self.value )
> def yourfunctionhere( self ):
> pass
> 
> c = C(); c.value = 22
> d = C(); d.value = 27
> print( d - c )
> 
>   transcript
> 
> C 5
> 

What I'm trying to avoid is,


def __sub__(self, other):
return subtract(self, other)

def __add__(self, other):
return add(self, other)

def __mul__(self, other):
return multiply(self, other)

def some_func(self, *args, **kwargs):
return some_func(self, *args, **kwargs)


for many existing functions. Injecting them as instance methods was
probably not ideal, but I could get it working (apart from the special
methods). Ideally I'd like to take all the functions defined in a
separate module, create ordinary class methods from most of them (with
the same name), then create special methods from the remaining
functions. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Injecting methods into instance / class

2018-12-02 Thread duncan smith
Hello,
  I have a lot of functions that take an instance of a particular
class as the first argument. I want to create corresponding methods in
the class. I have tried the following, which (when called from __init__)
creates the relevant methods in an instance (Python 3.6).


def init_methods(self):
for func_name, method_name in [('add', '__add__'),
   ('subtract', '__sub__')]:
setattr(self, method_name,
types.MethodType(globals()[func_name], self))


The problem is that e.g.

x.__sub__(y)

works as expected, but

x - y

does not (because special methods are looked up in the class rather than
the instance).

I have tried to find examples of injecting methods into classes without
success. I have tried a few things that intuition suggested might work,
but didn't - like removing the first line above, dedenting and replacing
self with the class. This is only to save typing and make the code
cleaner, but I would like to get it right. Any pointers appreciated. TIA.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Why do integers compare equal to booleans?

2018-11-16 Thread duncan smith
On 16/11/18 14:51, Steve Keller wrote:
> Why do the integers 0 and 1 compare equal to the boolean values False
> and True and all other integers to neither of them?
> 
> $ python3
> Python 3.5.2 (default, Nov 12 2018, 13:43:14)
> [GCC 5.4.0 20160609] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 0 == False
> True
> >>> 1 == True
> True
> >>> 2 == False
> False
> >>> 2 == True
> False
> >>> -1 == False
> False
> >>> -1 == True
> False
> >>>
> 
> Since these are objects of different types I would expect they cannot
> be equal.  I know that 0 means false and != 0 means true in C, C++,
> etc. but in Python that surprises me.
> 
> Steve
> 

>>> isinstance(False, int)
True
>>> isinstance(True, int)
True
>>> False.real
0
>>> True.real
1
>>>

At least in recent Pythons.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


numpy use cases for where and out

2018-09-11 Thread duncan smith
Hello,
  I'm writing some code for sparse arrays that is intended to pretty
much follow the numpy API. Because my arrays can have different default
values there is an issue with using the 'out' keyword argument for
functions. e.g. If I elementwise multiply 2 arrays with defaults a and
b, then the default of the product becomes a*b. Changing the default of
'out' to a*b (if that is not the existing default) requires pre-existing
values in 'out' equal to the previous default to be filled in. It seems
to be more hassle than it's worth, and basically eliminates sparsity.

So I am wondering if there are any compelling use cases for 'out'. At
the same time, I'm wondering about 'where'. It's less of an issue. But
given my dictionary of keys implementation, and returning new arrays
rather than views, I'm wondering if I could reasonably eliminate both. I
don't think I've ever used either argument when using numpy. So I'm
asking here in case there are use cases I've overlooked. TIA.

Duncan

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help Needed : script weird result.

2018-09-01 Thread duncan smith
On 01/09/18 18:11, moha...@gmail.com wrote:
> All,
> 
> I m trying to run this small script to find the lowest of the given array of 
> numbers. The script works fine for various combination of inputs but fails in 
> a weird way for a particular set of inputs, can anyone point the mistake in 
> the script and the behavior.
> 
> Script
> 
> x = input ("Enter the numbers separated by space and press ENTER :")
> x = x.split(" ")
> 
> def checkmin(arr):
> lowest = arr[0]
> for count in range(0,len(arr),1):
> if arr[count] < lowest :
> lowest = arr[count]
> else :
> pass
> print (lowest)
> return lowest
> 
> minimum = checkmin(x)
> print ("Lowest : {0}".format (minimum))
> 
> 
> Weird output is as below.
> 
> == RESTART: C:\Users\mohan\Desktop\temp.py ==
> Enter the numbers separated by space and press ENTER :5 90 63 82 59 24
> 5
> 5
> 5
> 5
> 5
> 24
> Lowest : 24
> 
> Regards
> Mohan C
> 


Compare,

>>> min(5, 90, 63, 82, 59, 24)
5
>>> min('5', '90', '63', '82', '59', '24')
'24'
>>>

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


plotly / dash export data

2018-06-09 Thread duncan smith
Hello,
  I have been trying to work out how to export data from a dash
application. I know little about html or javascript. I can upload data
into a table as per the example at
https://github.com/plotly/dash-core-components/pull/73. I can edit the
data in the table. But I can't find a way of exporting the edited data.
(Saving the data to file after each edit and providing a link to said
file might do it, but doesn't seem particularly clean). The data are
model parameters, and the user can edit these and run simulations. It
seems unreasonable to allow the user to experiment with different
parameters, but not allow them to export the parameters for future use.
Hopefully someone with better knowledge of dash (or html) can point me
in the right direction. TIA.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Instance variables question

2018-04-16 Thread duncan smith
On 16/04/18 17:03, Irv Kalb wrote:
> I have been writing OOP code for many years in other languages and for the 
> past few years in Python.  I am writing new curriculum for a course on OOP in 
> Python.  In order to see how others are explaining OOP concepts, I have been 
> reading as many books and watching as many videos as I can.   I've been 
> watching some videos created by Dr. Chuck Severance in a series called 
> "Python For Everyone".  I think "Dr. Chuck" is an excellent teacher and I 
> think his videos are outstanding.  
> 
> Today I watched this video:   https://www.youtube.com/watch?v=b2vc5uzUfoE 
>   which is about 10 minutes 
> long.  In that video he gives a very basic overview of OOP and classes.  He 
> gives a demonstration using the following example:
> 
> class PartyAnimal():
> x = 0
> 
> def party(self):
> self.x = self.x + 1
> print('So far', self.x)
> 
> an = PartyAnimal()
> an.party()
> an.party()
> an.party()
> 
> # I added this line just to see what it would do
> print('Class variable', PartyAnimal.x)
> 
> 
> And the output is:
> 
> So far 1
> So far 2
> So far 3
> Class variable 0
> 
> But there is something there that seems odd.  My understanding is that the "x 
> = 0" would be defining a class variable, that can be shared by all 
> PartyAnimal objects.  But he explains that because x is defined between the 
> class statement and the "party" method, that this defines an instance 
> variable x.   That way, it can be used in the first line of the "party" 
> method as self.x to increment itself.   
> 
> At the end of the video, he creates two objects from the same class, and each 
> one gets its own self.x where each correctly starts at zero.  Again, I 
> expected x to be a class variable (accessible through PartyAnimal.x).  
> 
> When I want to create an instance variable and to be used later in other 
> methods, I do this:
> 
> class PartyAnimal():
> def __init__(self):
>   self.x = 0  
> 
> def party(self):
> self.x = self.x + 1
> print('So far', self.x)
> 
> an = PartyAnimal()
> an.party()
> an.party()
> an.party()
> 
> That is, I would assign the instance variable in the __init__ method.  Both 
> approaches give the same results.
> 
> I'm certainly not trying to argue with Dr. Chuck.  I am trying to understand 
> his approach, but it's not clear to me why his code works.  Specifically, can 
> anyone explain how his "x = 0" turns x into an instance variable - while also 
> allowing the syntax for a class variable PartyAnimal.x to be used?
> 
> Thanks,
> 
> Irv
> 

My understanding of this:

x is a class variable.

Initially an instance has no instance variable self.x. So on the first
call to self.party the name 'x' is looked up in the class. This value is
incremented and the result is assigned to self.x. This is where the
instance variable x is created and set. On subsequent calls to
self.party there exists an instance variable x, so the name 'x' is not
looked up in the class.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Treatment of NANs in the statistics module

2018-03-17 Thread duncan smith
On 16/03/18 23:16, Steven D'Aprano wrote:
> The bug tracker currently has a discussion of a bug in the median(), 
> median_low() and median_high() functions that they wrongly compute the 
> medians in the face of NANs in the data:
> 
> https://bugs.python.org/issue33084
> 
> I would like to ask people how they would prefer to handle this issue:
> 
> (1) Put the responsibility on the caller to strip NANs from their data. 
> If there is a NAN in your data, the result of calling median() is 
> implementation-defined. This is the current behaviour, and is likely to 
> be the fastest.
> 
> (2) Return a NAN.
> 
> (3) Raise an exception.
> 
> (4) median() should strip out NANs.
> 
> (5) All of the above, selected by the caller. (In which case, which would 
> you prefer as the default?)
> 
> 
> Thank you.
> 
> 
> 
> 

(2). A user can check for a returned NaN if necessary (so no real need
for (3)). Silently stripping out NaNs strikes me as a terrible idea. The
user should decide how NaNs should be dealt with. Optional arguments to
govern the handling of NaNs - OK as long as the default behaviour is to
return a NaN. There is no sensible default for handling NaNs (or missing
values). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: plot / graph connecting re ordered lists

2018-01-23 Thread duncan smith
On 23/01/18 23:42, Vincent Davis wrote:
> On Tue, Jan 23, 2018 at 4:15 PM Dennis Lee Bieber 
> wrote:
> 
>> On Tue, 23 Jan 2018 13:51:55 -0700, Vincent Davis
>>  declaimed the following:
>>
>>> Looking for suggestions. I have an ordered list of names these names will
>>> be reordered. I am looking to make a plot, graph, with the two origins of
>>
>> IE: you have two lists with the same items in different orders...
>>
>>> the names in separate columns and a line connecting them to visually
>>> represent how much they have moved in the reordering.
>>> Surely there is some great example code for this on the net an am not
>>> finding a clean example.
>>>
>>
>> Determine positions:
>>
>> pos = []
>> for p, name in enumerate(first_list):
>> np = second_list.index(name)
>> pos.append( (name, p, np) )
>>
>> for (name, p, np) in pos:
>> draw_line((1,p) , (2, np))
>> label( (1, p), name)
>>
>> Exact details of graphics package and scaling left as an exercise
> 
> 
> Actualy, it’s recomendations for a graphing package And an example using it
> for such a graph that I am most interested in. I know how to relate the
> names on the 2 lists.
> 
> 
>> --
>> Wulfraed Dennis Lee Bieber AF6VN
>> wlfr...@ix.netcom.comHTTP://wlfraed.home.netcom.com/
>>
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>

Maybe http://graphviz.org/ and http://matthiaseisen.com/articles/graphviz/

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Speeding up the implementation of Stochastic Gradient Ascent in Python

2018-01-18 Thread duncan smith
On 17/01/18 14:29, leutrim.kal...@gmail.com wrote:
> 
> Hello everyone, 
> 
> I am implementing a time-dependent Recommender System which applies BPR 
> (Bayesian Personalized Ranking), where Stochastic Gradient Ascent is used to 
> learn the parameters of the model. Such that, one iteration involves sampling 
> randomly the quadruple (i.e. userID, positive_item, negative_item, 
> epoch_index) for n times, where n is the total number of positive feedbacks 
> (i.e. the number of ratings given to all items). But, as my implementation 
> takes too much time for learning the parameters (since it requires 100 
> iterations to learn the parameters), I was wondering if there is a way to 
> improve my code and speed up the learning process of the parameters.
> 
> Please find the code of my implementation (the update function named as 
> updateFactors is the one that learns the parameters, and such that I guess is 
> the one that should be improved in order to speed up the process of learning 
> the parameter values) in the following link: 
> https://codereview.stackexchange.com/questions/183707/speeding-up-the-implementation-of-stochastic-gradient-ascent-in-python
> 





dim0 = []
dim1 = []
...
dim19 = []


dim0.append((asin, self.theta_item_per_bin[bin][itemID][0]))
dim1.append((asin,self.theta_item_per_bin[bin][itemID][1]))
...
dim19.append((asin,self.theta_item_per_bin[bin][itemID][19]))


for d in range(self.K2):
if d == 0:
max_value = max(dim0, key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:',d,', itemID: ', asin_, ', value = ', value_

if d == 1:
max_value = max(dim1, key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:',d,', itemID: ', asin_, ', value = ', value_



if d == 19:
max_value = max(dim19, key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:',d,', itemID: ', asin_, ', value = ', value_


How about something like,


dims = [[] for _ in range(20)]

for i in range(20):
dims[i].append((asin, self.theta_item_per_bin[bin][itemID][i]))

for j in range(self.K2):
max_value = max(dims[j], key=operator.itemgetter(1))
asin_ = max_value[0]
value_ = max_value[1]
print 'dim:', j,', itemID: ', asin_, ', value = ', value_


I haven't looked at it in detail, and I'm not sure why you have exactly
20 lists or what happens if self.K2 is not equal to 20. But you can make
the code simpler and shorter, and marginally more efficient by not
testing all those if statements.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Problem with assignment. Python error or mine?

2017-12-21 Thread duncan smith
On 21/12/17 19:06, John Ladasky wrote:
> On Thursday, December 21, 2017 at 7:37:39 AM UTC-8, MRAB wrote:
> 
>> Python never makes a copy unless you ask it to.
>>
>> What x1=X does is make the name x1 refer to the same object that X 
>> refers to. No copying.
> 
> Well, except with very simple, mutable data types like scalars... compare 
> this:
> 
 x=5
 y=x
 x,y
> (5, 5)
 x+=1
 x,y
> (6, 5)
> 
> To this:
> 
 a=[1,2,3]
 b=a
 a,b
> ([1, 2, 3], [1, 2, 3])
 a[1]=9
 a,b
> ([1, 9, 3], [1, 9, 3])
> 

Except ints aren't mutable and there's still no copying.

For

x += 1

(where x is e.g. an int) read

x = x + 1

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: General Purpose Pipeline library?

2017-11-20 Thread duncan smith
On 20/11/17 15:48, Jason wrote:
> a pipeline can be described as a sequence of functions that are applied to an 
> input with each subsequent function getting the output of the preceding 
> function:
> 
> out = f6(f5(f4(f3(f2(f1(in))
> 
> However this isn't very readable and does not support conditionals.
> 
> Tensorflow has tensor-focused pipepines:
> fc1 = layers.fully_connected(x, 256, activation_fn=tf.nn.relu, 
> scope='fc1')
> fc2 = layers.fully_connected(fc1, 256, activation_fn=tf.nn.relu, 
> scope='fc2')
> out = layers.fully_connected(fc2, 10, activation_fn=None, scope='out')
> 
> I have some code which allows me to mimic this, but with an implied parameter.
> 
> def executePipeline(steps, collection_funcs = [map, filter, reduce]):
>   results = None
>   for step in steps:
>   func = step[0]
>   params = step[1]
>   if func in collection_funcs:
>   print func, params[0]
>   results = func(functools.partial(params[0], 
> *params[1:]), results)
>   else:
>   print func
>   if results is None:
>   results = func(*params)
>   else:
>   results = func(*(params+(results,)))
>   return results
> 
> executePipeline( [
>   (read_rows, (in_file,)),
>   (map, (lower_row, field)),
>   (stash_rows, ('stashed_file', )),
>   (map, (lemmatize_row, field)),
>   (vectorize_rows, (field, min_count,)),
>   (evaluate_rows, (weights, None)),
>   (recombine_rows, ('stashed_file', )),
>   (write_rows, (out_file,))
>   ]
> )
> 
> Which gets me close, but I can't control where rows gets passed in. In the 
> above code, it is always the last parameter.
> 
> I feel like I'm reinventing a wheel here.  I was wondering if there's already 
> something that exists?
> 

Maybe Kamaelia?

http://www.kamaelia.org/Home.html

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


table class - inheritance, delegation?

2017-03-31 Thread duncan smith
Hello,
  I need to code up a table class, which will be based on numpy
arrays. Essentially it needs to behave like a numpy array, but with a
variable associated with each array dimension. It must (at least
partially) satisfy the API of an existing class. The main reason for the
new class is to avoid the creation of new axes and the transpositions
that were required for elementwise operations on instances of the
original class. So when summing across axes I will retain all the
dimensions. My inclination is to go for composition. e.g. (all following
code untested)


class Table(object):
def __init__(self, values, variables):
self.values = values # numpy array
self.all_variables = variables
self.var_ind_map = dict(zip(variables, range(len(variables

@property
def variables(self):
return [v for dim, v in
zip(self.values.shape, self.all_variables)
if not dim == 1]


variables is required by the API. Most binary operators will behave
exactly as for numpy arrays. e.g.


def __mul__(self, other):
return self.__class__(self.values * other.values,
  self.all_variables)


I can write a factory method to generate many of these, and maybe
another factory method for __neg__, __pos__, __abs__ etc. But I then
have to deal with __rsub__, __rdiv__ etc. and all I'm doing is emulating
the behaviour of numpy arrays. I also need to handle multipication /
division etc. by e.g. floats in exactly the same way as numpy.

What I can't immediately see is a clean way of doing this with
delegation. But I'm unsure that inheritance is the best way to go (the
class will be very lightweight compared to numpy arrays). Any advice
welcome. I'm currently using Python 2.7 (but will make the move to 3 at
some point). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Surprised by the lack of constant folding

2017-03-24 Thread duncan smith
On 24/03/17 19:35, Tim Chase wrote:
> Playing around, I came across the following
> 
> $ python3
> Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
> [GCC 4.9.1] on linux
> Type "help", "copyright", "credits" or "license" for more information.
 from dis import dis
 def f(x):
> ... return x * 1024 * 1024
> ... 
 dis(f)
>   2   0 LOAD_FAST0 (x)
>   3 LOAD_CONST   1 (1024)
>   6 BINARY_MULTIPLY
>   7 LOAD_CONST   1 (1024)
>  10 BINARY_MULTIPLY
>  11 RETURN_VALUE
> 
> Is there any reason that Python doesn't do the constant folding of
> 1024 * 1024?  Especially as it does constant folding if the order is
> reversed:
> 
 def f(x):
> ... return 1024 * 1024 * x
> ... 
 dis(f)
>   2   0 LOAD_CONST   2 (1048576)
>   3 LOAD_FAST0 (x)
>   6 BINARY_MULTIPLY
>   7 RETURN_VALUE
> 
> 
> I'm cool with defining things as a constant and using those instead
> or front-loading my constants in my expressions, but the disparity
> struck me as a little odd.  Any insights on the reasons/motivations
> for one and not the other?
> 
> -tkc
> 
> 
> 

Left to right evaluation? Depending on the type of x, x * 1024 * 1024
might not return the same result as x * (1024 * 1024). That's my guess.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Create ordering of points

2017-02-15 Thread duncan smith
On 15/02/17 13:27, spiess.benja...@googlemail.com wrote:
> Hello !:)
> I've got a problem which I would really like to solve. 
> I got a cloud of points (in the simplest example its a 2-dimensional cloud of 
> points). 
> First, I want to set one of the points as the initial (or middle) point. 
> Starting from there, the next points with the closest distance to the initial 
> points are to be found. Afterwards, the closest points to these points are to 
> be written out. These points shall be connected like it is depicted in this 
> figure http://fs5.directupload.net/images/170215/b835zixm.jpg
> 
> With this approach, the "fastest" path from every point to the defined 
> initial point are specified by this pattern. 
> 
> Does somebody know a good technique for this problem? or can even give a hint 
> to a existing python procedure?
> I would be really grateful!
> 
> 
> Regards
> Benny
> 

Maybe https://en.wikipedia.org/wiki/Shortest-path_tree?

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: geopandas bug ?

2017-01-14 Thread duncan smith
On 14/01/17 14:59, Xristos Xristoou wrote:
> Τη Σάββατο, 14 Ιανουαρίου 2017 - 4:38:39 μ.μ. UTC+2, ο χρήστης duncan smith 
> έγραψε:
>> On 14/01/17 11:18, Xristos Xristoou wrote:
>>> i want  to create a simple spatial joing using geopandas but i thing so 
>>> geopandas has bug ?
>>>
>>>
>>>
>>> geopandas code :
>>>
>>> from geopandas import gpd
>>> import geopandas
>>> points = geopandas.GeoDataFrame.from_file('points.shp') # or geojson etc
>>> polys = geopandas.GeoDataFrame.from_file('polygons.shp')
>>> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>>>
>>> error :
>>>
>>> Traceback (most recent call last):
>>>   File "/home/sarantis/testshapely/sumpointsinsidepolygon/testgeo.py", line 
>>> 7, in 
>>> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>>>   File "/usr/local/lib/python2.7/dist-packages/geopandas/tools/sjoin.py", 
>>> line 57, in sjoin
>>> r_idx = np.concatenate(idxmatch.values)
>>> ValueError: need at least one array to concatenate
>>>
>>> and if i change the imports with the some code :
>>>
>>> import geopandas
>>> import pandas as pd
>>> import geopandas as gpd
>>> from geopandas import GeoDataFrame, read_file
>>> from geopandas.tools import sjoin
>>> from shapely.geometry import Point, mapping,shape
>>> import pandas as gpd
>>>
>>> i take that error :
>>>
>>> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>>> AttributeError: 'module' object has no attribute 'sjoin'
>>>
>>>
>>> any idea why ?
>>>
>>
>>
>> import geopandas as gpd
>> import pandas as gpd
>>
>> Does pandas have an attribute 'sjoin'?
>>
>> Duncan
> 
> i dont know i follow detais i am newbie
> 

You import geopandas as gpd, then import pandas as gpd. So maybe you
think gpd refers to geopandas while it actually refers to pandas. I
don't know geopandas or pandas, but you should check your imports.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: geopandas bug ?

2017-01-14 Thread duncan smith
On 14/01/17 11:18, Xristos Xristoou wrote:
> i want  to create a simple spatial joing using geopandas but i thing so 
> geopandas has bug ?
> 
> 
> 
> geopandas code :
> 
> from geopandas import gpd
> import geopandas
> points = geopandas.GeoDataFrame.from_file('points.shp') # or geojson etc
> polys = geopandas.GeoDataFrame.from_file('polygons.shp')
> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
> 
> error :
> 
> Traceback (most recent call last):
>   File "/home/sarantis/testshapely/sumpointsinsidepolygon/testgeo.py", line 
> 7, in 
> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
>   File "/usr/local/lib/python2.7/dist-packages/geopandas/tools/sjoin.py", 
> line 57, in sjoin
> r_idx = np.concatenate(idxmatch.values)
> ValueError: need at least one array to concatenate
> 
> and if i change the imports with the some code :
> 
> import geopandas
> import pandas as pd
> import geopandas as gpd
> from geopandas import GeoDataFrame, read_file
> from geopandas.tools import sjoin
> from shapely.geometry import Point, mapping,shape
> import pandas as gpd
> 
> i take that error :
> 
> pointInPoly = gpd.sjoin(points, polys, how='left',op='within')
> AttributeError: 'module' object has no attribute 'sjoin'
> 
> 
> any idea why ?
> 


import geopandas as gpd
import pandas as gpd

Does pandas have an attribute 'sjoin'?

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-12-01 Thread duncan smith
On 01/12/16 01:12, Chris Kaynor wrote:
> On Wed, Nov 30, 2016 at 4:54 PM, duncan smith <duncan@invalid.invalid> wrote:
>>
>> Thanks. So something like the following might do the job?
>>
>> def _execute(command):
>> p = subprocess.Popen(command, shell=False,
>>  stdout=subprocess.PIPE,
>>  stderr=subprocess.STDOUT,
>>  close_fds=True)
>> out_data, err_data = p.communicate()
>> if err_data:
>> print err_data
> 
> I did not notice it when I sent my first e-mail (but noted it in my
> second one) that the docstring in to_image is presuming that
> shell=True. That said, as it seems everybody is at a loss to explain
> your issue, perhaps there is some oddity, and if everything appears to
> work with shell=False, it may be worth changing to see if it does fix
> the problem. With other information since provided, it is unlikely,
> however.
> 
> Not specifying the stdin may help, however it will only reduce the
> file handle count by 1 per call (from 2), so there is probably a root
> problem that it will not help.
> 
> I would expect the communicate change to fix the problem, except for
> your follow-up indicating that you had tried that before without
> success.
> 
> Removing the manual stdout.read may fix it, if the problem is due to
> hanging processes, but again, your follow-up indicates thats not the
> problem - you should have zombie processes if that were the case.
> 
> A few new questions that you have not answered (nor have they been
> asked in this thread): How much memory does your system have? Are you
> running a 32-bit or 64-bit Python? Is your Python process being run
> with any additional limitations via system commands (I don't know the
> command, but I know it exists; similarly, if launched from a third
> app, it could be placing limits)?
> 
> Chris
> 

8 Gig, 64 bit, no additional limitations (other than any that might be
imposed by IDLE). In this case the simulation does consume *a lot* of
memory, but that hasn't been the case when I've hit this in the past. I
suppose that could be the issue here. I'm currently seeing if I can
reproduce the problem after adding the p.communicate(), but it seems to
be using more memory than ever (dog slow and up to 5 Gig of swap). In
the meantime I'm going to try to refactor to reduce memory requirements
- and 32 Gig of DDR3 has been ordered. I'll also dig out some code that
generated the same problem before to see if I can reproduce it. Cheers.

Duncan

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
On 01/12/16 00:46, Chris Kaynor wrote:
> On Wed, Nov 30, 2016 at 4:12 PM, duncan smith <duncan@invalid.invalid> wrote:
>> On 30/11/16 17:57, Chris Angelico wrote:
>>> On Thu, Dec 1, 2016 at 4:34 AM, duncan smith <duncan@invalid.invalid> wrote:
>>>>
>>>> def _execute(command):
>>>> # shell=True security hazard?
>>>> p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
>>>>  stdout=subprocess.PIPE,
>>>>  stderr=subprocess.STDOUT,
>>>>  close_fds=True)
>>>> output = p.stdout.read()
>>>> p.stdin.close()
>>>> p.stdout.close()
>>>> #p.communicate()
>>>> if output:
>>>> print output
>>>
>>> Do you ever wait() these processes? If not, you might be leaving a
>>> whole lot of zombies behind, which will eventually exhaust your
>>> process table.
>>>
>>> ChrisA
>>>
>>
>> No. I've just called this several thousand times (via calls from a
>> higher level function) and had no apparent problem. Top reports no
>> zombie tasks, and memory consumption and the number of sleeping tasks
>> seem to be reasonably stable. I'll try running the code that generated
>> the error to see if I can coerce it into failing again. OK, no error
>> this time. Great, an intermittent bug that's hard to reproduce ;-). At
>> the end of the day I just want to invoke dot to produce an image file
>> (perhaps many times). Is there perhaps a simpler and more reliable way
>> to do this? Or do I just add the p.wait()? (The commented out
>> p.communicate() is there from a previous, failed attempt to fix this -
>> as, I think, are the shell=True and close_fds=True.) Cheers.
> 
> That would appear to rule out the most common issues I would think of.
> 
> That said, are these calls being done in a tight loop (the full
> call-stack implies it might be a physics simulation)? Are you doing
> any threading (either in Python or when making the calls to Python -
> using a bash command to start new processes without waiting counts)?
> Is there any exception handling at a higher level that might be
> continuing past the error and sometimes allowing a zombie process to
> stay?
> 

In this case the calls *are* in a loop (record linkage using an
expectation maximization algorithm).

> If you are making a bunch of calls in a tight loop, that could be your
> issue, especially as you are not waiting on the process (though the
> communicate does so implicitly, and thus should have fixed the issue).
> This could be intermittent if the processes sometimes complete
> quickly, and other times are delayed. In these cases, a ton of the dot
> processes (and shell with shell=true) could be created before any
> finish, thus causing massive usage. Some of the processes may be
> hanging, rather than outright crashing, and thus leaking some
> resources.
> 

I'll try the p.communicate thing again. The last time I tried it I might
have already got myself into a situation where launching more
subprocesses was bound to fail. I'll edit the code, launch IDLE again
and see if it still happens.

> BTW, the docstring in to_image implies that the shell=True is not an
> attempted fix for this - the example 'unflatten -l 3 | dot' is
> explicitly suggesting the usage of shell=True.
> 


OK. As you can see, I don't really understand what's happening under the
hood :-). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
On 30/11/16 17:53, Chris Kaynor wrote:
> On Wed, Nov 30, 2016 at 9:34 AM, duncan smith <duncan@invalid.invalid> wrote:
>> Hello,
>>   I have had an issue with some code for a while now, and I have not
>> been able to solve it. I use the subprocess module to invoke dot
>> (Graphviz) to generate a file. But if I do this repeatedly I end up with
>> an error. The following traceback is from a larger application, but it
>> appears to be repeated calls to 'to_image' that is the issue.
> 
> I don't see any glaring problems that would obviously cause this,
> however have you checked to see if the processes are actually exiting
> (it looks like you are on Linux, so the top command)?
> 
>>
>>
>> Traceback (most recent call last):
>>   File "<pyshell#80>", line 1, in 
>> z = link_exp.sim1((djt, tables), variables, 1000, 400, 600,
>> [0,1,2,3,4,5,6], [6,7,8,9,10], ind_gens=[link_exp.males_gen()],
>> ind_gens_names=['Forename'], seed='duncan')
>>   File "link_exp.py", line 469, in sim1
>> RL_F2 = EM_setup(data)
>>   File "link_exp.py", line 712, in full_EM
>> last_g = prop.djt.g
>>   File "Nin.py", line 848, in draw_model
>> dot_g.to_image(filename, prog='dot', format=format)
>>   File "dot.py", line 597, in to_image
>> to_image(str(self), filename, prog, format)
>>   File "dot.py", line 921, in to_image
>> _execute('%s -T%s -o %s' % (prog, format, filename))
>>   File "dot.py", line 887, in _execute
>> close_fds=True)
>>   File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
>> errread, errwrite)
>>   File "/usr/lib/python2.7/subprocess.py", line 1235, in _execute_child
>> self.pid = os.fork()
>> OSError: [Errno 12] Cannot allocate memory
>>
>>
>> The relevant (AFAICT) code is,
>>
>>
>> def to_image(text, filename, prog='dot', format='dot'):
>> # prog can be a series of commands
>> # like 'unflatten -l 3 | dot'
>> handle, temp_path = tempfile.mkstemp()
>> f = open(temp_path, 'w')
>> try:
>> f.write(text)
>> f.close()
>> progs = prog.split('|')
>> progs[0] = progs[0] + ' %s ' % temp_path
>> prog = '|'.join(progs)
>> _execute('%s -T%s -o %s' % (prog, format, filename))
>> finally:
>> f.close()
>> os.remove(temp_path)
>> os.close(handle)
>>
>> def _execute(command):
>> # shell=True security hazard?
>> p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
>>  stdout=subprocess.PIPE,
>>  stderr=subprocess.STDOUT,
>>  close_fds=True)
>> output = p.stdout.read()
>> p.stdin.close()
>> p.stdout.close()
>> #p.communicate()
>> if output:
>> print output
> 
> This code has a potential dead-lock. If you are calling it from
> multiple threads/processes, it could cause issues. This should be
> obvious, as your program will also not exit. The communicate call is
> safe, but commented out (you'd need to remove the three lines above it
> as well). Additionally, you could just set stdin=None rather than
> PIPE, which avoids the dead-lock, and you aren't using stdin anyways.
> This issues comes if the subprocess may ever wait for something to be
> written to stdin, it will block forever, but your call to read will
> also block until it closes stdout (or possibly other cases). Another
> option would be to close stdin before starting the read, however if
> you ever write to stdin, you'll reintroduce the same issue, depending
> on OS buffer sizes.
> 
> My question above also comes from the fact that I am not 100% sure
> when stdout.read() will return. It is possible that a null or EOF
> could cause it to return before the process actually exits. The
> subprocess could also expliciting close its stdout, causing it to
> return while the process is still running. I'd recommend adding a
> p.wait() or just uncommenting the p.communicate() call to avoid these
> issues.
> 
> Another, unrelated note, the security hazard depends on where the
> arguments to execute are coming from. If any of those are controlled
> from untrusted sources (namely, user input), you have a
> shell-injection attack. Imagine, for example, if the user requests the
> filename "a.jpg|wipehd" (note: I don't know the format command on
> Linux, so replace with your desired command). This will cause your
> code to wipe the HD by piping into the command. If all of the inputs
> are 100% sanitized or come from trusted sources, you're fine, however
> that can be extremely difficult to guarantee.
> 

Thanks. So something like the following might do the job?

def _execute(command):
p = subprocess.Popen(command, shell=False,
 stdout=subprocess.PIPE,
 stderr=subprocess.STDOUT,
 close_fds=True)
out_data, err_data = p.communicate()
if err_data:
print err_data


Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
On 30/11/16 17:57, Chris Angelico wrote:
> On Thu, Dec 1, 2016 at 4:34 AM, duncan smith <duncan@invalid.invalid> wrote:
>>
>> def _execute(command):
>> # shell=True security hazard?
>> p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
>>  stdout=subprocess.PIPE,
>>  stderr=subprocess.STDOUT,
>>  close_fds=True)
>> output = p.stdout.read()
>> p.stdin.close()
>> p.stdout.close()
>> #p.communicate()
>> if output:
>> print output
> 
> Do you ever wait() these processes? If not, you might be leaving a
> whole lot of zombies behind, which will eventually exhaust your
> process table.
> 
> ChrisA
> 

No. I've just called this several thousand times (via calls from a
higher level function) and had no apparent problem. Top reports no
zombie tasks, and memory consumption and the number of sleeping tasks
seem to be reasonably stable. I'll try running the code that generated
the error to see if I can coerce it into failing again. OK, no error
this time. Great, an intermittent bug that's hard to reproduce ;-). At
the end of the day I just want to invoke dot to produce an image file
(perhaps many times). Is there perhaps a simpler and more reliable way
to do this? Or do I just add the p.wait()? (The commented out
p.communicate() is there from a previous, failed attempt to fix this -
as, I think, are the shell=True and close_fds=True.) Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith

[snip]

Sorry, should have said Python 2.7.12 on Ubuntu 16.04.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


OSError: [Errno 12] Cannot allocate memory

2016-11-30 Thread duncan smith
Hello,
  I have had an issue with some code for a while now, and I have not
been able to solve it. I use the subprocess module to invoke dot
(Graphviz) to generate a file. But if I do this repeatedly I end up with
an error. The following traceback is from a larger application, but it
appears to be repeated calls to 'to_image' that is the issue.


Traceback (most recent call last):
  File "", line 1, in 
z = link_exp.sim1((djt, tables), variables, 1000, 400, 600,
[0,1,2,3,4,5,6], [6,7,8,9,10], ind_gens=[link_exp.males_gen()],
ind_gens_names=['Forename'], seed='duncan')
  File "link_exp.py", line 469, in sim1
RL_F2 = EM_setup(data)
  File "link_exp.py", line 712, in full_EM
last_g = prop.djt.g
  File "Nin.py", line 848, in draw_model
dot_g.to_image(filename, prog='dot', format=format)
  File "dot.py", line 597, in to_image
to_image(str(self), filename, prog, format)
  File "dot.py", line 921, in to_image
_execute('%s -T%s -o %s' % (prog, format, filename))
  File "dot.py", line 887, in _execute
close_fds=True)
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1235, in _execute_child
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory


The relevant (AFAICT) code is,


def to_image(text, filename, prog='dot', format='dot'):
# prog can be a series of commands
# like 'unflatten -l 3 | dot'
handle, temp_path = tempfile.mkstemp()
f = open(temp_path, 'w')
try:
f.write(text)
f.close()
progs = prog.split('|')
progs[0] = progs[0] + ' %s ' % temp_path
prog = '|'.join(progs)
_execute('%s -T%s -o %s' % (prog, format, filename))
finally:
f.close()
os.remove(temp_path)
os.close(handle)

def _execute(command):
# shell=True security hazard?
p = subprocess.Popen(command, shell=True, stdin=subprocess.PIPE,
 stdout=subprocess.PIPE,
 stderr=subprocess.STDOUT,
 close_fds=True)
output = p.stdout.read()
p.stdin.close()
p.stdout.close()
#p.communicate()
if output:
print output


Any help solving this would be appreciated. Searching around suggests
this is something to do with file handles, but my various attempts to
solve it have failed. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: need some kind of "coherence index" for a group of strings

2016-11-04 Thread duncan smith
On 03/11/16 16:18, Fillmore wrote:
> 
> Hi there, apologies for the generic question. Here is my problem let's
> say that I have a list of lists of strings.
> 
> list1:#strings are sort of similar to one another
> 
>   my_nice_string_blabla
>   my_nice_string_blqbli
>   my_nice_string_bl0bla
>   my_nice_string_aru
> 
> 
> list2:#strings are mostly different from one another
> 
>   my_nice_string_blabla
>   some_other_string
>   yet_another_unrelated string
>   wow_totally_different_from_others_too
> 
> 
> I would like an algorithm that can look at the strings and determine
> that strings in list1 are sort of similar to one another, while the
> strings in list2 are all different.
> Ideally, it would be nice to have some kind of 'coherence index' that I
> can exploit to separate lists given a certain threshold.
> 
> I was about to concoct something using levensthein distance, but then I
> figured that it would be expensive to compute and I may be reinventing
> the wheel.
> 
> Thanks in advance to python masters that may have suggestions...
> 
> 
> 

https://pypi.python.org/pypi/jellyfish/

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: retain dimensions for numpy slice

2016-10-24 Thread duncan smith
On 24/10/16 19:05, Peter Otten wrote:
> duncan smith wrote:
> 
>> Hello,
>>   I have several arrays that I need to combine elementwise in
>> various fashions. They are basically probability tables and there is a
>> mapping of axes to variables. I have code for transposing and reshaping
>> that aligns the variables / axes so the usual broadcasting rules achieve
>> the desired objective. But for a specific application I want to avoid
>> the transposing and reshaping. So I've specified arrays that contain the
>> full dimensionality (dimensions equal to the total number of variables).
>> e.g.
>>
>> Arrays with shape,
>>
>> [1,3,3] and [2,3,1]
>>
>> to represent probability tables with variables
>>
>> [B,C] and [A,B].
>>
>> One operation that I need that is not elementwise is summing over axes,
>> but I can use numpy.sum with keepdims=True to retain the appropriate
>> shape.
>>
>> The problem I have is with slicing. This drops dimensions. Does anyone
>> know of a solution to this so that I can e.g. take an array with shape
>> [2,3,1] and generate a slice with shape [2,1,1]? I'm hoping to avoid
>> having to manually reshape it. Thanks.
> 
> Can you clarify your requirement or give an example of what you want?
> 
> Given an array 
> 
>>>> a.shape
> (2, 3, 1)
> 
> you can get a slice with shape (2,1,1) with (for example)
> 
>>>> a[:,:1,:].shape
> (2, 1, 1)
> 
> or even
> 
>>>> newshape = (2, 1, 1)
>>>> a[tuple(slice(d) for d in newshape)].shape
> (2, 1, 1)
> 
> but that's probably not what you are asking for...
> 

Thanks. I think that's exactly what I wanted.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


retain dimensions for numpy slice

2016-10-24 Thread duncan smith
Hello,
  I have several arrays that I need to combine elementwise in
various fashions. They are basically probability tables and there is a
mapping of axes to variables. I have code for transposing and reshaping
that aligns the variables / axes so the usual broadcasting rules achieve
the desired objective. But for a specific application I want to avoid
the transposing and reshaping. So I've specified arrays that contain the
full dimensionality (dimensions equal to the total number of variables).
e.g.

Arrays with shape,

[1,3,3] and [2,3,1]

to represent probability tables with variables

[B,C] and [A,B].

One operation that I need that is not elementwise is summing over axes,
but I can use numpy.sum with keepdims=True to retain the appropriate shape.

The problem I have is with slicing. This drops dimensions. Does anyone
know of a solution to this so that I can e.g. take an array with shape
[2,3,1] and generate a slice with shape [2,1,1]? I'm hoping to avoid
having to manually reshape it. Thanks.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to pick out the same titles.

2016-10-16 Thread duncan smith
On 16/10/16 16:16, Seymore4Head wrote:
> How to pick out the same titles.
> 
> I have a  long text file that has movie titles in it and I would like
> to find dupes.
> 
> The thing is that sometimes I have one called "The Killing Fields" and
> it also could be listed as "Killing Fields"  Sometimes the title will
> have the date a year off.
> 
> What I would like to do it output to another file that show those two
> as a match.
> 
> I don't know the best way to tackle this.  I would think you would
> have to pair the titles with the most consecutive letters in a row.
> 
> Anyone want this as a practice exercise?  I don't really use
> programming enough to remember how.
> 

Tokenize, generate (token) set similarity scores and cluster on
similarity score.


>>> import tokenization
>>> bigrams1 = tokenization.n_grams("The Killing Fields".lower(), 2,
pad=True)
>>> bigrams1
['_t', 'th', 'he', 'e ', ' k', 'ki', 'il', 'll', 'li', 'in', 'ng', 'g ',
' f', 'fi', 'ie', 'el', 'ld', 'ds', 's_']
>>> bigrams2 = tokenization.n_grams("Killing Fields".lower(), 2, pad=True)
>>> import pseudo
>>> pseudo.Jaccard(bigrams1, bigrams2)
0.7


You could probably just generate token sets, then iterate through all
title pairs and manually review those with similarity scores above a
suitable threshold. The code I used above is very simple (and pasted below).


def n_grams(s, n, pad=False):
# n >= 1
# returns a list of n-grams
# or an empty list if n > len(s)
if pad:
s = '_' * (n-1) + s + '_' * (n-1)
return [s[i:i+n] for i in range(len(s)-n+1)]

def Jaccard(tokens1, tokens2):
# returns exact Jaccard
# similarity measure for
# two token sets
tokens1 = set(tokens1)
tokens2 = set(tokens2)
return len(tokens1) / len(tokens1|tokens2)


Duncan


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: rdflib subclass problem

2016-03-02 Thread duncan smith
On 02/03/16 08:16, dieter wrote:
> duncan smith <duncan@invalid.invalid> writes:
> 
>>   I'm just getting to grips with RDF and rdflib, and I've hit
>> something I can't figure out.
>>
>> I have a graph with information on two people. (I haven't shown the
>> imports below because they're scattered around my interactive session
>> and I might reconstruct them incorrectly. Anyone familiar with rdflib
>> will probably know what they are.)
>>
>>
>>>>> G = Graph()
>>>>> mark = BNode()
>>>>> nat = BNode()
>>>>> G.add((mark, RDF.type, FOAF.Person))
>>>>> G.add((mark, FOAF.firstName, Literal('mark')))
>>>>> G.add((nat, RDF.type, URIRef('Professor')))
>>>>> G.add((nat, FOAF.firstName, Literal('natalie')))
>>>>> G.add((URIRef('Professor'), RDFS.subClassOf, FOAF.Person))
>>>>>
>>
>>
>> So I have specified that mark is a Person, natalie is a Professor, and
>> that Professor is a subclass of Person. (I know that Professor is really
>> a FOAF.title, but I'm just tinkering ATM.)
>>
>>
>>>>> qres = G.query(
>> """SELECT DISTINCT ?aname
>>WHERE {
>>   ?a rdf:type foaf:Person .
>>   ?a foaf:firstName ?aname .
>>}""", initNs = {"rdf": RDF,"foaf": FOAF})
>>>>> for row in qres:
>>  print "%s is a person" % row
>>
>>  
>> mark is a person
>>>>> qres = G.query(
>> """SELECT DISTINCT ?aname
>>WHERE {
>>   ?a rdf:type ?prof .
>>   ?a foaf:firstName ?aname .
>>}""", initNs = {"rdf": RDF,"foaf": FOAF, "prof":
>> URIRef('Professor')})
>>>>> for row in qres:
>>  print "%s is a Prof" % row
>>
>>  
>> natalie is a Prof
>> mark is a Prof
>>>>>
>>
>>
>> But according to the above queries only mark is a Person, and each is a
>> Professor. I would have thought that both would be Persons and only
>> natalie would be a Professor. Can anyone spot where I'm going wrong
>> here? Thanks.
> 
> What you observe would be consistent with "RDFS.subClassOf" working
> in the other direction than the one you expect; i.e. that
> 
>URIRef('Professor'), RDFS.subClassOf, FOAF.Person
> 
> means that "Person" is a subclass of "Professor" (not what
> you expect, that "Professor" is a subclass of "Person").
> 
> Carefully check whether "A rel B" means that "B" is in relation "rel" to "A"
> (I think that is the case) or that "A" is in relation "rel" to "B".
> 

According to https://www.w3.org/TR/rdf-schema/#ch_class -

A triple of the form:

C1 rdfs:subClassOf C2

states that C1 is an instance of rdfs:Class, C2 is an instance of
rdfs:Class and C1 is a subclass of C2.


I might report it as a bug, but I'm not yet 100% convinced I haven't got
something wrong. Cheers.

Duncan


-- 
https://mail.python.org/mailman/listinfo/python-list


rdflib subclass problem

2016-03-01 Thread duncan smith
Hello,
  I'm just getting to grips with RDF and rdflib, and I've hit
something I can't figure out.

I have a graph with information on two people. (I haven't shown the
imports below because they're scattered around my interactive session
and I might reconstruct them incorrectly. Anyone familiar with rdflib
will probably know what they are.)


>>> G = Graph()
>>> mark = BNode()
>>> nat = BNode()
>>> G.add((mark, RDF.type, FOAF.Person))
>>> G.add((mark, FOAF.firstName, Literal('mark')))
>>> G.add((nat, RDF.type, URIRef('Professor')))
>>> G.add((nat, FOAF.firstName, Literal('natalie')))
>>> G.add((URIRef('Professor'), RDFS.subClassOf, FOAF.Person))
>>>


So I have specified that mark is a Person, natalie is a Professor, and
that Professor is a subclass of Person. (I know that Professor is really
a FOAF.title, but I'm just tinkering ATM.)


>>> qres = G.query(
"""SELECT DISTINCT ?aname
   WHERE {
  ?a rdf:type foaf:Person .
  ?a foaf:firstName ?aname .
   }""", initNs = {"rdf": RDF,"foaf": FOAF})
>>> for row in qres:
print "%s is a person" % row


mark is a person
>>> qres = G.query(
"""SELECT DISTINCT ?aname
   WHERE {
  ?a rdf:type ?prof .
  ?a foaf:firstName ?aname .
   }""", initNs = {"rdf": RDF,"foaf": FOAF, "prof":
URIRef('Professor')})
>>> for row in qres:
print "%s is a Prof" % row


natalie is a Prof
mark is a Prof
>>>


But according to the above queries only mark is a Person, and each is a
Professor. I would have thought that both would be Persons and only
natalie would be a Professor. Can anyone spot where I'm going wrong
here? Thanks.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread duncan smith
On 21/12/15 03:01, Steven D'Aprano wrote:
> I have a large number of strings (originally file names) which tend to fall
> into two groups. Some are human-meaningful, but not necessarily dictionary
> words e.g.:
> 
> 
> baby lions at play
> saturday_morning12
> Fukushima
> ImpossibleFork
> 
> 
> (note that some use underscores, others spaces, and some CamelCase) while
> others are completely meaningless (or mostly so):
> 
> 
> xy39mGWbosjY
> 9sjz7s8198ghwt
> rz4sdko-28dbRW00u
> 
> 
> Let's call the second group "random" and the first "non-random", without
> getting bogged down into arguments about whether they are really random or
> not. I wish to process the strings and automatically determine whether each
> string is random or not. I need to split the strings into three groups:
> 
> - those that I'm confident are random
> - those that I'm unsure about
> - those that I'm confident are non-random
> 
> Ideally, I'll get some sort of numeric score so I can tweak where the
> boundaries fall.
> 
> Strings are *mostly* ASCII but may include a few non-ASCII characters.
> 
> Note that false positives (detecting a meaningful non-random string as
> random) is worse for me than false negatives (miscategorising a random
> string as non-random).
> 
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:
> 
> - I think nltk has a "language detection" function, would that be suitable?
> 
> - If not nltk, are there are suitable language detection libraries?
> 
> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> 
> - How about Bayesian filters, e.g. SpamBayes?
> 
> 
> 
> 

Finite state machine / transition matrix. Learn from some English text
source. Then process your strings by lower casing, replacing underscores
with spaces, removing trailing numeric characters etc. Base your score
on something like the mean transition probability. I'd expect to see two
pretty well separated groups of scores.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Catogorising strings into random versus non-random

2015-12-21 Thread duncan smith
On 21/12/15 16:49, Ian Kelly wrote:
> On Mon, Dec 21, 2015 at 9:40 AM, duncan smith <duncan@invalid.invalid> wrote:
>> Finite state machine / transition matrix. Learn from some English text
>> source. Then process your strings by lower casing, replacing underscores
>> with spaces, removing trailing numeric characters etc. Base your score
>> on something like the mean transition probability. I'd expect to see two
>> pretty well separated groups of scores.
> 
> Sounds like a case for a Hidden Markov Model.
> 

Perhaps. That would allow the encoding of marginal probabilities and
distinct transition matrices for each class - if we could learn those
extra parameters.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


counting unique numpy subarrays

2015-12-04 Thread duncan smith
Hello,
  I'm trying to find a computationally efficient way of identifying
unique subarrays, counting them and returning an array containing only
the unique subarrays and a corresponding 1D array of counts. The
following code works, but is a bit slow.

###

from collections import Counter
import numpy

def bag_data(data):
# data (a numpy array) is bagged along axis 0
# returns concatenated array and corresponding array of counts
vec_shape = data.shape[1:]
counts = Counter(tuple(arr.flatten()) for arr in data)
data_out = numpy.zeros((len(counts),) + vec_shape)
cnts = numpy.zeros((len(counts,)))
for i, (tup, cnt) in enumerate(counts.iteritems()):
data_out[i] = numpy.array(tup).reshape(vec_shape)
cnts[i] =  cnt
return data_out, cnts

###

I've been looking through the numpy docs, but don't seem to be able to
come up with a clean solution that avoids Python loops. TIA for any
useful pointers. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: counting unique numpy subarrays

2015-12-04 Thread duncan smith
On 04/12/15 22:36, Albert-Jan Roskam wrote:
> Hi
> 
> (Sorry for topposting)
> 
> numpy.ravel is faster than numpy.flatten (no copy)
> numpy.empty is faster than numpy.zeros
> numpy.fromiter might be useful to avoid the loop (just a hunch)
> 
> Albert-Jan
> 

Thanks, I'd forgotten the difference between numpy. flatten and
numpy.ravel. I wasn't even aware of numpy.empty.

Duncan

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: counting unique numpy subarrays

2015-12-04 Thread duncan smith
On 04/12/15 23:06, Peter Otten wrote:
> duncan smith wrote:
> 
>> Hello,
>>   I'm trying to find a computationally efficient way of identifying
>> unique subarrays, counting them and returning an array containing only
>> the unique subarrays and a corresponding 1D array of counts. The
>> following code works, but is a bit slow.
>>
>> ###
>>
>> from collections import Counter
>> import numpy
>>
>> def bag_data(data):
>> # data (a numpy array) is bagged along axis 0
>> # returns concatenated array and corresponding array of counts
>> vec_shape = data.shape[1:]
>> counts = Counter(tuple(arr.flatten()) for arr in data)
>> data_out = numpy.zeros((len(counts),) + vec_shape)
>> cnts = numpy.zeros((len(counts,)))
>> for i, (tup, cnt) in enumerate(counts.iteritems()):
>> data_out[i] = numpy.array(tup).reshape(vec_shape)
>> cnts[i] =  cnt
>> return data_out, cnts
>>
>> ###
>>
>> I've been looking through the numpy docs, but don't seem to be able to
>> come up with a clean solution that avoids Python loops. 
> 
> Me neither :(
> 
>> TIA for any
>> useful pointers. Cheers.
> 
> Here's what I have so far:
> 
> def bag_data(data):
> counts = numpy.zeros(data.shape[0])
> seen = {}
> for i, arr in enumerate(data):
> sarr = arr.tostring()
> if sarr in seen:
> counts[seen[sarr]] += 1
> else:
> seen[sarr] = i
> counts[i] = 1
> nz = counts != 0
> return numpy.compress(nz, data, axis=0), numpy.compress(nz, counts)
> 

Three times as fast as what I had, and a bit cleaner. Excellent. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Generating a vector from successive multiplications of another vector from an initial value

2015-10-01 Thread duncan smith
On 01/10/15 21:45, Paulo da Silva wrote:
> Hi all.
> 
> What is the fastest way to do the following:
> 
> I have an initial value V and a vector vec of (financial) indexes.
> I want to generate a new vector nvec as
> 
> V, V*vec[0], V*vec[0]*vec[1], V*vec[0]*vec[1]*vec[2], ...
> 
> A numpy vectorized solution would be better.
> 
> Thanks
> 

Maybe,

http://docs.scipy.org/doc/numpy/reference/generated/numpy.cumprod.html

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to Calculate NPV?

2015-07-29 Thread duncan smith
On 29/07/15 15:27, ryguy7272 wrote:
 On Wednesday, July 29, 2015 at 10:21:35 AM UTC-4, ryguy7272 wrote:
 On Wednesday, July 29, 2015 at 9:59:10 AM UTC-4, ryguy7272 wrote:
 I am using Spyder Python 2.7.  I'm running this sample code.

 import scipy as sp
 cashflows=[50,40,20,10,50]
 npv=sp.npv(0.1,cashflows)
 round(npv,2)


 Now, I'm trying to get the NPV, and I don't see any obvious way to get it.  
 The author of the book that I'm reading gets 144.56.  I think that's wrong, 
 but I don't know for sure, as Python won't do any calculation at all.  It's 
 easy to enter code and run it, but I can't tell how to get Python to 
 actually DO the calculation.

 Any idea what I'm doing wrong?
  
 PERFECT!!  SO SIMPLE!!
 I don't know why the author didn't do that in the book.

 Thanks!
 
 One last thing, for Excel users, leave out the initial CF.  Do the NPV on the 
 other CFs, and then add in the initial CF at the end of the NPV function.  
 It's almost like a PV + 1stCF.  I don't know why Excel does it like that...
 

I don't know what Excel does, but the first value shouldn't require any
special treatment, 1.1**0 = 1.

 sum([x/(1.1)**i for i, x in enumerate([50,40,20,10,50])])
144.55638276074038


Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bug in floating point multiplication

2015-07-02 Thread duncan smith
On 02/07/15 15:52, Steven D'Aprano wrote:
 Despite the title, this is not one of the usual Why can't Python do
 maths? bug reports.
 
 Can anyone reproduce this behaviour? If so, please reply with the version of
 Python and your operating system. Printing sys.version will probably do.
 
 
 x = 1 - 1/2**53
 assert x == 0.
 for i in range(1, 100):
 if int(i*x) == i:
 print(i); break
 
 
 Using Jython and IronPython, the loop runs to completion. That is the
 correct behaviour, or so I am lead to believe. Using Python 2.6, 2.7 and
 3.3 on Centos and Debian, it prints 2049 and breaks. That should not
 happen. If you can reproduce that (for any value of i, not necessarily
 2049), please reply.
 
 See also http://bugs.python.org/issue24546 for more details.
 
 
 

Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type copyright, credits or license() for more information.
 from __future__ import division
 x = 1 - 1/2**53
 assert x == 0.
 for i in range(1, 100):
if int(i*x) == i:
print(i); break





Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type copyright, credits or license() for more information.
 x = 1 - 1/2**53
 assert x == 0.
 for i in range(1, 100):
if int(i*x) == i:
print(i); break





Ubuntu 14.04

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Algorithm for Creating Supersets of Smaller Sets Based on Common Elements

2015-02-22 Thread duncan smith
On 21/02/15 19:46, TommyVee wrote:
 Start off with sets of elements as follows:
 
 1. A,B,E,F
 2. G,H,L,P,Q
 3. C,D,E,F
 4. E,X,Z
 5. L,M,R
 6. O,M,Y
 
 Note that sets 1, 3 and 4 all have the element 'E' in common, therefore
 they are related and form the following superset:
 
 A,B,C,D,E,F,X,Z
 
 Likewise, sets 2 and 5 have the element 'L' in common, then set 5 and 6
 have element 'M' in common, therefore they form the following superset:
 
 G,H,L,M,O,P,Q,R,Y
 
 I think you get the point.  As long as sets have at least 1 common
 element, they combine to form a superset.  Also links (common
 elements) between sets may go down multiple levels, as described in the
 second case above (2-5-6).  Cycles thankfully, are not possible.
 
 BTW, the number of individual sets (and resultant supersets) will be
 very large.
 
 I don't know where to start with this.  I thought about some type of
 recursive algorithm, but I'm not sure.  I could figure out the Python
 implementation easy enough, I'm just stumped on the algorithm itself.
 
 Anybody have an idea?
 
 Thanks, Tom

Tom,
You could construct a graph G such that there exists an edge {v,w}
for each pair of items that appear in a common set. Each connected
component will contain the nodes of one of the supersets you're looking
for. That's unnecessarily expensive, so you can adopt a similar approach
using trees.

http://en.wikipedia.org/wiki/Disjoint-set_data_structure

Construct a forest of trees (initially one tree for each item) and join
pairs of trees until you have one tree for each of your supersets. For
each set of size n you only need to consider n-1 joins. That will ensure
that any pair of items that are in one of the sets are contained in a
single tree. The find and union operations are amortized constant, so
this should be efficient for your large numbers of sets.

The union_find module can be found at
https://github.com/DuncanSmith147/KVMS. Cheers.

Duncan


 import union_find
 sets = (['A','B','E','F'],
['G','H','L','P','Q'],
['C','D','E','F'],
['E','X','Z'],
['L','M','R'],
['O','M','Y'])
 uf = union_find.UnionFindTree()
 for a_set in sets:
for item in a_set:
try:
uf.add(item)
except ValueError:
pass
n = a_set[0]
for item in a_set[1:]:
is_merged = uf.union(n, item)


 from collections import defaultdict
 res = defaultdict(list)
 for item in uf:
res[uf.find(item)].append(item)


 res.values()
[['G', 'H', 'M', 'L', 'O', 'Q', 'P', 'R', 'Y'], ['A', 'C', 'B', 'E',
'D', 'F', 'X', 'Z']]



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using Python for date calculations

2014-11-21 Thread duncan smith
On 21/11/14 08:35, Steve Hayes wrote:
 I've finally found a use for Python. 
 
 When, in the course of my genealogy research, I look at census or burial
 records, I often want to work out a person's date of birth from their age.
 It's a simple matter of mental arithmetic, but I sometimes get it wrong, and
 mislead myself. There are calculators and date calculation programs, but they
 are usually too complicated and try to do too much, so by the time you've
 worked out what to do it takes much longer. 
 
 This Python script does it for me. 
 
 year = input(Year: )
 age = input(Age: )
 born = year-age
 print 'Year of birth:', born
 
 It's so simple, so elementary, that it's not really worth writing about,
 except for the fact that it illustrates the KISS principle. 
 

[snip]

This is keeping it too simple. Someone aged 50 (i.e. over 50 but not yet
51) today - 21st Nov 2014 - might have been born in 1963 or 1964
depending on their birthday. For me your calculation would return the
correct answer (born in March), for my sister it would be wrong (born in
December).

Duncan

-- 
https://mail.python.org/mailman/listinfo/python-list


__index__

2014-11-01 Thread duncan smith
Hello,
  I have a Bloom filter class and want to (partially) serialize
instances using hex() or oct(). Instances are mutable, so I can't
inherit from long. I thought I'd found the answer when I came across
__index__,
https://docs.python.org/2/reference/datamodel.html#object.__index__. But
it doesn't seem to work as I expected it to.


 class MyClass(object):
def __init__(self):
self.val = 7
def __index__(self):
return self.val


 x = MyClass()
 oct(x)

Traceback (most recent call last):
  File pyshell#76, line 1, in module
oct(x)
TypeError: oct() argument can't be converted to oct
 oct(x.__index__())
'07'



Can someone please explain why my thinking is wrong on this? TIA.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: __index__

2014-11-01 Thread duncan smith
On 01/11/14 16:56, duncan smith wrote:

[snip]

Sorry, forgot to add that I'm using Python 2.7.6 on Ubuntu 14.04. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: __index__

2014-11-01 Thread duncan smith
On 01/11/14 18:29, Ethan Furman wrote:
 On 11/01/2014 10:11 AM, Ned Batchelder wrote:
 On 11/1/14 12:56 PM, duncan smith wrote:

I have a Bloom filter class and want to (partially) serialize
 instances using hex() or oct(). Instances are mutable, so I can't
 inherit from long. I thought I'd found the answer when I came across
 __index__, it doesn't seem to work as I expected it to.

 Just above your link in the docs is __oct__ and __hex__, which are
 used to implement oct() and hex():
 https://docs.python.org/2/reference/datamodel.html#object.__oct__
 
 In Python 2 __oct__ and __hex__ are used for oct() and hex(), but in
 Python 3 __index__ is used.
 

It was the doc for hex at
https://docs.python.org/2/library/functions.html that led me to think I
needed to implement the _index__ method. The doc for bin and oct seems
to be right (I need __index__ for bin, but it doesn't work for oct).


 But I agree with Net that using a separate method is probably better.
 
 -- 
 ~Ethan~

Possibly, I'm still tinkering with it. I use the Bloom filters to
generate, and act as pseudonyms for token sets. I have another class
that represents pseudonyms (of a different kind, but still bit strings)
that is derived from long - so hex etc. just work. I should probably (at
least) subclass my Bloom filter class before adding the relevant
methods. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [OT] spelling colour / color was Re: Toggle

2014-10-11 Thread duncan smith
On 11/10/14 12:45, mm0fmf wrote:
 On 11/10/2014 10:37, Christian Gollwitzer wrote:
 Being a non-native English speaker/writer, I myself stick to the
 recommendations of the Oxford dictionary.

  Christian
 
 But you do realise the Oxford dictionary is different to English usage
 and is renowned for using what is known as Oxford spelling? You wont
 find -ize used by the BBC in content for the UK nor will you find
 British newspapers using it.
 

[snip]

I don't know about the BBC etc. but it pretty much corresponds to my
usage. I tend to use 'ize', but not for words like analyse (where for
whatever reason it just doesn't look right with a 'z' - maybe because
realization is fine but analyzis is not). I suspect some non-US English
speakers, when faced with the option, assume the 'z' might be a US
spelling and opt for the 's' instead. A useful rule of thumb perhaps,
that might make 's' common usage for e.g. the BBC / newspapers. But 'z'
is still used by some of us outside the media.

The media have their own quirks when it comes to English. The BBC
regularly use top of / bottom of in the sense of start of / end
of, but I don't know any British people who would (currently) use that
in conversation. (This only started a few years ago, and the first time
I heard it I had to work out what it meant from context.)

Duncan

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [OT] spelling colour / color was Re: Toggle

2014-10-11 Thread duncan smith
On 11/10/14 20:55, William Ray Wing wrote:
 On Oct 11, 2014, at 3:20 PM, Dennis Lee Bieber wlfr...@ix.netcom.com wrote:
 
 On Sat, 11 Oct 2014 16:26:43 +0100, duncan smith buzzard@invalid.invalid
 declaimed the following:


 The media have their own quirks when it comes to English. The BBC
 regularly use top of / bottom of in the sense of start of / end
 of, but I don't know any British people who would (currently) use that
 in conversation. (This only started a few years ago, and the first time
 I heard it I had to work out what it meant from context.)


  That usage I think is ancient... I'm sure I've heard it back when there
 was a reasonable BBC World Service (along with VOA, Radio Netherlands, and
 etc. running on Short Wave...

  Top of the Hour…
 
 Of course, musicians have used it for years, as in “Take it from the top.”
 

[snip]

I think it must be a more recent thing with BBC (TV) presenters /
newsreaders (I rarely listen to radio); or, at least, it has become a
lot more common. I can see it makes perfect sense with e.g. sheet music.
You start again at the top. It was the top / bottom of the [TV]
programme that I didn't immediately get, because I was thinking of a
timeline running left to right (perhaps rather than the script used by
the presenters).

Duncan

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [OT] spelling colour / color was Re: Toggle

2014-10-09 Thread duncan smith
On 09/10/14 18:43, mm0fmf wrote:
 On 09/10/2014 02:29, Steven D'Aprano wrote:
 Apart from the horrible spelling of colour :-)
 
 I've always spelt colour as color when programming and as colour
 when writing language including documentation about software.
 
 colour in a programme doesn't seem right.
 

Even in British English that is usually spelt 'program' (from the US
spelling, of course). Let's not cave in on 'colour' too. It's bad enough
that we can't use 'whilst' loops :-).

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: The possibility integration in Python without an equation, just an array-like file

2014-05-16 Thread duncan smith

On 16/05/14 13:57, Enlong Liu wrote:

Sorry for this inconvenience. Since my file is a little large, I think it will 
be more difficult for fellows to check. I will now paste the attachment below.

The file for T(E) consists of two columns, the first is E and the second is 
T(E). What I want is to integrate this function in a certain range.

   E   T(E)
   -6.2578958.96247678929291
   -6.2378958.90487115888237
   -6.2178957.96907432610510


[snip]

http://en.wikipedia.org/wiki/Numerical_integration

Duncan

--
https://mail.python.org/mailman/listinfo/python-list


Re: The possibility integration in Python without an equation, just an array-like file

2014-05-16 Thread duncan smith

On 16/05/14 16:01, Johannes Schneider wrote:

If you do not have a closed form for T(E) you cannot calculate the exact
value of I(V).

Anyway. Assuming T is integrable you can approximate I(V).

1. Way to do:
interpolate T(E) by a polynomial P and integrate P. For this you need
the equation (coefficients and exponents) of P. Integrating is easy
after that.

2. other way:
Use Stair-functions: you can approximate the Value of IV() by the sum
over T(E_i) * (E_{i+1} - E_i) s.t. E_0 = E_F-\frac{eV}{2} and E_n =
E_F+\frac{eV}{2}.



snip]

Or piecewise polynomials (splines).

Duncan

--
https://mail.python.org/mailman/listinfo/python-list


Re: random.sample with large weighted sample-sets?

2014-02-16 Thread duncan smith

On 16/02/14 05:08, Ben Finney wrote:

Tim Chase python.l...@tim.thechases.com writes:


I'm not coming up with the right keywords to find what I'm hunting.
I'd like to randomly sample a modestly compact list with weighted
distributions, so I might have

   data = (
 (apple, 20),
 (orange, 50),
 (grape, 30),
 )


That's not a list, it's a tuple. I think you want a list.

When you want a sequence where each position has a semantic meaning, use
a tuple (such as ‘(apple, 20)’). Each item has a meaning *because of*
the position it's in; if the items were in a different order, they'd
mean different things.

When you want a sequence where the positions don't have a special
meaning – each item means exactly the same no matter if you change the
order – that's sometimes called a “homogeneous” sequence, and you want a
list.

So a “record” should be represented as a tuple, and a “table” of records
should be represented as a list of tuples:

 records = [
 (apple, 20),
 (orange, 50),
 (grape, 30),
 ]


and I'd like to random.sample() it as if it was a 100-element list.




[snip]

That's a description of sampling without replacement. The probabilities 
change as items are sampled. e.g. The probability of the first item 
being appleis 20/100. But the probability that the second sampled item 
is apple is either 19/99 or 20/99, depending on the value of the first 
sampled item. The following (due to Knuth) will generate indices into a 
notional list of items.



def indices(n, pop):
# generates indices into a
# population list containing
# items with frequencies in pop
# [(apple, 10), (orange, 50), ...]
N = sum(tup[1] for tup in pop)
i = m = 0
while m  n:
u = random.random()
if (N-i)*u = n-m:
i += 1
else:
yield i
i += 1
m += 1


 list(indices(3, [(apple, 20),(orange, 50),(grape, 30)]))
[8, 27, 78]



The indices are generated in order, so it could easily be extended to 
generate items or item count pairs.


There might be something more efficient based on the hypergeometric 
distribution (generate a number of apples, then a number of oranges 
given the number of sampled apples, then a number of grapes given the 
number of sampled apples and oranges, etc.).


Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: random.sample with large weighted sample-sets?

2014-02-16 Thread duncan smith

On 16/02/14 16:35, Charles Allen wrote:

How efficient does this thing need to be?

You can always just turn it into a two-dimensional sampling problem by
thinking of the data as a function f(x=item), generating a random x=xr
in [0,x], then generating a random y in [0,max(f(x))].  The xr is
accepted if 0  y = max(f(xr)), or rejected (and another attempt made) if
y  max(f(xr)).



You can avoid rejection by constructing an alias table. A list can be 
constructed such that each list element contains a pair of values and a 
cutoff. e.g.


[(apple, 20), (orange, 50), (grape, 30)]

would become (using one particular algorithm)

[((apple, orange), 0.6),
 ((orange, apple), 1.0),
 ((grape, orange), 0.9)]

Generate a random index, then select one of the values on the basis of 
the cutoff. For short enough lists you can generate a single 0-1 random 
variate, u, and use int(n*u) for the index and compare n*u - int(n*u) to 
the cutoff, where n is the length of the list. It's still sampling with 
replacement though.


Duncan


--
https://mail.python.org/mailman/listinfo/python-list


Re: numpy.where() and multiple comparisons

2014-01-17 Thread duncan smith

On 18/01/14 01:51, John Ladasky wrote:

Hi folks,

I am awaiting my approval to join the numpy-discussion mailing list, at 
scipy.org.  I realize that would be the best place to ask my question.  
However, numpy is so widely used, I figure that someone here would be able to 
help.

I like to use numpy.where() to select parts of arrays.  I have encountered what 
I would consider to be a bug when you try to use where() in conjunction with 
the multiple comparison syntax of Python.  Here's a minimal example:

Python 3.3.2+ (default, Oct  9 2013, 14:50:09)
[GCC 4.8.1] on linux
Type help, copyright, credits or license for more information.

import numpy as np
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

b = np.where(a  5)
b

(array([0, 1, 2, 3, 4]),)

c = np.where(2  a  7)

Traceback (most recent call last):
   File stdin, line 1, in module
ValueError: The truth value of an array with more than one element is 
ambiguous. Use a.any() or a.all()

Defining b works as I want and expect.  The array contains the indices (not the 
values) of a where a  5.

For my definition of c, I expect (array([3, 4, 5, 6]),).  As you can see, I get a 
ValueError instead.  I have seen the error message about the truth value of an 
array with more than one element before, and generally I understand how I 
(accidentally) provoke it.  This time, I don't see it.  In defining c, I expect to be 
stepping through a, one element at a time, just as I did when defining b.

Does anyone understand why this happens?  Is there a smart work-around?  Thanks.




 a = np.arange(10)
 c = np.where((2  a)  (a  7))
 c
(array([3, 4, 5, 6]),)


Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: Optimizing list processing

2013-12-11 Thread duncan smith

On 11/12/13 23:54, Steven D'Aprano wrote:

I have some code which produces a list from an iterable using at least
one temporary list, using a Decorate-Sort-Undecorate idiom. The algorithm
looks something like this (simplified):

table = sorted([(x, i) for i,x in enumerate(iterable)])
table = [i for x,i in table]

The problem here is that for large iterables, say 10 million items or so,
this is *painfully* slow, as my system has to page memory like mad to fit
two large lists into memory at once. So I came up with an in-place
version that saves (approximately) two-thirds of the memory needed.

table = [(x, i) for i,x in enumerate(iterable)]
table.sort()
for x, i in table:
 table[i] = x

For giant iterables (ten million items), this version is a big
improvement, about three times faster than the list comp version. Since
we're talking about the difference between 4 seconds and 12 seconds (plus
an additional 40-80 seconds of general slow-down as the computer pages
memory into and out of virtual memory), this is a good, solid
optimization.

Except that for more reasonably sized iterables, it's a pessimization.
With one million items, the ratio is the other way around: the list comp
version is 2-3 times faster than the in-place version. For smaller lists,
the ratio varies, but the list comp version is typically around twice as
fast. A good example of trading memory for time.

So, ideally I'd like to write my code like this:


table = [(x, i) for i,x in enumerate(iterable)]
table.sort()
if len(table)  ?:
 table = [i for x,i in table]
else:
 for x, i in table:
 table[i] = x

where ? no doubt will depend on how much memory is available in one
contiguous chunk.

Is there any way to determine which branch I should run, apart from hard-
coding some arbitrary and constant cut-off value?



I had a slightly similar problem a while ago. I actually wanted to 
process data from a large file in sorted order. In the end I read chunks 
of data from the file, sorted them, then wrote each chunk of data to a 
temporary file. Then I used heapq.merge to merge the data in the 
temporary files. It vastly reduced memory consumption, and was 'quick 
enough'. It was based on Guido's solution for sorting a million 32-bit 
integers in 2MB of RAM 
(http://neopythonic.blogspot.co.uk/2008/10/sorting-million-32-bit-integers-in-2mb.html). 
Cheers.


Duncan

--
https://mail.python.org/mailman/listinfo/python-list


Re: many constructors in a class?

2013-08-14 Thread duncan smith

On 14/08/13 15:16, climb65 wrote:

Hello,

here is a small basic question :

Is it possible to have more than one constructor (__init__ function) in a
class? For instance, to create an object with 2 different ways? If my
memory is good, I think that with C++ it is possible.

Thanks for your answer.




http://stackoverflow.com/questions/5738470/whats-an-example-use-case-for-a-python-classmethod


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 450 Adding a statistics module to Python

2013-08-11 Thread duncan smith

On 11/08/13 15:02, Roy Smith wrote:

In article mailman.479.1376221844.1251.python-l...@python.org,
  Skip Montanaro s...@pobox.com wrote:


See the Rationale of PEP 450 for more reasons why “install NumPy� is not
a feasible solution for many use cases, and why having ‘statistics’ as a
pure-Python, standard-library package is desirable.


I read that before posting but am not sure I agree. I don't see the
screaming need for this package.  Why can't it continue to live on
PyPI, where, once again, it is available as pip install ...?


My previous comments on this topic were along the lines of installing
numpy is a non-starter if all you need are simple mean/std-dev.  You
do, however, make a good point here.  Running pip install statistics
is a much lower barrier to entry than getting numpy going, especially if
statistics is pure python and thus has no dependencies on compiler tool
chains which may be missing.

Still, I see two classes of function in PEP-450.  Class 1 is the really
basic stuff:

* mean
* std-dev

Class 2 are the more complicated things like:

* linear regression
* median
* mode
* functions for calculating the probability of random variables
   from the normal, t, chi-squared, and F distributions
* inference on the mean
* anything that differentiates between population and sample

I could see leaving class 2 stuff in an optional pure-python module to
be installed by pip, but for (as the PEP phrases it), the simplest and
most obvious statistical functions (into which I lump mean and std-dev),
having them in the standard library would be a big win.



I would probably move other descriptive statistics (median, mode, 
correlation, ...) into Class 1.


I roll my own statistical tests as I need them - simply to avoid having 
a dependency on R. But I generally do end up with a dependency on scipy 
because I need scipy.stats.distributions. So I guess a distinct library 
for probability distributions would be handy - but maybe it should not 
be in the standard library.


Once we move on to statistical modelling (e.g. linear regression) I 
think the case for inclusion in the standard library becomes weaker 
still. Cheers.


Duncan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Simple algorithm question - how to reorder a sequence economically

2013-05-24 Thread duncan smith

On 24/05/13 10:11, Chris Angelico wrote:

On Fri, May 24, 2013 at 6:47 PM, Fábio Santos fabiosantos...@gmail.com wrote:


On 24 May 2013 09:41, Chris Angelico ros...@gmail.com wrote:


On Fri, May 24, 2013 at 6:14 PM, Peter Brooks
peter.h.m.bro...@gmail.com wrote:

What is the easiest way to reorder a sequence pseudo-randomly?

That is, for a sequence 1,2,3,4 to produce an arbitrary ordering (eg
2,1,4,3) that is different each time.


...


It works, it produces a unique list for any given index provided, but
it's not the cleanest or most efficient. But I know someone'll improve
on it... or tell me I'm an idiot for not taking a more obvious
approach :)

ChrisA


I think that is pretty much itertools.permutations from the standard
library. The OP should check it out.


That works if all the permutations are wanted at once. Is there a way,
short of iterating over it N times, to request permutation #N? Or
maybe I'm misreading the OP and that's not a requirement.

ChrisA



A long time ago I wrote some code to do that.


import gmpy

def LexPermFromIndex(items, index):
n = len(items)
inds = range(n)
perm = []
for i in range(1, n+1):
r, index = divmod(index, gmpy.fac(n-i))
r = int(r)
perm.append(inds[r])
inds = inds[:r] + inds[r+1:]

return [items[i] for i in perm]


 LexPermFromIndex([1,2,3,4], 0)
[1, 2, 3, 4]
 LexPermFromIndex([1,2,3,4], 1)
[1, 2, 4, 3]
 LexPermFromIndex([1,2,3,4], 10)
[2, 4, 1, 3]



I can't remember exactly why I wrote it. But I also have something for 
generating a permutation's index and similar functions for combinations.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Ordered dictionaries compared

2013-05-23 Thread duncan smith

On 23/05/13 04:31, Dan Stromberg wrote:


What kind of ordered dictionaries?  Sorted by key.

I've redone the previous comparison, this time with a better red-black
tree implementation courtesy of Duncan G. Smith.

The comparison is at
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/just-trees/

The Red-Black tree gave a much better showing this time, but it gave
just one 2nd place on one workload-interpreter - still kinda
lackluster.  It took 1st place 0 times.




A quick test of my Red Black Tree and Treap (Python 2.7).


 def test_trees(data, randomize=True):
cpy = data[:] # for deletion
if randomize:
random.shuffle(data)
random.shuffle(cpy)
t = binary_trees.RedBlackTree()
start = time.time()
for datum in data:
t.insert(datum)
print 'Red Black Tree insertion %s' % (time.time() - start)
start = time.time()
for datum in data:
t.find(datum)
print 'Red Black Tree find %s' % (time.time() - start)
start = time.time()
for datum in cpy:
t.delete(datum)
print 'Red Black Tree deletion %s' % (time.time() - start)
t = binary_trees.Treap()
start = time.time()
for datum in data:
t.insert(datum)
print
print 'Treap insertion %s' % (time.time() - start)
start = time.time()
for datum in data:
t.find(datum)
print 'Treap find %s' % (time.time() - start)
start = time.time()
for datum in cpy:
t.delete(datum)
print 'Treap deletion %s' % (time.time() - start)


 test_trees(range(10))
Red Black Tree insertion 5.42807197571
Red Black Tree find 1.58799219131
Red Black Tree deletion 3.87580800056

Treap insertion 6.79647684097
Treap find 2.11693120003
Treap deletion 4.61243915558

 test_trees(range(10), False)
Red Black Tree insertion 6.29647898674
Red Black Tree find 1.157143116
Red Black Tree deletion 2.74785804749

Treap insertion 3.87288999557
Treap find 1.48776102066
Treap deletion 1.88962197304



RBT is quicker than Treap for insertion with randomized data, but slower 
with ordered data. Randomized data will tend to minimize the number of 
tree rotations needed to keep the RBT balanced, whilst the Treap will be 
performing rotations to maintain the heap property in an already 
reasonably well balanced tree. With ordered data the RBT will have to 
work harder to keep the tree balanced, whilst the Treap will be able to 
maintain the heap property with fewer rotations.


No surprise that find() is generally quicker for RBTs, they tend to be 
better balanced.


Deletion is a bit more confusing. I suppose deletion from a better 
balanced tree will tend to be quicker, but deletion from a treap 
constructed from ordered data is (for some reason) quickest of all.


All these operations require a call to find(), and that is generally 
going to be quicker for RBTs. Treaps tend to require fewer subsequent 
rotations, but they have variable worth (in terms of rebalancing).


Looks like RBTs are better than treaps if they are being populated with 
randomly ordered data, but not if they are being populated with ordered 
data. RBTs are better for use cases that are heavy on finds.


Both types of tree appear to be better balanced (on the basis of the 
find results) if populated from ordered data. Treaps appear to perform 
better on insertion, find and deletion when populated from ordered data.


Duncan

--
http://mail.python.org/mailman/listinfo/python-list


  1   2   3   >