Re: [Tutor] decomposing a problem

2018-12-29 Thread Avi Gross

Steven,

A more practical answer about splitting a data frame is to import modules such 
as for machine learning. 

Import sklearn.model_selection

Then use train_test_split() to return 4 parts. Not sure what answers you need 
and why here. Plenty of ways and tools exist to specify choosing percentages to 
partition by or other ways.


Sent from AOL Mobile Mail
On Saturday, December 29, 2018 Avi Gross  wrote:
Steven,

As I head out the door, I will sketch it.

Given a data.frame populated with N rows and columns you want to break it
into training and test data sets.

In a data.frame, you can refer to a row by using an index like 5 or 2019.
You can ask for the number of rows currently in existence. You can also
create an array/vector of length N consisting of instructions that can tell
which random rows of the N you want and which you don't. For the purposes of
this task, you choose random numbers in the range of N and either keep the
numbers as indices or as a way to mark True/False in the vector. You then
ask for a new data.frame made by indexing the existing one using the vector.
You can then negate the vector and ask for a second new data.frame indexing
it.

Something close to that.

Or, you can simply add the vector as a new column in the data.frame in some
form. It would then mark which rows are to be used for which purpose. Later,
when using the data, you include a CONDITION that row X is true, or
whatever.



-Original Message-
From: Tutor  On Behalf Of
Steven D'Aprano
Sent: Friday, December 28, 2018 11:12 PM
To: tutor@python.org
Subject: Re: [Tutor] decomposing a problem

On Fri, Dec 28, 2018 at 10:39:53PM -0500, Avi Gross wrote:
> I will answer this question then head off on vacation.

You wrote about 140 or more lines, but didn't come close to answering the
question: how to randomly split data from a dictionary into training data
and reserved data.



--
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-29 Thread Avi Gross
Steven,

As I head out the door, I will sketch it.

Given a data.frame populated with N rows and columns you want to break it
into training and test data sets.

In a data.frame, you can refer to a row by using an index like 5 or 2019.
You can ask for the number of rows currently in existence. You can also
create an array/vector of length N consisting of instructions that can tell
which random rows of the N you want and which you don't. For the purposes of
this task, you choose random numbers in the range of N and either keep the
numbers as indices or as a way to mark True/False in the vector. You then
ask for a new data.frame made by indexing the existing one using the vector.
You can then negate the vector and ask for a second new data.frame indexing
it.

Something close to that.

Or, you can simply add the vector as a new column in the data.frame in some
form. It would then mark which rows are to be used for which purpose. Later,
when using the data, you include a CONDITION that row X is true, or
whatever.



-Original Message-
From: Tutor  On Behalf Of
Steven D'Aprano
Sent: Friday, December 28, 2018 11:12 PM
To: tutor@python.org
Subject: Re: [Tutor] decomposing a problem

On Fri, Dec 28, 2018 at 10:39:53PM -0500, Avi Gross wrote:
> I will answer this question then head off on vacation.

You wrote about 140 or more lines, but didn't come close to answering the
question: how to randomly split data from a dictionary into training data
and reserved data.



--
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-28 Thread Steven D'Aprano
On Fri, Dec 28, 2018 at 10:39:53PM -0500, Avi Gross wrote:
> I will answer this question then head off on vacation.

You wrote about 140 or more lines, but didn't come close to answering 
the question: how to randomly split data from a dictionary into training 
data and reserved data.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-28 Thread Avi Gross
in data and manipulate it. R has
multiple sets of tools including one in what they call the tidyverse. In
English, given such a data structure with any number of rows and columns,
you have names for the columns and optionally the rows. The tools allow you
to select any combination of rows and columns based on all kinds of search
and matching criteria. You can re-arrange them, add new ones or create new
ones using the data in existing ones, generate all kinds of statistical info
such as the standard deviation of each column or apply your own functions.
All this can be done in a pipelined fashion.

What you often do is read in a data.frame from a Comma Separated Values file
(CSV) or all kinds of data from other programs including EXCEL spreadsheets,
Stata and so on, including the Feather format python can make, and massage
the data such as removing rows with any NA (not available) values, or
interpolate new values, split it into multiple dataframes as discussed and
so on. You can do many statistical analyses by feeding entire dataframes  or
selected subsets to functions to do many things like linear and other forms
of regression and it really shines when you feed these data structures to
graphics engines like ggplot2 letting you make amazing graphs. Like I said,
R is designed with vectors and data.frames as principal components.

But once python is augmented, it can do much of the same. Not quite sure how
much is ported or invented. Some data types like "formulas" seem to be done
differently. It will take me a while to study it all.

I can point to resources if anyone is interested but again, this is a python
forum. So it is of interest to me that it is possible to combine bits and
pieces of R and python in the same programming environment. I mean you can
use one to do what it does best, have the data structures silently be
translated into something the other one understands and do some more
processing where you have software that shines, then switch back and forth
as needed. This kind of duality may mean it is not necessary to keep
changing one language to be able to do what the other does, in some cases.
And, amusingly, much of the underlying functionality accessed is in C or C++
with some data structures being translated to/from the compiled C/C++
equivalents as you enter a function, them translated back at exit. 

This is not very deep so just making a point since Alan asked. You can find
strengths and weaknesses in any language. I love how python consistently
enough has everything being object-oriented. R started off without and has
grafted on at least a dozen variations which can be a tad annoying.



-----Original Message-
From: Tutor  On Behalf Of
Steven D'Aprano
Sent: Friday, December 28, 2018 8:04 PM
To: tutor@python.org
Subject: Re: [Tutor] decomposing a problem

On Fri, Dec 28, 2018 at 03:34:19PM -0500, Avi Gross wrote:

[...]
> You replied to one of my points with this about a way to partition data:
> 
> ---
> The obvious solution:
> 
> keys = list(mydict.keys())
> random.shuffle(keys)
> index = len(keys)*3//4
> training_data = keys[:index]
> reserved = keys[index:]
> ---
> 
> (In the above, "---" is not python but a separator!)
> 
> That is indeed a very reasonable way to segment the data. But it sort 
> of makes my point. If the data is stored in a dictionary, the way to 
> access it ended up being to make a list and play with that. I would 
> still need to get the values one at a time from the dictionary such as 
> in the ways you also show and I omit.

Yes? How else do you expect to get the value given a key except by looking
it up?


> For me, it seems more natural in this case to simply have the data in 
> a data frame where I have lots of tools and methods available.


I'm not sure if your understanding of a data frame is the same as my
understanding. Are you talking about this?

http://www.r-tutor.com/r-introduction/data-frame

In other words, a two-dimensional array of some sort?

Okay, you have your data frame. Now what? How do you solve the problem
being asked? I'm not interested in vague handwaving that doesn't solve
anything. You specified data in a key:value store, let's say like this:


mydict = {'spam': 25, 'ham': 2, 'eggs': 7, 'cheddar': 1, 'brie': 14,
  'aardvark': 3, 'argument': 11, 'parrot': 16}

Here it is as a data frame:

df = [['spam', 'ham', 'eggs', 'cheddar', 'brie', 'aardvark', 'argument',
'parrot'],
  [25, 2, 7, 1, 14, 3, 11, 16]]

Now what? How do you randomly split that into randomly selected set of
training data and reserved data?

Feel free to give an answer in terms of R, provided you also give an
answer in terms of Python. Remember that unlike R, Python doesn't have a
standard data frame type, so you are responsible for building whatever
methods you need.




-- 
Steve
___
Tutor maillist  -  Tutor@python.o

Re: [Tutor] decomposing a problem

2018-12-28 Thread Steven D'Aprano
On Fri, Dec 28, 2018 at 03:34:19PM -0500, Avi Gross wrote:

[...]
> You replied to one of my points with this about a way to partition data:
> 
> ---
> The obvious solution:
> 
> keys = list(mydict.keys())
> random.shuffle(keys)
> index = len(keys)*3//4
> training_data = keys[:index]
> reserved = keys[index:]
> ---
> 
> (In the above, "---" is not python but a separator!)
> 
> That is indeed a very reasonable way to segment the data. But it sort of
> makes my point. If the data is stored in a dictionary, the way to access it
> ended up being to make a list and play with that. I would still need to get
> the values one at a time from the dictionary such as in the ways you also
> show and I omit.

Yes? How else do you expect to get the value given a key except by 
looking it up?


> For me, it seems more natural in this case to simply have the data in 
> a data frame where I have lots of tools and methods available.


I'm not sure if your understanding of a data frame is the same as my
understanding. Are you talking about this?

http://www.r-tutor.com/r-introduction/data-frame

In other words, a two-dimensional array of some sort?

Okay, you have your data frame. Now what? How do you solve the problem
being asked? I'm not interested in vague handwaving that doesn't solve
anything. You specified data in a key:value store, let's say like this:


mydict = {'spam': 25, 'ham': 2, 'eggs': 7, 'cheddar': 1, 'brie': 14,
  'aardvark': 3, 'argument': 11, 'parrot': 16}

Here it is as a data frame:

df = [['spam', 'ham', 'eggs', 'cheddar', 'brie', 'aardvark', 'argument', 
'parrot'],
  [25, 2, 7, 1, 14, 3, 11, 16]]

Now what? How do you randomly split that into randomly selected set of
training data and reserved data?

Feel free to give an answer in terms of R, provided you also give an
answer in terms of Python. Remember that unlike R, Python doesn't have a
standard data frame type, so you are responsible for building whatever
methods you need.




-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-28 Thread Alan Gauld via Tutor
On 28/12/2018 20:34, Avi Gross wrote:

> So before I respond, here is a general statement. I am NOT particularly
> interested in much of what we discuss here from a specific point of view.
> Someone raises a question and I think about it. They want to know of a
> better way to get a random key from a dictionary. My thought is that if I
> needed that random key, maybe I would not have stored it in a dictionary in
> the first place. But, given that the data is in a dictionary, I wonder what
> could be done. It is an ACADEMIC discussion with a certain amount of hand
> waving.

But you need to apply real world constraints.
The choice of data type is intrinsic to the language in use.
(The same is true of control structures - loops,
decision points etc - but that is less pertinent here.)

If you program in C pretty much all you get is the array.
Everything else (including struct/union/typedef) is hand
crafted by the programmer.

If its Lisp then you get the list and anything else
is coded (or simulated in code) by hand.

In Python we have lists, tuples, dictionaries and sets.
Anything else, including subclassing the basic types, is
down to the programmer (or finding a module already created
by somebody else).

In Smalltalk we have over a hundred basic collection
types to choose from. And the choice to subclass any
of them.

So when you address a problem in any given language the
available solutions must be constrained by whatever the
language in question offers. Wishing for non-existent
data structures that may exist elsewhere is simply to
request a new feature to be designed and programmed
in the language at hand. That may be the best solution
depending on the nature of the problem but we need to
recognise the nature of the request. It is still a new
feature.

Of course we can learn a great deal by comparing features
on one language against another but in terms of solving
a specific request we need specific answers too.


> ---
> The obvious solution:
> 
> keys = list(mydict.keys())
> random.shuffle(keys)
> index = len(keys)*3//4
> training_data = keys[:index]
> reserved = keys[index:]
> ---

> For me, it seems more natural in this case to simply have the data in a data
> frame where I have lots of tools and methods available. 

But only if such a data frame exists. In Python it does not
(at least, not in the standard library). So any reference
to such a non existent structure is in effect a work request
for someone to build one. To do so requires a specification
or design that the OP can follow or better still a prototypical
template. It also assumes a much higher level of skill than the
original request and the "obvious solution".

> but if your coding style is more comfortable with another way, why bother
> unless you are trying to learn other ways and be flexible.

If your current language does not support the structure
you desire you have three choices:
1) change your programming language or
2) build the missing feature or
3) find a workaround using the structures available.

Most people opt for #3. (Although all may be valid options
depending on the circumstance)


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-28 Thread Avi Gross
Steve,

I am going to just respond to one part of your message and will snip the
rest. I am not is disagreement with most of what you say and may simply
stress different aspects. I will say that unless I have reason to, I don't
feel a need to test speeds for an academic discussion. Had this been a real
project, sure. Even then, if it will need to run on multiple machines using
multiple incarnations of python, the results will vary, especially if the
data varies too. You suggest that discussions backed by real data are
better. Sure. But when a discussion is abstract enough, then I think it
perfectly reasonable to say "may be faster" to mean that until you try it,
there are few guarantees. Many times a method seems superior until you reach
a pathological case. One sorting algorithm is fast except when the data is
already almost fully sorted already.

So why do I bother saying things like MAY? It seems to be impossible to
please everybody. There are many things with nuance and exceptions. When I
state things one way, some people (often legitimately) snipe. When I don't
insist on certainty, other have problem with that. When I make it short, I
am clearly leaving many things out. When I go into as much detail as I am
aware of, I get feedback that it is too long or boring or it wanders too
much. None of this is a problem as much as a reality about tradeoffs.

So before I respond, here is a general statement. I am NOT particularly
interested in much of what we discuss here from a specific point of view.
Someone raises a question and I think about it. They want to know of a
better way to get a random key from a dictionary. My thought is that if I
needed that random key, maybe I would not have stored it in a dictionary in
the first place. But, given that the data is in a dictionary, I wonder what
could be done. It is an ACADEMIC discussion with a certain amount of hand
waving. Sometimes I do experiment and show what I did. Other times I say I
am speculating and if someone disagrees, fine. If they show solid arguments
or point out errors on my part or create evidence, they can change my mind. 

You (Steve) are an easy person to discuss things with but there are some who
are less. People who have some idea of my style and understand the kind of
discussion I am having at that point and who let me understand where they
are coming from, can have a reasonable discussion. The ones who act like TV
lawyers who hear that some piece of evidence has less than one in a
quadrillion chance of happening then say BUT THERE IS A CHANCE so reasonable
doubt ... are hardly worth debating.

You replied to one of my points with this about a way to partition data:

---
The obvious solution:

keys = list(mydict.keys())
random.shuffle(keys)
index = len(keys)*3//4
training_data = keys[:index]
reserved = keys[index:]
---

(In the above, "---" is not python but a separator!)

That is indeed a very reasonable way to segment the data. But it sort of
makes my point. If the data is stored in a dictionary, the way to access it
ended up being to make a list and play with that. I would still need to get
the values one at a time from the dictionary such as in the ways you also
show and I omit.

For me, it seems more natural in this case to simply have the data in a data
frame where I have lots of tools and methods available. Yes, underneath it
all providing an array of indices or True/False Booleans to index the data
frame can be slow but it feels more natural. Yes, python has additional
paradigms I may not have used in R such as list comprehensions and
dictionary comprehensions that are conceptually simple. But I did use the
R-onic (to coin a phrase nobody would ironically use) equivalents that can
also be powerful and I need not discuss here in a python list. Part of
adjusting to python includes unlearning some old habits and attitudes and
living off this new land. [[Just for amusement, the original R language was
called S so you might call its way of doing things Sonic.]]

I see a balance between various ways the data is used. Clearly it is
possible to convert it between forms and for reasonable amounts of data it
can be fast enough. But as you note, at some point you can just toss one
representation away so maybe you can not bother using that in the first
place. Keep it simple.

In many real life situations, you are storing many units of data and often
have multiple ways of indexing the data. There are representations that do
much of the work for you. Creating a dictionary where each item is a list or
other data structure can emulate such functionality and even have advantages
but if your coding style is more comfortable with another way, why bother
unless you are trying to learn other ways and be flexible.

As I have mentioned too many times, my most recent work was in R and I
sometimes delight and other times groan at the very different ways some
things are done when using specific modules or libraries. But even within a
language and 

Re: [Tutor] decomposing a problem

2018-12-28 Thread Mike Mossey

> On Dec 27, 2018, at 3:32 PM, Avi Gross  wrote:
> 
> [Mark Lawrence please press DELETE now in case the rest of this message is
> all about you.]
> [[If that is not working, if on Windows, try Control-ALT-DELETE as that will
> really get rid of my message.]]
> 


Hi Avi,

Mark doesn’t have a basis for complaining, of course, as he can simply not read 
your posts.


> Back to replying to Steven,
> 
> Of course I want to be corrected when wrong.
> 
> I think everyone here knows I tend to be quite expansive in my thoughts and
> sometimes to the point where they suggest I am free-associating. I am trying
> to get to the point faster and stay there.

Since you are expressing interest, I'll give some thoughts. 

I think it’s important not only for writing, but for economy of thinking to use 
fewer words and simpler concepts, and it can make us better programmers and 
teachers. 

Previously, when I worked alone as a programmer, I was stuck in overcomplicated 
ways of thinking. It’s “getting out there” and interacting with people that 
rejuvenated my thinking, and I’ll be forever grateful.

One form of practice at this is to edit my posts for brevity. Here’s a link 
about brevity in writing:

http://copymatter.com/embracing-brevity/

It helps me as well that I tutor students in math and computer science 
regularly, because it forces me to get more simple and concrete. A student is a 
“feedback device” — when I’m doing better, I can read the results in their 
expression and their understanding.

I think it’s important both to have something you are aiming for (a sense of 
what level of brevity you’d like to achieve) and a feedback mechanism that 
helps you to know if you are succeeding.

Take or leave these thoughts as you see fit.
Mike


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-27 Thread Avi Gross
[Mark Lawrence please press DELETE now in case the rest of this message is
all about you.]
[[If that is not working, if on Windows, try Control-ALT-DELETE as that will
really get rid of my message.]]

Back to replying to Steven,

Of course I want to be corrected when wrong.

I think everyone here knows I tend to be quite expansive in my thoughts and
sometimes to the point where they suggest I am free-associating. I am trying
to get to the point faster and stay there.

So if what I write is not wrong as a general point and you want to bring up
every exception, fine. I reserve the right not to follow you there,
especially not on the forum. I may continue a discussion with you in
private, of course.

I often have a problem in real life (not talking about you, let alone
whoever Mark is) where I think I said something clearly by using phrases
like "if" and find the other person simply acts as if I had left that out.
You know, we can go to the park IF it is not raining tomorrow. Reply is to
tell me the weather report says it will rain so why am I suggesting we go to
the park. Duh. I was not aware of the weather report directly BUT clearly
suggested it was a consideration we should look at before deciding. 

Now a more obvious error should be pointed out. EXAMPLE, I am driving to
Pennsylvania this weekend not far from a National Park and will have some
hours to kill. I suggested we might visit Valley Forge National Historic
Park and did not say only if it was open. Well, in the U.S. we happen to
have the very real possibility the Park will be closed due to it being
deemed optional during a so-called Government Shutdown so such a reply IS
reasonable. I did not consider that and stand corrected.

But Chris, you point out I reacted similarly to what you said. Indeed, you
said that sometimes we don't need to focus on efficiency as compared to
saying we should always ignore it or something like that. I think we
actually are in relative agreement in how we might approach a problem like
this. We might try to solve it in a reasonable way first and not worry at
first about efficiency especially now that some equipment runs so fast and
with so much memory that results appear faster than we can get to them. But,
with experience, and need, we may fine tune code that is causing issues. As
I have mentioned, I have applications that regularly need huge samples taken
at random so a list of millions being created millions of times and the
above being done thousands of times, adds up. Many cheaper methods might
then be considered including, especially, just switching to a better data
structure ONCE.

I will stop this message here as I suspect Mark is still reading and fuming.
Note, I do not intend to mention Mark again in future messages. I do not
actually want to annoy him and wish he would live and let live.

-Original Message-
From: Tutor  On Behalf Of
Steven D'Aprano
Sent: Thursday, December 27, 2018 5:38 PM
To: tutor@python.org
Subject: Re: [Tutor] decomposing a problem

On Wed, Dec 26, 2018 at 11:02:07AM -0500, Avi Gross wrote:

> I often find that I try to make a main point ad people then focus on 
> something else, like an example.

I can't speak for others, but for me, that could be because of a number of
reasons:

- I agree with what you say, but don't feel like adding "I agree" 
after each paragraph of yours;

- I disagree, but can't be bothered arguing;

- I don't understand the point you intend to make, so just move on.

But when you make an obvious error, I tend to respond. This is supposed to
be a list for teaching people to use Python better, after all.


> So, do we agree on the main point that choosing a specific data structure
or
> algorithm (or even computer language) too soon can lead to problems that
can
> be avoided if we first map out the problem and understand it better?

Sure, why not? That's vague and generic enough that it has to be true.

But if its meant as advice, you don't really offer anything concrete. 
How does one decide what is "too soon"? How does one avoid design 
paralysis?


> I do not concede that efficiency can be ignored because computers are
fast.

That's good, but I'm not sure why you think it is relevant as I never 
suggested that efficiency can be ignored. Only that what people *guess* 
is "lots of data" and what actually *is* lots of data may not be the 
same thing.


> I do concede that it is often not worth the effort or that you can
> inadvertently make things worse and there are tradeoffs.

Okay.


> Let me be specific. The side topic was asking how to get a random key from
> an existing dictionary. If you do this ONCE, it may be no big deal to make
a
> list of all keys, index it by a random number, and move on. I did supply a
> solution that might(or might not) run faster by using a generator to get
one
> item at a time and stopping when found. Less space but not sure if less
> time.

Re: [Tutor] decomposing a problem

2018-12-27 Thread Steven D'Aprano
On Wed, Dec 26, 2018 at 11:02:07AM -0500, Avi Gross wrote:

> I often find that I try to make a main point ad people then focus on
> something else, like an example.

I can't speak for others, but for me, that could be because of a number 
of reasons:

- I agree with what you say, but don't feel like adding "I agree" 
after each paragraph of yours;

- I disagree, but can't be bothered arguing;

- I don't understand the point you intend to make, so just move on.

But when you make an obvious error, I tend to respond. This is supposed 
to be a list for teaching people to use Python better, after all.


> So, do we agree on the main point that choosing a specific data structure or
> algorithm (or even computer language) too soon can lead to problems that can
> be avoided if we first map out the problem and understand it better?

Sure, why not? That's vague and generic enough that it has to be true.

But if its meant as advice, you don't really offer anything concrete. 
How does one decide what is "too soon"? How does one avoid design 
paralysis?


> I do not concede that efficiency can be ignored because computers are fast.

That's good, but I'm not sure why you think it is relevant as I never 
suggested that efficiency can be ignored. Only that what people *guess* 
is "lots of data" and what actually *is* lots of data may not be the 
same thing.


> I do concede that it is often not worth the effort or that you can
> inadvertently make things worse and there are tradeoffs.

Okay.


> Let me be specific. The side topic was asking how to get a random key from
> an existing dictionary. If you do this ONCE, it may be no big deal to make a
> list of all keys, index it by a random number, and move on. I did supply a
> solution that might(or might not) run faster by using a generator to get one
> item at a time and stopping when found. Less space but not sure if less
> time.

Why don't you try it and find out?


> But what I often need to do is to segment lots of data into two piles. One
> is for training purposes using some machine learning algorithm and the
> remainder is to be used for verifications. The choice must be random or the
> entire project may become meaningless. So if your data structure was a
> dictionary with key names promptly abandoned, you cannot just call pop()
> umpteen times to get supposedly random results as they may come in a very
> specific order.

Fortunately I never suggested doing that.


> If you want to have 75% of the data in the training section,
> and 25% reserved, and you have millions of records, what is a good way to
> go? 

The obvious solution:

keys = list(mydict.keys())
random.shuffle(keys)
index = len(keys)*3//4
training_data = keys[:index]
reserved = keys[index:]

Now you have the keys split into training data and reserved data. To 
extract the value, you can just call mydict[some_key]. If you prefer, 
you can generate two distinct dicts:

training_data = {key: mydict[key] for key in training_data}

and similarly for the reserved data, and then mydict becomes redundant 
and you are free to delete it (or just ignore it).

Anything more complex than this solution should not even be attempted 
until you have tried the simple, obvious solution and discovered that it 
isn't satisfactory.

Keep it simple. Try the simplest thing that works first, and don't add 
complexity until you know that you need it.

By the way, your comments would be more credible if you had actual 
working code that demonstrates your point, rather than making vague 
comments that something "may" be faster. Sure, anything "may" be faster. 
We can say that about literally anything. Walking to Alaska from the 
southernmost tip of Chile while dragging a grand piano behind you "may" 
be faster than flying, but probably isn't. Unless you have actual code 
backing up your assertions, they're pretty meaningless.

And the advantage of working code is that people might actually learn 
some Python too.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-27 Thread Steven D'Aprano
On Thu, Dec 27, 2018 at 07:03:18PM +, Mark Lawrence wrote:
> On 26/12/2018 00:00, Avi Gross wrote:
> >[Long enough that some should neither read nor comment on.]
> >
> 
> PLEASE GO AWAY YOU ARE REALLY IRRITATING.

People in glass houses...

Mark, you're not the arbiter of who is allowed to post here. You are 
being obnoxious. Please settle down and perhaps chill a bit. If you 
don't want to read Avi's posts, you know how to hit delete in your mail 
reader don't you?


-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-27 Thread Mark Lawrence

On 26/12/2018 00:00, Avi Gross wrote:

[Long enough that some should neither read nor comment on.]



PLEASE GO AWAY YOU ARE REALLY IRRITATING.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-25 Thread Steven D'Aprano
On Tue, Dec 25, 2018 at 11:56:21PM -0500, Avi Gross wrote:

> I find that many people are fairly uncomfortable with abstraction and 
> tend to resist a pure top down approach by diving to any solutions 
> they may envision.

https://blog.codinghorror.com/it-came-from-planet-architecture/

> As someone asked on another python list, 
> is there a better way to get a random key for a dictionary. Well, not 
> easily without expanding all keys into a list of perhaps huge length. 

Define "better".

What do you value? Time, space, simplicity or something else?

One of the most harmful things to value is "cleverness" for its own 
sake. Some people tend to value a "clever" solution even when it wastes 
time, space and is over complex and therefore hard to maintain or debug.

Even when the practical response to the "clever" solution is "YAGNI".

What counts as "huge"? To me, picking a random key from a list of 100 
keys is "huge". Copy out 100 keys to a list by hand and then pick one? 
What a PITA that would be.

But to your computer, chances are that ten million keys is "small". One 
hundred million might be pushing "largish". A billion, or perhaps ten 
billion, could be "large". Fifty, a hundred, maybe even a thousand 
billion (a trillion) would be "huge".

Unless you expect to be handling at least a billion keys, there's 
probably no justification for anything more complex than:

random.choose(list(dict.keys())

Chances are that it will be faster *and* use less memory than any clever 
solution you come up with -- and even if it does use more memory, it 
uses it for a few milliseconds, only when needed, unlike a more complex 
solution that inflates the size of the data structure all the time, 
whether you need it or not.

Of course there may be use-cases where we really do need a more complex, 
clever solution, and are willing to trade off space for time (or 
sometimes time for space). But chances are YAGNI.


> Followed by a search of much of that list to get the nth index.

That's incorrect. Despite the name, Python lists aren't linked lists[1] 
where you have to traverse N items to get to the Nth item. They're 
arrays, where indexing requires constant time.


[...]
> If they keep demanding one function to master all, you can end up with 
> fairly awful spaghetti code.

https://en.wikipedia.org/wiki/God_object




[1] Technically speaking, this is not a requirement of the language, 
only a "quality of implementation" question. A Python interpreter could 
offer built-in lists using linked lists under the hood, with O(N) 
indexing. But all the major implementations -- CPython, Stackless, PyPy, 
Jython, IronPython, Cython, Nuitka, even (I think) MicroPython -- use 
arrays as the list implementation. Given how simple arrays are, I think 
it is fair to assume that any reasonable Python interpreter will do the 
same.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-25 Thread Avi Gross
[REAL SUBJECT: What's this?]

Steven,

I am afraid you are right. I was not selfish enough about this. I have done
object-oriented programming in many other languages and I am afraid today it
showed. Think C++ or Java. Part of me continues to think in every language I
ever used, including human languages. So since the name of this variable is
a suggestion, it was not enforced by the interpreter and I was not reminded.

Be happy I even used an English word and not  something like idempotent or
eponymous
.
P.S. just to confuse the issue, some in JavaScript confusingly use both this
and self near each other.
P.P.S. Please pardon my puns, especially the ones you did not notice.

-Original Message-
From: Tutor  On Behalf Of
Steven D'Aprano
Sent: Tuesday, December 25, 2018 11:39 PM
To: tutor@python.org
Subject: Re: [Tutor] decomposing a problem

On Tue, Dec 25, 2018 at 10:25:50PM -0500, Avi Gross wrote:

> class chainable_list(list):
> """Same as list but sort() can now be chained"""
> def chainsort(this, *args, **kwargs):
> this.sort(*args, **kwargs)
> return this

In Python, it is traditional to use "self" rather than "this" as the
instance parameter.

Using "this" is not an error, but you can expect a lot of strange looks. 
Like a Scotsman in a kilt wandering down the middle of Main Street,
Pleasantville USA.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-25 Thread Avi Gross
Mike,

Excellent advice.

I find that many people are fairly uncomfortable with abstraction and tend to 
resist a pure top down approach by diving to any solutions they may envision. 
For example, if you say things like create a data structure that can hold as 
many kinds of information as will be needed. The data should be able to be 
viewed in several ways and adding a new item should be fast even if the number 
of items grows large ...

Some will have stopped reading (or creating) and will jump to deciding then 
need a dictionary. Others may want a deque. Some may insist they need a new 
class. 

But wait, if you continue reading or designing, it may be clear that some 
choices are not optimal. Heck, it may turn out some design elements are 
contradictory. As someone asked on another python list, is there a better way 
to get a random key for a dictionary. Well, not easily without expanding all 
keys into a list of perhaps huge length. Followed by a search of much of that 
list to get the nth index. So maybe a plain dictionary does not make that easy 
or efficient so do you give up that need or use some other data structure that 
makes that fast? Perhaps you need a hybrid data structure. One weird idea is to 
use the dictionary but every time you generate a new key/value pair you also 
store a second pair that looks like "findkey666": key so that a random key of 
the first kind can be found in constant time by picking a random number up to 
half the number of items, concatenate it to "findkey" and look up the value 
which is a key.

When you try to work bottom up with students, some see no point as they are 
missing the big picture. I used to work during graduate school writing PASCAL 
code for a company making flexible manufacturing systems and my job often was 
to read a man page describing some function that did something minor. I often 
had no clue why it was needed or where it would be used? I was sometimes told 
it had to FIT into a certain amount of memory because of the overlay technique 
used and if it was compiled to something larger, was asked to break the 
function down into multiple functions that were called alternately  
Sometimes an entire section had to be redesigned because it had to fit into the 
same footprint as another. That was the limit of the big picture. A shadow!

What I found works for me is a combination. I mean teaching. You give them just 
enough of the top-down view for motivation. Then you say that we need to figure 
out what kinds of things might be needed to support the functionality. This 
includes modules to import as well as objects or functions to build. But that 
too can be hard unless you move back into the middle and explain a bit about 
the subunit you are building so you know what kind of support it needs closer 
to the bottom.

I admit that my personal style is the wrong one for most people. I do top down 
and bottom up simultaneously as well as jump into the middle to see both ways 
to try to make sure the parts will meet fairly seamlessly. Does not always work.

How often have we seen a project where some function is designed with three 
arguments. Much later, you find out some uses of the function only have and 
need two but some may have additional arguments, perhaps to pass along to yet 
another function the second will conditionally invoke? It may turn out that the 
bottom up approach starting from one corner assumed that the function would 
easily meet multiple needs when the needs elsewhere are not identical enough. 
If they keep demanding one function to master all, you can end up with fairly 
awful spaghetti code. Of course python is not a compiled language like C/C++ 
and PASCAL and many others were. It is often fairly easy in python to have a 
variable number of arguments or for the same function to do something 
reasonable with multiple types and do something reasonable for each.

One thing I warn people about is mission creep. When asked to do something, try 
not to add lots of nice features at least until you have developed and tested 
the main event. I have seen many projects that did feel the need to add every 
feature they could imagine as there remained keys on the keyboard that did not 
yet invoke some command, even if no customer ever asked for it or would ever 
use it. Amazing how often these projects took too long and came to market too 
late to catch on ...

Some of the people asking questions here do not even tell us much about what is 
needed, let alone their initial design plan. It can take multiple interactions 
back and forth and I wonder how many give up long before as they just want an 
ANSWER. 

In case you wonder, I am reliably told the answer to life, the universe and 
everything  is 2*21.

-Original Message-
From: Mike Mossey  
Sent: Tuesday, December 25, 2018 9:49 PM
To: Avi Gross 
Subject: Re: [Tutor] decomposing a problem


> On Dec 25, 2018, at 4:00 PM, Avi Gross  wrote:
> 
> 

Re: [Tutor] decomposing a problem

2018-12-25 Thread Steven D'Aprano
On Tue, Dec 25, 2018 at 10:25:50PM -0500, Avi Gross wrote:

> class chainable_list(list):
> """Same as list but sort() can now be chained"""
> def chainsort(this, *args, **kwargs):
> this.sort(*args, **kwargs)
> return this

In Python, it is traditional to use "self" rather than "this" as the 
instance parameter.

Using "this" is not an error, but you can expect a lot of strange looks. 
Like a Scotsman in a kilt wandering down the middle of Main Street, 
Pleasantville USA.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-25 Thread Avi Gross
Alan,

Your thoughts were helpful and gave me a hint.

Just an idea. What if you sub-classed an object type like list with a name
like chainable_list?

For most things it would be left alone. But if you isolated specific named
methods like sort() and reverse() you could over-ride them with the same
name or a new name.

If you override the function, you need to call list.sort() with whatever
arguments you had passed and then return this. If you choose a new name,
call this.sort() and then return this.

I tried it and it seems to work fine when I use a new name:

"""Module to create a version of list that is more chainable"""

class chainable_list(list):
"""Same as list but sort() can now be chained"""
def chainsort(this, *args, **kwargs):
this.sort(*args, **kwargs)
return this

Here it is on a list of ints:

>>> testink = chainable_list([3,5,1,7])
>>> testink
[3, 5, 1, 7]
>>> testink.chainsort()
[1, 3, 5, 7]
>>> testink.chainsort(reverse=True)
[7, 5, 3, 1]

Here it is on a list of strings that sort differently unless coerced back
into an int to show keyword arguments are passed:

>>> testink = chainable_list(["3","15","1","7"])
>>> testink.chainsort()
['1', '15', '3', '7']
>>> testink.chainsort(reverse=True)
['7', '3', '15', '1']
>>> testink.chainsort(key=int,reverse=True)
['15', '7', '3', '1']

I then tested the second method using the same name but asking the original
list sort to do things:

"""Module to create a version of list that is more chainable"""

class chainable_list(list):
"""Same as list but sort() can now be chained"""
def sort(this, *args, **kwargs):
list.sort(this, *args, **kwargs)
return this

>>> testink = chainable_list(["3","15","1","7"])
>>> testink.sort()
['1', '15', '3', '7']
>>> testink.sort().sort(reverse=true)
Traceback (most recent call last):
  File "", line 1, in 
testink.sort().sort(reverse=true)
NameError: name 'true' is not defined
>>> testink.sort().sort(reverse=True)
['7', '3', '15', '1']
>>> testink.sort().sort(reverse=True).sort(key=int)
['1', '3', '7', '15']

Again, it works fine. So if someone did something similar to many of the
methods that now return None, you could use the new class when needed.

This seems too simple so it must have been done. Obviously not in the
standard distribution but perhaps elsewhere. And, no, I do not expect a
method like pop() to suddenly return the list with a member dropped but it
would be nice to fix some like this one:

>>> testink.remove('7')
>>> testink
['1', '3', '15']

Meanwhile, I hear Beethoven is decomp..., well never mind! It was probably
Liszt!

-Original Message-
From: Tutor  On Behalf Of
Alan Gauld via Tutor
Sent: Tuesday, December 25, 2018 8:06 PM
To: tutor@python.org
Subject: Re: [Tutor] decomposing a problem

On 26/12/2018 00:00, Avi Gross wrote:

> great. Many things in python can be made to fit and some need work. 
> Dumb example is that sorting something internally returns None and not 
> the object itself.

This is one of my few complaints about Python.
In Smalltalk the default return value from any method is self. In Python it
is None.

self allows chaining of methods, None does not.
Introducing features like reversed() and sorted() partially addresses the
issue but leads to inconsistent and ugly syntax.

Smalltalk uses this technique so much it has its own code layout idiom
(Pythonised as
follows):

object
   .method1()
   .method2()
   .method3()
   
   .lastone()

We can do this with some methods but not all.
And of course methods that return a different type of value require careful
handling (eg. an
index() call in the middle of a set of list operations means the subsequent
methods are being called on an int not a list - which if handled correctly
can be confusing and if not handled correctly produces errors! (The
idiomatic way says don't chain with methods not returning self!)

In practice I (and the Smalltalk community) don't find that an issue in real
world usage, but it may have been why Guido chose not to do it that way.
But I still curse the decision every time I hit it!

But as I said, it's about the only thing in Python I dislike... a small
price to pay.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-25 Thread Cameron Simpson

On 26Dec2018 01:06, Alan Gauld  wrote:

On 26/12/2018 00:00, Avi Gross wrote:
great. Many things in python can be made to fit and some need work. 
Dumb example is that sorting something internally returns None and 
not the object itself.


This is one of my few complaints about Python.
In Smalltalk the default return value from
any method is self. In Python it is None.
self allows chaining of methods, None does not.

[...]

Smalltalk uses this technique so much it has
its own code layout idiom (Pythonised as
follows):

object
  .method1()
  .method2()
  .method3()
  
  .lastone()


While I see your point, the Python distinction is that methods returning 
values tend to return _independent_ values; the original object is not 
normally semanticly changed. As you know.


To take the builtin sorted() example, let us soppose object is a 
collection, such as a list. I would not want:


 object.sort()

to return the list because that method has a side effect on object.

By contract, I'd be happy with a:

 object.sorted()

method returning a new list because it hasn't changes object, and it 
returns a nice chaining capable object for continued use.


But that way lies a suite of doubled methods for most classes: one to 
apply some operation to an object, modifying it, and its partner to 
produce a new object (normally of the same type) being a copy of the 
first object with the operation applied.


To me it is the side effect on the original object which weighs against 
modification methods returning self.


Here's a shiny counter example for chaining.

   thread1:
 print(object.sorted())
   thread2:
 print(object.sorted(reverse=True))

The above employs composable methods. And they conflict. When methods 
return a copy the above operation is, loosely speaking, safe:


   thread1:
 print(sorted(object))
   thread2:
 print(sorted(object,reverse=True))

Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-25 Thread Steven D'Aprano
On Wed, Dec 26, 2018 at 01:06:04AM +, Alan Gauld via Tutor wrote:

> In Smalltalk the default return value from
> any method is self. In Python it is None.
> 
> self allows chaining of methods, None does not.


You might be interested in this simple recipe for retrofitting method 
chaining onto any class:

http://code.activestate.com/recipes/578770-method-chaining-or-cascading/


-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] decomposing a problem

2018-12-25 Thread Alan Gauld via Tutor
On 26/12/2018 00:00, Avi Gross wrote:

> great. Many things in python can be made to fit and some need work. Dumb
> example is that sorting something internally returns None and not the object
> itself. 

This is one of my few complaints about Python.
In Smalltalk the default return value from
any method is self. In Python it is None.

self allows chaining of methods, None does not.
Introducing features like reversed() and sorted()
partially addresses the issue but leads to
inconsistent and ugly syntax.

Smalltalk uses this technique so much it has
its own code layout idiom (Pythonised as
follows):

object
   .method1()
   .method2()
   .method3()
   
   .lastone()

We can do this with some methods but not all.
And of course methods that return a different
type of value require careful handling (eg. an
index() call in the middle of a set of list
operations means the subsequent methods are
being called on an int not a list - which if
handled correctly can be confusing and if
not handled correctly produces errors! (The
idiomatic way says don't chain with methods
not returning self!)

In practice I (and the Smalltalk community) don't
find that an issue in real world usage, but it
may have been why Guido chose not to do it that way.
But I still curse the decision every time I hit it!

But as I said, it's about the only thing in Python
I dislike... a small price to pay.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] decomposing a problem

2018-12-25 Thread Avi Gross
[Long enough that some should neither read nor comment on.]

Mats raised an issue that I think does relate to how to tutor people in
python.

The issue is learning how to take a PROBLEM to solve that looks massive and
find ways to look at it as a series of steps where each step can be easily
solved using available tools and techniques OR can recursively be decomposed
into smaller parts that can. Many people learn to program without learning
first how to write down several levels of requirements that spell out how
each part of the overall result needs to look and finally how each part will
be developed and tested. I worked in organizations with a division of labor
to try to get this waterfall method in place. At times I would write
higher-level architecture documents followed by Systems Engineering
documents and Developer documents and Unit Test and System Test and even
Field Support. The goal was to move from abstract to concrete so that the
actual development was mainly writing fairly small functions, often used
multiple times,  and gluing them together.

I looked back at the kind of tools used in UNIX and realize how limited they
were relative to what is easily done in languages like python especially
given a huge tool set you can import. The support for passing the output of
one program to another made it easy to build pipelines. You can do that in
python too but rarely need to.

And I claim there are many easy ways to do things even better in python.

Many UNIX tools were simple filters. One would read a file or two and pass
through some of the lines, perhaps altered, to the standard output. The next
process in the pipeline would often do the same, with a twist and sometimes
new lines might even be added. The simple tools like cat and grep and sed
and so on loosely fit the filter analogy. They worked on a line at a time,
mostly. The more flexible tools like AWK and PERL are frankly more like
Python than the simple tools.

So if you had a similar task to do in python, is there really much
difference? I claim not so much.

Python has quite a few ways to do a filter. One simple one is a list
comprehension and its relatives. Other variations are the map and filter
functions and even reduce. Among other things, they can accept a list of
lines of text and apply changes to them or just keep a subset or even
calculate a result from them.

Let me be concrete. You have a set of lines to process. You want to find all
lines that pass through a gauntlet, perhaps with changes along the way.

So assume you read an entire file (all at once at THIS point) into a list of
lines.

stuff = open(...).readlines()

Condition 1 might be to keep only lines that had some word or pattern in
them. You might have used sed or grep in the UNIX shell to specify a fixed
string or pattern to search for.

So in python, what might you do? Since stuff is a list, something like a
list comprehension can handle many such needs. For a fixed string like
"this" you can do something like this.

stuff2 = [some_function(line) for line in stuff if some_condition(line)]

The condition might be: "this" in line
Or it might be a phrase than the line ends with something.
Or it might be a regular expression type search.
Or it might be the length is long enough or the number of words short
enough. Every such condition can be some of the same things used in a UNIX
pipeline or brand new ideas not available there like does a line translate
into a set of numbers that are all prime!

And, the function applied to what is kept can be to transform it to
uppercase, or replace it with something else looked up in a dictionary and
so on. You might even be able to apply multiple filters with each step.
Python allows phrases like line.strip().upper() and conditions like: this or
(that and not something_else)

The point is a single line like the list comprehension above may already do
what a pipeline of 8 simple commands in UNIX did, and more.

Some of the other things UNIX tools did might involve taking a line and
breaking it into chunks such as at a comma or tab or space and then keeping
just the third and fifth and eighth but in reverse order. We sometimes used
commands like cut or very brief AWK scripts to do that. Again, this can be
trivial to do in python. Built in to character strings are functions that
let you split a line like the above into a list of fields on a separator and
perhaps rearrange and even rejoin them. In the above list comprehension
method, if you are expecting eight regions that are comma separated

>>> line1 = "f1,f2,f3,f4,f5,f6,f7,f8"
>>> line2 = "g1,g2,g3,g4,g5,g6,g7,g8"
>>> lines=[line1, line2]
>>> splitsville = [line.split(',') for line in lines]
>>> splitsville
[['f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8'], ['g1', 'g2', 'g3', 'g4',
'g5', 'g6', 'g7', 'g8']]
>>> items8_5_3 = [(h8, h5, h3) for (h1,h2,h3,h4,h5,h6,h7,h8) in splitsville]
>>> items8_5_3
[('f8', 'f5', 'f3'), ('g8', 'g5', 'g3')]

Or if you want them back as character with an