On Sun, Oct 3, 2010 at 6:37 PM, Steven D'Aprano <st...@pearwood.info> wrote: > On Mon, 4 Oct 2010 08:33:07 am David Hutto wrote: >> I'm creating an app that charts/graphs data. The mapping of the >> graphs is the 'easy' part with matplotlib, >> and wx. My question relates to the alignment of the data to be >> processed. >> >> Let's say I have three sets of 24 hr graphs with the same time steps: >> >> -the position of the sun >> -the temp. >> -local powerplant energy consumption >> >> >> A human could perceive the relations that when it's wintertime, cold >> and the sun goes down, heaters are turned on >> and energy consumption goes up, and the opposite in summer when it >> the sun comes up. >> My problem is how to compare and make the program perceive the >> relation. > > This is a statistics problem, not a programming problem. Or rather, > parts of it *uses* programming to solve the statistics problem. > > My statistics isn't good enough to tell you how to find correlations > between three independent variables, but I can break the problem into a > simpler one: find the correlation between two variables, temperature > and energy consumption.
This was the initial starting point, but I thought that the comparing multiples should set the tone for how the data is interpreted, but you're right, I should start with two, and then give relation to relation within the 2 object compared structure. So if x and y are compared and related, then it makes since that if x and b are compared and related, that b and y are related in some way because they have a in common in terms of 2 object comparison relationals. or: (see below for comparative % based statistic analysis algorithm) x and y = related x and b = related eg. y and b = related but that gives the origin and the end comparison paradox of my end desires for the program. Do I compare the end object to all or do random 2 coordinate list comparisons and match the data over corresponding timesteps, then eliminate the list of comparable based on a hierarchy of matches, in other words? if x relates to y and x relates to b: So I have(really rough pseudo code for time constraints): list1 = [+,+,+,+,-,+,-,+,-,+] list2 = [-,+,+,+,-,+,-,+,-,+] Above I have a 90% match to timestep increments/decrements, that over say, one minute periods, x and y both increased or decreased together 90% of the time, or the opposite, that they diverged 90% of the time. > > Without understanding how the data was generated, I'm not entirely sure > how to set the data up, but here's one approach: > > (1) Plot the relationship between: > x = temperature > y = power consumption > > where x is the independent variable and y is the dependent variable. > > (2) Look at the graph. Can you see any obvious pattern? If all the data > points are scattered randomly around the graph, there you can be > fairly sure that there is no correlation and you can go straight on > to calculating the correlation coefficient to make sure. > > (3) But if the graph clearly appears to be made of separate sections, > AND those sections correlate to physical differences due to the time > of day (position of the sun), then you need to break the data set > into multiple data sets and work on each one individually. > > E.g. if the graph forms a straight line pointing DOWN for the hours > 11pm to 5am, and a straight line pointing UP for the hours 5am > to 11pm, and you can think of a physical reason why this is > plausible, then you would be justified in separating out the data > into two independent sets: 5am-11pm, 11pm-5am. > > If you want to have the program do this part for you, this is a VERY > hard problem. You're essentially wanting to write an artifical > intelligence system capable of picking out statistical correlations > from data. Such software does exist. It tends to cost hundreds of > thousands of dollars, or millions. Good luck writing your own! > > (4) Otherwise feel free to simplify the problem by just investigating > the relationship between temperature and power consumption during > (say) daylight hours. > > (5) However you decide to proceed, you should now have one (or more) x-y > graph. First step is to decide whether there is any correlation at > all. If there is not, you can stop there. Calculate the correlation > coefficient, r. r will be a number between -1 and 1. r=1 means a > perfect positive correlation; r=-1 means a perfect negative > correlation. r=0 means no correlation at all. > > (6) Decide whether the correlation is meaningful. I don't remember how > to do this -- consult your statistics text books. If it's not > meaningful, then you are done -- there's no statistically valid > relationship between the variables. > > (7) Otherwise, you want to calculate the line of best fit (or possibly > some other curve, but let's stick to straight lines for now) for the > data. The line of best fit may be complicated to calculate, and it > may not be justified statistically, so start off with something > simpler which (hopefully!) is nearly as good -- a linear regression > line. This calculates a line that statistically matches your data. > > (8) Technically, you can calculate a regression line for *any* data, > even if it clearly doesn't form a line. That's why you are checking > the correlation coefficient to decide whether it is sensible or not. > > > By now any *real* statisticians reading this will be horrified :) What > I've described is essentially the most basic, "Stats 101 for Dummies" > level. > > Have fun! > > > > -- > Steven D'Aprano > _______________________________________________ > Tutor maillist - tu...@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor