Hal, > For one dimension, you sort, compute the average, then > compute the distance > of the first and last samples from the average. Discard the > one that is > farther from the average
....Well, may work... A Method of ourlier search in one dimension that has worked for me very well in the last years is (it is not my idea but comes from an article covering robust statistics): First you have to understand that the usual arithmetic average and the standard deviation are measures that are NOT robust against outliers and that you need to substitute them by robust measures when you need their functionality. 1) Sort Data in ascending order. 2) Find the "center" of the sorted data, i.e. the data value where 50 % of all values are greater or equal and the other 50 % are smaller or equal the specific value. This value is called the "median" or "50% percentile". Imagine it as a substitute for the average that is VERY robust. 3) Now (similar as with the standard deviation) compute the absolute values of the differences of all data points and the median. 4) Again order the resulting values in ascending order and find their median. 5) What you have now is the median deviation of the data to the original median and is a very robust measure of the width of the distribution. There is even a "norming" factor (that I do not remember because I do not need it) that makes this number directly comparable to the standard deviation of (outlier free) data. 99% of all data of a Gaussian distribution are inside +/- 3 sigma, so if a data value is outside say +/- 5 median deviation then it is very likely a outlier. However, what you really want ist a outlier free average value. The median itself is a single data value containing all the noise that you want to average out. For this purpose robust statistics holds a different (but similar) tool: The IQR (Inner Quartile Range). The Algo is: 1) Sort data in ascending order 2) Find the median of the data, the 50 % percentile but in addition also find the 25% percentile and the 75 % percentile. 3) Now you have 4 groups (quartiles) of data, divided by the 3 percentiles. Ignore the outer quartiles (where outliers are located) and compute the arithmetic average over the two inner two quartiles which are free of outliers if at least 50 % of all data are NOT outliers. The IQR is a robust compromise between outlier removement and noise removement. For the two dimension case I would suggest the following: 1) For all computations keep the index of the data points with you so that a data point can be identified later. 2) Sort data in ascending order separate for the two dimensions. 3) Identify the inner quartiles separate for the two dimensions. 4) Now search for indices that are contained in BOTH inner quartiles, i.e. data that has NOT been sorted out as a outlier in one of the dimensions. 4) Compute the arithmetic average over the data points found in 4) Best regards Ulrich > -----Ursprungliche Nachricht----- > Von: [email protected] > [mailto:[email protected]] Im Auftrag von Hal Murray > Gesendet: Mittwoch, 9. Dezember 2009 11:53 > An: [email protected] > Betreff: [time-nuts] Discarding outliers in two dimensions > > > > Suppose I want to average a bunch of samples. Sometimes it > helps to discard > the outliers. I think that helps when there are two noise > mechanisms, say > the typical Gaussian plus sometimes some other noise added > on. If the other > noise is rare but large, those occasional samples can have a > big influence on > the average. So discarding those outliers gives better > results, for some > value of "better". > > I know how to do it in one dimension. How do I do it in two > dimensions? > > Say I have a lot of samples from a GPS system and I want to > compute the best > position to use when shifting into timing mode. > > > For one dimension, you sort, compute the average, then > compute the distance > of the first and last samples from the average. Discard the > one that is > farther from the average. > > The problem with two dimensions is I don't know how to sort. > > Let's ignore efficiency. I can compute the average without > sorting. I can > scan the whole list looking for the one that is farthest > (radial distance) > from the average. Does that work (and do what I want)? (I > think so, but I'm > not sure.) > > Is there a way to do that efficiently? > > > -- > These are my opinions, not necessarily my employer's. I hate spam. > > > > > _______________________________________________ > time-nuts mailing list -- [email protected] > To unsubscribe, go to > https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts > and follow the instructions there. _______________________________________________ time-nuts mailing list -- [email protected] To unsubscribe, go to https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts and follow the instructions there.
