[computer-go] OT: median of a data stream

2007-08-16 Thread Darren Cook
Apologies for the off-topic post, but I know lots of people here are
interested in statistics and algorithms.

Calculating the mean of a stream of numbers [1] is easy: just keep track
of the sum and the count, and divide at the end. But what about the
median? I think I always need to buffer at least half the numbers, but
does anyone know for sure, or know a clever algorithm [2]?

Darren

[1]: In a desperate attempt at on-topic-ness lets pretend they are
thinking times for each move in all games in a large go game collection.

[2]: If I know the number of distinct numbers is relatively small I can
make the histogram and derive the median that way. E.g. 100 distinct
numbers means I just need 100 counters. I also get the mode as a bonus.

-- 
Darren Cook
http://dcook.org/mlsn/ (English-Japanese-German-Chinese free dictionary)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] OT: median of a data stream

2007-08-16 Thread John Tromp
On 8/16/07, Darren Cook [EMAIL PROTECTED] wrote:
 Apologies for the off-topic post, but I know lots of people here are
 interested in statistics and algorithms.

 Calculating the mean of a stream of numbers [1] is easy: just keep track
 of the sum and the count, and divide at the end. But what about the
 median? I think I always need to buffer at least half the numbers, but
 does anyone know for sure, or know a clever algorithm [2]?

You'll have to remember all numbers.
If you've seen a_0  a_1  a_2  ...  a_n   so far,
then for any i, 0=i=n, you'll have to ouput a_i after an additional
(n-i) values less than a_0, and i values greater than a_n.

regards,
-John
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] OT: median of a data stream

2007-08-16 Thread Erik van der Werf
On 8/16/07, Darren Cook [EMAIL PROTECTED] wrote:
 Calculating the mean of a stream of numbers [1] is easy: just keep track
 of the sum and the count, and divide at the end. But what about the
 median?

 I think I always need to buffer at least half the numbers

A counter for each unique value should always suffice and may be less
than half the numbers in the stream. If you don't require huge
precision you can save a lot by increasing the bin-size (just maintain
a histogram).

 , but does anyone know for sure, or know a clever algorithm [2]?

Not sure, but I guess you could look for an incremental update of any
probability density approximation that has sufficient complexity for
your purposes.

An even more simplistic approach might be to compare incoming samples
to the current estimate and increment/decrement by a small constant
depending on whether it's higher or lower (but that may be too crude
for your purposes).

Best,
Erik
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] OT: median of a data stream

2007-08-16 Thread A van Kessel
You can use a binary tree. In each node you store the number
of child nodes in the left subtree. (maybe also for the right subtree)
If you want to handle duplicates, you could add yet another counter to the 
nodes.

Finding the median is easy. As a bonus, you can also find the value
for *any* rank or percentile.

Cost is 2 or3 pointers plus 2 or 3 ints per node, say 20 bytes.
If your input is reasonably random, balancing will not be needed.

Avoiding dynamic allocation will probably speed things up.
(and make deallocation easyer ;-)

HTH,
AvK
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/