David,

I think this is called a "Histogram Method" in the literature.
You can obviously make the 'buckets' as big or as small as you like
depending on how much memory you want to use - you could even have
1-byte buckets which would allow you to calculate the actual median
rather than an approximation, but an approximation would be plenty good
enough for this application.

There are also published algorithms for calculating the Standard
Deviation in a single pass, so that using this and your method would
allow calculation of the mean, median and SD with one pass through the
file.
That should give you a pretty good idea of the 'shape' of the data



Regards


Mike


-----Original Message-----
From: u2-users-boun...@listserver.u2ug.org On Behalf Of David Wolverton 
Sent: Friday, 7 October 2011 8:29 a.m.
To: 'U2 Users List'
Subject: Re: [U2] FAST (File Analysis and Sizing Tool)

What about this... Could you just store a MultiValue of the 'approx
size' of each record?  Then you would 'add one' to the correct MV as you
process the record counts.  At the end of the run, you'd have a list
where you could essentially start a loop to 'work to the center'

So you'd have a MV of buckets as
512]1024]2048]4096]8192]16384]32768[etc
in whatever break was most reasonable -- perhaps it would be
100]200]300]400]500]600]700 etc...

Each record would '+1' to the correct size bucket - so a record that is
310 would drop into the '400' bucket (>300, <=400) or one that is 4000
would be in 4096 (>2048 <=4096) -- the number and size for the buckets
would be what would be 'meaningful' to the process.
At the end, you'd have an array of counts based on the 'size' MV it
dropped into that would resemble

20]400]22600]52000]16000]250]1]]]1

Or the like... from there, you could get to 'median' pretty quickly I
think with one pass at it. And while not the EXACT a byte count (meaning
the 52000 'bucket' clearly has the 'median' which would be somewhere
between 300 and 400 bytes or 2048 to 4096 - depending on how you did
your 'buckets'. This would likely be 'close enough' for the 'horseshoes'
game of file sizing as long as you did the buckets in
narrow-ish/meaningful breaks.  

(And maybe you'd do a DIM'd Array over MV to make incrementing the
counters faster -- but you get the idea.)

-----Original Message-----
From: u2-users-boun...@listserver.u2ug.org
[mailto:u2-users-boun...@listserver.u2ug.org] On Behalf Of Jeff
Fitzgerald
Sent: Thursday, October 06, 2011 12:44 PM
To: u2-users@listserver.u2ug.org
Subject: Re: [U2] FAST (File Analysis and Sizing Tool)

[snip]

This brings up my question - What is the best way to find the median?
Say that you have an arbitrary number of records; how do you find the
median record size?  The easy answer is to sort the size values and then
pick the middle.  However, suppose you have a billion records - storing
the billion size values provides some challenges!  Is there a way to
find median with a single pass and without having to provide storage for
ALL the values?  I'd be interested in anyone's ideals about finding
median and about measures of skewness in general.
 Thanks!

Jeff Fitzgerald
Fitzgerald & Long, Inc.
www.fitzlong.com
j...@fitzlong.com
The information contained in this Internet Email message is intended
for the addressee only and may contain privileged information, but not
necessarily the official views or opinions of the New Zealand Defence Force.
If you are not the intended recipient you must not use, disclose, copy or 
distribute this message or the information in it.

If you have received this message in error, please Email or telephone
the sender immediately.
_______________________________________________
U2-Users mailing list
U2-Users@listserver.u2ug.org
http://listserver.u2ug.org/mailman/listinfo/u2-users

Reply via email to