David, I think this is called a "Histogram Method" in the literature. You can obviously make the 'buckets' as big or as small as you like depending on how much memory you want to use - you could even have 1-byte buckets which would allow you to calculate the actual median rather than an approximation, but an approximation would be plenty good enough for this application.

## Advertising

There are also published algorithms for calculating the Standard Deviation in a single pass, so that using this and your method would allow calculation of the mean, median and SD with one pass through the file. That should give you a pretty good idea of the 'shape' of the data Regards Mike -----Original Message----- From: u2-users-boun...@listserver.u2ug.org On Behalf Of David Wolverton Sent: Friday, 7 October 2011 8:29 a.m. To: 'U2 Users List' Subject: Re: [U2] FAST (File Analysis and Sizing Tool) What about this... Could you just store a MultiValue of the 'approx size' of each record? Then you would 'add one' to the correct MV as you process the record counts. At the end of the run, you'd have a list where you could essentially start a loop to 'work to the center' So you'd have a MV of buckets as 512]1024]2048]4096]8192]16384]32768[etc in whatever break was most reasonable -- perhaps it would be 100]200]300]400]500]600]700 etc... Each record would '+1' to the correct size bucket - so a record that is 310 would drop into the '400' bucket (>300, <=400) or one that is 4000 would be in 4096 (>2048 <=4096) -- the number and size for the buckets would be what would be 'meaningful' to the process. At the end, you'd have an array of counts based on the 'size' MV it dropped into that would resemble 20]400]22600]52000]16000]250]1]]]1 Or the like... from there, you could get to 'median' pretty quickly I think with one pass at it. And while not the EXACT a byte count (meaning the 52000 'bucket' clearly has the 'median' which would be somewhere between 300 and 400 bytes or 2048 to 4096 - depending on how you did your 'buckets'. This would likely be 'close enough' for the 'horseshoes' game of file sizing as long as you did the buckets in narrow-ish/meaningful breaks. (And maybe you'd do a DIM'd Array over MV to make incrementing the counters faster -- but you get the idea.) -----Original Message----- From: u2-users-boun...@listserver.u2ug.org [mailto:u2-users-boun...@listserver.u2ug.org] On Behalf Of Jeff Fitzgerald Sent: Thursday, October 06, 2011 12:44 PM To: u2-users@listserver.u2ug.org Subject: Re: [U2] FAST (File Analysis and Sizing Tool) [snip] This brings up my question - What is the best way to find the median? Say that you have an arbitrary number of records; how do you find the median record size? The easy answer is to sort the size values and then pick the middle. However, suppose you have a billion records - storing the billion size values provides some challenges! Is there a way to find median with a single pass and without having to provide storage for ALL the values? I'd be interested in anyone's ideals about finding median and about measures of skewness in general. Thanks! Jeff Fitzgerald Fitzgerald & Long, Inc. www.fitzlong.com j...@fitzlong.com The information contained in this Internet Email message is intended for the addressee only and may contain privileged information, but not necessarily the official views or opinions of the New Zealand Defence Force. If you are not the intended recipient you must not use, disclose, copy or distribute this message or the information in it. If you have received this message in error, please Email or telephone the sender immediately. _______________________________________________ U2-Users mailing list U2-Users@listserver.u2ug.org http://listserver.u2ug.org/mailman/listinfo/u2-users