Can't remember if this came from Wally or not a long time ago. But I use it to figure out Split/Merge. I have a development box that has a copy of production that I can play with. So I do a lot of playing with mod and sep and depend on GROUP.STAT to give me some idea of how groups are being populated.
Sizing Dynamic Files Technote (FAQ) Problem Sometimes, administrators would like some ideas and insights on how to configure dynamic files to maximize file access speed and minimize the physical size. This article describes one process for making this determination. Solution To improve dynamic file performance an administrator can choose a new modulo and/or block size. Other important factors, however, are the percent standard deviation of the record size, the correct the hash type, and the split load percent. The first step is to generate file statistics using the ECL command FILE.STAT (in ECLTYPE U mode). The percent standard deviation can be obtained by the following formula: "Standard deviation from average" divided by "Average number of bytes in a record". Ideally, this percent would be zero - all records are exactly the same size. Having all records the same size makes calculations more accurate for our file sizing purposes. A standard deviation percent under 15-20% means the variation of record sizes is less than perfect but we can still predict well enough to be confident that the problem has a satisfactory solution. However, it is very common in the U2 world for a file design to have been left in service beyond what is reasonable for today's situations - i.e. what worked well in 1980 may not be a good solution for much larger files than was originally anticipated. So, if in the old days you had, say, 10 multivalues in most records and today you have between 20 and 3000, then it is easy to see how the percent standard deviation for record size can creep up over the years without being noticed. Anyway, we'll slog forward on the assumption that the standard deviation percent is "good". Final point: a high standard deviation percent for record size usually leads to wasted space, either in the form of sparsely populated primary groups and/or excessive overflow. A high standard deviation percent can create a situation where there is no "good" answer. An important factor in correct file sizing is to determine the better hashing algorithm - either type 0 or 1. It is useful to keep an open mind on this because hash type is another thing that can be set correctly and, over time, the format of the ids changes, and now the other hash type is better. First, you should always do ANALYZE.FILE filename and look at the "Keys" column. If you see consistency in the number as you look down the column, then the algorithm currently in place is likely correct. It can be variable enough to warrant further study. How you do this kind of analysis is to select a sample of 10,000 record ids from the file. Then create two dynamic files (one a type 0, the other a type 1) of blocksize 1024 and a modulo of 3. Then, CONFIGURE.FILE to set the MERGE.LOAD to 5 and the SPLIT.LOAD to 10. This configuration helps exagerate the results of the testing to make the decision a little easier. Then, populate each of the files using the sample list of ids and the empty string for a record. Whichever file is the smaller is usually the better hash type. Determine Id Size and Record Size Get two numbers from the FILE.STAT report: "Average number of bytes in a record"(avg rec size), rounded up to the next whole number; and, "Average number of bytes in record ID"(id size), rounded up to the next whole number. Follow these steps: 1.IDSIZE = id size from report above + 8 2.DATASIZE = avg rec size from report above - id size from report above 3.TOTAL = IDSIZE + DATASIZE Example: File name(Dynamic File) = DYN1 Number of groups in file (modulo) = 115 Dynamic hashing, hash type = 1 Split/Merge type = KEYONLY Block size = 1024 File has 5 groups in level one overflow. Number of records = 575 Total number of bytes = 25708 . . . Average number of bytes in a record = 100.7 Average number of bytes in record ID = 8.2 Standard deviation from average = 15.3 Average number of bytes in a record = 100.7 -> 101 Average number of bytes in record ID = 8.2 -> 9 IDSIZE = 9 + 8 = 17 DATASIZE = 101 - 9 = 92 TOTAL = 17 + 92 = 109 Determine Blocksize and Modulo The first block in each group has 32 bytes of header information. So, for a 1024 byte block, 992 bytes are useable for keys and data. Of this, a minimum of roughly 10 percent (124 bytes in this case) is reserved for key information. Each key will use up 8 bytes of overhead plus the length of the key itself. This is represented by IDSIZE above. The data portion of the record(s) begins after the key area and continues to the end of the block. By way of example, we can take the 992 bytes and divide by the 109 bytes in TOTAL above and get 9.1. This is the average number of records we can get into a block without overflow. We must make this number an integer, and we will round down since we do not want to chance that rounding up will cause overflow. So, the number of records per block is 9. The modulo is calculated to be the prospective number of records divided by the records per block. If you expect the file to be immediately subject to a lot of new records being added, you may want to inflate the modulo by some percent in an effort to avoid incurring the associated temporary performance penalty as the file goes through a period of frequent splitting due to the addition of records to groups that will be filled to capacity. One of the results of rounding down in the example above is that there will be some guaranteed waste of space - 109 bytes times .1 or 11 bytes. Divide this by the 992 bytes available and you see that, at least, around 1% of the file will be wasted space. Sometimes, however, the percent of wasted space is rather high. To eliminate this, double the size of the block and repeat the calculations. You may need to double again. The blocksizes I'd advise using are 1k, 2k, 4k, 8k, and 16k. The objective is to get the wasted space down to some acceptable percent. Do not choose a blocksize that is smaller than the average record size because this will force each record in the file that is greater in size than the block size to start its data portion in a separate overflow block, usually wasting space. Wasted space causes performance problems as well as taking too much disk space relative to the actual data stored. Operations which must process the whole file must traverse all of the groups - even the sparsely populated groups. Determine Split and Merge Loads The next step, important and often overlooked, is to determine the split load percent. KEYONLY is the presumed split method. Under this method, when the key portion of the primary group occupies x% of the total blocksize, then the group splits. It is assumed you understand the purpose and repercussions of this. To determine the best split load percent, apply the following formula: SPLIT = INT(RECORDS PER BLOCK * IDSIZE *100 / BLOCKSIZE) SPLIT = INT(9 * 17 * 100 / 1024) = 14 So, the split percent should be around 14% to fit most of the records into primarily the dat portion. If you also want to fully populate the over portion, then you can increase the split by a factor of 2 to a value of 30. The merge percent is calculated to be about half of the split load figure, so it needs to be around 8 for the first and 15 for the second. These are approximations that will work if the std dev is not too high. The best way to gain a feel for how these models work is to review a dozen or so files in your production environment in the fashion described above. Putting it All Together First, find a time when impact on production will be minimized. The best thing to do is create a new dynamic file with the desired hash type, modulo, and block size. Then use CONFIGURE.FILE to set the appropriate SPLIT and MERGE parameters. Then copy the data from the old file to this new structure. Copy the dictionary items to the new file's dictionary using the overwriting option. Now, you should be able to use the CNAME command to change the name of the old file to a backup name, and use CNAME again to change the name of the newly populated file to the production file name. You may wish to conduct some performance tests to validate the improvement. Because this method of determining reconfiguration parameters is imperfect, you may have to do some further tinkering, such as repeat the process using a slightly higher or lower split load percent than was calculated. -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Dave Laansma Sent: Tuesday, June 05, 2012 12:33 PM To: [email protected] Subject: [U2] Learning about file sizing Can anyone point me to a good document that will give me guidelines for 'proper' file sizing of dynamic files in particular? And when to use KEYONLY vs KEYDATA? Thanks! Sincerely, David Laansma IT Manager Hubbard Supply Co. Direct: 810-342-7143 Office: 810-234-8681 Fax: 810-234-6142 www.hubbardsupply.com <http://www.hubbardsupply.com> "Delivering Products, Services and Innovative Solutions" _______________________________________________ U2-Users mailing list [email protected] http://listserver.u2ug.org/mailman/listinfo/u2-users ------------------------------------------------------------------------------ CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2012 Cigna ============================================================================== _______________________________________________ U2-Users mailing list [email protected] http://listserver.u2ug.org/mailman/listinfo/u2-users
