Re: [U2] Learning about file sizing

Baakkonen, Rodney A (Rod) 46K Tue, 05 Jun 2012 10:53:14 -0700

 Can't remember if this came from Wally or not a long time ago. But I use it to 
figure out Split/Merge. I have a development box that has a copy of production 
that I can play with. So I do a lot of playing with mod and sep and depend on 
GROUP.STAT to give me some idea of how groups are being populated.



Sizing Dynamic Files

 Technote (FAQ) 
  
Problem 
Sometimes, administrators would like some ideas and insights on how to 
configure dynamic files to maximize file access speed and minimize the physical 
size. This article describes one process for making this determination.  
  
 
  
Solution 
To improve dynamic file performance an administrator can choose a new modulo 
and/or block size. Other important factors, however, are the percent standard 
deviation of the record size, the correct the hash type, and the split load 
percent. 

The first step is to generate file statistics using the ECL command FILE.STAT 
(in ECLTYPE U mode). The percent standard deviation can be obtained by the 
following formula: "Standard deviation from average" divided by "Average number 
of bytes in a record". Ideally, this percent would be zero - all records are 
exactly the same size. Having all records the same size makes calculations more 
accurate for our file sizing purposes. A standard deviation percent under 
15-20% means the variation of record sizes is less than perfect but we can 
still predict well enough to be confident that the problem has a satisfactory 
solution.

However, it is very common in the U2 world for a file design to have been left 
in service beyond what is reasonable for today's situations - i.e. what worked 
well in 1980 may not be a good solution for much larger files than was 
originally anticipated. So, if in the old days you had, say, 10 multivalues in 
most records and today you have between 20 and 3000, then it is easy to see how 
the percent standard deviation for record size can creep up over the years 
without being noticed. Anyway, we'll slog forward on the assumption that the 
standard deviation percent is "good". 

Final point: a high standard deviation percent for record size usually leads to 
wasted space, either in the form of sparsely populated primary groups and/or 
excessive overflow. A high standard deviation percent can create a situation 
where there is no "good" answer. 

An important factor in correct file sizing is to determine the better hashing 
algorithm - either type 0 or 1. It is useful to keep an open mind on this 
because hash type is another thing that can be set correctly and, over time, 
the format of the ids changes, and now the other hash type is better. First, 
you should always do ANALYZE.FILE filename and look at the "Keys" column. If 
you see consistency in the number as you look down the column, then the 
algorithm currently in place is likely correct. It can be variable enough to 
warrant further study. How you do this kind of analysis is to select a sample 
of 10,000 record ids from the file. Then create two dynamic files (one a type 
0, the other a type 1) of blocksize 1024 and a modulo of 3. Then, 
CONFIGURE.FILE to set the MERGE.LOAD to 5 and the SPLIT.LOAD to 10. This 
configuration helps exagerate the results of the testing to make the decision a 
little easier. Then, populate each of the files using the sample list of ids 
and the empty string for a record. Whichever file is the smaller is usually the 
better hash type. 

Determine Id Size and Record Size

Get two numbers from the FILE.STAT report: "Average number of bytes in a 
record"(avg rec size), rounded up to the next whole number; and, "Average 
number of bytes in record ID"(id size), rounded up to the next whole number.

Follow these steps: 
1.IDSIZE = id size from report above + 8 
2.DATASIZE = avg rec size from report above - id size from report above 
3.TOTAL = IDSIZE + DATASIZE 

Example: 

File name(Dynamic File)               = DYN1 
Number of groups in file (modulo)     = 115 
Dynamic hashing, hash type            = 1 
Split/Merge type                      = KEYONLY 
Block size                            = 1024 
File has 5 groups in level one overflow. 
Number of records                     = 575 
Total number of bytes                 = 25708
.
.
.
Average number of bytes in a record   = 100.7
Average number of bytes in record ID  = 8.2 
Standard deviation from average       = 15.3


Average number of bytes in a record = 100.7 -> 101 
Average number of bytes in record ID = 8.2 -> 9 

IDSIZE = 9 + 8 = 17 
DATASIZE = 101 - 9 = 92
TOTAL = 17 + 92 = 109


Determine Blocksize and Modulo

The first block in each group has 32 bytes of header information. So, for a 
1024 byte block, 992 bytes are useable for keys and data. Of this, a minimum of 
roughly 10 percent (124 bytes in this case) is reserved for key information. 
Each key will use up 8 bytes of overhead plus the length of the key itself. 
This is represented by IDSIZE above. The data portion of the record(s) begins 
after the key area and continues to the end of the block. By way of example, we 
can take the 992 bytes and divide by the 109 bytes in TOTAL above and get 9.1. 
This is the average number of records we can get into a block without overflow. 
We must make this number an integer, and we will round down since we do not 
want to chance that rounding up will cause overflow. So, the number of records 
per block is 9.

The modulo is calculated to be the prospective number of records divided by the 
records per block. If you expect the file to be immediately subject to a lot of 
new records being added, you may want to inflate the modulo by some percent in 
an effort to avoid incurring the associated temporary performance penalty as 
the file goes through a period of frequent splitting due to the addition of 
records to groups that will be filled to capacity.

One of the results of rounding down in the example above is that there will be 
some guaranteed waste of space - 109 bytes times .1 or 11 bytes. Divide this by 
the 992 bytes available and you see that, at least, around 1% of the file will 
be wasted space. 

Sometimes, however, the percent of wasted space is rather high. To eliminate 
this, double the size of the block and repeat the calculations. You may need to 
double again. The blocksizes I'd advise using are 1k, 2k, 4k, 8k, and 16k. The 
objective is to get the wasted space down to some acceptable percent.

Do not choose a blocksize that is smaller than the average record size because 
this will force each record in the file that is greater in size than the block 
size to start its data portion in a separate overflow block, usually wasting 
space.

Wasted space causes performance problems as well as taking too much disk space 
relative to the actual data stored. Operations which must process the whole 
file must traverse all of the groups - even the sparsely populated groups.

Determine Split and Merge Loads

The next step, important and often overlooked, is to determine the split load 
percent. KEYONLY is the presumed split method. Under this method, when the key 
portion of the primary group occupies x% of the total blocksize, then the group 
splits. It is assumed you understand the purpose and repercussions of this. To 
determine the best split load percent, apply the following formula: 

SPLIT = INT(RECORDS PER BLOCK * IDSIZE *100 / BLOCKSIZE)

SPLIT = INT(9 * 17 * 100 / 1024) = 14

So, the split percent should be around 14% to fit most of the records into 
primarily the dat portion. If you also want to fully populate the over portion, 
then you can increase the split by a factor of 2 to a value of 30. The merge 
percent is calculated to be about half of the split load figure, so it needs to 
be around 8 for the first and 15 for the second. These are approximations that 
will work if the std dev is not too high. The best way to gain a feel for how 
these models work is to review a dozen or so files in your production 
environment in the fashion described above. 

Putting it All Together

First, find a time when impact on production will be minimized. The best thing 
to do is create a new dynamic file with the desired hash type, modulo, and 
block size. Then use CONFIGURE.FILE to set the appropriate SPLIT and MERGE 
parameters. Then copy the data from the old file to this new structure. Copy 
the dictionary items to the new file's dictionary using the overwriting option. 
Now, you should be able to use the CNAME command to change the name of the old 
file to a backup name, and use CNAME again to change the name of the newly 
populated file to the production file name. You may wish to conduct some 
performance tests to validate the improvement. Because this method of 
determining reconfiguration parameters is imperfect, you may have to do some 
further tinkering, such as repeat the process using a slightly higher or lower 
split load percent than was calculated.  
 
 
 
  

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Dave Laansma
Sent: Tuesday, June 05, 2012 12:33 PM
To: [email protected]
Subject: [U2] Learning about file sizing

Can anyone point me to a good document that will give me guidelines for
'proper' file sizing of dynamic files in particular?

 

And when to use KEYONLY vs KEYDATA?

 

Thanks!

 

Sincerely,

David Laansma

IT Manager

Hubbard Supply Co.

Direct: 810-342-7143

Office: 810-234-8681

Fax: 810-234-6142

www.hubbardsupply.com <http://www.hubbardsupply.com> 

"Delivering Products, Services and Innovative Solutions"

 

_______________________________________________
U2-Users mailing list
[email protected]
http://listserver.u2ug.org/mailman/listinfo/u2-users

------------------------------------------------------------------------------
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown.  
This email transmission may contain confidential information.  This 
information is intended only for the use of the individual(s) or entity to 
whom it is intended even if addressed incorrectly.  Please delete it from 
your files if you are not the intended recipient.  Thank you for your 
compliance.  Copyright (c) 2012 Cigna
==============================================================================

_______________________________________________
U2-Users mailing list
[email protected]
http://listserver.u2ug.org/mailman/listinfo/u2-users

Re: [U2] Learning about file sizing

Reply via email to