I need to understand more clearly about the split/merge formula. Below
it states:


SPLIT = INT( 9 * 17 * 100 / 1024) = 14

How did they come up with 9 as the RECORDS PER BLOCK from the file
status outlined?

David Laansma
IT Manager
Hubbard Supply Co.
Direct: 810-342-7143
Office: 810-234-8681
Fax: 810-234-6142
"Delivering Products, Services and Innovative Solutions"

-----Original Message-----
[] On Behalf Of Rutherford,
Sent: Tuesday, June 05, 2012 3:50 PM
To: U2 Users List
Subject: Re: [U2] Learning about file sizing

Rod, Excellent post!  I have a file I have been wanting to convert to
dynamic.  Since it not something I do every day I have been stalling for
a while now...


Marc Rutherford
Principal Programmer Analyst
Advanced Bionics LLC
661) 362 1754

-----Original Message-----
[] On Behalf Of Baakkonen,
Rodney A (Rod) 46K
Sent: Tuesday, June 05, 2012 10:53 AM
To: 'U2 Users List'
Subject: Re: [U2] Learning about file sizing

 Can't remember if this came from Wally or not a long time ago. But I
use it to figure out Split/Merge. I have a development box that has a
copy of production that I can play with. So I do a lot of playing with
mod and sep and depend on GROUP.STAT to give me some idea of how groups
are being populated.

Sizing Dynamic Files

 Technote (FAQ) 
Sometimes, administrators would like some ideas and insights on how to
configure dynamic files to maximize file access speed and minimize the
physical size. This article describes one process for making this
To improve dynamic file performance an administrator can choose a new
modulo and/or block size. Other important factors, however, are the
percent standard deviation of the record size, the correct the hash
type, and the split load percent. 

The first step is to generate file statistics using the ECL command
FILE.STAT (in ECLTYPE U mode). The percent standard deviation can be
obtained by the following formula: "Standard deviation from average"
divided by "Average number of bytes in a record". Ideally, this percent
would be zero - all records are exactly the same size. Having all
records the same size makes calculations more accurate for our file
sizing purposes. A standard deviation percent under 15-20% means the
variation of record sizes is less than perfect but we can still predict
well enough to be confident that the problem has a satisfactory

However, it is very common in the U2 world for a file design to have
been left in service beyond what is reasonable for today's situations -
i.e. what worked well in 1980 may not be a good solution for much larger
files than was originally anticipated. So, if in the old days you had,
say, 10 multivalues in most records and today you have between 20 and
3000, then it is easy to see how the percent standard deviation for
record size can creep up over the years without being noticed. Anyway,
we'll slog forward on the assumption that the standard deviation percent
is "good". 

Final point: a high standard deviation percent for record size usually
leads to wasted space, either in the form of sparsely populated primary
groups and/or excessive overflow. A high standard deviation percent can
create a situation where there is no "good" answer. 

An important factor in correct file sizing is to determine the better
hashing algorithm - either type 0 or 1. It is useful to keep an open
mind on this because hash type is another thing that can be set
correctly and, over time, the format of the ids changes, and now the
other hash type is better. First, you should always do ANALYZE.FILE
filename and look at the "Keys" column. If you see consistency in the
number as you look down the column, then the algorithm currently in
place is likely correct. It can be variable enough to warrant further
study. How you do this kind of analysis is to select a sample of 10,000
record ids from the file. Then create two dynamic files (one a type 0,
the other a type 1) of blocksize 1024 and a modulo of 3. Then,
CONFIGURE.FILE to set the MERGE.LOAD to 5 and the SPLIT.LOAD to 10. This
configuration helps exagerate the results of the testing to make the
decision a little easier. Then, populate each of the files using the
sample list of ids and th  e empty string for a record. Whichever file
is the smaller is usually the better hash type. 

Determine Id Size and Record Size

Get two numbers from the FILE.STAT report: "Average number of bytes in a
record"(avg rec size), rounded up to the next whole number; and,
"Average number of bytes in record ID"(id size), rounded up to the next
whole number.

Follow these steps: 
1.IDSIZE = id size from report above + 8 2.DATASIZE = avg rec size from
report above - id size from report above 3.TOTAL = IDSIZE + DATASIZE 


File name(Dynamic File)               = DYN1 
Number of groups in file (modulo)     = 115 
Dynamic hashing, hash type            = 1 
Split/Merge type                      = KEYONLY 
Block size                            = 1024 
File has 5 groups in level one overflow. 
Number of records                     = 575 
Total number of bytes                 = 25708
Average number of bytes in a record   = 100.7
Average number of bytes in record ID  = 8.2 
Standard deviation from average       = 15.3

Average number of bytes in a record = 100.7 -> 101 Average number of
bytes in record ID = 8.2 -> 9 

IDSIZE = 9 + 8 = 17
DATASIZE = 101 - 9 = 92
TOTAL = 17 + 92 = 109

Determine Blocksize and Modulo

The first block in each group has 32 bytes of header information. So,
for a 1024 byte block, 992 bytes are useable for keys and data. Of this,
a minimum of roughly 10 percent (124 bytes in this case) is reserved for
key information. Each key will use up 8 bytes of overhead plus the
length of the key itself. This is represented by IDSIZE above. The data
portion of the record(s) begins after the key area and continues to the
end of the block. By way of example, we can take the 992 bytes and
divide by the 109 bytes in TOTAL above and get 9.1. This is the average
number of records we can get into a block without overflow. We must make
this number an integer, and we will round down since we do not want to
chance that rounding up will cause overflow. So, the number of records
per block is 9.

The modulo is calculated to be the prospective number of records divided
by the records per block. If you expect the file to be immediately
subject to a lot of new records being added, you may want to inflate the
modulo by some percent in an effort to avoid incurring the associated
temporary performance penalty as the file goes through a period of
frequent splitting due to the addition of records to groups that will be
filled to capacity.

One of the results of rounding down in the example above is that there
will be some guaranteed waste of space - 109 bytes times .1 or 11 bytes.
Divide this by the 992 bytes available and you see that, at least,
around 1% of the file will be wasted space. 

Sometimes, however, the percent of wasted space is rather high. To
eliminate this, double the size of the block and repeat the
calculations. You may need to double again. The blocksizes I'd advise
using are 1k, 2k, 4k, 8k, and 16k. The objective is to get the wasted
space down to some acceptable percent.

Do not choose a blocksize that is smaller than the average record size
because this will force each record in the file that is greater in size
than the block size to start its data portion in a separate overflow
block, usually wasting space.

Wasted space causes performance problems as well as taking too much disk
space relative to the actual data stored. Operations which must process
the whole file must traverse all of the groups - even the sparsely
populated groups.

Determine Split and Merge Loads

The next step, important and often overlooked, is to determine the split
load percent. KEYONLY is the presumed split method. Under this method,
when the key portion of the primary group occupies x% of the total
blocksize, then the group splits. It is assumed you understand the
purpose and repercussions of this. To determine the best split load
percent, apply the following formula: 


SPLIT = INT(9 * 17 * 100 / 1024) = 14

So, the split percent should be around 14% to fit most of the records
into primarily the dat portion. If you also want to fully populate the
over portion, then you can increase the split by a factor of 2 to a
value of 30. The merge percent is calculated to be about half of the
split load figure, so it needs to be around 8 for the first and 15 for
the second. These are approximations that will work if the std dev is
not too high. The best way to gain a feel for how these models work is
to review a dozen or so files in your production environment in the
fashion described above. 

Putting it All Together

First, find a time when impact on production will be minimized. The best
thing to do is create a new dynamic file with the desired hash type,
modulo, and block size. Then use CONFIGURE.FILE to set the appropriate
SPLIT and MERGE parameters. Then copy the data from the old file to this
new structure. Copy the dictionary items to the new file's dictionary
using the overwriting option. Now, you should be able to use the CNAME
command to change the name of the old file to a backup name, and use
CNAME again to change the name of the newly populated file to the
production file name. You may wish to conduct some performance tests to
validate the improvement. Because this method of determining
reconfiguration parameters is imperfect, you may have to do some further
tinkering, such as repeat the process using a slightly higher or lower
split load percent than was calculated.  

-----Original Message-----
[] On Behalf Of Dave Laansma
Sent: Tuesday, June 05, 2012 12:33 PM
Subject: [U2] Learning about file sizing

Can anyone point me to a good document that will give me guidelines for
'proper' file sizing of dynamic files in particular?


And when to use KEYONLY vs KEYDATA?





David Laansma

IT Manager

Hubbard Supply Co.

Direct: 810-342-7143

Office: 810-234-8681

Fax: 810-234-6142 <> 

"Delivering Products, Services and Innovative Solutions"


U2-Users mailing list

CONFIDENTIALITY NOTICE: If you have received this email in error, please
immediately notify the sender by e-mail at the address shown.  
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity
to whom it is intended even if addressed incorrectly.  Please delete it
from your files if you are not the intended recipient.  Thank you for
your compliance.  Copyright (c) 2012 Cigna

U2-Users mailing list
U2-Users mailing list
U2-Users mailing list

Reply via email to