We have "job" A, running on userid A, that writes (replaces) file CEA DATA in a
filecontrol directory
using COPYFILE (REPLACE. Then "job" B, running on userid B, comes along and
reads the file and
creates a DB2 table. This has been running successfully for several months.
On Friday, the DB2 table had only 15,000 rows in it, instead of the usual
204,000 rows. (And I had
300+ messages in my email!)
Console log for user B shows 204,000 records in CEA FILE -- using REXX
stream('CEA FILE fm','c','query info')
Then it processes the file, and only reads (and writes) 15,000 records.
Normally, userid A is autologged at about 2:00 AM and userid B at 3:30 AM.
(Both jobs take about
20 minutes.) On Friday morning, though, both were autologged within the same
minute, at 3:30
AM. (We do not yet know why job A was delayed.)
I was under the impression that COPYFILE created a temporary file, and only
renamed it to the true
filename after the copy was completed. I was also under the impression that
once the file is
opened, SFS guarantees the the reader won't see a change to the file until the
file is closed. So
user B should have either processed all 204,000 records of the old file, or all
204,000 records of
the new file -- but that is not what happened.
The timestamp on the file now shows 3:50 AM, 20 minutes after the two userids
were autologged.
What happened?
And how do we prevent it happening again -- some kind of explicit SFS lock? Do
our own rename?
(Why didn't implicit SFS locks handle this?)
As far as we are concerned, if the new file is not available at 3:30 AM, we
just want to abort and
send an email. We're a LOT better off with all 204,000 rows from yesterday than
with only 15,000
rows from today.
Alan Ackerman
Alan (dot) Ackerman (at) Bank of America (dot) com