I've been struggling for days to get version 3.1.4 out.  Every
time I would run the regression test I would get failures.  The
failures would not always be at the same place, but I would always
get one or two.

I frequently got failures in the memory-db tests where we create
a large in-memory database, make lots of changes, roll those
changes back, then verify that the database holds exactly the same
information as it did before the transaction.  In a database of
about a megabyte in size, I would sometimes see a single bit
difference after the rollback.  The bit that changed would always
be the 0x08 bit.  But the location of the change within the
database was seemingly random.

I was talking with Dan about this yesterday - he was unable to
reproduce the problem.  So I said "Maybe it's hardware?"
"Not likely", Dan replied.  And rightly so.  No programmer ever
wants to admit that a nasty problem might be lurking in their
own code.  It is always easier to blame something else - some
library you are linking against, the operating system, the
hardware you are running on.  But at the end of the day, the
problem usually does end up being in your own code and not
elsewhere.  So after you have been programming for a while
(decades in my case) you begin to be very suspicious when
people go blaming malfunctions on the parts they didn't write.

But last night, I was at wits end trying to track down the problem
in SQLite.  I figured it can't hurt to test the memory, so I
rebooted using the SuSE install disk which happens to have a
nifty memory checker built in.  About 10 minutes into the test,
some errors popped up.  On a 512MB SIMM, less than 10 memory cells
where showing a problem, and then only if a specific bit pattern
was written into adjacent cells.  The error was always in the
0x08 bit.  I removed the offending SIMM, rebooted and all tests
passed.

I find it utterly amazing that a machine with bad memory could
run a full-blown Linux desktop and a copy of Win2K running in
VMWare for days on end without showing a problem, then suddenly
begin having trouble with the SQLite regression suite.  Yet that
is what appears to have happened.

Now it is still always the best policy to blame your own code
first.  When something isn't working right, the person sitting
behind the keyboard is the most likely cause.  Sometimes you
will run into problems with the library you are using, or with
your compiler, or your OS, but those cases are rare.  Hardware
is seldom an issue.  But as this case shows, sometime, very
rarely, it really can be the hardware's fault.

-- 
D. Richard Hipp <[EMAIL PROTECTED]>

Reply via email to