Jerry Van Baren wrote: > Larry Johnson wrote: >> The ECC POST reported intermittent failures running after power-up on >> the Korat PPC440EPx board. Even when the test passed, the debugging >> output occasionally reported additional unexpected ECC errors. >> >> This refactoring had two main objectives: (1) minimize the code executed >> with ECC enabled during the tests, and (2) add more checking of the >> results so any unexpected ECC errors would cause the test to fail. >> >> So far, the refactored test has not reported any intermittent failures. >> Further, synchronization instructions appear no longer to be require, so >> have been removed. If intermittent failures do occur in the future, the >> refactoring should make the causes easier to identify. > > WHOOP, WHOOP, WHOOP, red alert! "[S]ynchronization instructions appear > no longer to be require[d], so have been removed". > > Synchronization instructions either *ARE* required or *ARE NOT* > required, there is no "appear". When sync instructions appear to not be > required, but actually are required, that is when really obscure bugs > start happening in the dead of winter off the coast of Alaska / Siberia > and your boss asks you if you have warm clothes. > > I am not familiar with the 4xx family or the PowerPC core that is used > in it but... > > [snip] > >> -static int test_ecc(unsigned long ecc_addr) >> +static int test_ecc(uint32_t ecc_addr) >> { >> - unsigned long value; >> - volatile unsigned *const ecc_mem = (volatile unsigned *) ecc_addr; >> - int pret; >> + uint32_t value; >> + volatile uint32_t *const ecc_mem = (volatile uint32_t *)ecc_addr; >> int ret = 0; >> >> - sync(); >> - eieio(); >> WATCHDOG_RESET(); > > The combination of "sync" and "eieio" is a strong indication of someone > sprinkling pixie dust rather than understanding the problem. > > "Sync" forces all pending I/O (read/write) operations to be completed > all the way to memory/hardware register before the instruction > continues. Sync guarantees *WHEN* the I/O will complete: NOW. This is > a big hammer: it can cause a significant performance hit because it > stalls the processor BUT it is guaranteed effective (except for the > places that need an both an isync and a sync combination - thankfully, I > believe that is only needed in special cases when playing with the > processor's control registers). > > "Eieio" (enforce in-order execution of I/O) is a barrier that says all > I/O that goes before it must be completed before any I/O that goes after > it is started. It *DOES NOT* guarantee *WHEN* the preceding > reads/writes will be completed. Theoretically, the bus interface unit > (BIU) could hold the whole shootin' match for 10 minutes before it does > the preceding I/O followed by the succeeding I/O. Eieio is much less > draconian to the processor than sync (which is why eieios are preferred) > but an eieio may or may not cause the intended synchronizing result if > you are relying on a write or read causing the proper effect *NOW*. Note > that eieios are NOPs to processor cores that don't reorder I/O. > > Some PowerPC cores (e.g. the 74xx family) can reorder reads and writes > in the bus interface unit (some cores, such as the 603e, do *not* > reorder reads and writes). This is a performance enhancement... writes > (generally) are non-blocking to the processor core where a read causes > the processor to have to wait for the data (which cascades into pipeline > stalls and performance hits). The bus is a highly oversubscribed > resource (core speed / bus speed can be 8x or more). As a result, you > want to get reads done ASAP (if possible) and thus it is beneficial to > move a read ahead of a write. > > As you should have picked up by now, a sync (forcing all I/O to > complete) followed by eieio is silly - the eieio is superfluous. Seeing > syncs/isyncs/eieios sprinkled in code is an indication that the author > didn't understand what was going on and, as a result, kept hitting the > problem with a bigger and bigger hammer until it appeared to have gone > away. > > Besides read/write reordering problems, the bus interface unit (BIU) can > "short circuit" a read that follows a write to the same address. This > is very likely to be implemented in a given core - it offers a very good > speed up traded off against a modest increase in complexity to the BIU. > The problem is (for instance), if you configure your EDC to store an > invalid EDC flag, do a write to a test location (which gets held in the > BIU because the bus is busy), followed by a read of the test location > (expecting to see an EDC failure), the BIU could return the queued *but > unwritten* write value. > > OK, enough lecturing... > > Repeated disclaimer: What I write here is applicable for more complex > PowerPC implementations. It may not be applicable for the particular > 4xx core you are running on. I am not familiar with the 4xx core. > > The reason sync/eieio is very likely VITAL in a EDC test is that... > > 1) The EDC is being reconfigured in a way that can cause latent EDC > faults. If a write - for instance a save of a register on the stack - > gets deferred inadvertently until _after_ the EDC hardware is configured > to test an error, you could end up with the register save performed with > inadvertently screwed up EDC. As a result, when your code executes the > return postlog (popping the register off the stack) you will get a > totally unexpected EDC error. > > 2) Even if an eieio is used properly, the (EDC reconfiguration, write, > read, EDC fixup) sequence may occur in the right order, but it may occur > *way later* than you expected which could cause an EDC exception way > later in the code than you expected. This would lead to very flaky > results, unexpected EDC failures, etc. Hmmmmm. > > While the previous scenarios are worst cases, improper sync discipline > can cause test failures as well. In fact, it is actually more likely to > cause problems with the test that the worst case scenario. For > instance, if the BIU holds a write and short-circuits subsequent reads, > you may *think* you are testing EDC but, if the BIU has the write queued > and the read comes from the BIU rather than actual memory, the BIU will > inadvertently short circuit your test as well. > > By the way, if interrupts are enabled during this time.......... > (shudders) oooh, good choice to run polled, Dan/Wolfgang! > > [major snip] > > HTH, > gvb
Yes, it does help. Thanks, Jerry. When I first modified the (then) LWMON5 ECC POST to run on Korat and other 440EPx boards, Stefan urged me to replace the memory accesses using volatile pointers with accesses via "in_be32()" et al. As I understood it, this should have eliminated the need for any external synchronization. However, after checking the PPC440 documetation, I now believe that there should be a "sync" (actually, an "msync", which is the same opcode) between the memory access and access to the SDRAM- controller registers. (From what I can tell, "in_be32()" et al. do not not force completion of the storage access before returning. Is this correct?) Stefan, please hold of on this patch, as I expect to be resubmitting it soon. :-) Best regards, Larry ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace _______________________________________________ U-Boot-Users mailing list U-Boot-Users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/u-boot-users