Re: [Soekris] net5501 inexplicable crashes under load and using wlan - mystery solved

Attila Kinali Sun, 08 Apr 2012 06:55:33 -0700

On Wed, 4 Apr 2012 11:01:51 +0900
Alan <[email protected]> wrote:

> On Sun, Apr 1, 2012 at 7:05 PM, Attila Kinali <[email protected]> wrote:
> 
> > I finaly got the time to work on this again and got my net5501
> > working without crashes even under heavy load and using wlan at full
> > power. At least not within 24h.
> 
> Great that someone is working on this, but 24 hours is not much.  If I
> recall correctly one of my net5501 could go up to 2 weeks (of light
> use) without crashing.

Oh.. compared to 2 minutes (!) for an unmodified board or 5 minutes with
only two 1000uF electrolytic capacitors connected to J5 it's pretty impressive.
Beside, it was last weekend that i did the modifications, so i didn't had
the time to let it run for longer. Now i have it running for almost a week
and no crashes. 

> Like others have said, I am eagerly waiting for the description,
> schematics and pictures to fix this.

I didn't take any pictures of the modified board, but i can tell you
what the cause is in more details:

<summary>
The power supply of the net5501 and how it is distributed to the circuitry
is disregarding all common good design practices. Hence leading to problems
in certain load and use conditions. These problems can be bit errors or
complete crashes.
</summary>

Attached is a picture of an DDR SRAM chip mounted at the bottom of the
net5501. For your convenience, i marked the power supply pins red (for
VDD and VDDQ) and the ground pins blue (for the VSS and VSSQ).
I used this, because it's one of the best examples for what i want to
show and it has very little circuitry around that would distract
or make my point less clear.

The first thing that strikes the eye is, that there are only 4 capacitors
around the chip while it has 8 power supply pins. And even worse, those
4 capacitors are shared with the two adjacent SRAM chips. (effectively
halfing the number of capacitors "seen" by a chip)

The next thing you should notice is, that there isnt a via visible for
each power or ground pin. This suggests that only one via (underneath
the chip) has been used to connect the pin to it's power/ground plane.
You can also spot places where the same via is used for two pins.

Now, what does that mean? 
Digital chips are beasts in terms of power supply. Most of the time,
they do not draw any power (at least nothing you'd talk about), but
when the clock switches from high to low (or low to high, or both),
they draw a huge amount of power. One part of that power is used to
switch the transistors inside the chip, another part goes into the
switching of the output pins. Simplified, you can see the internal
circuitry and the output pins as an CMOS inverter [1]. When A changes
its logic level, there is first the gate capacitance that has to be
charged/discharged. Second, there is a very short period when both
transistors are conducting, leading to the so called shot trough
current. This current is limited by the current conductance properties
of the transistors themselves. For internal circuits it's quite low
(they dont have to conduct huge currents), but due to the number of
transistors switching at the same time, this cannot be neglected.
For the output pins it's a different matter. They are designed to
provide large currents (at least 16mA per pin in the DDR SRAM case).
So the shot trough current is significant for each pin and the situation
becomes worse when multiple pins are switching at the same time.
Please keep in mind, that the shot trough current lasts only for a
very short period of time, typically less than 1ns. On one hand, this
helps, as only little energy is lost by the shot trough. But on
the other hand, it leads to very high frequency components.

The next big power hog comes from the capacitance connected to the chip.
Each pin of a chip case has a capacitance in the order of 1-20pF.
(DDR SRAM chips have a pin capacitance of <5pF specified)
I.e. you have two pin capacitances (the "sender" and the "receiver" chip)
and the capacitance of the wire itself connected to the pin of the chip.
Each time an output pin switches high->low or low->high, this capacitance
has to be charged/discharged. Ie during this short period an current of
approximately of 16mA is flowing trough the pin. (Again: think about
multiple pins switching at the same time)

That's the theory.

Now to the practical stuff:

Because of the "spiky" current consumption of digital logic it has become
custom in the field of electronics to attach an 100nF capacitor to each
power supply pin, to ensure the power supply has a low inductance
and low resistance "power source" for the switching time. This has been
done since at least the 1970s, when the first 74xx logic family appeared.
You can see this still in DIL sockets sold with integrated 100nF capacitors.
The capacitor is connected directly between a power and a ground pin if
possible, to ensure minimal resistance between the capacitor and the
chip. You cannot group those capacitors together at one pin and just
connect the other pins to the power supply and ground, because the wires
and vias will have a resistance an (more importantly) an inductance
that can not be neglected. For fast digital chips, which have very high
frequency components on the power supply pins, it became custom to connect
a 10nF capacitor directly at the pin and a 100nF adjacent to it. This is
because even those tiny capacitors have an inductance. And due the internal
structure this inductance becomes dominating above the so called self
resonance frequency. This self resonance frequency is higher for smaller
value capacitors, making them better suited for high frequency applications.
The larger capacitor is then used to provide the energy, while the smaller
"eats" the spikes.

Also, for high current chips like SRAM chips, you generally use a higher
capacitor (somewhere in the range of 1-10uF) adjacent to the chip, to catch
the lower frequency components, or the bumbs so to speak of, that the 100nF
capacitors couldnt catch. The placement of this capacitor is not so critical
as it is "only" for the "low" frequency components. But it should be still
as near to the chip as possible, and one capacitor per chip.

Additionally, each power supply and ground pin is connected to their planes
in the middle of the board by two vias. This is done to reduce the inductance
that a via has. Using two vias in parallel halfes the inductance.

Ignoring this common engineering practices is generally a bad idea.
It will lead to so called ground bounces, where the local power supply
voltage at the chip decreases, due to inductance and resistance in the
wires/vias to the chip. And even worse: because the inductance/resistance
at the power supply and ground pins is not the same, the chips voltage
level will bounce around wildly depending on how much current is flowing
where. These ground bounces lead at best to a decreased signal to noise
ratio (higher bit error rate) and intermediatly to bit errors. But in
the worst case, it will lead to the chip entering a improper operating
state, where it because dysfunctional (either not doing anything anymore
or doing wild things it shouldnt do, potentially leading to the destruction
of itself or other chips).

You also do not share power supply and ground pins of chips, of which you
cannot ensure that they are switching at different times. In this case,
the SRAM chips will switch exactly at the same time, making the ground
bounce problem even worse.

There is a way to mitigate this problem a little bit, at least the part
of the problem that is caused by the output pin wire capacitance. If you
put a resistor (usually 10-30 Ohm) into the wire, you "insulate" the
capacitance at the down stream part from the output pin, forming an
R-C circuit. The R limits the amount of current flowing into the capacitance
downstream of the resistor. The main disadvantage of this is that the
switching time is increased by the R-C time constant. A second disadvantage
is, that you add two additional pin capacitances (the one of the resistor) to 
the system. Over all, this technique helps only if the capacitance of the wire
or of the "recipient" chips pin is significantly higher than the "sender"
chips pin and the part of the wire between the "sender" chip and the resistor.
You can see those resistors on the net5501 as small resistor networks between
the SRAM and the Geode chips. Please note that: the resistors are on a
short wire, hence the capacitance of the wire is most likely below 10pF,
probably in the range of 2-5pF. Also note that the Geode chip has probably
similar pin capacitance characteristics as the SRAM chips.
What is really interesting here though, is that the top side and bottom side
SRAM chips have resistors of different values. The top side is 33 Ohm while
the bottom side is 22 Ohm. This is very unusual, as normally you'd chose
the same for all resistors, because the wires are usually routed to have
the same properties (same length, same capacitance, same inductance).
I can only guess that Soekris might have had problems with the SRAM
ground bounce and thus increased the resistor size on the top to mitigate
this.

You can see this two problems of not heaving enough capcacitors and
the sharing of power supply pins between chips troughout the board.
Actually, i found it everywhere i cared to check.

But why doesn't it lead to crashes for all users, but only for some?
Well, electronic circuits are not ideal. And not every part is the same
as an other. Capacitors are usually rated +/-10%. More precise are getting
very expensive very quickly. For high volume production you often use +/-20%
because they are significantly cheaper.

The same applies also to digital logic. The voltage levels when a digital
circuit switches changes from chip to chip... it even changes from transistor
to transistor within the chip (+/-10% within a chip is kind of normal).
This also affects the signal to noise margin a digital circuit has. Meaning
that some systems will have a higher suceptibility to noise than others.
Usually this suceptibility is so low that you dont care about it (aka a
flipped bit every few years). But if the design is driven at its limits
(what ever the cause may be), then this suceptibility rises dramatically
and you see these "occasional", inexplicable crashes. In the case of the
net5501 and its ignoring common design principles, even normal use (like
inserting a wlan card) might drive it into this crash regime.

As i said in my previous mail, there is no real way to fix it.
You cannot wire a capacitor where one is missing, because there
is no space. You cannot make lower inductance power supply and ground
connections where one isn't, because you cannot access the inner planes
where these are distributed. The only thing you can do is solder a few
capacitors on top of the ones that exist and solder wires to decrease
the inductance ever so slightly. The extend you have to do this depends
on how exactly you use your net5501 and what part of the circuit causes
the crash. As i said, it can be anything from an hour of soldering
to a rework of the board that takes a day or two.

As you can tell, i'm quite pissed at all this, because Soren personally made
it a few times clear that he thinks that his design is flawless, calling
the power supply "rock solid". And in sometimes roundabout ways, sometimes
quite direct telling me that i'm an idiot looking for hardware problems.

And consider that the net5501 is a very expensive board, it costs 220USD,
which is twice what PC-Engines wants for their Alix boards (and you cannot
tell me that a Swiss company has lower labor costs than an US company or
that it has lower quality. If anything, the labor and production costs
in Switzerland are higher). I would have thought, that at these prices,
one could expect a proper design, with all due diligence.

Also, if you check the archives of this mailinglist, you will see that
crashes like i had have been reported repeately over the years. To some
reports even Soren replied directly. So, it is _not_ true, that they
have not been aware that there might be issues with the net5501.
They just ignored all reports and marked them as user errors.

                        Attila Kinali

[1] http://en.wikipedia.org/wiki/File:CMOS_Inverter.svg
[2] http://www.eetimes.com/electronics-news/4196917/Ground-Bounce-Primer
[3] http://www.fairchildsemi.com/an/AN/AN-640.pdf

-- 
Why does it take years to find the answers to
the questions one should have asked long ago?

<<attachment: sram.jpg>>

_______________________________________________
Soekris-tech mailing list
[email protected]
http://lists.soekris.com/mailman/listinfo/soekris-tech

Re: [Soekris] net5501 inexplicable crashes under load and using wlan - mystery solved

Reply via email to