Re: [U-Boot] [PATCH] nvme: add more cache flushes

2019-10-17 Thread J. William Campbell

Hi All,
On 10/17/2019 12:12 AM, Patrick Wildt wrote:

On Thu, Oct 17, 2019 at 03:08:59PM +0800, Bin Meng wrote:

Hi Patrick,

On Thu, Oct 17, 2019 at 2:44 PM Patrick Wildt  wrote:

On Thu, Oct 17, 2019 at 10:55:11AM +0800, Bin Meng wrote:

Hi Patrick,

On Wed, Oct 16, 2019 at 11:35 PM Patrick Wildt  wrote:

On Wed, Oct 16, 2019 at 06:11:23PM +0800, Bin Meng wrote:

On Mon, Oct 14, 2019 at 7:11 PM Patrick Wildt  wrote:

On an i.MX8MQ our nvme driver doesn't completely work since we are
missing a few cache flushes.  One is the prp list, which is an extra
buffer that we need to flush before handing it to the hardware.  Also
the block read/write operations needs more cache flushes on this SoC.

Signed-off-by: Patrick Wildt 
---
  drivers/nvme/nvme.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/nvme.c b/drivers/nvme/nvme.c
index 2444e0270f..69d5e3eedc 100644
--- a/drivers/nvme/nvme.c
+++ b/drivers/nvme/nvme.c
@@ -123,6 +123,9 @@ static int nvme_setup_prps(struct nvme_dev *dev, u64 *prp2,
 }
 *prp2 = (ulong)dev->prp_pool;

+   flush_dcache_range((ulong)dev->prp_pool, (ulong)dev->prp_pool +
+  dev->prp_entry_num * sizeof(u64));
+
 return 0;
  }

@@ -717,9 +720,10 @@ static ulong nvme_blk_rw(struct udevice *udev, lbaint_t 
blknr,
 u16 lbas = 1 << (dev->max_transfer_shift - ns->lba_shift);
 u64 total_lbas = blkcnt;

-   if (!read)
-   flush_dcache_range((unsigned long)buffer,
-  (unsigned long)buffer + total_len);
+   flush_dcache_range((unsigned long)buffer,
+  (unsigned long)buffer + total_len);

Why we need this for read?

I'm no processor engineer, but I have read (and "seen") the following
on ARMs.  If I'm wrong. please correct me.

Since the buffer is usually allocated cache-aligned on the stack,
it is very possible that this buffer was previously still used
as it's supposed to be used: as stack.  Thus, the caches can still
be filled, and are not yet evicted/flushed to memory.  Now it is
possible that between the DMA access from the device and our cache
invalidation, the CPU cache writes back its contents to memory,
overwriting whatever the NVMe just gave us.

OK, this makes sense. So if we allocate the buffer from the heap, we
should only care about flush on write, right? Can you test this?

If you're talking about having a bounce buffer:  You'd need to flush
it once upon allocation, because that part of the heap could have also
be used earlier by someone else.


I was not talking about bounce buffer. I mean the buffer allocated
from help and use that directly for DMA.

Regards,
Bin

If you allocate a buffer from the heap, you still need to make sure
to flush it once, since there's still the chance that you have used
it shortly earlier.  But it's less of an issue as on the stack.


The "rules" for cache management of DMA buffers for non-cache-coherent 
CPUs are immutable. It doesn't matter where the memory came from 
(static, heap, or stack). It may be in cache, and it may be dirty. You 
can't know  it is for sure "clean". It is assumed that the DMA buffers 
are allocated to begin on a cache line boundary and are a multiple of a 
cache line in length. If this is not the case, references by the CPU to 
locations outside the buffer can make the cache state of the buffer 
dirty, which is an error. It is also required that there be no accesses 
to the DMA buffer by the CPU while DMA is in progress. This is normally 
true by default, and if it isn't true, it is an error. The rules are 
then as follows:


On write, before DMA is started. the cache corresponding to the DMA 
buffer addresses MUST be flushed to ensure the desired content is 
transferred from cache to RAM before write DMA begins.


On read, before DMA is started, the cache corresponding to the DMA 
buffer addresses MUST be either invalidated or flushed to ensure that no 
cache system initiated writes to RAM will occur while read DMA is in 
progress. Normally, invalidate is preferred because it is faster. 
However, for programming simplicity some drivers choose to flush before 
both read or write DMA is started. If that is the case, it is good 
practice to note that choice in a comment.


On write, after DMA is completed, no additional cache actions are required.

On read, after DMA is completed, the cache corresponding to the DMA 
buffer addresses MUST be invalidated. This is necessary to ensure that 
stale data in the cache will not be used instead of the new read data in 
RAM. If one elected to invalidate the cache before the read DMA started, 
one may wonder why a second invalidate is necessary.  Since the buffer 
is not allowed to be referenced by the program in the interim, the cache 
should not contain any data from the DMA buffer area. The reason is that 
modern CPUs, may load data from the DMA buffer into cache while DMA is 
in progress. This can 

Re: [U-Boot] [PATCH 1/3] spl: ram: Do not memcpy() identical buffers

2019-05-28 Thread J. William Campbell

On 5/28/2019 4:19 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 05:24:34AM +0200, Marek Vasut wrote:

On 5/28/19 5:04 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 04:44:52AM +0200, Marek Vasut wrote:

On 5/28/19 4:42 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 04:07:44AM +0200, Marek Vasut wrote:

On 5/28/19 4:06 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 03:49:13AM +0200, Marek Vasut wrote:


If the source and destination buffer address is identical, there is
no need to memcpy() the content. Skip the memcpy() in such a case.

Signed-off-by: Marek Vasut 
Cc: Michal Simek 
Cc: Tom Rini 

Shouldn't memcpy catch that itself?


memcpy(3) says
The memcpy() function copies n bytes from memory area src to
memory area dest.  The memory areas must not overlap.  Use memmove(3) if
the memory areas do overlap.

OK, and shouldn't memcpy optimize that case?  Does it usually?

As the manpage says "The memory areas must not overlap." , I would
expect it does not have to ?

I guess I'm not being clear enough, sorry.  Go look at how this is
implemented in a few places please and report back to us.  Someone else,
or many someone else, have probably already figured out if optimizing
this case in general, in memcpy, is a good idea or not.  Thanks!

If even [1] says the behavior is undefined, then what does it matter
whether some implementation added an optimization for such undefined
behavior? We cannot depend on it, can we ?

[1] https://pubs.opengroup.org/onlinepubs/009695399/functions/memcpy.html

Yes, yes it would be worth seeing if other groups that have looked into
the performance impact here have also decided to optimize this undefined
behavior or not, thanks.


I don't think this is an optimization question, really. Calling memcpy 
with overlapping areas is an error. The result is explicitly undefined. 
It may well be that all the existing implementations effectively do 
nothing, either by explicit check or by actually copying the data over 
itself. However, to rely on that behavior is asking for trouble down the 
line. Undefined behavior is exactly that. Don't do it.





___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot


Re: [U-Boot] [PATCH] libfdt: Initialize the stack variable

2017-09-04 Thread J. William Campbell

On 9/4/2017 8:41 PM, Chee, Tien Fong wrote:

On Rab, 2017-08-30 at 06:31 -0700, J. William Campbell wrote:

On 8/29/2017 10:15 PM, tien.fong.c...@intel.com wrote:

From: Tien Fong Chee <tien.fong.c...@intel.com>

Report Coverity log:
The code uses a variable that has not
been initialized, leading to unpredictable
or unintended results.

Reported-by: Coverity (CID: 60519)
Signed-off-by: Tien Fong Chee <tien.fong.c...@intel.com>
---
   lib/libfdt/fdt_wip.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/libfdt/fdt_wip.c b/lib/libfdt/fdt_wip.c
index 45fb964..01adad0 100644
--- a/lib/libfdt/fdt_wip.c
+++ b/lib/libfdt/fdt_wip.c
@@ -115,7 +115,7 @@ int fdt_find_regions(const void *fdt, char *
const inc[], int inc_count,
     struct fdt_region region[], int max_regions,
     char *path, int path_len, int
add_string_tab)
   {
-   int stack[FDT_MAX_DEPTH];
+   int stack[FDT_MAX_DEPTH] = { 0 };

It seems to me that one of three things must be true. 1) Coverity
can't
correctly analyze the code and stack[] is not used in an un-
initialized
manner, 2) stack is used in an un-initialized manner but the result
is
not used in that case and is a "don't care" or 3) there is a bug in
the
code. It seems that just initializing the variable to 0 is a "Bad
Idea(tm)". If it is case 1 or 2, there should be a Coverity code
annotation comment added to that effect, and if it is case 3, it
should
be fixed. Initializing this variable makes the binary larger to no
purpose unless there is a bug already.

Best Regards,
J. William Campbell

Yeah, i agree with you, state machine design should ensure stack[] is
not used in a uninitialized manner. Hence, i need input from whom
familiar with this function, whether this warning fall in anyone of
these conditions. If we just direct init the stack[], and this solution
will make extra 128 bytes in binary, but having variable with default
value is also good pratice from software quality perspective.
Yes, if the default value has a rationale. On the surface, there is no 
way to know that 0 is a "good" initial value. There may be a reason that 
it is, but if we don't know for sure, it is just a "random" number. I 
hope whoever wrote this will speak up and say why the variable is never 
used before it is initialized. Thank you for being so diligent.


Best Regards,
Bill Campbell



    char *end;
    int nextoffset = 0;
    uint32_t tag;



___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot


Re: [U-Boot] [PATCH] libfdt: Initialize the stack variable

2017-08-30 Thread J. William Campbell

On 8/29/2017 10:15 PM, tien.fong.c...@intel.com wrote:

From: Tien Fong Chee <tien.fong.c...@intel.com>

Report Coverity log:
The code uses a variable that has not
been initialized, leading to unpredictable
or unintended results.

Reported-by: Coverity (CID: 60519)
Signed-off-by: Tien Fong Chee <tien.fong.c...@intel.com>
---
  lib/libfdt/fdt_wip.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/libfdt/fdt_wip.c b/lib/libfdt/fdt_wip.c
index 45fb964..01adad0 100644
--- a/lib/libfdt/fdt_wip.c
+++ b/lib/libfdt/fdt_wip.c
@@ -115,7 +115,7 @@ int fdt_find_regions(const void *fdt, char * const inc[], 
int inc_count,
 struct fdt_region region[], int max_regions,
 char *path, int path_len, int add_string_tab)
  {
-   int stack[FDT_MAX_DEPTH];
+   int stack[FDT_MAX_DEPTH] = { 0 };
It seems to me that one of three things must be true. 1) Coverity can't 
correctly analyze the code and stack[] is not used in an un-initialized 
manner, 2) stack is used in an un-initialized manner but the result is 
not used in that case and is a "don't care" or 3) there is a bug in the 
code. It seems that just initializing the variable to 0 is a "Bad 
Idea(tm)". If it is case 1 or 2, there should be a Coverity code 
annotation comment added to that effect, and if it is case 3, it should 
be fixed. Initializing this variable makes the binary larger to no 
purpose unless there is a bug already.


Best Regards,
J. William Campbell

char *end;
int nextoffset = 0;
uint32_t tag;



___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot


Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for SPL and normal U-Boot

2017-07-03 Thread J. William Campbell

On 7/3/2017 8:12 PM, Peng Fan wrote:



-Original Message-
From: Tom Rini [mailto:tr...@konsulko.com]
Sent: Tuesday, July 04, 2017 10:47 AM
To: Peng Fan 
Cc: Simon Glass ; Philipp Tomsich
; albert.u.b...@aribaud.net; u-
b...@lists.denx.de
Subject: Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for SPL and
normal U-Boot

On Tue, Jul 04, 2017 at 01:09:36AM +, Peng Fan wrote:

Hi Tom,


-Original Message-
From: Tom Rini [mailto:tr...@konsulko.com]
Sent: Tuesday, July 04, 2017 12:17 AM
To: Peng Fan ; Simon Glass ;
Philipp Tomsich 
Cc: albert.u.b...@aribaud.net; u-boot@lists.denx.de
Subject: Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for
SPL and normal U-Boot

On Mon, Jul 03, 2017 at 09:14:08PM +0800, Peng Fan wrote:


If not pass -fno-pic to toolchains, some toolchains may generate
.got and .got.plt sections, but when generate binaries, we did not
take .got and .got.plt into consideration, then SPL or normal
U-Boot boot failure because image corrupted.

Need to pass -fno-pic to disable generating .got and .got.plt
sections.

Signed-off-by: Peng Fan 
Cc: Albert Aribaud 
Cc: Tom Rini 
---
  arch/arm/config.mk | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm/config.mk b/arch/arm/config.mk index
1a9..66ae403 100644
--- a/arch/arm/config.mk
+++ b/arch/arm/config.mk
@@ -130,9 +130,10 @@ ALL-y += checkarmreloc  # instruction.
Relocation is not supported for that case, so disable  # such
usage by requiring word relocations.
  PLATFORM_CPPFLAGS += $(call cc-option, -mword-relocations)
-PLATFORM_CPPFLAGS += $(call cc-option, -fno-pic)  endif

+PLATFORM_CPPFLAGS += $(call cc-option, -fno-pic)
+
  # limit ourselves to the sections we want in the .bin.
  ifdef CONFIG_ARM64
  OBJCOPYFLAGS += -j .text -j .secure_text -j .secure_data -j
.rodata -j .data \

Something is "up" here and I need you to dig harder and perhaps see
if we're missing something in the linker scripts?  The very next line of

context here is:

 -j .u_boot_list -j .rela.dyn -j .got -j .got.plt

Meaning that we intentionally copy .got / .got.plt into the
resulting binary.  And I see that we took in 397d7d5a1be1 from you
back in 2016 saying that we needed this in SPL.  But 5a942a152776
put the got/got.plt sections (for 32bit
ARM) in intentionally as some relocations do need it.  And in
4b0d506ed3b4 Philipp seems to have seen the same problem you have,
but fixed it with adding got/got.plt to the sections list we copy in (the above

hunk of context).

If pass -fno-pic to compiler, there will be no .got and .got.plt sections.
The .got and .got.plt is usually used for dynamic link, such as linux "*.so" 
file.
We need to pass -fno-pic to compiler to remove .got and .got.plt sections.

"Usually" isn't the same as "always" or "only".  And this reminded me that we

 From https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html#Code-Gen-Options,
when -fpic is used for ARM, there will be GOT. The dynamic loader resolves the 
GOT entries
when the program starts (the dynamic loader is not part of GCC; it is part of 
the
operating system).


had to deal with e391b1e64b0b because yes, we're not making a shared library
but we do have position independent code.  So, in the case of SPL, since we

For position independent code, but not making a shared library, we could use 
-fpie.


can get away with -fno-pic (and get some space
savings) that's just not true of U-Boot itself.  We're enforcing -fpic on other
architectures, so what exactly is going on with what you're seeing?  Where

If not passing -fno-pic to gcc, the android toolchain will generate .got and 
.got.plt
section. I think these sections are for dynamic link, not for static link.


would we need to be taking .got/.got.plt into consideration and why would it
be a bad thing to do so?  Thanks!

We could keep .got and .got.plt, but personally I prefer use -fno-pic.
I think fpic is for shared library and fpie is for position independt code for 
static link.

Thanks,
Peng.
To clarify things, as I understand it, -pie produces the same code and 
segments as -pic does, plus a loader to load the code correctly into 
memory. u-boot probably doesn't want that loader. The loader is required 
to make the executable work under the OS, but stand-alone probably 
doesn't work right in general.  For u-boot, the advantage of -pic code 
is that there are no TEXTREL relocations in the binary. The code is 
generated to take care of bss (and code segment when required) 
references that require relocation. Things in the .got may need 
relocating by where bss ends up, but not the code. So relocation is 
short and simple.  Non pic code requires relocation processing that may 
be more complex than pic code, and requires a larger relocation 

Re: [U-Boot] [PATCH v2] NS16550: buffer reads

2011-10-24 Thread J. William Campbell
On 10/24/2011 3:22 PM, Graeme Russ wrote:
 Hi Wolfgang,

 On Tue, Oct 25, 2011 at 8:59 AM, Wolfgang Denkw...@denx.de  wrote:
 Dear Graeme Russ,

 In message4ea5cd39.2070...@gmail.com  you wrote:
 Assume we have a simple device with a small Rx FIFO - say, 8 bytes
 only. Guess what the chances are that we will overrun this FIFO (and
 then lose characters) before the other side stops sending?
 Should be zero (if you disable the hardware Tx buffer) unless you have an
 incredibly slow Rx routine (in which case, your baud rate is too high). And
 even with a Tx buffer in play the solution is fairly simple - After sending
 an XOFF, flush the Rx buffer (and remote Tx buffer) into an internal buffer
 until tstc() returns false - use this buffer for getc() until it is empty.
 Things are becoming more and more complicated, it seems.

 If you ask me: forget about all this, and stick with single line
 input.  Do the processing on the sender's side (always wait for a
 prompt before sending the next line).  This is MUCH easier.
 Oh, I wholehearedly agree that if we impose such a limitation, life will
 be much easier for us all ;)

 But realistically, the solution is actually very trivial and comprises
 just two parts:

 1a) When the command line interpreter (which is just the really simple
  function detecting new-line or command delimiter) dispatches a
  command string to the command interpreter (which looks up the
  command table and if it finds a command runs it) send an XOFF
 1b) When the command interpreter returns control back to the command
  line interpreter, send an XON
 1c) If an XOFF has been recently sent (i.e. no XON since) send an XON
  in getc()

 2) After sending XOFF, flush the transmitter's buffer into a small
 (16 bytes chould sufffice) buffer. Grab characters from this buffer
 before grabing from the UART when getc() is called
Hi All,
   The problem with all this is that the transmitting CPU may send 
several characters after the XOFF is transmitted, where several will 
be at least two if the transmitting CPU has no output fifo, up to 
possibly the output fifo length + 2. Furthermore, this overage can 
theoretically increase with each x-on x-off sequence, as each command 
may be, in theory,  shorter than the output fifo length. So the input 
buffering is needed in all cases, i.e. the receive fifo must be run 
dry before executing a command. If the receive fifo is shorter than the 
transmit fifo +2, you are doomed. If the receive fifo is larger than 
this, you are ok, the transmitting CPU will stop before overrunning the 
input fifo and data being lost. However, a 16 byte buffer will not be 
enough if there are lots of short commands in the input stream.

Best Regards,
Bill Campbell

 The second part should not even be necessary if you have a half-sane
 serial port with a half-decent hardware buffer.

 It really would not be that hard (maybe a few dozen lines of code at most),
 would work for all UARTS and should (hopefully) resolve all 'dropped
 character' issues.

 Now 1c) makes the bold assumption that any command which calls getc()
 'knows what it is doing' and will consume all input characters at a fast
 enough rate AND will not invoke a delay between receiving the last
 character and returning back to the command line interpreter. If the
 command knows it received characters and knows it will invoke a delay (some
 kind of post-processing) then in order to not lose characters, the command
 MUST issue it's own XOFF - nothing else aside from an interrupt driven
 serial driver will solve the problem in this case.

 Regards,

 Graeme
 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Timer API (again!)

2011-09-24 Thread J. William Campbell
On 9/16/2011 4:53 AM, Graeme Russ wrote:
  Hi All,
 
  Well, here we are again, a Timer API discussion
 
  All things considered, I don't think the Linux approach is right for 
U-Boot
  - It is designed to cater for way more use-cases than U-Boot will 
ever need
  to deal with (time queues and call-backs in particular)
Hi Graeme,
Glad you are tackling this again.
 
  To summarize, in a nutshell, what _I_ think U-Boot need from a Timer API
 
1. A reliable udelay() available as early as possible that is OK to 
delay
   longer than requested, but not shorter - Accuracy is not implied or
   assumed
I think you don't really mean this. SOME attempt at accuracy is to be 
expected. Not perfect is allowed, but to be totally bogus is not 
allowed. That is worse than nothing. Longer by 20% or so is possibly ok, 
2% isn't.
2. A method of obtaining a number of 'time intervals' which have 
elapsed
   since some random epoch (more info below)
3. A nice way of making A-B time checks simple within the code
 
  OK, some details:
 
  1. I'm starting to thing this should be a function pointer or a flag 
in gd.
  Why oh why would you do such a thing I hear you ask... udelay() is 
needed
  _EARLY_ - It may well be needed for SDRAM or other hardware 
initialisation
  prior to any reliable timer sub-system being available. But early 
udelay()
  is often inaccurate and when the timer sub-system gets fully 
initialised,
  it can be used for an accurate udelay(). x86 used to have an global data
  flag that got set when the timer-subsystem got initialised. If the 
flag was
  not set, udelay() would use one implementation, but if it was set, 
udelay()
  would use a more accurate implementation. In the case of the eNET 
board, it
  had an FPGA implementation of a microsecond timer which was even more
  accurate that the CPU, so it had it's own implementation that should 
have
  duplicated the 'if (gd-flags  TIMER_INIT)' but never did (this was OK
  because udelay() was never needed before then)
 
  I think a function pointer in gd would be a much neater way of 
handling the
  fact that more accurate (and efficient) implementations of udelay() may
  present themselves throughout the boot process
I think this won't be popular, but I am not against it on the face of it.
 
  An other option would be to have the gd flag and make udelay() a lib
  function which performs the test - If the arch timer has better than 1us
  resolution it can set the flag and udelay() will use the timer API
 
  2a (random epoch) - The timer sub-system should not rely on a particular
  epoch (1970, 1901, 0, 1, since boot, since timer init, etc...) - By 
using
  the full range of an unsigned variable, the epoch does not matter
 
  2b (time interval) - We need to pick a base resolution for the timer 
API -
  Linux uses nano-seconds and I believe this is a good choice. Big Fat 
Note:
  The underlying hardware DOES NOT need to have nano-second resolution. 
The
  only issue with nano-seconds is that it mandates a 64-bit timer - A few
  people are against this idea - lets discuss. A 32-bit microsecond timer
  provides ~4300 seconds (~1.2 hours) which _might_ be long enough, but
  stand-alone applications doing burn-in tests may need longer. Going
  milli-seconds means we cannot piggy-back udelay() on the timer API.
Well, as you probably know, I am one of those against a 64 bit timer 
API to figure out if my disk is ready, especially if my hardware 
timebase does not support nanosecond resolution anyway. I think having a 
time base in milliseconds like we have now is perfectly adequate for 
sane bootloader usage. The requirement for the timer API would then be 
to provide the current time in millisecond resolution, and the time in 
as close to microsecond resolution as possible. This would be a helper 
function for udelay.  Udelay could then always be implemented using 
microseconds to ticks and getting the subtracting delta ticks from the 
target value.
If you can provide the first function, you can provide the second. They 
can usually be the same code with different constants involved. The gd- 
call may actually be inside udelay, with the timer init replacing the 
pointer used by udelay to get the current time in microseconds. The 
helper function for udelay should not be called by other parts of 
u-boot, to prevent abuse of this capability. Note that a 32 bit 
millisecond delay is like 49 days, so that should be plenty. I think it 
must be kept in mind that u-boot is used on lots of small systems. 
Using up lots of resources for 64 bit arithmetic that we don't need 
elsewhere is , IMHO, not a good idea. If it really did something that we 
need, all well and good, I am for it. But in this case, I don't think it 
really helps us any. But it does make things harder for some CPUs, and 
that is not a good thing.
 
  My preference is a 64-bit nano-second timer with udelay() 
piggy-backed on
  the timer API (unless, like NIOS, the timer 

Re: [U-Boot] [PATCH v2] ARM926ejs: Add routines to invalidate D-Cache

2011-08-05 Thread J. William Campbell
On 8/5/2011 4:51 AM, Aneesh V wrote:
 Hi Albert,

 On Friday 05 August 2011 04:33 PM, Albert ARIBAUD wrote:
 Hi Aneesh,

 On 05/08/2011 12:47, Aneesh V wrote:
 Hi Eric,

 On Friday 05 August 2011 04:03 PM, Hong Xu wrote:
 Hi Aneesh,
 [snip ..]
 IMHO, Hong's approach is correct. If the buffer that is invalidated is
 not aligned to cache-line, one cache-line at the respective boundary
 may have to be flushed to make sure the invalidation doesn't affect
 somebody else's memory.

 The solution is for drivers to ensure that any buffer that needs to be
 invalidated is aligned to cache-line boundary at both ends. The above
 approach puts this onus on the driver. I have documented the alignment
 requirement in my recent patch series for fixing arm cache problems.
 I have not noticed the patch series. ;-)
 If we put the alignment burden to the driver, I'm afraid many drivers
 which make use of something like a DMA controller have to modify the
 code heavily. This sounds not good. :)
 We have a fundamental problem when it comes to invalidating an
 un-aligned buffer. Either you flush the boundary lines and corrupt your
 buffer at boundaries OR you invalidate without flushing and corrupt
 memory around your buffer. Both are not good! The only real solution is
 to have aligned buffers, if you want to have D-cache enabled and do DMA
 at the same time.
 Plus, there should not be *heavy* modifications; DMA engines tend to use
 essentially two types of memory-resident objects: data buffers and
 buffer descriptors. There's only a small handful of places in the driver
 code to look at to find where these objects are allocated and how.

 So I stand by my opinion: since the cache invalidation routine should
 only be called with cache-aligned objects, there is no requirement to
 flush the first (resp. last) cache line in case of unaligned start
 (resp.stop), and I don't want cache operations performed when they are
 not required.
 IMHO, flushing is better, because the person who commits the
 mistake of invalidating the un-aligned buffer is the one who is
 affected and is likely to fix the issue soon. If we didn't flush, the
 resulting corruption will cause totally random errors that will be hard
 to debug. Doing an extra flush operation for a maximum of 2 lines
 doesn't cost us anything. This is the approach followed by the kernel
 too.

Hi All,
 I am interested in this last statement in particular, that 
Linux allows non-cache aligned buffers for DMA. In a previous discussion 
series, we demonstrated why it was IMPOSSIBLE for a non-cache aligned 
DMA buffer to work in conjunction with a write back type cache. To be 
clear, by write-back, we mean a cache system that has a single dirty bit 
per cache line, and that if there are stores to any addresses in that 
line, the ENTIRE LINE will be written back into memory, not just the 
changed data. I also seem to recall that the ARM V7 caches were defined 
as write back, but I am not an ARM person so I don't know for sure what 
kind of cache we are talking about here. If it is write-back, there is 
only one possible solution that always works. Write-through will work 
with un-aligned buffers if the correct flushes and invalidates are 
present. In that case, buffer alignment is not so important. However, if 
the same driver is to be used in both cases, it must use cache-aligned 
buffers only.

Best Regards,
Bill Campbell


 br,
 Aneesh
 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-23 Thread J. William Campbell
On 7/23/2011 6:04 AM, Albert ARIBAUD wrote:
 Le 21/07/2011 08:48, David Jander a écrit :

 However, it is still correct that copying from an non-cached area is
 slower than from cached areas, because of burst reads vs. individual
 reads. However, I doubt that the u-boot user can tell the 
 difference, as
 the network latency will far exceed the difference in copy time.

 That's assuming cache is only for networking. There can be DMA engines 
 in a lot of other peripherals which do not have the same latency as 
 network (and then, even for networking, TFTP can be done from a very 
 nearby server, possibly even on the same Ethernet segment).
Hi All,
 Yes, there are other uses of DMA. On a network, unless you have 
a Gigabit network, your memory access speed is at least an order of 
magnitude faster than the network, probably more. Plus, there is a 
latency due to sending the ack and request for the next record that 
undoubtedly swamps out any reduction in memory speed due to the single 
copy that takes place. In the case of other devices, like for example 
disks, the percentage effect is probably greater, but since these 
devices are so fast anyway that the human-perceived speed reduction is 
essentially nil. If we were talking about a CPU running Linux and doing 
all kinds of I/O all day long, the reduction in throughput  performance 
might be 10% and that might matter. In a boot loader that does I/O 
mostly to read in a program to replace itself, I would argue that nobody 
will notice the difference between cached and un-cached buffers. 
Counter-examples welcome however.

 The
 question is, which is easier to do, and that is probably a matter of
 opinion. However, it is safe to say that so far a cached solution has
 eluded us. That may be changing, but it would still be nice to know how
 to allocate a section of un-cached RAM in the ARM processor, in so far
 as the question has a single answer! That would allow easy portability
 of drivers that do not know about caches, of which there seems to be 
 many.

 That is one approach, which I think prevents cache from being used 
 beyond caching pure CPU-used DRAM.
You are certainly correct there. However, I think the pure CPU-used ram 
case is the one that matters most. Uncompressing and checksumming of 
input data are typical u-boot functions that take significant time. The 
performance increase due to cache hits in these cases is huge, and 
easily perceptible by the user.

 I agree. Unfortunately, my time is up for now, and I can't go on with 
 trying
 to fix this driver. Maybe I'll pick up after my vacation.
 As for now I settled for the ugly solution of keeping dcache disabled 
 while
 ethernet is being used :-(

 Make sure you flush before disabling. :)

 IMHO, doing cache maintenance all over the driver is not an easy or nice
 solution. Implementing a non-cached memory pool in the MMU and a 
 corresponding
 dma_malloc() sounds like much more universally applicable to any driver.

 I think cache maintenance is feasible if one makes sure the cached 
 areas used by the driver are properly aligned, which simplifies things 
 a lot: you don't have to care for flush-invalidate or just-in-time 
 invalidate, you just have to flush before sending and invalidate 
 before reading.
I do agree it can be done. However, most (I think?) of the CPUs to which 
u-boot have been ported have cache-coherent DMA. As a result, cache 
issues for  these CPUs are not addressed in the driver at all. Often 
this means that cache support is done after the fact by somebody other 
than the original author who may not totally understand the original 
driver. If DMA buffers were always allocated from cache-coherent memory, 
either because the memory is un-cached or because the CPU is DMA cache 
coherent, no changes would be necessary to get the driver working 
correctly.  If performance ever became an issue in the un-cached case, 
then more work would be required, but in most cases, I expect nobody 
will notice.

Best Regards,
Bill Campbell

 Best regards,

 Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-20 Thread J. William Campbell
On 7/20/2011 7:35 AM, Albert ARIBAUD wrote:
 Le 20/07/2011 16:01, J. William Campbell a écrit :
 On 7/20/2011 6:02 AM, Albert ARIBAUD wrote:
 Le 19/07/2011 22:11, J. William Campbell a écrit :

 If this is true, then it means that the cache is of type write-back 
 (as
 opposed to write-thru). From a (very brief) look at the arm7 
 manuals, it
 appears that both types of cache may be present in the cpu. Do you 
 know
 how this operates?
 Usually, copyback (rather than writeback) and writethough are modes of
 operation, not cache types.
 Hi Albert,
 One some CPUs both cache modes are available. On many other CPUs (I
 would guess most), you have one fixed mode available, but not both. I
 have always seen the two modes described as write-back and
 write-through, but I am sure we are talking about the same things.

 We are. Copy-back is another name for write-back, not used by ARM but 
 by some others.

 The
 examples that have both modes that I am familiar with have the mode as a
 global setting. It is not controlled by bits in the TLB or anything
 like that. How does it work on ARM? Is it fixed, globally, globally
 controlled, or controlled by memory management?

 Well, it's a bit complicated, because it depends on the architecture 
 version *and* implementation -- ARM themselves do not mandate things, 
 and it is up to the SoC designer to specify what cache they want and 
 what mode it supports, both at L1 and L2, in their specific instance 
 of ARM cores. And yes, you can have memory areas that are write-back 
 and others that are write-through in the same system.

 If it is controlled by memory management, it looks to me like lots of
 problems could be avoided by operating with input type buffers set as
 write-through. One probably isn't going to be writing to input buffers
 much under program control anyway, so the performance loss should be
 minimal. This gets rid of the alignment restrictions on these buffers
 but not the invalidate/flush requirements.

 There's not much you can do about alignment issues except align to 
 cache line boundaries.

 However, if memory management
 is required to set the cache mode, it might be best to operate with the
 buffers and descriptors un-cached. That gets rid of the flush/invalidate
 requirement at the expense of slowing down copying from read buffers.

 That makes 'best' a subjective choice, doesn't it? :)
Hi All,
 Yes,it probably depends on the usage.

 Probably a reasonable price to pay for the associated simplicity.

 Others would say that spending some time setting up alignments and 
 flushes and invalidates is a reasonable price to pay for increased 
 performance... That's an open debate where no solution is The Right 
 One(tm).

 For instance, consider the TFTP image reading. People would like the 
 image to end up in cached memory because we'll do some checksumming on 
 it before we give it control, and having it cached makes this step 
 quite faster; but we'll lose that if we put it in non-cached memory 
 because it comes through the Ethernet controller's DMA; and it would 
 be worse to receive packets in non-cached memory only to move their 
 contents into cached memory later on.

 I think properly aligning descriptors and buffers is enough to avoid 
 the mixed flush/invalidate line issue, and wisely putting instruction 
 barriers should be enough to get the added performance of cache 
 without too much of the hassle of memory management.
I am pretty sure that all the drivers read the input data into 
intermediate buffers in all cases. There is no practical way to be sure 
the next packet received is the right one for the tftp. Plus there are 
headers involved, and furthermore there is no way to ensure that a tftp 
destination is located on a sector boundary. In short, you are going to 
copy from an input buffer to a destination.
However, it is still correct that copying from an non-cached area is 
slower than from cached areas, because of burst reads vs. individual 
reads. However, I doubt that the u-boot user can tell the difference, as 
the network latency will far exceed the difference in copy time. The 
question is, which is easier to do, and that is probably a matter of 
opinion. However, it is safe to say that so far a cached solution has 
eluded us. That may be changing, but it would still be nice to know how 
to allocate a section of un-cached RAM in the ARM processor, in so far 
as the question has a single answer! That would allow easy portability 
of drivers that do not know about caches, of which there seems to be many.

Best Regards,
Bill Campbell

 Best Regards,
 Bill Campbell

 Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-19 Thread J. William Campbell
On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
 Le 19/07/2011 10:43, Aneesh V a écrit :

 You would have to flush (before sending packets / starting external
 memory-to-device DMA) and invalidate (before reading received packets /
 after external device-to-memory DMA is done); using MMU and mapping
 cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
 the xmit/receive buffers and descriptors.
 So, you say actually what I did while exploring the problem would have
 been a
 correct way of solving this problem?

 Like this:

 587 flush_cache(fec-tbd_base[fec-tbd_index], 4);
 This is what is needed assuming the below is initiating a memory to
 peripheral DMA. Is your buffer only 4 bytes long?
 Generally:

 - for sending data through a device that has its own, external, DMA
 engine, you'll obviously need to flush the data buffer(s) but also any
 DMA descriptors used by the engine, before you start the engine;

 - for rceiving, if you have to set up receive descriptors, you must
 flush that before telling the device to enter receive mode (so that the
 device reads the descriptors as you wrote them), and you should
 invalidate the receive buffers at the latest when the device signals
 that data has been received,
Hi All,

   or preferably long before (at the same time
 you flushed the read descriptor, so that cache-related actions are
 grouped in the same place in the code).
I think this last statement is incorrect. You should invalidate the 
cache for the receive buffers just before you intend to reference them. 
If you do it right after receive mode is entered, subsequent access to 
items NEAR the receive buffer may reload the cache with receive buffer 
data before the dma is done, re-creating the problem you are trying to 
avoid. Also, I don't know if ARM cache is write-back or write -thru, but 
if it is write-back, the only way to avoid problems is to allocate the 
receive buffers on cache line boundaries, so no nearby writes can 
cause something in the DMA buffer to be corrupted. If all receive 
buffers are allocated on cache-line boundaries (both start and end of 
each buffer), you can invalidate the cache early under the assumption 
that there will be no read accesses to the read buffers themselves until 
after DMA is complete. IMHO it is better, even in this case., to 
invalidate cache after dma is done but before referencing the read data.

Best Regards,
Bill Campbell


 Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-19 Thread J. William Campbell
On 7/19/2011 11:14 AM, Anton Staaf wrote:
 On Tue, Jul 19, 2011 at 7:36 AM, J. William Campbell
 jwilliamcampb...@comcast.net  wrote:
 On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
 Le 19/07/2011 10:43, Aneesh V a écrit :

 You would have to flush (before sending packets / starting external
 memory-to-device DMA) and invalidate (before reading received packets /
 after external device-to-memory DMA is done); using MMU and mapping
 cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
 the xmit/receive buffers and descriptors.
 So, you say actually what I did while exploring the problem would have
 been a
 correct way of solving this problem?

 Like this:

 587 flush_cache(fec-tbd_base[fec-tbd_index], 4);
 This is what is needed assuming the below is initiating a memory to
 peripheral DMA. Is your buffer only 4 bytes long?
 Generally:

 - for sending data through a device that has its own, external, DMA
 engine, you'll obviously need to flush the data buffer(s) but also any
 DMA descriptors used by the engine, before you start the engine;

 - for rceiving, if you have to set up receive descriptors, you must
 flush that before telling the device to enter receive mode (so that the
 device reads the descriptors as you wrote them), and you should
 invalidate the receive buffers at the latest when the device signals
 that data has been received,
 Hi All,

or preferably long before (at the same time
 you flushed the read descriptor, so that cache-related actions are
 grouped in the same place in the code).
 I think this last statement is incorrect. You should invalidate the
 cache for the receive buffers just before you intend to reference them.
 If you do it right after receive mode is entered, subsequent access to
 items NEAR the receive buffer may reload the cache with receive buffer
 data before the dma is done, re-creating the problem you are trying to
 avoid. Also, I don't know if ARM cache is write-back or write -thru, but
 if it is write-back, the only way to avoid problems is to allocate the
 receive buffers on cache line boundaries, so no nearby writes can
 cause something in the DMA buffer to be corrupted. If all receive
 buffers are allocated on cache-line boundaries (both start and end of
 each buffer), you can invalidate the cache early under the assumption
 that there will be no read accesses to the read buffers themselves until
 after DMA is complete. IMHO it is better, even in this case., to
 invalidate cache after dma is done but before referencing the read data.
 This is a critical observation, and one I was going to make if I had
 made it to the end of the thread and no one had already pointed it
 out.  In fact, there is no way with the current implementation (that I
 can see) of the v7_dcache_inval_range function to correctly implement
 a cache enabled DMA driven driver without aligning buffers to cache
 line sizes.  Below is the commit message from a recent patch I made
 locally to fix the Tegra MMC driver.  I wanted to start a discussion
 on the list about forcing all buffers to be aligned to cache line
 sizes.  The problem is that many buffers are stack allocated or part
 of structs that are of unknown alignment.

  mmc: Tegra2: Enable dcache support by bouncing unaligned requests.

  Dcache support was disabled due to dcache/dma interactions with
  unaligned buffers.  When an unaligned buffer is used for DMA the
  first and last few bytes of the buffer will be clobbered by the
  dcache invalidate call that is required to make the contents of
  the buffer visible to the CPU post DMA.  The reason that these
  bytes are clobbered is that the dcache invalidate code (which is
  the same as the linux kernels) checks for unaligned invalidates
  and first flushes the cache lines that are being partially
  invalidated.  This is required because the cache lines may be
  shared with other variables, and may be dirty.
Hi Anton,
This flush however
  writes over the first and last few bytes of the unaligned buffer
  with whatever happened to be in the cache.
If this is true, then it means that the cache is of type write-back (as 
opposed to write-thru). From a (very brief) look at the arm7 manuals, it 
appears that both types of cache may be present in the cpu. Do you know 
how this operates? If all that we have is write-back, you are correct. 
There is no safe way to do DMA to non-cache aligned buffers. This means 
both ends of the buffer must be cache aligned, so you may have to 
allocate it bigger than the transfer requires. For unfamiliar readers, 
write-back cache has only a single dirty bit for each cache line. When a 
dirty line is flushed, the cpu doesn't know which data changed, so it 
writes all locations back into memory, thus write-back. Write-thru cache 
updates cache on a write, and also schedules a write to memory 
immediately of the new data.  That way, no writes are actually necessary 
to flush the cache

Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-15 Thread J. William Campbell
On 7/15/2011 12:17 AM, Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message4e1f8127.8030...@comcast.net  you wrote:
 I am pretty sure that is long enough for someone to notice. . I would be
 interested in seeing an example of such code as you refer to. Could you
 point me to one, because it seems to me that the only reason to code
 such a delay is that for some reason the user didn't want to keep
 looking at the termination condition so often. I think that that
 arch/powerpc/cpu/mpc512x/i2c.c:

   80 while (timeout--  (status  I2C_BB)) {
   ...
   88 udelay (1000);
   89 status = mpc_reg_in (regs-msr);
   90 }

 103 while (timeout--  !(*status  I2C_IF)) {
 104 udelay (1000);
 105 *status = mpc_reg_in (regs-msr);
 106 }
 arch/powerpc/cpu/mpc512x/pci.c:

   87 /* We need to wait at least a 1sec based on PCI specs */
   88 for (i = 0; i  1000; i++)
   89 udelay(1000);

 arch/powerpc/cpu/mpc5xx/spi.c:

 258 for (i = 0; i  1000; i++) {
 ...
 265 udelay(1000);
 266 }

 390 for (tm=0; tm1000; ++tm) {
 ...
 394 udelay (1000);
 395 }

 arch/powerpc/cpu/mpc5xxx/i2c.c:

 102 while (timeout--  (status  I2C_BB)) {
 ...
 111 udelay(15);
 112 status = mpc_reg_in(regs-msr);
 113 }

 125 while (timeout--  !(*status  I2C_IF)) {
 126 udelay(15);
 127 *status = mpc_reg_in(regs-msr);
 128 }

 And just that you don't think this is in a single CPU only:

 arch/powerpc/cpu/mpc8260/pci.c:

 343 for (i = 0; i  1000; ++i)
 ...
 345 udelay (1000);

 Or in board code:

 board/altera/common/cfide.c:

   29 /* wait 500 ms for power to stabilize */
   30 for (i = 0; i  500; i++)
   31 udelay(1000);

 board/amcc/bamboo/bamboo.c:

 1019 for (i=0; i500; i++)
 1020 udelay(1000);

 or common code:

 common/cmd_fdc.c:

 213 while((read_fdc_reg(FDC_SRA)0x80)==0) {
 214 timeout--;
 215 udelay(10);
 216 if(timeout==0) /* timeout occured */
 217 return FALSE;
 218 }

 228 while((read_fdc_reg(FDC_MSR)0xC0)!=0xC0) {
 229 /* direction out and ready */
 230 udelay(10);
 231 timeout--;
 232 if(timeout==0) /* timeout occured */
 233 return -1;
 234 }
 etc.etc.

 375 for(val=0;val255;val++)
 376 udelay(500); /* wait some time to start motor */

 418 for(i=0;i100;i++)
 419 udelay(500); /* wait 500usec for fifo overrun */

 600 for(i=0; i255; i++) /* then we wait some time */
 601 udelay(500);


 common/usb.c:

93 inline void wait_ms(unsigned long ms)
94 {
95 while (ms--  0)
96 udelay(1000);
97 }

 etc. etc.  Note this last example which spreads the effect in a not
 so nice way.

 Note that there are even places where udelay() is part of the protocol
 timing implementation - like in the soft-I2C and soft-SPI drivers.

Hi All,
   Thanks for the pointers Wolfgang. From your examples, it seems we 
are talking exclusively about udelay. As I said in my previous post, on 
NIOS2, udelay needs to be mechanized on NIOS2 by a timed loop, not by 
using the same mechanism as the timer, as there is no way one can 
approximate microsecond resolution by the hardware timer. It is my 
opinion that there are many CPUs on which the udelay implementation must 
not be/should not be based on the same timer code as the get_time() 
operation, because in many cases the get_time() operation requires 
interrupts to be functional. As you have also pointed out, udelay needs 
to be available early, before interrupts are available. Therefore, the 
udelay and get_timer must use the hardware  at least somewhat 
differently. The must also not interfere with each other, which on some 
existing boards they did.
   If the I2C protocol must be available before interrupts are 
available, then udelay must be used. In the above examples, there are 
some loops in i2c and spi that appear to be waiting a full second. I 
assume they are using udelay because the get_timer feature is not yet 
available to them. I also assume that the example in common/usb.c uses 
udelay to wait for millisecond sized values for the same reason, i.e. 
that it may run before get_time() is available. However, if you examine 
the timer code on of the existing CPUs, you will find that udelay is NOT 
available early ( before interrupts are enabled) on quite a few of them. 
Therefore, none of the above code would work on these CPUs at present 
either. These CPUs must not have I2C or spi busses on them

Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-15 Thread J. William Campbell
On 7/15/2011 11:34 AM, Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message4e208227.6010...@comcast.net  you wrote:
 If the I2C protocol must be available before interrupts are
 available, then udelay must be used. In the above examples, there are
 some loops in i2c and spi that appear to be waiting a full second. I
 assume they are using udelay because the get_timer feature is not yet
 available to them. I also assume that the example in common/usb.c uses
 No, this is usually not the case.  This long delay is the error case,
 which most probably will never happen.  For the normal case, you want
 a tight spinning loop that introduces as little additional delay as
 possible.
Hi All,
 I understand why one would want to use udelay in those cases 
where there are other conditions that should cause the loop to exit. 
However, in the 16 examples you cited,  7 of those usages are for delays 
of multiple milliseconds with no tests or escape conditions included. I 
therefore pointed out that I assume that it was being done that way (as 
opposed to using  get_timer) because this code would be executed 
early, before get_timer was available.
 True, although I expect you will find the statement on all the other
 architectures to be false. Many other architectures, yes, all, no.
 These other architectures just don't have spi or I2C yet, or if they do,
 they don't use it early.
 Are you aware of another architecture that cannot provide
 sub-microsecond timer resolution?  Which is it?

  No, but I am aware of some that do not provide udelay at all until 
after interrupts are enabled and the get_timer interface is available. 
Several ARM implementations have this property. There are one or two 
others (SPARC) that are similar as I recall. There were also some that 
called reset_timer, which would mess up nested timing loops, but this 
has already been fixed. There also quite a few udelay implementations 
that do not actually provide 1 µs resolution.  For instance, the first 
thing that happens in arm720t/interrupts.c  __udelay is that the input 
microseconds is divided by 1000 and the remainder discarded. A 
udelay(10) won't work so well here. The lh7a40x/timer.c uses a 508 kHz 
clock, so delays are accurate to at best 2 µs. There are other examples. 
For instance, I am pretty sure some u-boot based CPUs use a 32kHz clock 
as the timebase for udelay. The resolution in that case is 30 µs, which 
would be a problem for your example code. I am sure further digging will 
reveal other resolution issues.

These udelay issues would need fixing if all u-boot options are to work 
out of the box. In some cases of low resolution, the fix is not easy, 
or maybe not even possible. However, this problem has not prevented the 
use of u-boot on these CPUs. It does mean that some things do not work 
out of the box, and these non-compliant udelays should be fixed 
wherever practical. In any case, this shouldn't stop progress on 
get_timer and the delay_at_least (or whatever it ends up being called) 
function being adopted. That is independent of the udelay issues. udelay 
must be implemented in a compliant manner, which may or may not use 
the same hardware as get_timer() does. It certainly can't use the same 
code (although some ARM version in effect do, resulting in a 
non-compliant, low resolution  udelay).

It might be interesting to utilize something like jiffies for the loop 
delays that are looking for error exits. In those cases, the error case 
should be infrequent. It may not matter that the total delay is quite a 
bit longer than the user requested, as long as it isn't vastly shorter. 
If a 10 µsec delay became a 30 µsec delay inside the loop examples where 
we expect to exit due to another cause, it probably wouldn't matter 
much. It would matter a lot though if the delay were setting a baud rate 
on a bit-banged output, so jiffies may not be worth it. I put it out 
there for consideration, as it is easier to produce than a udelay that 
has a guaranteed low absolute error.

Best Regards,
Bill Campbell

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-14 Thread J. William Campbell
On 7/14/2011 12:41 PM, Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message4e1cf2e0.1030...@comcast.net  you wrote:
 Yes, this is true. However, the time_elapsed_since routine can do
 this dynamically (i.e. add twice the timer resolution) . I think you had
 That would IMHO be a very bad idea.  We have a number of places where
 we have to deal with pretty long timeouts (usually because of protocol
 specifications that require this - often in the order of several
 seconds), where the normal path is very fast.  The typical approach is
 to break the timeout into a large number of very short loops.
 Sometimes we use udelay() for this, other places use get_timer().

 So assume we have a timeout of 5 seconds, and implement this as 50,000
 loops of 100 microseconds.  If you silently turn each of these into 20
 milliseconds on NIOS, the timeout would become 1,000 seconds instead
 of 5 - users would return boards as broken and report it just
 freezes because nobody expects that it will wake up again after some
 16 minutes.
Hi All,
   If such a condition existed, that is indeed what would 
happen. However, at present, such code is not being used on NIOS. We 
know this because the current work-around of resetting the timer at 
the start of any timeout operation extends the timeout to a minimum of 
10 milliseconds. So we would be waiting 8 minutes currently, not 16, and 
I am pretty sure that is long enough for someone to notice. . I would be 
interested in seeing an example of such code as you refer to. Could you 
point me to one, because it seems to me that the only reason to code 
such a delay is that for some reason the user didn't want to keep 
looking at the termination condition so often. I think that that 
equivalent  operation can be produced by a pretty simple re-coding of 
the loop. In any case, NIOS does not have the problem at present, so the 
suggested new work-around would be no worse than the present situation.
  It is also true that the hardware timer cannot be used in a 
reasonable version of udelay, as most of the desired delays may be very 
short relative to the timebase. A calibrated timing loop would be the 
best approach to the udelay problem.
 another function name (at_least) involved, but you can define
 time_elapsed_since as always compensating for the resolution. That will
 fix any resolution questions in a processor-specific way. It is either
 that or the ifdefs. One way or another, the resolution must be
 addressed. Up to now, the implicit resolution has been 1 ms, but we now
 know that is not general enough.
 It's not as simple as this.  You have to change a lot of code to make
 this work for such slow clock systems.
In general, that is true. There may be a few cases where a delay of less 
than the resolution is essential to make something work. There are 
probably lots of other cases where we can easily remove the restriction 
on the resolution. We cannot fix the first kind, no matter what we do, 
to work on a lower resolution timer. The second kind we can and probably 
should fix, because they are coded in an overly-restrictive manner. In 
any case, we don't absolutely have to fix them until somebody decides to 
use the code on a CPU with a low resolution timer. Practically speaking, 
the suggested solution will therefore work on all extant cases.

Best Regards,
Bill Campbell

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-12 Thread J. William Campbell
On 7/12/2011 6:10 AM, Wolfgang Denk wrote:
 Dear Graeme Russ,

snip
 Do we?  What exactly is the needed resolution of the underlying
 hardware timer?  So far, it appears sufficient to have it ticking with
 1000 Hz or more.  Are there really systems that cannot provide that?
 The only architecture I remember that seemed prolematic was NIOS - but
 then the NIOS documentation suggests that this might actually be
Hi All,
   The NIOS2 systems are a FPGA based architecture (soft core). 
They CAN have all different kinds of timers. However, there are many 
already out in the world that were built (generated) with a timer that 
has 10 ms resolution. We can't retro-actively change these. We can 
declare them not supported in future releases, but we can't fix the 
resolution problem retroactively.
   I have two comments regarding this discussion so far. First, I 
think using the time function name at all is a VERY BAD idea. People 
will confuse it with the normal c library function that returns the 
time of day since the epoch. One may say that they should RTFM, but it 
is lots easier to just use another name and avoid all such confusion. 
Second, I think forcing all systems to use 64 bit time stamps is not a 
good idea. The main purpose of the timer API is to provide hardware 
timeouts. 64 bits is vast overkill for such a purpose. Mandating 64 bit 
capability for all u-boots is I think counter-productive. A 32 bit 
timestamp in millisecond resolution plus udelay covers all practical 
cases. The 1 ms resolution is probably inadequate for performance 
measurement, but for every u-boot system that is sometimes used for 
performance measurement, there are 100 (probably even more, like 1000, 
but I don't really have the figures) systems that are just used to boot 
Linux or another stand alone program. So we may need a different API for 
performance measurement that uses higher resolution. On most larger 
CPUs, there are hardware timers that can already do this with minimal 
work. On some less-capable CPUs, it may be lots harder, but if you 
aren't doing performance measurement, you don't need it. So I suggest 
that mandating a 64 bit API for some of the processors (68K comes to 
mind) is not really a good idea. It seems counter to the elegant  
simplicity of the rest of u-boot to stick a 64 bit requirement on 
timestamps used to determine if the boot delay is expired.

Best Regards,
J. William Campbell

   Best regards,


 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-12 Thread J. William Campbell
On 7/12/2011 8:23 AM, Scott McNutt wrote:
 Dear Wolfgang

 Wolfgang Denk wrote:

 What exactly is the reason that we cannot have better timer
 resolutions in NIOS?
 You _can_ have better timer resolutions in Nios. However, there
 are legacy systems that implement timer(s) with a fixed period
 of 10 msec. The use of such implementations is very common.
 How do I know? Because my customers have such implementations.
 Why not upgrade? Because most sane engineers are loathe to
 change their rtl just so they can fix a simple software bug
 or add a simple custom feature.

 My obvious preference is to continue to use the latest u-boot in
 such systems ... without having to create a custom branch for
 customers with older Nios implementations, where I use the old
 timer interfaces ... simply because we improved the API.

 Scott, maybe you can comment here?
 I have only one comment: Out of pure frustration, I have been
 avoiding any discussions regarding this timer re-write effort.

 At the moment I'm trying to understand if we really have a problem on
 NIOS2 that cannot be fixed in a way that is compatible with our
 current plans.
 This is where my frustration begins. I've said this several times
 before, and I don't know how else to say this ... but I'll give it
 one more try:

   This is not a Nios problem.

 If I can't use a 10 msec timer interrupt to detect a simple timeout
 condition when I'm programming a flash memory, then the timer API
 design is garbage.
Hi All,
   I agree with the idea that mandating a particular timer 
resolution is an un-desired limitation. You may need it for some cases, 
but not for all. Proof of this concept is the number of legacy UNIX 
systems running on 10 ms timer ticks, or for that matter the number of 
systems running on line cycle interrupts (16.666 ms in the US). It is 
also however true that one cannot ignore the timer resolution. That is 
what produced the problem in the first place. Asking for a delay that is 
less than the resolution isn't going to work as expected. This problem 
can be easily fixed by always adding twice the timer resolution to any 
requested delay, to ensure you waited at least two ticks , or by having 
a routine that does this for you (like the suggested API does). Right 
now, there are many places in the u-boot code that wait a certain small 
number of milliseconds under the assumption that the timer resolution is 
either 1) not an issue at all, or 2) is 1 ms. Both cases are wrong, the 
first case more so than the second. I wonder how many delays in the 
program that say wait  5 ms are aware that you could exit the loop 
after 4ms plus epsilon every so often, even with 1 ms resolution? 
Perhaps the tolerance is built into the number 5, perhaps not. However, 
it is surely not if the timer resolution is 16.666 ms.
   And quite frankly, the enormous amount of
 discussion in this matter reminds me of an undergraduate science
 project.
Or any government project. But when you will effect lots of people, you 
often must kill a few trees to make sure nobody (such as yourself and 
other NIOS users) somehow gets a bad deal.

I am now returning to real work and will be offline for a while. Good 
Luck All.

Best Regards,
Bill Campbell
 We also have to deal with decrementing counters, but this is just aan
 unimportant detail.  And it appears that we actually can have this,
 even on NIOS.
 Wow! Even on Nios! Arrggghh!

 The only architecture I remember that seemed prolematic was NIOS
 Really? There have been occasional issues that have required a patch
 here and there. But I consider most of them far from problematic.

 This is exhausting. As I recall, this entire fuss was born out
 of an issue with nested calls to get_timer() ... and that begat
 removing reset_timer ... and that begat ... and so on. And we're
 still spilling lots of intellectual seed here. And now we have
 what? A jack-of-all-trades API that can do everything with
 exacting precision ... other than handling a 10 msec interrupt?

 Best Regards,
 --Scott

 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-12 Thread J. William Campbell
On 7/12/2011 5:33 PM, Graeme Russ wrote:
 Hi Reinhard,

 On 13/07/11 02:08, Reinhard Meyer wrote:
 Dear J. William Campbell, All
 [snip]

 Lets just keep the current functions udelay(us) and u32 get_timer(), the
 latter maybe without parameter. Remove all *masked() and *reset() functions
 This is happening and has Wolfgang's 100% support. Patches are done, just
 need to be rebased and resent

 and lets change the timeout values used in code that also NIOS uses to
 be 20ms more than the hardware requires. It does not really matter if a
 *timeout* of 20ms is blown up to 40ms.
 Can't do that - That would mean winding out common code (CFI, Network,
 Serial, FPGA etc, etc) timeouts to global worst-case scenario unless you
 want to add a #define and start littering the code with macro's - Urgh!
Hi All,
   Yes, this is true. However, the time_elapsed_since routine can do 
this dynamically (i.e. add twice the timer resolution) . I think you had 
another function name (at_least) involved, but you can define 
time_elapsed_since as always compensating for the resolution. That will 
fix any resolution questions in a processor-specific way. It is either 
that or the ifdefs. One way or another, the resolution must be 
addressed. Up to now, the implicit resolution has been 1 ms, but we now 
know that is not general enough.

Best Regards,
Bill Campbell


 Regards,

 Graeme
 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-28 Thread J. William Campbell
On 5/27/2011 10:53 PM, Graeme Russ wrote:
 Hi Bill,

 On 28/05/11 00:23, J. William Campbell wrote:
 On 5/27/2011 12:35 AM, Graeme Russ wrote:
 Hi Wolfgang,

 On 27/05/11 17:13, Wolfgang Denk wrote:
 Dear Graeme Russ,

 In messagebanlktinwvy9b4qzelnawf7mkt9z1zem...@mail.gmail.com   you wrote:
 I think we will need to define get_timer() weak - Nios will have to
 override the default implementation to cater for it's (Nios') limitations
 Please don't  - isn't the purpose of this whole discussion to use
 common code for this ?

 Yes, but Nios is particularly bad - It has a 10ms tick counter :(
 Hi All,
   And a hardware timer that you can't read to subdivide the 10 ms. Note
 that this is not necessarily a problem with all NIOS implementations. The
 timer characteristics can be controlled when you generate the bitstream for
 the FPGA. You can make the counter both faster and readable if you want. It
 just uses a bit more silicon. Sad to say, it probably will require  per
 board get_ticks routine. For the old nios2 timers however, overriding
 get_timer with a /board routine is probably the only way to go.

 Actually, using the logic you presented in the Remove calls to
 [get,reset]_timer outside arch/ thread, we could have a common get_timer()
 implementation which deals with all timer resolution issues:

 #if !defined(CONFIG_MIN_TIMER_RESOLUTION)
 #define CONFIG_MIN_TIMER_RESOLUTION 1
 #endif

 u32 get_timer(u32 base)
 {
   if (base != 0) {
   if (timer - base  (CONFIG_MIN_TIMER_RESOLUTION * 2))
   return 0;
   else
   return timer - base - CONFIG_MIN_TIMER_RESOLUTION;
   } else {
   return timer;
   }
 }

 This would also guarantee that timeout loops run for _at least_ the timeout
 period

Hi Graeme,
   You are almost correct. The problem is that if you are calling 
get_timer early on in start-up, the timer can actually be 0. The loop 
initializing get_timer(0) will then return 0. If you can initialize the 
timer to a number !=0, then the above code works. If the timer is 
totally hardware derived, such an initialization may not be easy. For 
nios2, it is easy since the counter is software interrupt driven.  It 
can just be initialized to 1. The 10 ms updates will then run like 
normal. It also won't work if the user is doing his own timer arithmetic 
by subtracting get_timer(0) values, but there is nothing we can do about 
that case anyway so that case is moot. Good job of spotting this approach.

Best Regards,
Bill Campbell
 Regards,

 Graeme



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/26/2011 9:33 PM, Graeme Russ wrote:
 Hi Bill,

snip
 get_ticks() does not care about the clock rate - It simply looks at the
 current value of the hardware tick counter and the value of the hardware
 tick counter the last time get_ticks() was called, calculates the difference
 and adds that to the 64-bit software tick-counter
 I don't see how it being a down counter makes that any more difficult

 This is neither simple nor maintainable. Further, it is un-necessary, as the
 sync_timer routine is just going to convert the number from whatever radix
 we converted it to into millisec.  If we leave the two numbers as split, all
 that complexity is removed from get_ticks and sent upward to the common
 routine that converts the answer into ms anyway. This makes the system more
 maintainable by placing the minimum requirement on get_ticks. The tick
 should be opaque to anybody but sync_timebase anyway.
 But how is the common routine going to know if it is a split timer, up
 timer, down timer, little endian, big endian, etc, etc.

 get_ticks() abstracts the hardware implementation of the hardware timer
 from sync_timer()
Hi All,
 I understand your point. I prefer a higher level of abstraction. 
You are correct that there are some aspects of the tick counter that are 
very hardware quirky, and these attributes are hard to abstract. If the 
timer is embedded into a bit field with other variables, it is 
reasonable to expect get_ticks to extract the bit field and right 
justify the number. If there are endian issues, the get_ticks routine 
must return the number in the natural endianness of the platform. 
However, after that point, the values are extremely regular. The fact 
that a counter is a down counter can be expressed in a data structure as 
a boolean. The high limit of the hardware counter is a number. The 
number of ticks per millsecond is obtainable from usec2ticks(1000), or 
1 if we want to avoid some roundoff. From these values, sync_timer 
can take the two part ticks value and convert it to millisec. Trust me 
on this. I have the routines to do it. This puts as much of the 
abstraction of the two numbers into ONE COMMON ROUTINE, sync_timer. Now 
it is clearly possible to move some of the abstraction down a level into 
sync_timer. For instance you could move inverting the counter down to 
that level, and then multiply the msb by the maximum value of the lsb 
counter and add in the msb. It is clearly possible to move ALL of 
sync_timer down into get_ticks, if one wanted to. It is clearly possible 
to replace general values in gd- with platform specific constant 
values. However, if you do that, you end up with a lot of duplicate, or 
almost duplicate, code running around. That has proven to be error 
prone, and it has left new ports of u-boot to sort of fend for 
themselves in figuring out how things should work. I prefer to abstract 
that all up into sync_timer. That way, all the math is in one place, and 
is table driven so it is easy to change.
 The top 32 bits are the rollover count and the bottom 32 bits are the
 current counter. If the counter is a full 32 bits, so much the better.
 Ah - Lets keep it that way

 Again, one could put them together inside the interrupt routine , but it
 is
 easier to check for a changed value if you don't do this. Otherwise, you
 have to check both words. It also makes the isr faster. It is just an
 As I said before - Simple First, Fast Later
 I am in favor of simple. That is why I want get_ticks to do as little as
 possible. It should just read the hardware register and the overflow counter
 if it is separate. Make sure the overflow didn't change while we were
 reading. This is redundant if we are not using interrupts but we can leave
 the code in. It just won't do anything.  We can also move the rollover
 detection to sync_timebase. It will be redundant if we are using interrupts,
 because time will never back up. But we can do it this way. This
 centralizes the overflow detection, which is a good thing.
 That does not sound simple to me. This, however, does:

 u64 get_ticks(void)
 {
   static u64 last_hw_tick_count;
   static u64 last_sw_tick_count;

   /* Now for the platform specific stuff - Read hardware tick counter */
   u64 current_hw_tick_count = /* Read hw registers etc */

   /* Deal with hardware weirdness - errata, stupid hw engineers etc */

   u64 elapsed_ticks = current_hw_tick_count - last_hw_tick_count;
   last_sw_tick_count += elapsed_ticks;

   return last_sw_tick_count;
 }

 The '/* Read hw registers etc */' bit will always be the same, no matter
 what way you do it
Agree.
 The '/* Deal with hardware weirdness - errata, stupid hw engineers etc */'
 bit is where we are truly abstracting the hardware away - This is the
 bit you propose to leave mangled and deal with in sync_time?
Not totally. The get_ticks routine must mask off any extra bits and 
right justify the hardware counter. If the counter is  

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 12:28 AM, Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message4ddefdbc.7050...@comcast.net  you wrote:
 I really STRONGLY disagree with this statement. If you actually needed
 64 bit variables, fine use them. But as I have already shown, you do not
 need them in general.  We are computing a 32 bit result. There is some
 entropy argument that says you shouldn't need 64 bits to do so. Another
 way to look at it is that converting the top 32 bit word and the bottom
 32 bit word to ms separately can be easier than doing them both together
 at once.  However, as we will see below, I do agree we need two 32 bit
 words to make this process go smoothly. I just don't agree that they
 should/will constitute a 64 bit binary word. See below.
 And I disagree with this.

Hi Wolfgang,
 OK, I hear you.
 Yes, that is the problem. I have come to the view that  two 32 bit words
 are the best approach. Note that the lsb may actually not fill the full
 32 bits. The top 32 bits are the rollover count and the bottom 32 bits
 are the current counter. If the counter is a full 32 bits, so much the
 better. Again, one could put them together inside the interrupt routine
 , but it is easier to check for a changed value if you don't do this.
 It's even easier if you use a single 64 bit variable, because then you
 can simply use ==.


 In general, no you can't, or at least you probably don't want to. . 
If you are reading a 64 bit performance counter, it is quite likely that 
you cannot read it twice without the clock having ticked. If the CPU 
executes 1 instruction (or fewer(if an SPR/memory reference  is 
involved?) per performance counter tick, which is the goal of the 
performance counter, == is an infinite loop A similar condition 
exists if you are combining a software counter with a fairly fast 
hardware counter. It might require flipping the hardware counter (if it 
is a down counter) and a 64 bit multiply add, which must be done in 
software/a subroutine if the cpu has no 64 by 64 multiply. By the time 
that is done, the timer LSB may have ticked. Consider the m68K case.
 Otherwise, you have to check both words. It also makes the isr faster.
 Drop any thoughts about make FOO faster for now, please.  Especially
 at this stage it is much more important to have a simple and clean
 design.  If split in two variables, even a simple read access will
 turn into code like

   do {
   upper  = timebase_upper;
   lower  = timebase_lower;
   } while (upper != timebase_upper);

 This is not exactly as simple as you claimed.

 True, but if you look at a lot of 64 bit performance counters, that 
is EXACTLY what the handbook book recommends on how to read them. There 
is no atomic way to read them both at once, and reading one half doesn't 
freeze the other half. This code is also required if timebase_upper is 
altered in the interrupt routine.  YMMV, but in a lot, dare I say most, 
cases this is required anyway.  And while the code is more complex than 
a simple assignment statement, it is not very complex.

Best Regards,
Bill Campbell

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 12:33 AM, Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message4ddf2072.5090...@comcast.net  you wrote:
 ...
 The problem is that the way we previously detected wrapping does not
 work if the interrupt rate is == to the counter wrap time, which it
 essentially always is. If get_ticks is trying to update the wrap count
 You ignore the fact that this is only ever a problem when the rollover
 cannot signal through an interrupt or similar.  Also, some processors
 allow daisy-chaning of timers, etc.

 Again, I would really like to know about how many exotic systems we
 are talking that fulfil your worst-case expectations.  I bet the
 overwhelming majority behaves absolutely harmless.
Hi Wolfgang,
 I think that in fact the opposite is true. The problems occur if 
both the main program and the interrupt routine are trying to update the 
timer msb using the same code, as we were originally talking about. 
There is no problem if only the interrupt routine detects the rollover. 
That is the correct way to go if your interrupts work. There was nothing 
particularly exotic required. It was the normal case. Take a look at 
what would happen on the PPC is the main program was reading the 
decrementer, detecting wraps and increasing the timestamp while the 
interrupt routine was also incrementing the timestamp. Every so often 
you get a double increment. Why were we doing this? Because I was trying 
to re-use exactly the same code in the interrupt case and the 
non-interrupt case. Not a good idea, in fact a bad idea as it turns out.

Best Regards,
Bill Campbell
 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 12:35 AM, Graeme Russ wrote:
 Hi Wolfgang,

 On 27/05/11 17:13, Wolfgang Denk wrote:
 Dear Graeme Russ,

 In messagebanlktinwvy9b4qzelnawf7mkt9z1zem...@mail.gmail.com  you wrote:
 I think we will need to define get_timer() weak - Nios will have to
 override the default implementation to cater for it's (Nios') limitations
 Please don't  - isn't the purpose of this whole discussion to use
 common code for this ?

 Yes, but Nios is particularly bad - It has a 10ms tick counter :(
Hi All,
  And a hardware timer that you can't read to subdivide the 10 ms. 
Note that this is not necessarily a problem with all NIOS 
implementations. The timer characteristics can be controlled when you 
generate the bitstream for the FPGA. You can make the counter both 
faster and readable if you want. It just uses a bit more silicon. Sad to 
say, it probably will require  per board get_ticks routine. For the 
old nios2 timers however, overriding get_timer with a /board routine 
is probably the only way to go.

l
 I don't see reason for hamstring other platforms when a simply weak
 function can get around it
Agree.
Best Regards,
Bill Campbel
 Regards,

 Graeme



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 6:07 AM, Scott McNutt wrote:
 Graeme Russ wrote:
 Hi Wolfgang

 On Friday, May 27, 2011, Wolfgang Denk w...@denx.de wrote:
 Dear Graeme Russ,

 In message banlktik2sum4sm8aljcrcmz+kcmgwge...@mail.gmail.com you 
 wrote:
 Besides, Nios can return an increment of 10 (presumably ms) between
 two immediately consecutive calls. This causes early timeouts in CFI
 driver
 Now this in turn is a bug in the timer implementation that needs to be
 fixed.

 And this is what reset_timer() corrected.

 Agreed, but that is not something I can achieve - I don't want to hold
 up this whole show that we have all put so much effort into for the
 sake of one weak function

 And I don't want to see something that currently works become broken
 because we improved a feature ... simply because the resolution of
 the timestamp is 10 msec rather than 1 msec.

 And just to be clear. This is not a Nios issue. Currently, if the
 timestamp is incremented via a fixed period interrupt, and the period
 of the interrupt is longer that 1 msec, calls to get_timer() may
 produce early timeouts ... regardless of platform.
Hi All,
 A more precise statement of the problem is that all timer delays 
may be shortened by the timer resolution. So this means that if you have 
a timeout of 1 ms in your get_time(0) {   } while ( ...  1), then your 
actual delay may be anywhere between 0 and 1 ms. The problem arises when 
some piece of common code uses a delay of say 8 millisec, expecting the 
actual delay to be between 7 and 8. If the resolution is 10 ms, the 
delay will be between 0 and 10 ms, 0 being particularly bad. This can be 
fixed in get_timer, making the 8 ms delay  become a minimum of 10 ms at 
the expense of it becoming up to 20 ms sometimes. Since these delays are 
used mostly for error conditions, making them longer will probably be 
ok, and doesn't require changing any of the common code. It probably 
will not make things slower either, because the error timeouts should 
not be reached. The reset of the hardware timer would cause all short 
delays to become 10 ms. This reset approach is bad in that it prevents 
proper nesting of timing loops. However, in this case it isn't so bad, 
in that the nested loops are just extended, not shortened. Note that if 
the reset is only resetting the HARDWARE interrupt generator, not the 
actual timestamp itself, we are just extending all existing timeouts by 
0 to 10 ms.. So this just lengthens all pending timeouts. The other fix 
is in my opinion nicer, because it affects the nest loops less. If the 
inner loop is executed 100 times, with the reset, the outer loop timeout 
is extended by up to 1000 ms.

Best Regards,
Bill Campbell

 --Scott




___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/26/2011 11:54 PM, Graeme Russ wrote:
 On Fri, May 27, 2011 at 4:33 PM, J. William Campbell
 jwilliamcampb...@comcast.net  wrote:
 On 5/26/2011 9:33 PM, Graeme Russ wrote:
 Hi Bill,

 snip
 [massive snip]

 OK, you have my ears pricked - Can you give me code samples for:

   - get_ticks()
   - sync_timbase() (no need to implement the whole lot if that is too
 much effort right now)
   - timer_isr()

 that handle the following hardware tick counter scenarios:

 a) 64-bit up counter
 b) 64-bit down counter
Hi Graeme,
 c) 32-bit up counter, wraps at 65000
Do you mean 32 bits or 16 bits? doesn't make much difference, but just 
checking.
 d) 16-bit microsecond up counter 0-999 which wraps and triggers a 16-bit
 millisecond up counter. Reading milliseconds latched microseconds and
 clears milliseconds (look in arch/x86/cpu/sc520/timer.c for example)
 e) 24-bit counter occupying bits 2-25 of a 32-bit word (just to be
 difficult)
 f) Any other option anyone cares to throw ;)

 All of these options must be covered using:
   - Minimal global data (we would like it to work before relocation, but
 not mandatory - GD footprint would be nice)
   - All use the same sync_timebase function
   - Demonstrate using an ISR NOT synced to the hardware tick counter and
 an ISR that is
   - Parameters to get_ticks() and sync_timer() are permitted, but not for
 timer_isr() (naturally)


 OK! Once again, I accept the challenge. One Caveat. My real work 
has been sliding due to the time I have been spending on this. I am 
flying from San Francisco to Sydney tonight , (to work, not to play),  
so I will be off-grid for 14 hours+. You will not get this code for a 
few days, like probably 3 days. I have looked at the requirements, and I 
see no real problems that I don't know how to solve.
 I don't' see any reason to push this down to a lower level. It is just one
 more thing to get messed up across implementations.
 Agreed

 detection in the non-interrupt case to sync_timebase as well.
 Sync_timebase
 can also invert the down-counting counters, removing that from get_ticks.
 The wrap detection code can be #ifdef out if one is using interrupts and
 Urghh - Why are you adding unnecessary ugliness - #ifdef in the middle of
 code is usually a sign you are doing something wrong
 As I said, this is an optional optimization. I do not agree that an #ifdef
 in the middle of code indicates you have a bad design. Lots and Lots of
 ifdefs certainly indicates a bad design. An ifdef to eliminate code if some
 option is not selected is hardly such a strange thing, especially only a
 single #ifdef. However, feel free to not have it if you so desire.
 OK, I'll let this slide for the moment - please include in above example
Will Do.
 offended by it's presence. Thanks for pointing this out and compelling me
 to
 reduce the number of cases! Making get_ticks more lightweight is a good
 idea
 in my opinion.
 Lets say you have a platform with a 32-bit tick counter running at a
 reasonably long rollover time so you decide not to use interrupts. Then
 you create a new platform with the same tick counter, but it runs much
 faster and you realise you need to implement the interrupt routine to
 make get_timer() work for long enough periods - Fine, you add an ISR
 for the new platform that calls sync_timebase - No other changes are
 required.

 The last thing we want is for the 64-bit tick counter to be conceptually
 different across platforms

 I just realised - the ISR _does not need to call the sync_timebase at
 all_
 The ISR only needs to call get_ticks(), so it will be fast anyway
 The problem is that the way we previously detected wrapping does not work
 if
 the interrupt rate is == to the counter wrap time, which it essentially
 always is. If get_ticks is trying to update the wrap count when an
 interrupt
 Is it, always, on every platform?
 Yes, pretty much. You get a terminal count/counter underflow interrupt and
 that is it.
 Not on sc520 - The micro/millisecond counter cannot be used to driver an
 interrupt - you need to setup a seperate timer. I think the x86 internal
 performance counters are the same
What is true, as you have stated, is that the micro/millisecond counter 
on the sc520 does not interrupt at all. Nor do the x86 performance 
counters. The x86 performance counters are a non-problem because they 
are 64 bits long. We don't need interrupts for them. Now, if you choose 
to use the sc520 micro/millisecond counter, then you need another source 
of interrupts. Due to the fact that reading the sc520 counter resets it, 
we must accumulate the elapsed time in software. That means the 
interrupt routine must do a bit more work, but it also allows reading 
the counters in non-interrupt code (with interrupts disabled) to not 
mess up the accumulated count. We don't detect rollover in the sc520 
counters, we just read and accumulate the value. So no problem there.
 comes in, it will do it wrong. If the interrupt

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 8:44 AM, Scott McNutt wrote:
 J. William Campbell wrote:
 On 5/27/2011 6:07 AM, Scott McNutt wrote:
 Graeme Russ wrote:
 Hi Wolfgang

 On Friday, May 27, 2011, Wolfgang Denk w...@denx.de wrote:
 Dear Graeme Russ,

 In message banlktik2sum4sm8aljcrcmz+kcmgwge...@mail.gmail.com 
 you wrote:
 Besides, Nios can return an increment of 10 (presumably ms) between
 two immediately consecutive calls. This causes early timeouts in CFI
 driver
 Now this in turn is a bug in the timer implementation that needs 
 to be
 fixed.

 And this is what reset_timer() corrected.

 Agreed, but that is not something I can achieve - I don't want to hold
 up this whole show that we have all put so much effort into for the
 sake of one weak function

 And I don't want to see something that currently works become broken
 because we improved a feature ... simply because the resolution of
 the timestamp is 10 msec rather than 1 msec.

 And just to be clear. This is not a Nios issue. Currently, if the
 timestamp is incremented via a fixed period interrupt, and the period
 of the interrupt is longer that 1 msec, calls to get_timer() may
 produce early timeouts ... regardless of platform.
 snip
 This can be fixed in get_timer, making the 8 ms delay  become a 
 minimum of 10 ms at the expense of it becoming up to 20 ms sometimes.

 Ok. Now I get it. Thanks.

 snip
 This reset approach is bad in that it prevents proper nesting of 
 timing loops.

 In my particular case, because reset_timer() resets the timestamp
 to zero rather than simply restarting the timer. I believe leaving
 the timestamp alone would solve the nesting problem.

 snip
 The other fix is in my opinion nicer, because it affects the nest 
 loops less. If the inner loop is executed 100 times, with the reset, 
 the outer loop timeout is extended by up to 1000 ms.

 Bill, thank you for explaining -- probably for the nth time -- but it
 did finally sink in.
Hi Scott,
  Glad to help, I finally think I understand it myself in looking 
into it further! I think we have a good way ahead that should keep 
everything working. We will get you an alpha copy of whatever we do as 
soon as possible so you can verify we didn't break nios2!
Best Regards,
Bill Campbell

 Regards,
 --Scott



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 8:13 AM, Simon Glass wrote:
 On Fri, May 27, 2011 at 8:00 AM, J. William Campbell
 jwilliamcampb...@comcast.net  wrote:
 [snip]
 Hi All,
  A more precise statement of the problem is that all timer delays
 may be shortened by the timer resolution. So this means that if you have
 a timeout of 1 ms in your get_time(0) {   } while ( ...  1), then your
 actual delay may be anywhere between 0 and 1 ms. The problem arises when
 some piece of common code uses a delay of say 8 millisec, expecting the
 actual delay to be between 7 and 8. If the resolution is 10 ms, the
 delay will be between 0 and 10 ms, 0 being particularly bad. This can be
 fixed in get_timer, making the 8 ms delay  become a minimum of 10 ms at
 the expense of it becoming up to 20 ms sometimes. Since these delays are
 used mostly for error conditions, making them longer will probably be
 ok, and doesn't require changing any of the common code. It probably
 will not make things slower either, because the error timeouts should
 not be reached. The reset of the hardware timer would cause all short
 delays to become 10 ms. This reset approach is bad in that it prevents
 proper nesting of timing loops. However, in this case it isn't so bad,
 in that the nested loops are just extended, not shortened. Note that if
 the reset is only resetting the HARDWARE interrupt generator, not the
 actual timestamp itself, we are just extending all existing timeouts by
 0 to 10 ms.. So this just lengthens all pending timeouts. The other fix
 is in my opinion nicer, because it affects the nest loops less. If the
 inner loop is executed 100 times, with the reset, the outer loop timeout
 is extended by up to 1000 ms.

 Best Regards,
 Bill Campbell
 Hi Bill,

 Yes I agree that this is ugly - I didn't realize that this is what
 reset_timer() does, but I think these 10ms platforms should have to
 live with the fact that timeouts will be 0-10ms longer than hoped.
 Perhaps reset_timer() should become a non-standard board thing that is
 deprecated. Really if you have a 10ms timer and are asking for a 10ms
 timeout you are being a bit hopeful.
Hi All,
 Yes, but the person writing the driver was writing common code. 
He probably didn't even know there was a timer whose resolution was not 
1 ms.
 But perhaps this argues for a function to check timeouts - at the
 moment get_timer() returns the time since an event and it is used at
 the start of the loop and the end. Perhaps we should have:

 #define TIMEOUTMS 2000

 stop_time = get_future_time(TIMEOUT_MS);  // Returns current time +
 TIMEOUT_MS + (resolution of timer)
 while (get_timer(stop_time)  0) // (I would much prefer while
 (!timed_out(stop_time))
 wait for something
 }

 Regards,
 Simon
In the existing system, you can get the same result by running the while 
loop  with a condition of (get_timer(base)  TIMEOUTMS + TIMER_RESOLUTION).
We could just make TIMER_RESOLUTION a mandatory define for all u-boots. 
Then common code would be wrong if the TIMER_RESOLUTION were omitted. 
For all I know, there may be such a define already. Anybody know of one?

Best Regards,
Bill Campbell

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 6:27 AM, Graeme Russ wrote:
 Hello Everyone,

 OK - Starting a new thread to discuss implementation details. This is a
 heads-up for arch/platform maintainers - Once this is a bit more stable, I
 will put it on the wiki

 Assumed Capabilities of the Platform
   - Has a 'tick counter' that does not rely on software to increment
Hi All,
The nios2 with the most basic timer does not meet this 
requirement. It will not count at all without the 10 ms interrupt. I 
don't think this requirement matters anyway. We need a 'tick counter' 
that 'ticks'. If it takes software to make it tick, we don't much care. 
There may be problems with early use of udelay in that case, but that is 
a different issue.
   - tick interval may by a fixed constant which cannot be controlled
 via software, or it could be programmable (PIT)

 API Functions (/lib/timer.c)
   - u32 get_timer(u32 start)
  - Returns the number of elapsed milliseconds since 'start'

 API Functions (/arch/...)
   - void udelay(u32 delay)
  - Used for 'short' delays (generally up to several seconds)
  - Can use the tick counter if it is fast enough
  - MUST NOT RESET THE TICK COUNTER
There is a requirement that udelay be available before relocation and 
before the BSS is available. One can use the tick counter to provide 
udelay as long as sync_timebase is not called OR sync timebase does not 
use BSS. It appears many implementations ignore this requirement at 
present. We should try to fix this, but is should not be a requirement.
 'Helper' Functions (/lib/timer.c)
I think this function should be weak, so that it is possible for people 
to override it with a custom function. The fully general sync_timebase 
has lots of code in it that can be simplified in special cases. We want 
and need a fully general function to be available, but other users who 
are real tight on space may want a cut down version. We should make that 
easily possible.
   - void sync_timebase(void)
  - Updates the millisecond timer
  - Utilises HAL functions to access the platform's tick counter
  - Must be called more often than the rollover period of the
platform's tick counter
  - Does not need to be called with a regular frequency (immune
to interrupt skew)
  - Is always called by get_timer()
  - For platforms with short tick counter rollovers it should
be called by an ISR
  - Bill Campbell wrote a good example which proved this can be common
and arbitrary (and optionally free of divides and capable of
maintaining accuracy even if the tick frequency is not an even
division of 1ms)

 HAL Functions (/arch/... or /board/...)
   - u64 get_ticks(void)
For what it's worth, I would like to propose using a (gasp!) typedef 
here. It seems to me there are a whole lot of cases where the max number 
of ticks is a u32 or less. In those cases, the wrap at 32 bits helps 
things a lot. If the tick counter is really 64 bits, the function of 
sync_timebase  is simply to convert the tick value  to millisec, and 
that is it. Otherwise, if it is 32 bits or less then some other actions 
will be required. I think this is one of those times where a typedef 
would help, We could define a type called timer_tick_t to describe this 
quantity. That would allow a pure 32 bit implementation where appropriate.

Another suggestion is that perhaps we want a u32 get_ticks_lsb(void) as 
well as a regular get_ticks. The lsb version would be used for udelay 
and could possibly come from another timer if that was 
necessary/desirable. See the requirement for early udelay early 
availability.
  - Returns a tick count as an unsigned 64-bit integer
  - Abstracts the implementation of the platform tick counter
(platform may by 32-bit 3MHz counter, might be a 16-bit
0-999 microsecond plus 16-bit 0-65535 millisecond etc)
   - u64 ticks_per_millisecond()
  - Returns the number of ticks (as returned by get_ticks()) per
millisecond
I think ticks_per_sec would be a better choice. First, such a function 
already exists in almost all u-boots. Second, if one wants the best 
accuracy for things like udelay, you need better accuracy than  
millisec. Since this function is used only infrequently, when things are 
initialized, a divide to get ticks_per_millsec (if that is what you 
really want) is no big deal. Lastly, I think this function can remain 
u32. Yes, there is a 4 GHz limit on the clock rate. If this ever becomes 
an issue, we can change the type to timer_tick_t. When the CPU clock 
rate gets quite high, it is an advantage to divide it down for 
performance measurement anyway. The AMD/Intel chips already do this. If 
the hardware doesn't do it, shift the timer value right two bits. I 
doubt you will miss the 0.4 nanoseconds resolution loss from your 10 GHz 
timestamp.
   - void timer_isr()
  - Optional (particularly if tick counter rollover period is
exceptionally log 

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 12:16 PM, Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message4ddea165.9010...@comcast.net  you wrote:
 I think it is the task of get_ticks to return the hardware tick counter
 as an  increasing counter, period.  The counter may wrap at some final
 count that is not all ones. That is ok. Sync_timebase deals with the
 NO!  We want to be able to compute time differences using simple
 unsigned arithmentics, even after a rollover of the counter.  For this
 it is mandatory that the counter always gets only incremented until it
 wraps around at te end of it's number range, and never gets reset
I agree that that is what must happen, but it should happen inside
 sync_timebase. Sync_timebase does what is needed to convert the
 less-than-fully capable counters into a fully capable one. You could
 I think you also want this behaviour for get_ticks().
Hi Wolfgang,
 I understand why that might be nice. But to do that with common 
code would require get_ticks to call a generic routine (say sync_ticks) 
that would expand the counter to 64 bits. Note that this is not always 
totally trivial, as the timer may roll over at 10 ms or some other 
not-so-nice number. Then sync_timer would convert the 64 bit number to 
milliseconds. That approach will work. However, I think that is 
overkill, as we really want the result in milliseconds. If you look at 
the prototype sync_timer routine, you can see an example of how this is 
possible without resorting to 64 bit math. I think that avoiding the 64 
bit math on processors that don't have a 64 bit tick counter (and are 
therefore probably less capable) is worthwhile. I also think that the 
purpose of the get_time routine abstracting the time into milliseconds 
is to avoid dealing with ticks anywhere except in the timer routines. 
Presumably, nobody but sync_timer would ever have reason to call 
get_ticks. If that is not your thinking,  fine, we just disagree on that 
point, and having a sync_ticks to expand the tick counter certainly can 
be done.
 To date, it has been shown conclusively that this process cannot be
 relied upon, or we wouldn't be having this discussion.  If we put that
 functionality inside sync_timebase, it is in one place and debuggable
 once. All sync_timebase requires to do this is ticks per second and
 maximum tick value. I do request that counters that decrement be negated
 in the get_ticks routine, but beyond that, it should be a simple read of
 the tick register(s).
 I think using ticks per second is not a good idea. It may exceed
 ULONG_MAX, and having to use 64 bit for all calculations is probably
 overkill.  The existing ticks2usec/usec2ticks work fine so far.
I certainly agree using 64 bits for all calculations is vast overkill. 
In fact, I think using 64 bit calculations on systems that have only a 
32 bit or less timer register is probably overkill. :-) However, to 
date,AFAIK,  no processor has exceeded the u32 in ticks per second. As I 
pointed out in a previous e-mail, if they ever do this, we can just drop 
one or 2 bits off the 64 bit counter and in millisecond resolution, 
nobody will ever know.  Also as previously pointed out, usec2ticks is 
not present yet in lots of implementations. Also, if the fundamental 
clock frequency is 32 kHz  (or anything less than 1 MHz), usec2ticks is 
0! This probably rules out using it to get ticks per millisec or ticks 
per sec.

Best Regards,
Bill Campbell

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 1:27 PM, Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message4ddeafe0.8060...@comcast.net  you wrote:
 I certainly agree using 64 bits for all calculations is vast overkill.
 In fact, I think using 64 bit calculations on systems that have only a
 32 bit or less timer register is probably overkill. :-) However, to
 date,AFAIK,  no processor has exceeded the u32 in ticks per second. As I
 Not yet. But it makes no sense to start a new design with settings
 already in critical range, especially since there is zero problem
 with breaking it down by a factor of 1000 or 1e6.

 pointed out in a previous e-mail, if they ever do this, we can just drop
 one or 2 bits off the 64 bit counter and in millisecond resolution,
 nobody will ever know.  Also as previously pointed out, usec2ticks is
 No. I will not accept a design that is so close on the edge of
 breaking.

 What is your exact problem with the existing interfaces ticks2usec()
 and usec2ticks() ?

 not present yet in lots of implementations. Also, if the fundamental
 clock frequency is 32 kHz  (or anything less than 1 MHz), usec2ticks is
 0! This probably rules out using it to get ticks per millisec or ticks
 per sec.
 The statement usec2ticks is 0 makes absolutely no sense as long as
 you don't say which argument you pass in.  You get a return value of
 0 even for a tick rate in the GHz range if you pass 0 as argument.

 Hoewver, usec2ticks(1000) or maybe usec2ticks(10) will probably
 return pretty useful values.

 [Note that by passing properly scaled arguments you can also avoid a
 number of rounding errors.]
Hi Wolfgang,
   Yes, you are correct. I was thinking usec2ticks(1), which is 
certainly not the way to do it. I am happy with usec2ticks and 
ticks2usec. That works for me. Sorry for the noise.

How about the first part of my response? Are you still thinking about it 
or is it just too bad for words :-) ?

Best Regards,
Bill Campbell


 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 4:28 PM, Graeme Russ wrote:
 Hi Bill,

 On Fri, May 27, 2011 at 2:56 AM, J. William Campbell
 jwilliamcampb...@comcast.net  wrote:
 On 5/26/2011 6:27 AM, Graeme Russ wrote:
 Hello Everyone,

 OK - Starting a new thread to discuss implementation details. This is a
 heads-up for arch/platform maintainers - Once this is a bit more stable, I
 will put it on the wiki

 Assumed Capabilities of the Platform
   - Has a 'tick counter' that does not rely on software to increment
 Hi All,
The nios2 with the most basic timer does not meet this requirement. It
 will not count at all without the 10 ms interrupt. I don't think this
 requirement matters anyway. We need a 'tick counter' that 'ticks'. If it
 takes software to make it tick, we don't much care. There may be problems
 with early use of udelay in that case, but that is a different issue.
 I think we will need to define get_timer() weak - Nios will have to
 override the default implementation to cater for it's (Nios') limitations
Hi All,
 Yes, that will probably be required here.
   - tick interval may by a fixed constant which cannot be controlled
 via software, or it could be programmable (PIT)

 API Functions (/lib/timer.c)
   - u32 get_timer(u32 start)
  - Returns the number of elapsed milliseconds since 'start'

 API Functions (/arch/...)
   - void udelay(u32 delay)
  - Used for 'short' delays (generally up to several seconds)
  - Can use the tick counter if it is fast enough
  - MUST NOT RESET THE TICK COUNTER
 There is a requirement that udelay be available before relocation and before
 the BSS is available. One can use the tick counter to provide udelay as long
 as sync_timebase is not called OR sync timebase does not use BSS. It appears
 many implementations ignore this requirement at present. We should try to
 fix this, but is should not be a requirement.
 If you really wanted to, sync_timebase() could use global data (it doesn't
 have many static variables) in which case all timer functions would be
 available before relocation
Yes, my implementation of the sync_timebase routine was written that 
way, using gd- for the required variables.
 'Helper' Functions (/lib/timer.c)
 I think this function should be weak, so that it is possible for people to
 override it with a custom function. The fully general sync_timebase has
 lots of code in it that can be simplified in special cases. We want and need
 a fully general function to be available, but other users who are real tight
 on space may want a cut down version. We should make that easily possible.
 Agree

   - void sync_timebase(void)
  - Updates the millisecond timer
  - Utilises HAL functions to access the platform's tick counter
  - Must be called more often than the rollover period of the
platform's tick counter
  - Does not need to be called with a regular frequency (immune
to interrupt skew)
  - Is always called by get_timer()
  - For platforms with short tick counter rollovers it should
be called by an ISR
  - Bill Campbell wrote a good example which proved this can be common
and arbitrary (and optionally free of divides and capable of
maintaining accuracy even if the tick frequency is not an even
division of 1ms)

 HAL Functions (/arch/... or /board/...)
   - u64 get_ticks(void)
 For what it's worth, I would like to propose using a (gasp!) typedef here.
 It seems to me there are a whole lot of cases where the max number of ticks
 is a u32 or less. In those cases, the wrap at 32 bits helps things a lot. If
 the tick counter is really 64 bits, the function of sync_timebase  is simply
 to convert the tick value  to millisec, and that is it. Otherwise, if it is
 32 bits or less then some other actions will be required. I think this is
 one of those times where a typedef would help, We could define a type called
 timer_tick_t to describe this quantity. That would allow a pure 32 bit
 implementation where appropriate.

 Another suggestion is that perhaps we want a u32 get_ticks_lsb(void) as well
 as a regular get_ticks. The lsb version would be used for udelay and could
 possibly come from another timer if that was necessary/desirable. See the
 requirement for early udelay early availability.
 I think this all adds unnecessary complexity

  - Returns a tick count as an unsigned 64-bit integer
  - Abstracts the implementation of the platform tick counter
(platform may by 32-bit 3MHz counter, might be a 16-bit
0-999 microsecond plus 16-bit 0-65535 millisecond etc)
   - u64 ticks_per_millisecond()
  - Returns the number of ticks (as returned by get_ticks()) per
millisecond
 I think ticks_per_sec would be a better choice. First, such a function
 already exists in almost all u-boots. Second, if one wants the best accuracy
 for things like udelay, you need better accuracy than  millisec. Since this
 function is used only infrequently, when things

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 6:51 PM, Graeme Russ wrote:
 Hi Bill,

 On Fri, May 27, 2011 at 11:26 AM, J. William Campbell
 jwilliamcampb...@comcast.net  wrote:
 On 5/26/2011 4:28 PM, Graeme Russ wrote:
 Why mess around with bit shifting (which you would then have to cludge
 into
 your platform code) when carting around a 64-bit value is relatively
 cheap,
 transparent and poratble (all all supported up-to-date tool chains)

 I really STRONGLY disagree with this statement. If you actually needed 64
 bit variables, fine use them. But as I have already shown, you do not need
 them in general.  We are computing a 32 bit result. There is some entropy
 argument that says you shouldn't need 64 bits to do so. Another way to look
 at it is that converting the top 32 bit word and the bottom 32 bit word to
 ms separately can be easier than doing them both together at once.  However,
 as we will see below, I do agree we need two 32 bit words to make this
 process go smoothly. I just don't agree that they should/will constitute a
 64 bit binary word. See below.
   - void timer_isr()
  - Optional (particularly if tick counter rollover period is
exceptionally log which is usually the case for native 64-bit tick
counters)
  - Simply calls sync_timebase()
  - For platforms without any tick counter, this can implement one
(but accuracy will be harmed due to usage of disable_interrupts()
 and
enable_interrupts() in U-Boot

 So to get the new API up and running, only two functions are mandatory:

 get_ticks() which reads the hardware tick counter and deals with any
 'funny
 stuff' including rollovers, short timers (12-bit for example), composite
 counters (16-bit 0-999 microsecond + 16-bit millisecond) and maintains a
 'clean' 64-bit tick counter which rolls over from all 1's to all 0's.
 The
 I think it is the task of get_ticks to return the hardware tick counter
 as
 an  increasing counter, period.  The counter may wrap at some final count
 that is not all ones. That is ok. Sync_timebase deals with the rollovers
 if
 The hardware tick counter may, the 64-bit software tick counter maintained
 by get_ticks() may not
 necessary. get_ticks is very lightweight. get_ticks should deal with
 decrementing counters by returning the complement of the counter.  The
 sc520
 case is a bit more complex if you intend to use the 0-999 and 16 bit
 millisec registers, in that you do need to add them to the previous value
 to
 As I mentioned in another post, this is a problem for the platform
 maintainer and is abstracted away throught the platform specific
 implementation of get_ticks()

 make an increasing counter. Sync_timebase likes short counters in that
 they are easy to convert to millisec and tick remainders.
 The compiler should handle using 64-bit rather than 32-bit transparently
 True enough.  But you don't need 64 bit variables at this point two 32 bit
 ones work just fine, in fact better in most cases.
 Remember, we are not dealing with a high performance OS here. The primary
 goal is portability - Performance optimisations (which do not break
 portability) can be performed later

 64-bit tick counter does not need to be reset to zero ever (even on
 startup
 - sync_timebase tacks care of all the details)
 True, but sync_timebase does have to be initialized (as does the timer
 itself in most cases, so this is not a restriction).
 This can be done in timer_init() via a call to sync_timebase() after the
 timer has been configured. This should bring everything into line

 ticks_per_millisecond() simply return the number of ticks in a
 millisecond
 - This may as simple as:

 inline u64 ticks_per_millisecond(void)
 {
 return CONFIG_SYS_TICK_PER_MS;
 }

 But it may be trickier if you have a programmable tick frequency
 You will have to call the routine that initializes sync_timebase. This
 routine should have a name, like void init_sync_timebase(void)?
 The optional timer ISR is required if the tick counter has a short roll
 over duration (short is up to you - 1 second is short, 1 hour might be,
 1
 century is not)

 Regards,

 Graeme

 It is probably true that sync_timebase should have a parameter flag. The
 reason is that if the timer isr is called only when the timer wraps, then
 the calls to sync_timebase may be slightly more than a full timer period
 apart. (due to interrupt latency). Therefore, when the timer difference
 is
 computed, if the current update is due to a wrap AND the previous update
 is
 due to a wrap, the difference should be approximately 1 wrap. If it comes
 up
 real short, you must add a wrap. This isn't necessary if the routine is
 called more often than once per wrap. Also, when sync_timebase is called
 in
 timer_isr() MUST be called more often than the rollover period of the
 underlying hardware tick counter - This is a requirement
 The equality case can be made to work.  If the extension of the counter is
 done in the interrupt routine, not in get_ticks, get_ticks just

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell

On 5/24/2011 10:17 PM, Wolfgang Denk wrote:

Dear J. William Campbell,

In message4ddc31eb.6040...@comcast.net  you wrote:
...

A tick is defined as the smallest increment of system time as measured by a
computer system (seehttp://en.wikipedia.org/wiki/System_time):

System time is measured by a system clock, which is typically
implemented as a simple count of the number of ticks that have
transpired since some arbitrary starting date, called the
epoch.

Unfortunately, this definition is obsolete, and has been for quite some

Do you have any proof for such a claim?

Hi Wolfgang,
 Well, yes, in fact the same reference you used. Note that the 
statement A tick is defined as the smallest increment of system time as 
measured by a computer system is NOT found in 
http://en.wikipedia.org/wiki/System_time. That page is defining system 
time, and what we are discussing is the definition of a tick. In fact, 
on the same wiki page you cite, there IS the statement Windows NT 
http://en.wikipedia.org/wiki/Windows_NT counts the number of 
100-nanosecond ticks since 1 January 1601 00:00:00 UT as reckoned in the 
proleptic Gregorian calendar 
http://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar, but returns 
the current time to the nearest millisecond. Here 100 nanosecond ticks 
clearly does not refer to any hardware 100 ns clock that exists on the 
pc. The 100 ns is a computed (dare I say virtual) tick value. The 
point here is that the definition of tick is yours, not wikipedia.org's. 
(Although we all are aware if it is wikipedia, it MUST be so).  Further, 
http://en.wikipedia.org/wiki/Jiffy_%28time%29#Use_in_computing contains 
the statement In computing http://en.wikipedia.org/wiki/Computing, a 
jiffy is the duration of one tick of the system timer 
http://en.wikipedia.org/wiki/System_time interrupt 
http://en.wikipedia.org/wiki/Interrupt.. If a tick is the smallest 
increment of system time as measured by the computer system, the of the 
system timer interrupt part of the statement would be unnecessary. The 
fact it IS present indicates there are other kinds of ticks present in 
the universe.
AFAIK, all timers tick, and the definition of the tick rate is 
1/timer resolution. The concept of timer ticks and clock ticks has 
been around forever, and exists independent of System Time. For example, 
there may be a watchdog timer on a system that does not measure time at 
all, yet the watchdog still ticks. When you read a value from a timer 
that was not the system timer, what do you call the value you read? I 
would call it ticks, and I bet just about everybody else would too.
The only reason I feel this point matters at all is that when one 
refers to a routine called get_ticks, it is not obvious to me which 
timer ticks are being referenced. You are saying that, by definition, it 
refers to the system clock.  My feeling is that it is not obvious why 
that is so on a system that has many clocks. The name of the function or 
an argument to the function should, IMNSHO,  specify which timer ticks 
are being returned.


Best Regards,
Bill Campbell

years.  When computers had a single timer, the above definition worked,
but it no longer does, as many (most?) computers have several hardware
timers. A tick today is the time increment of any particular timer of
a computer system. So, when one writes a function called get_ticks on a
PPC, does one mean read the decrementer, or does one read the RTC or
does one read the TB register(s) A similar situation exists on the
Blackfin BF531/2/3, that has a preformance counter, a real-time clock,
and three programmable timers. Which tick do you want? For each u-boot

Please re-read the definition.  At least as far as U-Boot and Linux
are concerned, there is only a single clock source used to implement
the _system_time_.  And I doubt that other OS do different.


implementation, we can pick one timer as the master timer, but it may
not be the one with the most rapid tick rate. It may be the one with the
most USEFUL tick rate for get_timer. If you take the above definition at
face value, only the fastest counter value has ticks, and all other
counters time increments are not ticks. If they are not ticks, what are
they?

Clocks, timers?


Best regards,

Wolfgang Denk



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 12:46 PM, Wolfgang Denk wrote:
 Dear Graeme Russ,

 In messagebanlktikm3lpynzcknp64kjeq5v+te7y...@mail.gmail.com  you wrote:
 I hope to get an implementation agreed upon that does not require
 interrupts at all, provided a tick counter with sufficiently long roll
 over time is available (longer than the maximum expected period
 between 'related' get_timer() calls, for example calls to get_timer()
 in a timeout testing while loop). This 'bare minimum' implementation
 can be optionally augmented with an ISR which kicks the timer
 calculation in the background (typically by just calling get_timer())

 It really is quite simple in the end.
 The length of this thread shows that it is not as simple as you want
 to make us believe.



 If all you have in mind are timeouts, then we don't need get_timer()
 at all.  We can implement all types of timeout where we wait for some
 eent to happen using udelay() alone.

 Need a 10 second timeout? Here we go:

   int cnt = 0;
   int limit = 10 * 1000;
   while (!condition) {
   usleep(1000); /* wait 1 millisec */
   if (++cnt  limit)
   break;
   }
   if (cnt  limit)
   error(timeout);

 get_timer() comes into play only if we want to calculate a time
 difference, for example if we want to run some code where we don't
 know how long it runs, and later come back and check if a certain
 amount of time has passed.

 When we don't know how long this code runs, we also cannot know (and
 espeically not in advance) wether or not this time will be longer or
 not than the rollover time of the tick counter.


 Your plan to require that get_timer() gets called often enough to
 prevent or detect tick counter overflows is putting things on their
 head.  It should be the opposite:  The implementation of get_timer()
 should be such that it becomes independent from such low level
 details.

HI all,
   We also cannot know if this code disables interrupts. If it does, 
the existing PPC design is broken. If interrupts are disabled during the 
entire code being executed, the elapsed run time will be 0. You can say 
that disabling interrupts in the code is not allowed. Fine, then that 
becomes a constraint on the code. Calling get_timer explicitly often 
enough is also a constraint on the code. One constraint is acceptable to 
you, the other is not. Ok, that's fine too. If the interrupt routine 
calls get_timer, then get_timer is called often enough and everything 
works the same as it presently does on PPC, It is not different than 
requiring the interrupt routine to be executed often enough. The two 
methods are functionally identical. The only problems arise on systems 
that don't support interrupts and don't have any timers with enough bits 
available to meet the 4294967 seconds maximum interval requirement. 
Those systems will be broken no matter what we do, as we have all agreed.

 Right now, almost all ARM cases are broken, because they have short 
timers and don't use interrupts. In some cases, there are actual bugs 
involved. We can make these cases less broken than they now are with a 
common get_timer approach as outlined previously. However, we cannot fix 
them to the standard Wolfgang is stating here, to be not broken. So, 
back to what I was asking before, is the improvement worth the effort if 
the result is still broken?

Best Regards,
Bill Campbell
 I have stated this before:  I consider any design that requires
 get_timer() to be called often enough broken.

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 4:13 PM, Graeme Russ wrote:
 Hi Wolfgang

 On Thu, May 26, 2011 at 7:16 AM, Wolfgang Denkw...@denx.de  wrote:
 Dear Graeme Russ,

 In message4ddd7066.4000...@gmail.com  you wrote:
 No, not at all. And I already answered this. For example on PPC, just
 reading the timebase would be perfectly sufficient, and simpler and
 more reliable than the current interrupt based approach.
 I assume by 'timebase' you mean the 64-bit tick counter. If so, that is
 By timebase I mean the timebase register, implemented as two 32 bit
 registers tbu and tbl, holding the upper and the lower 32 bits of the
 free-running 64 bit counter, respective.
 And remember, not all platforms have this implementation. The AMD sc520
 for example has a microsecond register which counts 0-999 that ticks a
 16-bit millisecond register and resets to zero. And the millisecond
 register latches the value of the microsecond register and resets
 (the millisecond register) back to zero.

 The thing is, this can all be abstracted away via get_tick() which
 (provided it is called every 65 seconds or so) can maintain a software
 version of the timebase register. So, every 65 seconds, the prescaler
 needs to be kicked. Now, if all we want to use get_timer() for is to
 monitor a timeout (which I think might be every single use in U-Boot
 to date) then the while (get_timer(start)  timeout) loop will work. If
 get_timer() is needed to measure time between two arbitrary events (which
 I 100% agree it should be able to do) then the prescaler will need to be
 kicked (typically by an interrupt)

 _exactly_ what I am suggesting we do (and what does already happen on ARM).
 I don't think so.
Hi All,
   Just to be clear, while ARMv7 has a 64 bit performance counter, 
it is not presently used by get_time. This is a change we want to make 
correct?
 On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2)
 do. at91 is odd - It looks like it uses interrupts, but get_timer() and
 udelay() both end up calling get_timer_raw() (with udelay only having
 millisecond resolution it seems).
I am not sure why you say at91 appears to use interrupts. There is a 
comment in arch/arm/cpu/arm930t/at91/timer.c that says timer without 
interrupts (line 73). There is the same comment in 
arch/arm/cpu/arm930t/at91rm9200/timer.c Nothing in either routine refers 
to interrupts, so I would say the timer doesn't use them. I could be 
wrong of course.
   Some others can be configured to
 increment the timer using an interrupt. ARM is, quite frankly, a complete
 mess - It has a mass of *_timer_masked() functions which the core timer
 functions are 'wafer thin' wrapper around, udelay() silently resets
 the timebase trashing get_timer() loops etc.
I sure agree with this last part. The only arm timer I found that 
clearly thought it could use interrupts was in arch/arm/cpu/ixp, and 
that was conditional, not mandatory.
 So let's wind back and distill the approach I am suggesting:

   1) A common prescaler function in /lib/ - It's purpose is to maintain
  a 1ms resolution timer (if the platform cannot otherwise do so)[1]
  The prescaler utilises a platform provided get_ticks()[2]
   2) A get_ticks() function provided by the platform - This function must
  return an unsigned counter which wraps from all 1's to all 0's - It
  DOES NOT have to be initialised to zero at system start. get_ticks()
  hides the low-level tick counter implementation - The sc520 example
  above is a classic example, so is your PPC tbu/tbl example.
   3) [Optional]An ISR which calls the prescaler[3]

 Now there is an optimisation if your tick counter has a 1ms resolution
 and is not small (i.e. 64-bits) - The prescaler is defined weak, so in
 the platform code, re-implement the prescaler to simply copy the tick
 counter to the timer variable.

 And what are the specific implementation types (in decending order of
 preference)? I think:
   1) A 64-bit micro-second tick counter[5]
- No interrupts needed
- Can be used by udelay() and get_timer() trivially
   2) A 64-bit sub-micro-second tick counter
- Interrupts most likely undeeded unless the tick frequency is
  insanely high
- Can be used by udelay() and get_timer() trivially
   3) A 64-bit milli-second tick counter
- No interrupts needed
- No prescaler needed
- Can be used by get_timer() trivially
- udelay() needs another tick source (if available) or be reduced
  to millisecond resolution
   4) A 32-bit milli-second tick counter
- No prescaler needed[6]
- Max 'glitch free' duration is ~50 days
- ISR needed to kick prescaler if events longer than 50 days need
  to be timed
- Can be used by get_timer() trivially
- udelay() needs another tick source (if available) or be reduced
  to millisecond resolution
   5) A 24-bit milli-second tick counter
- No prescaler needed[6]

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 9:19 PM, Reinhard Meyer wrote:
 Dear Graeme Russ,
 On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2)
 do. at91 is odd - It looks like it uses interrupts, but get_timer() and
 udelay() both end up calling get_timer_raw() (with udelay only having
 millisecond resolution it seems). Some others can be configured to
 increment the timer using an interrupt. ARM is, quite frankly, a 
 complete
 mess - It has a mass of *_timer_masked() functions which the core timer
 functions are 'wafer thin' wrapper around, udelay() silently resets
 the timebase trashing get_timer() loops etc.

 Please look at current master for at91.

 http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/cpu/arm926ejs/at91/timer.c;h=a0876879d3907af553d832bea187a062a22b9bd4;hb=5d1ee00b1fe1180503f6dfc10e87a6c6e74778f3
  


 AT91 uses a 32 bit hardware register that by means of a prescaler is made
 to increment at a rate in the low megahertz range.

 This results in a wrap approximately every 1000 seconds.

 Actually this would be sufficient for all known uses of udelay() and 
 get_timer()
Hi All
   Yes, you are correct. It would be sufficient for all known uses 
of udelay and get_timer(). However, Wolfgang has indicated a strong 
desire that the performance of the new API work properly over the full 
32 bit range of the millisecond delay time. That has been the basic 
issue for some time here.
 timeout loops. However, this hardware register is extended to 64 bits 
 by software
 every time it is read (by detecting rollovers).
Yes, but this extension ONLY happens if get_ticks is called via udelay 
or get_timer. It doesn't happen if you are sitting at the command prompt 
or off executing a downloaded stand alone program. You might ask who 
cares, and I would answer that Wolfgang cares, at least to some level. 
If the timer overflow triggered an interrupt, we could call get_ticks to 
update the extended time inside the interrupt routine. But, as far as I 
know, it doesn't. There are some other ARM processors that have a 32 bit 
clock derived from a 32 kHz crystal, The will work much as you example 
does up to 134217 seconds, in fact much longer than your AT91 example 
does. However, that doesn't touch the 4294967 seconds that the PPC can 
manage. Without interrupts, the 32 bit (or smaller) counters will NEVER 
get to the PPC standard if their tick rate exceeds 1 msec. It may be 
that we need a lower standard, perhaps saying 1000 seconds is enough. 
But that is not my decision to make.

 Since a wrap of that 64 bit tick would occur after the earth has ended,
 it is simple to obtain milliseconds from it by doing a 64 bit division.
True, but moot because of the above.
Best Regards,
Bill Campbell

 Best Regards,
 Reinhard



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 9:40 PM, Graeme Russ wrote:
 On Thu, May 26, 2011 at 2:19 PM, Reinhard Meyer
 u-b...@emk-elektronik.de  wrote:
 Dear Graeme Russ,
 On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2)
 do. at91 is odd - It looks like it uses interrupts, but get_timer() and
 udelay() both end up calling get_timer_raw() (with udelay only having
 millisecond resolution it seems). Some others can be configured to
 increment the timer using an interrupt. ARM is, quite frankly, a complete
 mess - It has a mass of *_timer_masked() functions which the core timer
 functions are 'wafer thin' wrapper around, udelay() silently resets
 the timebase trashing get_timer() loops etc.
 Please look at current master for at91.

 http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/cpu/arm926ejs/at91/timer.c;h=a0876879d3907af553d832bea187a062a22b9bd4;hb=5d1ee00b1fe1180503f6dfc10e87a6c6e74778f3

 AT91 uses a 32 bit hardware register that by means of a prescaler is made
 to increment at a rate in the low megahertz range.
 Yes, I see that now

 This results in a wrap approximately every 1000 seconds.
 Actually this would be sufficient for all known uses of udelay() and 
 get_timer()
 timeout loops. However, this hardware register is extended to 64 bits by 
 software
 every time it is read (by detecting rollovers).
 Which makes it 100% compatible with my proposed solution - The software
 prescaler will trigger the 64-bit extension and rollover detection

 Since a wrap of that 64 bit tick would occur after the earth has ended,
 it is simple to obtain milliseconds from it by doing a 64 bit division.
 Which would be done in the common prescaler in /lib/

 Currently, most ARM specific utilisations of get_timer() enforce a reset
 of the tick counter by calling reset_timer() - Subsequent calls to
 get_timer() then assume a start time of zero. Provided the internal timer
 rolls over currectly, the initial call of get_timer(0) will reset the ms
 timer and remove and 'glitch' present due to not calling the 'extender'
 function between 32-bit rollovers which makes the reset_timer() call
 unneccessary - I believe at91 behaves correctly in this regard.

 In any case, the underlying assumption made by the ARM timer interface
 (call reset_timer() first always) is inherently broken as not all users
 of the timer API do this - They assume a sane behaviour of:

   start = get_timer(0);
   elapsed_time = get_timer(start);

 Add to this udelay() resetting the timer make the following very broken:

   start = get_timer(0);
   while(condition) {
   udelay(delay);
   }
   elapsed_time = get_timer(start);

 NOTE: In this case, if udelay() also calls the prescaler then no interrupt
 triggered every 1000s would be required in the above example to get
 correct elapsed_time even if the loop ran for several hours (provided
 udelay() is called at least every 1000s

 However, to allow timing of independent events with no intervening
 udelay() or get_timer() calls, an 1000s interrupt to kick the prescaler is
 all that is needed to make this particular implementation behave correctly.
Hi All,
   True, if the processor supports timer interrupts. The problem is 
that the existing u-boots in many cases do not. I think that is really 
the crux of the problem. So what are we going to do? I am open to ideas 
here.

Best Regards,
Bill Campbell

 Of course disabling interruts and not calling get_timer() or udelay() will
 break the timer - But there is nothing that can be done about that)

 Regards,

 Graeme



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 7:12 AM, Wolfgang Denk wrote:
 Dear Albert ARIBAUD,

 In message4ddb4c1c.7030...@aribaud.net  you wrote:
 Not sure I still follow what the two options are -- a heads up is welcome.
 However, I do like the simplicity in having a single time unit (ticks)
 for the timer API -- asuming it covers all needs -- and providing other
 time units only as helper functions.
 I don't think using ticks is a good idea.  You would need to change
 all definitiuons of timeouts and delays and such.

 Why not using a scaled unit like microsecods or the currently used
 milliseconds?

 I wonder why we suddenly have to change everything that has been
 working fine for more than a decade (ignoring the large number of
 incorrect, incomplete or broken implementations).  But so far we
 really never needed anything else but  udelay()  for the typical short
 device related timeouts, and  get_time()  for longer, often protocol
 defined,  timeouts.

 Is there anything wrong with these two solutions, based on standard
 units (us and ms) ?
Hi All,
 After really looking into this, I think I agree with Wolfgang 
that using ms for a get_timer timebase is the best way to go. This 
thinking is heavily influenced (in my case anyway) by the fact that in 
the interrupt driven cases (and these are the ONLY fully compliant cases 
ATM I think), the cost of using ms is 0, because that is the native 
unit in which the timer ticks. This makes everything real simple. We 
can, right today, produce an API that supports u32 get_time_ms(void) for 
all CPUs in use. This would allow u32 get_timer(u32 base) to continue to 
exist as-is. These implementations would still be technically broken 
in the non-interrupt case, but they would work at least as well as they 
presently do. In fact, they would operate better because they would all 
use a single routine, not a bunch of different routines (some of which I 
am pretty sure have errors). Wolfgang would need to accept the fact that 
we are not yet fixing all the non-interrupt cases. This needs to be 
done, but is a different problem (I hope). In the non-interrupt case 
there is some cost in converting to ms from the native units, but we are 
in a timout loop, so a few extra instructions do not matter much. It is 
only a few 32 bit multiplies, which these days are real fast. If we can 
figure out how to use interrupts eventually, even this will go away.

 Then, we can ADD an optional performance measuring API like 
Scott suggests. This API would be something similar to the following:
struct ticks
{
 u32tickmsb;
 u32   ticklsb;
} ;
struct time_value
{
   u32 seconds;
   u 32 nano_seconds;
};


and the functions would be get_ticks(struct ticks * p) , 
get_tick_delta(struct ticks * minuend, struct ticks * subtrahend),  
and   cvt_tick_delta(struct time_value * result, struct ticks *input). I 
didn't use u64 on purpose, because I don't want any un-necessary 
functions pulled from the library. get_ticks reads a hi precision 
counter. How high the precision is depends on the hardware. 
get_tick_delta subtracts two tick values, leaving the difference in the 
first operand. Yes, this is may be simple to do in open code but it is 
better to hide the details. (What if msb is seconds and lsb is 
nanoseconds, then it is not so simple). cvt_tick_delta converts from 
ticks to seconds and nano_seconds. We also need a u32 
get_tick_resolution, which would return the tick resolution in ns. The 
user never needs to do any arithmetic on ticks, so that makes his life 
much easier. However, the user may want to know the resolution of these 
measurements. All these functions are quite fast except possibly 
cvt_tick_delta, which is only needed for printing anyway.
 If the hardware has a performance monitoring counter, 
implementing these functions is quite simple. The PPC and the x86 
certainly do (well, any x86 pentium and above anyway). This is a whole 
lot of chips off our plate. In cases where no such counter exists, we 
will use whatever counters there are available already. Some of the ARM 
counters are running at 32 kHz, so their resolution won't be great, but 
it is what it is. If we find out that some of these other CPUs have a 
performance counter, we will use it. This API will be completely 
optional in u-boot and can be removed by changing a #define.
Thoughts welcome.

Best Regards,
Bill Campbell

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 9:51 AM, Graeme Russ wrote:
 On 25/05/11 00:19, Wolfgang Denk wrote:
 =
snip
Hi all,
 I have a few of questions.
First, it seems that the get_timer interface is expected to work 
properly only after relocation and only when bss is available. I say 
this because the PPC version uses an (initialized) variable, timestamp, 
to hold the time. If that is the case, there is no need to hold the 
timer static data in gd, as I have been doing up to  now. Am I correct ?
  Second, udelay is expected to be available before  bss is 
available, very early in the startup process. There a comments to that 
effect in several places in the code. Therefore, udelay cannot use 
global or static variables. Is this true?
 And third, udelay is only expected/required to work correctly for 
short, like say 10 seconds. I say this based on comments contained in 
powerpr/lib/time.c. Please ack or nak this as well.

I have a couple of small changes to the previously submitted code, 
based partly on the above
 Exposing ticks and tick_frequency to everyone via a 'tick' HAL

 In /lib/timer.c

ulong timestamp = 0;
/*
  This routine is called to initialize the timer after BSS is available
*/
void  __weak prescaler_init(void)
{
 u32 tick_frequency = get_tick_frequency();
  /* initialize prescaler variables */
}

 void __weak prescaler(void)
 {
   u32 ticks = get_ticks();
 /* Bill's algorithm */
/* result stored in timestamp; */
 }

 u32 get_timer(u32 base)
 {
#if defined(CONFIG_SYS_NEED_PRESCALER)
   prescaler();
#endif
   return timestamp - base;
 }

 In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board/board/timer.c

 u32 get_ticks(void)
 {
   u32 ticks;

   /* Get ticks from hardware counter */

   return ticks;
 }

 u32 get_tick_frequency(void)
 {
   u32 tick_frequency;

   /* Determine tick frequency - likely very trivial */

   return tick_frequency;
 }
or instead the user may override prescaler_init and prescaler if, for 
some reason, a highly optimized version is desired. Note also that if 
the configuration variable CONFIG_SYS_NEED_PRESCALER is not defined, the 
additional prescaler routines will not be called anywhere so the 
routines should not be loaded. Yes, it it a #define to manage, but it 
should allow the existing u-boots to be the same size as before, with no 
unused code. This size matters to some people a lot!
 ===
 Not exposing ticks and tick_frequency to everyone

 In /lib/timer.c

 void prescaler(u32 ticks, u32 tick_frequency)
 {
   u32 current_ms;

   /* Bill's algorithm */

   /* result stored in gd-timer_in_ms; */
 }

 In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board/board/timer.c

 static u32 get_ticks(void)
 {
   u32 ticks;

   /* Get ticks from hardware counter */

   return ticks;
 }

 static u32 get_tick_frequency(void)
 {
   u32 tick_frequency;

   /* Determine tick frequency */

   return tick_frequency;
 }

 u32 get_timer(u32 base)
 {
   u32 ticks = get_ticks();
   u32 tick_frequency = get_tick_frequency();

   prescaler(ticks, tick_frequency);

   return gd-timer_in_ms - base;
 }

 ===
 I personally prefer the first - There is only one implementation of
 get_timer() in the entire code and the platform implementer never has to
 concern themselves with what the tick counter is used for. If the API gets
 extended to include get_timer_in_seconds() there is ZERO impact on
 platforms. Using the second method, any new feature would have to be
 implemented on all platforms - and we all know how well that works ;)

 And what about those few platforms that are actually capable of generating
 a 1ms timebase (either via interrupts or natively in a hardware counter)
 without the prescaler? Well, with prescaler() declared weak, all you need
 to do in /arch/cpu/soc/timer.c or /arch/cpu/timer.c or
 /board/board/timer.c is:

 For platforms with a 1ms hardware counter:
 void prescaler(void /* or u32 ticks, u32 tick_frequency*/)
 {
   gd-timer_in_ms = get_milliseconds();
 }

 For platforms with a 1ms interrupt source:
 void timer_isr(void *unused)
 {
   gd-timer_in_ms++;
 }

 void prescaler(void /* or u32 ticks, u32 tick_frequency*/)
 {
 }


 And finally, if the platform supports interrupts but either the hardware
 counter has better accuracy than the interrupt generator or the interrupt
 generator cannot generate 1ms interrupts, configure the interrupt generator
 to fire at any rate better than the tick counter rollover listed in
 previous post and:

 void timer_isr(void *unused)
 {
   /*
* We are here to stop the tick counter rolling over. All we
* need to do is kick the prescaler - get_timer() does that :)
*/
   get_timer(0);
 }

 In summary, platform specific code reduces to:
   - For a platform that cannot generate 1ms interrupts AND the hardware
 counter is 

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 12:19 PM, Wolfgang Denk wrote:
 Dear Graeme Russ,

 In message4ddbe22d.6050...@gmail.com  you wrote:
 Why must get_timer() be used to perform meaningful time measurement?
 Excellent question!  It was never intended to be used as such.
 Because get_timer() as it currently stands can as it is assumed to return
 milliseconds
 Yes, but without any guarantee for accuracy or resolution.
 This is good enough for timeouts, but nothing for time measurements.

 OK, let's wind back - My original suggestion made no claim towards changing
 what the API is used for, or how it looks to those who use it (for all
 practical intents and purposes). I suggested:
   - Removing set_timer() and reset_timer()
   - Implement get_timer() as a platform independent function
 Trying to remember what I have read in this thread I believe we have
 an agreement on these.

 Exposing ticks and tick_frequency to everyone via a 'tick' HAL
 I skip this.  I don't even read it.

 ===
 Not exposing ticks and tick_frequency to everyone

 In /lib/timer.c

 void prescaler(u32 ticks, u32 tick_frequency)
 {
  u32 current_ms;

  /* Bill's algorithm */

  /* result stored in gd-timer_in_ms; */
 }

 In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board/board/timer.c

 static u32 get_ticks(void)
 Currently we have unsigned long long get_ticks(void)  which is better
 as it matches existing hardware.

 Note that we also have void wait_ticks(u32) as needed for udelay().
Hi All,
I didn't comment before on the definition of ticks, but I fear 
that was unwise. The stated definition  was:

A tick is defined as the smallest increment of system time as measured by a
computer system (seehttp://en.wikipedia.org/wiki/System_time):

System time is measured by a system clock, which is typically
implemented as a simple count of the number of ticks that have
transpired since some arbitrary starting date, called the
epoch.


Unfortunately, this definition is obsolete, and has been for quite some 
years.  When computers had a single timer, the above definition worked, 
but it no longer does, as many (most?) computers have several hardware 
timers. A tick today is the time increment of any particular timer of 
a computer system. So, when one writes a function called get_ticks on a 
PPC, does one mean read the decrementer, or does one read the RTC or 
does one read the TB register(s) A similar situation exists on the 
Blackfin BF531/2/3, that has a preformance counter, a real-time clock, 
and three programmable timers. Which tick do you want? For each u-boot 
implementation, we can pick one timer as the master timer, but it may 
not be the one with the most rapid tick rate. It may be the one with the 
most USEFUL tick rate for get_timer. If you take the above definition at 
face value, only the fastest counter value has ticks, and all other 
counters time increments are not ticks. If they are not ticks, what are 
they?

This is one of the reasons I favor the performance monitoring system be 
separate from the get_timer timing methodology, as it will often use a 
different counter and time base anyway. That is also why I prefer to 
have a conversion routine that converts timer values to seconds and 
nano-seconds without reference to tick rates, so the user never has to 
deal with these ambiguities. Yes, under the hood, somebody does, but 
that need not be specified in the external interface. )Nobody has yet 
commented on my proposed performance measuring functions, and I know 
this group well enough not to assume that silence implies consent! )

The prescaler function defined by Graeme needs to read some timer value. 
It turns out that a u32 value is the most appropriate value for the 
get_timer operation.  The value desired is usually not of hardware timer 
tick resolution. It should be the specific bits in the timer, such that 
the LSB resolution is between 1 ms and 0.5 ms. We use greater resolution 
that this only when the counter is short enough that we need lower 
significance bits to make the counter 32 bits. For that reason, the 
function should probably be call something like u32 
read_timer_value(void). I really don't much care what it is called as 
long as we understand what it does.

Best Regards,
Bill Campbell


 static u32 get_tick_frequency(void)
 {
  u32 tick_frequency;

  /* Determine tick frequency */

  return tick_frequency;
 }
 Note that we also have u32 usec2ticks(u32 usec) and u32 ticks2usec(u32 ticks).

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 5:17 PM, Graeme Russ wrote:
 On Wed, May 25, 2011 at 5:19 AM, Wolfgang Denkw...@denx.de  wrote:
 Dear Graeme Russ,

 In message4ddbe22d.6050...@gmail.com  you wrote:
 Why must get_timer() be used to perform meaningful time measurement?
 Excellent question!  It was never intended to be used as such.
 Because get_timer() as it currently stands can as it is assumed to return
 milliseconds
 Yes, but without any guarantee for accuracy or resolution.
 This is good enough for timeouts, but nothing for time measurements.
 Out of curiosity, are there any platforms that do not use their most
 accurate source(*) as the timebase for get_timer()? If a platform is using
 it's most accurate, commonly available, source for get_timer() the the
 whole accuracy argument is moot - You can't get any better anyway so
 why sweat the details.
Hi All,
Well, it is not quite that simple. The accuracy of the 1 ms 
interrupt rate is controlled in all cases I know about by the resolution 
of the programmable divider used to produce it. It appears that the x86 
uses a 1.19318 MHz crystal oscillator to produce the nominal 1 ms timer 
tick. (There is a typo in line 30 of arch/x86/lib/pcat_timer.c that says 
1.9318. I couldn't make any of the numbers work until I figured this 
out). The tick is produced by dividing the 1.19318 rate999.313 by 1194, 
which produces an interrupt rate of 999.3 Hz, or about 0.068% error. 
However, the performance counter on an x86 is as exact as the crystal 
frequency of the CPU is. FWIW, you can read the performance counter with 
rdtsc on a 386/486 and the CYCLES and CYCLES2 registers on later 
Intel/AMD chips. So yes, there is at least one example of a cpu that 
does not use it's most accurate (or highest resolution) time source.
 (*)I'm actually referring to what is commonly available for that platform,
 and not where a board has a high precision/accuracy source in addition to
 the common source.

 As a followup question, how many platforms use two completely independent
 sources for udelay() and get_timer() - x86 does, but I plan to change this
 so the interrupt kicks the new prescaler which can be done at  1ms period
 and udelay() and get_timer() will use the same tick source and therefore
 have equivalent accuracy.
Are you sure of this? From what I see in arch/x86/lib/pcat_timer.c, the 
timer 0 is programmed to produce the 1 kHz rate timer tick and is also 
read repeatedly in __udelay to produce the delay value. They even 
preserve the 1194 inaccuracy, for some strange reason. I see that the 
sc520 does appear to use different timers for the interrupt source, and 
it would appear that it may be exact, but I don't know what the input 
to the prescaler is so I can't be sure. Is the input to the prescaler 
really 8.3 MHz exactly? Also, is the same crystal used for the input to 
the prescaler counter and the software timer millisecond count. If 
not, then we may have different accuracies in this case as well.

Also of note, it appears that the pcat_timer.c, udelay is not available 
intil interrupts are enabled. That is technically non-compliant, 
although it obviously seems not to matter.

Best Regards,
Bill Campbell
 OK, let's wind back - My original suggestion made no claim towards changing
 what the API is used for, or how it looks to those who use it (for all
 practical intents and purposes). I suggested:
   - Removing set_timer() and reset_timer()
   - Implement get_timer() as a platform independent function
 Trying to remember what I have read in this thread I believe we have
 an agreement on these.

 Exposing ticks and tick_frequency to everyone via a 'tick' HAL
 I skip this.  I don't even read it.
 Hmmm, I think it is worthwhile at least comparing the two - What is the
 lesser of two evils

   1. Exposing 'ticks' through a HAL for the prescaler
   2. Duplicating a function with identical code 50+ times across the source
  tree

 I personally think #2 is way worse - The massive redundant duplication and
 blind copying of code is what has get us into this (and many other) messes

 ===
 Not exposing ticks and tick_frequency to everyone

 In /lib/timer.c

 void prescaler(u32 ticks, u32 tick_frequency)
 {
u32 current_ms;

/* Bill's algorithm */

/* result stored in gd-timer_in_ms; */
 }

 In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board/board/timer.c

 static u32 get_ticks(void)
 Currently we have unsigned long long get_ticks(void)  which is better
 as it matches existing hardware.
 Matches PPC - Does it match every other platform? I know it does not match
 the sc520 which has a 16-bit millisecond and a 16-bit microsecond counter
 (which only counts to 999 before resetting to 0)

 Don't assume every platform can implement a 64-bit tick counter. But yes,
 we should cater for those platforms that can

 Note that we also have void wait_ticks(u32) as needed for udelay().

 static u32 get_tick_frequency(void)
 {
u32 

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell

On 5/22/2011 11:29 PM, Albert ARIBAUD wrote:

Hi all,

Sorry, could not follow the discussion although I find it very
interesting, so I will handle the task of coming in late and asking the
silly questions.
I am glad you are looking at our discussion. I am sure we are going to 
need all the help/oversight/questions that we can get, as this is a 
change that will affect all architectures.

Le 23/05/2011 07:25, Graeme Russ a écrit :


On Mon, May 23, 2011 at 3:02 PM, J. William Campbell
jwilliamcampb...@comcast.net   wrote:

On 5/22/2011 6:42 PM, Graeme Russ wrote:

OK, so in summary, we can (in theory) have:
   - A timer API in /lib/ with a single u32 get_timer(u32 base) function
   - A HAL with two functions
 - u32 get_raw_ticks()
 - u32 get_raw_tick_rate() which returns the tick rate in kHz (which
   max's out at just after 4GHz)
   - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks()
 and get_tick_rate() to correctly maintain the ms counter used by
 get_timer() - This function can be weak (so next point)
   - If the hardware supports a native 32-bit ms counter, get_raw_ms()
 can be overridden to simply return the hardware counter. In this case,
 get_raw_ticks() would return 1

Are you sure you did not mean 'get_raw_ticks_rate' here? Besides, I'd
like the name to specify the units used: 'get_raw_ticks_rate_in_khz' (or
conversively 'get_raw_ticks_per_ms', depending on which is simpler to
implement and use).
I think you are correct, it was the rate function desired here. I think 
the best way to go is use a get_raw_tick_rate_in_mhz function, because 
it is probably the easiest one to implement, and in many cases something 
like it already exists.

   - Calling of get_raw_ticks() regularly in the main loop (how ofter will
 depend on the raw tick rate, but I image it will never be necessary
 to call more often than once every few minutes)

That's to keep track of get_raw_ticks() rollovers, right? And then the
get_timer function (which, again, I would prefer to have '... in ms'
expressed in its name) would call get_raw_ticks() in confidence that at
most one rollover may have occurred since the last time the helper
function was called, so a simple difference of the current vs last tick
value will always be correct.
Exactly so. Note that this same function probably needs to be called in 
udelay for the same reason. More precisely, the get_timer function will 
call get_raw_ms, which will call get_raw_ticks. I think it may be better 
to move get_timer down a level in the hierarchy,
so we don't need a get_raw_ms. get_timer would then be part of the HAL. 
One would use a get_timer(0) in order to do what get_raw_ms alone would 
have done. If the user had a good reason, he would then override 
get_timer with his own version. What do you think Graeme? It reduces the 
nesting depth by one level. As for the name change to get_timer_in_ms, I 
would support it. Naturally, such a change would be up to Mr. Denk. 
Since by definition that is what the function does, it seems to be a 
good change from my point of view.

   - If the hardware implements a native 32-bit 1ms counter, no call in
 the main loop is required
   - An optional HAL function u32 get_raw_us() which can be used for
 performance profiling (must wrap correctly)

Hi All,
   Graeme, I think you have stated exactly what is the best approach to
this problem.  I will provide a version of get_raw_ms that is  initialized
using get_raw_tick_rate that will work for all reasonable values of
raw_tick_rate. This will be the generic solution. Both the initialization
function and the get_raw_ms function can be overridden if there is reason to
do so, like exact clock rates. I will do the same with get_raw_us. This
will be posted sometime on Monday for people to review, and to make sure I
didn't get too far off base. Thank you to both Graeme and Reinhard for
looking at/working on this.. Hopefully, this solution will put this timing
issue to rest for all future ports as well as the presently existing ones.

In Greame's description, I did not see a get_raw_ms, only a get_raw_us.
Was this last one a typo or is that a third HAL function?
get_raw_ms was referenced as a library function a few lines above.  
Right now, I think the functionality we require from the HAL is


  1. get_raw_tick_rate_in_mhz
  2. get_raw_ms
  3. get_raw_ticks
  4. (optional)get_raw_us

There is also  APIs for these functions called get_timer.

I think we need to add a call to another function called 
initialize_timer_system or similar that will initialize the data 
structures in gd by calling get_raw_tick_rate_in_mhz. Additionally, I 
think we need to provide a udelay function, simply because it can 
interact with calling get_raw_ms often enough. We are somewhat caught 
between two fires here, in that on the one hand we want to provide a 
very generic approach to the timing system that will work on any CPU 
while on the other hand we

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell

On 5/23/2011 6:19 AM, Wolfgang Denk wrote:

Dear Graeme Russ,

In message4dda5334.4060...@gmail.com  you wrote:

  - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks()
and get_tick_rate() to correctly maintain the ms counter used by
get_timer() - This function can be weak (so next point)

Ditto.  What would that do?  If it gets milliseconds as the name
suggest, that's already the function needed for get_timer()?

OK, there appears to be a consensus that not all hardware actually supports
a free-running timer with 1ms resolution. To overcome this, the idea is to

Indeed.  I guess most of them do not.


create a common library function which maintains the free running counter.
The library function acts as a pre-scaler using a 'raw tick counter' and a
'raw tick rate' supplied by the low level architecture. We define this weak

What are raw ticks?  And what are cooked ticks, then?

Hi all,
 FWIW,  cooked ticks would be 1 ms ticks, although we never 
really use them as such.

so that if the architecture can provide a free running 1ms counter, there
is no (code size) penalty

Why do we need a free running 1ms counter at all?  Any free running
counter of at least millisecoind resolution should be good enough.
Correct. Any free running counter whose resolution is better than one 
millisecond and which is long enough that it will not overflow between 
calls to get_timer is sufficient.

This approach eliminates all the existing per-arch code which (attempts) to
manage the time base behind get time. So we simplify each arch down to it's
bare essentials - Provide a counter which increments at a natural fixed
rate and what the rate is - Let common library code deal with the rest.

Did you have a look at the PowerPC implementation?  I'd like to see
this used as reference.
I have looked at it as a reference. However, there is one disadvantage 
in using the PPC code as a reference. It has a 64 bit timestamp. Many 
systems do not have a 64 bit timestamp, but rather a 32 bit timestamp.  
It is possible to extend the 32 bit timestamp to a 64 bit timestamp if 
get_timer is called often enough that there will only be a single 
rollover of the bottom 32 bits  between uses. However, if that condition 
is met, there is no need to extend the timer to 64 bits. Instead, just 
convert the elapsed time since the last call (which you know) to ms and 
be done with it.  As Wolfgang said above, any counter that has better 
than 1 ms resolution  is adequate to the task. The additional 
requirement that we have stated is that the counter be long enough that 
it does not overflow between calls to get_timer. If the counter is 64 
bits long, it pretty much for sure meets this requirement  (although 
bits below 0.5 ms resolution really don't help any). If the timer is 32 
bits long, it will meet any requirements using get_timer to time out 
hardware intervals. My original implementation used a 32 bit divide and 
does exactly that. This is the shortest and simplest approach, and we 
can get that working in all cases quite easily I think. We can avoid the 
no divide optimization until everybody is satisfied with what we have.

  - Calling of get_raw_ticks() regularly in the main loop (how ofter will
depend on the raw tick rate, but I image it will never be necessary
to call more often than once every few minutes)

NAK. This concept is fundamentally broken.  I will not accept it.

Some existing timers are fundamentally broken - The glitch at the
0x to 0x rollover or rollover early - The method discussed
in this thread eliminates all such glitches. Provided pre-scaler in /lib/
(triggered by get_timer() usually) is called often enough (71 minutes for a
32-bit 1MHz counter) then there is no need. Even then, it is only important

We already have this nightmare of code for triggering the watchdog on
systems that use it.

Assuming there are places in the main loop that get executed often
enough is a broken concept, and I will not accept any such code.
That is fine with me. The reason this was being done was to attempt to 
emulate, as much as possible, the power PC, where the 64 bit timestamp 
counter allows calls to get_timer separated by many minutes and several 
console commands to work properly. These get timer commands will NOT 
work properly on systems that have a 32 bit counter that overflows every 
200 seconds or so. The call in the idle loop was an attempt to make the 
32 bit systems work more like the 64 bit systems. One may then either


  1.   Define calls to get_timer to measure an elapsed interval
 separated by any returns to the command processor as broken.
  2. Require the use of interrupts to extend the 32 bit timestamp.
 (This may not be possible on all systems as the timer used for
 performance monitoring does not interrupt, etc.)
  3. Allow the call in the idle loop under the assumption that we are
 talking about timing in the minutes range, not a few seconds.


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell
On 5/23/2011 6:19 AM, Wolfgang Denk wrote:
 Dear Graeme Russ,
snip
 This is what PPC is doing. And I understand that Reinhard did the same
 in software for AT91.
Hi All,
  My apologies for being a little (perhaps more than a little) 
dense. As they say, after further review, I think the key aspect of 
the PPC timer system is that it uses the decrementer register to 
generate an interrupt at a 1 KHz rate. What I have been attempting here 
is to produce a timer system that does not use interrupts at all. This 
is a fundamental design question. Naturally, systems that can generate 
an interrupt at a 1 KHz rate (or at any (reasonable) higher rate for 
that matter) using the decrementer register can produce a 1 ms 
resolution software counter that updates by magic. If my understanding 
of this PPC  code is incorrect, somebody please stop me before I make a 
further fool of myself!  Is it then a design requirement that the timer 
system use interrupts? Is that what is meant by using the PPC system as 
a model? If so, is it possible/reasonable on all the u-boots that are 
out there to generate and process timer interrupts at some (hopefully 
but not necessarily) programmable rate?

Best Regards,
Bill Campbell
 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell

On 5/23/2011 12:33 PM, Wolfgang Denk wrote:

Dear J. William Campbell,

In message4ddaa705.1040...@comcast.net  you wrote:

   My apologies for being a little (perhaps more than a little)
dense. As they say, after further review, I think the key aspect of
the PPC timer system is that it uses the decrementer register to
generate an interrupt at a 1 KHz rate. What I have been attempting here
is to produce a timer system that does not use interrupts at all. This
is a fundamental design question. Naturally, systems that can generate

No, it is not.  It is an implementation detail which is irrelevant to
almost all users of U-Boot.

Or do you actucally care if your UART driver uses polling or
interrupts?

Hi All,
  I might care a lot if I expect typeahead to work, or if I am 
sending a command script via a terminal emulator and I don't want to 
loose characters while u-boot is off executing a command. One might 
(correctly) say that that is too much to expect of a boot loader, and 
define polling as good enough, which I advocate, or not. YMMV.

an interrupt at a 1 KHz rate (or at any (reasonable) higher rate for
that matter) using the decrementer register can produce a 1 ms
resolution software counter that updates by magic. If my understanding
of this PPC  code is incorrect, somebody please stop me before I make a
further fool of myself!  Is it then a design requirement that the timer
system use interrupts? Is that what is meant by using the PPC system as

No, it is not a design requirement.  It is just one possible
implementation.  Any other method that achieves the same or similar
results is as good.  As noted before, on PowerPC we could have
probably avoided this and just base all timer services on the timebase
register.

[The reason for this dual implementation is historical.  When I wrote
this code, I did not know if we would ever need any fancy timer-
controlled callbacks or similar.  And I needed to implement interrupt
handling for a few other purposes (for example for use in standalone
applications; this was an explicit requirement at that time).  And the
timer was something that made a good and simple example.]


a model? If so, is it possible/reasonable on all the u-boots that are
out there to generate and process timer interrupts at some (hopefully
but not necessarily) programmable rate?

I consider this an implementation detail.  On all architectures it
should be possible to use interrupts, so if the hardware supports a
timer that can generate interrupts it should be possible to use this.
But it is not a requirement that all implementations must work like
  Ok, this is very nice to understand. I will attempt to summarize 
what I think this email and the previous one means. First, the required 
properties of the get_timer routine.


  1. It must have 1 millisecond resolution and accuracy (more or less).
 For instance, the old NIOS timer that incremented the timestamp by
 10 every 10 milliseconds in response to an interrupt is not compliant.
  2. The get_timer routine must have full period accuracy without any
 assumptions regarding what is going on in u-boot. This period is
 4294967 seconds or so.

I then suggest that the minimum system requirements to support the 
u-boot timer are as follows:


   * Either there exists a free-running timer whose period is  4294967
 and whose resolution is 1 millisecond or better. This probably
 includes all 64 bit timestamp counters.
   * Or there exists a method of generating interrupts at a known rate
 of 1 millisecond or faster. This is a superset of  the current PPC
 method.
   * Or there exists a method of generating interrupts at a known fixed
 rate slower than once a millisecond AND there exists a readable
 free running counter whose period is longer or the same as the
 interrupt rate AND whose resolution is at least 1 ms. This would
 include N bit counters that generate an interrupt when they
 overflow, or some long timestamp counter with another, possibly 
 unrelated interrupt generating methodology that is faster than the

 counter overflow interval.

There are many systems that are able to do all three cases, or two of 
three, so they have a choice on how to implement get_timer(). I claim 
that these are sufficient conditions. I also claim they are necessary. 
If a hardware system does not meet these criteria, I claim it can't meet 
the get_timer requirements. Do such systems exist today that use u-boot? 
I think probably so, but maybe they are corner cases. Note that you 
cannot extend a 32 bit counter to a 64 bit counter reliably without 
using interrupts to do so when get_timer is not being called.


Systems using the first approach above have essentially all their logic 
in the get_timer routine, but MUST use at least some 64 bit arithmetic 
(actually 33 bit arithmetic if you don't count shifts) to do so because 
the delta time between calls can be very large. The routine

Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/

2011-05-23 Thread J. William Campbell
On 5/23/2011 1:10 PM, Graeme Russ wrote:
 On 24/05/11 04:29, Scott McNutt wrote:
 Hi Bill,

 J. William Campbell wrote:
 On 5/23/2011 6:12 AM, Scott McNutt wrote:
 Dear Graeme,

 Graeme Russ wrote:
 On 23/05/11 22:19, Scott McNutt wrote:
 Hi Graeme,

 Graeme Russ wrote:
 There is no need to use get_timer() and reset_timer() and there are
 build
 I must have missed something WRT reset_timer() -- my apologies
 if I'm covering old ground.

 When the timestamp is incremented using an interrupt that occurs with
 a period greater than 1 ms, we can get early timeouts. reset_timer()
 solved the problem. What's the recommended approach for dealing with
 this without reset_timer() ?
 Hi Scott,
Are you saying that the interrupt frequency is greater than
 1000 times per second, or as I read it, the frequency is less than 1000
 per second (period greater than 1 ms). If anything, that should make the
 timer run slow, not fast.
   I wonder if it is a resolution issue. What are the typical delays in ms
 you are using?
 Some older nios2 implementations have _fixed_ 10 msec timers.
 Basically, the timestamp is incremented asynchronous to get_timer(0).
 So a  10 msec timeout can occur, for example, almost immediately if
 the timer isn't reset just prior to calling get_timer(0). There are
 more details in the comments for the following commits:

 nios2: Reload timer count in reset_timer():
d8bc0a2889700ba063598de6d4e7d135360b537e

 cfi_flash: reset timer in flash status check:
22d6c8faac4e9fa43232b0cf4da427ec14d72ad3

 I'm totally in favor of cleaning this stuff up. It caused some
 headaches (and wasted time) about 13 months ago. My primary concern
 is to avoid breaking things that currently work for us nios2
 weenies ... at least for any length of time.

 Things are a bit tight for me until next week or so. I'll probably
 come up for air around June 1st ... and I'll be glad to help out.

 Is there any reason why we cannot silently perform a reset_timer() any time
 set_timer() is called with a parameter of 0?
Hi All,
  I assume you mean get_timer(0)?  In principle, you cannot do this 
because it could be inside another get_timer(0) loop that has already 
some time elapsed before you hit the inner get_timer(0). I think what 
needs to happen on the old NIOS with 10 ms resolution on the interrupt 
times is that all timer intervals must have 10 ms added and then rounded 
up to the nearest multiple of 10. Thus, if you wanted to wait for 1 
millisecond, you must use an argument of 20 ms to be sure you wait at 
all! If you use an argument of 10, it won't help because you could get 
an interrupt right away and exit. If these routines are nios2 specific, 
you could add a local reset_timer, but I assume they are generic. . Note 
that if these routines are not nios2 specific, is there any harm in 
waiting too long?

Best Regards,
Bill Campbell
 Regards,

 Graeme



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/

2011-05-23 Thread J. William Campbell
On 5/23/2011 2:02 PM, Graeme Russ wrote:
 On 24/05/11 06:49, J. William Campbell wrote:
 On 5/23/2011 1:10 PM, Graeme Russ wrote:
 On 24/05/11 04:29, Scott McNutt wrote:
 Hi Bill,

 J. William Campbell wrote:
 On 5/23/2011 6:12 AM, Scott McNutt wrote:
 Dear Graeme,

 Graeme Russ wrote:
 On 23/05/11 22:19, Scott McNutt wrote:
 Hi Graeme,

 Graeme Russ wrote:
 There is no need to use get_timer() and reset_timer() and there are
 build
 I must have missed something WRT reset_timer() -- my apologies
 if I'm covering old ground.

 When the timestamp is incremented using an interrupt that occurs with
 a period greater than 1 ms, we can get early timeouts. reset_timer()
 solved the problem. What's the recommended approach for dealing with
 this without reset_timer() ?
 Hi Scott,
 Are you saying that the interrupt frequency is greater than
 1000 times per second, or as I read it, the frequency is less than 1000
 per second (period greater than 1 ms). If anything, that should make the
 timer run slow, not fast.
I wonder if it is a resolution issue. What are the typical delays in ms
 you are using?
 Some older nios2 implementations have _fixed_ 10 msec timers.
 Basically, the timestamp is incremented asynchronous to get_timer(0).
 So a  10 msec timeout can occur, for example, almost immediately if
 the timer isn't reset just prior to calling get_timer(0). There are
 more details in the comments for the following commits:

 nios2: Reload timer count in reset_timer():
 d8bc0a2889700ba063598de6d4e7d135360b537e

 cfi_flash: reset timer in flash status check:
 22d6c8faac4e9fa43232b0cf4da427ec14d72ad3

 I'm totally in favor of cleaning this stuff up. It caused some
 headaches (and wasted time) about 13 months ago. My primary concern
 is to avoid breaking things that currently work for us nios2
 weenies ... at least for any length of time.

 Things are a bit tight for me until next week or so. I'll probably
 come up for air around June 1st ... and I'll be glad to help out.

 Is there any reason why we cannot silently perform a reset_timer() any time
 set_timer() is called with a parameter of 0?
 Hi All,
   I assume you mean get_timer(0)?  In principle, you cannot do this
 Yes - it's early, no coffee yet ;)
 because it could be inside another get_timer(0) loop that has already some
 time elapsed before you hit the inner get_timer(0). I think what needs to
 Correct, but that is what is already happening for ALL arches in cfi due to
 the reset_timer() before get_timer(0) - I am suggesting sandboxing the
 problem to NIOS until we sort out the timer API properly

 happen on the old NIOS with 10 ms resolution on the interrupt times is that
 all timer intervals must have 10 ms added and then rounded up to the
 nearest multiple of 10. Thus, if you wanted to wait for 1 millisecond, you
 must use an argument of 20 ms to be sure you wait at all! If you use an
 argument of 10, it won't help because you could get an interrupt right away
 and exit. If these routines are nios2 specific, you could add a local
 reset_timer, but I assume they are generic. . Note that if these routines
 are not nios2 specific, is there any harm in waiting too long?
 Well, we have no control over the argument in cfi driver (unless you plan
 to put #ifdef NIOS all over the place)

 Maybe we could round up the parameter inside get_timer() itself?
Hi All,
That would probably be the best way to go for now. It might slow 
things down a bit though, if these delays are all desired to be short, 
like 1 ms. We would expand the 1 ms delay to 15 ms (average) while the 
current (illegal) solution would expand a 1 ms delay to 10 ms always. It 
is worth trying I think. It is also true that any other delays in the 
program will suffer from the 10 ms resolution problem, so your idea is I 
think a good one.

Best Regards,
Bill Campbell
 Regards,

 Graeme




___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/21/2011 9:26 PM, Reinhard Meyer wrote:
 Dear Graeme Russ,
 Hi All,

 I've just had a good look through the timer API code with the view of
 rationalising it and fixing up some of the inconsistencies. What I found
 was a little scarier than I expected! Anyway, here is a write-up of what I
 found - Please comment
 We have been at this discussion a multiple of times :) but never reached a 
 consent.

 However, at current master, I have reduced at91 architecture to only use
 get_timer(base), set_timer() never existed and reset_timer() has been removed.

 As it showed up recently, common cfi code still calls reset_timer() - which 
 certainly
 can be fixed with little effort...

 At the lowest level, the U-Boot timer API consists of a unsigned 32-bit
 free running timestamp which increments every millisecond (wraps around
 every 4294967 seconds, or about every 49.7 days). The U-Boot timer API
 allows:
- Time deltas to be measured between arbitrary code execution points
  ulong start_time;
  ulong elapsed_time;

  start_time = get_timer(0);
  ...
  elapsed_time = get_timer(start_time);

- Repetition of code for a specified duration
  ulong start_time;

  start_time = get_timer(0);

  while (get_timer(start_time)   REPEAT_TIME) {
  ...
  }

- Device timeout detection
  ulong start_time;

  send_command_to_device();
  start = get_timer (0);
  while (device_is_busy()) {
  if (get_timer(start)   TIMEOUT)
  return ERR_TIMOUT;
  udelay(1);
  }
  return ERR_OK;
 correct.

 The U-Boot timer API is not a 'callback' API and cannot 'trigger' a
 function call after a pre-determined time.
 that would be too complex to implement and of little use in a single task
 system. u-boot can do fine with polling.

 NOTE: http://lists.denx.de/pipermail/u-boot/2010-June/073024.html appears
 to imply the following implementation of get_timer() is wrong:

  ulong get_timer(ulong base)
  {
  return get_timer_masked() - base;
  }
 Is is not wrong as long as get_timer_masked() returns the full 32 bit space
 of numbers and 0x is followed by 0x. Most implementations
 probably do NOT have this property.

 U-Boot Timer API Details
 
 There are currently three functions in the U-Boot timer API:
  ulong get_timer(ulong start_time)
 As you point out in the following, this is the only function required.
 However it REQUIRES that the internal timer value must exploit the full
 32 bit range of 0x to 0x before it wraps back to 0x.

  void set_timer(ulong preset)
  void reset_timer(void)

 get_timer() returns the number of milliseconds since 'start_time'. If
 'start_time' is zero, therefore, it returns the current value of the
 free running counter which can be used as a reference for subsequent
 timing measurements.

 set_timer() sets the free running counter to the value of 'preset'
 reset_timer() sets the free running counter to the value of zero[1]. In
 theory, the following examples are all identical

  
  ulong start_time;
  ulong elapsed_time;

  start_time = get_timer(0);
  ...
  elapsed_time = get_timer(start_time);
  
  ulong elapsed_time;

  reset_timer();
  ...
  elapsed_time = get_timer(0);
  
  ulong elapsed_time;

  set_timer(0);
  ...
  elapsed_time = get_timer(0);
  

 [1] arch/arm/cpu/arm926ejs/at91/ and arch/arm/cpu/arm926ejs/davinci/ are
 exceptions, they set the free running counter to get_ticks() instead
 Not anymore on at91.
 Architecture Specific Peculiarities
 ===
 ARM
- Generally define get_timer_masked() and reset_timer_masked()
- [get,reset]_timer_masked() are exposed outside arch\arm which is a bad
  idea as no other arches define these functions - build breakages are
  possible although the external files are most likely ARM specific (for
  now!)
- Most CPUs define their own versions of the API get/set functions which
  are wrappers to the _masked equivalents. These all tend to be the same.
  The API functions could be moved into arch/arm/lib and made weak for
  the rare occasions where they need to be different
- Implementations generally look sane[2] except for the following:
  - arm_intcm - No timer code (unused CPU arch?)
  - arm1136/mx35 - set_timer() is a NOP
  - arm926ejs/at91 - reset_timer() sets counter to get_ticks()
 no implelemtation of set_timer()
 See current master for actual implementation!
  - arm926ejs/davinci - reset_timer() sets counter to get_ticks()
no 

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/21/2011 11:23 PM, Graeme Russ wrote:
 On 22/05/11 14:26, Reinhard Meyer wrote:
 Dear Graeme Russ,
 Hi All,

 I've just had a good look through the timer API code with the view of
 rationalising it and fixing up some of the inconsistencies. What I found
 was a little scarier than I expected! Anyway, here is a write-up of what I
 found - Please comment
 We have been at this discussion a multiple of times :) but never reached a
 consent.

 However, at current master, I have reduced at91 architecture to only use
 get_timer(base), set_timer() never existed and reset_timer() has been 
 removed.
 Excellent

 As it showed up recently, common cfi code still calls reset_timer() - which
 certainly
 can be fixed with little effort...
 Yes, this is one of the easy fixes as all call sites already use the start
 = get_timer(0), elapsed = get_timer(start) convention anyway - The
 reset_timer() calls are 100% redundant (provided get_timer() behaves
 correctly at the 32-bit rollover for all arches)

 The U-Boot timer API is not a 'callback' API and cannot 'trigger' a
 function call after a pre-determined time.
 that would be too complex to implement and of little use in a single task
 system. u-boot can do fine with polling.
 I am in no way suggesting this - I just want to clarify the API for anyone
 who needs to use it

 NOTE: http://lists.denx.de/pipermail/u-boot/2010-June/073024.html appears
 to imply the following implementation of get_timer() is wrong:

  ulong get_timer(ulong base)
  {
  return get_timer_masked() - base;
  }
 Is is not wrong as long as get_timer_masked() returns the full 32 bit space
 of numbers and 0x is followed by 0x. Most implementations
 probably do NOT have this property.
 U-Boot Timer API Details
 
 There are currently three functions in the U-Boot timer API:
  ulong get_timer(ulong start_time)
 As you point out in the following, this is the only function required.
 However it REQUIRES that the internal timer value must exploit the full
 32 bit range of 0x to 0x before it wraps back to 0x.
 So this needs to be clearly spelt out in formal documentation

 Rationalising the API
 =
 Realistically, the value of the free running timer at the start of a timing
 operation is irrelevant (even if the counter wraps during the timed period).
 Moreover, using reset_timer() and set_timer() makes nested usage of the
 timer API impossible. So in theory, the entire API could be reduced to
 simply get_timer()
 Full ACK here !!!
 I don't think there will be much resistance to this

 3. Remove reset_timer_masked()
 --
 This is only implemented in arm but has propagated outside arch/arm and
 into board/ and drivers/ (bad!)

 regex [\t ]*reset_timer_masked\s*\([^)]*\); reveals 135 callers!

 A lot are in timer_init() and reset_timer(), but the list includes:
- arch/arm/cpu/arm920t/at91rm9200/spi.c:AT91F_SpiWrite()
- arch/arm/cpu/arm926ejs/omap/timer.c:__udelay()
- arch/arm/cpu/arm926ejs/versatile/timer.c:__udelay()
- arch/arm/armv7/s5p-common/timer.c:__udelay()
- arch/arm/lh7a40x/timer.c:__udelay()
- A whole bunch of board specific flash drivers
- board/mx1ads/syncflash.c:flash_erase()
- board/trab/cmd_trab.c:do_burn_in()
- board/trab/cmd_trab.c:led_blink()
- board/trab/cmd_trab.c:do_temp_log()
- drivers/mtd/spi/eeprom_m95xxx.c:spi_write()
- drivers/net/netarm_eth.c:na_mii_poll_busy()
- drivers/net/netarm_eth.c:reset_eth()
- drivers/spi/atmel_dataflash_spi.c/AT91F_SpiWrite()
 Fixed in current master.
 Excellent. I have not pulled master for a little while, guess I should

- If hardware supports microsecond resolution counters, get_timer() could
  simply use get_usec_timer() / 1000
 That is wrong. Dividing 32 bits by any number will result in a result that
 does not
 exploit the full 32 bit range, i.e. wrap from (0x/1000) to 
 0x,
 which makes time differences go wrong when they span across such a wrap!

 Yes, this has already been pointer out - 42 bits are needed as a bare
 minimum. However, we can get away with 32-bits provided get_timer() is
 called at least every 71 minutes

 P.S. Can we use the main loop to kick the timer?

- get_usec_timer_64() could offer a longer period (584942 years!)
 Correct. And a must be when one uses such division.
 Unless we can rely on get_timer() to be called at least every 71 minutes in
 which case we can handle the msb's without error in software

 But you have to realize that most hardware does not provide a simple means to
 implement a timer that runs in either exact microseconds or exact
 milliseconds.
 This is where things get interesting and we need to start pushing a
 mandated low-level HAL. For example, I believe get_timer() should be
 implemented in /lib as:

   ulong get_timer(ulong base)
   {
   return get_raw_msec() - base;
  

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell

On 5/22/2011 1:15 AM, Reinhard Meyer wrote:

Dear J. William Campbell,

please demonstrate for me (and others), by a practical example,
how _any_ arithmetic (even less with just shifts and multiplies)
can convert a free running 3.576 MHz (wild example) free running
32 bit counter (maybe software extended to 64 bits) into a ms
value that will properly wrap from 2**32-1 to 0 ?

Hi All
  I accept the challenge! I will present two ways to do this, one 
using a 32 bit by 16 bit divide, and one using only multiplies.
This first method is exact, in that there is no difference in 
performance from a hardware counter ticking at the 1 ms rate. This is 
accomplished by operating the 1 ms counter based on the delta time in 
the hardware time base. It is necessary to call this routine often 
enough that the hardware counter does not wrap more than once between 
calls. This is not really a problem, as this time is 1201 seconds or so. 
If the routine is not called for a long time, or at the first call, it 
will return a timer_in_ms value that will work for all subsequent calls 
that are within a hardware rollover interval. Since the timer in ms is a 
32 bit number anyway. The same rollover issue will exist if you 
software extend the timer to 64 bits. You must assume 1 rollover. If 
it is more than 1, the timer is wrong.



The variables in the gd are
u32 prev_timer;
u32 timer_in_ms;
u16 timer_remainder;

/* gd-timer remainder must be initialized to 0 (actually, an number 
less than 3576, but 0 is nice). Other two variables don't matter but can 
be initialized if desired  */


u32 get_raw_ms()
{
u32 delta;
   u32  t_save;

  read(t_save);   /* atomic read of the hardware 32 bit timer 
running at 3.576 MHz */

  delta_t = (t_save  - gd-prev_timer) ;

  gd-prev_timer =  t_save;
 /*
   Hopefully, the following two lines only results in one hardware 
divide when optimized. If your CPU has no hardware divide, or if it 
slow, see second method .

*/
 gd-timer_in_ms += delta_t  / 3576; /* convert elapsed time to ms */
 gd-timer_remainder += delta_t  % 3576; /* add in remaining part 
not included above */

 if (gd-timer_remainder = 3576) /* a carry has been detected */
{
   ++gd-timer_in_ms;
   gd-timer_remainder -= 3576; /* fix remainder for the carry above */
}

 return(gd-timer_in_ms)
}

This approach works well when the number of ticks per ms is an exact 
number representable as a small integer, as it is in this case. It is 
exact with a clock rate of 600 MHz, but is not exact for a clock rate of 
666 MHz. 67 is not an exact estimate of ticks per ms, It is off by 
0.5 % That should be acceptable for use as a timeout delay. The 
accumulated error in a 10 second delay should be less than 0.5 ms.


There is a way that the divide above can be approximated by multiplying 
by an appropriate fraction, taking the resulting delta t in ms, 
multiplying it by 3576, and subtracting the product from the original 
delta to get the remainder.  This is the way to go if your CPU divides 
slowly or not at all. This approach is presented below.


the vaues in gd are as follows:

u32 prev_timer;
u32 timer_in_ms;

/*
One tick of the 3.576 MHz timer corresponds to 1/3.576/1000 ms,
or 0.000279642 ms. Scale the fraction by 65536 (16 bit shift),
you get 37532.9217
*/
u32 get_raw_ms(void)
{
  u32  t_save;
  u32  temp;

  /* read the hardware 32 bit timer running at 3.576 MHz */
  read_timer_atomic(t_save);
  t_save -= gd-prev_timer; /* get delta time since last call */
  gd-prev_timer += t_save; /* assume we will use all of the counts */

  /*
   * This first while loop is entered for any delta time  about 18.3 
ms. The

   * while loop will execute twice 2.734% of the time, otherwise once.
   */
  while (t_save  65535)
  {
temp = t_save  16;   /* extract msb */
temp  = ((temp * 37532) + ((temp * 60404)  16))  11;
/* temp  = (temp * 37532)  11; */
gd-timer_in_ms += temp;
t_save  -= temp * 3576;
  }
  /*
   * This second while loop is entered for 94.837% of all possible 
delta times,
   * 0 through 0X. The second while loop will execute twice 
0.037% of

   * the time, otherwise once.
   */
  while (t_save = 3576)
  {
temp  = (t_save * 37532)  (16 + 11);
if (temp == 0)
  temp = 1; /* we know that 1 works for sure */
gd-timer_in_ms += temp;
t_save  -= temp * 3576;
  }
  gd-prev_timer -= t_save; /* restore any counts we didn't use this 
time */

  return gd-timer_in_ms;
}

I have tested this code and it seems to work fine for me. I have 
attached a more readable copy as example .c for those who wish to play 
around with it. In my original post, I had a version of this code that 
could be used with different clock rates. I can provide the same 
functionality with this code, for CPUs/systems where the clock rate is 
unknown at compile time or is variable. I can also address the error

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/22/2011 5:02 PM, Graeme Russ wrote:
 Dear Reinhard,

 On Sun, May 22, 2011 at 6:15 PM, Reinhard Meyer
 u-b...@emk-elektronik.de  wrote:
 Dear J. William Campbell,

 please demonstrate for me (and others), by a practical example,
 how _any_ arithmetic (even less with just shifts and multiplies)
 can convert a free running 3.576 MHz (wild example) free running
 32 bit counter (maybe software extended to 64 bits) into a ms
 value that will properly wrap from 2**32-1 to 0 ?

 I fail to see how that will be possible...
 These may help:

 http://blogs.msdn.com/b/devdev/archive/2005/12/12/502980.aspx
 http://www.hackersdelight.org/divcMore.pdf

 of Google 'division of unsigned integer by a constant'

 Basically you will be dividing by a constant value of 3576 which can be
 highly optimised
Hi All,
Yes, you can do it this way. I prefer not to use properties of 
the constant because it doesn't allow multiple clock rates easily. The 
code I posted just a moment ago is easy to change for different rates 
because it does not rely much on the exact bit structure of 3576.
There are simple formulas for all the magic numbers in what I posted, 
so it is easy to change the input frequency.
Best Regards,
Bill Campbell
 Regards,

 Graeme



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/22/2011 6:42 PM, Graeme Russ wrote:
 OK, so in summary, we can (in theory) have:
   - A timer API in /lib/ with a single u32 get_timer(u32 base) function
   - A HAL with two functions
 - u32 get_raw_ticks()
 - u32 get_raw_tick_rate() which returns the tick rate in kHz (which
   max's out at just after 4GHz)
   - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks()
 and get_tick_rate() to correctly maintain the ms counter used by
 get_timer() - This function can be weak (so next point)
   - If the hardware supports a native 32-bit ms counter, get_raw_ms()
 can be overridden to simply return the hardware counter. In this case,
 get_raw_ticks() would return 1
   - Calling of get_raw_ticks() regularly in the main loop (how ofter will
 depend on the raw tick rate, but I image it will never be necessary
 to call more often than once every few minutes)
   - If the hardware implements a native 32-bit 1ms counter, no call in
 the main loop is required
   - An optional HAL function u32 get_raw_us() which can be used for
 performance profiling (must wrap correctly)
Hi All,
   Graeme, I think you have stated exactly what is the best 
approach to this problem.  I will provide a version of get_raw_ms that 
is  initialized using get_raw_tick_rate that will work for all 
reasonable values of raw_tick_rate. This will be the generic 
solution. Both the initialization function and the get_raw_ms function 
can be overridden if there is reason to do so, like exact clock rates. 
I will do the same with get_raw_us. This will be posted sometime on 
Monday for people to review, and to make sure I didn't get too far off 
base. Thank you to both Graeme and Reinhard for looking at/working on 
this.. Hopefully, this solution will put this timing issue to rest for 
all future ports as well as the presently existing ones.
On a different note, the graylisting application is causing 
some (about half) of my replies to the list from showing up. The 
messages are going to the specified individuals correctly, but not 
always to the list. This is apparently because my ISP has so many 
different network address available that the probability of using the 
same one (or same subnet) is not very high. Is there anything that can 
be done about this?

Best Regards,
Bill Campbell
 Regards,

 Graeme



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/22/2011 8:26 PM, Reinhard Meyer wrote:
 Dear J. William Campbell,
 On 5/22/2011 1:15 AM, Reinhard Meyer wrote:
 Dear J. William Campbell,

 please demonstrate for me (and others), by a practical example,
 how _any_ arithmetic (even less with just shifts and multiplies)
 can convert a free running 3.576 MHz (wild example) free running
 32 bit counter (maybe software extended to 64 bits) into a ms
 value that will properly wrap from 2**32-1 to 0 ?
 Hi All
 I accept the challenge! I will present two ways to do this, one using 
 a 32 bit by 16 bit divide, and one using only multiplies.
 This first method is exact, in that there is no difference in 
 performance from a hardware counter ticking at the 1 ms rate. This 
 is accomplished by operating the 1 ms counter based on the delta time 
 in the hardware time base. It is necessary to call this routine often 
 enough that the hardware counter does not wrap more than once between 
 calls. This is not really a problem, as this time is 1201 seconds or 
 so. If the routine is not called for a long time, or at the first 
 call, it will return a timer_in_ms value that will work for all 
 subsequent calls that are within a hardware rollover interval. Since 
 the timer in ms is a 32 bit number anyway. The same rollover issue 
 will exist if you software extend the timer to 64 bits. You must 
 assume 1 rollover. If it is more than 1, the timer is wrong.


 The variables in the gd are
 u32 prev_timer;
 u32 timer_in_ms;
 u16 timer_remainder;

 /* gd-timer remainder must be initialized to 0 (actually, an number 
 less than 3576, but 0 is nice). Other two variables don't matter but 
 can be initialized if desired */

 u32 get_raw_ms()
 {
 u32 delta;
 u32 t_save;

 read(t_save); /* atomic read of the hardware 32 bit timer running at 
 3.576 MHz */
 delta_t = (t_save - gd-prev_timer) ;

 gd-prev_timer = t_save;
 /*
 Hopefully, the following two lines only results in one hardware 
 divide when optimized. If your CPU has no hardware divide, or if it 
 slow, see second method .
 */
 gd-timer_in_ms += delta_t / 3576; /* convert elapsed time to ms */
 gd-timer_remainder += delta_t % 3576; /* add in remaining part not 
 included above */
 if (gd-timer_remainder = 3576) /* a carry has been detected */
 {
 ++gd-timer_in_ms;
 gd-timer_remainder -= 3576; /* fix remainder for the carry above */
 }

 return(gd-timer_in_ms)
 }

 Thank you! Basically this is similar to a Bresenham Algorithm.
Hi All,
   Yes, I think you are correct. I didn't know it by that name, but 
i think you are correct. It is a bit different use of the idea, but it 
is very similar.


 This approach works well when the number of ticks per ms is an exact 
 number representable as a small integer, as it is in this case. It is 
 exact with a clock rate of 600 MHz, but is not exact for a clock rate 
 of 666 MHz. 67 is not an exact estimate of ticks per ms, It is 
 off by 0.5 % That should be acceptable for use as a timeout 
 delay. The accumulated error in a 10 second delay should be less than 
 0.5 ms.

 I would think the non exact cases result in such a small error that 
 can be
 tolerated. We are using the ms tick for timeouts, not for providing a 
 clock
 or exact delays. We should just round up when calculating the divider.
Yes, we should round off the divider value, so 666.66 MHz rounds 
to 67 ticks/Ms while 333.33 MHz rounds to 33 ticks/Ms.

 Hence the hick-ups that result when this is not called frequent enough to
 prevent a multiple rollover of the raw value between calls do not matter
 either (they should be just documented).
Good, I am glad we agree on this also.


 There is a way that the divide above can be approximated by 
 multiplying by an appropriate fraction, taking the resulting delta t 
 in ms, multiplying it by 3576, and subtracting the product from the 
 original delta to get the remainder. This is the way to go if your 
 CPU divides slowly or not at all. This approach is presented below.

 [...]

 Optimizations would be up to the implementer of such a hardware and work
 only if the divider is a compile time constant. Often the divider will be
 run time determined (AT91 for example).
Correct. I will provide a generic version that computes the constants 
at run time.  If the clock rate is a constant, these routines can be 
overridden at compile/link time. This generic version should be 
available on Monday for further review.

Best Regards,
Bill Campbell

 Reinhard



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-21 Thread J. William Campbell
On 5/21/2011 5:06 PM, Graeme Russ wrote:
 On 22/05/11 01:33, J. William Campbell wrote:
 On 5/21/2011 5:38 AM, Graeme Russ wrote:
 Hi All,
 4. Implement microsecond API - get_usec_timer()
 ---
- Useful for profiling
- A 32-bit microsecond counter wraps in 71 minutes - Probably OK for most
  U-Boot usage scenarios
- By default could return get_timer() * 1000 if hardware does not support
  microsecond accuracy - Beware of results   32 bits!
 Hi All,
I think the multiply overflowing an unsigned long is ok here as long
 as the timeout value you desire is less than 71 seconds. This assumes that
 the CPU returns the correct lower 32 bits when overflow occurs, but I think
 this is the normal behavior(?)
 I think you mean 71 minutes - 2^32 = 4294967296 (usec) = 4294967.296 (msec)
 = 4294.967296 s = 71.582788267 minutes

 So provided any millisecond timer using a 32-bit microsecond timer as a
 time base is called every 71 minutes, a 32-bit microsecond  timer should
 suffice to keep the msb's accurately maintained by software

Hi Graeme,
 Yes, you are certainly correct, it is minutes, not seconds.
- If hardware supports microsecond resolution counters, get_timer() could
  simply use get_usec_timer() / 1000
 I think this actually is NOT equivalent to the current API  in
 that the full 32 bits of the timer is not available and as a result the
 wrapping properties of a 32 bit subtract for delta times will not work
 properly. If a larger  counter is available in hardware, then it is
 certainly possible to do a 64 by 32 bit divide in get_timer, but probably
 you don't want to do that  either. As previously discussed, it is possible
 to extract a 32 bit monotonic counter of given resolution (microsecond or
 millisecond resolution) from a higher resolution counter using a shift to
 approximately the desired resolution followed by a couple of multiply/add
 functions of 32 bit resolution.  To do this with a microsecond resolution,
 a 42 bit or larger timer is required. The extra bits can be provided in
 Of course, how silly of me - To downscale the microsecond timer down to
 milliseconds you need to have at least 1000 times more resolution
 (9.965784285 bits) - It was late ;)

 software as long as the get_timer/get_usec_timer routines are called more
 often than every 71/2 sec, so that a correct delta in microseconds can be
 obtained. Note that when the timer is not actively in use (not  called
 often enough), the millisecond timer msb would stop updating, but that
 wouldn't matter.
 Minutes - see above
Correct.
 If the hardware supports sub-microsecond accuracy in a longer
 register, say 64 bits, you can just convert the 64 bit hardware timer to 32
 bit microseconds or milliseconds by a shift  and 32 bit multiplies
 Yes

Good luck with this effort. I think getting the timer API and also
 the method of implementation of the interface to the hardware to be the
 same across all u-boot architectures is a very good idea, and it is
 possible. However, it is a quite a bit of work and I am glad you are brave
 enough to try!
 It's all there already - it just needs a little bit of housekeeping :)
Correct, if we do not worry too much about the low level details of 
get_timer. It looks to me like there is a lot of cruft there, depending 
on which architecture one looks at. Many implementations create a 
parallel universe of get_ticks or some similar timing system that is 
then used to support get_timer. However, the get_ticks routines are also 
used in timeouts elsewhere in the code. Searching on get_timer doesn't 
find these non-standard usages. It would be nice to fix these also, but 
every bit does help!

Best Regards,
Bill Campbell
 Regards,

 Graeme



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] ARM timing code refactoring

2011-01-24 Thread J. William Campbell
On 1/24/2011 5:02 AM, Wolfgang Denk wrote:
 Dear Albert ARIBAUD,

 In message4d3d2942.4060...@free.fr  you wrote:
 That is assuming a 64-bit timebase, isn't it? for CPUs / SoCs that don't
 have such a timebase but only a 32-bit timer, the bogo_ms/jiffy would
 not go through the full 32-bit range, which would cause issues with the
 timing loops on rollover -- and while a timeout of more than 65 sec may
 not be too likely, a timeout starting near the wraparound value of
 bogo_ms still could happen.
Hi All,
If you use my approach of shifting right by n bits  (in order to 
get a counter that has about 1 ms resolution),  zeros are shifted into 
the top. The counter would never reach more than (32-n) bits in 
magnitude. This is easily fixed by creating a virtual n msbs for the 
counter, as must be done if a 64 bit counter is to be simulated. 
Actually, I only need n bits, but in any case it requires detecting 
that the counter has backed up mod 32 bits and if so, increase the 
virtual counter by 1.

If you really, really, really want a timer that ticks at a 1 ms rate, 
instead of bogo_ms/jiffies, I propose the following:

u32 get_time()
{
 u32 delta_t_bogo_ms;
u32  t_save;
#if defined(TIMER_IS_64_BITS)
 u64 tick;
 read(tick);

  t_save  = tick  gd-timer_shift;
  delta_t_bogo_ms = t_bogo_ms - gd-prev_timer;
#else
   read(t_save);
   delta_t_bogo_ms = (t_save  - gd-prev_timer)  gd-timer_shift;
#endif
   gd-prev_timer = t_save;  /* save previous counter */
   if (delta_t_bogo_ms  gd-bogo_ms_per_65_sec)
   {
  gd-fract_timer+=  delta_t_bogo_ms * 
gd-cvt_bogo_ms_to_ms; /* accumulate fraction of ms */
  gd-timer_in_ms += gd-fract_timer  16;
 gd-fract_timer = 0X;
   }
   else
 gd-fract_timer = 0; /* start accumulating from 0 fraction */
   return(gd-timer_in_ms)
}

This routine will create a timer in ms, that will be 
accurate as long as this routine is called either once about ever 65 
seconds or once per time base rollover, whichever is less. Nested timer 
use is ok. If the timer is not called frequently enough, it will return 
the same value twice, but after that it will start timing normally. This 
shouldn't matter, as the second returned value will  be the start of a 
timing loop. Timeout values can be the full 32 bits, as long as you keep 
calling the routine frequently enough. No initialization is required.
   Note that you can (and probably should) use the bottom 32 
bits of the hardware timebase as a 32 bit timebase unless the clock 
would overflow in 65 seconds (running faster than about 66 MHz), or if 
you want to  relax the 65 seconds.
If you want to save a word in the gd data,  use 0X1 instead of 
bogo_ms_per_65_sec (or a more precise value if you know it). Note that 
gd-timer_shift and gd-cvt_bogo_ms_to_ms can also be replaced by 
constants if the clock speed is fixed.

This is more expensive than using the bogo_ms timer, but does have the 
advantage that everything is in ms. FWIW, I think converting from ms to 
some other unit for loop control is fine, as long as we have a standard 
routine to do that that is cheap. However, others may not agree. For 
sure, passing around 64 bit tick values to do this process is, IMHO, 
vast overkill and not a good general solution, as many processors really 
don't like to do 64 bit operations.

Best Regards.
Bill Campbell


 Sorry, but I don't get it.  What exactly is the problem with a 32 bit
 counter, and why would it not go through the full 32-bit range?

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] ARM timing code refactoring

2011-01-23 Thread J. William Campbell
On 1/23/2011 2:57 PM, Wolfgang Denk wrote:
 Dear Reinhard Meyer,

 In message4d3c9bfc.1010...@emk-elektronik.de  you wrote:
 get_timer() returns a monotonous upward counting time stamp with a
 resolution of milliseconds. After reaching ULONG_MAX the timer wraps
 around to 0.
 Exactly that wrap makes the situation so complicated, since the simple code
 u32 get_timer(void)
 {
return (ticks * 1000ULL) / tickspersec;
 }
 won't do that wrap.
 Do you have a better suggestion?

 The get_timer() implementation may be interrupt based and is only
 available after relocation.
 Currently it is used before relocation in some places, I think I have
 seen it in NAND drivers... That would have to be changed then.
 Indeed.  It is unreliable or even broken now.

 This is already implemented functionally very closely (apart from factoring 
 and the
 get_timer(void) change) to this in AT91, the only (academic) hitch is that 
 it will
 burp a few billion years after each reset :)
 What bothers me is the need for 64 bit mul/div in each loop iteration, for 
 CPUs without
 hardware for that this might slow down data transfer loops of the style

 u32 start_time = get_timer();
 do {
  if (data_ready)
  /* transfer a byte */
  if (get_timer() - start_time  timeout)
  /* fail and exit loop */
 } while (--bytestodo  0);

 since get_timer() will be somewhat like:

  return (tick * 1000ULL) / tickspersec;

 As I stated before, tickspersec is a variable in, for example, AT91. So the
 expression cannot be optimized by the compiler.
 I don't think this is the only way to implement this. How does Linux
 derive time info from jiffies?

Hi All,
   In order to avoid doing 64 bit math, we can define a jiffie 
or a bogo_ms that is the 64 bit timebase shifted right such that the 
lsb of the bottom 32 bits has a resolution of between 0.5 ms and 1 ms. 
It is then possible to convert the difference between two jiffie/bogo_ms 
values to a number of ms using a 32 bit multiply and a right shift of 16 
bits, with essentially negligible error.  get_bogo_ms() would return a 
32 bit number in bogo_ms, thus the timing loop would be written.

u32 start_time = get_bogo_ms();
do {
 if (data_ready)
 /* transfer a byte */
 if (bogo_ms_to_ms(get_timer() - start_time)  TIMEOUT_IN_MS)
 /* fail and exit loop */
} while (--bytestodo  0);

u32 get_bogo_ms()
{
 u64 tick;
 read(tick);

  return (tick  gd-timer_shift);
}
u32 bogo_ms_to_ms(u32 x)
{
/* this code assumes the resulting ms will be between 0 and 65535, 
or 65 seconds */
return ((x * gd-cvt_bogo_ms_to_ms)  16); /* cvt_bogo_ms_to_ms 
is a 16 bit binary fraction */
}

All the above code assumes timeouts are 65 seconds or less, which I 
think is probably fair. Conversion of ms values up to 65 seconds to 
bogo_ms is also easy, and a 32 bit multiplied result is all that is 
required.
What is not so easy is converting a 32 bit timer value to ms.  It can be 
done if the CPU can do a 32 by 32 multiply to produce a 64 bit result, 
use the msb, and possibly correct the result by an add if  bit 32,of the 
timer is set.  You need a 33 bit counter in bogo_ms to get a monotonic, 
accurate 32 bit counter in ms. The powerpc can use a mulhw operation to 
do this, and any CPU that will produce a 64 bit product can do this. 
However, many CPUs do not produce 64 bit products easily. Using division 
to do these operations are even less appealing, as many CPUs do not 
provide hardware division at all. Since it is not necessary to do this 
conversion to easily use timeouts with 1 ms resolution and accuracy,  I 
think the idea of not using a timer in ms but rather bogo_ms/jiffies is 
possibly better?

Best Regards,
Bill Campbell

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] TIMER cleanup RFC, was: [PATCH 4/4] arm920t/at91/timer: replace bss variables by gd

2010-11-30 Thread J. William Campbell
On 11/30/2010 1:14 AM, Reinhard Meyer wrote:
 Dear Wolfgang Denk,

 what we really need is only a 32 bit monotonous free running tick that 
 increments
 at a rate of at least 1 MHz. As someone pointed out a while ago, even at 1GHz 
 that would
 last for four seconds before it rolls over. But a 1HGz counter could be 64 
 bit internally
 and always be returned as 32 bits when it is shifted right to bring it into 
 the few MHz
 range.

 Any architecture should be able to provide such a 32 bit value. On powerpc 
 that would
 simply be tbu|tbl shifted right a few bits.

 An architecture and SoC specific timer should export 3 functions:

 int timer_init(void);
 u32 get_tick(void); /* return the current value of the internal free running 
 counter */
 u32 get_tbclk(void); /* return the rate at which that counter increments (per 
 second) */

 A generic timer function common to *most* architectures and SoCs would use 
 those two
 functions to provice udelay() and reset_timer() and get_timer().
 Any other timer functions should not be required in u-boot anymore.

 However get_timer() and reset_timer() are a bit of a functional problem:

 currently reset_timer() does either actually RESET the free running timer 
 (BAD!) or
 remember its current value in another (gd-)static variable which later is 
 subtracted
 when get_timer() is called. That precludes the use of several timers 
 concurrently.

 Also, since the 1000Hz base for that timer is usually derived from get_tick() 
 by
 dividing it by some value, the resulting 1000Hz base is not exhausting the 32 
 bits
 before it wraps to zero.

 Therefore I propose two new functions that are to replace reset_timer() and 
 get_timer():

 u32 init_timeout(u32 timeout_in_ms); /* return the 32 bit tick value when the 
 timeout will be */
 bool is_timeout(u32 reference); /* return true if reference is in the past */

 A timeout loop would therefore be like:

 u32 t_ref = timeout_init(3000);   /* init a 3 second timeout */

 do ... loop ... while (!is_timeout(t_ref));

 coarse sketches of those functions:

 u32 init_timeout(u32 ms)
 {
   return get_ticks() + ((u64)get_tbclk() * (u64)ms) / (u64)1000;
 }

 bool is_timeout(u32 reference)
 {
   return ((int)get_ticks() - (int)reference)  0;
 }

 Unfortunately this requires to fix all uses of get_timer() and friends, but 
 I see no other
 long term solution to the current incoherencies.

 Comments welcome (and I would provide patches)...
Hi All,
   The idea of changing the get_timer interface to the 
init_timeout/is_timeout pair has the advantage that it is only necessary 
to change the delay time in ms to an internal timebase once, and after 
that, only a 32-bit subtraction is required. I do not however like the 
idea of using 64 bit math to do so, as on many systems this is quite 
expensive. However, this is a feature that can be optimized for 
particular CPUs. I also REALLY don't like the idea of having a get_ticks 
function, because for sure people will use this instead of the desired 
interface to the timer because it is better. Then we get back into a 
mess. Since in most cases get_ticks is one or two instructions, please, 
let us hide them in init_timeout/is_timeout.
An alternate approach, which has the merit of being more like 
the originally intended interface, simply disallows reset_timer since it 
is  totally unnecessary. The only dis-advantage of the original approach 
using just get_timer is that the conversion to ms must be considered at 
each call to get_timer, and will require at a minimum one 32 bit integer 
to remember the hardware timer value the last time get_timer was called 
(unless the hardware time can be trivially converted to a 32 bit value 
in ms, which is quite uncommon). This is not a high price to pay, and 
matches the current usage. This is probably for Mr. Denk to decide. If 
we were just starting now, the init_timeout/is_timeout is simpler, but 
since we are not, perhaps keeping the current approach has value.
I would really like to help by providing some patches, but I am 
just way too busy at present.

Best Regards,
Bill Campbell
 Reinhard

 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] TIMER cleanup RFC, was: [PATCH 4/4] arm920t/at91/timer: replace bss variables by gd

2010-11-30 Thread J. William Campbell
On 11/30/2010 7:48 AM, Reinhard Meyer wrote:
 Dear J. William Campbell,
 On 11/30/2010 1:14 AM, Reinhard Meyer wrote:
 Dear Wolfgang Denk,

 what we really need is only a 32 bit monotonous free running tick that
 increments
 at a rate of at least 1 MHz. As someone pointed out a while ago, even
 at 1GHz that would
 last for four seconds before it rolls over. But a 1HGz counter could
 be 64 bit internally
 and always be returned as 32 bits when it is shifted right to bring it
 into the few MHz
 range.

 Any architecture should be able to provide such a 32 bit value. On
 powerpc that would
 simply be tbu|tbl shifted right a few bits.

 An architecture and SoC specific timer should export 3 functions:

 int timer_init(void);
 u32 get_tick(void); /* return the current value of the internal free
 running counter */
 u32 get_tbclk(void); /* return the rate at which that counter
 increments (per second) */

 A generic timer function common to *most* architectures and SoCs would
 use those two
 functions to provice udelay() and reset_timer() and get_timer().
 Any other timer functions should not be required in u-boot anymore.

 However get_timer() and reset_timer() are a bit of a functional problem:

 currently reset_timer() does either actually RESET the free running
 timer (BAD!) or
 remember its current value in another (gd-)static variable which later
 is subtracted
 when get_timer() is called. That precludes the use of several timers
 concurrently.

 Also, since the 1000Hz base for that timer is usually derived from
 get_tick() by
 dividing it by some value, the resulting 1000Hz base is not exhausting
 the 32 bits
 before it wraps to zero.
 (###)
Hi All,
   A correct method to provide a full 32 bits of resolution at a 1 
ms rate is simple. It requires maintaining a software timer in ms. This 
timer is updated by converting the difference in the hardware time base 
to ms and then adding the correct number of ms to the software timer.
As previously discussed, this only requires calling get_timer once a 
second or so. The conversion can be made simply in most cases, 
especially if the clock rate is a nice number. Even in the worst case, 
it is not too hard to adjust the saved hardware timer value to contain 
any remainder ticks. Yes, if this is done incorrectly, the results are bad.
 Therefore I propose two new functions that are to replace
 reset_timer() and get_timer():

 u32 init_timeout(u32 timeout_in_ms); /* return the 32 bit tick value
 when the timeout will be */
 bool is_timeout(u32 reference); /* return true if reference is in the
 past */

 A timeout loop would therefore be like:

 u32 t_ref = timeout_init(3000);/* init a 3 second timeout */

 do ... loop ... while (!is_timeout(t_ref));

 coarse sketches of those functions:

 u32 init_timeout(u32 ms)
 {
  return get_ticks() + ((u64)get_tbclk() * (u64)ms) / (u64)1000;
 }

 bool is_timeout(u32 reference)
 {
  return ((int)get_ticks() - (int)reference)   0;
 }

 Unfortunately this requires to fix all uses of get_timer() and
 friends, but I see no other
 long term solution to the current incoherencies.

 Comments welcome (and I would provide patches)...
 Hi All,
The idea of changing the get_timer interface to the
 init_timeout/is_timeout pair has the advantage that it is only necessary
 to change the delay time in ms to an internal timebase once, and after
 that, only a 32-bit subtraction is required.
 Exactly.
 I do not however like the
 idea of using 64 bit math to do so, as on many systems this is quite
 expensive. However, this is a feature that can be optimized for
 particular CPUs.
 64 Bit math is only necessary when get_tbclk() times the maximun anticipated
 timeout (in ms) is larger than 32 bits. This happens easyly when tbclk is a
 few MHz.
What I think you mean to say here is that the delay time fits easily 
into 32 bits for all reasonable clock rates. For instance, a 1 GHz clock 
rate yields a maximum delay of about 2 seconds in a 32 bit word?  For 
fast clock rates, it is probably advisable to shift the 64 bit counter 
right a few bits to ensure adequate range.
   Besides the current working(!) implementations use 64 bit math
 already.
In ARM perhaps, but I think there are lots of working 32 bit in other 
processors(?).
   You can optimize into 32 bits by factoring the timeout_in_ms into
 whole seconds, and the remainder.
True.
 I also REALLY don't like the idea of having a get_ticks
 function, because for sure people will use this instead of the desired
 interface to the timer because it is better. Then we get back into a
 mess. Since in most cases get_ticks is one or two instructions, please,
 let us hide them in init_timeout/is_timeout.
 Agreed. However I intended to split the timer into two sources (read above):
 one hardware dependant part exporting exactly those functions, and one
 generic part using them.
Yes, and that is EXACTLY what I think is a bad idea for the reason I 
mentioned. If the get_ticks interface

Re: [U-Boot] Timer implementations

2010-11-01 Thread J. William Campbell
On 10/27/2010 11:02 PM, Reinhard Meyer wrote:
 Dear J. William Campbell,
 Hi All,

 I am pretty sure the migration to 64 bits was caused by 1) people not 
 understanding that the timer operating on time DIFFERENCES would work 
 fine even if the underlying timer wrapped around (most probable 
 problem) and possibly 2) broken timer functions causing bogus 
 timeouts, improperly fixed by switching to 64 bits.

 I think u-boot could get along just fine with only 2 time related 
 functions, uint_32 get_timer(uint_32 base) and udelay(uint 32 delay). 
 udelay will only work on small values of delay, on the order of 
 milliseconds. It is to be used when a short but precise delay in 
 microsecond resolution is required. Users of get_timer must 
 understand that it is only valid if it is called often enough, i.e. 
 at least once per period of the underlying timer. This is required 
 because u-boot does not want to rely on interrupts as a timer update 
 method. Therefore, all uses of get_timer must 1) call it once 
 initially to get a start value, and 2) call get_timer at least once 
 per period of the underlying hardware counter. This underlying period 
 is guaranteed to be at least 4.29 seconds (32 bit counter at 4 GHz). 
 Note that this does NOT mean that the total wait must be less than 
 4.29 seconds, only that the rate at which the elapsed time is TESTED 
 must be adequate.

 In order to implement this functionality, at least one hardware timer 
 of some kind is required. An additional software timer in 1 ms 
 resolution may be useful in maintaining the software time. If the 
 hardware timer update rate is programmable, u-boot MAY set the update 
 rate on initialization On initialization, u-boot MAY reset the 
 hardware timer and MAY reset any associated software timer. The 
 hardware timer MAY be started on initialization. On each call to 
 get_timer(), u-boot MUST start the hardware timer if it was not 
 started already. On calls to get_timer, u-boot MUST NOT reset the 
 hardware timer if it was already started. The software timer MAY be 
 reset if u-boot can unambiguously determine that more than 4.29 
 seconds has elapsed since the last call to get_timer.

 The simplest case for implementing this scheme is if two programmable 
 timers exist that can be set to 1ms and 1us. The timers are 
 initialized at start-up, get_timer just returns the 32 bit 1 ms timer 
 and udelay just waits for the number of ticks required on the second 
 timer to elapse. The most common harder case is where there is only 
 one timer available, it is running at 1 us per tick or faster, and we 
 cannot control the rate. udelay is still easy, because we can convert 
 the (small) delay in us to a delay in ticks by a 32 bit multiply that 
 will not overflow 32 bits even if we have quite a few fractional bits 
 in the tics per microsecond value. The elapsed ticks required is the 
 (delay in us * us/per tick)  # fractional bits in us/per tick. If 
 that is not close enough for you, you can do it as (delay in us * 
 (integer part of us/tick)) + ((delay in us * (fractional 
 part)us/tick)  # fraction bits). For nice numbers, like any 
 integral number of MHz, there is no fractional

 part. Only numbers like 66 MHz, or 1.666 GHz require messing with the 
 fractional part.
 For get_timer, it is a bit harder. The program must keep two 32 bit 
 global variables, the timer reading last time and the software 
 timer in 1 ms resolution. Whenever get_timer is called, it must 
 increase the software timer by the number of ms that have elapsed 
 since the previous update and record the corresponding timer reading 
 as the new last time. Note that if the number of ms elapsed is not 
 an integer (a common case), the value recorded as the last time 
 must be decreased by the number of ticks not included in the 1 ms 
 resolution software timer. There are many different ways to 
 accomplish update, depending on what hardware math capabilities are 
 available, and whether one thinks efficiency is important here. 
 Conceptually, you convert the elapsed time in ticks into an 
 equivalent number of ms, add that number to the software timer, store 
 the current value of the hardware timer in last time, and subtract 
 any remainder ticks from that value. If the elapsed time is les
 s
 that one ms, do no update of last hardware time and return the 
 current software counter. If the elapsed time is greater than 4.29 
 seconds, reset the software counter to 0, record the current hardware 
 counter time and return the current software counter. In between, do 
 the math, which will fit into 32 bits.

 If this idea seems like a good one, I can provide more detail on the 
 conversions for various hardware capabilities is people want. 
 Comments welcome.

 To get the timer mess cleaned up three things have to happen:

Hi All,

   I am glad somebody was still interested. I was afraid I had 
scared everyone off.
 1. A consensus and documentation how it MUST

Re: [U-Boot] [PATCH v2] mmc: omap: timeout counter fix

2010-10-26 Thread J. William Campbell
On 10/25/2010 11:01 PM, Reinhard Meyer wrote:
 Dear Wolfgang Denk,
 Dear Reinhard Meyer,

 In message4cc66a67.4000...@emk-elektronik.de   you wrote:
 It fails in case the timer wraps around.

 Assume 32 bit counters, start time = 0xFFF0, delay = 0x20. It
 will compute end = 0x10, the while codition is immediately false, and
 you don't have any delay at all, which most probably generates a
 false error condition.
 I used and assumed a 64 bit counter, that will not wrap around while
 our civilization still exists...

 The code is still wrong, and as a simple correct implementation exists
 there is no excuse for using such incorrect code.

 Please fix that!
 Agreed here. People are invited to dig through u-boot and find all
 those places.

 If get_ticks() is only 32 bits worth, both methods will misbehave
 at a 32 bit wrap over.
 No.

start = time();
while ((time() - start)delay)
...

 This works much better (assuming unsigned arithmetics).
 True, provided the underlying timer is really 64 bits, otherwise
 this fails, too...
 You are wrong. Try for example this:
   
 - snip ---
 #includestdio.h
   
 int main(void)
 {
  unsigned int time = 0xFFF0;
  unsigned int delay = 0x20;
  unsigned int start;

 You are wrong here, because you take it out of context.
 My demo is using the (declared as) 64 bit function get_ticks().
 I mentioned above that this function MUST be truly returning 64
 bits worth of (incrementing) value to make any version work.
 If get_ticks() just returns a 32 bit counter value neither method will work
 reliably. Just check all implementations that this function is implemented
 correctly.
Hi All,
   I have wondered for quite some time about the rush to make 
get_ticks() return a 64 bit value. For any reasonable purpose, like 
waiting a few seconds for something to complete, a 32 bit timebase is 
plenty adequate. If the number of ticks per second is 10, i.e. a 
1 GHz clock rate, the clock wraps in a 32 bit word about every five 
seconds.
The trick is that time always moves forward, so a current get_ticks() - 
a previous get_ticks() is ALWAYS a positive number. It is necessary to 
check the clock more often than (0X1 - your_timeout) times per 
second, but unless your timeout is very near the maximum time that fits 
into 32 bits, this won't be a problem. Most CPUs have a counter that 
count at a reasonable rate. Some CPUs also have a cycle counter that 
runs at the CPU clock rate. These counters are useful to determine 
exactly how many machine cycles a certain process took, and therefore 
they have high resolution. Timers for simple delays neither need nor 
want such resolution. If the only counter available on you CPU runs at 
several GHz, and is 64 bits long,  just shift it right a few bits to 
reduce the resolution to a reasonable  resolution and return a 32 bit 
value. There is no need for a bunch of long long variables and extra 
code running around to process simple timeouts. It may be that we need a 
different routine for precision timing measurements with high 
resolution, but it needn't, and probably shouldn't IMHO be get_ticks().

Best Regards,
Bill Campbell
 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Timer implementations

2010-10-26 Thread J. William Campbell
On 10/26/2010 6:33 AM, Reinhard Meyer wrote:
 Dear Wolfgang Denk,
 Then the define CONFIG_SYS_HZ should not be in everyboard.h since that
 suggests that a board developer has some freedom there...
 Agreed - there are historical reasons this has ever been changable at
 all.

 and MOST IMPORTANT that some implementations of udelay() might
 call reset_timer() and therefore break an outer timeout loop !!!
 Such implementations are inherently broken and need to be fixed.
 Found such in arm926ejs/omap... But then, that timer is multiple-broken:
 relocation broken (uses static data), returns 32 a bit value in get_ticks(),
 returns CONFIG_SYS_HZ in get_tbclk() instead of the rate get_ticks()
 increments...

 PXA:
 void udelay_masked (unsigned long usec)
 {
   unsigned long long tmp;
   ulong tmo;

   tmo = us_to_tick(usec);
   tmp = get_ticks() + tmo;/* get current timestamp */

   while (get_ticks()  tmp)
   /* loop till event _OR FOREVER is tmp happens to be  32 bit_ */
/*NOP*/;

 }

 unsigned long long get_ticks(void)
 {
   return readl(OSCR);
 }
 - not any better :( -- its the same code that AT91 had before I fixed it.

 It is also open if reset_timer() does actually reset the hardware timer
 (e.g. tbu/tbl at PPC) - which would be messing up any time difference
 calculation using get_ticks() - or does emulate that by remembering
 the hardware value and subtracting it later in every subsequent
 get_timer() call?
 This is an implementation detail.
 IF we require that get_ticks() and get_timer() shall not interfere with
 each other and IF both are based on the same hardware timer only the
 second method is available (same if the hardware timer is not easyly
 resettable).

 2. get_ticks() and friends operate at a higher rate (tbu/tbl for PPC).
 Since they are defined as having 64 bits they MUST not wrap at 32 bits,
 i.e. if the hardware provides only 32 bits, the upper 32 bits must be
 emulated by software.
 Right.

 Otherwise we have to document that get_ticks() cannot be used to get
 64 bit time differences.
 No. Such an implementation is broken and needs fixing.
 Original AT91 timer.c was like that, and I think other ARMs where this was
 copied around should be looked at... I don't know when get_timer() became
 64 bits, but it seems that some implementations just did change the return
 type: uint64 get_timer(void) {return (uint64)timer_val_32;}
Hi All,

  I am pretty sure the migration to 64 bits was caused by 1) people 
not understanding that the timer operating on time DIFFERENCES would 
work fine even if the underlying timer wrapped around (most probable 
problem) and possibly 2) broken timer functions causing bogus timeouts, 
improperly fixed by switching to 64 bits.

I think u-boot could get along just fine with only 2 time related 
functions,  uint_32 get_timer(uint_32 base) and udelay(uint 32 delay).  
udelay will only work on small values of delay, on the order of 
milliseconds. It is to be used when a short but precise delay in 
microsecond resolution is required.  Users of get_timer must understand 
that it is only valid if it is called often enough, i.e. at least once 
per period of the underlying timer. This is required because u-boot  
does not want to rely on interrupts as a timer update method. Therefore, 
all uses of get_timer must 1) call it once initially to get a start 
value, and 2) call get_timer at least once per period  of the underlying 
hardware counter. This underlying period is guaranteed to be at least 
4.29 seconds (32 bit counter at 4 GHz). Note that this does NOT mean 
that the total wait must be less than 4.29 seconds, only that the rate 
at which the elapsed time is TESTED must be adequate.

In order to implement this functionality, at least one hardware timer of 
some kind is required. An additional software timer in 1 ms resolution 
may be useful in maintaining the software time. If the hardware timer 
update rate is programmable, u-boot MAY set the update rate on 
initialization On initialization, u-boot MAY reset the hardware timer 
and MAY reset any associated software timer. The hardware timer MAY be 
started on initialization. On each call to get_timer(), u-boot MUST 
start the hardware timer if it was not started already. On calls to 
get_timer, u-boot MUST NOT reset the hardware timer if it was already 
started. The software timer MAY be reset if u-boot can unambiguously 
determine that  more than 4.29 seconds has elapsed since the last call 
to get_timer.

The simplest case for implementing this scheme is if two programmable 
timers exist that can be set to 1ms and 1us.  The timers are initialized 
at start-up, get_timer just returns the 32 bit 1 ms timer and udelay 
just waits for the number of ticks required on the second timer to 
elapse. The most common harder case is where there is only one timer 
available, it is running at 1 us per tick or faster, and we cannot 
control the rate. udelay is still 

Re: [U-Boot] [PATCH 2/2] ARM: fix relocation support for onenand device.

2010-10-23 Thread J. William Campbell
On 10/23/2010 1:56 PM, Wolfgang Denk wrote:
 Dear Enric Balletbo i Serra,

 In message1287479602-21721-3-git-send-email-eballe...@iseebcn.com  you 
 wrote:
 We also have to relocate the onenand command table manually, otherwise
 onenand command don't work.

 Signed-off-by: Enric Balletbo i Serraeballe...@iseebcn.com
 ---
   arch/arm/lib/board.c |3 +++
   common/cmd_onenand.c |6 ++
   2 files changed, 9 insertions(+), 0 deletions(-)

Is this patch still necessary? I thought the relocation change made it OBE.

Best Regards,
Bill Campbell

 Applied, thanks.

 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] mpc83xx: Add -fpic relocation support

2010-10-13 Thread J. William Campbell
  On 10/12/2010 11:30 PM, Albert ARIBAUD wrote:
 Le 12/10/2010 23:00, Joakim Tjernlund a écrit :

 Yes, but the difference isn't really the arch. It is the -mrelocatable
 flag that is the big difference.
 Not only: obviously, implementing GOT relocation is not done the same on
 both archs, and it simply is not beneficial on ARM wrt PPC in terms of
 instructions. I did a pretty extensive run of tests with and without
 -fPIC and -fPIE on ARM, and GOT relocation clearly makes code bigger,
 whereas it does not PPC.

 This simply implies that -fPIC is a better choice for PPC (and hence
 -mrelocatable) while -fpie is a better one for ARM.
Hi All,
  In particular, the PPC takes two 32 bit instructions to load the 
known address of a variable into a register. If the GOT is used, a 
single 32 bit instruction can load the address of a variable from the 
GOT table (pointed to by a fixed register) into a register. In both 
cases, there are two memory cycles, but in the GOT case, only one 
instruction is required. This is why the GOT based code is smaller. 
However, the GOT cannot be used to address constants and some other 
items that are not variables. I do think that -fPIC and -fpie are not 
mutually incompatible. On the PPC, the GOT references would be relocated 
in the loop that updates the GOT and the references to constants would 
be relocated by the ELF relocation code. That is how shared libraries 
are relocated.

Best Regards,
Bill Campbell

 Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] futile c relocation attempt

2010-10-06 Thread J. William Campbell
  On 10/6/2010 2:43 AM, Graeme Russ wrote:
 On 06/10/10 01:48, Reinhard Meyer wrote:
 ---
   arch/arm/cpu/arm926ejs/start.S |8 -
   arch/arm/lib/board.c   |   57 
 +++-
   include/configs/top9000_9xe.h  |1 +
   3 files changed, 63 insertions(+), 3 deletions(-)

 I had a quick look at this and nothing is jumping out at me. Of course I am
 not familiar with ARM asm...

 I don't see any reason why this ultimately will not work eventually. You
 may be having some issues with the transition from asm-C-asm through the
 relocation - This was an especially painful thing for me involving an
 intermediate trampoline which I have only recently figured out how to remove.

 Maybe some memory barriers are needed to stop the C optimiser mangling things?

 I am sure what you have is very close to the real solution :)

 I do think the main relocation fixup loop can be moved into a common
 library in which case we can add additional case statements. The nice thing
 is that x86 as all Type 8 which is specifically allocated to x86 so my if
Hi All,
   I think that type 8 IS NOT allocated to the 386. For instance, 
R_PPC_ADDR14_BRTAKEN also has a value of 8. So does R_ARM_ABS8. I think 
that there will be a lot of #ifdefs just to keep the references 
symbolic. Many of the different platform  relocation types will come out 
to be the same code in the switch statement, but different symbolic 
names. There will also be some entries that are processor specific and 
have no equivalent on other processors. I think it would be a good idea 
to use the symbolic values of the Relocation types (as opposed to 
integer constants), as it will make the code clearer. There are sort of 
two ways to organize the code inside the switch statement. Since the 
code inside the switch statement is very short, it might be best for 
each architecture (ELF format) to be  bunched together, even at the 
expense of repeating the same executable statements that some other 
formats may use, as follows:
#ifdef PPC
case R_PPC_ADDR32:  /* S + A */
code to do the deed
   break;

case R_PPC_RELATIVE:  /* B + A */
code to do the deed
   break;
#endif
#ifdef I386
case R_386_32:  /* S + A */ /* I think this is the other 
location type Graeme used, but I could be wrong */
code to do the deed
   break;

case R_386_RELATIVE:  /* B + A */
code to do the deed
   break;

#endif
#ifdef ARM
case R_ARM_ABS32:  /* S + A */ /* I don't remember the ARM 
relocation types used  */
code to do the deed
   break;

case R_ARM_REL32:  /* B + A  - P */
code to do the deed
   break;

#endif

Or we could group the various Relocation types by what they actually do:
#ifdef PPC
case R_PPC_ADDR32:  /* S + A */
#endif
#ifdef I386
case R_386_32:  /* S + A */ /* I think this is the other 
location type Graeme used, but I could be wrong */
  #endif
  #ifdef ARM
case R_ARM_ABS32:  /* S + A */ /* I don't remember the ARM 
relocation types used  */
#endif
code to do the deed
   break;

#ifdef PPC
case R_PPC_RELATIVE:  /* B + A */
#endif
  #ifdef I386
 case R_386_RELATIVE:  /* B + A */
#endif
code to do the deed
   break;

#ifdef ARM
case R_ARM_REL32:  /* B + A  - P */
code to do the deed
   break;
#endif

Note that the ARM_REL32 is defined differently than the PPC/I386 
relative, FWIW. I also don't know what to use for the names of the 
binary formats. It would be nice if we could use something already in 
the header files? Thoughts on all this solicited!

Best Regards,
Bill Campbell
 TEXT_BASE checks can be kept. For size freaks, we could litter the code
 with #ifdefs to remove un-needed cases ;)

 Interestingly, ARM is adding gd-reloc_off while x86 is subtracting
 gd-reloc_off. If this is correct, I need to change the calculation of
 gd_reloc_off to be consistent
Adding is the way the specifications define it. add B+A.

Best Regards,
Bill Campbell
 Regards,

 Graeme
 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-04 Thread J. William Campbell
  On 10/4/2010 3:13 AM, Wolfgang Denk wrote:
 Dear Albert ARIBAUD,

 In message4ca999ee.5030...@free.fr  you wrote:
 Note however that linking for base address 0 is not mandatory for
 achieving true position independence. What is required is that the code
 which runs from power-up until relocation be able to run anywhere, i.e.,
 this code should not require any relocation fixup. That can be achieved
 on ARM by using only relative branches and accessing data only relative
 to pc (e.g. literals) or truly absolute (e.g. HW registers etc).
 That means you need to build all of U-Boot that way, because
 significant parts of the code already run before relocation
 (including all clocks and timers setup, console setup, printf and all
 routines these pull in).

Yes, I think Wolfgang is correct. This is not going to be easy to do in 
general.  To run anywhere, the code must be true Position Independent 
code. If you intend to use any C code in the initialization, this will 
result in needing -fPIC for at least that code. I am not sure you can 
mix -fPIC and non -fPIC code in the same link, but I expect not. I am a 
bit surprised that it is possible to get even the initialization code to 
be Position Independent, but it appears that on at least some PPC it is 
possible/has been done.
 On a related topic, I did find some information on the 
-mrelocatable history. Take a look at
http://www.mail-archive.com/g...@gcc.gnu.org/msg02528.html.
If you read both thread entries, it explains -mrelocatable as more or 
less the post-processor that re-formats the ELF relocation information 
into a smaller format and puts it in the text as another segment. What 
Albert is doing now, and Graeme did before,  is the first option, 
creating a loader that understands ELF. This has the advantage that it 
will work on all architectures. However, once this understanding is in 
place, it would be easy to write a small post-processing program that 
would reduce the size of the relocation entries, much like -mrelocatable 
does. This may or may not be necessary, but it is certainly possible.

Best Regards,
Bill Campbell
 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-04 Thread J. William Campbell

 On 10/4/2010 10:06 AM, Wolfgang Denk wrote:

Dear J. William Campbell,

In message4ca9f294.8080...@comcast.net  you wrote:

Yes, I think Wolfgang is correct. This is not going to be easy to do in
general.  To run anywhere, the code must be true Position Independent
code. If you intend to use any C code in the initialization, this will
result in needing -fPIC for at least that code. I am not sure you can
mix -fPIC and non -fPIC code in the same link, but I expect not. I am a
bit surprised that it is possible to get even the initialization code to
be Position Independent, but it appears that on at least some PPC it is
possible/has been done.

Not really. On PowerPC, only the first 20 or 30 lines of assembler
code in start.S are position independent; then we compute the link
(resp. execution) address and branch to it. From then, we run from the
very address range we were linked for (starting at TEXT_BASE).

Hi Wolfgang,
 You are of course correct. I was referring more to Jocke's 
(joakim.tjernl...@transmode.se) statements regarding:


Yes, that is there today. I am talking about linking to any TEXT_BASE(say 0)
but burn and run into another address. I impl. this quite some time
ago for PPC(search for LINK_OFF)

I understand from his comment that he had achieved total PIC for the 
initialization, that would run at any location regardless
of TEXT_BASE. I think this code was not accepted into mainline, so it is not a 
problem at present. However, any relocation code
added would have to be modified by Jocke if he wished to preserve that 
capability. I am amazed that he was able to get the
rest of u-boot to work under the constraints you pointed out. It must have been 
quite tedious.

  I also wish to support Graeme's desire that the added relocation code at 
the end of the day be written in C. The routine to do the
relocation does not require .bss and is not real long. The obvious advantage of 
this approach is that all architectures can use it. The
ELF relocation codes will have to be changed to the architecture equivalents, 
and in some casesarchitecture specific relocation code
processing added, but the theory will always be the same. This approach will 
make using relocation much easier/trivial for new
architecture  ports, thereby reducing resistance to doing it!

Best Regards,
Bill Campbell



Albert is doing now, and Graeme did before,  is the first option,
creating a loader that understands ELF. This has the advantage that it
will work on all architectures. However, once this understanding is in
place, it would be easy to write a small post-processing program that
would reduce the size of the relocation entries, much like -mrelocatable
does. This may or may not be necessary, but it is certainly possible.

Eventually we might even add -mrelocatable support for the other
architectures to the tool chain.

Best regards,

Wolfgang Denk



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] [PATCH] arm: arm926ejs: use ELF relocations

2010-10-04 Thread J. William Campbell
  On 10/4/2010 5:16 PM, Albert ARIBAUD wrote:
 Le 05/10/2010 01:21, Graeme Russ a écrit :
 On Tue, Oct 5, 2010 at 9:57 AM, Albert ARIBAUDalbert.arib...@free.fr   
 wrote:
 Le 05/10/2010 00:22, Graeme Russ a écrit :
 On Tue, Oct 5, 2010 at 9:01 AM, Albert Aribaudalbert.arib...@free.fr
wrote:
 The output from MAKEALL is curiously calculated... If I look at objdumps of
 the GOT and ELF binaries, I find that:

 - the GOT .text section is 118960 bytes and the ELF .text section only
 108112. This is due to the fact that GOT relocation requires additional
 instruction for GOT indirection whereas ELF relocations work by patching the
 code.
 It would be interesting to compare against the basline non-relocatable
 version
 I #defined CONFIG_RELOC_FIXUP_WORKS and removed -pie from the ARM
 config.mk. This puts the edminiv2 code in the non-reloc build case, and
 produces identical .text and .data, and almost identical .rodata, as the
 ELF case.

 - the .rodata section is 22416 for GOT, 22698 for ELF, whereas the .data
 section is 2908 for GOT, 2627 for ELF. Some initialized data apparently
 moved from non-const ton const for some reason, but basically, initialized
 data remains constant.

 - the .bss section remains constant too, 16640 for GOT vs. 16636 for ELF.
 I'm not going to track what causes the 4 byte difference. :)

 Many sections are output in the ELF file which do not appear in the GOT
 file, such as .interp, .dynamic, .dynstr etc. They probably pollute
 MAKEALL's figures.
 I now discard a few sections:

  /DISCARD/ : { *(.dynstr*) }
  /DISCARD/ : { *(.dynamic*) }
  /DISCARD/ : { *(.plt*) }
  /DISCARD/ : { *(.interp*) }
  /DISCARD/ : { *(.gnu*) }

 Not that it makes a huge difference - most of these are trivially small
 Thanks. I'll add this to the .lds as a measure of clarity.

 That's roughly consistent with the numbers I get: about 19 KB of .rel.dyn
 plus .dynsym, which we will be able to cut by half if we preprocess it.
 Which is not copied to RAM, so not as nasty as the .got related increase
 True also. Note that we could probably shrink the table to 1/4 of its
 current size by taking advantage from the fact that the few
 non-program-base-relative relocations it has can easily be converted to
 program-base-relative, and that two consecutive relocations are always
 less than 64 KB away from each other. Of course that moves away from
 using the ELF structures as-is, and requires additional build steps, but
 people with small FLASH devices may want it.
Hi All,
  This may be pushing it in the more general case. ARM has only a 
few relocation types. Other CPU types have more types, and therefore 
still may need a type field. You can certainly get 1/2 in all cases, and 
more if you are willing to get a bit more complex in the preprocessing. 
That said, I think this is best left to later when all CPUs are in the 
relocatable state.
 I'm also looking at moving the low-level intialisation and relocation code
 into a seperate section (outside .text) so I even less to relocate to RAM
 As Wolfgang pointed out, there might be issues in that all the code that
 runs in FLASH should be truly PI, which might not be a piece of cake.
 ARM C code, for instance, tends to generate literals which need to be
 relocated if you don't run the code where it was linked for.
True, but the code WILL be running at the address it was linked for. It 
just won't be copied and relocated to the new address, as it would 
never be run again anyway. This goal is along the lines of the two 
stage u-boot that has been/is being considered, where all execute only 
once code can be concentrated into a segment that is not moved into ram.

Bill Campbell
 Then I could even compress the relocatable section, but that is just being
 silly ;)
 :)

 Regards,

 Graeme
 Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] [PATCH] arm: arm926ejs: use ELF relocations

2010-10-04 Thread J. William Campbell
  On 10/4/2010 10:30 PM, Wolfgang Denk wrote:
 Dear Albert ARIBAUD,

 In message4caa50aa.3000...@free.fr  you wrote:
 Remember: this patch only applies to boards which boot from NOR FLASH!
 You can test it on other types of boards (NAND-based, etc) for
 regression testing, but nothing more.
 Assuming the NAND loder does not load U-Boot to it's final location
 at the upper end of RAM, but - say - somewhere in lower memory, the
 standard relocation preocess will be running, so I think there should
 be no real difference between (such) NAND booting systems and NOR
 booting ones - or am I missing something?
FWIW I think you are right. If u-boot is linked for the address where 
the NAND loader put it, everything should work fine. It can
size memory, move a copy of u-boot to the top of memory, and branch to 
the entry point that continues initialization.

Bill Campbell
 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-03 Thread J. William Campbell
  On 10/3/2010 1:58 AM, Albert ARIBAUD wrote:
 Le 03/10/2010 10:44, Graeme Russ a écrit :

 Bill just said that -pic (or, for ARM, -fPIC or -fPIE) was unnecessary
 for relocation. You seem to imply it actually is... In my experience,
 -fPIC and-fPIE do increase code by adding GOT relocation to symbols 
 that
 need fixing, so they would indeed be redundant to any other relocation
 mechanism -- I just did some test with basic code and this seems to
 confirm, no -fPIx is needed to get relocation the way you do on ARM.

 Just to clarify -fpic is a compiler option, -pic is a linker option. x86
 has no compile time relocation options (therefore no referencing .got 
 etc).
 Using the link time pic option produces the relocation data table
 (.rel.dyn) which must be pre-processed before execution can begin at the
 relocated address

 Thanks for clarifying, Graeme.

 This is consistent with the ARM compile-time options -fPIC/-fPIE vs 
 link-time option -pie. So there may be at least an interest in 
 investigating ELF-style relocation on ARM and comparing it to 
 GOT-based relocation in terms of FLASH and RAM sizes and code speed.

Hi All,
   It is for sure that -fPIC/-fPIE programs will contain more 
executable instructions than programs compiled without these options.
The program will also contain more data space for the got. If -fPIC 
actually produced a fully position-independent executable, the extra 
overhead would perhaps be tolerable. However, since it does not do this, 
(problems with initialized data etc.) there is really no advantage in 
using these compile-time options. The executable code and required data 
space for the program without these switches will always be smaller 
and faster than with them. In order to fix the remaining issues even 
when using -fPIC, a relocation loop must exist in the u-boot code, 
either one global one or a bunch of user written specific ones. Also, 
the -pie switch will be needed anyway at link time to build the 
relocation table for the remaining relocation requirements.
   Programs compiled without -fPIC will have a larger .rel.dyn table 
than those compiled with -fPIC. However, the table entries in the 
relocation table occupy about the same storage as the code generated by 
the compiler to relocate a reference to the symbol at run time.  So this 
is probably a almost a wash. Also, the dynamic relocation data need not 
be copied into the run-time object, as it is no longer needed. So the 
likely outcome is that the flash image is about the same size/slightly 
larger than the one compiled by -fPIC, and that the ram footprint after 
relocation is slightly smaller.
   If one is REALLY pressed for space, the size of the dynamic 
relocation area can be reduced by a post-processor program that would 
re-format the relocation entries. This re-formatting is possible because 
1) ELF is a very general format and we only need a small subset of it, 
and 2) u-boot code will never occupy say 16 MB of space, so each 
relocation can probably be compressed into a 32 bit word. I doubt anyone 
is that desperate, but it IS possible.
   It will be interesting to see what the results of this comparison 
are. For me, the no user awareness of relocation is worth a lot, and the 
fact that the difference/overhead of relocation will all be in exactly 
one place is very appealing.

Best Regards,
Bill Campbell
 Cheers,

 Graeme

 Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-03 Thread J. William Campbell
  On 10/3/2010 11:29 AM, Wolfgang Denk wrote:
 Dear Reinhard Meyer,

 In message4ca79896.2010...@emk-elektronik.de  you wrote:
 I agree here. _If_ relocation, it should work without hand-adding
 fixup stuff to all functions using initialized data with pointers.
 Even Wolfgang forgot to fixup his 2nd level command table in
 cmd_nvedit.c ;)
 I didn't forget it - at least not in the sensse that I think this is
 something that needs to be done.

 This works fine on PPC with relocation, and we should make it work
 the same on other arches.

 And, for space concerns in flash, relocation should always be an
 option on a board by board basis...
 NAK.

 And as an idea, if position independent code is used, only pointers
 in initialized data need adjustment. Cannot the linker emit a table
 of addresses that need fixing?
 It does. That's the GOT.
I think this is actually a misunderstanding. The purpose of the GOT, at 
least from GCC's point of view, is to hold the absolute addresses of 
private data referenced by shared library code. That is what it was 
invented to do. This is similar to, but not identical with, relocating 
all data references. Initialized data in the library must have a copy 
created (and relocated as necessary if it contains pointers) by the 
runtime linker when the library is initialized in the address space of 
the process using the library. The code in the shared library is -fPIC, 
but it still needs the runtime linker to allocate a copy of the GOT for 
the current user AND to allocate and relocate any data that is required 
for the library that is private to the user. It is that second step 
where we have trouble.

Best Regards,
Bill Campbell
 Best regards,

 Wolfgang Denk


___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-02 Thread J. William Campbell

 On 10/2/2010 3:17 AM, Joakim Tjernlund wrote:

Hello Reinhard,

Reinhard Meyer wrote:

Dear Albert ARIBAUD,

I try to understand how the relocation process could handle pointers (to
functions or other data) in const or data sections.
Your code cannot know what is data and what is a pointer that needs
adjustment?

Best Regards,
Reinhard

Hi Reinhart,

Short answer - the relocation process does not handle pointers inside
data structures.

And yes, this means the content arrays of pointers such as init_sequence
is not relocated. Been there, done that, can give you one of the

The init_sequence should not called anymore after relocation, as it is
the init_sequence ... or?


tee-shirts I got :)

ATM I have not found a way to fix this, except making the code which
uses the pointers aware that the are location-sensitive and fix them
when using them.

That means that things like this cannot work (with relocation),
unless adding the relocation offset before using the pointer:

Yep, you have to fix these pointers after relocation ...


const struct {
const u8 shift;
const u8 idcode;
struct spi_flash *(*probe) (struct spi_slave *spi, u8 *idcode);
} flashes[] = {
#ifdef CONFIG_SPI_FLASH_SPANSION
{ 0, 0x01, spi_flash_probe_spansion, },
#endif

[...]

#ifdef CONFIG_SPI_FRAM_RAMTRON
{ 6, 0xc2, spi_fram_probe_ramtron, },
# ifdef CONFIG_SPI_FRAM_RAMTRON_NON_JEDEC
{ 0, 0xff, spi_fram_probe_ramtron, },
# endif
# undef IDBUF_LEN
# define IDBUF_LEN 9 /* we need to read 6+3 bytes */
#endif
};

And I think there are more places of this type in u-boot...

Yes, maybe. But relocation as I did for arm, also works
on m68k, sparc, mips, avr32 and they must do also this
fixups, so for common functions (except the new env handling,
which I think got never tested on this architectures?) should
work ...

This pointer problem is solved with the fixup relocs on ppc and
should work without manual relocation. I think this is a ppc
only extension but I might be wrong.


Hi All,
  You are correct that this is a ppc only extension. As such, it is 
not a good candidate for general use.



I believe that the other alternative is to do it as x86 does
which I think is the general way which should work on any arch.
Graem Russ would know better.

Almost exactly a year ago, this was all pretty much presented by Graeme 
in the threads

Relocation size penalty calculation (October 14, 2009)
i386 Relocation (November 24, 2009)

Using the full relocation scheme eliminates the need for all these 
fixups in u-boot C code. I think this is a very desirable result.
It is also not clear to me that hard coding in the relocation as several 
C routines will produce a  u-boot that is smaller than the one 
produced by using normal ELF relocation. However, using full relocation 
creates an environment that is true C and does not rely on people 
remembering that they may have to fix up some parts of their code. It is 
hard to see much downside in using the full relocation capability 
provided by Graeme's code.
FWIW, the relocation code and data does not have to be moved into ram if 
space is at a premium.


Best Regards,
Bill Campbell


  Jocke

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot




___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] flash.h: pull in common.h for types

2009-11-17 Thread J. William Campbell
Mike Frysinger wrote:
 On Tuesday 17 November 2009 16:56:58 Wolfgang Denk wrote:
   
 Scott Wood wrote:
 
 My question: is there a definitive position  somewhere  (for  example
 for  the  Linux kernel; I'm sure we don't have one for U-Boot [yet]),
 whether system headers should be self-sufficient?
 
 I'd say they should be self-sufficient, in that the inclusion of the
 header itself should not fail if I haven't included some arbitrary other
 header.  I don't see what the argument would be for not doing this.
   
 Well, Theo de Raadt says for example ... people would be able to
 include less files; indeed, almost be careless about what they
 include. But this would not increase portability in any way. And
 'make build' would probably, if it was taken the nth degree, take
 twice as long. Therefore there is no benefit for the crazy rule you
 suggest... - see
 http://www.mail-archive.com/t...@openbsd.org/msg00425.html
 

 i disagree with this using, ironically, the same base logic, but a different 
 conclusion:
 http://sourceware.org/ml/libc-alpha/2006-08/msg00064.html

 also, i think a self contained system like u-boot which has full control at 
 the api level can be better at this than a user interface which really sits 
 on 
 top of an abi and has to deal with a lot of crap from user code.

 while i'm not asking for you or anyone else to audit header paths here as i 
 think that level of enforcement will bog things down, small patches from 
 people who choose to fix things should be merged.

   
FWIW, I think one needs to be very careful with this reasoning. It is 
clear that experienced and capable programmers disagree on the correct 
approach to this problem. It is also true that the logical structure of 
the include chain is important. A crap interface is going to be hard 
to maintain no matter how you do it. The problem is, crap has a way of 
sneaking into well-designed interfaces in the form of small patches to 
fix things. (I am not saying this is one of them). I have observed 
that in any large program, as time goes on, more and more things get 
included in more and more places. Hiding this fact by doing the 
inclusion in other header files often obscures the drift towards a very 
polluted name space, where editing just about any include file requires 
the entire system to be rebuilt. Some would say, so what, it is easy to 
do. However, these situations often result in hard to find bugs. The 
question of which modules can access this variable becomes all of 
them because of (badly designed?)  interfaces being included in other 
interfaces without the user being explicitly aware of this. Often I see 
in patch critiques the statement This belongs in a header file, which 
may be true, but can lead to a bunch of un-related things being stuck 
together in a header file just to meet the requirement. Those bad 
choices then get included in other header files and we are off to the 
races. Since the users are not explicitly aware they are pulling in a 
bunch of stuff that has nothing to do with the main purpose of the 
module being written, there is often no incentive to clean up bad header 
designs until things get so bad it is almost impossible to untangle 
things. So when adding an include to an include file, think hard whether 
you are bringing along a lot of things that are not required and that 
one would not expect to be exposed in all files using the modified 
include file. If you are bringing along a lot of baggage, perhaps the 
interface you are including could use some re-work.

Best Regards,
Bill Campbell
 I don't know whether Linux has a specific policy on this, but I haven't
 noticed many problems in this regard, and when I did find one in the
 kernel a few years back I didn't get any argument when I submitted a
 patch to fix it.
   

 ive semi-frequently post fixes to linux headers so that you can include just 
 that header and have it work.  i have yet to hear anyone complain; rather 
 every one has been merged (ignoring issues unrelated to the original purpose).

   
 Which man pages are you looking at?
   
 Well, for example:

 open(2):
  SYNOPSIS
 #include sys/types.h
 #include sys/stat.h
 #include fcntl.h

 mknod(2):
  SYNOPSIS
 #include sys/types.h
 #include sys/stat.h
 #include fcntl.h
 #include unistd.h

 stat(2):
  SYNOPSIS
 #include sys/types.h
 #include sys/stat.h
 #include unistd.h

 Why do we need these lists of #includes? WHy doe - for example -
 sys/stat.h not auto-include anything it might need?

 To me this seems to be an indication that there is no intention to
 make headers self-sufficent, but I am absolutely not sure.
 

 i'm pretty sure your man page example is an unrelated issue.  the include 
 list 
 does not imply that any one of those headers cannot be included by itself 
 first.  if you read the full text 

Re: [U-Boot] Relocation size penalty calculation

2009-10-17 Thread J. William Campbell
Graeme Russ wrote:
 On Thu, Oct 15, 2009 at 3:45 AM, J. William Campbell
 jwilliamcampb...@comcast.net wrote:
   
 Joakim Tjernlund wrote:
 
   
megasnip

 Apologies if this is getting way off-topic for a simple boot loader, but
 this is information I have gathered from far and wide over the net. I am
 surprised that there isn't a web site out there on 'How to create a
 relocatable boot loader'...

 OK, its all starting to come together now - It helps when you look at the
 right files ;)

 Firstly, u-boot.map

 0x380589a0__rel_dyn_start = .

 .rel.dyn0x380589a0 0x42b0
  *(.rel.dyn)
  .rel.got   0x0x0 cpu/i386/start.o
  .rel.plt   0x0x0 cpu/i386/start.o
  .rel.text  0x380589a0 0x2e28 cpu/i386/start.o
  .rel.start16   0x3805b7c8   0x10 cpu/i386/start.o
  .rel.data  0x3805b7d8  0xc18 cpu/i386/start.o
  .rel.rodata0x3805c3f0  0x360 cpu/i386/start.o
  .rel.u_boot_cmd
 0x3805c750  0x500 cpu/i386/start.o
 0x3805cc50__rel_dyn_end = .


 And the output of readelf...

 Section Headers:
   [Nr] Name  TypeAddr OffSize   ES Flg Lk Inf 
 Al
   [ 0]   NULL 00 00 00  0   0 
  0
   [ 1] .text PROGBITS3804 001000 0118a4 00  AX  0   0 
  4
   [ 2] .rel.text REL  066c68 005d00 08 40   1 
  4
   [ 3] .rodata   PROGBITS380518a4 0128a4 005da5 00   A  0   0 
  4
   [ 4] .rel.rodata   REL  06c968 000360 08 40   3 
  4
   [ 5] .interp   PROGBITS38057649 018649 13 00   A  0   0 
  1
   [ 6] .dynstr   STRTAB  3805765c 01865c 0001ee 00   A  0   0 
  1
   [ 7] .hash HASH3805784c 01884c cc 04   A 11   0 
  4
   [ 8] .data PROGBITS38057918 018918 000a3c 00  WA  0   0 
  4
   [ 9] .rel.data REL  06ccc8 000c18 08 40   8 
  4
   [10] .got.plt  PROGBITS38058354 019354 0c 04  WA  0   0 
  4
   [11] .dynsym   DYNSYM  38058360 019360 000200 10   A  6   1 
  4
   [12] .dynamic  DYNAMIC 38058560 019560 80 08  WA  6   0 
  4
   [13] .u_boot_cmd   PROGBITS380585e0 0195e0 0003c0 00  WA  0   0 
  4
   [14] .rel.u_boot_cmd   REL  06d8e0 000500 08 40  13 
  4
   [15] .bss  NOBITS  3805cc50 01ec50 001a34 00  WA  0   0 
  4
   [16] .bios PROGBITS 01e000 00053e 00  AX  0   0 
  1
   [17] .rel.bios REL  06dde0 c0 08 40  16 
  4
   [18] .rel.dyn  REL 380589a0 0199a0 0042b0 08   A 11   0 
  4
   [19] .start16  PROGBITSf800 01e800 000110 00  AX  0   0 
  1
   [20] .rel.start16  REL  06dea0 38 08 40  19 
  4
   [21] .resetvec PROGBITSfff0 01eff0 10 00  AX  0   0 
  1
   [22] .rel.resetvec REL  06ded8 08 08 40  21 
  4

 ...

 Relocation section '.rel.text' at offset 0x66c68 contains 2976 entries:
  Offset InfoTypeSym.Value  Sym. Name
 38040010  0101 R_386_32  3804   .text
 3804001e  0101 R_386_32  3804   .text
 38040028  0101 R_386_32  3804   .text
 3804003f  0101 R_386_32  3804   .text
 38040051  0101 R_386_32  3804   .text
 38040075  0101 R_386_32  3804   .text
 38040085  0101 R_386_32  3804   .text
 3804009d  0003e602 R_386_PC32380403fa   load_uboot
 380400a6  0101 R_386_32  3804   .text
 38040015  00029f02 R_386_PC323804bdd8   early_board_init
 38040023  0003f702 R_386_PC323804bdda   show_boot_progress_asm

 ...

 Relocation section '.rel.rodata' at offset 0x6c968 contains 108 entries:
  Offset InfoTypeSym.Value  Sym. Name
 38051908  0201 R_386_32  380518a4   .rodata
 38051938  0201 R_386_32  380518a4   .rodata
 38051968  0201 R_386_32  380518a4   .rodata
 38051998  0201 R_386_32  380518a4   .rodata
 380519c8  0201 R_386_32  380518a4   .rodata
 380519f8  0201 R_386_32  380518a4   .rodata

 ...

 Relocation section '.rel.dyn' at offset 0x199a0 contains 2134 entries:
  Offset InfoTypeSym.Value  Sym. Name
 f838  0008 R_386_RELATIVE
 f846  0008 R_386_RELATIVE
 38040010  0008 R_386_RELATIVE
 3804001e  0008 R_386_RELATIVE
 38040028  0008 R_386_RELATIVE
 3804003f  0008 R_386_RELATIVE
 38040051  0008 R_386_RELATIVE
 38040075  0008 R_386_RELATIVE
 38040085  0008 R_386_RELATIVE

 Notice that, apart from .rel.dyn, non of the .rel.* sections have the
 A (Allocated) flag set - They do not end

Re: [U-Boot] Relocation size penalty calculation

2009-10-14 Thread J. William Campbell
Joakim Tjernlund wrote:
 J. William Campbell jwilliamcampb...@comcast.net wrote on 14/10/2009 
 01:48:52:
   
 Joakim Tjernlund wrote:
 
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 22:06:56:


   
 On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:

 
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 13:21:05:

   
 On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:

 
 Graeme Russ graeme.r...@gmail.com wrote on 11/10/2009 12:47:19:

   
 [Massive Snip :)]


 
 So, all that is left are .dynsym and .dynamic ...
   .dynsym
 - Contains 70 entries (16 bytes each, 1120 bytes)
 - 44 entries mimic those entries in .got which are not relocated
 - 21 entries are the remaining symbols exported from the linker
   script
 - 4 entries are labels defined in inline asm and used in C

 
 Try adding proper asm declarations. Look at what gcc
 generates for a function/variable and mimic these.

   
 Thanks - Now .dynsym contains only exports from the linker script

 
 :)

   
 - 1 entry is a NULL entry

   .dynamic
 - 88 bytes
 - Array of Elf32_Dyn
 - typedef struct {
   Elf32_Sword d_tag;
   union {
   Elf32_Word  d_val;
   Elf32_Addr  d_ptr;
   } d_un;
   } Elf32_Dyn;
 - 0x11 entries
   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
   [05] 0x000B, 0x0010 DT_SYMENT, ???
   [06] 0x0015, 0x DT_DEBUG, ???
   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
   [08] 0x0012, 0x14D8 DT_RELSZ, ???

 
 How big DT_REL is

   
   [09] 0x0013, 0x0008 DT_RELENT, ???

 
 hmm, cannot remeber :)

   
 How big an entry in DT_REL is

 
 Right, how could I forget :)

   
   [0a] 0x0016, 0x DT_TEXTREL, ???

 
 Oops, you got text relocations. This is generally a bad thing.
 TEXTREL is commonly caused by asm code that arent truly pic so it needs
 to modify the .text segment to adjust for relocation.
 You should get rid of this one. Look for DT_TEXTREL in .o files to find
 the culprit.


   
 Alas I cannot - The relocations are a result of loading a register with a
 return address when calling show_boot_progress in the very early stages 
 of
 initialisation prior to the stack becoming available. The x86 does not
 allow direct access to the IP so the only way to find the 'current
 execution address' is to 'call' to the next instruction and pop the 
 return
 address off the stack

 
 hmm, same as ppc but that in it self should not cause a TEXREL, should it?
 Ahh, the 'call' is absolute, not relative? I guess there is some way 
 around it
 but it is not important ATM I guess.

 Evil idea, skip -fpic et. all and add the full reloc procedure
 to relocate by rewriting directly in TEXT segment. Then you save space
 but you need more relocation code. Something like dl_do_reloc from
 uClibc. Wonder how much extra code that would be? Not too much I think.


   
 With the following flags

 PLATFORM_RELFLAGS += -fvisibility=hidden
 PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm
 PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions

 I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think
 this might mean I need the symbol table in the binary in order to resolve
 them

 

 BTW, how many relocs do you get compared with -fPIC? I suspect you more
 now but hopefully not that many more.

   
 Possibly, but I think you only need to add an offset to all those
 relocs.

   
 Almost right. The relocations specify a symbol value that needs to be
 added to the data in memory to relocate the reference. The symbol values
 involved should be the start of the text section for program references,
 the start of the uninitialized data section for bss references, and the
 start of the data section for initialized data and constants. So there
 are about four symbols whose value you need to keep. Take a look at
 http://refspecs.freestandards.org/elf/elf.pdf (which you have probably
 already looked at) and it tells you what to do with R_386_PC32 ad
 R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded
 will remove all the symbols you don't actually need, but I don't know
 that for sure. Note also that you can change the section flags of a
 section marked noload  to load.
 

 Still think you can get away with just ADDING an offset. The image is linked 
 to a
 specific address

Re: [U-Boot] Relocation size penalty calculation

2009-10-14 Thread J. William Campbell
Joakim Tjernlund wrote:
 Graeme Russ graeme.r...@gmail.com wrote on 14/10/2009 13:48:27:
   
 On Wed, Oct 14, 2009 at 6:25 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:
 
 J. William Campbell jwilliamcampb...@comcast.net wrote on 14/10/2009 
 01:48:52:
   
 Joakim Tjernlund wrote:
 
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 22:06:56:


   
 On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:

 
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 13:21:05:

   
 On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:

 
 Graeme Russ graeme.r...@gmail.com wrote on 11/10/2009 12:47:19:

   
 [Massive Snip :)]


 
 So, all that is left are .dynsym and .dynamic ...
   .dynsym
 - Contains 70 entries (16 bytes each, 1120 bytes)
 - 44 entries mimic those entries in .got which are not relocated
 - 21 entries are the remaining symbols exported from the linker
   script
 - 4 entries are labels defined in inline asm and used in C

 
 Try adding proper asm declarations. Look at what gcc
 generates for a function/variable and mimic these.

   
 Thanks - Now .dynsym contains only exports from the linker script

 
 :)

   
 - 1 entry is a NULL entry

   .dynamic
 - 88 bytes
 - Array of Elf32_Dyn
 - typedef struct {
   Elf32_Sword d_tag;
   union {
   Elf32_Word  d_val;
   Elf32_Addr  d_ptr;
   } d_un;
   } Elf32_Dyn;
 - 0x11 entries
   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
   [05] 0x000B, 0x0010 DT_SYMENT, ???
   [06] 0x0015, 0x DT_DEBUG, ???
   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
   [08] 0x0012, 0x14D8 DT_RELSZ, ???

 
 How big DT_REL is

   
   [09] 0x0013, 0x0008 DT_RELENT, ???

 
 hmm, cannot remeber :)

   
 How big an entry in DT_REL is

 
 Right, how could I forget :)

   
   [0a] 0x0016, 0x DT_TEXTREL, ???

 
 Oops, you got text relocations. This is generally a bad thing.
 TEXTREL is commonly caused by asm code that arent truly pic so it 
 needs
 to modify the .text segment to adjust for relocation.
 You should get rid of this one. Look for DT_TEXTREL in .o files to 
 find
 the culprit.


   
 Alas I cannot - The relocations are a result of loading a register 
 with a
 return address when calling show_boot_progress in the very early 
 stages of
 initialisation prior to the stack becoming available. The x86 does not
 allow direct access to the IP so the only way to find the 'current
 execution address' is to 'call' to the next instruction and pop the 
 return
 address off the stack

 
 hmm, same as ppc but that in it self should not cause a TEXREL, should 
 it?
 Ahh, the 'call' is absolute, not relative? I guess there is some way 
 around it
 but it is not important ATM I guess.

 Evil idea, skip -fpic et. all and add the full reloc procedure
 to relocate by rewriting directly in TEXT segment. Then you save space
 but you need more relocation code. Something like dl_do_reloc from
 uClibc. Wonder how much extra code that would be? Not too much I think.


   
 With the following flags

 PLATFORM_RELFLAGS += -fvisibility=hidden
 PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm
 PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions

 I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think
 this might mean I need the symbol table in the binary in order to resolve
 them

 
 BTW, how many relocs do you get compared with -fPIC? I suspect you more
 now but hopefully not that many more.

   
 Possibly, but I think you only need to add an offset to all those
 relocs.

   
 Almost right. The relocations specify a symbol value that needs to be
 added to the data in memory to relocate the reference. The symbol values
 involved should be the start of the text section for program references,
 the start of the uninitialized data section for bss references, and the
 start of the data section for initialized data and constants. So there
 are about four symbols whose value you need to keep. Take a look at
 http://refspecs.freestandards.org/elf/elf.pdf (which you have probably
 already looked at) and it tells you what to do with R_386_PC32 ad
 R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded
 will remove

Re: [U-Boot] Relocation size penalty calculation

2009-10-14 Thread J. William Campbell
Joakim Tjernlund wrote:
 J. William Campbell jwilliamcampb...@comcast.net wrote on 14/10/2009 
 17:35:44:
   
 Joakim Tjernlund wrote:
 
 J. William Campbell jwilliamcampb...@comcast.net wrote on 14/10/2009 
 01:48:52:

   
 Joakim Tjernlund wrote:

 
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 22:06:56:



   
 On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:


 
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 13:21:05:


   
 On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:


 
 Graeme Russ graeme.r...@gmail.com wrote on 11/10/2009 12:47:19:


   
 [Massive Snip :)]
 

 [Yet another SNIP :)]

   
 Evil idea, skip -fpic et. all and add the full reloc procedure
 to relocate by rewriting directly in TEXT segment. Then you save space
 but you need more relocation code. Something like dl_do_reloc from
 uClibc. Wonder how much extra code that would be? Not too much I think.



   
 With the following flags

 PLATFORM_RELFLAGS += -fvisibility=hidden
 PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm
 PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions

 I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think
 this might mean I need the symbol table in the binary in order to resolve
 them


 
 BTW, how many relocs do you get compared with -fPIC? I suspect you more
 now but hopefully not that many more.


   
 Possibly, but I think you only need to add an offset to all those
 relocs.


   
 Almost right. The relocations specify a symbol value that needs to be
 added to the data in memory to relocate the reference. The symbol values
 involved should be the start of the text section for program references,
 the start of the uninitialized data section for bss references, and the
 start of the data section for initialized data and constants. So there
 are about four symbols whose value you need to keep. Take a look at
 http://refspecs.freestandards.org/elf/elf.pdf (which you have probably
 already looked at) and it tells you what to do with R_386_PC32 ad
 R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded
 will remove all the symbols you don't actually need, but I don't know
 that for sure. Note also that you can change the section flags of a
 section marked noload  to load.

 
 Still think you can get away with just ADDING an offset. The image is 
 linked to a
 specific address and then you move the whole image to a new address. 
 Therefore
 you should be able to read the current address, add offset, write back the
   
 new address.
 
 Normally one do what you describe but here we know that the whole img has 
 moved so
 we don't have to do calculate the new address from scratch.

   
 If the addresses of the bss, text, and data segments change by the same
 value, I think you are correct. However, if the text and data/bss
 segments are moved by different offsets, naturally the relocations would
 be different. One reason to retain this capability would be to allow the
 u-boot copy to execute in place in NOR flash while re-locating the
 read-write storage once memory has been sized. Having different
 relocation factors is not much worse than just one, and it may be just
 as easy to get working initially as a single relocation constant.
 

 How do figure that? You need to rewrite the insn to access the moved
 data/bss and they are in flash, did I miss something?
   
No, I did. You are quite correct, there would be references in flash 
that couldn't be fixed. Sorry about that.

Best Regards,
Bill Campbell
   
 FWIW, the ultimate solution to minimum relocation size is a
 post-processing step that creates several arrays of relocation offsets
 as two byte quantities. This reduces the cost of each relocation entry
 to just a bit more than two bytes (there is a small overhead for array
 size, MSB values and relocation offset selection.) Naturally, this is
 much less than the ELF version of the same relocations, because we do
 not need to retain as much information and ELF doesn't worry about size
 that much.. This may pacify users for which the flash size of the image
 is critical, at the expense of an extra link step. Naturally, getting
 things to work with standard ELF is the most important step, and
 probably enough for most people.
 

 That would save 2+4 bytes/reloc on REL arches and
 2+4+4 on RELA(ppc) (provided one can ignore r_addend)

 But yes, this is probably too fancy for the moment.

   Jocke



   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-13 Thread J. William Campbell
Joakim Tjernlund wrote:
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 13:21:05:
   
 On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:
 
 Graeme Russ graeme.r...@gmail.com wrote on 11/10/2009 12:47:19:
   
 [Massive Snip :)]

 
 So, all that is left are .dynsym and .dynamic ...
   .dynsym
 - Contains 70 entries (16 bytes each, 1120 bytes)
 - 44 entries mimic those entries in .got which are not relocated
 - 21 entries are the remaining symbols exported from the linker
   script
 - 4 entries are labels defined in inline asm and used in C
 
 Try adding proper asm declarations. Look at what gcc
 generates for a function/variable and mimic these.
   
 Thanks - Now .dynsym contains only exports from the linker script
 
 :)
   
 - 1 entry is a NULL entry

   .dynamic
 - 88 bytes
 - Array of Elf32_Dyn
 - typedef struct {
   Elf32_Sword d_tag;
   union {
   Elf32_Word  d_val;
   Elf32_Addr  d_ptr;
   } d_un;
   } Elf32_Dyn;
 - 0x11 entries
   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
   [05] 0x000B, 0x0010 DT_SYMENT, ???
   [06] 0x0015, 0x DT_DEBUG, ???
   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
   [08] 0x0012, 0x14D8 DT_RELSZ, ???
 
 How big DT_REL is
   
   [09] 0x0013, 0x0008 DT_RELENT, ???
 
 hmm, cannot remeber :)
   
 How big an entry in DT_REL is
 

 Right, how could I forget :)
   
   [0a] 0x0016, 0x DT_TEXTREL, ???
 
 Oops, you got text relocations. This is generally a bad thing.
 TEXTREL is commonly caused by asm code that arent truly pic so it needs
 to modify the .text segment to adjust for relocation.
 You should get rid of this one. Look for DT_TEXTREL in .o files to find
 the culprit.

   
 Alas I cannot - The relocations are a result of loading a register with a
 return address when calling show_boot_progress in the very early stages of
 initialisation prior to the stack becoming available. The x86 does not
 allow direct access to the IP so the only way to find the 'current
 execution address' is to 'call' to the next instruction and pop the return
 address off the stack
 

 hmm, same as ppc but that in it self should not cause a TEXREL, should it?
 Ahh, the 'call' is absolute, not relative? I guess there is some way around it
 but it is not important ATM I guess.

 Evil idea, skip -fpic et. all and add the full reloc procedure
 to relocate by rewriting directly in TEXT segment. Then you save space
 but you need more relocation code. Something like dl_do_reloc from
 uClibc. Wonder how much extra code that would be? Not too much I think.
   
I think this approach will turn out to be a big win. At present, the 
problem with just using the relocs is that objcopy is stripping them out 
when u-boot.bin is created, as I understand it. It seems this can be 
solved by changing the command switches appropriately, like using 
--strip-unneeded. In any case, there is some combination of switches 
that will preserve the relocation data. The executable code will get 
smaller, there will be no .got, and the relocation data will be larger 
(than with -fpic). In total size, it probably will be slightly smaller, 
but that is a guess. The most important benefit of this approach is that 
it will work for all architectures, thereby solving the problem once and 
forever! Even if the result is a bit larger, the RAM footprint will be 
reduced by the smaller object code size (since the relocation data need 
not be copied into ram).Having this approach as an option would be real 
nice, since it would always just work.

Best Regards,
Bill Campbell
   
 This is not a problem because this is very low-level init that is not
 called once relocated into RAM - These relocations can be safely ignored
 

   
   [0b] 0x6FFA, 0x0236 ???, Entries in .rel.dyn
   [0c] 0x, 0x DT_NULL, End of Array
   [0d] 0x, 0x DT_NULL, End of Array
   [0e] 0x, 0x DT_NULL, End of Array
   [0f] 0x, 0x DT_NULL, End of Array
   [10] 0x, 0x DT_NULL, End of Array

 I think some more investigation into the need for .dynsym and .dynamic is
 still required...
 
 .dynsym may still be required if only for accessing the __u_boot_cmd
 structure. However, I may be able to hack that a little and not create a
 __u_boot_cmd symbol in the linker script (create some other temporary
 symbol) and populate __u_boot_cmd with a valid value after relocation. It
 will look a little weird, but may mean 

Re: [U-Boot] Relocation size penalty calculation

2009-10-13 Thread J. William Campbell
Joakim Tjernlund wrote:
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 22:06:56:

   
 On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:
 
 Graeme Russ graeme.r...@gmail.com wrote on 13/10/2009 13:21:05:
   
 On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
 joakim.tjernl...@transmode.se wrote:
 
 Graeme Russ graeme.r...@gmail.com wrote on 11/10/2009 12:47:19:
   
 [Massive Snip :)]

 
 So, all that is left are .dynsym and .dynamic ...
   .dynsym
 - Contains 70 entries (16 bytes each, 1120 bytes)
 - 44 entries mimic those entries in .got which are not relocated
 - 21 entries are the remaining symbols exported from the linker
   script
 - 4 entries are labels defined in inline asm and used in C
 
 Try adding proper asm declarations. Look at what gcc
 generates for a function/variable and mimic these.
   
 Thanks - Now .dynsym contains only exports from the linker script
 
 :)
   
 - 1 entry is a NULL entry

   .dynamic
 - 88 bytes
 - Array of Elf32_Dyn
 - typedef struct {
   Elf32_Sword d_tag;
   union {
   Elf32_Word  d_val;
   Elf32_Addr  d_ptr;
   } d_un;
   } Elf32_Dyn;
 - 0x11 entries
   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
   [05] 0x000B, 0x0010 DT_SYMENT, ???
   [06] 0x0015, 0x DT_DEBUG, ???
   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
   [08] 0x0012, 0x14D8 DT_RELSZ, ???
 
 How big DT_REL is
   
   [09] 0x0013, 0x0008 DT_RELENT, ???
 
 hmm, cannot remeber :)
   
 How big an entry in DT_REL is
 
 Right, how could I forget :)
   
   [0a] 0x0016, 0x DT_TEXTREL, ???
 
 Oops, you got text relocations. This is generally a bad thing.
 TEXTREL is commonly caused by asm code that arent truly pic so it needs
 to modify the .text segment to adjust for relocation.
 You should get rid of this one. Look for DT_TEXTREL in .o files to find
 the culprit.

   
 Alas I cannot - The relocations are a result of loading a register with a
 return address when calling show_boot_progress in the very early stages of
 initialisation prior to the stack becoming available. The x86 does not
 allow direct access to the IP so the only way to find the 'current
 execution address' is to 'call' to the next instruction and pop the return
 address off the stack
 
 hmm, same as ppc but that in it self should not cause a TEXREL, should it?
 Ahh, the 'call' is absolute, not relative? I guess there is some way around 
 it
 but it is not important ATM I guess.

 Evil idea, skip -fpic et. all and add the full reloc procedure
 to relocate by rewriting directly in TEXT segment. Then you save space
 but you need more relocation code. Something like dl_do_reloc from
 uClibc. Wonder how much extra code that would be? Not too much I think.

   
 With the following flags

 PLATFORM_RELFLAGS += -fvisibility=hidden
 PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm
 PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions

 I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think
 this might mean I need the symbol table in the binary in order to resolve
 them
 

 Possibly, but I think you only need to add an offset to all those
 relocs.
   
Almost right. The relocations specify a symbol value that needs to be 
added to the data in memory to relocate the reference. The symbol values 
involved should be the start of the text section for program references, 
the start of the uninitialized data section for bss references, and the 
start of the data section for initialized data and constants. So there 
are about four symbols whose value you need to keep. Take a look at 
http://refspecs.freestandards.org/elf/elf.pdf (which you have probably 
already looked at) and it tells you what to do with R_386_PC32 ad 
R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded 
will remove all the symbols you don't actually need, but I don't know 
that for sure. Note also that you can change the section flags of a 
section marked noload  to load.

Best Regards,
Bill Campbell
   Jokce

 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot


   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Peter Tyser wrote:
 On Thu, 2009-10-08 at 22:54 +1100, Graeme Russ wrote:
   
 Out of curiosity, I wanted to see just how much of a size penalty I am
 incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
 the results (fixed width font will help - its space, not tab, formatted):

 Section non-reloc reloc
 ---
 .text000118c4  000137fc - 0x1f38 bytes (~8kB) bigger
 .rodata  5bad  59d0
 .interp  n/a   0013
 .dynstr  n/a   0648
 .hashn/a   0428
 .eh_frame3268  34fc
 .data0a6c  01dc
 .data.reln/a   0098
 .data.rel.ro.local   n/a   0178
 .data.rel.local  n/a   07e4
 .got   01f0
 .got.plt n/a   000c
 .rel.got n/a   03e0
 .rel.dyn n/a   1228
 .dynsym  n/a   0850
 .dynamic n/a   0080
 .u_boot_cmd  03c0  03c0
 .bss 1a34  1a34
 .realmode0166  0166
 .bios053e  053e
 ===
 Total0001d5dd  00022287 - 0x4caa bytes (~19kB) bigger

 Its more than a 16% increase in size!!!

 .text accounts for a little under half of the total bloat, and of that,
 the crude dynamic loader accounts for only 341 bytes

 Have any metrics been done for PPC?
 

 Things actually improve a little bit when we use -mrelocatable and get
 rid of all the manual += gd-reloc_off fixups:

 1) Top of mainline on XPedite5370:
text  data bss dec hex filename
  308612 24488   33172  366272   596c0 u-boot

 2) Top of reloc branch on XPedite5370 (ie -mrelocatable):
text  data bss dec hex filename
  303704 28644   33156  365504   593c0 u-boot

   
Hi Peter,
 Just to be clear, the total text+data length of u-boot with the 
manual relocations (#1)  is LARGER than the text+data length of u-boot 
with the manual relocations removed and the necessary centralized 
relocation code added, along with any additional data sections required 
by -mrelocateable (#2), by 768 (dec) bytes? And both cases (1 and 2) 
work equivalently?

Best Regards,
Bill Campbell.
 For fun:
 3) #2 but with s/-mrelocatable/-fpic/ (probably doesn't boot):
text  data bss dec hex filename
  303704 24472   33156  361332   58374 u-boot


 There may be some other changes that affect the size between mainline
 and reloc, but their sizes are in the same general ballpark.

 Best,
 Peter

 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot


   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Graeme Russ wrote:
 Out of curiosity, I wanted to see just how much of a size penalty I am
 incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
 the results (fixed width font will help - its space, not tab, formatted):

 Section non-reloc reloc
 ---
 .text000118c4  000137fc - 0x1f38 bytes (~8kB) bigger
 .rodata  5bad  59d0
 .interp  n/a   0013
 .dynstr  n/a   0648
 .hashn/a   0428
 .eh_frame3268  34fc
 .data0a6c  01dc
 .data.reln/a   0098
 .data.rel.ro.local   n/a   0178
 .data.rel.local  n/a   07e4
 .got   01f0
 .got.plt n/a   000c
 .rel.got n/a   03e0
 .rel.dyn n/a   1228
 .dynsym  n/a   0850
 .dynamic n/a   0080
 .u_boot_cmd  03c0  03c0
 .bss 1a34  1a34
 .realmode0166  0166
 .bios053e  053e
 ===
 Total0001d5dd  00022287 - 0x4caa bytes (~19kB) bigger

 Its more than a 16% increase in size!!!

 .text accounts for a little under half of the total bloat, and of that,
 the crude dynamic loader accounts for only 341 bytes
   
Hi Graeme,
   I would be interested in a third option (column), the x86 build 
with just -mrelocateable but NOT -fpic. It will not be definitive 
because there will be extra code that references the GOT and missing 
code to do some of the relocation, but it would still be interesting.

Best Regards,
Bill Campbell
 Have any metrics been done for PPC?

 Regards,

 Graeme
 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot


   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Peter Tyser wrote:
 On Thu, 2009-10-08 at 08:53 -0700, J. William Campbell wrote:
   
 Peter Tyser wrote:
 
 On Thu, 2009-10-08 at 22:54 +1100, Graeme Russ wrote:
   
   
 Out of curiosity, I wanted to see just how much of a size penalty I am
 incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
 the results (fixed width font will help - its space, not tab, formatted):

 Section non-reloc reloc
 ---
 .text000118c4  000137fc - 0x1f38 bytes (~8kB) bigger
 .rodata  5bad  59d0
 .interp  n/a   0013
 .dynstr  n/a   0648
 .hashn/a   0428
 .eh_frame3268  34fc
 .data0a6c  01dc
 .data.reln/a   0098
 .data.rel.ro.local   n/a   0178
 .data.rel.local  n/a   07e4
 .got   01f0
 .got.plt n/a   000c
 .rel.got n/a   03e0
 .rel.dyn n/a   1228
 .dynsym  n/a   0850
 .dynamic n/a   0080
 .u_boot_cmd  03c0  03c0
 .bss 1a34  1a34
 .realmode0166  0166
 .bios053e  053e
 ===
 Total0001d5dd  00022287 - 0x4caa bytes (~19kB) bigger

 Its more than a 16% increase in size!!!

 .text accounts for a little under half of the total bloat, and of that,
 the crude dynamic loader accounts for only 341 bytes

 Have any metrics been done for PPC?
 
 
 Things actually improve a little bit when we use -mrelocatable and get
 rid of all the manual += gd-reloc_off fixups:

 1) Top of mainline on XPedite5370:
textdata bss dec hex filename
  308612   24488   33172  366272   596c0 u-boot

 2) Top of reloc branch on XPedite5370 (ie -mrelocatable):
textdata bss dec hex filename
  303704   28644   33156  365504   593c0 u-boot

   
   
 Hi Peter,
  Just to be clear, the total text+data length of u-boot with the 
 manual relocations (#1)  is LARGER than the text+data length of u-boot 
 with the manual relocations removed and the necessary centralized 
 relocation code added, along with any additional data sections required 
 by -mrelocateable (#2), by 768 (dec) bytes?
 

 Hi Bill,
 Doah, looks like I chose a bad board as an example.  The XPedite5370
 already had -mrelocatable defined in its own
 board/xes/xpedite5370/config.mk in mainline, so the above comparison
 should be ignored as both builds used -mrelocatable.

 Here's some *real* results from the MPC8548CDS:
 1) Top of mainline:
text  data bss dec hex filename
  219968 17052   22992  260012   3f7ac u-boot

 2) Top of reloc branch (ie -mrelocatable)
text  data bss dec hex filename
  219192 20640   22980  262812   4029c u-boot

 So the reloc branch is 2.7K bigger for the MPC8548CDS.
   
Hi Peter,
 OK, that's more like it! A 1.2 % size increase in ROM seems like a 
very small price to pay for a truly relocatable u-boot image that will 
run on any size memory without the programmer having to actively worry 
about what may need relocating as code is written. . Also, it should be 
noted that the size increase in 2)  is mostly in relocation segments 
that do not need to be copied into ram, so the ram footprint should be 
smaller for 2) than 1). The relocation code itself could also be placed 
is a segment that is not copied into ram, although that may be more 
trouble than it is worth.
   I am looking forward to Graeme's results with the 386. I expect 
that it will not be quite so favorable, perhaps a 4 or 5% size increase 
for -mrelocatable over an absolute build. However, -mrelocatable vs. 
-fpic may be comparable, with -mrelocatable actually winning. But then 
again, I could be totally wrong!

Best Regards,
Bill Campbell
 Best,
 Peter



   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Graeme Russ wrote:
 On Fri, Oct 9, 2009 at 2:58 AM, J. William Campbell
 jwilliamcampb...@comcast.net wrote:
   
 Graeme Russ wrote:
 
 Out of curiosity, I wanted to see just how much of a size penalty I am
 incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
 the results (fixed width font will help - its space, not tab, formatted):

 Section non-reloc reloc
 ---
 .text000118c4  000137fc - 0x1f38 bytes (~8kB) bigger
 .rodata  5bad  59d0
 .interp  n/a   0013
 .dynstr  n/a   0648
 .hashn/a   0428
 .eh_frame3268  34fc
 .data0a6c  01dc
 .data.reln/a   0098
 .data.rel.ro.local   n/a   0178
 .data.rel.local  n/a   07e4
 .got   01f0
 .got.plt n/a   000c
 .rel.got n/a   03e0
 .rel.dyn n/a   1228
 .dynsym  n/a   0850
 .dynamic n/a   0080
 .u_boot_cmd  03c0  03c0
 .bss 1a34  1a34
 .realmode0166  0166
 .bios053e  053e
 ===
 Total0001d5dd  00022287 - 0x4caa bytes (~19kB) bigger

 Its more than a 16% increase in size!!!

 .text accounts for a little under half of the total bloat, and of that,
 the crude dynamic loader accounts for only 341 bytes

   
 Hi Graeme,
  I would be interested in a third option (column), the x86 build with
 just -mrelocateable but NOT -fpic. It will not be definitive because there
 will be extra code that references the GOT and missing code to do some of
 the relocation, but it would still be interesting.
 

 x86 does not have -mrelocatable. This is a PPC only option :(
   
Hi Graeme,
   You are unfortunately correct. However, I wonder if we can 
get essentially the same result by executing the final ld step with the 
--emit-relocs switch included. This may also include some extra 
sections that we would want to strip out, but if it works, it could give 
all ELF-based systems a way to a relocatable u-boot.

Best Regards,
Bill Campbell
**

   
 Best Regards,
 Bill Campbell
 
 Have any metrics been done for PPC?

 Regards,

 Graeme
   

 Once the reloc branch has been merged, how many arches are left which do
 not support relocation?

 Regards,

 Graeme


   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Joakim Tjernlund wrote:
 On Fri, Oct 9, 2009 at 9:27 AM, J. William Campbell
 jwilliamcampb...@comcast.net wrote:
 
 Graeme Russ wrote:
   
 On Fri, Oct 9, 2009 at 2:58 AM, J. William Campbell
 jwilliamcampb...@comcast.net wrote:

 
 Graeme Russ wrote:

   
 Out of curiosity, I wanted to see just how much of a size penalty I am
 incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
 the results (fixed width font will help - its space, not tab,
 formatted):

 Section non-reloc reloc
 ---
 .text000118c4  000137fc - 0x1f38 bytes (~8kB) bigger
 .rodata  5bad  59d0
 .interp  n/a   0013
 .dynstr  n/a   0648
 .hashn/a   0428
 .eh_frame3268  34fc
 .data0a6c  01dc
 .data.reln/a   0098
 .data.rel.ro.local   n/a   0178
 .data.rel.local  n/a   07e4
 .got   01f0
 .got.plt n/a   000c
 .rel.got n/a   03e0
 .rel.dyn n/a   1228
 .dynsym  n/a   0850
 .dynamic n/a   0080
 .u_boot_cmd  03c0  03c0
 .bss 1a34  1a34
 .realmode0166  0166
 .bios053e  053e
 ===
 Total0001d5dd  00022287 - 0x4caa bytes (~19kB) bigger

 Its more than a 16% increase in size!!!

 .text accounts for a little under half of the total bloat, and of that,
 the crude dynamic loader accounts for only 341 bytes


 
 Hi Graeme,
 I would be interested in a third option (column), the x86 build with
 just -mrelocateable but NOT -fpic. It will not be definitive because
 there
 will be extra code that references the GOT and missing code to do some of
 the relocation, but it would still be interesting.

   
 x86 does not have -mrelocatable. This is a PPC only option :(

 
 Hi Graeme,
  You are unfortunately correct. However, I wonder if we can get
 essentially the same result by executing the final ld step with the
 --emit-relocs switch included. This may also include some extra sections
 that we would want to strip out, but if it works, it could give all
 ELF-based systems a way to a relocatable u-boot.

   
 I don't think --emit-relocs is necessary with -pic. I haven't gone through
 all the permutations to see if there is a smaller option, but gcc -fpic and
 ld -pie creates enough information to perform relocation on the x86
 platform
 

   
It is true that --emit-relocs is not required when -pic and -pie are 
used instead. However, pic and pie are designed to allow shared code 
(libraries)  to appear at different logical addresses in several 
programs without altering the text. This is grand overkill for what we 
need, which is the ability to relocate the code. The -pic and -pie code 
will be larger than the code without pic and pie. How much larger is a 
good question. On the PPC, it is larger but not much larger, because 
there are lots of registers available and one is almost for sure got (no 
pun intended) the magic relocation constant(s) in it. On the 386 with 
many fewer registers, pic and pie will cause the code to be 
percentage-wise larger than on the PPC. Thus avoiding pic and pie is a 
Good Thing in most cases.
 Try -fvisibility=hidden
   
I assume the -fvisibility=hidden is suggested in order to reduce 
(eliminate) the symbol table from the output, which we don't need 
because there are assumed to be no undefined symbols in our final ld. If 
that works, great! I was assuming we might need a custom strip program 
to delete any sections that we don't need, but this sounds easier if it 
gets them all.

Best Regards,
Bill Campbell
  Jocke



   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0

2009-10-07 Thread J. William Campbell
Joakim Tjernlund wrote:
 On Wed, Oct 7, 2009 at 5:55 PM, Wolfgang Denk w...@denx.de wrote:
 
 Dear Graeme Russ,

 In message d66caabb0910061824s4165d33bu5d5213f6783c0...@mail.gmail.com 
 you wrote:
   
 I think that even the -mrelocatable / .fixup method may not be needed at
 all. -pie / -pic used by themselves creates enough information for an OS
 dynamic loader to relocate an executable, so why not U-Boot? Given that
 the type and location of each section is easily determined, a striped
 down dynamic loader can provide a platform-independent relocation scheme.
 
 One reason for not using ELF images for the boot loader is size. The
 ELF header alone is often more than we would be willing to accept, not
 to mention the additional code.

   
 But the headers get stripped from the final binary. All we are left with in
 order to locate the ELF section data are the symbols exported from the
 linker script

 The extra code is only three very tight for-loops. I had them wrapped in
 functions to improve readability, but they are good inline candidates (only
 called once each) and I doubt they use much code space at all (I'll send
 through actual numbers soon)
 

 But how much space in the extra sections you link in?

 if size is comparable with fixup ptrs we should probably
 consider using the same for ppc. Then we can use -fpic/-fpie and
 that is significantly smaller then -fPIC on PPC.

   
 Question is, does -mrelocatable result in smaller .got (et al) are is the
 .fixup section adding extra size for the sake of ease of implementation?
 

 fixup section expands with lots of ptrs, then fixup is placed just after .got


   
There is also a trade-off in using the -mrelocatable / .fixup method 
that should be considered in the general case. -fPIC or even -fpic code 
is almost always larger than the code generated for -mrelocateable, so 
it it the final size of the object blob in prom that matters, not just 
the length of the fixup segment. This effect is architecture dependent, 
where PIC is much cheaper on a PPC than on an Intel 386. So the best 
combination may depend on the chip.
I would also strongly encourage the -mrelocateable method be always 
available as an option. Historically, -fPIC has been buggy in initial 
releases of GCC for new architectures, probably because you can do 
useful work with just relocateable code and PIC can come later. So it 
may be that u-boot would need an -mrelocateable approach for some period 
until -fPIC/-fpic worked. In any case, it would be a good fallback if 
one suspected a bug in PIC.

Best Regards,
Bill Campbell
___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0

2009-10-06 Thread J. William Campbell
Peter Tyser wrote:
 On Tue, 2009-10-06 at 19:51 +0200, Wolfgang Denk wrote:
   
 Dear Peter Tyser,

 In message 1254843932.24664.2083.ca...@localhost.localdomain you wrote:
 
 I personally like the current implementation of putting the bss after
 the entire U-Boot image.  It keeps U-Boot's code, malloc pool, stack,
 bss, etc all in the same general area which is nice, and has the side
 benefit that the bootpg won't be overwritten.
   
 OK, if you think so...

 
 I know ORing in 0x10 is a bit ugly, but what's the real downside of
 doing it?
   
 Nothing. I just hate to allocate the bss at 0x0, because this is
 actually incorrect - it's the result of an address overflow /
 truncation, and pretty much misleading to someone trying to read and
 understand the code. For the linked image, it does not _look_ as if
 the bss was located _after_ the U-Boot image, it looks detached and
 allocated in low RAM.
 

 Do you have a preference Kumar?  You're probably going to be the first
 in line to have to deal with any resulting confusion:)

 I personally would rank the options:
 1. OR in an offset to the bss address and leave some good comments in
 the linker script and commit message

 2. Make the bss the last section like other PPC boards which would
 result in the bootpg sometimes being overwritten

 3. Put the bss at an arbitrary address
   
FWIW, I think an arbitrary address disjoint from the u-boot addresses is 
best. While u-boot is in ROM, you can't use the bss anyway. The bss will 
actually be located at an address selected by the u-boot code itself 
after memory is sized. All references to the bss will be re-located by 
subtracting the arbitrary start address and adding the run-time chosen 
start address. So the linked start address is not important, except that 
is cannot be NULL or it may confuse the relocation code that doesn't 
want to re-locate NULL pointers. Some of the confusion in this 
discussion probably stems from the fact that the linker scripts make the 
bss look like part of u-boot, when it is really not. It is just a 
chunk of zero'ed ram, located anywhere the u-boot code decides to put 
it. An arbitrary strange address would make this more apparent.

Best Regards,
Bill Campbell
 Best,
 Peter

 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot


   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0

2009-10-06 Thread J. William Campbell
Peter Tyser wrote:
 On Tue, 2009-10-06 at 13:34 -0700, J. William Campbell wrote:
   
 Peter Tyser wrote:
 
 On Tue, 2009-10-06 at 19:51 +0200, Wolfgang Denk wrote:
   
   
 Dear Peter Tyser,

 In message 1254843932.24664.2083.ca...@localhost.localdomain you wrote:
 
 
 I personally like the current implementation of putting the bss after
 the entire U-Boot image.  It keeps U-Boot's code, malloc pool, stack,
 bss, etc all in the same general area which is nice, and has the side
 benefit that the bootpg won't be overwritten.
   
   
 OK, if you think so...

 
 
 I know ORing in 0x10 is a bit ugly, but what's the real downside of
 doing it?
   
   
 Nothing. I just hate to allocate the bss at 0x0, because this is
 actually incorrect - it's the result of an address overflow /
 truncation, and pretty much misleading to someone trying to read and
 understand the code. For the linked image, it does not _look_ as if
 the bss was located _after_ the U-Boot image, it looks detached and
 allocated in low RAM.
 
 
 Do you have a preference Kumar?  You're probably going to be the first
 in line to have to deal with any resulting confusion:)

 I personally would rank the options:
 1. OR in an offset to the bss address and leave some good comments in
 the linker script and commit message

 2. Make the bss the last section like other PPC boards which would
 result in the bootpg sometimes being overwritten

 3. Put the bss at an arbitrary address
   
   
 FWIW, I think an arbitrary address disjoint from the u-boot addresses is 
 best. While u-boot is in ROM, you can't use the bss anyway. The bss will 
 actually be located at an address selected by the u-boot code itself 
 after memory is sized. All references to the bss will be re-located by 
 subtracting the arbitrary start address and adding the run-time chosen 
 start address. So the linked start address is not important, except that 
 is cannot be NULL or it may confuse the relocation code that doesn't 
 want to re-locate NULL pointers. Some of the confusion in this 
 discussion probably stems from the fact that the linker scripts make the 
 bss look like part of u-boot, when it is really not. It is just a 
 chunk of zero'ed ram, located anywhere the u-boot code decides to put 
 it. An arbitrary strange address would make this more apparent.
 

 Hi Bill,
 What's the advantage of having the bss not be located next to U-Boot?
 The big disadvantage of picking an arbitrary address for the bss is that
 there's now 1 more magical section of SDRAM that the user needs to know
 shouldn't be used.  I already field enough question from people that
 corrupt their exception vectors or stack/malloc pool/u-boot code, I
 don't want to add more bss questions:)
   
Hi Peter,
  The point is that the address chosen for the ld step is NOT the 
address in ram where the bss will reside anyway. This address can 
overlap the exception vectors, stack, or even the u-boot code itself and 
it wouldn't matter (other than possible confusion). The actual physical 
address where the bss and u-boot itself resides is COMPUTED by u-boot 
after it sizes memory. u-boot only needs to know how big the section is 
in order to allow enough room. All references to the bss will then be 
re-located correctly. Where the bss actually ends up is a function of 
u-boot code. It may be on some processors that the computation of bss 
start is done assuming the bss is adjacent to u-boot in the original 
memory map, but if so, it is an un-necessary restriction. All that is 
required is a safe chunk of ram, which is also what is needed for 
stack and malloc area and should be chosen in a similar manner. So at 
run time, bss probably ends up adjacent to u-boot in ram because that's 
how it was coded, but at ld time it shouldn't matter.

Best Regards,
Bill Campbell
 Best,
 Peter

 PS. please keep the original email recipients on CC



   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] CFI Driver: Reset watchdog timer after each flash operation

2009-10-02 Thread J. William Campbell
Mike Frysinger wrote:
 On Friday 02 October 2009 08:30:51 Wolfgang Denk wrote:
   
 Ingo van Lil wrote:
 
 The CFI driver does not reset the device's watchdog, so long-running
 flash operations will cause the watchdog timer to expire. A comment in
 flash_status_check() suggests that udelay() is expected to reset the
 watchdog, but I can't find any architecture where it does.
   
 Please have a closer look, then. On PowerPC, udelay()
 [lib_ppc/time.c] calls wait_ticks(), which in turn
 [lib_ppc/ticks.S] calls WATCHDOG_RESET

 If this is missing in other architectures, it should be fixed at the
 root cause, i. e. in udelay() or in the respective support routines.
 

 Blackfin is missing it as well as i really had no idea it was supposed to be 
 there.  certainly no doc states this requirement.  perhaps it'd make sense to 
 break apart the common stuff to a common udelay() that does things like call 
 the watchdog and then call the implementation __udelay().  there should be at 
 least a doc/README.arch that includes these kind of details ...
 -mike
   
   
At this point it might be appropriate to ask if including such a reset 
in udelay() is a good idea. The way it is, no infinite loop in u-boot 
that contains a udelay() will ever allow the watchdog to time out. That 
restriction is somewhat non-obvious. I would argue that there are 
basically two classes of commands in u-boot, those that should execute 
so fast that there is no way the watchdog should be triggered, and 
commands that may take a long time during  which the watchdog may 
erroneously be triggered if there is no provision made to reset it. In 
the second case, the resetting of the watchdog must be placed where 
measurable progress towards command completion is being made, i.e. 
there is a certainty that we are getting closer to a command complete. 
Just because the program invokes udelay, there is no assurance that 
measurable progress is being made. The udelay may be part of a process 
that will never complete. Therefore, having a watchdog reset in udelay 
seems like a less than optimal idea in general. Maybe now would be a 
good time to look at removing it?

Bill Campbell
 

 ___
 U-Boot mailing list
 U-Boot@lists.denx.de
 http://lists.denx.de/mailman/listinfo/u-boot
   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Virtual addresses, u-boot, and the MMU

2009-09-03 Thread J. William Campbell
Becky Bruce wrote:

 On Sep 2, 2009, at 2:59 AM, Wolfgang Denk wrote:

 Dear J. William Campbell,

 In message 4a9d99b1.1010...@comcast.net you wrote:

 ...
 Becky then posted the summary of this discussion here:

 http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/50705
 ...
 In quick summary, for the next few years, we will require that all
 important physical addresses have corresponding virtual addresses.

 ...unless there are good reasons to deviate from that rule, but these
 cases are expected to be rare exceptions from the rule.


 In any case, I think we should be careful not to mix things: what  we
 are  discussing here are address mappings. What we are not discussing
 is specific memory properties like  being  cached/uncached,  guarded/
 non-guarded, etc.

 Such properties are important, too, but  they  need  to  get  handled
 through a separate interface.

 Here is where I am quite sure you are going to have a problem. In very
 many CPUs, cache control and memory management are joined at the hip.
 Some systems have no easy way to enable and disable (D,I) cache
 globally, it is only doable on a page or segment basis. The PPC 
 hardware
 has a relatively low cost way to do so, but not all architectures do.

 I am well aware of this problem, which is one of the reasons that the
 majority of systems is running  with  data  cache  turned  off.  Even
 PowerPC  has  DC off (at least after relocation to RAM), ARM does not
 implement cache support yet, etc.

 When we state that U-Boot is a boot loader and thus  should  be  kept
 simple,  this  more or less logically results in the consequence that
 if it's difficult to enable the DC (on some systems), then just don't
 do it, then.

 Nobody enforces you to enable caches when you find it hard to do.


 Very True. I did forget about the read being just a memory 
 reference. So
 if we desire the flash to be cached, it would have to normally be
 cached for reads to take advantage of the operation.

 ACK.

 Thanks for looking at this. It therefore seems to me that adding an
 uncache(virtual address) operation (that may return a substitute
 address for the actual write to the flash) followed by a
 restore_cache() operation inside the flash driver write routine 
 should
 work. The uncache routine would do nothing if the flash is not 
 cached to

 This is where Detlev's comment about using the chance to define a
 cache API comes into play.

 Yes, we probably should create a set of functions like

 enable_data_cache(address, size);
 and
 disable_data_cache(address, size);

 which would turn on resp. off the caching attributes on the given
 memory range.

 begin with, would globally turn off data cache if that is easy to 
 do, or
 would provide an alternate virtual address to be used in the write. 
 That

 This is where I disagree.

 I'm not really deep enough in the implementation details and thus
 would appreciate comments for example from Becky and Stefan. In my
 opinion, turning on or off the cache on an address range should be
 implemented as outlined above, i. e. as an operation changing the
 caching properties of the address range.

 This makes sense to me.  The disable function would need to flush the 
 range from the cache, but that's the only difficulty I forsee.  
 However, I dug up some AVR32 docs, and it looks like the whole dual 
 cacheable/CI mapping thing may be architectural.  The architecture 
 seems to specify a virtual memory map for privileged state, and part 
 of the VA range is not translated by the MMU, but has a default 
 translation.  There appear to be segments in the untranslated VA space 
 that map to the same PA, one cacheable and the other not.
Unfortunately, more than one architecture that has  problems turning off 
cache in an arbitrary region of memory. If a system has a global cache 
disable, that would be the easiest option. Since u-boot is single 
threaded, we don't have to worry about the cache not being turned back 
on because we lost control of the CPU. Alternatively, one could change 
the cache enable bits in an appropriate BAT if such a thing exists, 
and then change them back. However, I think you have to be open to the 
idea of an alternate address, as that may be the easiest way to go on 
many systems, and possibly the only way to go on others. In any case, 
Becky is quite correct in pointing out that flushing the cache for the 
appropriate VAs is necessary if you alter the contents of memory region 
that were previously cached.


 Using a completely different address range instead is a different
 thing, and not what I have in mind. I really dislike the idea of
 supporting alternate addresses in this context - even if this is
 what would be easiest to implement on some architectures.

I understand what you are saying here Wolfgang. However, I would plead 
the following possible mitigating circumstances make this not so bad. 
First, the rule will be that the usage will be very local, inside at 
most say 100

Re: [U-Boot] Virtual addresses, u-boot, and the MMU

2009-09-01 Thread J. William Campbell
  I have followed the recent discussions about problems in the CFI 
driver caused by the need to change the attributes of the address at 
which the flash is mapped. This discussion has raised some questions in 
my mind regarding the assumptions u-boot makes regarding the behavior of 
the addresses used in the code. I am writing this message for comments 
that may correct any mis-understandings I have. First, the addresses 
used by the u-boot program to reference memory and code are all 
virtual addresses (VA) because they go through the MMU  to become a 
physical address(PA). If there is no MMU, the virtual address and the 
physical address are identical.
The normal, or legacy,policy for u-boot is to arrange a memory 
map such that all physical addresses are mapped to some virtual address. 
Originally, the mapping were such that the VA was actually == the PA, 
but today on some CPUs, this is not possible. When the size of the 
physical address space exceeds the size of the virtual address space, 
the VA may not =- the PA numerically, but there is a one-to-one 
correspondence. It MAY also acceptable to map the same PA so that it 
appears more than once in the address space (VA), but if this is done, 
any references inside u-boot (such as in drivers) must be consistent, 
i.e. the flash chip must exist at exactly one virtual address from 
u-boots viewpoint. This is required because the u-boot drivers maintain 
pointers into the flash, and it is important to be able to compare these 
pointers to input virtual addresses. It is also necessary that the VA of 
the flash chip not change after the driver has been set up, or else any 
saved pointers will be wrong.
  Becky Bruce re-wrote the driver to use VAs instead of PAs. I am 
not exactly sure what this means, but I assume it meant allowing the VA 
referencing the flash
to be distinct from the PA where the flash lives (and may require 36 
bits on a PPC system to represent the PA). Does the driver re-map 
portions of the flash in order to access them? If the flash is really 
large, I can certainly see a need to do so. However, I assume on medium 
size flashes, it does not need to remap. In that case, don't all 
references just go through the MMU and get translated? The VA != PA, but 
from the point of view of u-boot, the VA is the only address that 
matters. The AVR32 certainly does not map flash dynamically, so it would 
not matter on that CPU.
  The issue with the CFI driver on the AVR32 is that it needs to 
disable cache on the address space of the flash memory when it is 
writing to the flash. This apparently is not trivial to do, but there is 
a second mapping that does have the cache off. Wolfgang has recommended 
the creation of a function to turn off the cache for the flash area, and 
also (presumably) one to turn the flash back on when the write is 
complete. Haavard has at present a function that returns an alternate 
address with the cache already off that addresses the same memory. This 
wouldn't cause a problem if the mapping happened immediately before the 
actual copy operation took place and was used for nothing else. However, 
if it happens early on in the driver, the address will not match the 
structure set up by the rest of the flash code using the non-translated 
address.
Therefore, I have the following questions: If the map_physmem() 
macro is removed from the driver on the AVR32, does the driver work if 
it is told that the flash PA=VA = the un-cached address? If not, why 
not? Shouldn't this be just like any CFI on an un-cached PPC address? 
The driver will be somewhat slower reading but otherwise it should work. 
If/when it does work, couldn't a map_in_cache() macro be placed directly 
in front of the read code that copies data from flash to other buffers. 
The macro would return an address of the same data referenced through a 
cached address if it exists. This address would go nowhere else and 
never be stored anywhere. This would speed up the copy operation for 
situations where it matters, and is applicable to all platforms that can 
do such a thing. The most general solution would be a call to 
map_in_cache/map_in_not_cache for both reads and writes in the CFI 
driver. These routines would return a substitute address (or the same 
input one), and may actually add another mapping dynamically, use a 
pre-existing appropriate mapping, just turn on/turn off data cache 
globally, or do nothing). At the end of the copy, map_restore() would 
put the map back  By default, the assumption would be that the flash is 
not cached and the macros do nothing. Sounds simple to me, what have I 
overlooked?

Bill Campbell
 Haavard Skinnemoen wrote:

 snip

   
 Right...I'm beginning to doubt that anyone is familiar enough with the
 u-boot code, since everyone seems to have their own opinion about how
 things are supposed to work.

 To summarize, here are the possible ways to fix the problem as I see it:
   - Use virtual 

Re: [U-Boot] Virtual addresses, u-boot, and the MMU

2009-09-01 Thread J. William Campbell
Wolfgang Denk wrote:
 Dear J. William Campbell,

 In message 4a9d5ef2.4030...@comcast.net you wrote:
   
   I have followed the recent discussions about problems in the CFI 
 driver caused by the need to change the attributes of the address at 
 which the flash is mapped. This discussion has raised some questions in 
 my mind regarding the assumptions u-boot makes regarding the behavior of 
 the addresses used in the code. I am writing this message for comments 
 that may correct any mis-understandings I have. First, the addresses 
 used by the u-boot program to reference memory and code are all 
 virtual addresses (VA) because they go through the MMU  to become a 
 physical address(PA). If there is no MMU, the virtual address and the 
 physical address are identical.
 

 Even if there is a MMU, we keep it switched off on most systems, or
 otherwise set it up in such a way that there is still a 1:1  mapping,
 i. e. the virtual address and physical address are identical, too.

 There have been several discussions concerning this topic on IRC
 (#u-boot at freenode); I'll try to summarize these here - not sure
 though if I don't miss anything: please feel free to complement what
 might be missing.

   
snip
 Becky then posted the summary of this discussion here:

 http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/50705


 Note that there was a general agreement among those who raised their
 voices.

   
In quick summary, for the next few years, we will require that all 
important physical addresses have corresponding virtual addresses. 
Some limited support for mapping in other resources may be provided at 
an operator interface level, but it will be quite limited. OK, seems 
reasonable to me.
   
 The normal, or legacy,policy for u-boot is to arrange a memory 
 map such that all physical addresses are mapped to some virtual address. 
 Originally, the mapping were such that the VA was actually == the PA, 
 but today on some CPUs, this is not possible. When the size of the 
 physical address space exceeds the size of the virtual address space, 
 the VA may not =- the PA numerically, but there is a one-to-one 
 correspondence. It MAY also acceptable to map the same PA so that it 
 appears more than once in the address space (VA), but if this is done, 
 

 This may or may not be possible. It may even make sense or be needed
 on some systems, and it may be impossible to do on others.

 In any case, I think we should be careful not to mix things: what  we
 are  discussing here are address mappings. What we are not discussing
 is specific memory properties like  being  cached/uncached,  guarded/
 non-guarded, etc.

 Such properties are important, too, but  they  need  to  get  handled
 through a separate interface.
   
Here is where I am quite sure you are going to have a problem. In very 
many CPUs, cache control and memory management are joined at the hip. 
Some systems have no easy way to enable and disable (D,I) cache 
globally, it is only doable on a page or segment basis. The PPC hardware 
has a relatively low cost way to do so, but not all architectures do.
   
   Becky Bruce re-wrote the driver to use VAs instead of PAs. I am 
 not exactly sure what this means, but I assume it meant allowing the VA 
 referencing the flash
 to be distinct from the PA where the flash lives (and may require 36 
 bits on a PPC system to represent the PA). Does the driver re-map 
 

 I think the information provided above sheds more light on this.
   
Yes, it did.
   
 portions of the flash in order to access them? If the flash is really 
 large, I can certainly see a need to do so. However, I assume on medium 
 size flashes, it does not need to remap. In that case, don't all 
 references just go through the MMU and get translated? The VA != PA, but 
 from the point of view of u-boot, the VA is the only address that 
 matters. The AVR32 certainly does not map flash dynamically, so it would 
 not matter on that CPU.
 

 OK.

   
   The issue with the CFI driver on the AVR32 is that it needs to 
 disable cache on the address space of the flash memory when it is 
 writing to the flash. This apparently is not trivial to do, but there is 
 

 Actually this is not specific to the AVR32, and so far  most  systems
 simply  do  not  enable  caches at all on the flash memory regions. I
 understand why the AVR32 solution is interesting, and I  think,  when
 we  try to find a solution for this we should use this chance to find
 a solution that also allows other systems to turn on  caches  on  the
 flash memory - things like loading the Linux kernel or ramdisk images
 etc. will benefit from that.
   
Full ACK.

snip

 While there are specific routines to  write  to  the  flash  (init,
 erase,  write),  there  is  no  specific  code  to read from flash.
 Reading is allowed everywhere by just  performing  load  instructions
 from this memory area. The CFI driver (nor any other flash driver