Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-15 Thread Uwe Bonnes
> "Matthias" == Matthias Welwarsky  writes:


Matthias> You will always have the best performance if you carefully
Matthias> tweak your write speed so that you reach optimum performance
Matthias> without ever producing a WAIT response. HLA adapters or
Matthias> CMSIS-DAP with SWD have a clear advantage here, because they
Matthias> can do a proper synchronous WAIT at the JTAG or SWD link layer
Matthias> and not up in the ADI protocol.

Hello,

tweaking non-HLA Adapter sounds very heuristic. Provide long enough wait
times between flash write to satify the maximum datasheet flash write delay.
This is the deterministic way. That way is not as fast as tweaking, but
OpenOCD has already much too many parameters to tweak...

Bye
-- 
Uwe Bonnesb...@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt
- Tel. 06151 1623569 --- Fax. 06151 1623305 -

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Christopher Head
Yes, I used the default setting of 8000. However I think I have now figured out 
where the WAITs came from, and it’s not because the clock frequency was 
inherently too high. Rather, it is because the reset-init handler for the chip 
does the adapter_khz change. So on a two-chip chain, after the first chip runs 
its reset-init handler, the adapter goes up to 8000, then the second chip, 
which is not running on PLL yet, runs its reset-init handler with adapter 
already at 8000 and issues WAITs in there.
On March 13, 2018 2:02:37 PM PDT, Tomas Vanek via OpenOCD-devel 
 wrote:
>Did you use 'adapter_khz 8000'  for the last test?
>I'm afraid that 8 MHz is too much (WAITs during reset-init).
-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Tomas Vanek via OpenOCD-devel

On 13.03.2018 21:14, Christopher Head wrote:

On March 13, 2018 12:21:47 PM PDT, Tomas Vanek via OpenOCD-devel 
 wrote:

Obviously faster DAP WAIT handling on USB HS.
The question remains: why are you getting DAP WAITs with algo, is the
reason
different adapter FT2232H vs FT2232C (should not be different except
faster turnaround but...) or is it a difference between STM32F722 and
F745 ?

This is indeed an interesting question. I don’t have an F722 to test with. I 
also don’t have a 2232C.

Damn I neglected that I had to decrease adapter_khz.


I tried some experiments. First I tried mucking about with CM7_AHBSCR to 
increase the priority of AHBS accesses to the DTCM over CPU accesses; no 
change. Then I tried eliminating the AHBS altogether by putting the work area 
at 0x2001 (system SRAM); also no change. Finally I tried changing dap 
memaccess; this did help. When I changed from the default of 8 up to 44 (43 was 
not enough), I got no more WAITs and 150 kiB/s.


Did you use 'adapter_khz 8000'  for the last test?
I'm afraid that 8 MHz is too much (WAITs during reset-init). I copied 
the value from STM32F4 config to be consistent.
ST-Link limits the clock to 4 MHz so who knows if 8 MHz was really 
tested. Anyway F4 does not suffer from WAIT problem.
Try to comment out "adapter_khz 8000" in -event reset-init definition. 
2000 kHz should work with default memaccess 8

(if not please find the memaccess limit).



So, what now? Is that setting something that belongs in the F7 target file?

Especially if the value depends on internal flash timing in such broad 
range as 16usec typ 100 usec max ...


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Christopher Head
On March 13, 2018 12:21:47 PM PDT, Tomas Vanek via OpenOCD-devel 
 wrote:
>Obviously faster DAP WAIT handling on USB HS.
>The question remains: why are you getting DAP WAITs with algo, is the
>reason
>different adapter FT2232H vs FT2232C (should not be different except
>faster turnaround but...) or is it a difference between STM32F722 and
>F745 ?

This is indeed an interesting question. I don’t have an F722 to test with. I 
also don’t have a 2232C.

I tried some experiments. First I tried mucking about with CM7_AHBSCR to 
increase the priority of AHBS accesses to the DTCM over CPU accesses; no 
change. Then I tried eliminating the AHBS altogether by putting the work area 
at 0x2001 (system SRAM); also no change. Finally I tried changing dap 
memaccess; this did help. When I changed from the default of 8 up to 44 (43 was 
not enough), I got no more WAITs and 150 kiB/s.

So, what now? Is that setting something that belongs in the F7 target file?

-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Tomas Vanek via OpenOCD-devel

On 13.03.2018 19:04, Christopher Head wrote:

OK, here are my test results. All are taken on a JTAG chain with two STM32F745 
chips, using an Olimex ARM-USB-TINY-H. In all cases the data is written to the 
second chip in the chain and is 473 kilobytes using the command “flash 
write_bank 1 filename.bin”. In all cases the default clock is used, so 2000 on 
tests without your patch and 8000 with it. In all cases where the write was 
successful, I also did a verify and it passed. If I didn’t mention how many DAP 
WAITs I saw in a particular case, it means there were none. In no case did I 
muck with the DAP memaccess setting.

=== Commit b8c7232b ===
Reset init: no DAP WAITs.

With algorithm: 2 DAP WAITs followed by debug regions are unpowered.

Without algorithm: 2.900 kiB/s.

=== With only your patch from 4464 ===
Reset init: 8 DAP WAITs but it seems happy.

With algorithm: 120 DAP WAITs, 53.283 kiB/s.

Without algorithm: 4.135 kiB/s.

=== With only my patch from 4463 ===
Reset init: no DAP WAITs.

With algorithm: 2 DAP WAITs followed by debug regions are unpowered.

Without algorithm: 1 DAP WAIT, 35.239 kiB/s.

=== With both patches ===
Reset init: 8 DAP WAITs but it seems happy.

With algorithm: 122 DAP WAITs, 52.752 kiB/s.

Without algorithm: 1 DAP WAIT, 55.177 kiB/s.


Obviously faster DAP WAIT handling on USB HS.
The question remains: why are you getting DAP WAITs with algo, is the reason
different adapter FT2232H vs FT2232C (should not be different except
faster turnaround but...) or is it a difference between STM32F722 and F745 ?

Tom

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Christopher Head
OK, here are my test results. All are taken on a JTAG chain with two STM32F745 
chips, using an Olimex ARM-USB-TINY-H. In all cases the data is written to the 
second chip in the chain and is 473 kilobytes using the command “flash 
write_bank 1 filename.bin”. In all cases the default clock is used, so 2000 on 
tests without your patch and 8000 with it. In all cases where the write was 
successful, I also did a verify and it passed. If I didn’t mention how many DAP 
WAITs I saw in a particular case, it means there were none. In no case did I 
muck with the DAP memaccess setting.

=== Commit b8c7232b ===
Reset init: no DAP WAITs.

With algorithm: 2 DAP WAITs followed by debug regions are unpowered.

Without algorithm: 2.900 kiB/s.

=== With only your patch from 4464 ===
Reset init: 8 DAP WAITs but it seems happy.

With algorithm: 120 DAP WAITs, 53.283 kiB/s.

Without algorithm: 4.135 kiB/s.

=== With only my patch from 4463 ===
Reset init: no DAP WAITs.

With algorithm: 2 DAP WAITs followed by debug regions are unpowered.

Without algorithm: 1 DAP WAIT, 35.239 kiB/s.

=== With both patches ===
Reset init: 8 DAP WAITs but it seems happy.

With algorithm: 122 DAP WAITs, 52.752 kiB/s.

Without algorithm: 1 DAP WAIT, 55.177 kiB/s.
-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Tomas Vanek via OpenOCD-devel

On 13.03.2018 10:40, Uwe Bonnes wrote:
Just for reference: Can anybody test what number gdb reports for the 
"load" command in that circumstances? Thanks 


Good point, Uwe. Here you go:

FT2232, JTAG, STM32F722-nucleo:
> reset init
...
> adapter_khz
adapter speed: 3000 kHz
> dap memaccess
memory bus access delay set to 8 tck
> load_image 64kib.bin 0x2000
65536 bytes written at address 0x2000
downloaded 65536 bytes in 0.637129s (100.451 KiB/s)


FT2232, SWD, STM32F722-nucleo, now with 'dap memaccess':

> reset init
...
> adapter_khz
adapter speed: 3000 kHz
> dap memaccess
memory bus access delay set to 8 tck
> load_image 64kib.bin 0x2000
65536 bytes written at address 0x2000
downloaded 65536 bytes in 0.552768s (115.781 KiB/s)

> flash write_image 64kib.bin 0x0803
not enough working area available(requested 76)
...
error writing to flash at address 0x0800 at offset 0x0003

> dap memaccess 9
memory bus access delay set to 9 tck
> flash write_image 64kib.bin 0x0807
not enough working area available(requested 76)
no working area available, can't do block memory writes
couldn't use block writes, falling back to single memory accesses
wrote 65536 bytes from file 64kib.bin in 0.629446s (101.677 KiB/s)


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Matthias Welwarsky
On Dienstag, 13. März 2018 01:32:24 CET Tomas Vanek via OpenOCD-devel wrote:
> On 12.03.2018 21:53, Christopher Head wrote:
> > On March 10, 2018 11:25:15 PM PST, Tomas Vanek via OpenOCD-devel  wrote:
> I got much worse results with DAP WAIT on a slow old Intel Atom single
> core industrial PC:
> speed as slow as 3.618 KiB/s (without wait 48.723 KiB/s).

Just a general remark on WAIT with cheap USB shift register drivers: Even with 
a proper implementation of WAIT handling so that it is not treated as an error 
any more, the performance will be abysmal. Especially for large block writes, 
once you hit a WAIT condition, all following writes will be discarded, but 
since we do only check the result at the end of a block to save time on the 
fast path, a lot of discarded writes have to be replayed slowly and 
synchronously and that will take a lot of time.

You will always have the best performance if you carefully tweak your write 
speed so that you reach optimum performance without ever producing a WAIT 
response. HLA adapters or CMSIS-DAP with SWD have a clear advantage here, 
because they can do a proper synchronous WAIT at the JTAG or SWD link layer 
and not up in the ADI protocol.

BR,
Matthias


-- 
Mit freundlichen Grüßen/Best regards,

Matthias Welwarsky
Project Engineer

SYSGO AG
Office Mainz
Am Pfaffenstein 14 / D-55270 Klein-Winternheim / Germany
Phone: +49-6136-9948-0 / Fax: +49-6136-9948-10
VoIP: SIP:m...@sysgo.com
E-mail: matthias.welwar...@sysgo.com / Web: http://www.sysgo.com
_
 
Web: https://www.sysgo.com
Blog: https://www.sysgo.com/blog
Events: https://www.sysgo.com/events
Newsletter: https://www.sysgo.com/newsletter 
_
 
Handelsregister/Commercial Registry: HRB Mainz 90 HRB 8066
Vorstand/Executive Board: Etienne Butery (CEO), Kai Sablotny (COO)
Aufsichtsratsvorsitzender/Supervisory Board Chairman: Marc Darmon
USt-Id-Nr./VAT-Id-No.: DE 149062328

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-13 Thread Uwe Bonnes
> "Tomas" == Tomas Vanek via OpenOCD-devel 
>  writes:

...
>> reset init
Tomas> ...
>> adapter_khz
Tomas> adapter speed: 3000 kHz
>> dap memaccess
Tomas> memory bus access delay set to 8 tck
>> targets f7.cpu flash write_image 64kib.bin 0x0802
Tomas> wrote 65536 bytes from file 64kib.bin in 0.743568s (86.071 KiB/s)
>> targets f4.cpu flash write_image 64kib.bin 0x0802
Tomas> wrote 65536 bytes from file 64kib.bin in 0.763147s (83.863 KiB/s)

Just for reference: Can anybody test what number gdb reports for the "load"
command in that circumstances?

Thanks
-- 
Uwe Bonnesb...@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt
- Tel. 06151 1623569 --- Fax. 06151 1623305 -

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-12 Thread Tomas Vanek via OpenOCD-devel

On 12.03.2018 21:53, Christopher Head wrote:

On March 10, 2018 11:25:15 PM PST, Tomas Vanek via OpenOCD-devel 
 wrote:

I wouldn't call this case as an obscure one. The reason could be
insufficient device clock rate,
not very high adapter_khz. Anyway all these cases could be solved by
configuring
the device properly.

I don’t think this is related to device clock speed. You will get a WAIT if 
there is a bus stall, and Flash programming is self timed. So I think it 
depends on the 16 (typical) to 100 (maximum) microsecond word programming time. 
If you can deliver 35 bits (length of a DRSCAN) plus the TMS transitions within 
the word programming time, then you will get a WAIT reply.
You're right that self-timed flash write takes most of required time. 
Bus transport should
take just one or two bus cycles (the second for clock sync). It is quite 
different when using algo.





One more concern: If programming by algo is usable on SWD only, JTAG
users should
set WORKAREASIZE to zero. But algos are used for verify, blank check
and
external memories as well.
This may impose a big penalty...

Yes, this is unfortunate. The verify algorithm works fine for me, but of course 
it is a synchronous, rather than asynchronous, algorithm, so any silicon 
erratum exposed by bus arbitration or other weirdness would not apply there.

In any case, 4463 makes this change. I get one DAP WAIT, but no more, with my 
FTDI at 2M, and programming works fine and verifies properly

Have you noticed programming speed?

For testing I connected STM32F722-nucleo and STM32F413-nucleo to one 
JTAG chain from FT2232.
I configured 128 MHz clock in F7 reset-init 
(http://openocd.zylin.com/4464) a lowered max adapter_khz to 3000

as my old FT2232C does not work well @ 6000 khz.

With algo:

> reset init
...
> adapter_khz
adapter speed: 3000 kHz
> dap memaccess
memory bus access delay set to 8 tck
> targets f7.cpu
> flash write_image 64kib.bin 0x0802
wrote 65536 bytes from file 64kib.bin in 0.743568s (*86*.071 KiB/s)
> targets f4.cpu
> flash write_image 64kib.bin 0x0802
wrote 65536 bytes from file 64kib.bin in 0.763147s (83.863 KiB/s)

Now with your patch and WORKAREASIZE 0 (both devices performs very 
similar so I list just F722):


> reset init
...
> adapter_khz
adapter speed: 3000 kHz
> dap memaccess
memory bus access delay set to 8 tck
> flash write_image 64kib.bin 0x0804
device id = 0x10006452
flash size = 512kbytes
not enough working area available(requested 76)
no working area available, can't do block memory writes
couldn't use block writes, falling back to single memory accesses
DAP transaction stalled (WAIT) - slowing down
wrote 65536 bytes from file 64kib.bin in 2.583704s (*24*.771 KiB/s)

However if you set longer memory access delay manually:

> dap memaccess 31
memory bus access delay set to 31 tck
> flash write_image 64kib.bin 0x0807
not enough working area available(requested 76)
no working area available, can't do block memory writes
couldn't use block writes, falling back to single memory accesses
wrote 65536 bytes from file 64kib.bin in 0.962217s (*66*.513 KiB/s)

I got much worse results with DAP WAIT on a slow old Intel Atom single 
core industrial PC:

speed as slow as 3.618 KiB/s (without wait 48.723 KiB/s).

FT2232 SWD transport (F7 only):

> reset init
...
> adapter_khz
adapter speed: 3000 kHz
> flash write_image 64kib.bin 0x0806
device id = 0x10006452
flash size = 512kbytes
not enough working area available(requested 76)
no working area available, can't do block memory writes
couldn't use block writes, falling back to single memory accesses
SWD DPIDR 0x5ba02477
Failed to write memory at 0x0806006c
error writing to flash at address 0x0800 at offset 0x0006

Too fast, DAP WAIT is an error on FTDI/SWD.

> adapter_khz 1500
adapter speed: 1500 kHz
> flash write_image 64kib.bin 0x0807
not enough working area available(requested 76)
no working area available, can't do block memory writes
couldn't use block writes, falling back to single memory accesses
wrote 65536 bytes from file 64kib.bin in 0.734222s (87.167 KiB/s)

Works. 46 SWCLK cycles / 1.5 MHz = 30.6 usec > t_flash

And finally ST-Link with original adapter_khz values:

> reset init
...
adapter speed: 4000 kHz
> flash write_image 64kib.bin 0x0803
not enough working area available(requested 76)
no working area available, can't do block memory writes
couldn't use block writes, falling back to single memory accesses
wrote 65536 bytes from file 64kib.bin in 9.900800s (6.464 KiB/s)

Really slow in comparison to 110.128 KiB/s with algo.

Your change really speed-up non-algo flashing. Unfortunately WAIT handling
on dumb adapters is far from effective and manual setting of extra 
memaccess cycles
heavily depends on the flash timing and this may vary with the flash 
wear out/temperature/whatever.



  (at least, it does once I work around the fact that my nasty multi target 
hacks have gone from 

Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-12 Thread Christopher Head
On March 10, 2018 11:25:15 PM PST, Tomas Vanek via OpenOCD-devel 
 wrote:
>I wouldn't call this case as an obscure one. The reason could be 
>insufficient device clock rate,
>not very high adapter_khz. Anyway all these cases could be solved by 
>configuring
>the device properly.

I don’t think this is related to device clock speed. You will get a WAIT if 
there is a bus stall, and Flash programming is self timed. So I think it 
depends on the 16 (typical) to 100 (maximum) microsecond word programming time. 
If you can deliver 35 bits (length of a DRSCAN) plus the TMS transitions within 
the word programming time, then you will get a WAIT reply.

>One more concern: If programming by algo is usable on SWD only, JTAG 
>users should
>set WORKAREASIZE to zero. But algos are used for verify, blank check
>and 
>external memories as well.
>This may impose a big penalty...

Yes, this is unfortunate. The verify algorithm works fine for me, but of course 
it is a synchronous, rather than asynchronous, algorithm, so any silicon 
erratum exposed by bus arbitration or other weirdness would not apply there.

In any case, 4463 makes this change. I get one DAP WAIT, but no more, with my 
FTDI at 2M, and programming works fine and verifies properly (at least, it does 
once I work around the fact that my nasty multi target hacks have gone from 
necessary to counterproductive).

-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-01 Thread Christopher Head
On March 1, 2018 2:55:37 AM PST, Tomas Vanek via OpenOCD-devel 
 wrote:
>I meant possible errors induced from sticky state to target handling. 
>Maybe nonsense, ok.
>Christopher, did you get algo timeouts and unpowered dbg regions on
>FTDI

Yes, I got those messages when using the original, unmodified algorithm on 
FTDI. I also got those messages when using ByteBlaster, but only after 
modifying the algorithm to remove the CR/SR accesses from the loop; with those 
in the loop, ByteBlaster gave me lots of WAITs but generally no errors.

>OMG, we all forget about the important thing: JTAG clock to bus clock 
>ratio!!!
>I run the example app on the F722nucleo (which sets up some faster 
>clock), halt and *without* reset init
>test flashing - no WAITs this time!
>And it explains why STM32F4 worked - reset init sets up 64 MHz clock
>(unlike STM32F7 where the out-of-reset HSI 16 MHz clock is used).
>Seems like in this particular case the rule "adapter_khz <= F_CPU/6" is
>
>not sufficient.
>Not surprisingly if we want fast algo programming we also need 
>reasonably fast CPU clock

Interesting! I had used F4 in the past and I think it didn’t print WAIT 
messages, whereas WAIT messages showed up for F7. I never reported them because 
I thought WAIT was just normal flow control.

I tried, on the F7, switching to 64 MHz clock, as the F4 does, just now as an 
experiment, using the FTDI. I got a lot of WAITs, but it did program 
successfully. I only got 36 kilobytes per second, a poor comparison to the 135 
I managed earlier in direct mode, but at least it worked. The other target in 
the multitarget chain doesn’t seem to be working so well, but I will leave 
further investigation there until the big event handler multitarget brokenness 
stuff is fixed (I intend to try out the pending patches but have not had time 
yet).

By the way, where did the clock/6 come from? I don’t think I saw it in the ADI 
spec, the Cortex-M7 user guide or reference manual, or the F7 reference manual 
or datasheet. Just curious.

Side note, I saw that the F4 config file sets the upper four bits of 
RCC_PLLCFGR to zero, but the reference manual says they should be kept at their 
reset value and that the reset value of the register is 0x24003010. Maybe it 
doesn’t matter, but what if ST put something important but not user-pokable 
there?

-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-01 Thread Tomas Vanek via OpenOCD-devel

On 01.03.2018 10:25, Matthias Welwarsky wrote:

WAITs are very strange. It looks like the stalled access to flash
blocks also JTAG access to RAM.
And SWD access doesn't suffer this silicon bug... who knows... maybe
some NOPs in algo busy wait loop would fix it.
BTW The programming algo should avoid bus stalling, shouldn't it?

I was wondering about this.

It's not really all that wonderous, since it's the DAP that stalls, and the
DAP is single-issue. So of course once you run into a WAIT condition, it will
only clear when the DAP has completed the access that caused it, and
to complete it, you have to, well, wait. Note that when you receive the WAIT,
it is the _previous_ access that is still pending, not the one you received
the WAIT for.


But the write algo is designed to avoid WAIT condition. And why WAIT 
does not appear

when SWD transport is used?
To complicate things further: the same test on STM32F413-nucleo runs 
*without any problems*.



What OpenOCD version do you use? It looks like your version misses
Matthias' WAIT handling
if you get such errors like algo timeout.

I don't think algo-timeout has to do with the DAP stalling. Isn't algo-timeout
just that the algorithm running on the core is not reporting back to the
debugger? The debugger is waiting for the target to execute a breakpoint so
that it gets back control, right?
I meant possible errors induced from sticky state to target handling. 
Maybe nonsense, ok.

Christopher, did you get algo timeouts and unpowered dbg regions on FTDI?

OMG, we all forget about the important thing: JTAG clock to bus clock 
ratio!!!
I run the example app on the F722nucleo (which sets up some faster 
clock), halt and *without* reset init

test flashing - no WAITs this time!
And it explains why STM32F4 worked - reset init sets up 64 MHz clock
(unlike STM32F7 where the out-of-reset HSI 16 MHz clock is used).
Seems like in this particular case the rule "adapter_khz <= F_CPU/6" is 
not sufficient.
Not surprisingly if we want fast algo programming we also need 
reasonably fast CPU clock.


Tom

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-03-01 Thread Matthias Welwarsky
On Donnerstag, 1. März 2018 07:27:49 CET Christopher Head wrote:
> On Thu, 1 Mar 2018 00:12:12 +0100
> 
> Tomas Vanek  wrote:
> > We should also focus to a question why algo flashing is broken on
> > FTDI. Some non STM devices (e.g. Kinetis) work with very similar algo
> > just perfectly on FTDI or any other adapter.
> 
> Sure. If someone could fix algorithm-based flashing, I would love to
> use it! I’m not convinced it will make things any faster for my
> specific set of hardware, but as long as it’s not broken and not
> slower, I don’t really care, and I understand it does make things
> faster for some people.
> 
> > WAITs are very strange. It looks like the stalled access to flash
> > blocks also JTAG access to RAM.
> > And SWD access doesn't suffer this silicon bug... who knows... maybe
> > some NOPs in algo busy wait loop would fix it.
> > BTW The programming algo should avoid bus stalling, shouldn't it?
> 
> I was wondering about this. 

It's not really all that wonderous, since it's the DAP that stalls, and the 
DAP is single-issue. So of course once you run into a WAIT condition, it will 
only clear when the DAP has completed the access that caused it, and
to complete it, you have to, well, wait. Note that when you receive the WAIT, 
it is the _previous_ access that is still pending, not the one you received 
the WAIT for. 

> The second data point is this: when using the algorithm-based approach,
> I attached an oscilloscope to TDO coming out of the F7. I was very
> surprised to see it *tristate* from time to time (at least, I’m pretty
> sure it tristated—it had a very slow rise time and settled to a voltage
> somewhat below VDD). I didn’t manage to correlate the time of the
> tristate to any particular higher level activity, but it definitely
> happened quite frequently during a programming operation and looked
> very weird. I’m pretty sure it didn’t happen during direct programming,
> only algorithm-driven programming. I found this suspicious, but again,
> didn’t look into it too much as the direct approach was very fast.

TDO just get's tristated when the target is not driving it, i.e. if you're not 
in SHIFT-IR/DR state. I have the same behaviour on the i.MX8MQ I'm currently 
working with.

> > What OpenOCD version do you use? It looks like your version misses
> > Matthias' WAIT handling
> > if you get such errors like algo timeout.

I don't think algo-timeout has to do with the DAP stalling. Isn't algo-timeout 
just that the algorithm running on the core is not reporting back to the 
debugger? The debugger is waiting for the target to execute a breakpoint so 
that it gets back control, right?

> It was head of master from somewhere in the last week or two. I can
> look up the exact commit ID tomorrow if you want.

WAIT is included in 0.10.0

BR,
Matthias



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-28 Thread Christopher Head
On Thu, 1 Mar 2018 00:12:12 +0100
Tomas Vanek  wrote:

> We should also focus to a question why algo flashing is broken on
> FTDI. Some non STM devices (e.g. Kinetis) work with very similar algo
> just perfectly on FTDI or any other adapter.

Sure. If someone could fix algorithm-based flashing, I would love to
use it! I’m not convinced it will make things any faster for my
specific set of hardware, but as long as it’s not broken and not
slower, I don’t really care, and I understand it does make things
faster for some people.

> WAITs are very strange. It looks like the stalled access to flash
> blocks also JTAG access to RAM.
> And SWD access doesn't suffer this silicon bug... who knows... maybe 
> some NOPs in algo busy wait loop would fix it.
> BTW The programming algo should avoid bus stalling, shouldn't it?

I was wondering about this. I have two weird data points I can add to
this discussion.

The first data point is this: remember early on in the thread where I
said I wasn’t able to successfully modify the algorithm to move the CR
write and SR read out of the loop? When I tried, it ran for a while and
then gave either debug regions unpowered or, more commonly, timeout
waiting for algorithm—the same messages I got using the FTDI adapter
with the original, unmodified algorithm, only with the modified
algorithm, it gave those messages using the ByteBlaster as well, which
had formerly been very robust. This seemed suspicious, and I’m
reasonably certain I got the modifications to the algorithm correct
(I’ve done plenty of Thumb assembly in the past), but I didn’t pay too
much attention as the direct approach was so fast.

The second data point is this: when using the algorithm-based approach,
I attached an oscilloscope to TDO coming out of the F7. I was very
surprised to see it *tristate* from time to time (at least, I’m pretty
sure it tristated—it had a very slow rise time and settled to a voltage
somewhat below VDD). I didn’t manage to correlate the time of the
tristate to any particular higher level activity, but it definitely
happened quite frequently during a programming operation and looked
very weird. I’m pretty sure it didn’t happen during direct programming,
only algorithm-driven programming. I found this suspicious, but again,
didn’t look into it too much as the direct approach was very fast.

The reference manual seems a little unclear on whether the algorithm as
written should stall the CPU or not. It says, “Any attempt to read the
Flash memory while it is being written or erased, causes the bus to
stall. Read operations are processed correctly once the program
operation has completed. This means that code or data fetches cannot be
performed while a write/erase operation is ongoing.” The obvious way to
interpret that sentence is that the STRH places the halfword into the
CPU write buffer, the DSB pushes it out as an AXI write cycle to the
Flash interface, the Flash interface immediately completely the bus
cycle and internally buffers the data while starting the burn, and then
the AXI is free to proceed to SR polling, while the *next* AXI cycle,
if any, accessing the Flash interface while still busy will be stalled.
However, an alternative way to interpret that is that the AXI write
cycle that delivers the halfword is stalled by the Flash interface, but
the CPU can continue execution because the data is in the CPU write
buffer, and the CPU can proceed before the bus cycle completes. In this
case the DSB would stall the CPU. BSY seems rather pointless in that
case, but I have learned not to assume anything when reading silicon
documentation (and it could be useful to avoid stalls if used without
DSB, I suppose). My guess would be the first interpretation is correct,
though, which means the algorithm as written should indeed not stall
the CPU ever, since code execution is from DTCM which is unrelated to
Flash. It should be possible to test which interpretation is correct by
performing a STRH followed by a DSB and then checking whether BSY is
set immediately afterwards; if yes, then interpretation 1 is correct,
while if no, then interpretation 2 is correct.

Assuming interpretation 1 is correct, though, I don’t see anything
wrong with the algorithm code.

> What OpenOCD version do you use? It looks like your version misses 
> Matthias' WAIT handling
> if you get such errors like algo timeout.

It was head of master from somewhere in the last week or two. I can
look up the exact commit ID tomorrow if you want.
-- 
Christopher Head


pgpk4rsNNRS1R.pgp
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-28 Thread Matthias Welwarsky
On Mittwoch, 28. Februar 2018 21:51:29 CET Christopher Head wrote:
> I’ll be completely honest here: the reason I tried doing this is because the
> algorithm approach *broke* with the FTDI adapter, not because I wanted to
> improve speed. It kept issuing messages either timeout waiting for
> algorithm or debug regions unpowered. So I tried bypassing the algorithm
> and noticed that it was really slow, *then* tried speeding it up by moving
> the CR and SR accesses out of the loop and noticed that it became really
> really fast.

Not surprising, since the turn-around penalty over USB is quite substantial. 
This is the main reason why openocd uses all the "STICKY" features, which is 
OK in principle for throughput but makes error recovery really cumbersome.

And the throughput can be quite substantial. I've timed uploads to DDR memory 
at 30 MHz JTAG clock to reach close to 1MB/s, as long as you just push data 
out and don't need to flush the queue to turn around and read back status 
registers.

BR,
Matthias

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-28 Thread Christopher Head
On February 28, 2018 8:10:09 AM PST, Andreas Bolsch  
wrote:
>To be sure I just did the following tests (openOCD, current head, 
>integrated ST-Link v2-1, 4 MHz SWD clock):
>nucleo-f767zi, 2 MByte random data: prog: 140 kBytes/s, read: 150 
>kBytes/s
>disco-f412g, 1 MByte random data: prog: 134 kBytes/s, read: 158
>kBytes/s
>
>Then STM32CubeProgrammer (defaults, Linux host, integrated ST-Link 
>v2-1):
>disco-f412g, 1 MByte random data: prog. 133 kBytes/s, read: 150
>kBytes/s
>
>And finally openOCD with algorithm disabled, anything else as before:
>disco-f412g, 1 MByte random data: prog. 1 kByte/s (yes, no kidding, 
>ONE!)
>
>All tests above with SWD, not JTAG.

They were also done with an HLA. Not all of us have the option to use such an 
adapter, for various reasons, nor do all of us have the option of using SWD 
over JTAG.

>That the direct register approach is quite slow isn't surprising.
>That's 
>like playing ping-pong over USB for every single bit.

Word, actually. Not bit. At least for the ByteBlaster and FTDI adapters. And 
with the CR and SR accesses pulled out of the loop, it turns into a single 
giant call to target_write_memory to write the entire image, which AFAIK just 
shovels words (or halfwords, at 16× parallelism) into the DRW in TAR 
autoincrement mode.

Anyway, your test results with algorithm on an HLA seem to give roughly the 
same performance (135 kilobytes per second if you don’t include erase time—did 
you?) as I get without algorithm on an FTDI.

> The main benefit 
>of the algorithm approach is that data transport and  programming 
>("real" programming with CPU stall) run simultaneously. Of course, this
>
>can only work smoothly if the programming adapter does support this 
>"streaming" approach, so it won't work reasonably well with a low-level
>
>adapter.

I’ll be completely honest here: the reason I tried doing this is because the 
algorithm approach *broke* with the FTDI adapter, not because I wanted to 
improve speed. It kept issuing messages either timeout waiting for algorithm or 
debug regions unpowered. So I tried bypassing the algorithm and noticed that it 
was really slow, *then* tried speeding it up by moving the CR and SR accesses 
out of the loop and noticed that it became really really fast.

So while the algorithm approach seems really nice conceptually, in practice, 
for me, it doesn’t work, so I took the shortest path to something that *would* 
work, then discovered it could be fast anyway.

>Regarding the parallelism I'd suggest to leave the parallelism by 
>default as it currently is, i. e. 16.
>Anything else would be a pitfall for the unaware user. The assumption 
>that most users will use 2.4V to 3.3V supply is still valid, I guess.
>If 
>it were configurable, 32 wouldn't give substancially higher speed
>(well, 
>at least if a "good" programming adapter is used) anyway.

Fair enough. I never wanted to change the default anyway. I just wanted to 
provide the user with the ability to change it should they wish. Does this seem 
reasonable to you?

-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-28 Thread Andreas Bolsch
These figures are quite surprising. I've made a lot of benchmarks with a 
pile of discovery boards, mainly F4 and F7, some L4. Since my focus was 
on external spi flash, I did not record the results for the internal 
flash, but as far as I recall, both programming *AND* read (via 
write_bank , read_bank) with the algorithm aproach gave approx. 120-160 
kBytes/s.


To be sure I just did the following tests (openOCD, current head, 
integrated ST-Link v2-1, 4 MHz SWD clock):
nucleo-f767zi, 2 MByte random data: prog: 140 kBytes/s, read: 150 
kBytes/s

disco-f412g, 1 MByte random data: prog: 134 kBytes/s, read: 158 kBytes/s

Then STM32CubeProgrammer (defaults, Linux host, integrated ST-Link 
v2-1):

disco-f412g, 1 MByte random data: prog. 133 kBytes/s, read: 150 kBytes/s

And finally openOCD with algorithm disabled, anything else as before:
disco-f412g, 1 MByte random data: prog. 1 kByte/s (yes, no kidding, 
ONE!)


All tests above with SWD, not JTAG. Some weeks ago I did tests with 
ST-Link reflashed to JLink and JLink v8, JLink v9, but the results were 
rather disappointing. Some quite the same speed, some slightly to 
moderately slower when compared to ST-Link. And some tests with ST-Link 
v2 clones, they gave roughly the same speed as the integrated ST-Link.


The surprising fact is that I got the very same limit (almost precisely 
150 kBytes/s) for external spi flash programming and reading (both with 
the QSPI interface and bitbanging SPI). This apparently indicates that 
the "real" programming time has almost no impact on the observed speed. 
What matters is the transport via USB, the ST-Link adapter and SWD 
clock.


The datasheet for f767 says typ. 16 us per programming operation, so 8 
us per byte for parallelism 16 or 125 kBytes/s. I. e. openOCD already 
operates at the hardware imposed limit, and the programming time is 
almost completely absorbed by the data transfer. Quite excellent, I'd 
say.


That the direct register approach is quite slow isn't surprising. That's 
like playing ping-pong over USB for every single bit. The main benefit 
of the algorithm approach is that data transport and  programming 
("real" programming with CPU stall) run simultaneously. Of course, this 
can only work smoothly if the programming adapter does support this 
"streaming" approach, so it won't work reasonably well with a low-level 
adapter.


Regarding the parallelism I'd suggest to leave the parallelism by 
default as it currently is, i. e. 16.
Anything else would be a pitfall for the unaware user. The assumption 
that most users will use 2.4V to 3.3V supply is still valid, I guess. If 
it were configurable, 32 wouldn't give substancially higher speed (well, 
at least if a "good" programming adapter is used) anyway.


BTW: "parallelism" apparently means ***maximum*** parallelism, cf. 
rm0081, 1.5.2:


"Parallelism is the maximum number of bits that may be programmed to 0 
in one step during
a program or erase operation. The maximum program/erase parallelism is 
limited by the
supply voltage and by whether the external V PP supply is used or not. 
..."


Hence "limited by" actually means "limited ***above*** by", and the 
table indicates the maximum allowed value, not the exact value to use.


On 2018-02-27 21:50, Christopher Head wrote:

As for performance, I have two data points so far.

First, using a ByteBlaster clone, I was able to achieve about 6
kilobytes per second using the algorithm and about 10 using optimized
direct programming (the original direct code got about 3).

Second, using an Olimex ARM-USB-TINY-H (FTDI-based), I had to reduce
the JTAG clock *massively* in order to get the algorithm approach to
even work at all (otherwise it would see a mix of timeout waiting for
algorithm and debug regions unpowered), but optimized direct
programming at the default 2 MHz JTAG clock got me 30 kilobytes per
second, much more than the algorithm approach at the reduced clock
speed.

Both of the above tests were made at 16× parallelism. Repeating the
Olimex test with the optimized direct code at 32× parallelism yielded
84 kilobytes per second.


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-28 Thread Uwe Bonnes
> "Freddie" == Freddie Chopin  writes:

...
Freddie> Reference Manual is misleading here, but datasheet is even more
Freddie> confusing. If you look at any STM32F7 datasheet, it says that
Freddie> max Programming voltage Vprog for 16x and 8x parallelism is 3.6
Freddie> V, but for 32x parallelism it is 3 V. I suspect that this is a
Freddie> typo and in fact for all parallelism values max programming
Freddie> voltage is the same - 3.6 V.

Freddie> 5.3.13 Memory characteristics Table 48. Flash memory
Freddie> programming (numbers don't have to match your version of
Freddie> datasheet exactly)

I have brought up that problem on th ST forum.

Lets see if some ST comments come are reaction.

Bye
-- 
Uwe Bonnesb...@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt
- Tel. 06151 1623569 --- Fax. 06151 1623305 -

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-28 Thread Matthias Welwarsky
On Dienstag, 27. Februar 2018 23:23:47 CET Christopher Head wrote:
> On February 27, 2018 2:42:25 AM PST, Matthias Welwarsky 
 wrote:
> >I'm guessing that the BUSY check was done to explicitly to avoid a JTAG
> >WAIT,
> >which was an error condition not long ago. It might still break with
> >SWD.
> 
> Ah, I didn’t know that WAIT was ever considered an error. From reading the
> ARM debug infrastructure spec, it looked more like a flow control
> mechanism. I see now that OpenOCD appears to enable ORUNDETECT (at least I
> think it does, based on dap_dp_init in arm_adi_v5.c).

Before 0.10.0, WAIT was indeed an error condition. I wrote the support for 
WAIT for JTAG transport, it's a bit tricky since for performance reasons, we 
have to use deep queues and ORUNDETECT which considerably changes the DAP 
behaviour due to its 'stickyness' and requires a complex machinery to detect 
and replay  the transactions following and including the one causing the WAIT. 
With an active probe that has knowledge of the DAP protocol it's dead simple, 
but for USB-connected shift registers - not so.

> According to the ADI specification, SWD also has a WAIT response, which it
> issues in case a previous transaction is outstanding. It says just the same
> as JTAG: if WAIT is received, normally a debugger just resends the same
> transaction. Although using sticky overrun mode changes the format a bit so
> that WAIT is followed by a data packet, which it would not be with
> ORUNDETECT cleared, and you have to clear the sticky status bit.

I never got the opportunity to extend the WAIT code to also cover SWD. I 
simply don't have a platform using SWD transport that is not using CMSIS-DAP 
or another high-level adapter. But if you have the motivation - just go ahead, 
I'll be happy to assist.

BR,
Matthias

-- 
Mit freundlichen Grüßen/Best regards,

Matthias Welwarsky
Project Engineer

SYSGO AG
Office Mainz
Am Pfaffenstein 14 / D-55270 Klein-Winternheim / Germany
Phone: +49-6136-9948-0 / Fax: +49-6136-9948-10
VoIP: SIP:m...@sysgo.com
E-mail: matthias.welwar...@sysgo.com / Web: http://www.sysgo.com
_
 
Web: https://www.sysgo.com
Blog: https://www.sysgo.com/blog
Events: https://www.sysgo.com/events
Newsletter: https://www.sysgo.com/newsletter 
_
 
Handelsregister/Commercial Registry: HRB Mainz 90 HRB 8066
Vorstand/Executive Board: Etienne Butery (CEO), Kai Sablotny (COO)
Aufsichtsratsvorsitzender/Supervisory Board Chairman: Marc Darmon
USt-Id-Nr./VAT-Id-No.: DE 149062328

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-27 Thread Christopher Head
On February 27, 2018 2:42:25 AM PST, Matthias Welwarsky 
 wrote:
>I'm guessing that the BUSY check was done to explicitly to avoid a JTAG
>WAIT, 
>which was an error condition not long ago. It might still break with
>SWD.

Ah, I didn’t know that WAIT was ever considered an error. From reading the ARM 
debug infrastructure spec, it looked more like a flow control mechanism. I see 
now that OpenOCD appears to enable ORUNDETECT (at least I think it does, based 
on dap_dp_init in arm_adi_v5.c).

According to the ADI specification, SWD also has a WAIT response, which it 
issues in case a previous transaction is outstanding. It says just the same as 
JTAG: if WAIT is received, normally a debugger just resends the same 
transaction. Although using sticky overrun mode changes the format a bit so 
that WAIT is followed by a data packet, which it would not be with ORUNDETECT 
cleared, and you have to clear the sticky status bit.

-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-27 Thread Christopher Head
On February 27, 2018 1:10:19 AM PST, Antonio Borneo  
wrote:
>Regarding your proposal to get rid of the write algorithm, I'm a
>little sceptical it is not needed.
>I would like to see the optimized direct programming code and get some
>performance measurement before going for it.

Sure. Do you want to see this on Zylin, or somewhere else?

As for performance, I have two data points so far.

First, using a ByteBlaster clone, I was able to achieve about 6 kilobytes per 
second using the algorithm and about 10 using optimized direct programming (the 
original direct code got about 3).

Second, using an Olimex ARM-USB-TINY-H (FTDI-based), I had to reduce the JTAG 
clock *massively* in order to get the algorithm approach to even work at all 
(otherwise it would see a mix of timeout waiting for algorithm and debug 
regions unpowered), but optimized direct programming at the default 2 MHz JTAG 
clock got me 30 kilobytes per second, much more than the algorithm approach at 
the reduced clock speed.

Both of the above tests were made at 16× parallelism. Repeating the Olimex test 
with the optimized direct code at 32× parallelism yielded 84 kilobytes per 
second.

All of the above numbers came from program commands, which means they are lower 
than the raw programming speed because they include the time take for erasing 
in the total time and throughout numbers. In all cases I used program with the 
verify option to make sure the data was correct. My sample was about half a meg 
of data.

-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-27 Thread Christopher Head
On February 27, 2018 1:28:01 AM PST, Freddie Chopin  
wrote:
>5.3.13 Memory characteristics
>Table 48. Flash memory programming
>(numbers don't have to match your version of datasheet exactly)

Oh, that is very interesting. I missed that table the first time. When I looked 
at the reference manual, the table in there makes it look like it could be 
unsafe to use too small a parallelism setting (e.g. 16× at 3.3V might damage 
the Flash), but the datasheet suggests it’s fine. And yes, it seems they 
contradict in that the RM says 32×@3.3 is optimal while the DS says it is 
prohibited.

Regardless, I think we should just let the board file choose. Any objections to 
using the bus width number for this purpose? I was thinking we could use the 
chip width parameter in future to support STM32H7, where writes have to be 128 
bits wide in order to prevent ECC errors—we could set chip width to 1 for 
F2/F4/F7 and 16 for H7, and the Flash code could recognize that difference and 
act accordingly; meanwhile bus width could be 1, 2, 4, or 8 to set the 
parallelism.
-- 
Christopher Head

signature.asc
Description: PGP signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-27 Thread Antonio Borneo
On Tue, Feb 27, 2018 at 10:28 AM, Freddie Chopin  wrote:
> On Mon, 2018-02-26 at 16:23 -0800, Christopher Head wrote:
>> Hi all,
>> I was looking into an issue with Flash programming on the STM32F7. I
>> discovered some quite odd results.
>>
>> First, I discovered that OpenOCD always uses 16-bit parallelism.
>> There is a comment at the top of stm32f2x.c stating that this was
>> chosen for compatibility with the widest possible range of VDD
>> values, but I simply can’t see how this is true. The
>> STM32F205/215/207/217 Flash programming manual PM0059 rev 5, the
>> STM32F405/415/407/417/427/437/429/439 reference manual RM0090 rev 15,
>> and the STM32F75/74 reference manual RM0385 rev 6 all contain exactly
>> the same table, which says 64× shall be used with external VPP, 32×
>> shall be used with VDD in [2.7,3.6], 16× shall be used with VDD in
>> [2.1,2.7], and 8× shall be used with VDD in [1.8,2.1]. I imagine an
>> awful lot of STM32s are probably operated at 3.3 volts, and that is
>> *not* in the legal VDD range for 16× parallelism. Am I
>> misunderstanding something here?
>
> Reference Manual is misleading here, but datasheet is even more
> confusing. If you look at any STM32F7 datasheet, it says that max
> Programming voltage Vprog for 16x and 8x parallelism is 3.6 V, but for
> 32x parallelism it is 3 V. I suspect that this is a typo and in fact
> for all parallelism values max programming voltage is the same - 3.6 V.
>
> 5.3.13 Memory characteristics
> Table 48. Flash memory programming
> (numbers don't have to match your version of datasheet exactly)

Agree, quite confusing.
A reason more to not use the voltage value in scripts for the
selections and to stick at the programming width.

By the way, flash erase algorithm is impacted by voltage<=>width too!

Antonio

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-27 Thread Matthias Welwarsky
On Dienstag, 27. Februar 2018 01:23:08 CET Christopher Head wrote:
> Second, I discovered that, in both algorithm-driven mode and direct
> programming mode, the loop writes to CR, then writes one halfword of data
> to the target address, then checks BSY and the error flags in SR. However,
> this seems unnecessary. CR doesn’t magically change on its own; PG and
> PSIZE can be set once and then many writes performed in a block, increasing
> efficiency. Also, it is not necessary to check BSY after each write. Step 3
> of the Flash programming sequence is to “perform the data write
> operation(s)”, which can be plural. If you manage to deliver data too fast,
> the Flash hardware stalls the AHB or AXI bus cycles doing the subsequent
> writes, which eventually translates into a WAIT JTAG response (in the
> direct programming case) or a CPU execution stall (in the algorithm-driven
> case), which is a reasonable flow control mechanism. The error bits in SR
> are also cumulative. Taken together, all this means that one can simply
> write CR once, write all the data, and then check SR afterwards, waiting
> for the last write to finish and examining the error flags. Once modifying
> the code to do this, I then discovered that direct-mode programming with
> these changes is actually faster than algorithm-based programming without
> them (I was not able to successfully modify the algorithm to omit these
> extra operations, but I can’t see it making a whole lot of difference to
> the execution time in algorithm mode.

I'm guessing that the BUSY check was done to explicitly to avoid a JTAG WAIT, 
which was an error condition not long ago. It might still break with SWD.

-- 
Mit freundlichen Grüßen/Best regards,

Matthias Welwarsky
Project Engineer

SYSGO AG
Office Mainz
Am Pfaffenstein 14 / D-55270 Klein-Winternheim / Germany
Phone: +49-6136-9948-0 / Fax: +49-6136-9948-10
VoIP: SIP:m...@sysgo.com
E-mail: matthias.welwar...@sysgo.com / Web: http://www.sysgo.com
_
 
Web: https://www.sysgo.com
Blog: https://www.sysgo.com/blog
Events: https://www.sysgo.com/events
Newsletter: https://www.sysgo.com/newsletter 
_
 
Handelsregister/Commercial Registry: HRB Mainz 90 HRB 8066
Vorstand/Executive Board: Etienne Butery (CEO), Kai Sablotny (COO)
Aufsichtsratsvorsitzender/Supervisory Board Chairman: Marc Darmon
USt-Id-Nr./VAT-Id-No.: DE 149062328

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-27 Thread Freddie Chopin
On Mon, 2018-02-26 at 16:23 -0800, Christopher Head wrote:
> Hi all,
> I was looking into an issue with Flash programming on the STM32F7. I
> discovered some quite odd results.
> 
> First, I discovered that OpenOCD always uses 16-bit parallelism.
> There is a comment at the top of stm32f2x.c stating that this was
> chosen for compatibility with the widest possible range of VDD
> values, but I simply can’t see how this is true. The
> STM32F205/215/207/217 Flash programming manual PM0059 rev 5, the
> STM32F405/415/407/417/427/437/429/439 reference manual RM0090 rev 15,
> and the STM32F75/74 reference manual RM0385 rev 6 all contain exactly
> the same table, which says 64× shall be used with external VPP, 32×
> shall be used with VDD in [2.7,3.6], 16× shall be used with VDD in
> [2.1,2.7], and 8× shall be used with VDD in [1.8,2.1]. I imagine an
> awful lot of STM32s are probably operated at 3.3 volts, and that is
> *not* in the legal VDD range for 16× parallelism. Am I
> misunderstanding something here?

Reference Manual is misleading here, but datasheet is even more
confusing. If you look at any STM32F7 datasheet, it says that max
Programming voltage Vprog for 16x and 8x parallelism is 3.6 V, but for
32x parallelism it is 3 V. I suspect that this is a typo and in fact
for all parallelism values max programming voltage is the same - 3.6 V.

5.3.13 Memory characteristics
Table 48. Flash memory programming
(numbers don't have to match your version of datasheet exactly)

Regards,
FCh

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-27 Thread Antonio Borneo
On Tue, Feb 27, 2018 at 7:03 AM, Christopher Head  wrote:
> On Mon, 26 Feb 2018 16:23:08 -0800
> Christopher Head  wrote:
>
>> 2. Allow the user to set the parallelism level with a new stm32f2x
>> subcommand, since only the board config knows what VDD is being
>> supplied.

Hi Christophe,
I had something similar in my TODO list, but never got time to code it down.
I agree that 16 bit parallelism is NOT the safest case. It is supposed
to fail writing in flash if the device is powered at 1.8V
The right setup should use 8 bit as default, compatible with all the
possible voltage ranges. But this would increase the programming time
in 3.3V systems.
The first step in fixing this is to have write width selectable as 8,
16, 32 or 64 bits.
I prefer let the user select the width (not the voltage) because the
relationship width<=>voltage could be different in future devices.

Actually some JTAG dongle is able to read the voltage of the target,
but this feature is not always present and, depending on the setup,
the wire that senses the voltage could be unconnected. I would not
rely on the JTAG to know the target voltage.

> Having thought it over a little more, perhaps we could use the bus
> width parameter to the flash bank command for this purpose instead of
> a new stm32f2x subcommand? Then add a settable variable which gets
> passed through the shipped target config files?

User should set the width in the board file (depending on board
voltage) using a variable that is then passed to the target file.
In the target file the variable should be checked and get a reasonable
default if it is not set. Should we keep 16 bits for backward
compatibility or 8 bits for safety reason?

Then, how to pass the value: sub-command or "flash bank" parameter or
even "target" parameter?
I personally prefer adding it to "flash bank". We could easily handle
the case of banks that require different value (I do not see this case
today).
But I would not reject other proposals, if well motivated.

Regarding your proposal to get rid of the write algorithm, I'm a
little sceptical it is not needed.
I would like to see the optimized direct programming code and get some
performance measurement before going for it.

Best Regards,
Antonio

> --
> Christopher Head

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel


Re: [OpenOCD-devel] STM32F2/4/7 Flash programming

2018-02-26 Thread Christopher Head
On Mon, 26 Feb 2018 16:23:08 -0800
Christopher Head  wrote:

> 2. Allow the user to set the parallelism level with a new stm32f2x
> subcommand, since only the board config knows what VDD is being
> supplied.

Having thought it over a little more, perhaps we could use the bus
width parameter to the flash bank command for this purpose instead of
a new stm32f2x subcommand? Then add a settable variable which gets
passed through the shipped target config files?
-- 
Christopher Head


pgpdtbXtc8qjq.pgp
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel