Re: STM32MP1 boot slow

2020-03-30 Thread Marek Vasut
On 3/30/20 10:05 AM, Patrick DELAUNAY wrote:
> Hi,

Hi,

>> From: Marek Vasut 
>> Sent: vendredi 27 mars 2020 19:21
>>
>> On 3/27/20 4:35 PM, Patrick DELAUNAY wrote:
>>> Hi Marek,
>>
>> Hi,
>>
>> [...]
>>
> I would like to see patches to enable the cache. We did this some
> years ago in a Chromebook and it made a big difference. It is not
> that hard.

 ACK. Why did the chromebook patches never make it upstream ?
>>>
>>> Work in progress
>>> https://gitlab.denx.de/u-boot/custodians/u-boot-stm/-/commit/3399fb37c
>>> 3b7db6e99118766c4d1cd5e742ecc8f
>>>
>>> with improvements
>>>
>>> For example, the result on STM32MP157C-DK2 board is:
>>>1,6s gain for trusted boot chain with TF-A
>>>2,2s gain for basic boot chain with SPL
>>>
>>> I will push this patch after sanity checks (ARM requirement on TLB /
>>> cache update with MMU activated).
>>
>> Nice, thanks. CC me when submitting it.
> 
> I check the dcache and MMU sequence with my patch and it is correctly
> handled in SPL and U-Boot pre-reloc.
> 
> So the I send my my patch today:
> "arm: stm32mp1: activate data cache in SPL and before relocation"
> 
> http://patchwork.ozlabs.org/patch/1263815/

I added some review comments, thanks.


RE: STM32MP1 boot slow

2020-03-30 Thread Patrick DELAUNAY
Hi,

> From: Marek Vasut 
> Sent: vendredi 27 mars 2020 19:21
> 
> On 3/27/20 4:35 PM, Patrick DELAUNAY wrote:
> > Hi Marek,
> 
> Hi,
> 
> [...]
> 
> >>> I would like to see patches to enable the cache. We did this some
> >>> years ago in a Chromebook and it made a big difference. It is not
> >>> that hard.
> >>
> >> ACK. Why did the chromebook patches never make it upstream ?
> >
> > Work in progress
> > https://gitlab.denx.de/u-boot/custodians/u-boot-stm/-/commit/3399fb37c
> > 3b7db6e99118766c4d1cd5e742ecc8f
> >
> > with improvements
> >
> > For example, the result on STM32MP157C-DK2 board is:
> >1,6s gain for trusted boot chain with TF-A
> >2,2s gain for basic boot chain with SPL
> >
> > I will push this patch after sanity checks (ARM requirement on TLB /
> > cache update with MMU activated).
> 
> Nice, thanks. CC me when submitting it.

I check the dcache and MMU sequence with my patch and it is correctly
handled in SPL and U-Boot pre-reloc.

So the I send my my patch today:
"arm: stm32mp1: activate data cache in SPL and before relocation"

http://patchwork.ozlabs.org/patch/1263815/

> [...]
> 
> --
> Best regards,
> Marek Vasut

Regards
Patrick


Re: STM32MP1 boot slow

2020-03-27 Thread Marek Vasut
On 3/27/20 4:35 PM, Patrick DELAUNAY wrote:
> Hi Marek,

Hi,

[...]

>> Or reuse the DWC I2C driver timing calculation, which is real simple, fast, 
>> and
>> should be accurate enough.
> 
> Yes I checked
> ./drivers/i2c/designware_i2c.c:: __dw_i2c_set_bus_speed
> 
> I agree that something simple should be possible to found 'good enough' 
> setting. 
> But I don't ding in the ST I2C specification I waiting internal feedback

OK, that's fine. I believe a much simpler math should be possible here
to figure out the timing values.

[...]

>>> I would like to see patches to enable the cache. We did this some
>>> years ago in a Chromebook and it made a big difference. It is not that
>>> hard.
>>
>> ACK. Why did the chromebook patches never make it upstream ?
> 
> Work in progress
> https://gitlab.denx.de/u-boot/custodians/u-boot-stm/-/commit/3399fb37c3b7db6e99118766c4d1cd5e742ecc8f
> 
> with improvements
> 
> For example, the result on STM32MP157C-DK2 board is:
>1,6s gain for trusted boot chain with TF-A
>2,2s gain for basic boot chain with SPL
> 
> I will push this patch after sanity checks
> (ARM requirement on TLB / cache update with MMU activated).

Nice, thanks. CC me when submitting it.

[...]

-- 
Best regards,
Marek Vasut


RE: STM32MP1 boot slow

2020-03-27 Thread Patrick DELAUNAY
Hi Marek,

> From: Marek Vasut 
> Sent: jeudi 26 mars 2020 17:28
> 
> On 3/26/20 5:19 PM, Simon Glass wrote:
> > Hi Patrick,
> 
> Hi,
> 
> > On Wed, 25 Mar 2020 at 09:57, Patrick DELAUNAY 
> wrote:
> >>
> >> Hi,
> >>
> >>> From: Marek Vasut 
> >>> Sent: mercredi 25 mars 2020 00:39
> >>>
> >>> Hi,
> >>>
> >>> I was looking at the STM32MP1 boot time and I noticed it takes about
> >>> 2 seconds to get to U-Boot.
> >>
> >> Thanks for the feedback.
> >>
> >> To be clear, the SPL is not the ST priority as we have many
> >> limitation (mainly on power management) for the SPL boot chain
> (stm32mp15_basic_defconfig):
> >> Rom code => SPL => U-Boot
> >>
> >> The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot
> >> (stm32mp15_trusted_defconfg).
> >>
> >>> One problem is the insane I2C timing calculation in stm32f7 i2c
> >>> driver, which is almost a mallocator and CPU stress test and takes
> >>> about 1 second to complete in SPL -- we need some simpler
> >>> replacement for that, possibly the one in DWC I2C driver might do?
> >>
> >> Our first idea to manage this I2C settings (prescaler/timings
> >> setting) was to set this values in device tree, but this binding was
> >> refused so this function stm32_i2c_choose_solution()
> >
> > Was the binding refused in linux? Could we add something
> > U-Boot-specific then? I think having 'early' timings, etc. is very
> > handy. We are doing this on x86.
> >
> > Of course it has traditionally been impossible to convince Linux
> > people to add this sort of thing. Still, I think we should do it. Our
> > U-Boot-specific files allow this.
> 
> Or reuse the DWC I2C driver timing calculation, which is real simple, fast, 
> and
> should be accurate enough.

Yes I checked
./drivers/i2c/designware_i2c.c:: __dw_i2c_set_bus_speed

I agree that something simple should be possible to found 'good enough' 
setting. 
But I don't ding in the ST I2C specification I waiting internal feedback

> >> provided the better settings for any input clock and I2C frequency (called 
> >> for
> each probe).
> >>
> >> But it is brutal and not optimum solution: try all the solution to found 
> >> the better
> one.
> >> And the performance problem of this loop (shared code between Linux /
> >> U-Boot/TF-A drivers) had be already see/checked on ST side in TF-A context.
> >
> > We should be able to calculate it, like with dw-i2c.
> 
> Yes
> 
> >> We try to improve the solution, without success, but finally the
> >> performance issue was solved by dcache activation in TF-A before to execute
> this loop.
> >
> > I would like to see patches to enable the cache. We did this some
> > years ago in a Chromebook and it made a big difference. It is not that
> > hard.
> 
> ACK. Why did the chromebook patches never make it upstream ?

Work in progress
https://gitlab.denx.de/u-boot/custodians/u-boot-stm/-/commit/3399fb37c3b7db6e99118766c4d1cd5e742ecc8f

with improvements

For example, the result on STM32MP157C-DK2 board is:
   1,6s gain for trusted boot chain with TF-A
   2,2s gain for basic boot chain with SPL

I will push this patch after sanity checks
(ARM requirement on TLB / cache update with MMU activated).

> >> But as in SPL the data cache is not activated, this loop has terrible
> performance.
> >>
> >> We need to ding again of this topic for U-Boot point of view (SPL &
> >> also in U-Boot, before relocation and after relocation) .
> >>
> >> And I had shared this issue with the ST owner of this code.
> >>
> >> For information, I add some trace and I get for same code execution on DK2
> board.
> >> - 440ms in SPL (dcache OFF)
> >> - 36ms in U-Boot (dcache ON)
> >>
> >>> Another item I found is that, in U-Boot, initf_dm() takes about half
> >>> a second and so does serial_init(). I didn't dig into it to find out
> >>> why, but I suspect it has to do with the massive amount of UCLASSes
> >>> the DM has to traverse OR with the CPU being slow at that point, as the
> clock driver didn't get probed just yet.
> >>>
> >>> Thoughts ?
> >>
> >> Yes, it is the first parsing of device tree, and it is really slow...
> >> directly linked to device tree size and libfdt.
> >
> > I wonder if we can improve this. There was a change to how the drivers
> > were bound (changing the ordering). We could perhaps revert that for
> > SPL.
> 
> Link ?
> 
> [...]
> 
> --
> Best regards,
> Marek Vasut


RE: STM32MP1 boot slow

2020-03-27 Thread Patrick DELAUNAY
Hi,

> From: Simon Glass 
> Sent: jeudi 26 mars 2020 17:20
> 
> Hi Patrick,
> 
> On Wed, 25 Mar 2020 at 09:57, Patrick DELAUNAY 
> wrote:
> >
> > Hi,
> >
> > > From: Marek Vasut 
> > > Sent: mercredi 25 mars 2020 00:39
> > >
> > > Hi,
> > >
> > > I was looking at the STM32MP1 boot time and I noticed it takes about
> > > 2 seconds to get to U-Boot.
> >
> > Thanks for the feedback.
> >
> > To be clear, the SPL is not the ST priority as we have many limitation
> > (mainly on power management) for the SPL boot chain
> (stm32mp15_basic_defconfig):
> > Rom code => SPL => U-Boot
> >
> > The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot
> > (stm32mp15_trusted_defconfg).
> >
> > > One problem is the insane I2C timing calculation in stm32f7 i2c
> > > driver, which is almost a mallocator and CPU stress test and takes
> > > about 1 second to complete in SPL -- we need some simpler
> > > replacement for that, possibly the one in DWC I2C driver might do?
> >
> > Our first idea to manage this I2C settings (prescaler/timings setting)
> > was to set this values in device tree, but this binding was refused so
> > this function stm32_i2c_choose_solution()
> 
> Was the binding refused in linux? Could we add something U-Boot-specific 
> then? I
> think having 'early' timings, etc. is very handy. We are doing this on x86.

https://patchwork.ozlabs.org/patch/740214/
st,i2c-timing : A 32-bit I2C timing register value


> Of course it has traditionally been impossible to convince Linux people to 
> add this
> sort of thing. Still, I think we should do it. Our U-Boot-specific files 
> allow this.
> > provided the better settings for any input clock and I2C frequency (called 
> > for
> each probe).

Yes it is one possible solution.
I already propose it internally.

> > But it is brutal and not optimum solution: try all the solution to found 
> > the better
> one.
> > And the performance problem of this loop (shared code between Linux /
> > U-Boot/TF-A drivers) had be already see/checked on ST side in TF-A context.
> 
> We should be able to calculate it, like with dw-i2c.

I checked drivers/i2c/designware_i2c.c... 
Nothing obviously applicable on ST IP.

In fact today I also challenge the I2C responsible for the need of
this loop to found optimum parameter in bootloader.

I think that 'good enough' register value could be found with few operation
(as in designware_i2c.c).

And moreover it seems tuning isn't really needed if we limit the I2C speed at 
400kHz.

I still waiting internal feedback but with COVID-19 it is more difficult here.

> >
> > We try to improve the solution, without success, but finally the
> > performance issue was solved by dcache activation in TF-A before to execute
> this loop.
> 
> I would like to see patches to enable the cache. We did this some years ago 
> in a
> Chromebook and it made a big difference. It is not that hard.

Yes I am working on this patch, an today it is already functional.

https://gitlab.denx.de/u-boot/custodians/u-boot-stm/-/commit/3399fb37c3b7db6e99118766c4d1cd5e742ecc8f

Updated bootstage report are available in the commit message.

I just need to cross check if the TLB and the cache is correctly managed
if I only activate/ deactivate cache with CP15 function.

And don't want miss something for the sensible point.

> >
> > But as in SPL the data cache is not activated, this loop has terrible 
> > performance.
> >
> > We need to ding again of this topic for U-Boot point of view (SPL &
> > also in U-Boot, before relocation and after relocation) .
> >
> > And I had shared this issue with the ST owner of this code.
> >
> > For information, I add some trace and I get for same code execution on DK2
> board.
> > - 440ms in SPL (dcache OFF)
> > - 36ms in U-Boot (dcache ON)
> >
> > > Another item I found is that, in U-Boot, initf_dm() takes about half
> > > a second and so does serial_init(). I didn't dig into it to find out
> > > why, but I suspect it has to do with the massive amount of UCLASSes
> > > the DM has to traverse OR with the CPU being slow at that point, as the 
> > > clock
> driver didn't get probed just yet.
> > >
> > > Thoughts ?
> >
> > Yes, it is the first parsing of device tree, and it is really slow...
> > directly linked to device tree size and libfdt.
> 
> I wonder if we can improve this. There was a change to how the drivers were
> bound (changing the ordering). We could perhaps revert that for SPL.

I no issue in SPL as the reduced device tree is short.

The issue in the U-Boot pre-reloc stage (with full device tree but without 
cache).

> >
> > And because it is done before relocation (before dache enable).
> >
> > Measurement on DK2 = 649ms
> >
> > It is a other topic in my TODO list.
> >
> > I want to explore livetree activation to reduce the DT parsing time.
> 
> Not in SPL though I suspect.

No 

In U-Boot proper (it is in my TODO list) 

Patrick


Re: STM32MP1 boot slow

2020-03-26 Thread Marek Vasut
On 3/26/20 5:19 PM, Simon Glass wrote:
> Hi Patrick,

Hi,

> On Wed, 25 Mar 2020 at 09:57, Patrick DELAUNAY  
> wrote:
>>
>> Hi,
>>
>>> From: Marek Vasut 
>>> Sent: mercredi 25 mars 2020 00:39
>>>
>>> Hi,
>>>
>>> I was looking at the STM32MP1 boot time and I noticed it takes about 2 
>>> seconds
>>> to get to U-Boot.
>>
>> Thanks for the feedback.
>>
>> To be clear, the SPL is not the ST priority as we have many limitation 
>> (mainly on
>> power management) for the SPL boot chain (stm32mp15_basic_defconfig):
>> Rom code => SPL => U-Boot
>>
>> The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot
>> (stm32mp15_trusted_defconfg).
>>
>>> One problem is the insane I2C timing calculation in stm32f7 i2c driver, 
>>> which is
>>> almost a mallocator and CPU stress test and takes about 1 second to 
>>> complete in
>>> SPL -- we need some simpler replacement for that, possibly the one in DWC 
>>> I2C
>>> driver might do?
>>
>> Our first idea to manage this I2C settings (prescaler/timings setting) was 
>> to set this values
>> in device tree, but this binding was refused so this function 
>> stm32_i2c_choose_solution()
> 
> Was the binding refused in linux? Could we add something
> U-Boot-specific then? I think having 'early' timings, etc. is very
> handy. We are doing this on x86.
> 
> Of course it has traditionally been impossible to convince Linux
> people to add this sort of thing. Still, I think we should do it. Our
> U-Boot-specific files allow this.

Or reuse the DWC I2C driver timing calculation, which is real simple,
fast, and should be accurate enough.

>> provided the better settings for any input clock and I2C frequency (called 
>> for each probe).
>>
>> But it is brutal and not optimum solution: try all the solution to found the 
>> better one.
>> And the performance problem of this loop (shared code between Linux / 
>> U-Boot/TF-A drivers)
>> had be already see/checked on ST side in TF-A context.
> 
> We should be able to calculate it, like with dw-i2c.

Yes

>> We try to improve the solution, without success, but finally the performance 
>> issue
>> was solved by dcache activation in TF-A before to execute this loop.
> 
> I would like to see patches to enable the cache. We did this some
> years ago in a Chromebook and it made a big difference. It is not that
> hard.

ACK. Why did the chromebook patches never make it upstream ?

>> But as in SPL the data cache is not activated, this loop has terrible 
>> performance.
>>
>> We need to ding again of this topic for U-Boot point of view
>> (SPL & also in U-Boot, before relocation and after relocation) .
>>
>> And I had shared this issue with the ST owner of this code.
>>
>> For information, I add some trace and I get for same code execution on DK2 
>> board.
>> - 440ms in SPL (dcache OFF)
>> - 36ms in U-Boot (dcache ON)
>>
>>> Another item I found is that, in U-Boot, initf_dm() takes about half a 
>>> second and so
>>> does serial_init(). I didn't dig into it to find out why, but I suspect it 
>>> has to do with
>>> the massive amount of UCLASSes the DM has to traverse OR with the CPU being
>>> slow at that point, as the clock driver didn't get probed just yet.
>>>
>>> Thoughts ?
>>
>> Yes, it is the first parsing of device tree, and it is really slow... 
>> directly linked to device
>> tree size and libfdt.
> 
> I wonder if we can improve this. There was a change to how the drivers
> were bound (changing the ordering). We could perhaps revert that for
> SPL.

Link ?

[...]

-- 
Best regards,
Marek Vasut


Re: STM32MP1 boot slow

2020-03-26 Thread Simon Glass
Hi Patrick,

On Wed, 25 Mar 2020 at 09:57, Patrick DELAUNAY  wrote:
>
> Hi,
>
> > From: Marek Vasut 
> > Sent: mercredi 25 mars 2020 00:39
> >
> > Hi,
> >
> > I was looking at the STM32MP1 boot time and I noticed it takes about 2 
> > seconds
> > to get to U-Boot.
>
> Thanks for the feedback.
>
> To be clear, the SPL is not the ST priority as we have many limitation 
> (mainly on
> power management) for the SPL boot chain (stm32mp15_basic_defconfig):
> Rom code => SPL => U-Boot
>
> The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot
> (stm32mp15_trusted_defconfg).
>
> > One problem is the insane I2C timing calculation in stm32f7 i2c driver, 
> > which is
> > almost a mallocator and CPU stress test and takes about 1 second to 
> > complete in
> > SPL -- we need some simpler replacement for that, possibly the one in DWC 
> > I2C
> > driver might do?
>
> Our first idea to manage this I2C settings (prescaler/timings setting) was to 
> set this values
> in device tree, but this binding was refused so this function 
> stm32_i2c_choose_solution()

Was the binding refused in linux? Could we add something
U-Boot-specific then? I think having 'early' timings, etc. is very
handy. We are doing this on x86.

Of course it has traditionally been impossible to convince Linux
people to add this sort of thing. Still, I think we should do it. Our
U-Boot-specific files allow this.

> provided the better settings for any input clock and I2C frequency (called 
> for each probe).
>
> But it is brutal and not optimum solution: try all the solution to found the 
> better one.
> And the performance problem of this loop (shared code between Linux / 
> U-Boot/TF-A drivers)
> had be already see/checked on ST side in TF-A context.

We should be able to calculate it, like with dw-i2c.

>
> We try to improve the solution, without success, but finally the performance 
> issue
> was solved by dcache activation in TF-A before to execute this loop.

I would like to see patches to enable the cache. We did this some
years ago in a Chromebook and it made a big difference. It is not that
hard.

>
> But as in SPL the data cache is not activated, this loop has terrible 
> performance.
>
> We need to ding again of this topic for U-Boot point of view
> (SPL & also in U-Boot, before relocation and after relocation) .
>
> And I had shared this issue with the ST owner of this code.
>
> For information, I add some trace and I get for same code execution on DK2 
> board.
> - 440ms in SPL (dcache OFF)
> - 36ms in U-Boot (dcache ON)
>
> > Another item I found is that, in U-Boot, initf_dm() takes about half a 
> > second and so
> > does serial_init(). I didn't dig into it to find out why, but I suspect it 
> > has to do with
> > the massive amount of UCLASSes the DM has to traverse OR with the CPU being
> > slow at that point, as the clock driver didn't get probed just yet.
> >
> > Thoughts ?
>
> Yes, it is the first parsing of device tree, and it is really slow... 
> directly linked to device
> tree size and libfdt.

I wonder if we can improve this. There was a change to how the drivers
were bound (changing the ordering). We could perhaps revert that for
SPL.

>
> And because it is done before relocation (before dache enable).
>
> Measurement on DK2 = 649ms
>
> It is a other topic in my TODO list.
>
> I want to explore livetree activation to reduce the DT parsing time.

Not in SPL though I suspect.


>
> And also activate dcache in pre-location stage
> (and potentially also in SPL as it was done in 
> http://patchwork.ozlabs.org/patch/699899/)
>
> A other solution (workaround ?) is to reduced the U-Boot device-tree (remove 
> all the nodes not used in
> U-Boot in soc file stm32mp157.dtsi or use /omit-if-no-ref/ for pincontrol 
> nodes).
>
> See bootsage report on DK2, we have dm_f = 648ms
>
> STM32MP> bootstage report
> Timer summary in microseconds (12 records):
>MarkElapsed  Stage
>   0  0  reset
> 195,613195,613  SPL
> 837,867642,254  end SPL
> 840,117  2,250  board_init_f
>   2,739,639  1,899,522  board_init_r
>   3,066,815327,176  id=64
>   3,103,377 36,562  id=65
>   3,104,078701  main_loop
>   3,142,171 38,093  id=175
>
> Accumulated time:
> 38,124  dm_spl
> 41,956  dm_r
>648,861  dm_f
>
> For information the time in spent in
> dm_extended_scan_fdt
> => dm_scan_fdt(blob, pre_reloc_only);
>
> This time is reduce d (few millisecond)
> with http://patchwork.ozlabs.org/patch/1240117/
>
> But only the data cache activation before relocation should improve this part.
>
> >
> > --
> > Best regards,
> > Marek Vasut
>
> Regards
> Patrick


Re: STM32MP1 boot slow

2020-03-25 Thread Marek Vasut
On 3/25/20 4:57 PM, Patrick DELAUNAY wrote:
> Hi,

Hi,

>> From: Marek Vasut 
>> Sent: mercredi 25 mars 2020 00:39
>>
>> Hi,
>>
>> I was looking at the STM32MP1 boot time and I noticed it takes about 2 
>> seconds
>> to get to U-Boot.
> 
> Thanks for the feedback.
> 
> To be clear, the SPL is not the ST priority as we have many limitation 
> (mainly on
> power management) for the SPL boot chain (stm32mp15_basic_defconfig):
> Rom code => SPL => U-Boot
> 
> The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot
> (stm32mp15_trusted_defconfg).

I don't want to use TF-A because it's problematic at best.

However, these issues I listed here are present also in U-Boot, so this
comment is irrelevant anyway.

>> One problem is the insane I2C timing calculation in stm32f7 i2c driver, 
>> which is
>> almost a mallocator and CPU stress test and takes about 1 second to complete 
>> in
>> SPL -- we need some simpler replacement for that, possibly the one in DWC I2C
>> driver might do?
> 
> Our first idea to manage this I2C settings (prescaler/timings setting) was to 
> set this values 
> in device tree, but this binding was refused so this function 
> stm32_i2c_choose_solution()
> provided the better settings for any input clock and I2C frequency (called 
> for each probe).
> 
> But it is brutal and not optimum solution: try all the solution to found the 
> better one.
> And the performance problem of this loop (shared code between Linux / 
> U-Boot/TF-A drivers)
> had be already see/checked on ST side in TF-A context.
> 
> We try to improve the solution, without success, but finally the performance 
> issue
> was solved by dcache activation in TF-A before to execute this loop.

That's not a solution but a workaround.

> But as in SPL the data cache is not activated, this loop has terrible 
> performance.
> 
> We need to ding again of this topic for U-Boot point of view
> (SPL & also in U-Boot, before relocation and after relocation) .
> 
> And I had shared this issue with the ST owner of this code.
> 
> For information, I add some trace and I get for same code execution on DK2 
> board.
> - 440ms in SPL (dcache OFF)
> - 36ms in U-Boot (dcache ON)

Still, this is a workaround.

The calculation should be simplified. And why do you even need all that
memory allocations in there ?

>> Another item I found is that, in U-Boot, initf_dm() takes about half a 
>> second and so
>> does serial_init(). I didn't dig into it to find out why, but I suspect it 
>> has to do with
>> the massive amount of UCLASSes the DM has to traverse OR with the CPU being
>> slow at that point, as the clock driver didn't get probed just yet.
>>
>> Thoughts ?
> 
> Yes, it is the first parsing of device tree, and it is really slow... 
> directly linked to device
> tree size and libfdt.
> 
> And because it is done before relocation (before dache enable).
> 
> Measurement on DK2 = 649ms
> 
> It is a other topic in my TODO list.
> 
> I want to explore livetree activation to reduce the DT parsing time.
>  
> And also activate dcache in pre-location stage
> (and potentially also in SPL as it was done in 
> http://patchwork.ozlabs.org/patch/699899/)
> 
> A other solution (workaround ?) is to reduced the U-Boot device-tree (remove 
> all the nodes not used in
> U-Boot in soc file stm32mp157.dtsi or use /omit-if-no-ref/ for pincontrol 
> nodes).
> 
> See bootsage report on DK2, we have dm_f = 648ms
> 
> STM32MP> bootstage report
> Timer summary in microseconds (12 records):
>MarkElapsed  Stage
>   0  0  reset
> 195,613195,613  SPL
> 837,867642,254  end SPL
> 840,117  2,250  board_init_f
>   2,739,639  1,899,522  board_init_r
>   3,066,815327,176  id=64
>   3,103,377 36,562  id=65
>   3,104,078701  main_loop
>   3,142,171 38,093  id=175
> 
> Accumulated time:
> 38,124  dm_spl
> 41,956  dm_r
>648,861  dm_f
> 
> For information the time in spent in 
>   dm_extended_scan_fdt
>   => dm_scan_fdt(blob, pre_reloc_only);
> 
> This time is reduce d (few millisecond) 
> with http://patchwork.ozlabs.org/patch/1240117/
> 
> But only the data cache activation before relocation should improve this part.

For this one, I think we have no better options than the Dcache indeed.
Thanks


RE: STM32MP1 boot slow

2020-03-25 Thread Patrick DELAUNAY
Hi,

> From: Marek Vasut 
> Sent: mercredi 25 mars 2020 00:39
> 
> Hi,
> 
> I was looking at the STM32MP1 boot time and I noticed it takes about 2 seconds
> to get to U-Boot.

Thanks for the feedback.

To be clear, the SPL is not the ST priority as we have many limitation (mainly 
on
power management) for the SPL boot chain (stm32mp15_basic_defconfig):
Rom code => SPL => U-Boot

The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot
(stm32mp15_trusted_defconfg).

> One problem is the insane I2C timing calculation in stm32f7 i2c driver, which 
> is
> almost a mallocator and CPU stress test and takes about 1 second to complete 
> in
> SPL -- we need some simpler replacement for that, possibly the one in DWC I2C
> driver might do?

Our first idea to manage this I2C settings (prescaler/timings setting) was to 
set this values 
in device tree, but this binding was refused so this function 
stm32_i2c_choose_solution()
provided the better settings for any input clock and I2C frequency (called for 
each probe).

But it is brutal and not optimum solution: try all the solution to found the 
better one.
And the performance problem of this loop (shared code between Linux / 
U-Boot/TF-A drivers)
had be already see/checked on ST side in TF-A context.

We try to improve the solution, without success, but finally the performance 
issue
was solved by dcache activation in TF-A before to execute this loop.

But as in SPL the data cache is not activated, this loop has terrible 
performance.

We need to ding again of this topic for U-Boot point of view
(SPL & also in U-Boot, before relocation and after relocation) .

And I had shared this issue with the ST owner of this code.

For information, I add some trace and I get for same code execution on DK2 
board.
- 440ms in SPL (dcache OFF)
- 36ms in U-Boot (dcache ON)

> Another item I found is that, in U-Boot, initf_dm() takes about half a second 
> and so
> does serial_init(). I didn't dig into it to find out why, but I suspect it 
> has to do with
> the massive amount of UCLASSes the DM has to traverse OR with the CPU being
> slow at that point, as the clock driver didn't get probed just yet.
>
> Thoughts ?

Yes, it is the first parsing of device tree, and it is really slow... directly 
linked to device
tree size and libfdt.

And because it is done before relocation (before dache enable).

Measurement on DK2 = 649ms

It is a other topic in my TODO list.

I want to explore livetree activation to reduce the DT parsing time.
 
And also activate dcache in pre-location stage
(and potentially also in SPL as it was done in 
http://patchwork.ozlabs.org/patch/699899/)

A other solution (workaround ?) is to reduced the U-Boot device-tree (remove 
all the nodes not used in
U-Boot in soc file stm32mp157.dtsi or use /omit-if-no-ref/ for pincontrol 
nodes).

See bootsage report on DK2, we have dm_f = 648ms

STM32MP> bootstage report
Timer summary in microseconds (12 records):
   MarkElapsed  Stage
  0  0  reset
195,613195,613  SPL
837,867642,254  end SPL
840,117  2,250  board_init_f
  2,739,639  1,899,522  board_init_r
  3,066,815327,176  id=64
  3,103,377 36,562  id=65
  3,104,078701  main_loop
  3,142,171 38,093  id=175

Accumulated time:
38,124  dm_spl
41,956  dm_r
   648,861  dm_f

For information the time in spent in 
dm_extended_scan_fdt
=> dm_scan_fdt(blob, pre_reloc_only);

This time is reduce d (few millisecond) 
with http://patchwork.ozlabs.org/patch/1240117/

But only the data cache activation before relocation should improve this part.

> 
> --
> Best regards,
> Marek Vasut

Regards
Patrick