Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-18 Thread Michael S. Tsirkin
On Mon, Oct 18, 2021 at 02:15:47PM +0200, Greg KH wrote:
> On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote:
> > On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> > > 
> > > > To which Andi replied
> > > > One problem with removing the ioremap opt-in is that
> > > > it's still possible for drivers to get at devices without going 
> > > > through probe.
> > > > 
> > > > To which Greg replied:
> > > > https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> > > > If there are in-kernel PCI drivers that do not do this, they 
> > > > need to be
> > > > fixed today.
> > > > 
> > > > Can you guys resolve the differences here?
> > > 
> > > 
> > > I addressed this in my other mail, but we may need more discussion.
> > 
> > Hopefully Greg will reply to that one.
> 
> Note, when wanting Greg to reply, someone should at the very least cc:
> him.

"that one" being "Andi's other mail". Which I don't remember what it was,
by now. Sorry.

> {sigh}
> 
> greg k-h

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-18 Thread Greg KH
On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote:
> On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> > 
> > > To which Andi replied
> > >   One problem with removing the ioremap opt-in is that
> > >   it's still possible for drivers to get at devices without going through 
> > > probe.
> > > 
> > > To which Greg replied:
> > > https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> > >   If there are in-kernel PCI drivers that do not do this, they need to be
> > >   fixed today.
> > > 
> > > Can you guys resolve the differences here?
> > 
> > 
> > I addressed this in my other mail, but we may need more discussion.
> 
> Hopefully Greg will reply to that one.

Note, when wanting Greg to reply, someone should at the very least cc:
him.

{sigh}

greg k-h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-18 Thread Greg KH
On Tue, Oct 12, 2021 at 11:35:04AM -0700, Andi Kleen wrote:
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
> 
> A hardened driver is a driver that

Ah, you do define this, thank you!

> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.

Where is this review done?  Where is is documented?  Who is responsible
for keeping it up to date with every code change to the driver, and to
the code that the driver calls and the code that calls the driver?

> - Had some fuzzing on these IO interfaces using to be released tools.

"some"?  What tools?  What is the input, and where is that defined?  How
much fuzzing do you claim is "good enough"?

> Right now it's only three virtio drivers (console, net, block)

Where was this work done and published?  And why only 3?

> Really it's no different than what we do for every new unprivileged user
> interface.

Really?  I have seen loads of new drivers from Intel submitted in the
past months that would fail any of the above things just based on
obvious code reviews that I end up having to do...

If you want to start a "hardened driver" effort, there's a lot of real
work that needs to be done here and documented, and explained why it can
not just be done for the whole kernel...

greg k-h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-18 Thread Greg KH
On Sun, Oct 10, 2021 at 03:11:23PM -0700, Andi Kleen wrote:
> 
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:
> > > On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan 
> > > wrote:
> > > > From: Andi Kleen 
> > > > 
> > > > For Confidential VM guests like TDX, the host is untrusted and hence
> > > > the devices emulated by the host or any data coming from the host
> > > > cannot be trusted. So the drivers that interact with the outside world
> > > > have to be hardened by sharing memory with host on need basis
> > > > with proper hardening fixes.
> > > > 
> > > > For the PCI driver case, to share the memory with the host add
> > > > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> > > > 
> > > > Signed-off-by: Andi Kleen 
> > > > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > > > 
> > > So I proposed to make all pci mappings shared, eliminating the need
> > > to patch drivers.
> > > 
> > > To which Andi replied
> > >  One problem with removing the ioremap opt-in is that
> > >  it's still possible for drivers to get at devices without going 
> > > through probe.
> > > 
> > > To which Greg replied:
> > > https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> > >  If there are in-kernel PCI drivers that do not do this, they 
> > > need to be
> > >  fixed today.
> > > 
> > > Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
> 
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for the
> number)
> 
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the only
> way they can be implemented because they often have no other enumeration
> mechanism. For PCI drivers it's rarer, but also still can happen. One
> example that comes to mind here is the x86 Intel uncore drivers, which
> support a mix of MSR, ioremap and PCI config space accesses all from the
> same driver. This particular example can (and should be) fixed in other
> ways, but similar things also happen in other drivers, and they're not all
> broken. Even for the broken ones they're usually for some crufty old devices
> that has very few users, so it's likely untestable in practice.
> 
> My point is just that the ecosystem of devices that Linux supports is messy
> enough that there are legitimate exceptions from the "First IO only in probe
> call only" rule.

No, there should not be for PCI drivers.  If there is, that is a bug
that you can, and should, fix.

> And we can't just fix them all. Even if we could it would be hard to
> maintain.

Not true at all, you can fix them, and write a simple coccinelle rule to
prevent them from ever coming back in.

> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.

No it is not.  It is "easier" for you because you all do not want to fix
up all of the drivers and want to add additional code complexity on top
of the current mess that we have and then you can claim that you have
"hardened" the drivers you care about.

Despite no one ever explaining exactly what "hardened" means to me.

Again, fix the existing drivers, you have the whole source, if this is
something that you all care about, it should not be hard to do.

Stop making excuses.

greg k-h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-17 Thread Thomas Gleixner
On Mon, Oct 18 2021 at 02:55, Thomas Gleixner wrote:
> On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
>> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
>> others) in init functions that also register drivers (thanks Elena for 
>> the number)
>
> These numbers are completely useless simply because they are based on
> nonsensical criteria. See:
>
>   https://lore.kernel.org/r/87r1cj2uad.ffs@tglx
>
>> My point is just that the ecosystem of devices that Linux supports is 
>> messy enough that there are legitimate exceptions from the "First IO 
>> only in probe call only" rule.
>
> Your point is based on your outright refusal to actualy do a proper
> analysis and your outright refusal to help fixing the real problems.
>
> All you have provided so far is handwaving based on a completely useless
> analysis.
>
> Sure, your goal is to get this TDX problem solved, but it's not going to
> be solved by:
>
>   1) Providing a nonsensical analysis
>
>   2) Using #1 as an argument to hack some half baken interfaces into the
>  kernel which allow you to tick off your checkbox and then leave the
>  resulting mess for others to clean up.
>  
> Try again when you have factual data to back up your claims and factual
> arguments which prove that the problem can't be fixed otherwise.
>
> I might be repeating myself, but kernel development works this way:
>
>   1) Hack your private POC - Yay!
>
>   2) Sit down and think hard about the problems you identified in step
>  #1. Do a thorough analysis.
>   
>   3) Come up with a sensible integration plan.
>
>   4) Do the necessary grump work of cleanups all over the place
>
>   5) Add sensible infrastructure which is understandable for the bulk
>  of kernel/driver developers
>
>   6) Let your feature fall in place
>
> and not in the way you are insisting on:
>
>   1) Hack your private POC - Yay!
>
>   2) Define that this is the only way to do it and try to shove it down
>  the throat of everyone.
>
>   3) Getting told that this is not the way it works
>
>   4) Insist on it forever and blame the grumpy maintainers who are just
>  not understanding the great value of your approach.
>
>   5) Go back to #2
>
> You should know that already, but I have no problem to give that lecture
> to you over and over again. I probably should create a form letter.
>
> And no, you can bitch about me as much as you want. These are not my
> personal rules and personal pet pieves. These are rules Linus cares
> about very much and aside of that they just reflect common sense.
>
>   The kernel is a common good and not the dump ground for your personal
>   brain waste.
>
>   The kernel does not serve Intel. Quite the contrary Intel depends on
>   the kernel to work nicely with it's hardware. Ergo, Intel should have
>   a vested interest to serve the kernel and take responsibility for it
>   as a whole. And so should you as an Intel employee.
>
> Just dumping your next half baken workaround does not cut it especially
> not when it is not backed up by sensible arguments.
>
> Please try again, but not before you have something substantial to back
> up your claims.

That said, I can't resist the urge to say a few words to the responsible
senior and management people at Intel in this context:

I surely know that a lot of Intel people claim that their lack of
progress is _only_ because Thomas is hard to work with and Thomas wants
unreasonable changes to their code, which I could perceive as an abuse of
myself for the purpose of self-deception. TBH, I don't give a damn.

Let me ask a few questions instead:

  - Is it unreasonable to expect that argumentations are based on facts
and proper analysis?

  - Is it unreasonable to expect a proper integration of a new feature?

  - Does it take unreasonable effort to do a proper design?

  - Is it unreasonable to ask that he necessary cleanups are done
upfront?

If anyone of the responsible people at Intel thinks so, then they should
speak up now and tell me in public and into my face what's so
unreasonable about that.

Thanks,

Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-17 Thread Thomas Gleixner
Andi,

On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
> On 10/9/2021 1:39 PM, Dan Williams wrote:
>> I agree with you and Greg here. If a driver is accessing hardware
>> resources outside of the bind lifetime of one of the devices it
>> supports, and in a way that neither modrobe-policy nor
>> device-authorization -policy infrastructure can block, that sounds
>> like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
> others) in init functions that also register drivers (thanks Elena for 
> the number)

These numbers are completely useless simply because they are based on
nonsensical criteria. See:

  https://lore.kernel.org/r/87r1cj2uad.ffs@tglx

> My point is just that the ecosystem of devices that Linux supports is 
> messy enough that there are legitimate exceptions from the "First IO 
> only in probe call only" rule.

Your point is based on your outright refusal to actualy do a proper
analysis and your outright refusal to help fixing the real problems.

All you have provided so far is handwaving based on a completely useless
analysis.

Sure, your goal is to get this TDX problem solved, but it's not going to
be solved by:

  1) Providing a nonsensical analysis

  2) Using #1 as an argument to hack some half baken interfaces into the
 kernel which allow you to tick off your checkbox and then leave the
 resulting mess for others to clean up.
 
Try again when you have factual data to back up your claims and factual
arguments which prove that the problem can't be fixed otherwise.

I might be repeating myself, but kernel development works this way:

  1) Hack your private POC - Yay!

  2) Sit down and think hard about the problems you identified in step
 #1. Do a thorough analysis.
  
  3) Come up with a sensible integration plan.

  4) Do the necessary grump work of cleanups all over the place

  5) Add sensible infrastructure which is understandable for the bulk
 of kernel/driver developers

  6) Let your feature fall in place

and not in the way you are insisting on:

  1) Hack your private POC - Yay!

  2) Define that this is the only way to do it and try to shove it down
 the throat of everyone.

  3) Getting told that this is not the way it works

  4) Insist on it forever and blame the grumpy maintainers who are just
 not understanding the great value of your approach.

  5) Go back to #2

You should know that already, but I have no problem to give that lecture
to you over and over again. I probably should create a form letter.

And no, you can bitch about me as much as you want. These are not my
personal rules and personal pet pieves. These are rules Linus cares
about very much and aside of that they just reflect common sense.

  The kernel is a common good and not the dump ground for your personal
  brain waste.

  The kernel does not serve Intel. Quite the contrary Intel depends on
  the kernel to work nicely with it's hardware. Ergo, Intel should have
  a vested interest to serve the kernel and take responsibility for it
  as a whole. And so should you as an Intel employee.

Just dumping your next half baken workaround does not cut it especially
not when it is not backed up by sensible arguments.

Please try again, but not before you have something substantial to back
up your claims.

Thanks,

Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-17 Thread Michael S. Tsirkin
On Thu, Oct 14, 2021 at 12:33:49PM +, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 07:27:42AM +, Reshetova, Elena wrote:
> > > > On Thu, Oct 14, 2021 at 06:32:32AM +, Reshetova, Elena wrote:
> > > > > > On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> > > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including 
> > > > > > > > MMIO and
> > > > > > > > others) in init functions that also register drivers (thanks 
> > > > > > > > Elena for
> > > > > > > > the number)
> > > > > > >
> > > > > > > To provide more numbers on this. What I can see so far from a 
> > > > > > > smatch-
> > based
> > > > > > > analysis, we have 409 __init style functions (.probe & 
> > > > > > > builtin/module_
> > > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > > > >
> > > > > > I don't think we care about allyesconfig at all though.
> > > > > > Just don't do that.
> > > > > > How about allmodconfig? This is closer to what distros actually do.
> > > > >
> > > > > It does not make any difference really for the content of the 
> > > > > /drivers/*:
> > > > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > > > >
> > > > > ['doc200x_ident_chip',
> > > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > > >
> > > > Um. ARM? Which architecture is this build for?
> > >
> > > The list of smatch IO findings is built for x86, but the smatch cross 
> > > function
> > > database covers all archs, so when queried for all potential function 
> > > callers,
> > > it would show non x86 arch call chains also.
> > >
> > > Here is the original x86 finding and call chain for the 
> > > 'arm_v7s_do_selftests':
> > >
> > >   Detected low-level IO from arm_v7s_do_selftests in fun
> > > __iommu_queue_command_sync
> > >
> > > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> > > {15002074744551330002}
> > > 'check_host_input' read from the host using function 'readl' to a
> > > member of the structure 'iommu->cmd_buf_head';
> > >
> > > __iommu_queue_command_sync()
> > >   iommu_completion_wait()
> > > amd_iommu_domain_flush_complete()
> > >   iommu_v1_map_page()
> > > arm_v7s_do_selftests()
> > >
> > > So, the results can be further filtered if you want a specified arch.
> > 
> > So what is it just for x86? Could you tell?
> 
> I can probably figure out how to do additional filtering here, but does
> it really matter for the case that is being discussed here? Andi's point was
> that there quite many existing places in the kernel when low-level IO
> happens before the probe stage. So I brought these numbers here.
> What do you plan to do with the pure x86 results? 

If the list is short - just suggest securing that ;)


> And here are the full results for allyesconfig, if anyone is interested (just 
> got permission to create
> the repository today):
> https://github.com/intel/ccc-linux-guest-hardening/tree/master/audit/sample_output/5.15-rc1
> We will be pushing to this repo all the scripts and fuzzing setups we use as 
> part of
> our Linux guest hardening effort for confidential cloud computing, but it is 
> going to take
> some time to get all the approvals collected.  
> 
> Best Regards,
> Elena.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-17 Thread Thomas Gleixner
Elena,

On Thu, Oct 14 2021 at 06:32, Elena Reshetova wrote:
>> On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> It does not make any difference really for the content of the /drivers/*:
> gives 408 __init style functions doing IO (.probe & builtin/module_
>> > _platform_driver_probe excluded) for 5.15 with allmodconfig:
>
> ['doc200x_ident_chip',
> 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> 'ubi_gluebi_init', 'ubiblock_init'
> 'ubi_init', 'mtd_stresstest_init',

All of this is MTD and can just be disabled wholesale.

Aside of that, most of these depend on either platform devices or device
tree enumerations which are not ever available on X86.

> 'probe_acpi_namespace_devices',

> 'amd_iommu_init_pci', 'state_next',
> 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
> 'late_iommu_features_init', 'detect_ivrs',
> 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
> 'intel_cleanup_irq_remapping', 'detect_intel_iommu',
> 'parse_ioapics_under_ir', 'si_domain_init',
> 'intel_iommu_init', 'dmar_table_init',
> 'enable_drhd_fault_handling',
> 'check_tylersburg_isoch', 

None of this is reachable because the initial detection which is ACPI
table based will fail for TDX. If not, it's a guest firmware problem.

> 'fb_console_init', 'xenbus_probe_backend_init',
> 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
> 'balloon_init',

XEN, that's relevant because magically the TDX guest will assume that it
is a XEN instance?

> 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> 'sun5i_setup_clockevent',
> 'mt7621_clk_init',
> 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> 'meson6_timer_init', 'atcpit100_timer_init',
> 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init',
> 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> 'of_fixed_mmio_clk_setup',
> 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',

ARM based drivers are initialized on x86 in which way?

> 'hv_init_tsc_clocksource', 'hv_init_clocksource',

HyperV. See XEN

> 'skx_init',
> 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> 'pnd2_init', 'edac_init', 'adummy_init',

EDAC has already hypervisor checks

> 'init_acpi_pm_clocksource',

Requires ACPI table entry or command line override

> 'intel_rng_mod_init',

Has an old style PCI table which is searched via pci_get_device(). Could
do with a cleanup which converts it to proper PCI probing.



So I stop here, because it would be way simpler to have the file names
but so far I could identify all of it from the top of my head.

So what are you trying to tell me? That you found tons of ioremaps in
__init functions which are completely irrelevant.

Please stop making arguments based on completely nonsensical data. It
took me less than 5 minutes to eliminate more than 50% of that list and
I'm pretty sure that I could have eliminated the bulk of the rest as
well.

The fact that a large part of this is ARM only, the 

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-14 Thread Michael S. Tsirkin
On Thu, Oct 14, 2021 at 07:27:42AM +, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 06:32:32AM +, Reshetova, Elena wrote:
> > > > On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO 
> > > > > > and
> > > > > > others) in init functions that also register drivers (thanks Elena 
> > > > > > for
> > > > > > the number)
> > > > >
> > > > > To provide more numbers on this. What I can see so far from a 
> > > > > smatch-based
> > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > >
> > > > I don't think we care about allyesconfig at all though.
> > > > Just don't do that.
> > > > How about allmodconfig? This is closer to what distros actually do.
> > >
> > > It does not make any difference really for the content of the /drivers/*:
> > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > >
> > > ['doc200x_ident_chip',
> > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > 
> > Um. ARM? Which architecture is this build for?
> 
> The list of smatch IO findings is built for x86, but the smatch cross function
> database covers all archs, so when queried for all potential function callers,
> it would show non x86 arch call chains also. 
> 
> Here is the original x86 finding and call chain for the 
> 'arm_v7s_do_selftests':
> 
>   Detected low-level IO from arm_v7s_do_selftests in fun
> __iommu_queue_command_sync
> 
> drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> {15002074744551330002}
> 'check_host_input' read from the host using function 'readl' to a
> member of the structure 'iommu->cmd_buf_head';
> 
> __iommu_queue_command_sync()
>   iommu_completion_wait()
> amd_iommu_domain_flush_complete()
>   iommu_v1_map_page()
> arm_v7s_do_selftests()
> 
> So, the results can be further filtered if you want a specified arch. 

Even better would be a typical distro build.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-14 Thread Michael S. Tsirkin
On Thu, Oct 14, 2021 at 07:27:42AM +, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 06:32:32AM +, Reshetova, Elena wrote:
> > > > On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO 
> > > > > > and
> > > > > > others) in init functions that also register drivers (thanks Elena 
> > > > > > for
> > > > > > the number)
> > > > >
> > > > > To provide more numbers on this. What I can see so far from a 
> > > > > smatch-based
> > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > >
> > > > I don't think we care about allyesconfig at all though.
> > > > Just don't do that.
> > > > How about allmodconfig? This is closer to what distros actually do.
> > >
> > > It does not make any difference really for the content of the /drivers/*:
> > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > >
> > > ['doc200x_ident_chip',
> > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > 
> > Um. ARM? Which architecture is this build for?
> 
> The list of smatch IO findings is built for x86, but the smatch cross function
> database covers all archs, so when queried for all potential function callers,
> it would show non x86 arch call chains also. 
> 
> Here is the original x86 finding and call chain for the 
> 'arm_v7s_do_selftests':
> 
>   Detected low-level IO from arm_v7s_do_selftests in fun
> __iommu_queue_command_sync
> 
> drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> {15002074744551330002}
> 'check_host_input' read from the host using function 'readl' to a
> member of the structure 'iommu->cmd_buf_head';
> 
> __iommu_queue_command_sync()
>   iommu_completion_wait()
> amd_iommu_domain_flush_complete()
>   iommu_v1_map_page()
> arm_v7s_do_selftests()
> 
> So, the results can be further filtered if you want a specified arch. 

So what is it just for x86? Could you tell?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-14 Thread Michael S. Tsirkin
On Thu, Oct 14, 2021 at 06:32:32AM +, Reshetova, Elena wrote:
> > On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > others) in init functions that also register drivers (thanks Elena for
> > > > the number)
> > >
> > > To provide more numbers on this. What I can see so far from a smatch-based
> > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > 
> > I don't think we care about allyesconfig at all though.
> > Just don't do that.
> > How about allmodconfig? This is closer to what distros actually do.
> 
> It does not make any difference really for the content of the /drivers/*:
> gives 408 __init style functions doing IO (.probe & builtin/module_
> > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> 
> ['doc200x_ident_chip',
> 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',

Um. ARM? Which architecture is this build for?


> 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
> 'late_iommu_features_init', 'detect_ivrs',
> 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
> 'intel_cleanup_irq_remapping', 'detect_intel_iommu',
> 'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init',
> 'fb_console_init', 'xenbus_probe_backend_init',
> 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
> 'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init',
> 'check_tylersburg_isoch', 'dmar_table_init',
> 'enable_drhd_fault_handling', 'init_acpi_pm_clocksource',
> 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> 'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init',
> 'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init',
> 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> 'meson6_timer_init', 'atcpit100_timer_init',
> 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init',
> 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> 'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init',
> 'bxt_idle_state_table_update', 'sklh_idle_state_table_update',
> 'skx_idle_state_table_update',
> 'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe',
> 'com90io_probe', 'com90xx_probe', 'pcnet32_init_module',
> 'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find',
> 'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find',
> 'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect',
> 'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register',
> 'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register',
> 'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init',
> 'init_control', 'con_init', 'cr_pll_init',
> 'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init',
> 'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init',
> 'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init',
> 'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe',
> 'timer_probe', 'of_clk_init', '__reserved_mem_init_node',
> 'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip',
> 'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init',
> 'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma',
> 'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found',
> 'intel_soc_thermal_init', 'thermal_register_governors',
> 'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init',
> 'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup',
> 'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init',
> 'early_console_setup', 'init_port', 'early_serial8250_setup',
> 'linflex_console_setup', 'register_earlycon', 'of_setup_earlycon',
> 'slgt_init', 'moxa_init', 'parport_pc_init_superio',
> 'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init',
> 'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init',
> 'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list',
> 'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry',
> 'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init',
> 'rcar_sysc_pd_setup', 

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 2:28 PM Andi Kleen  wrote:
[..]
> >> But how do you debug the kernel then? Making early undebuggable seems
> >> just bad policy to me.
> > I am not proposing making the early undebuggable.
>
>
> That's the implication of moving the policy into initrd.
>
>
> If only initrd can authorize then it won't be possible to authorize
> before initrd, thus the early console won't work.

Again, the proposal is that the allow-list is limited to just enough
devices to startup and debug the initramfs and no more. Everything
else can be dynamic, and this allows for a powerful custom override
interface without needing to debate additional ABI like command line
overrides, and minimizes future changes to this kernel-internal
allow-list.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen




But that was due to performance problems in hot paths. Nothing of this
applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.


Right now we're only talking about 2 places to change, and none of those 
are actually in individual drivers, but in the virtio generic code and 
in the MSI code.


While there might be drivers in the future that do it directly it will 
be always the exception, normal drivers don't have to deal with this.





For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

Right now there is no leaf driver changed at all.



I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened.

A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations
(mainly MMIO access, but also PCI config space) as a non privileged user
interface (like a ioctl). That review should be focused on memory safety.

- Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?



Right now it's zero I believe.

That is there is other low level code that sets memory shared, but it's 
not using ioremap, but some other mechanisms.




are needed

for booting. For example in TDX we can't print something to the console
with this mechanism, so you would never get any output before the
initrd. Just seems like a nightmare for debugging anything. There really
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to
have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.

But how do you debug the kernel then? Making early undebuggable seems
just bad policy to me.

I am not proposing making the early undebuggable.



That's the implication of moving the policy into initrd.


If only initrd can authorize then it won't be possible to authorize 
before initrd, thus the early console won't work.


-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 2:18 PM, Michael S. Tsirkin wrote:

On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:

Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine.

That's a good point. Andi, how about setting a per-device flag
if its ID has been allowed and then making pci_iomap create
a shared mapping transparently?


Yes for pci_iomap we could do that.

If someone uses raw ioremap without a device it won't work, but I don't 
think that's the case for virtio at least.


I suppose we could solve that problem if it actually happens.

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:
> Especially in this case where the virtio use case being
> opted-in is *already* in a path that has been authorized by the
> device-filter policy engine.

That's a good point. Andi, how about setting a per-device flag
if its ID has been allowed and then making pci_iomap create
a shared mapping transparently?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 11:35 AM Andi Kleen  wrote:
>
>
> > The "better safe-than-sorry" argument is hard to build consensus
> > around. The spectre mitigations ran into similar problems where the
> > community rightly wanted to see the details and instrument the
> > problematic paths rather than blanket sprinkle lfence "just to be
> > safe".
>
> But that was due to performance problems in hot paths. Nothing of this
> applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.

>
> > In this case the rules about when a driver is suitably
> > "hardened" are vague and the overlapping policy engines are confusing.
>
> What is confusing exactly?

Multiple places to go to enable functionality. The device-filter
firewall policy can collide with the ioremap access control policy.

> For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

> The policy is:
>
> - Have an allow list at driver registration.
>
> - Have an additional opt-in for MMIO mappings (and potentially config
> space, but that's not currently there) to cover init calls completely.

The proliferation of policy engines and driver special casing is the
issue. Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine. I.e. why special case the ioremap() in
virtio to be additionally authorized when the device has already been
authorized to probe? Put another way, the easiest driver API change to
merge would be no additional changes in leaf drivers.

>
> >
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
>
> A hardened driver is a driver that
>
> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.
>
> - Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?

> Right now it's only three virtio drivers (console, net, block)
>
> Really it's no different than what we do for every new unprivileged user
> interface.
>
>
> > I.e. instead of jumping to the assertion that fixing up
> > these init-path vulnerabilities are too big to fix, dig to the next
> > level to provide more evidence that per-driver opt-in is the only
> > viable option.
> >
> > For example, how many of these problematic paths are built-in to the
> > average kernel config?
>
> I don't think arguments from "the average kernel config" (if such a
> thing even exists) are useful. That's would be just hand waving.

I'm trying to bridge to your contention that this enabling can not
rely on custom kernel configs and must offer protection on the same
kernel image that might ship in the host, but lets set this aside and
focus on when and where leaf drivers need to adopt a new API.

> > A strawman might be to add a sprinkling error
> > exits in the module_init() of the problematic drivers, and only fail
> > if the module is built-in, and let modprobe policy handle the rest.
>
>
> That would be already hundreds of changes. I have no idea how such a
> thing could be maintained or sustained either.
>
> Really I don't even see how these alternatives can be considered. Tree
> sweeps should always be last resort. They're a pain for everyone. But
> here they're casually thrown around as alternatives to straight forward
> one or two line changes.

If it looked straightforward I'm not sure we would be having this
discussion, I think it's reasonable to ask if this is a per-driver
opt-in responsibility that must be added in addition to probe
authorization.

> >> Default policy in user space just seems to be a bad idea here. Who
> >> should know if a driver is hardened other than the kernel? Maintaining
> >> the list somewhere else just doesn't make sense to me.
> > I do not understand the maintenance burden correlation of where the
> > policy is driven vs where the list is maintained?
>
> All the hardening and auditing happens in the kernel tree. So it seems
> the natural place to store the result is in the kernel tree.
>
> But there's no single package for initrd, so you would need custom
> configurations for all the supported distros.
>
> Also we're really arguing about a list that currently has three entries.
>
>
> >   Even if I agreed
> > with the contention that 

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Michael S. Tsirkin
On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > others) in init functions that also register drivers (thanks Elena for
> > the number)
> 
> To provide more numbers on this. What I can see so far from a smatch-based
> analysis, we have 409 __init style functions (.probe & builtin/module_
> _platform_driver_probe excluded) for 5.15 with allyesconfig.

I don't think we care about allyesconfig at all though.
Just don't do that.
How about allmodconfig? This is closer to what distros actually do.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 12:13 PM, Dan Williams wrote:

On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
 wrote:



I suspect the true number is even higher because that doesn't include IO
inside calls to other modules and indirect pointers, correct?

Actually everything should be included. Smatch has cross-function db and
I am using it for getting the call chains and it follows function pointers.
Also since I am starting from a list of individual read IOs, every single
base read IO in drivers/* should be covered as far as I can see. But if it uses
some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?


This is everything that would need to be moved somewhere else if we 
didn't disable the entire classes of ioremap users.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
 wrote:
>
>
> > I suspect the true number is even higher because that doesn't include IO
> > inside calls to other modules and indirect pointers, correct?
>
> Actually everything should be included. Smatch has cross-function db and
> I am using it for getting the call chains and it follows function pointers.
> Also since I am starting from a list of individual read IOs, every single
> base read IO in drivers/* should be covered as far as I can see. But if it 
> uses
> some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/12/2021 11:36 AM, Reshetova, Elena wrote:

The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
others) in init functions that also register drivers (thanks Elena for
the number)

To provide more numbers on this. What I can see so far from a smatch-based
analysis, we have 409 __init style functions (.probe & builtin/module_
_platform_driver_probe excluded) for 5.15 with allyesconfig. The number of
distinct individual IO reads (MSRs included) is much higher than 2.4k and on the
range of 30k because quite often there are more than a single IO read in the
same source function. The full list of accesses and the possible call paths is 
huge
for manually looking at, but here is the list of the 409 functions if anyone 
wants
to take a look:



Thanks Elena.


I suspect the true number is even higher because that doesn't include IO 
inside calls to other modules and indirect pointers, correct?



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen



On 10/11/2021 10:31 PM, Christoph Hellwig wrote:

On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:

The reason we have trouble is that it's not clear what does the API mean
outside the realm of TDX.
If we really, truly want an API that says "ioremap and it's a hardened
driver" then I guess ioremap_hardened_driver is what you want.

Yes.  And why would be we ioremap the BIOS anyway?  It is not I/O memory
in any of the senses we generally use ioremap for.


I/O memory is anything outside the kernel memory map.

-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Andi Kleen




The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe".


But that was due to performance problems in hot paths. Nothing of this 
applies here.



In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.


What is confusing exactly?

For me it both seems very straight forward and simple (but then I'm biased)

The policy is:

- Have an allow list at driver registration.

- Have an additional opt-in for MMIO mappings (and potentially config 
space, but that's not currently there) to cover init calls completely.




I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened.


A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations 
(mainly MMIO access, but also PCI config space) as a non privileged user 
interface (like a ioctl). That review should be focused on memory safety.


- Had some fuzzing on these IO interfaces using to be released tools.

Right now it's only three virtio drivers (console, net, block)

Really it's no different than what we do for every new unprivileged user 
interface.




I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths are built-in to the
average kernel config?


I don't think arguments from "the average kernel config" (if such a 
thing even exists) are useful. That's would be just hand waving.




A strawman might be to add a sprinkling error
exits in the module_init() of the problematic drivers, and only fail
if the module is built-in, and let modprobe policy handle the rest.



That would be already hundreds of changes. I have no idea how such a 
thing could be maintained or sustained either.


Really I don't even see how these alternatives can be considered. Tree 
sweeps should always be last resort. They're a pain for everyone. But 
here they're casually thrown around as alternatives to straight forward 
one or two line changes.








Default policy in user space just seems to be a bad idea here. Who
should know if a driver is hardened other than the kernel? Maintaining
the list somewhere else just doesn't make sense to me.

I do not understand the maintenance burden correlation of where the
policy is driven vs where the list is maintained?


All the hardening and auditing happens in the kernel tree. So it seems 
the natural place to store the result is in the kernel tree.


But there's no single package for initrd, so you would need custom 
configurations for all the supported distros.


Also we're really arguing about a list that currently has three entries.



  Even if I agreed
with the contention that out-of-tree userspace would have a hard time
tracking the "hardened" driver list there is still an in-tree
userspace path to explore. E.g. perf maintains lists of things tightly
coupled to the kernel, this authorized device list seems to be in the
same category of data.


You mean the event list? perf is in the kernel tree, so it's maintained 
together with the kernel.


But we don't have a kernel initrd.






Also there is the more practical problem that some devices are needed
for booting. For example in TDX we can't print something to the console
with this mechanism, so you would never get any output before the
initrd. Just seems like a nightmare for debugging anything. There really
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to
have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.


But how do you debug the kernel then? Making early undebuggable seems 
just bad policy to me.


And if you fix if for the console why not add the two more entries for 
virtio net and block too?



-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Sun, Oct 10, 2021 at 3:11 PM Andi Kleen  wrote:
>
>
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:
> >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> >>> From: Andi Kleen 
> >>>
> >>> For Confidential VM guests like TDX, the host is untrusted and hence
> >>> the devices emulated by the host or any data coming from the host
> >>> cannot be trusted. So the drivers that interact with the outside world
> >>> have to be hardened by sharing memory with host on need basis
> >>> with proper hardening fixes.
> >>>
> >>> For the PCI driver case, to share the memory with the host add
> >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >>>
> >>> Signed-off-by: Andi Kleen 
> >>> Signed-off-by: Kuppuswamy Sathyanarayanan 
> >>> 
> >> So I proposed to make all pci mappings shared, eliminating the need
> >> to patch drivers.
> >>
> >> To which Andi replied
> >>  One problem with removing the ioremap opt-in is that
> >>  it's still possible for drivers to get at devices without going 
> >> through probe.
> >>
> >> To which Greg replied:
> >> https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> >>  If there are in-kernel PCI drivers that do not do this, they need 
> >> to be
> >>  fixed today.
> >>
> >> Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)
>
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the
> only way they can be implemented because they often have no other
> enumeration mechanism. For PCI drivers it's rarer, but also still can
> happen. One example that comes to mind here is the x86 Intel uncore
> drivers, which support a mix of MSR, ioremap and PCI config space
> accesses all from the same driver. This particular example can (and
> should be) fixed in other ways, but similar things also happen in other
> drivers, and they're not all broken. Even for the broken ones they're
> usually for some crufty old devices that has very few users, so it's
> likely untestable in practice.
>
> My point is just that the ecosystem of devices that Linux supports is
> messy enough that there are legitimate exceptions from the "First IO
> only in probe call only" rule.
>
> And we can't just fix them all. Even if we could it would be hard to
> maintain.
>
> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.
>
> Now we can argue about the details. Right now what we're proposing has
> some redundancies: it has both a device model filter and low level
> filter for ioremap (this patch and some others). The low level filter is
> for catching issues that don't clearly fit into the
> "enumeration<->probe" model. You could call that redundant, but I would
> call it defense in depth or better safe than sorry. In theory it would
> be enough to have the low level opt-in only, but that would have the
> drawback that is something gets enumerated after all you would have all
> kind of weird device driver failures and in some cases even killed
> guests. So I think it makes sense to have

The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe". In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.

I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened. I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths are built-in to the
average kernel config? A strawman might be to add a sprinkling error
exits in the module_init() of the problematic drivers, and only fail
if the module is built-in, and let modprobe policy handle the rest.

>
>
> > Fix those drivers instead of sprinkling
> > ioremap_shared in select places and with unclear rules about when a
> > driver is allowed to do "shared" mappings.
>
> Only add it when the driver has been audited and hardened.
>
> But I agree we need on a documented process for this. I will work on
> some 

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Christoph Hellwig
On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:
> The reason we have trouble is that it's not clear what does the API mean
> outside the realm of TDX.
> If we really, truly want an API that says "ioremap and it's a hardened
> driver" then I guess ioremap_hardened_driver is what you want.

Yes.  And why would be we ioremap the BIOS anyway?  It is not I/O memory
in any of the senses we generally use ioremap for.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Michael S. Tsirkin
On Mon, Oct 11, 2021 at 10:23:00AM -0700, Andi Kleen wrote:
> 
> On 10/11/2021 12:58 AM, Christoph Hellwig wrote:
> > Just as last time:  This does not make any sense.  ioremap is shared
> > by definition.
> 
> It's not necessarily shared with the host for confidential computing: for
> example BIOS mappings definitely should not be shared, but they're using
> ioremap today.

That just needs to be fixed.

> But if you have a better term please propose something. I tried to clarify
> it with "shared_host", but I don't know a better term.
> 
> 
> -Andi
> 


The reason we have trouble is that it's not clear what does the API mean
outside the realm of TDX.
If we really, truly want an API that says "ioremap and it's a hardened
driver" then I guess ioremap_hardened_driver is what you want.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Michael S. Tsirkin
On Mon, Oct 11, 2021 at 10:32:23AM -0700, Andi Kleen wrote:
> 
> > Because it does not end with I/O operations, that's a trivial example.
> > module unloading is famous for being racy: I just re-read that part of
> > virtio drivers and sure enough we have bugs there, this is after
> > they have presumably been audited, so a TDX guest is better off
> > just disabling hot-unplug completely, and hotplug isn't far behind.
> 
> These all shouldn't matter for a confidential guest. The only way it can be
> attacked is through IO, everything else is protected by hardware.
> 
> 
> Also it would all require doing something at the guest level, which we
> assume is not malicious.
> 
> 
> > Malicious filesystems can exploit many linux systems unless
> > you take pains to limit what is mounted and how.
> 
> That's expected to be handled by authenticated dmcrypt and similar.
> Hardening at this level has been done for many years.

It's possible to do it like this, sure. But that's not the
only configuration, userspace needs to be smart about setting things up.
Which is my point really.

> 
> > Networking devices tend to get into the default namespaces and can
> > do more or less whatever CAP_NET_ADMIN can.
> > Etc.
> 
> 
> Networking should be already hardened, otherwise you would have much worse
> problems today.

Same thing. NFS is pretty common, you are saying don't do it then. Fair
enough but again, arbitrary configs just aren't going to be secure.

> 
> 
> > hange in your subsystem here.
> > Well I commented on the API patch, not the virtio patch.
> > If it's a way for a driver to say "I am hardened
> > and audited" then I guess it should at least say so.
> 
> 
> This is handled by the central allow list. We intentionally didn't want each
> driver to declare itself, but have a central list where changes will get
> more scrutiny than random driver code.

Makes sense. Additionally, distros can tweak that to their heart's
content, selecting the functionality/security balance that makes sense
for them.

> But then there are the additional opt-ins for the low level firewall. These
> are in the API. I don't see how it could be done at the driver level, unless
> you want to pass in a struct device everywhere?

I am just saying don't do it then. Don't build drivers that distro does
not want to support into kernel. And don't load them when they are
modules.

> > > > How about creating a defconfig that makes sense for TDX then?
> > > TDX can be used in many different ways, I don't think a defconfig is
> > > practical.
> > > 
> > > In theory you could do some Kconfig dependency (at the pain point of 
> > > having
> > > separate kernel binariees), but why not just do it at run time then if you
> > > maintain the list anyways. That's much easier and saner for everyone. In 
> > > the
> > > past we usually always ended up with runtime mechanism for similar things
> > > anyways.
> > > 
> > > Also it turns out that the filter mechanisms are needed for some arch
> > > drivers which are not even configurable, so alone it's probably not 
> > > enough,
> > 
> > I guess they aren't really needed though right, or you won't try to
> > filter them?
> 
> We're addressing most of them with the device filter for platform drivers.
> But since we cannot stop them doing ioremap IO in their init code they also
> need the low level firewall.
> 
> Some others that cannot be addressed have explicit disables.
> 
> 
> > So make them configurable?
> 
> Why not just fix the runtime? It's much saner for everyone. Proposing to do
> things at build time sounds like we're in Linux 0.99 days.
> 
> -Andi

Um. Tweaking driver code is not just build time, it's development time.
At least with kconfig you don't need to patch your kernel.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Andi Kleen




Because it does not end with I/O operations, that's a trivial example.
module unloading is famous for being racy: I just re-read that part of
virtio drivers and sure enough we have bugs there, this is after
they have presumably been audited, so a TDX guest is better off
just disabling hot-unplug completely, and hotplug isn't far behind.


These all shouldn't matter for a confidential guest. The only way it can 
be attacked is through IO, everything else is protected by hardware.



Also it would all require doing something at the guest level, which we 
assume is not malicious.




Malicious filesystems can exploit many linux systems unless
you take pains to limit what is mounted and how.


That's expected to be handled by authenticated dmcrypt and similar. 
Hardening at this level has been done for many years.




Networking devices tend to get into the default namespaces and can
do more or less whatever CAP_NET_ADMIN can.
Etc.



Networking should be already hardened, otherwise you would have much 
worse problems today.





hange in your subsystem here.
Well I commented on the API patch, not the virtio patch.
If it's a way for a driver to say "I am hardened
and audited" then I guess it should at least say so.



This is handled by the central allow list. We intentionally didn't want 
each driver to declare itself, but have a central list where changes 
will get more scrutiny than random driver code.


But then there are the additional opt-ins for the low level firewall. 
These are in the API. I don't see how it could be done at the driver 
level, unless you want to pass in a struct device everywhere?



How about creating a defconfig that makes sense for TDX then?

TDX can be used in many different ways, I don't think a defconfig is
practical.

In theory you could do some Kconfig dependency (at the pain point of having
separate kernel binariees), but why not just do it at run time then if you
maintain the list anyways. That's much easier and saner for everyone. In the
past we usually always ended up with runtime mechanism for similar things
anyways.

Also it turns out that the filter mechanisms are needed for some arch
drivers which are not even configurable, so alone it's probably not enough,


I guess they aren't really needed though right, or you won't try to
filter them?


We're addressing most of them with the device filter for platform 
drivers. But since we cannot stop them doing ioremap IO in their init 
code they also need the low level firewall.


Some others that cannot be addressed have explicit disables.



So make them configurable?


Why not just fix the runtime? It's much saner for everyone. Proposing to 
do things at build time sounds like we're in Linux 0.99 days.


-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Andi Kleen



On 10/11/2021 12:58 AM, Christoph Hellwig wrote:

Just as last time:  This does not make any sense.  ioremap is shared
by definition.


It's not necessarily shared with the host for confidential computing: 
for example BIOS mappings definitely should not be shared, but they're 
using ioremap today.


But if you have a better term please propose something. I tried to 
clarify it with "shared_host", but I don't know a better term.



-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Michael S. Tsirkin
On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> 
> > To which Andi replied
> > One problem with removing the ioremap opt-in is that
> > it's still possible for drivers to get at devices without going through 
> > probe.
> > 
> > To which Greg replied:
> > https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> > If there are in-kernel PCI drivers that do not do this, they need to be
> > fixed today.
> > 
> > Can you guys resolve the differences here?
> 
> 
> I addressed this in my other mail, but we may need more discussion.

Hopefully Greg will reply to that one.

> 
> > 
> > And once they are resolved, mention this in the commit log so
> > I don't get to re-read the series just to find out nothing
> > changed in this respect?
> > 
> > I frankly do not believe we are anywhere near being able to harden
> > an arbitrary kernel config against attack.
> 
> Why not? Device filter and the opt-ins together are a fairly strong
> mechanism.

Because it does not end with I/O operations, that's a trivial example.
module unloading is famous for being racy: I just re-read that part of
virtio drivers and sure enough we have bugs there, this is after
they have presumably been audited, so a TDX guest is better off
just disabling hot-unplug completely, and hotplug isn't far behind.
Malicious filesystems can exploit many linux systems unless
you take pains to limit what is mounted and how.
Networking devices tend to get into the default namespaces and can
do more or less whatever CAP_NET_ADMIN can.
Etc.
I am not saying this makes the effort worthless, I am saying userspace
better know very well what it's doing, and kernel better be
configured in a very specific way.

> And it's not that they're a lot of code or super complicated either.
> 
> You're essentially objecting to a single line change in your subsystem here.

Well I commented on the API patch, not the virtio patch.
If it's a way for a driver to say "I am hardened
and audited" then I guess it should at least say so. It has nothing
to do with host or sharing, that's an implementation detail,
and it obscures the actual limitations of the approach,
in that eventually in an ideal world all drivers would be secure
and use this API.

Yes, if that's the API that PCI gains then virtio will use it.


> > How about creating a defconfig that makes sense for TDX then?
> 
> TDX can be used in many different ways, I don't think a defconfig is
> practical.
> 
> In theory you could do some Kconfig dependency (at the pain point of having
> separate kernel binariees), but why not just do it at run time then if you
> maintain the list anyways. That's much easier and saner for everyone. In the
> past we usually always ended up with runtime mechanism for similar things
> anyways.
> 
> Also it turns out that the filter mechanisms are needed for some arch
> drivers which are not even configurable, so alone it's probably not enough,


I guess they aren't really needed though right, or you won't try to
filter them? So make them configurable?

> 
> > Anyone deviating from that better know what they are doing,
> > this API tweaking is just putting policy into the kernel  ...
> 
> Hardening drivers is kernel policy. It cannot be done anywhere else.
> 
> 
> -Andi

To clarify, the policy is which drivers to load into the kernel.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-11 Thread Christoph Hellwig
Just as last time:  This does not make any sense.  ioremap is shared
by definition.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-10 Thread Andi Kleen




To which Andi replied
One problem with removing the ioremap opt-in is that
it's still possible for drivers to get at devices without going through 
probe.

To which Greg replied:
https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
If there are in-kernel PCI drivers that do not do this, they need to be
fixed today.

Can you guys resolve the differences here?



I addressed this in my other mail, but we may need more discussion.




And once they are resolved, mention this in the commit log so
I don't get to re-read the series just to find out nothing
changed in this respect?

I frankly do not believe we are anywhere near being able to harden
an arbitrary kernel config against attack.


Why not? Device filter and the opt-ins together are a fairly strong 
mechanism.


And it's not that they're a lot of code or super complicated either.

You're essentially objecting to a single line change in your subsystem here.



How about creating a defconfig that makes sense for TDX then?


TDX can be used in many different ways, I don't think a defconfig is 
practical.


In theory you could do some Kconfig dependency (at the pain point of 
having separate kernel binariees), but why not just do it at run time 
then if you maintain the list anyways. That's much easier and saner for 
everyone. In the past we usually always ended up with runtime mechanism 
for similar things anyways.


Also it turns out that the filter mechanisms are needed for some arch 
drivers which are not even configurable, so alone it's probably not enough,




Anyone deviating from that better know what they are doing,
this API tweaking is just putting policy into the kernel  ...


Hardening drivers is kernel policy. It cannot be done anywhere else.


-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-10 Thread Andi Kleen



On 10/9/2021 1:39 PM, Dan Williams wrote:

On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:

On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:

From: Andi Kleen 

For Confidential VM guests like TDX, the host is untrusted and hence
the devices emulated by the host or any data coming from the host
cannot be trusted. So the drivers that interact with the outside world
have to be hardened by sharing memory with host on need basis
with proper hardening fixes.

For the PCI driver case, to share the memory with the host add
pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.

Signed-off-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 


So I proposed to make all pci mappings shared, eliminating the need
to patch drivers.

To which Andi replied
 One problem with removing the ioremap opt-in is that
 it's still possible for drivers to get at devices without going 
through probe.

To which Greg replied:
https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
 If there are in-kernel PCI drivers that do not do this, they need to be
 fixed today.

Can you guys resolve the differences here?

I agree with you and Greg here. If a driver is accessing hardware
resources outside of the bind lifetime of one of the devices it
supports, and in a way that neither modrobe-policy nor
device-authorization -policy infrastructure can block, that sounds
like a bug report.


The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
others) in init functions that also register drivers (thanks Elena for 
the number)


Some are probably old drivers that could be fixed, but it's quite a few 
legitimate cases. For example for platform or ISA drivers that's the 
only way they can be implemented because they often have no other 
enumeration mechanism. For PCI drivers it's rarer, but also still can 
happen. One example that comes to mind here is the x86 Intel uncore 
drivers, which support a mix of MSR, ioremap and PCI config space 
accesses all from the same driver. This particular example can (and 
should be) fixed in other ways, but similar things also happen in other 
drivers, and they're not all broken. Even for the broken ones they're 
usually for some crufty old devices that has very few users, so it's 
likely untestable in practice.


My point is just that the ecosystem of devices that Linux supports is 
messy enough that there are legitimate exceptions from the "First IO 
only in probe call only" rule.


And we can't just fix them all. Even if we could it would be hard to 
maintain.


Using a "firewall model" hooking into a few strategic points like we're 
proposing here is much saner for everyone.


Now we can argue about the details. Right now what we're proposing has 
some redundancies: it has both a device model filter and low level 
filter for ioremap (this patch and some others). The low level filter is 
for catching issues that don't clearly fit into the 
"enumeration<->probe" model. You could call that redundant, but I would 
call it defense in depth or better safe than sorry. In theory it would 
be enough to have the low level opt-in only, but that would have the 
drawback that is something gets enumerated after all you would have all 
kind of weird device driver failures and in some cases even killed 
guests. So I think it makes sense to have




Fix those drivers instead of sprinkling
ioremap_shared in select places and with unclear rules about when a
driver is allowed to do "shared" mappings.


Only add it when the driver has been audited and hardened.

But I agree we need on a documented process for this. I will work on 
some documentation for a proposal. But essentially I think it should be 
some variant of what Elena has outlined in her talk at Security Summit.


https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf

That is using extra auditing/scrutiny at review time, supported with 
some static code analysis that points to the interaction points, and 
code needs to be fuzzed explicitly.


However short term it's only three virtio drivers, so this is not a 
urgent problem.



Let the new
device-authorization mechanism (with policy in userspace)



Default policy in user space just seems to be a bad idea here. Who 
should know if a driver is hardened other than the kernel? Maintaining 
the list somewhere else just doesn't make sense to me.


Also there is the more practical problem that some devices are needed 
for booting. For example in TDX we can't print something to the console 
with this mechanism, so you would never get any output before the 
initrd. Just seems like a nightmare for debugging anything. There really 
needs to be an authorization mechanism that works reasonably early.


I can see a point of having user space overrides though, but we need to 
have a sane kernel default that works early.


-Andi
___
Virtualization mailing 

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-09 Thread Dan Williams
On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:
>
> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > From: Andi Kleen 
> >
> > For Confidential VM guests like TDX, the host is untrusted and hence
> > the devices emulated by the host or any data coming from the host
> > cannot be trusted. So the drivers that interact with the outside world
> > have to be hardened by sharing memory with host on need basis
> > with proper hardening fixes.
> >
> > For the PCI driver case, to share the memory with the host add
> > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >
> > Signed-off-by: Andi Kleen 
> > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > 
>
> So I proposed to make all pci mappings shared, eliminating the need
> to patch drivers.
>
> To which Andi replied
> One problem with removing the ioremap opt-in is that
> it's still possible for drivers to get at devices without going 
> through probe.
>
> To which Greg replied:
> https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> If there are in-kernel PCI drivers that do not do this, they need to 
> be
> fixed today.
>
> Can you guys resolve the differences here?

I agree with you and Greg here. If a driver is accessing hardware
resources outside of the bind lifetime of one of the devices it
supports, and in a way that neither modrobe-policy nor
device-authorization -policy infrastructure can block, that sounds
like a bug report. Fix those drivers instead of sprinkling
ioremap_shared in select places and with unclear rules about when a
driver is allowed to do "shared" mappings. Let the new
device-authorization mechanism (with policy in userspace) be the
central place where all of these driver "trust" issues are managed.

> And once they are resolved, mention this in the commit log so
> I don't get to re-read the series just to find out nothing
> changed in this respect?
>
> I frankly do not believe we are anywhere near being able to harden
> an arbitrary kernel config against attack.
> How about creating a defconfig that makes sense for TDX then?
> Anyone deviating from that better know what they are doing,
> this API tweaking is just putting policy into the kernel  ...

Right, userspace authorization policy and select driver fixups seems
to be the answer to the raised concerns.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-09 Thread Michael S. Tsirkin
On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: Andi Kleen 
> 
> For Confidential VM guests like TDX, the host is untrusted and hence
> the devices emulated by the host or any data coming from the host
> cannot be trusted. So the drivers that interact with the outside world
> have to be hardened by sharing memory with host on need basis
> with proper hardening fixes.
> 
> For the PCI driver case, to share the memory with the host add
> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> 
> Signed-off-by: Andi Kleen 
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 

So I proposed to make all pci mappings shared, eliminating the need
to patch drivers.

To which Andi replied
One problem with removing the ioremap opt-in is that
it's still possible for drivers to get at devices without going through 
probe.

To which Greg replied:
https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
If there are in-kernel PCI drivers that do not do this, they need to be
fixed today.

Can you guys resolve the differences here?

And once they are resolved, mention this in the commit log so
I don't get to re-read the series just to find out nothing
changed in this respect?

I frankly do not believe we are anywhere near being able to harden
an arbitrary kernel config against attack.
How about creating a defconfig that makes sense for TDX then?
Anyone deviating from that better know what they are doing,
this API tweaking is just putting policy into the kernel  ...

> ---
>  Changes since v4:
>  * Replaced "_shared" with "_host_shared" in pci_iomap* APIs
>  * Fixed commit log as per review comments.
> 
>  include/asm-generic/pci_iomap.h |  6 +
>  lib/pci_iomap.c | 47 +
>  2 files changed, 53 insertions(+)
> 
> diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
> index df636c6d8e6c..a4a83c8ab3cf 100644
> --- a/include/asm-generic/pci_iomap.h
> +++ b/include/asm-generic/pci_iomap.h
> @@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, 
> int bar,
>  extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
>   unsigned long offset,
>   unsigned long maxlen);
> +extern void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
> +unsigned long max);
> +extern void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int 
> bar,
> +  unsigned long offset,
> +  unsigned long maxlen);
> +
>  /* Create a virtual mapping cookie for a port on a given PCI device.
>   * Do not call this directly, it exists to make it easier for architectures
>   * to override */
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index 57bd92f599ee..2816dc8715da 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, 
> size_t size)
>   return ioremap_wc(addr, size);
>  }
>  
> +static void __iomem *map_ioremap_host_shared(phys_addr_t addr, size_t size)
> +{
> + return ioremap_host_shared(addr, size);
> +}
> +
>  static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
>int bar,
>unsigned long offset,
> @@ -106,6 +111,48 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
>  
> +/**
> + * pci_iomap_host_shared_range - create a virtual shared mapping cookie
> + *for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @offset: map memory at the given offset in BAR
> + * @maxlen: max length of the memory to map
> + *
> + * Remap a pci device's resources shared in a confidential guest.
> + * For more details see pci_iomap_range's documentation.

So how does a driver author know when to use this function, and when to
use the regular pci_iomap_range?  Drivers have no idea whether they are
used in a confidential guest, and which ranges are shared, it's a TDX
thing ...

This documentation should really address it.

> + *
> + * @maxlen specifies the maximum length to map. To get access to
> + * the complete BAR from offset to the end, pass %0 here.
> + */
> +void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
> +   unsigned long offset,
> +   unsigned long maxlen)
> +{
> + return pci_iomap_range_map(dev, bar, offset, maxlen,
> +map_ioremap_host_shared, true);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_host_shared_range);
> +
> +/**
> + * pci_iomap_host_shared - create a virtual shared mapping cookie for a PCI 
> BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR