Re: [Xen-devel] kernel BUG at nvme/host/pci.c

2017-07-15 Thread Andreas Pflug
Am 15.07.17 um 10:51 schrieb Christoph Hellwig:
> On Fri, Jul 14, 2017 at 01:08:47PM -0400, Keith Busch wrote:
>>> So LVM2 backed by md raid1 isn't compatible with newer hardware... Any
>>> suggestions?
>> It's not that LVM2 or RAID isn't compatible. Either the IOMMU isn't
>> compatible if can use different page offsets for DMA addresses than the
>> physical aaddresses, or the driver for it is broken. The DMA addresses
>> in this mapped SGL look completely broken, at least, since the last 4
>> entries are all the same address. That'll corrupt data.
> Given that this is a Xen system I wonder if swiotlb-xen is involved
> here, which does some odd chunking of dma translations?

I did some more testing now.

With data stored on SATA disks with md1 and lvm2 (i.e. just replacing
NVME by SATA), there's nothing happening.
With data stored on /dev/nvme1n1p1, i.e. without any device mapping
stuff, I get the same problem.
Log attached.

Regards,
Andreas
Jul 15 15:25:06 xen2 [ 4376.149215] Invalid SGL for payload:20992 nents:5
Jul 15 15:25:06 xen2 [ 4376.150382] [ cut here ]
Jul 15 15:25:06 xen2 [ 4376.151261] WARNING: CPU: 0 PID: 29095 at 
drivers/nvme/host/pci.c:623 nvme_queue_rq+0x81b/0x840 [nvme]
Jul 15 15:25:06 xen2 [ 4376.152194] Modules linked in: xt_physdev br_netfilter 
iptable_filter xen_netback xen_blkback netconsole configfs bridge xen_gntdev 
xen_evtchn xenfs xen_privcmd iTCO_wdt intel_rapl iTCO_vendor_support mxm_wmi 
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd 
intel_rapl_perf snd_pcm snd_timer snd soundcore pcspkr i2c_i801 joydev ast ttm 
drm_kms_helper drm sg i2c_algo_bit lpc_ich ehci_pci mfd_core ehci_hcd mei_me 
mei e1000e ixgbe ptp nvme pps_core mdio nvme_core ioatdma shpchp dca wmi 
acpi_power_meter 8021q garp mrp stp llc button ipmi_si ipmi_devintf 
ipmi_msghandler sunrpc drbd lru_cache ip_tables x_tables autofs4 ext4 crc16 
mbcache jbd2 fscrypto raid10 raid456 libcrc32c crc32c_generic async_raid6_recov
Jul 15 15:25:06 xen2 [ 4376.158582]  async_memcpy async_pq async_xor xor 
async_tx raid6_pq raid0 multipath linear evdev hid_generic usbhid hid bcache 
dm_mod raid1 md_mod sd_mod crc32c_intel ahci libahci xhci_pci xhci_hcd libata 
usbcore scsi_mod
Jul 15 15:25:06 xen2 [ 4376.160593] CPU: 0 PID: 29095 Comm: 8.hda-0 Tainted: G  
D W   4.12.0-20170713+ #1
Jul 15 15:25:06 xen2 [ 4376.161678] Hardware name: Supermicro X10DRi/X10DRI-T, 
BIOS 2.1 09/13/2016
Jul 15 15:25:06 xen2 [ 4376.162649] task: 88015fdc5000 task.stack: 
c90048134000
Jul 15 15:25:06 xen2 [ 4376.163676] RIP: e030:nvme_queue_rq+0x81b/0x840 [nvme]
Jul 15 15:25:06 xen2 [ 4376.164804] RSP: e02b:c90048137a00 EFLAGS: 00010286
Jul 15 15:25:06 xen2 [ 4376.165890] RAX: 0025 RBX: f200 
RCX: 
Jul 15 15:25:06 xen2 [ 4376.166982] RDX:  RSI: 880186a0de98 
RDI: 880186a0de98
Jul 15 15:25:06 xen2 [ 4376.168099] RBP: 8801732ff000 R08: 0001 
R09: 0a57
Jul 15 15:25:06 xen2 [ 4376.169081] R10: 1000 R11: 0001 
R12: 0200
Jul 15 15:25:06 xen2 [ 4376.170198] R13: 1000 R14: 88015f9d7800 
R15: 88016fce1800
Jul 15 15:25:06 xen2 [ 4376.171330] FS:  () 
GS:880186a0() knlGS:880186a0
Jul 15 15:25:06 xen2 [ 4376.172474] CS:  e033 DS:  ES:  CR0: 
80050033
Jul 15 15:25:06 xen2 [ 4376.173600] CR2: 00b0f98d1970 CR3: 000175d4f000 
CR4: 00042660
Jul 15 15:25:06 xen2 [ 4376.174643] Call Trace:
Jul 15 15:25:06 xen2 [ 4376.175743]  ? __sbitmap_get_word+0x2a/0x80
Jul 15 15:25:06 xen2 [ 4376.176814]  ? blk_mq_dispatch_rq_list+0x200/0x3d0
Jul 15 15:25:06 xen2 [ 4376.177932]  ? blk_mq_flush_busy_ctxs+0xd1/0x120
Jul 15 15:25:06 xen2 [ 4376.178961]  ? 
blk_mq_sched_dispatch_requests+0x1c0/0x1f0
Jul 15 15:25:06 xen2 [ 4376.179942]  ? __blk_mq_delay_run_hw_queue+0x8f/0xa0
Jul 15 15:25:06 xen2 [ 4376.180941]  ? blk_mq_flush_plug_list+0x184/0x260
Jul 15 15:25:06 xen2 [ 4376.181935]  ? blk_flush_plug_list+0xf2/0x280
Jul 15 15:25:06 xen2 [ 4376.182952]  ? blk_finish_plug+0x27/0x40
Jul 15 15:25:06 xen2 [ 4376.183985]  ? dispatch_rw_block_io+0x732/0x9c0 
[xen_blkback]
Jul 15 15:25:06 xen2 [ 4376.185059]  ? _raw_spin_lock_irqsave+0x17/0x39
Jul 15 15:25:06 xen2 [ 4376.186103]  ? __do_block_io_op+0x362/0x690 
[xen_blkback]
Jul 15 15:25:06 xen2 [ 4376.187167]  ? _raw_spin_unlock_irqrestore+0x16/0x20
Jul 15 15:25:06 xen2 [ 4376.188216]  ? __do_block_io_op+0x362/0x690 
[xen_blkback]
Jul 15 15:25:06 xen2 [ 4376.189294]  ? xen_blkif_schedule+0x116/0x7f0 
[xen_blkback]
Jul 15 15:25:06 xen2 [ 4376.190247]  ? __schedule+0x3cd/0x850
Jul 15 15:25:06 xen2 [ 4376.191152]  ? remove_wait_queue+0x60/0x60
Jul 15 15:25:06 xen2 [ 4376.192112]  ? kthread+0xfc/0x130
Jul 15 15:25:06 xen2 [ 4376.193169]  ? xen_blkif_be_int+0x30/0x30 

Re: [Xen-devel] [Xen-users] 4.8.1 migration fails over 1st interface, works over 2nd

2017-06-30 Thread Andreas Pflug
My problem still persists, but the thread seems to have stalled
Apparently, my reply didn't hit the list



Am 05.06.17 um 11:33 schrieb Andrew Cooper:
> On 05/06/17 10:17, George Dunlap wrote:
>> On Mon, May 29, 2017 at 10:04 AM, Andreas Pflug
>> <pgad...@pse-consulting.de> wrote:
>>> I've setup a fresh Debian stretch with xen 4.8.1 and shared storage via
>>> custom block scripts on two machines.
>>>
>>> Both machine have one main interface with some VLAN stuff, the VM
>>> bridges and the SAN interface connected to a switch, and another
>>> interface directly interconnecting both machines. To insure packets
>>> don't take weird routes, arp_announce=2/arp_ignore=1 is configured.
>>> Everything on the primary interface seems to work flawlessly, e.g.
>>> ssh-ing from one machine to the other (no firewall or other filter
>>> involved).
>>>
>>> With xl migrate  , migration
>>> works as expected, bringing up the test domain fully functional back again.
>>>
>>> With xl migrate --debug  , I get
>>> xc: info: Saving domain 17, type x86 PV
>>> xc: info: Found x86 PV domain from Xen 4.8
>>> xc: info: Restoring domain
>>>
>>> and migration will stop here. The target machine will show the incoming
>>> VM, but nothing more happens. I have to kill xl on the target, Ctrl-C xl
>>> on the source machine, and destroy the target VM--incoming
>> Are you saying that migration works fine for you *unless* you add the
>> `--debug` option?
>>
>> Andy / Wei, any ideas?
> --debug adds a extra full memory copy, using memcmp() on the destination
> side to spot if any memory got missed during the live phase.
>
> It is only indented for development purposes, but it also expect it to
> function normally in the way you've used it.
>
> What does `xl -vvv migrate ...` say?
>
> ~Andrew

xl -vvv gives

libxl: debug: libxl.c:6895:libxl_retrieve_domain_configuration: no vtpm from 
xenstore for domain 21
libxl: debug: libxl.c:6895:libxl_retrieve_domain_configuration: no usbctrl from 
xenstore for domain 21
libxl: debug: libxl.c:6895:libxl_retrieve_domain_configuration: no usbdev from 
xenstore for domain 21
libxl: debug: libxl.c:6895:libxl_retrieve_domain_configuration: no pci from 
xenstore for domain 21
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x3/0x0/1773)
libxl: debug: libxl.c:932:libxl_domain_suspend: ao 0x55efd7b089d0: create: 
how=(nil) callback=(nil) poller=0x55efd7b08810
libxl: debug: libxl.c:6627:libxl__fd_flags_modify_save: fnctl F_GETFL flags for 
fd 9 are 0x1
libxl: debug: libxl.c:6635:libxl__fd_flags_modify_save: fnctl F_SETFL of fd 9 
to 0x1
libxl: debug: libxl.c:960:libxl_domain_suspend: ao 0x55efd7b089d0: inprogress: 
poller=0x55efd7b08810, flags=iLoading new save file  
(new xl fmt info 0x3/0x0/1773)
 Savefile contains xl domain config in JSON format
Parsing config from 

libxl: debug: libxl_create.c:1614:do_domain_create: ao 0x55dc55cea670: create: 
how=(nil) callback=(nil) poller=0x55dc55cea410
libxl: debug: libxl.c:6627:libxl__fd_flags_modify_save: fnctl F_GETFL flags for 
fd 0 are 0x0
libxl: debug: libxl.c:6635:libxl__fd_flags_modify_save: fnctl F_SETFL of fd 0 
to 0x0
libxl-save-helper: debug: starting save: Success
xc: detail: fd 9, dom 21, max_iters 0, max_factor 0, flags 1, hvm 0
xc: info: Saving domain 21, type x86 PV
xc: detail: 64 bits, 4 levels
xc: detail: max_mfn 0xc4
xc: detail: p2m list from 0xc900 to 0xc91f, root at 
0xc3e407
xc: detail: max_pfn 0x3, p2m_frames 512
libxl: debug: libxl_device.c:361:libxl__device_disk_set_backend: Disk 
vdev=xvda1 spec.backend=unknown
libxl: debug: libxl_device.c:276:disk_try_backend: Disk vdev=xvda1, uses 
script=... assuming phy backend
libxl: debug: libxl_device.c:396:libxl__device_disk_set_backend: Disk 
vdev=xvda1, using backend phy
libxl: debug: libxl_create.c:967:initiate_domain_create: restoring, not running 
bootloader
libxl: debug: libxl.c:4983:libxl__set_vcpuaffinity: New hard affinity for vcpu 
0 has unreachable cpus
libxl: debug: libxl_create.c:1640:do_domain_create: ao 0x55dc55cea670: 
inprogress: poller=0x55dc55cea410, flags=i
libxl: debug: libxl_stream_read.c:358:stream_header_done: Stream v2
libxl: debug: libxl_stream_read.c:574:process_record: Record: 1, length 0
libxl-save-helper: debug: starting restore: Success
xc: detail: fd 7, dom 15, hvm 0, pae 0, superpages 0, stream_type 0
xc: info: Found x86 PV domain from Xen 4.8
xc: info: Restoring domain
xc: detail: 64 bits, 4 levels
xc: detail: max_mfn 0xc4
xc: detail: Changed max_pfn from 0 to 0x3

And stalls here, need to ctrl-c on the sender, destroy the incoming vm
on the receiver and killall xl.

When using the working interface, st

[Xen-devel] SOLVED/no bug 4.8.1 migration fails over 1st interface, works over 2nd

2017-06-30 Thread Andreas Pflug
Ok, turns out to be a MTU related communication problem: the ethernet
interface and the switch both where configured for mtu=9216, but didn't
interpret this the same. Needed to reduce the eth iface mtu by 18 bytes

Sorry for the noise!

Regards,
Andreas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] EDAC infomation partially missing

2016-01-22 Thread Andreas Pflug
Am 21.01.16 um 17:41 schrieb Jan Beulich:
 On 20.01.16 at 16:01,  wrote:
>> Initially reported to debian
>> (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=810964), redirected here:
>>
>> With AMD Opteron 6xxx processors, half of the memory controllers are
>> missing from /sys/devices/system/edac/mc
>> Checked with single 6120 (dual memory controller) and twin 6344 (2x dual
>> MC), other dual-module CPUs might be affected too.
>>
>> Booting plain Linux (3.2, 3.16, 4.1, 4.3), all memory controllers are
>> listed under /sys/devices/system/edac/mc as expected. Same happens, when
>> Xen 4.1 is used: all MCs present.
>>
>> Starting with Xen 4.4 (Debian Jessie), only mc1 (on the single CPU
>> machine) or mc2/mc3 (dual CPU machine) are present, although the full
>> system memory is accessible. Checked versions were 4.1.4 (Debian
>> Wheezy), 4.4.1 (Jessie) and 4.6.0 (Sid)
> As already indicated by Ian in that bug, you should supply us with
> full kernel and hypervisor logs for both the good and bad cases
> (ideally with the same kernel version use in both runs, so that we
> can exclude kernel behavior differences).
Here are some dmesg excerpts, all performed with Linux 4.1.3.

When booting with Xen 4.1.4:

AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV
:00:18.2 (INTERRUPT)
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV
:00:19.2 (INTERRUPT)

When booting with Xen 4.4.1:

AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will
not load.
 Either enable ECC checking or force module loading by setting
'ecc_enable_override'.
 (Note that use of the override may cause unknown side effects.)
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV
:00:19.2 (INTERRUPT)

Apparently Xen4.4 doesn't report the BIOS flag correctly. I added
ecc_enable_override=1 to amd64_edac_mod, and then I get

EDAC MC: Ver: 3.0.0
AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will
not load.
EDAC amd64: Forcing ECC on!
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV
:00:18.2 (INTERRUPT)
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV
:00:19.2 (INTERRUPT)

This restored both MCs, so the BIOS flag seems 

Re: [Xen-devel] [BUG] EDAC infomation partially missing

2016-01-22 Thread Andreas Pflug
Am 22.01.16 um 11:40 schrieb Jan Beulich:
 On 22.01.16 at 10:09,  wrote:
>> When booting with Xen 4.4.1:
>>
>> AMD64 EDAC driver v3.4.0
>> EDAC amd64: DRAM ECC enabled.
>> EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable.
> I wonder how valid his message is. We actually write this MSR with
> all ones during boot.
>
> However, considering involved functions like
> nb_mce_bank_enabled_on_node() or node_to_amd_nb() taking
> node IDs as inputs, and considering that PV guests (including
> Dom0) don't have a topology matching that of the host, I doubt
> very much that this driver is even remotely prepared to run
> under Xen. It working on Xen 4.1.x would then be by pure
> accident.
The dmesg is identical with or without Xen4.1, so I'd guess it does work
if flags are detected correctly.

Regards
Andreas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [BUG] EDAC infomation partially missing

2016-01-20 Thread Andreas Pflug
Initially reported to debian
(http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=810964), redirected here:

With AMD Opteron 6xxx processors, half of the memory controllers are
missing from /sys/devices/system/edac/mc
Checked with single 6120 (dual memory controller) and twin 6344 (2x dual
MC), other dual-module CPUs might be affected too.

Booting plain Linux (3.2, 3.16, 4.1, 4.3), all memory controllers are
listed under /sys/devices/system/edac/mc as expected. Same happens, when
Xen 4.1 is used: all MCs present.

Starting with Xen 4.4 (Debian Jessie), only mc1 (on the single CPU
machine) or mc2/mc3 (dual CPU machine) are present, although the full
system memory is accessible. Checked versions were 4.1.4 (Debian
Wheezy), 4.4.1 (Jessie) and 4.6.0 (Sid)

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel