Re: [Xen-devel] [BUG] Error applying XSA240 update 5 on 4.8 and 4.9 (patch 3 references CONFIG_PV_LINEAR_PT, 3285e75dea89, x86/mm: Make PV linear pagetables optional)

2017-11-16 Thread Steven Haigh
On Friday, 17 November 2017 2:09:09 AM AEDT Ian Jackson wrote:
> George Dunlap writes ("Re: [BUG] Error applying XSA240 update 5 on 4.8 and 
4.9 (patch 3 references CONFIG_PV_LINEAR_PT, 3285e75dea89, x86/mm: Make PV 
linear pagetables optional)"):
> > These are two different things.  Steve's reluctance to backport a
> > potentially arbitrary number of non-security-related patches is
> > completely reasonable.
> 
> I think the right thing to do is this:
> 
> If the patche(s) in an XSA require commits from staging-N which are
> not contained in previous XSAs, the prerequisite commits should be
> listed in the advisory.
> 
> That way someone who is following the XSAs (and by implication does
> not want to take the other stuff from staging-N/stable-N or even our
> point releases) will be able to take the minimum set necessary.

Hi Ian,

I think that would be a great idea. That way, if a non-xsa and non-release 
commit is required, at least it is documented as such - therefore correctable.

On a theoretical side though, what would be the chances of opening up other 
vulnerabilities like this? I would think somewhat minimal, but worthy of 
thought - even in passing...


-- 
Steven Haigh

 net...@crc.id.au    http://www.crc.id.au
 +61 (3) 9001 6090 0412 935 897

signature.asc
Description: This is a digitally signed message part.
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] Error applying XSA240 update 5 on 4.8 and 4.9 (patch 3 references CONFIG_PV_LINEAR_PT, 3285e75dea89, x86/mm: Make PV linear pagetables optional)

2017-11-16 Thread Steven Haigh
On Thursday, 16 November 2017 8:30:39 PM AEDT Jan Beulich wrote:
> >>> On 15.11.17 at 23:48, <li...@johnthomson.fastmail.com.au> wrote:
> > Hi,
> > 
> > I am having trouble applying the patch 3 from XSA240 update 5 for xen
> > stable 4.8 and 4.9
> > xsa240 0003 contains:
> > 
> > CONFIG_PV_LINEAR_PT
> > 
> > from:
> > 
> > x86/mm: Make PV linear pagetables optional
> > https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=3285e75dea89afb0e
> > f5 b3ee39bd15194bd7cc110
> > 
> > I cannot find this string in an XSA, nor is an XSA referenced in the
> > commit.
> > Am I missing a patch, or doing something wrong?
> 
> Well, you're expected to apply all patched which haven't been
> applied so far. In particular, in the stable version trees, the 2nd
> patch hasn't gone in yet (I'm intending to do this later today),
> largely because it (a) wasn't ready at the time the first patch
> went in and (b) it is more a courtesy patch than an actual part of
> the security fix.

I'm not quite sure this is a great idea... They should work on the released 
versions - hence xsa240 patchset should apply to the base tarball + current 
XSA patches. If there is something in the git that *isn't* in the latest 
release, it should be included in the XSA patchset - otherwise the set is 
incomplete.

I don't see mention of anywhere in the written XSA that mentions a separate 
patch is required outside of the patches included with the XSA.

Could I suggest that we re-do v6 of these patches with the complete required 
set?

These should be included in 4.9.1 - which makes most things irrelevant - but 
I'm not aware of what the release window is for 4.9.1.

-- 
Steven Haigh

 net...@crc.id.au    http://www.crc.id.au
 +61 (3) 9001 6090 0412 935 897

signature.asc
Description: This is a digitally signed message part.
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 0/2] XSA-226

2017-08-15 Thread Steven Haigh
On Tuesday, 15 August 2017 11:43:59 PM AEST Jan Beulich wrote:
> XSA-226 went out with just a workaround patch. The pair of patches
> here became ready too late to be reasonably included in the XSA.
> Nevertheless they aim at fixing the underlying issues, ideally making
> the workaround unnecessary.
> 
> 1: gnttab: don't use possibly unbounded tail calls
> 2: gnttab: fix transitive grant handling
> 
> Signed-off-by: Jan Beulich <jbeul...@suse.com>

If this turns out to be all good and accepted, is it possible to reissue 
xsa226 with the proper fixes?

-- 
Steven Haigh

 net...@crc.id.au    http://www.crc.id.au
 +61 (3) 9001 6090 0412 935 897

signature.asc
Description: This is a digitally signed message part.
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 0/3] xen/blkback: several fixes of resource management

2017-06-07 Thread Steven Haigh
On Wednesday, 7 June 2017 11:52:34 PM AEST Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 07, 2017 at 10:36:58PM +1000, Steven Haigh wrote:
> > On Friday, 19 May 2017 1:28:46 AM AEST Juergen Gross wrote:
> > > Destroying a Xen guest domain while it was doing I/Os via xen-blkback
> > > leaked several resources, including references of the guest's memory
> > > pages.
> > > 
> > > This patch series addresses those leaks by correcting usage of
> > > reference counts and the sequence when to free which resource.
> > > 
> > > The series applies on top of commit 2d4456c73a487abe ("block:
> > > xen-blkback: add null check to avoid null pointer dereference") in
> > > Jens Axboe's tree kernel/git/axboe/linux-block.git
> > > 
> > > V2: changed flag to type bool in patch 1 (Dietmar Hahn)
> > > 
> > > Juergen Gross (3):
> > >   xen/blkback: fix disconnect while I/Os in flight
> > >   xen/blkback: don't free be structure too early
> > >   xen/blkback: don't use xen_blkif_get() in xen-blkback kthread
> > >  
> > >  drivers/block/xen-blkback/blkback.c |  3 ---
> > >  drivers/block/xen-blkback/common.h  |  1 +
> > >  drivers/block/xen-blkback/xenbus.c  | 15 ---
> > >  3 files changed, 9 insertions(+), 10 deletions(-)
> > 
> > Just wanted to give this a bit of a prod.
> 
>  Ouch!
> 
> > Is there any plans in having this hit the kernel.org kernels?
> 
> Yes.
> 
> > My testing was purely on kernel 4.9 branch - but it doesn't look like this
> > has shown up there yet?
> 
> Correct. I am thinking to send these to Jens around June 20th or so.

Ok, all understood. Thanks for the clarifications.

At the moment, I'm just including them in my kernel builds - then expecting 
them to fail at one point in the future when the patch fails due to 
upstreaming. I'll just keep doing this.

-- 
Steven Haigh

 net...@crc.id.au   http://www.crc.id.au
 +61 (3) 9001 6090  0412 935 897

signature.asc
Description: This is a digitally signed message part.
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 0/3] xen/blkback: several fixes of resource management

2017-06-07 Thread Steven Haigh
On Friday, 19 May 2017 1:28:46 AM AEST Juergen Gross wrote:
> Destroying a Xen guest domain while it was doing I/Os via xen-blkback
> leaked several resources, including references of the guest's memory
> pages.
> 
> This patch series addresses those leaks by correcting usage of
> reference counts and the sequence when to free which resource.
> 
> The series applies on top of commit 2d4456c73a487abe ("block:
> xen-blkback: add null check to avoid null pointer dereference") in
> Jens Axboe's tree kernel/git/axboe/linux-block.git
> 
> V2: changed flag to type bool in patch 1 (Dietmar Hahn)
> 
> Juergen Gross (3):
>   xen/blkback: fix disconnect while I/Os in flight
>   xen/blkback: don't free be structure too early
>   xen/blkback: don't use xen_blkif_get() in xen-blkback kthread
> 
>  drivers/block/xen-blkback/blkback.c |  3 ---
>  drivers/block/xen-blkback/common.h  |  1 +
>  drivers/block/xen-blkback/xenbus.c  | 15 ---
>  3 files changed, 9 insertions(+), 10 deletions(-)

Just wanted to give this a bit of a prod.

Is there any plans in having this hit the kernel.org kernels?

My testing was purely on kernel 4.9 branch - but it doesn't look like this has 
shown up there yet?

-- 
Steven Haigh

 net...@crc.id.au   http://www.crc.id.au
 +61 (3) 9001 6090  0412 935 897

signature.asc
Description: This is a digitally signed message part.
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220

2017-06-01 Thread Steven Haigh
On Thursday, 1 June 2017 11:56:28 PM AEST Boris Ostrovsky wrote:
> On 05/31/2017 10:25 PM, Steven Haigh wrote:
> > On 2017-05-31 00:37, Steven Haigh wrote:
> >> On 31/05/17 00:18, Boris Ostrovsky wrote:
> >>> On 05/30/2017 06:27 AM, Steven Haigh wrote:
> >>>> Just wanted to give this a nudge to try and get some suggestions on
> >>>> where to go / what to do about this.
> >>>> 
> >>>> On 28/05/17 09:44, Steven Haigh wrote:
> >>>>> The last couple of days running on kernel 4.9.29 and 4.9.30 with Xen
> >>>>> 4.9.0-rc6 I've had a number of ethernet lock ups that have taken my
> >>>>> system off the network.
> >>>>> 
> >>>>> This is a new development - but I'm not sure if its kernel or xen
> >>>>> related.
> >>> 
> >>> Since noone seems to have seen this it would be useful to narrow it
> >>> down
> >>> a bit.
> >>> 
> >>> Do you observe this on rc5? Or with 4.9.28 kernel? Any particular load
> >>> that you are using? Do you see this on a specific NIC?
> >> 
> >> This install is currently using xen 4.9-rc7 and kernel 4.9.30. I would
> >> say that there may be a connection between occurrences between disk
> >> activity and the ethernet adapter locking up - but I haven't been able
> >> to prove this in any valid way yet.
> >> 
> >> I am currently running this script on the server in question to try and
> >> get a log of how often the adapter locks up. I only added the logger
> >> line tonight - so I don't have a great deal of historical data to add as
> >> yet.
> >> 
> >> #!/bin/bash
> >> while true; do
> >> 
> >> ping -c1 10.1.1.2 >& /dev/null
> >> if [ $? != 0 ]; then
> >> 
> >> logger 'No response. Resetting enp5s0'
> >> mii-tool -R enp5s0
> >> 
> >> fi
> >> sleep 5
> >> 
> >> done
> > 
> > Just to keep kicking this along a little bit, my logs so far have shown:
> > messages:May 31 00:20:10 No response. Resetting enp5s0
> > messages:May 31 04:20:08 No response. Resetting enp5s0
> > messages:May 31 12:21:37 No response. Resetting enp5s0
> > 
> > Its almost spooky that its nearly 20 minutes past the hour on each reset.
> > 
> > I've checked against the cron logs, but I can't find anything that
> > would be scheduled on the Dom0 at that time.
> > 
> > The logs also show that after running mii-tool to reset the ethernet
> > adapter, connectivity has returned straight away.
> > 
> > The network adapter uses the r8169 kernel module, and shows as:
> > 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
> > 
> > I have a DomU backup script that runs *in* a DomU at 01:00 each night
> > - that causes a lot of disk activity - but alas, that time hasn't
> > lined up with anything as yet...
> > 
> > Still seem to be fidgeting in the dark :(
> 
> Since you've already observed this problem with rc6 and 4.9.29, wouldn't
> it be more useful to go backwards to narrow down where the problem first
> occurred? I am not sure how moving to rc7 and 4.9.30 is going to help
> unless you think this is a temporary regression.

I'm not 100% sure of the cause at the moment. I moved to kernel 4.9 from 4.4  
a few weeks before I started to test Xen 4.9.  My only thoughts were that 
bringing up to the latest version would at least test against other fixes that 
are known going into the Xen 4.9rc releases.

I have also been updating to the latest 4.9 kernel in case I come across a fix 
- or at least a version of kernel where this no longer occurs.

At this stage, I don't have any information to give any major hint on if this 
is Xen or kernel related other than I had never observed this using:
* Xen 4.7 + kernel 4.4
* Xen 4.7 + kernel 4.9

I am making the assumption however that because when the network dies in this 
manor, it is dead until manual intervention, that I would notice this in a 
different combination of Xen / kernel.

One observation I have made since putting in the extra logging via the 
ethernet reset script posted earlier - the WARNING is not printed for every 
ethernet controller hang. As such, this may actually be a side-effect of 
having the controller stay dead - rather than a cause.

A second observation is that I don't seem to see as many hangs of the ethern

Re: [Xen-devel] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220

2017-05-31 Thread Steven Haigh

On 2017-05-31 00:37, Steven Haigh wrote:

On 31/05/17 00:18, Boris Ostrovsky wrote:

On 05/30/2017 06:27 AM, Steven Haigh wrote:

Just wanted to give this a nudge to try and get some suggestions on
where to go / what to do about this.

On 28/05/17 09:44, Steven Haigh wrote:

The last couple of days running on kernel 4.9.29 and 4.9.30 with Xen
4.9.0-rc6 I've had a number of ethernet lock ups that have taken my
system off the network.

This is a new development - but I'm not sure if its kernel or xen 
related.


Since noone seems to have seen this it would be useful to narrow it 
down

a bit.

Do you observe this on rc5? Or with 4.9.28 kernel? Any particular load
that you are using? Do you see this on a specific NIC?


This install is currently using xen 4.9-rc7 and kernel 4.9.30. I would
say that there may be a connection between occurrences between disk
activity and the ethernet adapter locking up - but I haven't been able
to prove this in any valid way yet.

I am currently running this script on the server in question to try and
get a log of how often the adapter locks up. I only added the logger
line tonight - so I don't have a great deal of historical data to add 
as

yet.

#!/bin/bash
while true; do
ping -c1 10.1.1.2 >& /dev/null
if [ $? != 0 ]; then
logger 'No response. Resetting enp5s0'
mii-tool -R enp5s0
fi
sleep 5
done


Just to keep kicking this along a little bit, my logs so far have shown:
messages:May 31 00:20:10 No response. Resetting enp5s0
messages:May 31 04:20:08 No response. Resetting enp5s0
messages:May 31 12:21:37 No response. Resetting enp5s0

Its almost spooky that its nearly 20 minutes past the hour on each 
reset.


I've checked against the cron logs, but I can't find anything that would 
be scheduled on the Dom0 at that time.


The logs also show that after running mii-tool to reset the ethernet 
adapter, connectivity has returned straight away.


The network adapter uses the r8169 kernel module, and shows as:
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)


I have a DomU backup script that runs *in* a DomU at 01:00 each night - 
that causes a lot of disk activity - but alas, that time hasn't lined up 
with anything as yet...


Still seem to be fidgeting in the dark :(

--
Steven Haigh

? net...@crc.id.au ? http://www.crc.id.au
? +61 (3) 9001 6090? 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220

2017-05-30 Thread Steven Haigh
On 31/05/17 00:18, Boris Ostrovsky wrote:
> On 05/30/2017 06:27 AM, Steven Haigh wrote:
>> Just wanted to give this a nudge to try and get some suggestions on
>> where to go / what to do about this.
>>
>> On 28/05/17 09:44, Steven Haigh wrote:
>>> The last couple of days running on kernel 4.9.29 and 4.9.30 with Xen
>>> 4.9.0-rc6 I've had a number of ethernet lock ups that have taken my
>>> system off the network.
>>>
>>> This is a new development - but I'm not sure if its kernel or xen related.
> 
> Since noone seems to have seen this it would be useful to narrow it down
> a bit.
> 
> Do you observe this on rc5? Or with 4.9.28 kernel? Any particular load
> that you are using? Do you see this on a specific NIC?

This install is currently using xen 4.9-rc7 and kernel 4.9.30. I would
say that there may be a connection between occurrences between disk
activity and the ethernet adapter locking up - but I haven't been able
to prove this in any valid way yet.

I am currently running this script on the server in question to try and
get a log of how often the adapter locks up. I only added the logger
line tonight - so I don't have a great deal of historical data to add as
yet.

#!/bin/bash
while true; do
ping -c1 10.1.1.2 >& /dev/null
if [ $? != 0 ]; then
logger 'No response. Resetting enp5s0'
mii-tool -R enp5s0
fi
sleep 5
done

What I have right now in dmesg + journalctl is:
# dmesg
[221834.898685] r8169 :05:00.0 enp5s0: link down
[221834.898768] br10: port 1(vlan10) entered disabled state
[221834.898827] br203: port 1(vlan203) entered disabled state
[221834.905380] r8169 :05:00.0 enp5s0: link up
[221834.905748] br10: port 1(vlan10) entered blocking state
[221834.905749] br10: port 1(vlan10) entered forwarding state
[221834.906162] br203: port 1(vlan203) entered blocking state
[221834.906162] br203: port 1(vlan203) entered forwarding state
[221834.906176] r8169 :05:00.0 enp5s0: link down
[221835.949483] br10: port 1(vlan10) entered disabled state
[221835.949515] br203: port 1(vlan203) entered disabled state
[221838.069998] r8169 :05:00.0 enp5s0: link up
[221838.070538] br10: port 1(vlan10) entered blocking state
[221838.070540] br10: port 1(vlan10) entered forwarding state
[221838.071055] br203: port 1(vlan203) entered blocking state
[221838.071057] br203: port 1(vlan203) entered forwarding state

# journalctl | grep Resetting
May 31 00:20:10 xenhost: No response. Resetting enp5s0

> Have you checked hypervisor log (xl dmesg)?

The last lines I see in 'xl dmesg' are:
(XEN) Scrubbing Free RAM on 1 nodes using 4 CPUs
(XEN) .done.
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch
input to Xen)
(XEN) Freed 456kB init memory

This would indicate that nothing additional is being logged here.

If it matters, the xl info follows:
# xl info
host   : xenhost
release: 4.9.30-1.el7xen.x86_64
version: #1 SMP Fri May 26 06:16:37 AEST 2017
machine: x86_64
nr_cpus: 4
max_cpu_id : 3
nr_nodes   : 1
cores_per_socket   : 4
threads_per_core   : 1
cpu_mhz: 3303
hw_caps:
bfebfbff:179ae3bf:28100800:0001:0001:::0100
virt_caps  : hvm
total_memory   : 16308
free_memory: 1785
sharing_freed_memory   : 0
sharing_used_memory: 0
outstanding_claims : 0
free_cpus  : 0
xen_major  : 4
xen_minor  : 9
xen_extra  : -rc
xen_version: 4.9-rc
xen_caps   : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler  : credit2
xen_pagesize   : 4096
platform_params: virt_start=0x8000
xen_changeset  :
xen_commandline: placeholder dom0_mem=2048M cpufreq=xen
dom0_max_vcpus=1 dom0_vcpus_pin sched=credit2 console=tty0 console=com1
com1=115200,8n1
cc_compiler: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
cc_compile_by  : mockbuild
cc_compile_domain  : crc.id.au
cc_compile_date: Sun May 28 10:08:40 AEST 2017
build_id   : 0848a8631a9064b3de53cdfe71c996e929ce2539
xend_config_format : 4

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220

2017-05-30 Thread Steven Haigh
Just wanted to give this a nudge to try and get some suggestions on
where to go / what to do about this.

On 28/05/17 09:44, Steven Haigh wrote:
> The last couple of days running on kernel 4.9.29 and 4.9.30 with Xen
> 4.9.0-rc6 I've had a number of ethernet lock ups that have taken my
> system off the network.
> 
> This is a new development - but I'm not sure if its kernel or xen related.
> 
> in dmesg, I see the following:
> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
> dev_watchdog+0x217/0x220
> NETDEV WATCHDOG: enp5s0 (r8169): transmit queue 0 timed out
> Modules linked in: bridge 8021q garp stp llc btrfs dm_mod
> crct10dif_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
> glue_helper ablk_helper cryptd raid456 async_raid6_recov async_memcpy
> async_pq ppdev iTCO_wdt async_xor xor iTCO_vendor_support async_tx
> raid6_pq pcspkr i2c_i801 i2c_smbus pl2303 usbserial sg lpc_ich mfd_core
> tpm_infineon parport_pc parport shpchp mei_me mei xenfs xen_privcmd
> ip_tables xfs libcrc32c raid1 sd_mod i915 i2c_algo_bit drm_kms_helper
> drm crc32c_intel serio_raw ahci libahci i2c_core r8169 mii sata_mv video
> xen_acpi_processor xen_pciback xen_netback xen_gntalloc xen_gntdev
> xen_evtchn ipv6 crc_ccitt autofs4
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.30-1.el7xen.x86_64 #1
> Hardware name: Gigabyte Technology Co., Ltd. To be filled by
> O.E.M./Z68M-D2H, BIOS U1f 06/13/2012
> 880080203dd8 81348dc5 880080203e28 
> 880080203e18 81081711 013c80203e10 
> 88000526a000 0001  88000526a000
> Call Trace:
> 
> [] dump_stack+0x63/0x8e
> [] __warn+0xd1/0xf0
> [] warn_slowpath_fmt+0x4f/0x60
> [] dev_watchdog+0x217/0x220
> [] ? dev_deactivate_queue.constprop.27+0x60/0x60
> [] call_timer_fn+0x3a/0x130
> [] run_timer_softirq+0x191/0x420
> [] ? handle_percpu_irq+0x3a/0x50
> [] ? generic_handle_irq+0x22/0x30
> [] __do_softirq+0xd6/0x287
> [] irq_exit+0xa5/0xb0
> [] xen_evtchn_do_upcall+0x35/0x50
> [] xen_do_hypervisor_callback+0x1e/0x40
> 
> [] ? xen_hypercall_sched_op+0xa/0x20
> [] ? xen_hypercall_sched_op+0xa/0x20
> [] ? __tick_nohz_idle_enter+0x2c9/0x3c0
> [] ? xen_safe_halt+0x10/0x20
> [] ? default_idle+0x23/0xd0
> [] ? arch_cpu_idle+0xf/0x20
> [] ? default_idle_call+0x2c/0x40
> [] ? cpu_startup_entry+0x17a/0x210
> [] ? rest_init+0x77/0x80
> [] ? start_kernel+0x435/0x442
> [] ? set_init_arg+0x55/0x55
> [] ? x86_64_start_reservations+0x2a/0x2c
> [] ? xen_start_kernel+0x547/0x553
> ---[ end trace 2f33c440640c78e5 ]---
> 
> All network activity out that ethernet port dies until either:
> 1) The ethernet cable is unplugged & replugged, or
> 2) I run: mii-tool -R enp5s0
> 
> Either causes the ethernet adapter to be reset.
> 
> Any suggestions if this is Xen or kernel related?
> 
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
> 

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x217/0x220

2017-05-27 Thread Steven Haigh
The last couple of days running on kernel 4.9.29 and 4.9.30 with Xen
4.9.0-rc6 I've had a number of ethernet lock ups that have taken my
system off the network.

This is a new development - but I'm not sure if its kernel or xen related.

in dmesg, I see the following:
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
dev_watchdog+0x217/0x220
NETDEV WATCHDOG: enp5s0 (r8169): transmit queue 0 timed out
Modules linked in: bridge 8021q garp stp llc btrfs dm_mod
crct10dif_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd raid456 async_raid6_recov async_memcpy
async_pq ppdev iTCO_wdt async_xor xor iTCO_vendor_support async_tx
raid6_pq pcspkr i2c_i801 i2c_smbus pl2303 usbserial sg lpc_ich mfd_core
tpm_infineon parport_pc parport shpchp mei_me mei xenfs xen_privcmd
ip_tables xfs libcrc32c raid1 sd_mod i915 i2c_algo_bit drm_kms_helper
drm crc32c_intel serio_raw ahci libahci i2c_core r8169 mii sata_mv video
xen_acpi_processor xen_pciback xen_netback xen_gntalloc xen_gntdev
xen_evtchn ipv6 crc_ccitt autofs4
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.30-1.el7xen.x86_64 #1
Hardware name: Gigabyte Technology Co., Ltd. To be filled by
O.E.M./Z68M-D2H, BIOS U1f 06/13/2012
880080203dd8 81348dc5 880080203e28 
880080203e18 81081711 013c80203e10 
88000526a000 0001  88000526a000
Call Trace:

[] dump_stack+0x63/0x8e
[] __warn+0xd1/0xf0
[] warn_slowpath_fmt+0x4f/0x60
[] dev_watchdog+0x217/0x220
[] ? dev_deactivate_queue.constprop.27+0x60/0x60
[] call_timer_fn+0x3a/0x130
[] run_timer_softirq+0x191/0x420
[] ? handle_percpu_irq+0x3a/0x50
[] ? generic_handle_irq+0x22/0x30
[] __do_softirq+0xd6/0x287
[] irq_exit+0xa5/0xb0
[] xen_evtchn_do_upcall+0x35/0x50
[] xen_do_hypervisor_callback+0x1e/0x40

[] ? xen_hypercall_sched_op+0xa/0x20
[] ? xen_hypercall_sched_op+0xa/0x20
[] ? __tick_nohz_idle_enter+0x2c9/0x3c0
[] ? xen_safe_halt+0x10/0x20
[] ? default_idle+0x23/0xd0
[] ? arch_cpu_idle+0xf/0x20
[] ? default_idle_call+0x2c/0x40
[] ? cpu_startup_entry+0x17a/0x210
[] ? rest_init+0x77/0x80
[] ? start_kernel+0x435/0x442
[] ? set_init_arg+0x55/0x55
[] ? x86_64_start_reservations+0x2a/0x2c
[] ? xen_start_kernel+0x547/0x553
---[ end trace 2f33c440640c78e5 ]---

All network activity out that ethernet port dies until either:
1) The ethernet cable is unplugged & replugged, or
2) I run: mii-tool -R enp5s0

Either causes the ethernet adapter to be reset.

Any suggestions if this is Xen or kernel related?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/3] xen/blkback: several fixes of resource management

2017-05-17 Thread Steven Haigh

On 2017-05-16 16:23, Juergen Gross wrote:

Destroying a Xen guest domain while it was doing I/Os via xen-blkback
leaked several resources, including references of the guest's memory
pages.

This patch series addresses those leaks by correcting usage of
reference counts and the sequence when to free which resource.

The series applies on top of commit 2d4456c73a487abe ("block:
xen-blkback: add null check to avoid null pointer dereference") in
Jens Axboe's tree kernel/git/axboe/linux-block.git

Juergen Gross (3):
  xen/blkback: fix disconnect while I/Os in flight
  xen/blkback: don't free be structure too early
  xen/blkback: don't use xen_blkif_get() in xen-blkback kthread

 drivers/block/xen-blkback/blkback.c |  3 ---
 drivers/block/xen-blkback/common.h  |  1 +
 drivers/block/xen-blkback/xenbus.c  | 15 ---
 3 files changed, 9 insertions(+), 10 deletions(-)


Tested-by: Steven Haigh <net...@crc.id.au>

I've had a report that a new message is logged on destroy sometimes:
vif vif-1-0 vif1.0: Guest Rx stalled

This may be a different issue - however the main fix of this patch set 
is fully functional.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.9 v2] build: stubdom and tools should depend on public header target

2017-05-17 Thread Steven Haigh
Can confirm this fixes the problems I was seeing. Tested with multiple
builds on both RHEL6 and RHEL7. No further issues found.

Tested-by: Steven Haigh <net...@crc.id.au>

On 18/05/17 00:26, Wei Liu wrote:
> Build can fail if stubdom build is run before tools build because:
> 
> 1. tools/include build uses relative path and depends on XEN_OS
> 2. stubdom needs tools/include to be built, at which time XEN_OS is
>mini-os and corresponding symlinks are created
> 3. libraries inside tools needs tools/include to be built, at which
>time XEN_OS is the host os name, but symlinks won't be created
>because they are already there
> 4. libraries get the wrong headers and fail to build
> 
> Since both tools and stubdom build need the public headers, we build
> tools/include before stubdom and tools. Remove runes in stubdom and
> tools to avoid building tools/include more than once.
> 
> Provide a new dist target for tools/include.  Hook up the install,
> clean, dist and distclean targets for tools/include.
> 
> The new arrangement ensures tools build gets the correct headers
> because XEN_OS is set to host os when building tools/include. As for
> stubdom, it explicitly links to the mini-os directory without relying
> on XEN_OS so it should fine.
> 
> Reported-by: Steven Haigh <net...@crc.id.au>
> Signed-off-by: Wei Liu <wei.l...@citrix.com>
> ---
> Cc: Steven Haigh <net...@crc.id.au>
> Cc: Ian Jackson <ian.jack...@eu.citrix.com>
> Cc: Samuel Thibault <samuel.thiba...@ens-lyon.org>
> Cc: Julien Grall <julien.gr...@arm.com>
> ---
>  Makefile   | 14 +++---
>  stubdom/Makefile   |  1 -
>  tools/Makefile |  3 +--
>  tools/include/Makefile |  2 ++
>  4 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 084588e11e..3e1e065537 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -38,9 +38,14 @@ mini-os-dir-force-update: mini-os-dir
>  export XEN_TARGET_ARCH
>  export DESTDIR
>  
> +.PHONY: build-tools-public-headers
> +build-tools-public-headers:
> + $(MAKE) -C tools/include
> +
>  # build and install everything into the standard system directories
>  .PHONY: install
>  install: $(TARGS_INSTALL)
> + $(MAKE) -C tools/include install
>  
>  .PHONY: build
>  build: $(TARGS_BUILD)
> @@ -50,11 +55,11 @@ build-xen:
>   $(MAKE) -C xen build
>  
>  .PHONY: build-tools
> -build-tools:
> +build-tools: build-tools-public-headers
>   $(MAKE) -C tools build
>  
>  .PHONY: build-stubdom
> -build-stubdom: mini-os-dir
> +build-stubdom: mini-os-dir build-tools-public-headers
>   $(MAKE) -C stubdom build
>  ifeq (x86_64,$(XEN_TARGET_ARCH))
>   XEN_TARGET_ARCH=x86_32 $(MAKE) -C stubdom pv-grub
> @@ -75,6 +80,7 @@ test:
>  .PHONY: dist
>  dist: DESTDIR=$(DISTDIR)/install
>  dist: $(TARGS_DIST) dist-misc
> + make -C tools/include dist
>  
>  dist-misc:
>   $(INSTALL_DIR) $(DISTDIR)/
> @@ -101,7 +107,7 @@ install-tools:
>   $(MAKE) -C tools install
>  
>  .PHONY: install-stubdom
> -install-stubdom: mini-os-dir
> +install-stubdom: mini-os-dir build-tools-public-headers
>   $(MAKE) -C stubdom install
>  ifeq (x86_64,$(XEN_TARGET_ARCH))
>   XEN_TARGET_ARCH=x86_32 $(MAKE) -C stubdom install-grub
> @@ -168,6 +174,7 @@ src-tarball: subtree-force-update-all
>  
>  .PHONY: clean
>  clean: $(TARGS_CLEAN)
> + $(MAKE) -C tools/include clean
>  
>  .PHONY: clean-xen
>  clean-xen:
> @@ -191,6 +198,7 @@ clean-docs:
>  # clean, but blow away tarballs
>  .PHONY: distclean
>  distclean: $(TARGS_DISTCLEAN)
> + $(MAKE) -C tools/include distclean
>   rm -f config/Toplevel.mk
>   rm -rf dist
>   rm -rf config.log config.status config.cache autom4te.cache
> diff --git a/stubdom/Makefile b/stubdom/Makefile
> index aef705dd1e..db01827070 100644
> --- a/stubdom/Makefile
> +++ b/stubdom/Makefile
> @@ -355,7 +355,6 @@ LINK_DIRS := libxc-$(XEN_TARGET_ARCH) xenstore $(foreach 
> dir,$(LINK_LIBS_DIRS),l
>  LINK_STAMPS := $(foreach dir,$(LINK_DIRS),$(dir)/stamp)
>  
>  mk-headers-$(XEN_TARGET_ARCH): $(IOEMU_LINKFARM_TARGET) $(LINK_STAMPS)
> - $(MAKE) -C $(XEN_ROOT)/tools/include
>   mkdir -p include/xen && \
>ln -sf $(wildcard $(XEN_ROOT)/xen/include/public/*.h) include/xen 
> && \
>ln -sf $(addprefix $(XEN_ROOT)/xen/include/public/,arch-x86 hvm io 
> xsm) include/xen && \
> diff --git a/tools/Makefile b/tools/Makefile
> index 1396d95b50..496428e3a9 100644
> --- a/tools/Makefile
> +++ b/tools/Makefile
> @@ -5,7 +5,6 @@ export PKG_CONFIG_DIR = $(CURDIR)/pkg-config
>  i

Re: [Xen-devel] [PATCH for-4.9 0/2] build: fix tools and stubdom build

2017-05-16 Thread Steven Haigh
On 16/05/17 20:47, Wei Liu wrote:
> Wei Liu (2):
>   tools/Rules.mk: honour CPPFLAGS in header check
>   build: fix tools/include and stubdom build
> 
>  stubdom/Makefile   | 13 +++--
>  tools/Rules.mk |  2 +-
>  tools/include/Makefile | 34 ++
>  3 files changed, 22 insertions(+), 27 deletions(-)

I have been seeing mixed results with these patches.

I can confirm that they seem to fix the problem with building on RHEL7 -
however on RHEL6, the packages still fail to build.

I have copied the build log to:
https://cloud.crc.id.au/index.php/s/iTWJE3A1TQBhgDq

So far:
EL7 - Successful builds: 4/4
EL6 - Successful builds: 0/4

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] null domains after xl destroy

2017-05-15 Thread Steven Haigh

On 2017-05-16 10:49, Glenn Enright wrote:

On 15/05/17 21:57, Juergen Gross wrote:

On 13/05/17 06:02, Glenn Enright wrote:

On 09/05/17 21:24, Roger Pau Monné wrote:

On Mon, May 08, 2017 at 11:10:24AM +0200, Juergen Gross wrote:

On 04/05/17 00:17, Glenn Enright wrote:

On 04/05/17 04:58, Steven Haigh wrote:

On 04/05/17 01:53, Juergen Gross wrote:

On 03/05/17 12:45, Steven Haigh wrote:

Just wanted to give this a little nudge now people seem to be
back on
deck...


Glenn, could you please give the attached patch a try?

It should be applied on top of the other correction, the old 
debug

patch should not be applied.

I have added some debug output to make sure we see what is 
happening.


This patch is included in kernel-xen-4.9.26-1

It should be in the repos now.



Still seeing the same issue. Without the extra debug patch all I 
see in

the logs after destroy is this...

xen-blkback: xen_blkif_disconnect: busy
xen-blkback: xen_blkif_free: delayed = 0


Hmm, to me it seems as if some grant isn't being unmapped.

Looking at gnttab_unmap_refs_async() I wonder how this is supposed 
to

work:

I don't see how a grant would ever be unmapped in case of
page_count(item->pages[pc]) > 1 in __gnttab_unmap_refs_async(). All 
it
does is deferring the call to the unmap operation again and again. 
Or

am I missing something here?


No, I don't think you are missing anything, but I cannot see how 
this

can be
solved in a better way, unmapping a page that's still referenced is
certainly
not the best option, or else we risk triggering a page-fault 
elsewhere.


IMHO, gnttab_unmap_refs_async should have a timeout, and return an
error at
some point. Also, I'm wondering whether there's a way to keep track 
of

who has
references on a specific page, but so far I haven't been able to
figure out how
to get this information from Linux.

Also, I've noticed that __gnttab_unmap_refs_async uses page_count,
shouldn't it
use page_ref_count instead?

Roger.



In case it helps, I have continued to work on this. I notices 
processed

left behind (under 4.9.27). The same issue is ongoing.

# ps auxf | grep [x]vda
root  2983  0.0  0.0  0 0 ?S01:44   0:00  \_
[1.xvda1-1]
root  5457  0.0  0.0  0 0 ?S02:06   0:00  \_
[3.xvda1-1]
root  7382  0.0  0.0  0 0 ?S02:36   0:00  \_
[4.xvda1-1]
root  9668  0.0  0.0  0 0 ?S02:51   0:00  \_
[6.xvda1-1]
root 11080  0.0  0.0  0 0 ?S02:57   0:00  \_
[7.xvda1-1]

# xl list
Name  ID   Mem VCPUs  State   Time(s)
Domain-0  0  1512 2 r- 118.5
(null)1 8 4 --p--d  43.8
(null)3 8 4 --p--d   6.3
(null)4 8 4 --p--d  73.4
(null)6 8 4 --p--d  14.7
(null)7 8 4 --p--d  30

Those all have...

[root 11080]# cat wchan
xen_blkif_schedule

[root 11080]# cat stack
[] xen_blkif_schedule+0x418/0xb40
[] kthread+0xe5/0x100
[] ret_from_fork+0x25/0x30
[] 0x


And found another reference count bug. Would you like to give the
attached patch (to be applied additionally to the previous ones) a 
try?



Juergen



This seems to have solved the issue in 4.9.28, with all three patches
applied. Awesome!

On my main test machine I can no longer replicate what I was
originally seeing, and in dmesg I now see this flow...

xen-blkback: xen_blkif_disconnect: busy
xen-blkback: xen_blkif_free: delayed = 1
xen-blkback: xen_blkif_free: delayed = 0

xl list is clean, xenstore looks right. No extraneous processes left 
over.


Thankyou Juergen, so much. Really appreciate your persistence with
this. Anything I can do to help push this upstream please let me know.
Feel free to add a reported-by line with my name if you think it
appropriate.


This is good news.

Juergen, Can I request a full patch set posted to the list (plz CC me) - 
and I'll ensure we can build the kernel with all 3 (?) patches applied 
and test properly.


I'll build up a complete kernel with those patches and give a tested-by 
if all goes well.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] 4.9rc4: Cannot build with higher than -j4 - was: linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory

2017-05-14 Thread Steven Haigh
On 10/05/17 23:02, Steven Haigh wrote:
> On 10/05/17 01:20, M A Young wrote:
>> On Tue, 9 May 2017, Steven Haigh wrote:
>>
>>> I'm trying to use the same build procedure I had for working correctly
>>> for Xen 4.7 & 4.8.1 - but am coming across this error:
>>>
>>> gcc  -DPIC -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
>>> -Wstrict-prototypes -Wdeclaration-after-statement
>>> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -g3 -O0
>>> -fno-omit-frame-pointer -D__XEN_INTERFACE_VERSION__=__XE
>>> N_LATEST_INTERFACE_VERSION__ -MMD -MF .linux.opic.d -D_LARGEFILE_SOURCE
>>> -D_LARGEFILE64_SOURCE   -Werror -Wmissing-prototypes -I./include
>>> -I/builddir/build/BUILD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/include
>>> -I/builddir/build/BUI
>>> LD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/libs/toollog/include
>>> -I/builddir/build/BUILD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/include
>>>  -fPIC -c -o linux.opic linux.c
>>> mv headers.chk.new headers.chk
>>> linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory
>>>  #include 
>>> ^
>>> compilation terminated.
>>> linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory
>>>  #include 
>>> ^
>>> compilation terminated.
>>>
>>> Any clues as to what to start pulling apart that changed between 4.8.1
>>> and 4.9.0-rc4 that could cause this?
>>
>> It worked for me in a test build, eg. see one of the builds at
>> https://copr.fedorainfracloud.org/coprs/myoung/xentest/build/549124/
> 
> Ok, after lots of debugging, when I run 'make dist', I usually use the
> macro for smp building, so I end up with:
>   make %{?_smp_mflags} dist
> 
> It seems this is hit and miss as to it actually working.
> 
> I have had a 100% success rate (but slow builds) with:
>   make dist
> 
> Trying with 'make -j4 dist' seems to work the couple of times I've tried it.
> 
> This seems to be a new problem that I haven't come across before in 4.4,
> 4.5, 4.6, 4.7 or my initial 4.8.1 builds - so its new to 4.9.0 rc's.
> 
> The consensus on #xen seems to be that there is a race between libs &
> include - and that these are supposed to be built in sequence and not
> parallel.
> 
> I'm a little over my depth now - as I assume this heads into Makefile land.
> 
> If it helps, there is a full build log available at:
>   https://cloud.crc.id.au/index.php/s/iTWJE3A1TQBhgDq
> 
> I've committed my current progress in my git tree:
>   https://xen.crc.id.au/git/?p=xen49;a=tree
> 
> Right now, we're looking at lines 304 / 305 of SPECS/xen49.spec

Just wanted to give this a nudge. It seems if you build with above -j4
(on a machine with suitable number of cores), the build will fail. This
is a degradation from any version previous to 4.9.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] 4.9rc4: linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory

2017-05-10 Thread Steven Haigh
On 10/05/17 01:20, M A Young wrote:
> On Tue, 9 May 2017, Steven Haigh wrote:
> 
>> I'm trying to use the same build procedure I had for working correctly
>> for Xen 4.7 & 4.8.1 - but am coming across this error:
>>
>> gcc  -DPIC -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
>> -Wstrict-prototypes -Wdeclaration-after-statement
>> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -g3 -O0
>> -fno-omit-frame-pointer -D__XEN_INTERFACE_VERSION__=__XE
>> N_LATEST_INTERFACE_VERSION__ -MMD -MF .linux.opic.d -D_LARGEFILE_SOURCE
>> -D_LARGEFILE64_SOURCE   -Werror -Wmissing-prototypes -I./include
>> -I/builddir/build/BUILD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/include
>> -I/builddir/build/BUI
>> LD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/libs/toollog/include
>> -I/builddir/build/BUILD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/include
>>  -fPIC -c -o linux.opic linux.c
>> mv headers.chk.new headers.chk
>> linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory
>>  #include 
>> ^
>> compilation terminated.
>> linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory
>>  #include 
>> ^
>> compilation terminated.
>>
>> Any clues as to what to start pulling apart that changed between 4.8.1
>> and 4.9.0-rc4 that could cause this?
> 
> It worked for me in a test build, eg. see one of the builds at
> https://copr.fedorainfracloud.org/coprs/myoung/xentest/build/549124/

Ok, after lots of debugging, when I run 'make dist', I usually use the
macro for smp building, so I end up with:
make %{?_smp_mflags} dist

It seems this is hit and miss as to it actually working.

I have had a 100% success rate (but slow builds) with:
make dist

Trying with 'make -j4 dist' seems to work the couple of times I've tried it.

This seems to be a new problem that I haven't come across before in 4.4,
4.5, 4.6, 4.7 or my initial 4.8.1 builds - so its new to 4.9.0 rc's.

The consensus on #xen seems to be that there is a race between libs &
include - and that these are supposed to be built in sequence and not
parallel.

I'm a little over my depth now - as I assume this heads into Makefile land.

If it helps, there is a full build log available at:
https://cloud.crc.id.au/index.php/s/iTWJE3A1TQBhgDq

I've committed my current progress in my git tree:
https://xen.crc.id.au/git/?p=xen49;a=tree

Right now, we're looking at lines 304 / 305 of SPECS/xen49.spec

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] 4.9rc4: linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory

2017-05-09 Thread Steven Haigh
I'm trying to use the same build procedure I had for working correctly
for Xen 4.7 & 4.8.1 - but am coming across this error:

gcc  -DPIC -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
-Wstrict-prototypes -Wdeclaration-after-statement
-Wno-unused-but-set-variable -Wno-unused-local-typedefs   -g3 -O0
-fno-omit-frame-pointer -D__XEN_INTERFACE_VERSION__=__XE
N_LATEST_INTERFACE_VERSION__ -MMD -MF .linux.opic.d -D_LARGEFILE_SOURCE
-D_LARGEFILE64_SOURCE   -Werror -Wmissing-prototypes -I./include
-I/builddir/build/BUILD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/include
-I/builddir/build/BUI
LD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/libs/toollog/include
-I/builddir/build/BUILD/xen-4.9.0-rc4/tools/libs/evtchn/../../../tools/include
 -fPIC -c -o linux.opic linux.c
mv headers.chk.new headers.chk
linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory
 #include 
^
compilation terminated.
linux.c:27:28: fatal error: xen/sys/evtchn.h: No such file or directory
 #include 
^
compilation terminated.

Any clues as to what to start pulling apart that changed between 4.8.1
and 4.9.0-rc4 that could cause this?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897




signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] null domains after xl destroy

2017-05-03 Thread Steven Haigh
On 04/05/17 01:53, Juergen Gross wrote:
> On 03/05/17 12:45, Steven Haigh wrote:
>> Just wanted to give this a little nudge now people seem to be back on
>> deck...
> 
> Glenn, could you please give the attached patch a try?
> 
> It should be applied on top of the other correction, the old debug
> patch should not be applied.
> 
> I have added some debug output to make sure we see what is happening.

This patch is included in kernel-xen-4.9.26-1

It should be in the repos now.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] null domains after xl destroy

2017-05-03 Thread Steven Haigh
n_blkif_disconnect with a call to
>>> xen_blk_drain_io instead, and making xen_blkif_disconnect return void
>>> (to
>>> prevent further issues like this one).
>>
>> Glenn,
>>
>> can you please try the attached patch (in dom0)?
>>
>>
>> Juergen
>>
> 
> (resending with full CC list)
> 
> I'm back. After testing unfortunately I'm still seeing the leak. The
> below trace is with the debug patch applied as well under 4.9.25. It
> looks very similar to me. I am still able to replicate this reliably.
> 
> Regards, Glenn
> http://rimuhosting.com
> 
> [ cut here ]
> WARNING: CPU: 0 PID: 19 at drivers/block/xen-blkback/xenbus.c:511
> xen_blkbk_remove+0x138/0x140
> Modules linked in: ebt_ip xen_pciback xen_netback xen_gntalloc
> xen_gntdev xen_evtchn xenfs xen_privcmd xt_CT ipt_REJECT nf_reject_ipv4
> ebtable_filter ebtables xt_hashlimit xt_recent xt_state iptable_security
> iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables bridge stp llc
> ipv6 crc_ccitt ppdev parport_pc parport serio_raw i2c_i801 i2c_smbus
> i2c_core sg e1000e ptp pps_core i3000_edac edac_core raid1 sd_mod ahci
> libahci floppy dm_mirror dm_region_hash dm_log dm_mod
> CPU: 0 PID: 19 Comm: xenwatch Not tainted 4.9.25-1.el6xen.x86_64 #1
> Hardware name: Supermicro PDSML/PDSML+, BIOS 6.00 08/27/2007
>  c90040cfbb98 8136b76f 0013 
>    c90040cfbbe8 8108007d
>  ea141720 01ff41334434 8801 88004d3aedc0
> Call Trace:
>  [] dump_stack+0x67/0x98
>  [] __warn+0xfd/0x120
>  [] warn_slowpath_null+0x1d/0x20
>  [] xen_blkbk_remove+0x138/0x140
>  [] xenbus_dev_remove+0x47/0xa0
>  [] __device_release_driver+0xb4/0x160
>  [] device_release_driver+0x2d/0x40
>  [] bus_remove_device+0x124/0x190
>  [] device_del+0x112/0x210
>  [] ? xenbus_read+0x53/0x70
>  [] device_unregister+0x22/0x60
>  [] frontend_changed+0xad/0x4c0
>  [] xenbus_otherend_changed+0xc7/0x140
>  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>  [] frontend_changed+0x10/0x20
>  [] xenwatch_thread+0x9c/0x140
>  [] ? woken_wake_function+0x20/0x20
>  [] ? schedule+0x3a/0xa0
>  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>  [] ? complete+0x4d/0x60
>  [] ? split+0xf0/0xf0
>  [] kthread+0xe5/0x100
>  [] ? kthread+0xcd/0x100
>  [] ? __kthread_init_worker+0x40/0x40
>  [] ? __kthread_init_worker+0x40/0x40
>  [] ? __kthread_init_worker+0x40/0x40
>  [] ret_from_fork+0x25/0x30
> ---[ end trace ea3a48c80e4ad79d ]---
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] null domains after xl destroy

2017-04-21 Thread Steven Haigh
On 20/04/17 02:22, Steven Haigh wrote:
> On 19/04/17 20:09, Juergen Gross wrote:
>> On 19/04/17 09:16, Roger Pau Monné wrote:
>>> On Wed, Apr 19, 2017 at 06:39:41AM +0200, Juergen Gross wrote:
>>>> On 19/04/17 03:02, Glenn Enright wrote:
>>>>> Thanks Juergen. I applied that, to our 4.9.23 dom0 kernel, which still
>>>>> shows the issue. When replicating the leak I now see this trace (via
>>>>> dmesg). Hopefully that is useful.
>>>>>
>>>>> Please note, I'm going to be offline next week, but am keen to keep on
>>>>> with this, it may just be a while before I followup is all.
>>>>>
>>>>> Regards, Glenn
>>>>> http://rimuhosting.com
>>>>>
>>>>>
>>>>> [ cut here ]
>>>>> WARNING: CPU: 0 PID: 19 at drivers/block/xen-blkback/xenbus.c:508
>>>>> xen_blkbk_remove+0x138/0x140
>>>>> Modules linked in: xen_pciback xen_netback xen_gntalloc xen_gntdev
>>>>> xen_evtchn xenfs xen_privcmd xt_CT ipt_REJECT nf_reject_ipv4
>>>>> ebtable_filter ebtables xt_hashlimit xt_recent xt_state iptable_security
>>>>> iptable_raw igle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
>>>>> nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables bridge stp llc
>>>>> ipv6 crc_ccitt ppdev parport_pc parport serio_raw sg i2c_i801 i2c_smbus
>>>>> i2c_core e1000e ptp p000_edac edac_core raid1 sd_mod ahci libahci floppy
>>>>> dm_mirror dm_region_hash dm_log dm_mod
>>>>> CPU: 0 PID: 19 Comm: xenwatch Not tainted 4.9.23-1.el6xen.x86_64 #1
>>>>> Hardware name: Supermicro PDSML/PDSML+, BIOS 6.00 08/27/2007
>>>>>  c90040cfbba8 8136b61f 0013 
>>>>>    c90040cfbbf8 8108007d
>>>>>  ea0001373fe0 01fc33394434 8801 88004d93fac0
>>>>> Call Trace:
>>>>>  [] dump_stack+0x67/0x98
>>>>>  [] __warn+0xfd/0x120
>>>>>  [] warn_slowpath_null+0x1d/0x20
>>>>>  [] xen_blkbk_remove+0x138/0x140
>>>>>  [] xenbus_dev_remove+0x47/0xa0
>>>>>  [] __device_release_driver+0xb4/0x160
>>>>>  [] device_release_driver+0x2d/0x40
>>>>>  [] bus_remove_device+0x124/0x190
>>>>>  [] device_del+0x112/0x210
>>>>>  [] ? xenbus_read+0x53/0x70
>>>>>  [] device_unregister+0x22/0x60
>>>>>  [] frontend_changed+0xad/0x4c0
>>>>>  [] ? schedule_tail+0x1e/0xc0
>>>>>  [] xenbus_otherend_changed+0xc7/0x140
>>>>>  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>>>>>  [] ? schedule_tail+0x1e/0xc0
>>>>>  [] frontend_changed+0x10/0x20
>>>>>  [] xenwatch_thread+0x9c/0x140
>>>>>  [] ? woken_wake_function+0x20/0x20
>>>>>  [] ? schedule+0x3a/0xa0
>>>>>  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>>>>>  [] ? complete+0x4d/0x60
>>>>>  [] ? split+0xf0/0xf0
>>>>>  [] kthread+0xcd/0xf0
>>>>>  [] ? schedule_tail+0x1e/0xc0
>>>>>  [] ? __kthread_init_worker+0x40/0x40
>>>>>  [] ? __kthread_init_worker+0x40/0x40
>>>>>  [] ret_from_fork+0x25/0x30
>>>>> ---[ end trace ee097287c9865a62 ]---
>>>>
>>>> Konrad, Roger,
>>>>
>>>> this was triggered by a debug patch in xen_blkbk_remove():
>>>>
>>>>if (be->blkif)
>>>> -  xen_blkif_disconnect(be->blkif);
>>>> +  WARN_ON(xen_blkif_disconnect(be->blkif));
>>>>
>>>> So I guess we need something like xen_blk_drain_io() in case of calls to
>>>> xen_blkif_disconnect() which are not allowed to fail (either at the call
>>>> sites of xen_blkif_disconnect() or in this function depending on a new
>>>> boolean parameter indicating it should wait for outstanding I/Os).
>>>>
>>>> I can try a patch, but I'd appreciate if you could confirm this wouldn't
>>>> add further problems...
>>>
>>> Hello,
>>>
>>> Thanks for debugging this, the easiest solution seems to be to replace the
>>> ring->inflight atomic_read check in xen_blkif_disconnect with a call to
>>> xen_blk_drain_io instead, and making xen_blkif_disconnect return void (to
>>> prevent further issues like this one).
>>
>> Glenn,
>>
>> can you please try the attached patch (in dom0)?

Tested-by: Steven Haigh <net...@crc.id.au>

I've tried specifically with 4.9.23 and can no long make this occur in
my scenario. Also built with 4.9.24 and expecting similar results.

I'm aware Glenn has a much wider test schedule and number of systems
than me, however my testing is successful.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen Project Security Team considering becoming a CNA

2017-04-20 Thread Steven Haigh
On 21/04/17 01:57, Ian Jackson wrote:
> (Resending with the correct CC (!))
> 
> We are in discussions with MITRE with a view to potentially becoming a
> CVE Numbering Authority.  This would probably smooth the process of
> getting CVE numbers for XSAs.
> 
> If anyone has any opinions/representations/concerns/whatever about
> this, please do share them (here in this thread, or privately to
> security@).

YES.

YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES.
YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES. YES.

Yes.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] null domains after xl destroy

2017-04-19 Thread Steven Haigh
On 19/04/17 20:09, Juergen Gross wrote:
> On 19/04/17 09:16, Roger Pau Monné wrote:
>> On Wed, Apr 19, 2017 at 06:39:41AM +0200, Juergen Gross wrote:
>>> On 19/04/17 03:02, Glenn Enright wrote:
>>>> Thanks Juergen. I applied that, to our 4.9.23 dom0 kernel, which still
>>>> shows the issue. When replicating the leak I now see this trace (via
>>>> dmesg). Hopefully that is useful.
>>>>
>>>> Please note, I'm going to be offline next week, but am keen to keep on
>>>> with this, it may just be a while before I followup is all.
>>>>
>>>> Regards, Glenn
>>>> http://rimuhosting.com
>>>>
>>>>
>>>> [ cut here ]
>>>> WARNING: CPU: 0 PID: 19 at drivers/block/xen-blkback/xenbus.c:508
>>>> xen_blkbk_remove+0x138/0x140
>>>> Modules linked in: xen_pciback xen_netback xen_gntalloc xen_gntdev
>>>> xen_evtchn xenfs xen_privcmd xt_CT ipt_REJECT nf_reject_ipv4
>>>> ebtable_filter ebtables xt_hashlimit xt_recent xt_state iptable_security
>>>> iptable_raw igle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
>>>> nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables bridge stp llc
>>>> ipv6 crc_ccitt ppdev parport_pc parport serio_raw sg i2c_i801 i2c_smbus
>>>> i2c_core e1000e ptp p000_edac edac_core raid1 sd_mod ahci libahci floppy
>>>> dm_mirror dm_region_hash dm_log dm_mod
>>>> CPU: 0 PID: 19 Comm: xenwatch Not tainted 4.9.23-1.el6xen.x86_64 #1
>>>> Hardware name: Supermicro PDSML/PDSML+, BIOS 6.00 08/27/2007
>>>>  c90040cfbba8 8136b61f 0013 
>>>>    c90040cfbbf8 8108007d
>>>>  ea0001373fe0 01fc33394434 8801 88004d93fac0
>>>> Call Trace:
>>>>  [] dump_stack+0x67/0x98
>>>>  [] __warn+0xfd/0x120
>>>>  [] warn_slowpath_null+0x1d/0x20
>>>>  [] xen_blkbk_remove+0x138/0x140
>>>>  [] xenbus_dev_remove+0x47/0xa0
>>>>  [] __device_release_driver+0xb4/0x160
>>>>  [] device_release_driver+0x2d/0x40
>>>>  [] bus_remove_device+0x124/0x190
>>>>  [] device_del+0x112/0x210
>>>>  [] ? xenbus_read+0x53/0x70
>>>>  [] device_unregister+0x22/0x60
>>>>  [] frontend_changed+0xad/0x4c0
>>>>  [] ? schedule_tail+0x1e/0xc0
>>>>  [] xenbus_otherend_changed+0xc7/0x140
>>>>  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>>>>  [] ? schedule_tail+0x1e/0xc0
>>>>  [] frontend_changed+0x10/0x20
>>>>  [] xenwatch_thread+0x9c/0x140
>>>>  [] ? woken_wake_function+0x20/0x20
>>>>  [] ? schedule+0x3a/0xa0
>>>>  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>>>>  [] ? complete+0x4d/0x60
>>>>  [] ? split+0xf0/0xf0
>>>>  [] kthread+0xcd/0xf0
>>>>  [] ? schedule_tail+0x1e/0xc0
>>>>  [] ? __kthread_init_worker+0x40/0x40
>>>>  [] ? __kthread_init_worker+0x40/0x40
>>>>  [] ret_from_fork+0x25/0x30
>>>> ---[ end trace ee097287c9865a62 ]---
>>>
>>> Konrad, Roger,
>>>
>>> this was triggered by a debug patch in xen_blkbk_remove():
>>>
>>> if (be->blkif)
>>> -   xen_blkif_disconnect(be->blkif);
>>> +   WARN_ON(xen_blkif_disconnect(be->blkif));
>>>
>>> So I guess we need something like xen_blk_drain_io() in case of calls to
>>> xen_blkif_disconnect() which are not allowed to fail (either at the call
>>> sites of xen_blkif_disconnect() or in this function depending on a new
>>> boolean parameter indicating it should wait for outstanding I/Os).
>>>
>>> I can try a patch, but I'd appreciate if you could confirm this wouldn't
>>> add further problems...
>>
>> Hello,
>>
>> Thanks for debugging this, the easiest solution seems to be to replace the
>> ring->inflight atomic_read check in xen_blkif_disconnect with a call to
>> xen_blk_drain_io instead, and making xen_blkif_disconnect return void (to
>> prevent further issues like this one).
> 
> Glenn,
> 
> can you please try the attached patch (in dom0)?

For what its worth, I have applied this in kernel package 4.9.23-2 as
follows:

* Wed Apr 19 2017 Steven Haigh <net...@crc.id.au> - 4.9.23-2
- xen/blkback: fix disconnect while I/Os in flight

Its available from any 'in sync' mirror:
https://xen.crc.id.au/downloads/

Feedback welcome for both mine and Juergen's sake.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.8.0 dom0pvh=1 causes ext4 filesystem corruption

2017-03-09 Thread Steven Haigh
On 10/03/17 02:13, Valtteri Kiviniemi wrote:
> Hi,
> 
> Yesterday I decided to upgrade my Xen version from 4.6.0 to 4.8.0. I
> compiled it from source and at the same time I compiled the latest Linux
> kernel (4.10.1).
> 
> When rebooting I decided to try if dom0 PVH would work (with previous
> Xen version it just caused kernel panic). Seemed to boot fine until
> systemd started mounting the root filesystem and then the console was
> filled with ext4 errors. Couldn't even log in.
> 
> Booting with a systemrescuecd and running fsck just caused the whole
> filesystem to be re-attached in thousands of small pieces under
> lost+found. I was sure that this was a some kind of hardware failure, so
> I switched my hard drives and did a clean reinstall for dom0 and tried
> again. Again, after a reboot the whole rootfs was completely corrupted.
> 
> Second reinstall and this time I disabled dom0 PVH and the system booted
> just fine, and no ext4 errors. My root filesystem is just a simple Linux
> software raid1 with ext4 on top of it.
> 
> Now that I started thinking I have also had strange ext4 errors
> happening inside my guests, so I also disabled PVH from all the guests.
> With guests the ext4 error is always the same: "EXT4-fs error (device
> xvda1): ext4_iget:4665: inode #317: comm find: bogus i_mode (135206)"
> 
> Unfortunately I don't have any logs from the dom0 corruption as I can't
> even log in to the system when dom0 PVH is enabled. The corruption
> happens instantly during system bootup.

I have this happen a lot using pvh mode in previous Xen versions. Is it
supposed to be 'working' yet or is it still not recommended for use?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen Security Advisory 209 (CVE-2017-2620) - cirrus_bitblt_cputovideo does not check if memory region is safe

2017-02-23 Thread Steven Haigh
On 23/02/17 20:43, Roger Pau Monné wrote:
> On Tue, Feb 21, 2017 at 12:00:03PM +, Xen.org security team wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Xen Security Advisory CVE-2017-2620 / XSA-209
>>   version 3
>>
>>cirrus_bitblt_cputovideo does not check if memory region is safe
>>
>> UPDATES IN VERSION 3
>> 
>>
>> Public release.
>>
>> ISSUE DESCRIPTION
>> =
>>
>> In CIRRUS_BLTMODE_MEMSYSSRC mode the bitblit copy routine
>> cirrus_bitblt_cputovideo fails to check wethehr the specified memory
>> region is safe.
>>
>> IMPACT
>> ==
>>
>> A malicious guest administrator can cause an out of bounds memory
>> write, very likely exploitable as a privilege escalation.
>>
>> VULNERABLE SYSTEMS
>> ==
>>
>> Versions of qemu shipped with all Xen versions are vulnerable.
>>
>> Xen systems running on x86 with HVM guests, with the qemu process
>> running in dom0 are vulnerable.
>>
>> Only guests provided with the "cirrus" emulated video card can exploit
>> the vulnerability.  The non-default "stdvga" emulated video card is
>> not vulnerable.  (With xl the emulated video card is controlled by the
>> "stdvga=" and "vga=" domain configuration options.)
>>
>> ARM systems are not vulnerable.  Systems using only PV guests are not
>> vulnerable.
>>
>> For VMs whose qemu process is running in a stub domain, a successful
>> attacker will only gain the privileges of that stubdom, which should
>> be only over the guest itself.
>>
>> Both upstream-based versions of qemu (device_model_version="qemu-xen")
>> and `traditional' qemu (device_model_version="qemu-xen-traditional")
>> are vulnerable.
>>
>> MITIGATION
>> ==
>>
>> Running only PV guests will avoid the issue.
>>
>> Running HVM guests with the device model in a stubdomain will mitigate
>> the issue.
>>
>> Changing the video card emulation to stdvga (stdvga=1, vga="stdvga",
>> in the xl domain configuration) will avoid the vulnerability.
>>
>> CREDITS
>> ===
>>
>> This issue was discovered by Gerd Hoffmann of Red Hat.
>>
>> RESOLUTION
>> ==
>>
>> Applying the appropriate attached patch resolves this issue.
>>
>> xsa209-qemuu.patch   qemu-xen, qemu upstream
>> (no backport yet)qemu-xen-traditional
> 
> It would be nice to mention that (at least on QEMU shipped with 4.7) the
> following patch is also needed for the XSA-209 fix to build correctly:
> 
> 52b7f43c8fa185ab856bcaacda7abc9a6fc07f84
> display: cirrus: ignore source pitch value as needed in blit_is_unsafe

I did request that an updated XSA be issued with this patch - as at the
moment, nobody will be able to apply the XSA only patch to any other
version of Xen.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] dom0pvh issue with XEN 4.8.0

2017-02-05 Thread Steven Haigh
On Sunday, 5 February 2017 4:05:32 PM AEDT G.R. wrote:
> Hi all,
> dom0pvh=1 is not working well for me with XEN 4.8.0 + linux kernel 4.9.2.
> 
> The system boots with no obvious issue.
> But many user mode application are suffering from segfault, which
> makes the dom0 not useable: The segfault always come from libc-2.24.so
> while it works just fine in PV dom0.
> I have no idea why, but those segfault would kill my ssh connection
> while sshd is not showing up in the victim list.
> 
> Some examples:
> Feb  5 14:25:28 gaia kernel: [  123.446346] getty[3044]: segfault at 0
> ip 7f5e769e6c60 sp 7ffc57bc0a98 error 6 in
> libc-2.24.so[7f5e769b7000+195000]
> Feb  5 14:29:04 gaia kernel: [  339.671742] grep[4195]: segfault at 0
> ip 7f5d3b95ac60 sp 7ffcc1620bb8 error 6 in
> libc-2.24.so[7f5d3b92b000+195000]
> Feb  5 14:29:23 gaia kernel: [  358.495888] tail[4203]: segfault at 0
> ip 7f751314bc60 sp 7fffe5ce5e48 error 6 in
> libc-2.24.so[7f751311c000+195000]
> Feb  5 14:35:06 gaia kernel: [  701.314247] bash[4323]: segfault at 0
> ip 7f3fef30ec60 sp 7ffd48cc2058 error 6 in
> libc-2.24.so[7f3fef2df000+195000]
> Feb  5 14:48:43 gaia kernel: [ 1518.809924] ls[4910]: segfault at 0 ip
> 7f29e9bc1c60 sp 7ffd712752b8 error 6 in
> libc-2.24.so[7f29e9b92000+195000]
> 
> Any suggestion on how to get this fixed?
> I don't think I can do live debug since the userspace is quite unstable.
> On the other hand, dmesg from both dom0 && XEN looks just fine.
> 
> PS: I'm using a custom compiled dom0 kernel. Is there any specific
> kernel config is required to get dom0pvh=1 work?

I've been down this path before - and the only thing that gave me stability 
back was to disable the pvh options. I had everything from disk corruption to 
what you mention with apps while trying this option.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


signature.asc
Description: This is a digitally signed message part.
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen and Shared Clipboard.

2016-12-31 Thread Steven Haigh
Save yourself a *ton* of hassle and use GNU Screen.

On 31/12/16 22:12, Jason Long wrote:
> I write it? I'm a noob in developing and just start learning it :(
> 
> 
> On Thursday, December 29, 2016 6:43 AM, Konrad Rzeszutek Wilk
> <konrad.w...@oracle.com> wrote:
> 
> 
> On Thu, Dec 29, 2016 at 10:11:42AM +, Jason Long wrote:
>> I guess it is a good feature for work and Xen must have it.
> 
> Looking forward for the patch from you on that!
> 
> Thanks!
> 
>> 
>> On Tue, 12/20/16, Jason Long <hack3r...@yahoo.com
> <mailto:hack3r...@yahoo.com>> wrote:
>>
>>  Subject: Xen and Shared Clipboard.
>>  To: "Xen-devel" <xen-de...@lists.xenproject.org
> <mailto:xen-de...@lists.xenproject.org>>
>>  Date: Tuesday, December 20, 2016, 10:58 PM
>> 
>>  Hello.How can I enable Shared
>>  Clipboard in Xen? I like to copy and paste text from Host to
>>  Guest or vice versa.
>>  Thank you.
> 
>>
>> ___
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org <mailto:Xen-devel@lists.xen.org>
>> https://lists.xen.org/xen-devel
> 
> 
> 
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
> 

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Proposed plan and URL name for new VM to download xen tarballs (ftp.xenproject.org)

2016-10-12 Thread Steven Haigh
On 12/10/16 21:38, Ian Jackson wrote:
> Ian Jackson writes ("Re: Proposed plan and URL name for new VM to download 
> xen tarballs (ftp.xenproject.org)"):
>> Sure, I don't have an opinion.  I have changed this, so it's now
>> under:
>>   https://downloads.xenproject.org/release/xen/
> 
> No-one has objected, so we are now committing to this.  The new URLs
> will be primary for the forthcoming RC (Wei will send an announcement
> when it's ready).

I missed this previously in the rest of the list happenings.

I'm actually glad this is happening. Having predictable naming / pathing
of the xen tarballs is fantastic.

I lothe going via the web site to download the file.html which ends up
being the filename on the system. Would be much nicer to have a direct
download link automatically generated that works.

As such, +1 from me :)

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Resend: [PATCH] v3 - Add exclusive locking option to block-iscsi

2016-05-18 Thread Steven Haigh

On 2016-05-09 14:22, Steven Haigh wrote:

On 2016-05-05 15:52, Steven Haigh wrote:

On 2016-05-05 12:32, Steven Haigh wrote:

Overview

If you're using iSCSI, you can mount a target by multiple Dom0
machines on the same target. For non-cluster aware filesystems, this
can lead to disk corruption and general bad times by all. The iSCSI
protocol allows the use of persistent reservations as per the SCSI
disk spec. Low level SCSI commands for locking are handled by the
sg_persist program (bundled with sg3_utils package in EL).

The aim of this patch is to create a 'locktarget=y' option specified
within the disk 'target' command for iSCSI to lock the target in
exclusive mode on VM start with a key generated from the local 
systems

IP, and release this lock on the shutdown of the DomU.

Example Config:
disk=
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:mytarget,portal=iscsi.example.com,locktarget=y']

In writing this, I have also re-factored parts of the script to put
some things in what I believe to be a better place to make expansion
easier. This is mainly in removing functions that purely call other
functions with no actual code execution.

Signed-off-by: Steven Haigh <net...@crc.id.au>

(on a side note, first time I've submitted a patch to the list and 
I'm

currently stuck on a webmail client, so apologies in advance if this
all goes wrong ;)


Changes in v2:
Bugfix: Call find_device to locate the /dev/sdX component of the iSCSI
target before trying to run unlock_device().

Apologies for this oversight.



Changes in v3:
* Split the block-iscsi cleanup into a seperate patch
(block-iscsi-locking-v3_01_simplify_block-iscsi.patch).
* Add locking in second patch file 
(block-iscsi-locking-v3_02_add_locking.patch)


Resend of patches.

There was a mention of having to add further documentation to 
xl-disk-configuration.txt - however there are no mentions of block-iscsi 
script within the documentation to add. As such, it probably would be 
out of place to add things here.


The locktarget option is presented directly to the block-iscsi script 
and not evaluated anywhere outside this script.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
--- block-iscsi.orig2016-05-09 15:12:02.489495212 +1000
+++ block-iscsi 2016-05-09 15:16:35.447480532 +1000
@@ -31,16 +31,6 @@
 echo $1 | sed "s/^\("$2"\)//"
 }
 
-check_tools()
-{
-if ! command -v iscsiadm > /dev/null 2>&1; then
-fatal "Unable to find iscsiadm tool"
-fi
-if [ "$multipath" = "y" ] && ! command -v multipath > /dev/null 2>&1; then
-fatal "Unable to find multipath"
-fi
-}
-
 # Sets the following global variables based on the params field passed in as
 # a parameter: iqn, portal, auth_method, user, multipath, password
 parse_target()
@@ -52,12 +42,18 @@
 case $param in
 iqn=*)
 iqn=$(remove_label $param "iqn=")
+if ! command -v iscsiadm > /dev/null 2>&1; then
+fatal "Could not find iscsiadm tool."
+fi
 ;;
 portal=*)
 portal=$(remove_label $param "portal=")
 ;;
 multipath=*)
 multipath=$(remove_label $param "multipath=")
+if ! command -v multipath > /dev/null 2>&1; then
+fatal "Multipath selected, but no multipath tools found"
+fi
 ;;
 esac
 done
@@ -96,40 +92,6 @@
 fi
 }

-# Attaches the target $iqn in $portal and sets $dev to point to the
-# multipath device
-attach()
-{
-do_or_die iscsiadm -m node --targetname "$iqn" -p "$portal" --login > /dev/null
-find_device
-}
-
-# Discovers targets in $portal and checks that $iqn is one of those targets
-# Also sets the auth parameters to attach the device
-prepare()
-{
-# Check if target is already opened
-iscsiadm -m session 2>&1 | grep -q "$iqn" && fatal "Device already opened"
-# Discover portal targets
-iscsiadm -m discovery -t st -p $portal 2>&1 | grep -q "$iqn" || \
-fatal "No matching target iqn found"
-}
-
-# Attaches the device and writes xenstore backend entries to connect
-# the device
-add()
-{
-attach
-write_dev $dev
-}
-
-# Disconnects the device
-remove()
-{
-find_device
-do_or_die iscsiadm -m node --targetname "$iqn" -p "$portal" --logout > /dev/null
-}
-
 command=$1
 target=$(xenstore-read $XENBUS_PATH/params || true)
 if [ -z "$target" ]; then
@@ -138,15 +100,21 @@

 parse_target "$target"

-check_tools || exit 1
-
 case $command in
 add)
-prepare
-add
+# Check if target is already opened
+iscsiadm -m session 2>&1 | grep -

Re: [Xen-devel] [PATCH] v3 - Add exclusive locking option to block-iscsi

2016-05-15 Thread Steven Haigh

On 2016-05-12 21:02, Wei Liu wrote:

Hi Steven

On Mon, May 09, 2016 at 02:22:48PM +1000, Steven Haigh wrote:

On 2016-05-05 15:52, Steven Haigh wrote:
>On 2016-05-05 12:32, Steven Haigh wrote:
>>Overview
>>
>>If you're using iSCSI, you can mount a target by multiple Dom0
>>machines on the same target. For non-cluster aware filesystems, this
>>can lead to disk corruption and general bad times by all. The iSCSI
>>protocol allows the use of persistent reservations as per the SCSI
>>disk spec. Low level SCSI commands for locking are handled by the
>>sg_persist program (bundled with sg3_utils package in EL).
>>
>>The aim of this patch is to create a 'locktarget=y' option specified
>>within the disk 'target' command for iSCSI to lock the target in
>>exclusive mode on VM start with a key generated from the local systems
>>IP, and release this lock on the shutdown of the DomU.
>>
>>Example Config:
>>disk=
>>['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:mytarget,portal=iscsi.example.com,locktarget=y']


You seem to suggest an extension (locktarget) to disk spec as well but
you patch doesn't contain modification to
docs/txt/misc/xl-disk-configuration.txt.


Correct. There is no documentation for the existing block-iscsi script 
within xl-disk-configuration.txt.


In fact, there is no mention at all regarding block-iscsi in any of the 
documentation that I can see.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] v3 - Add exclusive locking option to block-iscsi

2016-05-08 Thread Steven Haigh

On 2016-05-05 15:52, Steven Haigh wrote:

On 2016-05-05 12:32, Steven Haigh wrote:

Overview

If you're using iSCSI, you can mount a target by multiple Dom0
machines on the same target. For non-cluster aware filesystems, this
can lead to disk corruption and general bad times by all. The iSCSI
protocol allows the use of persistent reservations as per the SCSI
disk spec. Low level SCSI commands for locking are handled by the
sg_persist program (bundled with sg3_utils package in EL).

The aim of this patch is to create a 'locktarget=y' option specified
within the disk 'target' command for iSCSI to lock the target in
exclusive mode on VM start with a key generated from the local systems
IP, and release this lock on the shutdown of the DomU.

Example Config:
disk=
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:mytarget,portal=iscsi.example.com,locktarget=y']

In writing this, I have also re-factored parts of the script to put
some things in what I believe to be a better place to make expansion
easier. This is mainly in removing functions that purely call other
functions with no actual code execution.

Signed-off-by: Steven Haigh <net...@crc.id.au>

(on a side note, first time I've submitted a patch to the list and I'm
currently stuck on a webmail client, so apologies in advance if this
all goes wrong ;)


Changes in v2:
Bugfix: Call find_device to locate the /dev/sdX component of the iSCSI
target before trying to run unlock_device().

Apologies for this oversight.



Changes in v3:
* Split the block-iscsi cleanup into a seperate patch 
(block-iscsi-locking-v3_01_simplify_block-iscsi.patch).
* Add locking in second patch file 
(block-iscsi-locking-v3_02_add_locking.patch)


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897--- block-iscsi.orig2016-05-09 15:12:02.489495212 +1000
+++ block-iscsi 2016-05-09 15:16:35.447480532 +1000
@@ -31,16 +31,6 @@
 echo $1 | sed "s/^\("$2"\)//"
 }
 
-check_tools()
-{
-if ! command -v iscsiadm > /dev/null 2>&1; then
-fatal "Unable to find iscsiadm tool"
-fi
-if [ "$multipath" = "y" ] && ! command -v multipath > /dev/null 2>&1; then
-fatal "Unable to find multipath"
-fi
-}
-
 # Sets the following global variables based on the params field passed in as
 # a parameter: iqn, portal, auth_method, user, multipath, password
 parse_target()
@@ -52,12 +42,18 @@
 case $param in
 iqn=*)
 iqn=$(remove_label $param "iqn=")
+if ! command -v iscsiadm > /dev/null 2>&1; then
+fatal "Could not find iscsiadm tool."
+fi
 ;;
 portal=*)
 portal=$(remove_label $param "portal=")
 ;;
 multipath=*)
 multipath=$(remove_label $param "multipath=")
+if ! command -v multipath > /dev/null 2>&1; then
+fatal "Multipath selected, but no multipath tools found"
+fi
 ;;
 esac
 done
@@ -96,40 +92,6 @@
 fi
 }

-# Attaches the target $iqn in $portal and sets $dev to point to the
-# multipath device
-attach()
-{
-do_or_die iscsiadm -m node --targetname "$iqn" -p "$portal" --login > /dev/null
-find_device
-}
-
-# Discovers targets in $portal and checks that $iqn is one of those targets
-# Also sets the auth parameters to attach the device
-prepare()
-{
-# Check if target is already opened
-iscsiadm -m session 2>&1 | grep -q "$iqn" && fatal "Device already opened"
-# Discover portal targets
-iscsiadm -m discovery -t st -p $portal 2>&1 | grep -q "$iqn" || \
-fatal "No matching target iqn found"
-}
-
-# Attaches the device and writes xenstore backend entries to connect
-# the device
-add()
-{
-attach
-write_dev $dev
-}
-
-# Disconnects the device
-remove()
-{
-find_device
-do_or_die iscsiadm -m node --targetname "$iqn" -p "$portal" --logout > /dev/null
-}
-
 command=$1
 target=$(xenstore-read $XENBUS_PATH/params || true)
 if [ -z "$target" ]; then
@@ -138,15 +100,21 @@

 parse_target "$target"

-check_tools || exit 1
-
 case $command in
 add)
-prepare
-add
+# Check if target is already opened
+iscsiadm -m session 2>&1 | grep -q "$iqn" && fatal "Device already opened"
+# Discover portal targets
+iscsiadm -m discovery -t st -p $portal 2>&1 | grep -q "$iqn" || \
+fatal "No matching target iqn found"
+
+## Login to the iSCSI target.
+do_or_die iscsiadm -m node --targetname "$iqn" -p "$portal" --login > /dev/null
+
+write_dev $dev
 ;;
 remove)

Re: [Xen-devel] [PATCH] v2 - Add exclusive locking option to block-iscsi

2016-05-06 Thread Steven Haigh
On 6/05/2016 7:09 PM, Roger Pau Monné wrote:
> On Thu, May 05, 2016 at 03:52:30PM +1000, Steven Haigh wrote:
>> On 2016-05-05 12:32, Steven Haigh wrote:
>>> Overview
>>>
>>> If you're using iSCSI, you can mount a target by multiple Dom0
>>> machines on the same target. For non-cluster aware filesystems, this
>>> can lead to disk corruption and general bad times by all. The iSCSI
>>> protocol allows the use of persistent reservations as per the SCSI
>>> disk spec. Low level SCSI commands for locking are handled by the
>>> sg_persist program (bundled with sg3_utils package in EL).
>>>
>>> The aim of this patch is to create a 'locktarget=y' option specified
>>> within the disk 'target' command for iSCSI to lock the target in
>>> exclusive mode on VM start with a key generated from the local systems
>>> IP, and release this lock on the shutdown of the DomU.
>>>
>>> Example Config:
>>> disk=
>>> ['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:mytarget,portal=iscsi.example.com,locktarget=y']
>>>
>>> In writing this, I have also re-factored parts of the script to put
>>> some things in what I believe to be a better place to make expansion
>>> easier. This is mainly in removing functions that purely call other
>>> functions with no actual code execution.
>>>
>>> Signed-off-by: Steven Haigh <net...@crc.id.au>
>>>
>>> (on a side note, first time I've submitted a patch to the list and I'm
>>> currently stuck on a webmail client, so apologies in advance if this
>>> all goes wrong ;)
>>
>> Changes in v2:
>> Bugfix: Call find_device to locate the /dev/sdX component of the iSCSI
>> target before trying to run unlock_device().
>>
>> Apologies for this oversight.
> 
> Thanks for the patch! A couple of comments below.
> 
>> -- 
>> Steven Haigh
>>
>> Email: net...@crc.id.au
>> Web: https://www.crc.id.au
>> Phone: (03) 9001 6090 - 0412 935 897
> 
>> --- block-iscsi 2016-02-10 01:44:19.0 +1100
>> +++ block-iscsi-lock2016-05-05 15:42:09.557191235 +1000
>> @@ -31,33 +31,37 @@
>>  echo $1 | sed "s/^\("$2"\)//"
>>  }
>>  
>> -check_tools()
>> -{
>> -if ! command -v iscsiadm > /dev/null 2>&1; then
>> -fatal "Unable to find iscsiadm tool"
>> -fi
>> -if [ "$multipath" = "y" ] && ! command -v multipath > /dev/null 2>&1; 
>> then
>> -fatal "Unable to find multipath"
>> -fi
>> -}
>> -
>>  # Sets the following global variables based on the params field passed in as
>>  # a parameter: iqn, portal, auth_method, user, multipath, password
>>  parse_target()
>>  {
>>  # set multipath default value
>>  multipath="n"
>> -for param in $(echo "$1" | tr "," "\n")
>> -do
>> +for param in $(echo "$1" | tr "," "\n"); do
>>  case $param in
>>  iqn=*)
>>  iqn=$(remove_label $param "iqn=")
>> +if ! command -v iscsiadm > /dev/null 2>&1; then
>> +fatal "Could not find iscsiadm tool."
>> +fi
>>  ;;
>>  portal=*)
>>  portal=$(remove_label $param "portal=")
>>  ;;
>>  multipath=*)
>>  multipath=$(remove_label $param "multipath=")
>> +if ! command -v multipath > /dev/null 2>&1; then
>> +fatal "Multipath selected, but no multipath tools found"
>> +fi
>> +;;
>> +locktarget=*)
>> +locktarget=$(remove_label $param "locktarget=")
>> +if ! command -v sg_persist > /dev/null 2>&1; then
>> +fatal "Locking requested but no sg_persist found"
>> +fi
>> +if ! command -v gethostip > /dev/null 2>&1; then
>> +fatal "Locking requested but no gethostip found for key 
>> generation"
>> +fi
> 
> Why don't you just add this to check_tools? In any case, if you want to fold 
> check_tools functionality into parse_target I think it should be done in a 
> separate patch in order for it to be easier to review.
> 
> IMHO, I prefer to have both functions separated, because it's

[Xen-devel] [PATCH] v2 - Add exclusive locking option to block-iscsi

2016-05-04 Thread Steven Haigh

On 2016-05-05 12:32, Steven Haigh wrote:

Overview

If you're using iSCSI, you can mount a target by multiple Dom0
machines on the same target. For non-cluster aware filesystems, this
can lead to disk corruption and general bad times by all. The iSCSI
protocol allows the use of persistent reservations as per the SCSI
disk spec. Low level SCSI commands for locking are handled by the
sg_persist program (bundled with sg3_utils package in EL).

The aim of this patch is to create a 'locktarget=y' option specified
within the disk 'target' command for iSCSI to lock the target in
exclusive mode on VM start with a key generated from the local systems
IP, and release this lock on the shutdown of the DomU.

Example Config:
disk=
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:mytarget,portal=iscsi.example.com,locktarget=y']

In writing this, I have also re-factored parts of the script to put
some things in what I believe to be a better place to make expansion
easier. This is mainly in removing functions that purely call other
functions with no actual code execution.

Signed-off-by: Steven Haigh <net...@crc.id.au>

(on a side note, first time I've submitted a patch to the list and I'm
currently stuck on a webmail client, so apologies in advance if this
all goes wrong ;)


Changes in v2:
Bugfix: Call find_device to locate the /dev/sdX component of the iSCSI 
target before trying to run unlock_device().


Apologies for this oversight.

--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
--- block-iscsi 2016-02-10 01:44:19.0 +1100
+++ block-iscsi-lock2016-05-05 15:42:09.557191235 +1000
@@ -31,33 +31,37 @@
 echo $1 | sed "s/^\("$2"\)//"
 }
 
-check_tools()
-{
-if ! command -v iscsiadm > /dev/null 2>&1; then
-fatal "Unable to find iscsiadm tool"
-fi
-if [ "$multipath" = "y" ] && ! command -v multipath > /dev/null 2>&1; then
-fatal "Unable to find multipath"
-fi
-}
-
 # Sets the following global variables based on the params field passed in as
 # a parameter: iqn, portal, auth_method, user, multipath, password
 parse_target()
 {
 # set multipath default value
 multipath="n"
-for param in $(echo "$1" | tr "," "\n")
-do
+for param in $(echo "$1" | tr "," "\n"); do
 case $param in
 iqn=*)
 iqn=$(remove_label $param "iqn=")
+if ! command -v iscsiadm > /dev/null 2>&1; then
+fatal "Could not find iscsiadm tool."
+fi
 ;;
 portal=*)
 portal=$(remove_label $param "portal=")
 ;;
 multipath=*)
 multipath=$(remove_label $param "multipath=")
+if ! command -v multipath > /dev/null 2>&1; then
+fatal "Multipath selected, but no multipath tools found"
+fi
+;;
+locktarget=*)
+locktarget=$(remove_label $param "locktarget=")
+if ! command -v sg_persist > /dev/null 2>&1; then
+fatal "Locking requested but no sg_persist found"
+fi
+if ! command -v gethostip > /dev/null 2>&1; then
+fatal "Locking requested but no gethostip found for key generation"
+fi
 ;;
 esac
 done
@@ -96,38 +100,29 @@
 fi
 }
 
-# Attaches the target $iqn in $portal and sets $dev to point to the
-# multipath device
-attach()
-{
-do_or_die iscsiadm -m node --targetname "$iqn" -p "$portal" --login > /dev/null
-find_device
-}
 
-# Discovers targets in $portal and checks that $iqn is one of those targets
-# Also sets the auth parameters to attach the device
-prepare()
+lock_device()
 {
-# Check if target is already opened
-iscsiadm -m session 2>&1 | grep -q "$iqn" && fatal "Device already opened"
-# Discover portal targets
-iscsiadm -m discovery -t st -p $portal 2>&1 | grep -q "$iqn" || \
-fatal "No matching target iqn found"
-}
-
-# Attaches the device and writes xenstore backend entries to connect
-# the device
-add()
-{
-attach
-write_dev $dev
+## Lock the iSCSI target as Exclusive Access.
+key=$(gethostip -x $(uname -n))
+if ! sg_persist -d ${dev} -o -G -S ${key} > /dev/null; then
+unlock_device
+iscsiadm -m node --targetname "$iqn" -p "$portal" --logout > /dev/null
+fatal "iSCSI LOCK: Failed to register with target"
+fi
+if ! sg_persist -d ${dev} -o -R -K ${key} -T 6 > /dev/null; then
+unlock_device
+iscsiadm -m node 

[Xen-devel] [PATCH] v1 - Add exclusive locking option to block-iscsi

2016-05-04 Thread Steven Haigh

Overview

If you're using iSCSI, you can mount a target by multiple Dom0 machines 
on the same target. For non-cluster aware filesystems, this can lead to 
disk corruption and general bad times by all. The iSCSI protocol allows 
the use of persistent reservations as per the SCSI disk spec. Low level 
SCSI commands for locking are handled by the sg_persist program (bundled 
with sg3_utils package in EL).


The aim of this patch is to create a 'locktarget=y' option specified 
within the disk 'target' command for iSCSI to lock the target in 
exclusive mode on VM start with a key generated from the local systems 
IP, and release this lock on the shutdown of the DomU.


Example Config:
disk= 
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:mytarget,portal=iscsi.example.com,locktarget=y']


In writing this, I have also re-factored parts of the script to put some 
things in what I believe to be a better place to make expansion easier. 
This is mainly in removing functions that purely call other functions 
with no actual code execution.


Signed-off-by: Steven Haigh <net...@crc.id.au>

(on a side note, first time I've submitted a patch to the list and I'm 
currently stuck on a webmail client, so apologies in advance if this all 
goes wrong ;)


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897--- block-iscsi 2016-02-10 01:44:19.0 +1100
+++ block-iscsi-lock2016-05-05 12:30:24.831903983 +1000
@@ -31,33 +31,37 @@
 echo $1 | sed "s/^\("$2"\)//"
 }
 
-check_tools()
-{
-if ! command -v iscsiadm > /dev/null 2>&1; then
-fatal "Unable to find iscsiadm tool"
-fi
-if [ "$multipath" = "y" ] && ! command -v multipath > /dev/null 2>&1; then
-fatal "Unable to find multipath"
-fi
-}
-
 # Sets the following global variables based on the params field passed in as
 # a parameter: iqn, portal, auth_method, user, multipath, password
 parse_target()
 {
 # set multipath default value
 multipath="n"
-for param in $(echo "$1" | tr "," "\n")
-do
+for param in $(echo "$1" | tr "," "\n"); do
 case $param in
 iqn=*)
 iqn=$(remove_label $param "iqn=")
+if ! command -v iscsiadm > /dev/null 2>&1; then
+fatal "Could not find iscsiadm tool."
+fi
 ;;
 portal=*)
 portal=$(remove_label $param "portal=")
 ;;
 multipath=*)
 multipath=$(remove_label $param "multipath=")
+if ! command -v multipath > /dev/null 2>&1; then
+fatal "Multipath selected, but no multipath tools found"
+fi
+;;
+locktarget=*)
+locktarget=$(remove_label $param "locktarget=")
+if ! command -v sg_persist > /dev/null 2>&1; then
+fatal "Locking requested but no sg_persist found"
+fi
+if ! command -v gethostip > /dev/null 2>&1; then
+fatal "Locking requested but no gethostip found for key generation"
+fi
 ;;
 esac
 done
@@ -96,38 +100,29 @@
 fi
 }
 
-# Attaches the target $iqn in $portal and sets $dev to point to the
-# multipath device
-attach()
-{
-do_or_die iscsiadm -m node --targetname "$iqn" -p "$portal" --login > /dev/null
-find_device
-}
-
-# Discovers targets in $portal and checks that $iqn is one of those targets
-# Also sets the auth parameters to attach the device
-prepare()
-{
-# Check if target is already opened
-iscsiadm -m session 2>&1 | grep -q "$iqn" && fatal "Device already opened"
-# Discover portal targets
-iscsiadm -m discovery -t st -p $portal 2>&1 | grep -q "$iqn" || \
-fatal "No matching target iqn found"
-}

-# Attaches the device and writes xenstore backend entries to connect
-# the device
-add()
+lock_device()
 {
-attach
-write_dev $dev
+## Lock the iSCSI target as Exclusive Access.
+key=$(gethostip -x $(uname -n))
+if ! sg_persist -d ${dev} -o -G -S ${key} > /dev/null; then
+unlock_device
+iscsiadm -m node --targetname "$iqn" -p "$portal" --logout > /dev/null
+fatal "iSCSI LOCK: Failed to register with target"
+fi
+if ! sg_persist -d ${dev} -o -R -K ${key} -T 6 > /dev/null; then
+unlock_device
+iscsiadm -m node --targetname "$iqn" -p "$portal" --logout > /dev/null
+fatal "iSCSI LOCK: Failed to set persistent reservation"
+fi
 }

-# Disconnects the device

Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-05-04 Thread Steven Haigh
On 4/05/2016 7:37 PM, Roger Pau Monné wrote:
> On Wed, May 04, 2016 at 06:41:26PM +1000, Steven Haigh wrote:
>> On 4/05/2016 5:34 PM, Roger Pau Monné wrote:
>>> On Wed, May 04, 2016 at 03:06:23PM +1000, Steven Haigh wrote:
>>> It is important for us to use the '-e' in order to make sure all the 
>>> failure 
>>> points are correctly handled, without the '-e' some command might fail and 
>>> the script wouldn't realize.
>>
>> I honestly think this is pretty nasty. While it may not be true of all
>> scripts, the block-iscsi script can only really fail in a couple of
>> places - yet we have this set of procedures called:
>>
>> parse_target -> check_tools -> prepare -> add -> attach -> find_device
>> -> write_dev.
>>
>> At least check_tools, prepare, add, attach, find_device could all be
>> rolled into a single function - as the majority of the rest is 1-4 lines
>> of code.
> 
> No, check_tools is used by both the attach and the detach path, so it cannot 
> be rolled into a single function together with the other ones, and the same 
> applies to mostly all other functions (find_device is also shared between 
> the add and remove functions).
> 
> IMHO, I think the current code is fine because each function has a small 
> logical task to accomplish, so it's easy to make sure each function does 
> what it's supposed to do, nothing more and nothing less. Batching everything 
> into one big function would make this harder.
> 
> That doesn't mean that I'm not open to improving it, so if you think it 
> would be better/easier using some other logical organization patches are 
> welcome :).

Right now, my changes are here:
http://paste.fedoraproject.org/362462/62356799/

It works perfectly well if you're the ONLY device connecting to the
specified iSCSI target, but falls apart when something else has the lock
and doesn't clean up after itself.

From using set -x in there, I can see this when it runs:
++ dirname /etc/xen/scripts/block-iscsi-lock
+ dir=/etc/xen/scripts
+ . /etc/xen/scripts/block-common.sh
+++ dirname /etc/xen/scripts/block-iscsi-lock
++ dir=/etc/xen/scripts
++ . /etc/xen/scripts/xen-hotplug-common.sh
 dirname /etc/xen/scripts/block-iscsi-lock
+++ dir=/etc/xen/scripts
+++ . /etc/xen/scripts/hotplugpath.sh
 sbindir=/usr/sbin
 bindir=/usr/bin
 LIBEXEC=/usr/lib/xen
 LIBEXEC_BIN=/usr/lib/xen/bin
 libdir=/usr/lib64
 SHAREDIR=/usr/share
 XENFIRMWAREDIR=/usr/lib/xen/boot
 XEN_CONFIG_DIR=/etc/xen
 XEN_SCRIPT_DIR=/etc/xen/scripts
 XEN_LOCK_DIR=/var/lock
 XEN_RUN_DIR=/var/run/xen
 XEN_PAGING_DIR=/var/lib/xen/xenpaging
 XEN_DUMP_DIR=/var/lib/xen/dump
+++ . /etc/xen/scripts/logging.sh
+++ . /etc/xen/scripts/xen-script-common.sh
 set -e
+++ . /etc/xen/scripts/locking.sh
 LOCK_BASEDIR=/var/run/xen-hotplug
+++ exec
Entered lock_device
  SUN   COMSTAR   1.0
  Peripheral device type: disk
libxl: error: libxl_exec.c:118:libxl_report_child_exitstatus:
/etc/xen/scripts/block-iscsi-lock add [22016] exited with error status 99
libxl: error: libxl.c:3127:local_device_attach_cb: unable to add vbd
with id 268446976: No such file or directory
libxl: error: libxl_bootloader.c:408:bootloader_disk_attached_cb: failed
to attach local disk for bootloader execution
libxl: error: libxl_bootloader.c:279:bootloader_local_detached_cb:
unable to detach locally attached disk
libxl: error: libxl_create.c:1142:domcreate_rebuild_done: cannot
(re-)build domain: -3
libxl: error: libxl.c:1591:libxl__destroy_domid: non-existant domain 57
libxl: error: libxl.c:1549:domain_destroy_callback: unable to destroy
guest with domid 57
libxl: error: libxl.c:1476:domain_destroy_cb: destruction of domain 57
failed

On a side note, it looks like the -e is not very uniformly used:
block:#!/bin/bash
block-enbd:#!/bin/bash
block-iscsi:#!/bin/bash -e
block-iscsi-lock:#!/bin/bash -x
block-nbd:#!/bin/bash
block-tap:#!/bin/bash -e
external-device-migrate:#!/bin/bash
vif2:#!/bin/bash
vif-bridge:#!/bin/bash
vif-nat:#!/bin/bash
vif-openvswitch:#!/bin/bash
vif-route:#!/bin/bash
vif-setup:#!/bin/bash

but probably gets set everywhere via xen-script-common.sh - hence things
dying easily.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-05-04 Thread Steven Haigh
On 4/05/2016 6:25 PM, George Dunlap wrote:
> On Wed, May 4, 2016 at 8:34 AM, Roger Pau Monné <roger@citrix.com> wrote:
>> Hello,
>>
>> I'm re-adding xen-devel in case someone else also wants to provide feedback.
>>
>> On Wed, May 04, 2016 at 03:06:23PM +1000, Steven Haigh wrote:
>>> Hi Roger,
>>>
>>> I've been getting some good progress with iSCSI thanks to your insights.
>>>
>>> I'm now trying to add support for locking via Persistent Reservations to
>>> ensure that only one Dom0 can attach / use a single iSCSI target at once.
>>
>> This might be problematic with migrations. IIRC there's a point during the
>> migration where both the sending and the receiving side have the disk open
>> at the same time. However Xen always makes sure that only one guest is
>> actually accessing the disk, either the one on the receiving side (if
>> everything has gone OK) or the one on the senders side (if migration has
>> failed).
>>
>>> In a nutshell, my thoughts are to use the following to 'lock' a device:
>>>   ## Create a hex key for the lock from the systems IP.
>>>   key=$(gethostip -x $(uname -n))
>>>   sg_persist -d ${dev} -o -G -S ${key}
>>>   sg_persist -d ${dev} -o -R -K ${key} -T 6
>>>
>>> This registers the device, and sets an Exclusive Access (-T 6) flag on
>>> the iSCSI device which means nothing else will be able to open the
>>> device until the lock is removed.
>>>
>>> To unlock the device, on remove, we should do something like:
>>>   key=$(gethostip -x $(uname -n))
>>> sg_persist -d ${dev} -o -L -K ${key} -T 6
>>> sg_persist -d ${dev} -o -G -K ${key} -S 0
>>>
>>> This releases the device for other things to use.
>>>
>>> I've tried putting these in block-iscsi - by using a lock_device and
>>> unlock_device function and calling it after find_device in both attach()
>>> and remove().
>>>
>>> My problems:
>>> 1) -e is set on the script - and maybe elsewhere - so any time something
>>> returns non-zero, you can't clean up. For example, if you can't get a
>>> lock, you should make sure all locks are removed from the host in
>>> question and then detach the iSCSI target.
>>
>> You can avoid this by adding something like:
>>
>> sg_persist ... || true
>>
>> Of course you can replace the "true" command with something else, like a
>> fatal message or some cleanup code. You can also place the command inside of
>> a conditional if you know it might fail:
>>
>> if ! sg_persist ...; then
>> fatal ...
>> fi
>>
>> It is important for us to use the '-e' in order to make sure all the failure
>> points are correctly handled, without the '-e' some command might fail and
>> the script wouldn't realize.
> 
> I realize I'm a bit in the minority here, but I've always thought this
> was rather a strange habit of bash scripts.  In every other language,
> you check the error codes of things that can fail and you handle them
> appropriately.  If you're just hacking something together, then "set
> -e" is probably OK, but wouldn't it make more sense in an
> infrastructure script like this to actually go through and handle all
> the errors?   Worst case you could just if ! [command] ; then exit 1 ;
> fi.

Then I'm in the minority too ;)

> Regarding the "maybe elsewhere" -- AFAIK the block-scsi script itself
> is run directly from libxl, so nothing "above" it should be setting
> -e; and it only includes block-common, which does not seem to be
> setting it.

Right now, even removing -e from line 1 causes something like this to
exit the script:

if [ $? != 0 ]; then

So, if I run the following, the script will always fail:

key=$(gethostip -x $(uname -n))
sg_persist -d ${dev} -o -G -S ${key} > /dev/null
if [ $? != 0 ]; then
iscsiadm -m node -T ${iqn} --logout > /dev/null
fatal "Could not obtain lock on $iqn"
fi

man 3 sg3_utils (http://linux.die.net/man/8/sg3_utils) shows a possible
list of exit codes - which it would be useful to consult at least *some*
of them with useful errors - such as:

Exit status 3:
the DEVICE reports that it is not ready for the operation requested. The
device may be in the process of becoming ready (e.g. spinning up but not
at speed) so the utility may work after a wait.

Also possibly a case for locking not being supported.

> That said, in theory as Roger said, if you actually are checking the
> error code of all your commands (which you should be if you need to do
> clean-up on failure), then 'set -e' shouldn't actually be causing an
> exit.
> 
>  -George
> 

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-05-04 Thread Steven Haigh
On 4/05/2016 5:34 PM, Roger Pau Monné wrote:
> Hello,
> 
> I'm re-adding xen-devel in case someone else also wants to provide feedback.
> 
> On Wed, May 04, 2016 at 03:06:23PM +1000, Steven Haigh wrote:
>> Hi Roger,
>>
>> I've been getting some good progress with iSCSI thanks to your insights.
>>
>> I'm now trying to add support for locking via Persistent Reservations to
>> ensure that only one Dom0 can attach / use a single iSCSI target at once.
> 
> This might be problematic with migrations. IIRC there's a point during the 
> migration where both the sending and the receiving side have the disk open 
> at the same time. However Xen always makes sure that only one guest is 
> actually accessing the disk, either the one on the receiving side (if 
> everything has gone OK) or the one on the senders side (if migration has 
> failed).

True - however I'd like to eventually attempt to commit changes to the
project and allow locking to be done as an option - just like iqn /
portal / multipath.

In my specific use case, its to stop someone accidentally starting the
same VM on multiple Dom0's at the same time - which from what I've seen
causes disk corruption and all kinds of issues. It leads to people not
having a good time.

The iSCSI system has a limit to the max connections - however it seems
that only applies *per host* meaning max connections = 1 will allow one
connection per Dom0.

>> In a nutshell, my thoughts are to use the following to 'lock' a device:
>>  ## Create a hex key for the lock from the systems IP.
>>  key=$(gethostip -x $(uname -n))
>>  sg_persist -d ${dev} -o -G -S ${key}
>>  sg_persist -d ${dev} -o -R -K ${key} -T 6
>>
>> This registers the device, and sets an Exclusive Access (-T 6) flag on
>> the iSCSI device which means nothing else will be able to open the
>> device until the lock is removed.
>>
>> To unlock the device, on remove, we should do something like:
>>  key=$(gethostip -x $(uname -n))
>> sg_persist -d ${dev} -o -L -K ${key} -T 6
>> sg_persist -d ${dev} -o -G -K ${key} -S 0
>>
>> This releases the device for other things to use.
>>
>> I've tried putting these in block-iscsi - by using a lock_device and
>> unlock_device function and calling it after find_device in both attach()
>> and remove().
>>
>> My problems:
>> 1) -e is set on the script - and maybe elsewhere - so any time something
>> returns non-zero, you can't clean up. For example, if you can't get a
>> lock, you should make sure all locks are removed from the host in
>> question and then detach the iSCSI target.
> 
> You can avoid this by adding something like:
> 
> sg_persist ... || true
> 
> Of course you can replace the "true" command with something else, like a 
> fatal message or some cleanup code. You can also place the command inside of 
> a conditional if you know it might fail:
> 
> if ! sg_persist ...; then
>   fatal ...
> fi
> 
> It is important for us to use the '-e' in order to make sure all the failure 
> points are correctly handled, without the '-e' some command might fail and 
> the script wouldn't realize.

I honestly think this is pretty nasty. While it may not be true of all
scripts, the block-iscsi script can only really fail in a couple of
places - yet we have this set of procedures called:

parse_target -> check_tools -> prepare -> add -> attach -> find_device
-> write_dev.

At least check_tools, prepare, add, attach, find_device could all be
rolled into a single function - as the majority of the rest is 1-4 lines
of code.

There are situations where you may want to evaluate the result of
sg_persist beyond a simple "worked or failed" - and that seems to be the
idea of fatal "The reason that I died is X".

>> 2) I can't find an easy way to clean up by doing an iscsiadm --logout if
>> the locking fails.
> 
> I'm not really following here, maybe because I don't know that much about 
> iSCSI. Can you just put whatever code is needed in order to unlock before 
> doing the logout? Or that's not how it works?

Yes, but if one of the two unlocks fails, the script terminates. It
makes different error checking *VERY* difficult. If I remove the -e from
line #1, the script still acts as if -e is still set - so something else
is enforcing that.

>> I'm wondering if there is a reason that the script is currently in the
>> stucture that it is - or if it just evolved like this? It may be a good
>> candidate for a complete re-write :\
> 
> TBH, I thought this was one of the most clean and well structured block 
> scripts that Xen has ;).

Please don't scare me ;)

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-04-15 Thread Steven Haigh
On 16/04/2016 12:30 AM, George Dunlap wrote:
> On Fri, Apr 15, 2016 at 7:59 AM, Steven Haigh <net...@crc.id.au> wrote:
>> Hi all,
>>
>> I'm wading through the somewhat confusing world of documentation regarding
>> storing DomU disk images on an iSCSI target.
>>
>> I'm getting an error when using pygrub of:
>> OSError: [Errno 2] No such file or directory:
>> 'iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250'
>>
>> After much hunting, I came across this post:
>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg02796.html
>>
>> As such, I'm wondering if it is still required to *NOT* use pygrub for
>> booting iSCSI DomUs?
>>
>> If so, what are the alternatives? Using pv-grub / pv-grub2? Something else?
>>
>> As I'm running EL7.2, I figure if I have to use a pv-grub based solution, it
>> would have to be pv-grub2?
> 
> I see you've got a fix already.  But even so, if using pv-grub2 is a
> possibility for you, it might be worth pursuing anyway:
> - it will be much more secure than pygrub
> - it will probably more reliable, since it's actually a native grub
> binary ported to Xen, while pygrub is just a python script that
> attempts to duplicate some of grub's functionality.

Hi George,

I kind of agree - its on my todo list, but I haven't managed to see much
about including it in a .spec build yet.

If you've done any work on this so far, I'd be happy to discuss off-list.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-04-15 Thread Steven Haigh
On 15/04/2016 7:01 PM, Roger Pau Monné wrote:
> On Fri, Apr 15, 2016 at 06:20:56PM +1000, Steven Haigh wrote:
> [...]
>> I might have spoken too soon here... I updated this system to 4.6.1 and
>> created the DomU again - still seems to fail - although it does actually
>> call the block-iscsi script this time:
>>
>> # xl -vvv create /etc/xen/test1.vm
>> Parsing config from /etc/xen/test1.vm
>> libxl: debug: libxl_create.c:1560:do_domain_create: ao 0x24ad330: create:
>> how=(nil) callback=(nil) poller=0x24b7070
>> libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk
>> vdev=xvda spec.backend=unknown
>> libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=xvda, uses
>> script=... assuming phy backend
>> libxl: debug: libxl_device.c:298:libxl__device_disk_set_backend: Disk
>> vdev=xvda, using backend phy
>> libxl: debug: libxl_create.c:945:initiate_domain_create: running bootloader
>> libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk
>> vdev=(null) spec.backend=phy
>> libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=(null), uses
>> script=... assuming phy backend
>> libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk
>> vdev=xvde spec.backend=phy
>> libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=xvde, uses
>> script=... assuming phy backend
>> libxl: debug: libxl_event.c:639:libxl__ev_xswatch_register: watch
>> w=0x24ada00 wpath=/local/domain/0/backend/vbd/0/51776/state token=3/0:
>> register slotnum=3
>> libxl: debug: libxl_create.c:1583:do_domain_create: ao 0x24ad330:
>> inprogress: poller=0x24b7070, flags=i
>> libxl: debug: libxl_event.c:576:watchfd_callback: watch w=0x24ada00
>> wpath=/local/domain/0/backend/vbd/0/51776/state token=3/0: event
>> epath=/local/domain/0/backend/vbd/0/51776/state
>> libxl: debug: libxl_event.c:884:devstate_callback: backend
>> /local/domain/0/backend/vbd/0/51776/state wanted state 2 still waiting state
>> 1
>> libxl: debug: libxl_event.c:576:watchfd_callback: watch w=0x24ada00
>> wpath=/local/domain/0/backend/vbd/0/51776/state token=3/0: event
>> epath=/local/domain/0/backend/vbd/0/51776/state
>> libxl: debug: libxl_event.c:880:devstate_callback: backend
>> /local/domain/0/backend/vbd/0/51776/state wanted state 2 ok
>> libxl: debug: libxl_event.c:677:libxl__ev_xswatch_deregister: watch
>> w=0x24ada00 wpath=/local/domain/0/backend/vbd/0/51776/state token=3/0:
>> deregister slotnum=3
>> libxl: debug: libxl_device.c:937:device_backend_callback: calling
>> device_backend_cleanup
>> libxl: debug: libxl_event.c:691:libxl__ev_xswatch_deregister: watch
>> w=0x24ada00: deregister unregistered
>> libxl: debug: libxl_linux.c:229:libxl__hotplug_disk: Args and environment
>> ready
>> libxl: debug: libxl_device.c:1034:device_hotplug: calling hotplug script:
>> /etc/xen/scripts/block-iscsi add
>> libxl: debug: libxl_aoutils.c:593:libxl__async_exec_start: forking to
>> execute: /etc/xen/scripts/block-iscsi add
>> libxl: error: libxl_exec.c:118:libxl_report_child_exitstatus:
>> /etc/xen/scripts/block-iscsi add [2126] exited with error status 1
>> libxl: debug: libxl_event.c:691:libxl__ev_xswatch_deregister: watch
>> w=0x24adb00: deregister unregistered
>> libxl: error: libxl_device.c:1084:device_hotplug_child_death_cb: script:
>> Device already opened
> 
> The message indicates that you have this device already opened in this 
> system, this is detected by the following check, that you can also run from 
> a shell:
> 
> # iscsiadm -m session 2>&1 | grep -q "$iqn" && fatal "Device already opened"
> 
> You will have to perform a logout in order for the hotplug script to 
> correctly attach it.

How right you are :)

# iscsiadm -m session
tcp: [1] 192.168.133.250:3260,1
iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5 (non-flash)

# iscsiadm -m node --targetname
iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5 -p
192.168.133.250 --logout
Logging out of session [sid: 1, target:
iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5, portal:
192.168.133.250,3260]
Logout of [sid: 1, target:
iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5, portal:
192.168.133.250,3260] successful.

# iscsiadm -m session
iscsiadm: No active sessions.

The DomU then started successfully. Thanks for your help.

I'll try the previously mentioned patch on 4.5 and see how I go with
that next week.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-04-15 Thread Steven Haigh

On 2016-04-15 18:11, Steven Haigh wrote:

On 2016-04-15 18:03, Roger Pau Monné wrote:

On Fri, Apr 15, 2016 at 05:48:24PM +1000, Steven Haigh wrote:

On 2016-04-15 17:46, Roger Pau Monné wrote:
> On Fri, Apr 15, 2016 at 05:28:12PM +1000, Steven Haigh wrote:
> > On 2016-04-15 17:23, Roger Pau Monné wrote:
> > > On Fri, Apr 15, 2016 at 04:59:11PM +1000, Steven Haigh wrote:
> > > > Hi all,
> > > >
> > > > I'm wading through the somewhat confusing world of documentation
> > > > regarding
> > > > storing DomU disk images on an iSCSI target.
> > > >
> > > > I'm getting an error when using pygrub of:
> > > > OSError: [Errno 2] No such file or directory: 
'iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250'
> > >
> > > Hello,
> > >
> > > It should work. Can you please paste your guest configuration file and
> > > the
> > > output of the create command with "-vvv"?
> >
> > DomU config file:
> > bootloader  = "pygrub"
> > name= "test1.vm"
> > memory  = 2048
> > vcpus   = 2
> > cpus= "1-7"
> > vif = ['bridge=br-151, vifname=vm.test1']
> > disk= 
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250']
> > boot= "c"
> >
> > # xl create /etc/xen/test1.vm -d -c
>
> Please post the output of xl -vvv create /etc/xen/test1.vm.

Whoops - apologies:

# xl -vvv create /etc/xen/test1.vm
Parsing config from /etc/xen/test1.vm
libxl: debug: libxl_create.c:1507:do_domain_create: ao 0x20b7260: 
create:

how=(nil) callback=(nil) poller=0x20b6b30
libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk
vdev=xvda spec.backend=unknown
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=xvda, 
uses

script=... assuming phy backend
libxl: debug: libxl_device.c:298:libxl__device_disk_set_backend: Disk
vdev=xvda, using backend phy
libxl: debug: libxl_create.c:907:initiate_domain_create: running 
bootloader

libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk
vdev=(null) spec.backend=phy
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=(null), 
uses

script=... assuming phy backend
libxl: debug: libxl.c:3064:libxl__device_disk_local_initiate_attach: 
locally
attaching PHY disk 
iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250


Now I remember, this was fixed not long ago, you will need to apply
b1882a424ae098d722b19086b16e64b9aeccc7ca to your source tree/package 
in

order to get pygrub working with hotplug scripts [0].

I guess you are using Xen 4.5, because this commit is already present 
in Xen

4.6, and it should fix your issue.

[0]
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=b1882a424ae098d722b19086b16e64b9aeccc7ca


Ahhh - thanks for the pointer. As this is a dev system, its probably
easier for me to upgrade it to Xen 4.6 - however I'll take that commit
and look at adding it to my Xen 4.5 packages for public consumption.


I might have spoken too soon here... I updated this system to 4.6.1 and 
created the DomU again - still seems to fail - although it does actually 
call the block-iscsi script this time:


# xl -vvv create /etc/xen/test1.vm
Parsing config from /etc/xen/test1.vm
libxl: debug: libxl_create.c:1560:do_domain_create: ao 0x24ad330: 
create: how=(nil) callback=(nil) poller=0x24b7070
libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk 
vdev=xvda spec.backend=unknown
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=xvda, uses 
script=... assuming phy backend
libxl: debug: libxl_device.c:298:libxl__device_disk_set_backend: Disk 
vdev=xvda, using backend phy
libxl: debug: libxl_create.c:945:initiate_domain_create: running 
bootloader
libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk 
vdev=(null) spec.backend=phy
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=(null), 
uses script=... assuming phy backend
libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk 
vdev=xvde spec.backend=phy
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=xvde, uses 
script=... assuming phy backend
libxl: debug: libxl_event.c:639:libxl__ev_xswatch_register: watch 
w=0x24ada00 wpath=/local/domain/0/backend/vbd/0/51776/state token=3/0: 
register slotnum=3
libxl: debug: libxl_create.c:1583:do_domain_create: ao 0x24ad330: 
inprogress: poller=0x24b7070, flags=i
libxl: debug: libxl_event.c:576:watchfd_callback: watch w=0x24ada00 
wpath=/local/domain/0/backend/vbd/0/51776/state token=3/0: event 
epath=/local/domain/0/backend/vbd/0/51776/state
libxl: debug: libxl_event.c:884:devstate_callback: bac

Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-04-15 Thread Steven Haigh

On 2016-04-15 18:03, Roger Pau Monné wrote:

On Fri, Apr 15, 2016 at 05:48:24PM +1000, Steven Haigh wrote:

On 2016-04-15 17:46, Roger Pau Monné wrote:
> On Fri, Apr 15, 2016 at 05:28:12PM +1000, Steven Haigh wrote:
> > On 2016-04-15 17:23, Roger Pau Monné wrote:
> > > On Fri, Apr 15, 2016 at 04:59:11PM +1000, Steven Haigh wrote:
> > > > Hi all,
> > > >
> > > > I'm wading through the somewhat confusing world of documentation
> > > > regarding
> > > > storing DomU disk images on an iSCSI target.
> > > >
> > > > I'm getting an error when using pygrub of:
> > > > OSError: [Errno 2] No such file or directory: 
'iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250'
> > >
> > > Hello,
> > >
> > > It should work. Can you please paste your guest configuration file and
> > > the
> > > output of the create command with "-vvv"?
> >
> > DomU config file:
> > bootloader  = "pygrub"
> > name= "test1.vm"
> > memory  = 2048
> > vcpus   = 2
> > cpus= "1-7"
> > vif = ['bridge=br-151, vifname=vm.test1']
> > disk= 
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250']
> > boot= "c"
> >
> > # xl create /etc/xen/test1.vm -d -c
>
> Please post the output of xl -vvv create /etc/xen/test1.vm.

Whoops - apologies:

# xl -vvv create /etc/xen/test1.vm
Parsing config from /etc/xen/test1.vm
libxl: debug: libxl_create.c:1507:do_domain_create: ao 0x20b7260: 
create:

how=(nil) callback=(nil) poller=0x20b6b30
libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk
vdev=xvda spec.backend=unknown
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=xvda, 
uses

script=... assuming phy backend
libxl: debug: libxl_device.c:298:libxl__device_disk_set_backend: Disk
vdev=xvda, using backend phy
libxl: debug: libxl_create.c:907:initiate_domain_create: running 
bootloader

libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk
vdev=(null) spec.backend=phy
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=(null), 
uses

script=... assuming phy backend
libxl: debug: libxl.c:3064:libxl__device_disk_local_initiate_attach: 
locally
attaching PHY disk 
iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250


Now I remember, this was fixed not long ago, you will need to apply
b1882a424ae098d722b19086b16e64b9aeccc7ca to your source tree/package in
order to get pygrub working with hotplug scripts [0].

I guess you are using Xen 4.5, because this commit is already present 
in Xen

4.6, and it should fix your issue.

[0]
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=b1882a424ae098d722b19086b16e64b9aeccc7ca


Ahhh - thanks for the pointer. As this is a dev system, its probably 
easier for me to upgrade it to Xen 4.6 - however I'll take that commit 
and look at adding it to my Xen 4.5 packages for public consumption.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-04-15 Thread Steven Haigh

On 2016-04-15 17:46, Roger Pau Monné wrote:

On Fri, Apr 15, 2016 at 05:28:12PM +1000, Steven Haigh wrote:

On 2016-04-15 17:23, Roger Pau Monné wrote:
> On Fri, Apr 15, 2016 at 04:59:11PM +1000, Steven Haigh wrote:
> > Hi all,
> >
> > I'm wading through the somewhat confusing world of documentation
> > regarding
> > storing DomU disk images on an iSCSI target.
> >
> > I'm getting an error when using pygrub of:
> > OSError: [Errno 2] No such file or directory: 
'iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250'
>
> Hello,
>
> It should work. Can you please paste your guest configuration file and
> the
> output of the create command with "-vvv"?

DomU config file:
bootloader  = "pygrub"
name= "test1.vm"
memory  = 2048
vcpus   = 2
cpus= "1-7"
vif = ['bridge=br-151, vifname=vm.test1']
disk= 
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250']

boot= "c"

# xl create /etc/xen/test1.vm -d -c


Please post the output of xl -vvv create /etc/xen/test1.vm.


Whoops - apologies:

# xl -vvv create /etc/xen/test1.vm
Parsing config from /etc/xen/test1.vm
libxl: debug: libxl_create.c:1507:do_domain_create: ao 0x20b7260: 
create: how=(nil) callback=(nil) poller=0x20b6b30
libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk 
vdev=xvda spec.backend=unknown
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=xvda, uses 
script=... assuming phy backend
libxl: debug: libxl_device.c:298:libxl__device_disk_set_backend: Disk 
vdev=xvda, using backend phy
libxl: debug: libxl_create.c:907:initiate_domain_create: running 
bootloader
libxl: debug: libxl_device.c:269:libxl__device_disk_set_backend: Disk 
vdev=(null) spec.backend=phy
libxl: debug: libxl_device.c:207:disk_try_backend: Disk vdev=(null), 
uses script=... assuming phy backend
libxl: debug: libxl.c:3064:libxl__device_disk_local_initiate_attach: 
locally attaching PHY disk 
iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250
libxl: debug: libxl_bootloader.c:411:bootloader_disk_attached_cb: Config 
bootloader value: pygrub
libxl: debug: libxl_bootloader.c:427:bootloader_disk_attached_cb: 
Checking for bootloader in libexec path: /usr/lib/xen/bin/pygrub
libxl: debug: libxl_create.c:1523:do_domain_create: ao 0x20b7260: 
inprogress: poller=0x20b6b30, flags=i
libxl: debug: libxl_event.c:581:libxl__ev_xswatch_register: watch 
w=0x20b7a30 wpath=/local/domain/12 token=3/0: register slotnum=3
libxl: debug: libxl_event.c:1950:libxl__ao_progress_report: ao 
0x20b7260: progress report: ignored
libxl: debug: libxl_bootloader.c:537:bootloader_gotptys: executing 
bootloader: /usr/lib/xen/bin/pygrub
libxl: debug: libxl_bootloader.c:541:bootloader_gotptys:   bootloader 
arg: /usr/lib/xen/bin/pygrub
libxl: debug: libxl_bootloader.c:541:bootloader_gotptys:   bootloader 
arg: --output=/var/run/xen/bootloader.12.out
libxl: debug: libxl_bootloader.c:541:bootloader_gotptys:   bootloader 
arg: --output-format=simple0
libxl: debug: libxl_bootloader.c:541:bootloader_gotptys:   bootloader 
arg: --output-directory=/var/run/xen/bootloader.12.d
libxl: debug: libxl_bootloader.c:541:bootloader_gotptys:   bootloader 
arg: 
iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250
libxl: debug: libxl_event.c:518:watchfd_callback: watch w=0x20b7a30 
wpath=/local/domain/12 token=3/0: event epath=/local/domain/12
libxl: error: libxl_bootloader.c:630:bootloader_finished: bootloader 
failed - consult logfile /var/log/xen/bootloader.12.log
libxl: error: libxl_exec.c:118:libxl_report_child_exitstatus: bootloader 
[-1] exited with error status 1
libxl: debug: libxl_event.c:619:libxl__ev_xswatch_deregister: watch 
w=0x20b7a30 wpath=/local/domain/12 token=3/0: deregister slotnum=3
libxl: error: libxl_create.c:1121:domcreate_rebuild_done: cannot 
(re-)build domain: -3
libxl: info: libxl.c:1698:devices_destroy_cb: forked pid 3003 for 
destroy of domain 12
libxl: debug: libxl_event.c:1774:libxl__ao_complete: ao 0x20b7260: 
complete, rc=-3
libxl: debug: libxl_event.c:1746:libxl__ao__destroy: ao 0x20b7260: 
destroy

xc: debug: hypercall buffer: total allocations:41 total releases:41
xc: debug: hypercall buffer: current allocations:0 maximum allocations:2
xc: debug: hypercall buffer: cache current size:2
xc: debug: hypercall buffer: cache hits:30 misses:2 toobig:9

--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-04-15 Thread Steven Haigh

On 2016-04-15 17:23, Roger Pau Monné wrote:

On Fri, Apr 15, 2016 at 04:59:11PM +1000, Steven Haigh wrote:

Hi all,

I'm wading through the somewhat confusing world of documentation 
regarding

storing DomU disk images on an iSCSI target.

I'm getting an error when using pygrub of:
OSError: [Errno 2] No such file or directory: 
'iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250'


Hello,

It should work. Can you please paste your guest configuration file and 
the

output of the create command with "-vvv"?


DomU config file:
bootloader  = "pygrub"
name= "test1.vm"
memory  = 2048
vcpus   = 2
cpus= "1-7"
vif = ['bridge=br-151, vifname=vm.test1']
disk= 
['script=block-iscsi,vdev=xvda,target=iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250']

boot= "c"

# xl create /etc/xen/test1.vm -d -c
Parsing config from /etc/xen/test1.vm
{
"domid": null,
"config": {
"c_info": {
"type": "pv",
"name": "test1.vm",
"uuid": "a7134f81-4616-4cf6-99db-3d2bc90b2d58",
"run_hotplug_scripts": "True"
},
"b_info": {
"max_vcpus": 2,
"avail_vcpus": [
0,
1
],
"vcpu_hard_affinity": [
[
1,
2,
3,
4,
5,
6,
7
],
[
1,
2,
3,
4,
5,
6,
7
]
],
"numa_placement": "False",
"max_memkb": 2097152,
"target_memkb": 2097152,
"shadow_memkb": 18432,
"sched_params": {

},
"claim_mode": "True",
"type.pv": {
"bootloader": "pygrub"
}
},
"disks": [
{
"pdev_path": 
"iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250",

"vdev": "xvda",
"format": "raw",
"script": "block-iscsi",
"readwrite": 1
}
],
"nics": [
{
"devid": 0,
"bridge": "br-151",
"ifname": "vm.test1"
}
],
"on_reboot": "restart"
}
}
libxl: error: libxl_bootloader.c:630:bootloader_finished: bootloader 
failed - consult logfile /var/log/xen/bootloader.11.log
libxl: error: libxl_exec.c:118:libxl_report_child_exitstatus: bootloader 
[-1] exited with error status 1
libxl: error: libxl_create.c:1121:domcreate_rebuild_done: cannot 
(re-)build domain: -3
libxl: info: libxl.c:1698:devices_destroy_cb: forked pid 2982 for 
destroy of domain 11
libxl: error: libxl_dom.c:36:libxl__domain_type: unable to get domain 
type for domid=11

xl: unable to exec console client: No such file or directory
libxl: error: libxl_exec.c:118:libxl_report_child_exitstatus: console 
child [2981] exited with error status 1


# cat /var/log/xen/bootloader.11.log
Traceback (most recent call last):
  File "/usr/lib/xen/bin/pygrub", line 894, in 
part_offs = get_partition_offsets(file)
  File "/usr/lib/xen/bin/pygrub", line 114, in get_partition_offsets
image_type = identify_disk_image(file)
  File "/usr/lib/xen/bin/pygrub", line 57, in identify_disk_image
fd = os.open(file, os.O_RDONLY)
OSError: [Errno 2] No such file or directory: 
'iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250'


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] block-iscsi with Xen 4.5 / 4.6

2016-04-15 Thread Steven Haigh

Hi all,

I'm wading through the somewhat confusing world of documentation 
regarding storing DomU disk images on an iSCSI target.


I'm getting an error when using pygrub of:
OSError: [Errno 2] No such file or directory: 
'iqn=iqn.1986-03.com.sun:02:ff2d12c0-b709-4ec0-999d-976506c666f5,portal=192.168.133.250'


After much hunting, I came across this post:
http://lists.xen.org/archives/html/xen-devel/2013-04/msg02796.html

As such, I'm wondering if it is still required to *NOT* use pygrub for 
booting iSCSI DomUs?


If so, what are the alternatives? Using pv-grub / pv-grub2? Something 
else?


As I'm running EL7.2, I figure if I have to use a pv-grub based 
solution, it would have to be pv-grub2?


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-04-01 Thread Steven Haigh
On 30/03/2016 1:14 AM, Boris Ostrovsky wrote:
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>>
>> Interestingly enough, this just happened again - but on a different
>> virtual machine. I'm starting to wonder if this may have something to do
>> with the uptime of the machine - as the system that this seems to happen
>> to is always different.
>>
>> Destroying it and monitoring it again has so far come up blank.
>>
>> I've thrown the latest lot of kernel messages here:
>>  http://paste.fedoraproject.org/346802/59241532
> 
> Would be good to see full console log. The one that you posted starts
> with an error so I wonder what was before that.

Ok, so I had a virtual machine do this again today. Both vcpus went to
100% usage and essentially hung. I attached to the screen console that
was connected via 'xl console' and copied the entire buffer to paste below:

yum-cron[30740]: segfault at 1781ab8 ip 7f2a7fcd282f sp
7ffe8655fe90 error 5 in libpython2.7.so.1.0[7f2a7fbf5000+178000]
swap_free: Bad swap file entry 2a2b7d5bb69515d8
BUG: Bad page map in process yum-cron  pte:56fab76d2a2bb06a pmd:0309e067
addr:0178 vm_flags:00100073 anon_vma:88007b974c08
mapping:  (null) index:1780
file:  (null) fault:  (null) mmap:  (null)
readpage:  (null)
CPU: 0 PID: 30740 Comm: yum-cron Tainted: GB
4.4.6-4.el7xen.x86_64 #1
  88004176bac0 81323d17 0178
 3000 88004176bb08 8117e574 81193d6e
 1780 88000309ec00 0178 56fab76d2a2bb06a
Call Trace:
 [] dump_stack+0x63/0x8c
 [] print_bad_pte+0x1e4/0x290
 [] ? swap_info_get+0x7e/0xe0
 [] unmap_single_vma+0x4ff/0x840
 [] unmap_vmas+0x47/0x90
 [] exit_mmap+0x98/0x150
 [] mmput+0x47/0x100
 [] do_exit+0x24e/0xad0
 [] do_group_exit+0x3f/0xa0
 [] get_signal+0x1c3/0x5e0
 [] do_signal+0x28/0x630
 [] ? printk+0x4d/0x4f
 [] ? vprintk_default+0x1f/0x30
 [] ? bad_area_access_error+0x43/0x4a
 [] ? __do_page_fault+0x22c/0x3f0
 [] exit_to_usermode_loop+0x4c/0x95
 [] prepare_exit_to_usermode+0x18/0x20
 [] retint_user+0x8/0x13
BUG: Bad page state in process yum-cron  pfn:0f3bf
page:ea3cefc0 count:0 mapcount:7 mapping:88000f3bf008
index:0x88000f3bf000
flags: 0x100094(referenced|dirty|slab)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x80(slab)
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 0 PID: 30740 Comm: yum-cron Tainted: GB
4.4.6-4.el7xen.x86_64 #1
  88004176b958 81323d17 ea3cefc0
 817ab348 88004176b980 811c1ab5 ea3cefc0
 0001  88004176b9d0 81159584
Call Trace:
 [] dump_stack+0x63/0x8c
 [] bad_page.part.69+0xdf/0xfc
 [] free_pages_prepare+0x294/0x2a0
 [] free_hot_cold_page+0x31/0x160
 [] free_hot_cold_page_list+0x49/0xb0
 [] release_pages+0xc5/0x260
 [] free_pages_and_swap_cache+0x7d/0x90
 [] tlb_flush_mmu_free+0x36/0x60
 [] unmap_single_vma+0x664/0x840
 [] unmap_vmas+0x47/0x90
 [] exit_mmap+0x98/0x150
 [] mmput+0x47/0x100
 [] do_exit+0x24e/0xad0
 [] do_group_exit+0x3f/0xa0
 [] get_signal+0x1c3/0x5e0
 [] do_signal+0x28/0x630
 [] ? printk+0x4d/0x4f
 [] ? vprintk_default+0x1f/0x30
 [] ? bad_area_access_error+0x43/0x4a
 [] ? __do_page_fault+0x22c/0x3f0
 [] exit_to_usermode_loop+0x4c/0x95
 [] prepare_exit_to_usermode+0x18/0x20
 [] retint_user+0x8/0x13
BUG: Bad rss-counter state mm:88007b99e4c0 idx:0 val:-1
BUG: Bad rss-counter state mm:88007b99e4c0 idx:1 val:2
BUG: Bad rss-counter state mm:88007b99e4c0 idx:2 val:-1
yum-cron[4197]: segfault at 32947fcb ip 7ff0fa1bf8bd sp
7ffdb1c54990 error 4 in libpython2.7.so.1.0[7ff0fa13a000+178000]
BUG: unable to handle kernel paging request at 88010f3beffe
IP: [] free_block+0x119/0x190
PGD 188b063 PUD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 1 PID: 8519 Comm: kworker/1:2 Tainted: GB
4.4.6-4.el7xen.x86_64 #1
Workqueue: events cache_reap
task: 8800346bf1c0 ti: 88005170 task.ti: 88005170
RIP: 0010:[]  [] free_block+0x119/0x190
RSP: 0018:880051703d40  EFLAGS: 00010082
RAX: ea3cefc0 RBX: ea00 RCX: 88000f3bf000
RDX: fffe RSI: 88007fd19c40 RDI: 88007d012100
RBP: 880051703d68 R08: 880051703d88 R09: 0006
R10: 88007d01a9

Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-29 Thread Steven Haigh
Greg, please see below - this is probably more for you...

On 03/29/2016 04:56 AM, Steven Haigh wrote:
>
> Interestingly enough, this just happened again - but on a different
> virtual machine. I'm starting to wonder if this may have something to do
> with the uptime of the machine - as the system that this seems to happen
> to is always different.
>
> Destroying it and monitoring it again has so far come up blank.
>
> I've thrown the latest lot of kernel messages here:
>  http://paste.fedoraproject.org/346802/59241532

So I just did a bit of digging via the almighty Google.

I started hunting for these lines, as they happen just before the stall:
BUG: Bad rss-counter state mm:88007b7db480 idx:2 val:-1
BUG: Bad rss-counter state mm:880079c638c0 idx:0 val:-1
BUG: Bad rss-counter state mm:880079c638c0 idx:2 val:-1

I stumbled across this post on the lkml:
http://marc.info/?l=linux-kernel=145141546409607

The patch attached seems to reference the following change in
unmap_mapping_range in mm/memory.c:
> - struct zap_details details;
> + struct zap_details details = { };

When I browse the GIT tree for 4.4.6:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/mm/memory.c?id=refs/tags/v4.4.6

I see at line 2411:
struct zap_details details;

Is this something that has been missed being merged into the 4.4 tree?
I'll admit my kernel knowledge is not enough to understand what the code
actually does - but the similarities here seem uncanny.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-29 Thread Steven Haigh
On 30/03/2016 1:14 AM, Boris Ostrovsky wrote:
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>>
>> Interestingly enough, this just happened again - but on a different
>> virtual machine. I'm starting to wonder if this may have something to do
>> with the uptime of the machine - as the system that this seems to happen
>> to is always different.
>>
>> Destroying it and monitoring it again has so far come up blank.
>>
>> I've thrown the latest lot of kernel messages here:
>>  http://paste.fedoraproject.org/346802/59241532
> 
> Would be good to see full console log. The one that you posted starts
> with an error so I wonder what was before that.

Agreed. It started off with me observing this on one VM - but since
trying to get details on that VM - others have started showing issues as
well. It frustrating as it seems I've been playing whack-a-mole to get
more debug on what is going on.

So, I've changed the kernel command line to the following on ALL VMs on
this system:
enforcemodulesig=1 selinux=0 fsck.repair=yes loglevel=7 console=tty0
console=ttyS0,38400n8

In the Dom0 (which runs the same kernel package), I've started a screen
sessions with a screen for each of the DomUs running attached to the
console via 'xl console blah' - so hopefully the next one that goes down
(whichever one that is) will get caught in the console.

> Have you tried this on bare metal, BTW? And you said this is only
> observed on 4.4, not 4.5, right?

I use the same kernel package as the Dom0 kernel - and so far haven't
seen any issues running this as the Dom0. I haven't used it on baremetal
as a non-xen kernel as yet.

The kernel package I'm currently running is for CentOS / Scientific
Linux / RHEL at:
http://au1.mirror.crc.id.au/repo/el7-testing/x86_64/

I'm using 4.4.6-3 at the moment - which has CONFIG_PREEMPT_VOLUNTARY set
- which *MAY* have increased the time between this happening - or may
have no effect at all. I'm not convinced either way as yet.

With respect to 4.5, I have had reports from another user of my packages
that they haven't seen the same crash using the same Xen packages but
with kernel 4.5. I have not verified this myself as yet as I haven't
gone down the path of making 4.5 packages for testing. As such, I
wouldn't treat this as a conclusive test case as yet.

I'm hoping that the steps I've taken above may give some more
information in which we can drill down into exactly what is going on -
or at least give more pointers into the root cause.

>>
>> Interestingly, around the same time, /var/log/messages on the remote
>> syslog server shows:
>> Mar 29 17:00:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root.
>> Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root.
>> Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root.
>> Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root.
>> Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94
>> offset -0.002247 sec
>> Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:08:34 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root.
>> Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root.
>> Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root.
>> Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root.
>> Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout
>> (limit 1min)!
>> Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed
>> out. Terminating.
>> Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed
>> out. Killing.
>> Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after
>> SIGKILL. Ignoring.
>> Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm
>> timed out. Killing.
>> Mar 29 17:38:26 zeus systemd: 

Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-29 Thread Steven Haigh
On 26/03/2016 8:07 AM, Steven Haigh wrote:
> On 26/03/2016 3:20 AM, Boris Ostrovsky wrote:
>> On 03/25/2016 12:04 PM, Steven Haigh wrote:
>>> It may not actually be the full logs. Once the system gets really upset,
>>> you can't run anything - as such, grabbing anything from dmesg is not
>>> possible.
>>>
>>> The logs provided above is all that gets spat out to the syslog server.
>>>
>>> I'll try tinkering with a few things to see if I can get more output -
>>> but right now, that's all I've been able to achieve. So far, my only
>>> ideas are to remove the 'quiet' options from the kernel command line -
>>> but I'm not sure how much that would help.
>>>
>>> Suggestions gladly accepted on this front.
>>
>> You probably want to run connected to guest serial console ("
>> serial='pty' " in guest config file and something like 'loglevel=7
>> console=tty0 console=ttyS0,38400n8' on guest kernel commandline). And
>> start the guest with 'xl create -c ' or connect later with 'xl
>> console '.
> 
> Ok thanks, I've booted the DomU with:
> 
> $ cat /proc/cmdline
> root=UUID=63ade949-ee67-4afb-8fe7-ecd96faa15e2 ro enforcemodulesig=1
> selinux=0 fsck.repair=yes loglevel=7 console=tty0 console=ttyS0,38400n8
> 
> I've left a screen session attached to the console (via xl console) and
> I'll see if that turns anything up. As this seems to be rather
> unpredictable when it happens, it may take a day or two to get anything.
> I just hope its more than the syslog output :)

Interestingly enough, this just happened again - but on a different
virtual machine. I'm starting to wonder if this may have something to do
with the uptime of the machine - as the system that this seems to happen
to is always different.

Destroying it and monitoring it again has so far come up blank.

I've thrown the latest lot of kernel messages here:
http://paste.fedoraproject.org/346802/59241532

Interestingly, around the same time, /var/log/messages on the remote
syslog server shows:
Mar 29 17:00:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:00:01 zeus systemd: Starting user-0.slice.
Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root.
Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root.
Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:00:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:01:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:01:01 zeus systemd: Starting user-0.slice.
Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root.
Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root.
Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94
offset -0.002247 sec
Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice.
Mar 29 17:08:34 zeus systemd: Stopping user-0.slice.
Mar 29 17:10:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:10:01 zeus systemd: Starting user-0.slice.
Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root.
Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root.
Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:10:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:20:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:20:01 zeus systemd: Starting user-0.slice.
Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root.
Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root.
Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:20:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout
(limit 1min)!
Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed
out. Terminating.
Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed
out. Killing.
Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after
SIGKILL. Ignoring.
Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm
timed out. Killing.
Mar 29 17:38:26 zeus systemd: systemd-logind.service still around after
final SIGKILL. Entering failed mode.
Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered failed
state.
Mar 29 17:38:26 zeus systemd: systemd-logind.service failed.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-25 Thread Steven Haigh
On 26/03/2016 3:20 AM, Boris Ostrovsky wrote:
> On 03/25/2016 12:04 PM, Steven Haigh wrote:
>> It may not actually be the full logs. Once the system gets really upset,
>> you can't run anything - as such, grabbing anything from dmesg is not
>> possible.
>>
>> The logs provided above is all that gets spat out to the syslog server.
>>
>> I'll try tinkering with a few things to see if I can get more output -
>> but right now, that's all I've been able to achieve. So far, my only
>> ideas are to remove the 'quiet' options from the kernel command line -
>> but I'm not sure how much that would help.
>>
>> Suggestions gladly accepted on this front.
> 
> You probably want to run connected to guest serial console ("
> serial='pty' " in guest config file and something like 'loglevel=7
> console=tty0 console=ttyS0,38400n8' on guest kernel commandline). And
> start the guest with 'xl create -c ' or connect later with 'xl
> console '.

Ok thanks, I've booted the DomU with:

$ cat /proc/cmdline
root=UUID=63ade949-ee67-4afb-8fe7-ecd96faa15e2 ro enforcemodulesig=1
selinux=0 fsck.repair=yes loglevel=7 console=tty0 console=ttyS0,38400n8

I've left a screen session attached to the console (via xl console) and
I'll see if that turns anything up. As this seems to be rather
unpredictable when it happens, it may take a day or two to get anything.
I just hope its more than the syslog output :)

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-25 Thread Steven Haigh
On 26/03/2016 1:44 AM, Boris Ostrovsky wrote:
> On 03/25/2016 10:05 AM, Steven Haigh wrote:
>> On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:
>>> On 03/24/2016 10:53 PM, Steven Haigh wrote:
>>>> Hi all,
>>>>
>>>> Firstly, I've cross-posted this to xen-devel and the lkml - as this
>>>> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
>>>> I have also CC'ed Greg KH for his awesome insight as maintainer.
>>>>
>>>> Please CC myself into replies - as I'm not a member of the kernel
>>>> mailing list - I may miss replies from monitoring the archives.
>>>>
>>>> I've noticed recently that heavy disk IO is causing rcu_sched to detect
>>>> stalls. The process mentioned usually goes to 100% CPU usage, and
>>>> eventually processes start segfaulting and dying. The only fix to
>>>> recover the system is to use 'xl destroy' to force-kill the VM and to
>>>> start it again.
>>>>
>>>> The majority of these issues seem to mention ext4 in the trace. This
>>>> may
>>>> indicate an issue there - or may be a red herring.
>>>>
>>>> The gritty details:
>>>> INFO: rcu_sched self-detected stall on CPU
>>>> #0110-...: (20999 ticks this GP) idle=327/141/0
>>>> softirq=1101493/1101493 fqs=6973
>>>> #011 (t=21000 jiffies g=827095 c=827094 q=524)
>>>> Task dump for CPU 0:
>>>> rsync   R  running task0  2446   2444 0x0088
>>>> 818d0c00 88007fc03c58 810a625f 
>>>> 818d0c00 88007fc03c70 810a8699 0001
>>>> 88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
>>>> Call Trace:
>>>>   [] sched_show_task+0xaf/0x110
>>>> [] dump_cpu_task+0x39/0x40
>>>> [] rcu_dump_cpu_stacks+0x8a/0xc0
>>>> [] rcu_check_callbacks+0x424/0x7a0
>>>> [] ? account_system_time+0x81/0x110
>>>> [] ? account_process_tick+0x61/0x160
>>>> [] ? tick_sched_do_timer+0x30/0x30
>>>> [] update_process_times+0x39/0x60
>>>> [] tick_sched_handle.isra.15+0x36/0x50
>>>> [] tick_sched_timer+0x3d/0x70
>>>> [] __hrtimer_run_queues+0xf2/0x250
>>>> [] hrtimer_interrupt+0xa8/0x190
>>>> [] xen_timer_interrupt+0x2e/0x140
>>>> [] handle_irq_event_percpu+0x55/0x1e0
>>>> [] handle_percpu_irq+0x3a/0x50
>>>> [] generic_handle_irq+0x22/0x30
>>>> [] __evtchn_fifo_handle_events+0x15f/0x180
>>>> [] evtchn_fifo_handle_events+0x10/0x20
>>>> [] __xen_evtchn_do_upcall+0x43/0x80
>>>> [] xen_evtchn_do_upcall+0x30/0x50
>>>> [] xen_hvm_callback_vector+0x82/0x90
>>>>   [] ? queued_write_lock_slowpath+0x3d/0x80
>>>> [] _raw_write_lock+0x1e/0x30
>>> This looks to me like ext4 failing to grab a lock. Everything above it
>>> (in Xen code) is regular tick interrupt handling which detects the
>>> stall.
>>>
>>> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates
>>> any possible issues with pv locks.
>>>
>>> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg,
>>> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)?
>> There is nothing special being logged anywhere that I can see. dmesg /
>> xl dmesg on the Dom0 show nothing unusual.
>>
>> I do share CPUs - but I don't give any DomU more than 2 vcpus. The
>> physical host has 4 cores - 1 pinned to the Dom0.
>>
>> I log to a remote syslog on this system - and I've uploaded the entire
>> log to a pastebin (don't want to do a 45Kb attachment here):
>>  http://paste.fedoraproject.org/345095/58914452
> 
> That doesn't look like a full log. In any case, the RCU stall may be a
> secondary problem --- there is a bunch of splats before the stall.

It may not actually be the full logs. Once the system gets really upset,
you can't run anything - as such, grabbing anything from dmesg is not
possible.

The logs provided above is all that gets spat out to the syslog server.

I'll try tinkering with a few things to see if I can get more output -
but right now, that's all I've been able to achieve. So far, my only
ideas are to remove the 'quiet' options from the kernel command line -
but I'm not sure how much that would help.

Suggestions gladly accepted on this front.

>>
>> Not sure if it makes any difference at all, but my DomU config is:
>> # cat /etc/xen/backup.vm
>&g

Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-25 Thread Steven Haigh
On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:
> On 03/24/2016 10:53 PM, Steven Haigh wrote:
>> Hi all,
>>
>> Firstly, I've cross-posted this to xen-devel and the lkml - as this
>> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
>> I have also CC'ed Greg KH for his awesome insight as maintainer.
>>
>> Please CC myself into replies - as I'm not a member of the kernel
>> mailing list - I may miss replies from monitoring the archives.
>>
>> I've noticed recently that heavy disk IO is causing rcu_sched to detect
>> stalls. The process mentioned usually goes to 100% CPU usage, and
>> eventually processes start segfaulting and dying. The only fix to
>> recover the system is to use 'xl destroy' to force-kill the VM and to
>> start it again.
>>
>> The majority of these issues seem to mention ext4 in the trace. This may
>> indicate an issue there - or may be a red herring.
>>
>> The gritty details:
>> INFO: rcu_sched self-detected stall on CPU
>> #0110-...: (20999 ticks this GP) idle=327/141/0
>> softirq=1101493/1101493 fqs=6973
>> #011 (t=21000 jiffies g=827095 c=827094 q=524)
>> Task dump for CPU 0:
>> rsync   R  running task0  2446   2444 0x0088
>> 818d0c00 88007fc03c58 810a625f 
>> 818d0c00 88007fc03c70 810a8699 0001
>> 88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
>> Call Trace:
>>   [] sched_show_task+0xaf/0x110
>> [] dump_cpu_task+0x39/0x40
>> [] rcu_dump_cpu_stacks+0x8a/0xc0
>> [] rcu_check_callbacks+0x424/0x7a0
>> [] ? account_system_time+0x81/0x110
>> [] ? account_process_tick+0x61/0x160
>> [] ? tick_sched_do_timer+0x30/0x30
>> [] update_process_times+0x39/0x60
>> [] tick_sched_handle.isra.15+0x36/0x50
>> [] tick_sched_timer+0x3d/0x70
>> [] __hrtimer_run_queues+0xf2/0x250
>> [] hrtimer_interrupt+0xa8/0x190
>> [] xen_timer_interrupt+0x2e/0x140
>> [] handle_irq_event_percpu+0x55/0x1e0
>> [] handle_percpu_irq+0x3a/0x50
>> [] generic_handle_irq+0x22/0x30
>> [] __evtchn_fifo_handle_events+0x15f/0x180
>> [] evtchn_fifo_handle_events+0x10/0x20
>> [] __xen_evtchn_do_upcall+0x43/0x80
>> [] xen_evtchn_do_upcall+0x30/0x50
>> [] xen_hvm_callback_vector+0x82/0x90
>>   [] ? queued_write_lock_slowpath+0x3d/0x80
>> [] _raw_write_lock+0x1e/0x30
> 
> This looks to me like ext4 failing to grab a lock. Everything above it
> (in Xen code) is regular tick interrupt handling which detects the stall.
> 
> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates
> any possible issues with pv locks.
> 
> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg,
> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)?

There is nothing special being logged anywhere that I can see. dmesg /
xl dmesg on the Dom0 show nothing unusual.

I do share CPUs - but I don't give any DomU more than 2 vcpus. The
physical host has 4 cores - 1 pinned to the Dom0.

I log to a remote syslog on this system - and I've uploaded the entire
log to a pastebin (don't want to do a 45Kb attachment here):
http://paste.fedoraproject.org/345095/58914452

Not sure if it makes any difference at all, but my DomU config is:
# cat /etc/xen/backup.vm
name= "backup.vm"
memory  = 2048
vcpus   = 2
cpus= "1-3"
disk= [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ]
vif = [ "mac=00:11:36:35:35:09, bridge=br203,
vifname=vm.backup, script=vif-bridge" ]
bootloader  = 'pygrub'
pvh = 1

on_poweroff = 'destroy'
on_reboot   = 'restart'
on_crash= 'restart'
cpu_weight  = 64

I never had this problem when running kernel 4.1.x - it only started
when I upgraded everything to 4.4 - not exactly a great help - but may
help narrow things down?

>> [] ext4_es_remove_extent+0x43/0xc0
>> [] ext4_clear_inode+0x39/0x80
>> [] ext4_evict_inode+0x8d/0x4e0
>> [] evict+0xb7/0x180
>> [] dispose_list+0x36/0x50
>> [] prune_icache_sb+0x4b/0x60
>> [] super_cache_scan+0x141/0x190
>> [] shrink_slab.part.37+0x1ee/0x390
>> [] shrink_zone+0x26c/0x280
>> [] do_try_to_free_pages+0x15c/0x410
>> [] try_to_free_pages+0xba/0x170
>> [] __alloc_pages_nodemask+0x525/0xa60
>> [] ? kmem_cache_free+0xcc/0x2c0
>> [] alloc_pages_current+0x8d/0x120
>> [] __page_cache_alloc+0x91/0xc0
>> [] pagecache_get_page+0x56/0x1e0
>> [] grab_cache_page_write_begin+0x26/0x40
>> [] ext4_da_write_begin+0xa1/0x300
>> [] ? ext4_da_write_end+0x124/0x2b0
>> [] generic_perfo

[Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-24 Thread Steven Haigh
Hi all,

Firstly, I've cross-posted this to xen-devel and the lkml - as this
problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
I have also CC'ed Greg KH for his awesome insight as maintainer.

Please CC myself into replies - as I'm not a member of the kernel
mailing list - I may miss replies from monitoring the archives.

I've noticed recently that heavy disk IO is causing rcu_sched to detect
stalls. The process mentioned usually goes to 100% CPU usage, and
eventually processes start segfaulting and dying. The only fix to
recover the system is to use 'xl destroy' to force-kill the VM and to
start it again.

The majority of these issues seem to mention ext4 in the trace. This may
indicate an issue there - or may be a red herring.

The gritty details:
INFO: rcu_sched self-detected stall on CPU
#0110-...: (20999 ticks this GP) idle=327/141/0
softirq=1101493/1101493 fqs=6973
#011 (t=21000 jiffies g=827095 c=827094 q=524)
Task dump for CPU 0:
rsync   R  running task0  2446   2444 0x0088
818d0c00 88007fc03c58 810a625f 
818d0c00 88007fc03c70 810a8699 0001
88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
Call Trace:
  [] sched_show_task+0xaf/0x110
[] dump_cpu_task+0x39/0x40
[] rcu_dump_cpu_stacks+0x8a/0xc0
[] rcu_check_callbacks+0x424/0x7a0
[] ? account_system_time+0x81/0x110
[] ? account_process_tick+0x61/0x160
[] ? tick_sched_do_timer+0x30/0x30
[] update_process_times+0x39/0x60
[] tick_sched_handle.isra.15+0x36/0x50
[] tick_sched_timer+0x3d/0x70
[] __hrtimer_run_queues+0xf2/0x250
[] hrtimer_interrupt+0xa8/0x190
[] xen_timer_interrupt+0x2e/0x140
[] handle_irq_event_percpu+0x55/0x1e0
[] handle_percpu_irq+0x3a/0x50
[] generic_handle_irq+0x22/0x30
[] __evtchn_fifo_handle_events+0x15f/0x180
[] evtchn_fifo_handle_events+0x10/0x20
[] __xen_evtchn_do_upcall+0x43/0x80
[] xen_evtchn_do_upcall+0x30/0x50
[] xen_hvm_callback_vector+0x82/0x90
  [] ? queued_write_lock_slowpath+0x3d/0x80
[] _raw_write_lock+0x1e/0x30
[] ext4_es_remove_extent+0x43/0xc0
[] ext4_clear_inode+0x39/0x80
[] ext4_evict_inode+0x8d/0x4e0
[] evict+0xb7/0x180
[] dispose_list+0x36/0x50
[] prune_icache_sb+0x4b/0x60
[] super_cache_scan+0x141/0x190
[] shrink_slab.part.37+0x1ee/0x390
[] shrink_zone+0x26c/0x280
[] do_try_to_free_pages+0x15c/0x410
[] try_to_free_pages+0xba/0x170
[] __alloc_pages_nodemask+0x525/0xa60
[] ? kmem_cache_free+0xcc/0x2c0
[] alloc_pages_current+0x8d/0x120
[] __page_cache_alloc+0x91/0xc0
[] pagecache_get_page+0x56/0x1e0
[] grab_cache_page_write_begin+0x26/0x40
[] ext4_da_write_begin+0xa1/0x300
[] ? ext4_da_write_end+0x124/0x2b0
[] generic_perform_write+0xc0/0x1a0
[] __generic_file_write_iter+0x188/0x1e0
[] ext4_file_write_iter+0xf0/0x340
[] __vfs_write+0xaa/0xe0
[] vfs_write+0xa2/0x1a0
[] SyS_write+0x46/0xa0
[] entry_SYSCALL_64_fastpath+0x12/0x71

Some 11 hours later:
sshd[785]: segfault at 1f0 ip 7f03bb94ae5c sp 7ffe9eb54470 error
4 in ld-2.17.so[7f03bb94+21000]
sh[787]: segfault at 1f0 ip 7f6b4a0dfe5c sp 7ffe3d4a71e0 error 4
in ld-2.17.so[7f6b4a0d5000+21000]
systemd-cgroups[788]: segfault at 1f0 ip 7f4baa82ce5c sp
7ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
sshd[791]: segfault at 1f0 ip 7ff8c8a8ce5c sp 7ffede9e1c20 error
4 in ld-2.17.so[7ff8c8a82000+21000]
sshd[792]: segfault at 1f0 ip 7f183cf75e5c sp 7ffc81ab7160 error
4 in ld-2.17.so[7f183cf6b000+21000]
sshd[793]: segfault at 1f0 ip 7f3c665ece5c sp 7ffd9a13c850 error
4 in ld-2.17.so[7f3c665e2000+21000]

From isolated testing, this does not occur on kernel 4.5.x - however I
have not verified this myself.

The kernel config used can be found in the kernel-xen git repo if it
assists in debugging:
http://xen.crc.id.au/git/?p=kernel-xen;a=summary

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] rcu_sched self-detected stall on CPU on kernel 4.4.5/6 in PV DomU

2016-03-23 Thread Steven Haigh

Just wanted to give a bit of a poke about this.

Currently running kernel 4.4.6 in a PV DomU and still occasionally 
getting hangs.


Also stumbled across this that may be related:
https://lkml.org/lkml/2016/2/4/724

My latest hang shows:
[339844.594001] INFO: rcu_sched self-detected stall on CPU
[339844.594001] 1-...: (287557828 ticks this GP) 
idle=4cb/141/0 softirq=1340383/1340384 fqs=95372371
[339844.594001]  (t=287566692 jiffies g=999283 c=999282 
q=1725381)

[339844.594001] Task dump for CPU 1:
[339844.594001] find    R  running task    0  2840   2834 
0x0088
[339844.594001]  818d0c00 88007fd03c58 810a625f 
0001
[339844.594001]  818d0c00 88007fd03c70 810a8699 
0002
[339844.594001]  88007fd03ca0 810d0e5a 88007fd170c0 
818d0c00

[339844.594001] Call Trace:
[339844.594001]    [] sched_show_task+0xaf/0x110
[339844.594001]  [] dump_cpu_task+0x39/0x40
[339844.594001]  [] rcu_dump_cpu_stacks+0x8a/0xc0
[339844.594001]  [] rcu_check_callbacks+0x424/0x7a0
[339844.594001]  [] ? account_system_time+0x81/0x110
[339844.594001]  [] ? account_process_tick+0x61/0x160
[339844.594001]  [] ? tick_sched_do_timer+0x30/0x30
[339844.594001]  [] update_process_times+0x39/0x60
[339844.594001]  [] 
tick_sched_handle.isra.15+0x36/0x50

[339844.594001]  [] tick_sched_timer+0x3d/0x70
[339844.594001]  [] __hrtimer_run_queues+0xf2/0x250
[339844.594001]  [] hrtimer_interrupt+0xa8/0x190
[339844.594001]  [] xen_timer_interrupt+0x2e/0x140
[339844.594001]  [] handle_irq_event_percpu+0x55/0x1e0
[339844.594001]  [] handle_percpu_irq+0x3a/0x50
[339844.594001]  [] generic_handle_irq+0x22/0x30
[339844.594001]  [] 
__evtchn_fifo_handle_events+0x15f/0x180
[339844.594001]  [] 
evtchn_fifo_handle_events+0x10/0x20

[339844.594001]  [] __xen_evtchn_do_upcall+0x43/0x80
[339844.594001]  [] xen_evtchn_do_upcall+0x30/0x50
[339844.594001]  [] xen_hvm_callback_vector+0x82/0x90
[339844.594001]    [] ? _raw_spin_lock+0x10/0x3


On 2016-03-19 08:46, Steven Haigh wrote:

On 19/03/2016 8:40 AM, Steven Haigh wrote:

Hi all,

So I'd just like to give this a prod. I'm still getting DomU's 
randomly

go to 100% CPU usage using kernel 4.4.6 now. It seems running 4.4.6 as
the DomU does not induce these problems.


Sorry - slight correction. Running 4.4.6 as the Dom0 kernel doesn't 
show

these errors. Only in the DomU.



Latest crash message from today:
INFO: rcu_sched self-detected stall on CPU
0-...: (20869552 ticks this GP) idle=9c9/141/0
softirq=1440865/1440865 fqs=15068
 (t=20874993 jiffies g=1354899 c=1354898 q=798)
rcu_sched kthread starved for 20829030 jiffies! g1354899 c1354898 f0x0
s3 ->state=0x0
Task dump for CPU 0:
kworker/u4:1R  running task0  5853  2 0x0088
Workqueue: writeback wb_workfn (flush-202:0)
 818d0c00 88007fc03c58 810a625f 
 818d0c00 88007fc03c70 810a8699 0001
 88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
Call Trace:
   [] sched_show_task+0xaf/0x110
 [] dump_cpu_task+0x39/0x40
 [] rcu_dump_cpu_stacks+0x8a/0xc0
 [] rcu_check_callbacks+0x424/0x7a0
 [] ? account_system_time+0x81/0x110
 [] ? account_process_tick+0x61/0x160
 [] ? tick_sched_do_timer+0x30/0x30
 [] update_process_times+0x39/0x60
 [] tick_sched_handle.isra.15+0x36/0x50
 [] tick_sched_timer+0x3d/0x70
 [] __hrtimer_run_queues+0xf2/0x250
 [] hrtimer_interrupt+0xa8/0x190
 [] xen_timer_interrupt+0x2e/0x140
 [] handle_irq_event_percpu+0x55/0x1e0
 [] handle_percpu_irq+0x3a/0x50
 [] generic_handle_irq+0x22/0x30
 [] __evtchn_fifo_handle_events+0x15f/0x180
 [] evtchn_fifo_handle_events+0x10/0x20
 [] __xen_evtchn_do_upcall+0x43/0x80
 [] xen_evtchn_do_upcall+0x30/0x50
 [] xen_hvm_callback_vector+0x82/0x90
   [] ? queued_spin_lock_slowpath+0x22/0x170
 [] _raw_spin_lock+0x20/0x30
 [] writeback_sb_inodes+0x124/0x560
 [] ? _raw_spin_unlock_irqrestore+0x16/0x20
 [] __writeback_inodes_wb+0x86/0xc0
 [] wb_writeback+0x1d6/0x2d0
 [] wb_workfn+0x284/0x3e0
 [] process_one_work+0x151/0x400
 [] worker_thread+0x11a/0x460
 [] ? __schedule+0x2bf/0x880
 [] ? rescuer_thread+0x2f0/0x2f0
 [] kthread+0xc9/0xe0
 [] ? kthread_park+0x60/0x60
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_park+0x60/0x60

This repeats over and over causing 100% CPU usage - eventually on all
vcpus assigned to the DomU and the only recovery is 'xl destroy'.

I'm currently running Xen 4.6.1 on this system - with kernel 4.4.6 in
both the DomU and Dom0.

On 17/03/2016 8:39 AM, Steven Haigh wrote:

Hi all,

I've noticed the following problem that ends up with a non-repsonsive 
PV

DomU using kernel 4.4.5 under heavy disk IO:

INFO: rcu_sched self-detected stall on CPU
0-...: (6759098 ticks this GP) idle=cb3/141/0
softirq=3244615/3244615 fqs=4
 (t=6762321 jiffies g=2275626 c=2275625 q=54)
rcu_sched kthread starved for 6762309 jiffies! g2275

[Xen-devel] rcu_sched self-detected stall on CPU on kernel 4.4.5 in PV DomU

2016-03-19 Thread Steven Haigh
Hi all,

I've noticed the following problem that ends up with a non-repsonsive PV
DomU using kernel 4.4.5 under heavy disk IO:

INFO: rcu_sched self-detected stall on CPU
0-...: (6759098 ticks this GP) idle=cb3/141/0
softirq=3244615/3244615 fqs=4
 (t=6762321 jiffies g=2275626 c=2275625 q=54)
rcu_sched kthread starved for 6762309 jiffies! g2275626 c2275625 f0x0 s3
->state=0x0
Task dump for CPU 0:
updatedbR  running task0  6027   6021 0x0088
 818d0c00 88007fc03c58 810a625f 
 818d0c00 88007fc03c70 810a8699 0001
 88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
Call Trace:
   [] sched_show_task+0xaf/0x110
 [] dump_cpu_task+0x39/0x40
 [] rcu_dump_cpu_stacks+0x8a/0xc0
 [] rcu_check_callbacks+0x424/0x7a0
 [] ? account_system_time+0x81/0x110
 [] ? account_process_tick+0x61/0x160
 [] ? tick_sched_do_timer+0x30/0x30
 [] update_process_times+0x39/0x60
 [] tick_sched_handle.isra.15+0x36/0x50
 [] tick_sched_timer+0x3d/0x70
 [] __hrtimer_run_queues+0xf2/0x250
 [] hrtimer_interrupt+0xa8/0x190
 [] xen_timer_interrupt+0x2e/0x140
 [] handle_irq_event_percpu+0x55/0x1e0
 [] handle_percpu_irq+0x3a/0x50
 [] generic_handle_irq+0x22/0x30
 [] __evtchn_fifo_handle_events+0x15f/0x180
 [] evtchn_fifo_handle_events+0x10/0x20
 [] __xen_evtchn_do_upcall+0x43/0x80
 [] xen_evtchn_do_upcall+0x30/0x50
 [] xen_hvm_callback_vector+0x82/0x90
   [] ? queued_spin_lock_slowpath+0x10/0x170
 [] _raw_spin_lock+0x20/0x30
 [] find_inode_fast+0x61/0xa0
 [] iget_locked+0x6e/0x170
 [] ext4_iget+0x33/0xae0
 [] ? out_of_line_wait_on_bit+0x72/0x80
 [] ext4_iget_normal+0x30/0x40
 [] ext4_lookup+0xd5/0x140
 [] lookup_real+0x1d/0x50
 [] __lookup_hash+0x33/0x40
 [] walk_component+0x177/0x280
 [] path_lookupat+0x60/0x110
 [] filename_lookup+0x9c/0x150
 [] ? kfree+0x10d/0x290
 [] ? call_filldir+0x9c/0x130
 [] ? getname_flags+0x4f/0x1f0
 [] user_path_at_empty+0x36/0x40
 [] vfs_fstatat+0x53/0xa0
 [] ? __fput+0x169/0x1d0
 [] SYSC_newlstat+0x22/0x40
 [] ? __audit_syscall_exit+0x1f0/0x270
 [] ? syscall_slow_exit_work+0x3f/0xc0
 [] ? __audit_syscall_entry+0xaf/0x100
 [] SyS_newlstat+0xe/0x10
 [] entry_SYSCALL_64_fastpath+0x12/0x71

This ends up with the system not responding at 100% CPU usage.

Has anyone else seen this using kernel 4.4.5 in a DomU?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] rcu_sched self-detected stall on CPU on kernel 4.4.5 in PV DomU

2016-03-19 Thread Steven Haigh
Hi all,

So I'd just like to give this a prod. I'm still getting DomU's randomly
go to 100% CPU usage using kernel 4.4.6 now. It seems running 4.4.6 as
the DomU does not induce these problems.

Latest crash message from today:
INFO: rcu_sched self-detected stall on CPU
0-...: (20869552 ticks this GP) idle=9c9/141/0
softirq=1440865/1440865 fqs=15068
 (t=20874993 jiffies g=1354899 c=1354898 q=798)
rcu_sched kthread starved for 20829030 jiffies! g1354899 c1354898 f0x0
s3 ->state=0x0
Task dump for CPU 0:
kworker/u4:1R  running task0  5853  2 0x0088
Workqueue: writeback wb_workfn (flush-202:0)
 818d0c00 88007fc03c58 810a625f 
 818d0c00 88007fc03c70 810a8699 0001
 88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
Call Trace:
   [] sched_show_task+0xaf/0x110
 [] dump_cpu_task+0x39/0x40
 [] rcu_dump_cpu_stacks+0x8a/0xc0
 [] rcu_check_callbacks+0x424/0x7a0
 [] ? account_system_time+0x81/0x110
 [] ? account_process_tick+0x61/0x160
 [] ? tick_sched_do_timer+0x30/0x30
 [] update_process_times+0x39/0x60
 [] tick_sched_handle.isra.15+0x36/0x50
 [] tick_sched_timer+0x3d/0x70
 [] __hrtimer_run_queues+0xf2/0x250
 [] hrtimer_interrupt+0xa8/0x190
 [] xen_timer_interrupt+0x2e/0x140
 [] handle_irq_event_percpu+0x55/0x1e0
 [] handle_percpu_irq+0x3a/0x50
 [] generic_handle_irq+0x22/0x30
 [] __evtchn_fifo_handle_events+0x15f/0x180
 [] evtchn_fifo_handle_events+0x10/0x20
 [] __xen_evtchn_do_upcall+0x43/0x80
 [] xen_evtchn_do_upcall+0x30/0x50
 [] xen_hvm_callback_vector+0x82/0x90
   [] ? queued_spin_lock_slowpath+0x22/0x170
 [] _raw_spin_lock+0x20/0x30
 [] writeback_sb_inodes+0x124/0x560
 [] ? _raw_spin_unlock_irqrestore+0x16/0x20
 [] __writeback_inodes_wb+0x86/0xc0
 [] wb_writeback+0x1d6/0x2d0
 [] wb_workfn+0x284/0x3e0
 [] process_one_work+0x151/0x400
 [] worker_thread+0x11a/0x460
 [] ? __schedule+0x2bf/0x880
 [] ? rescuer_thread+0x2f0/0x2f0
 [] kthread+0xc9/0xe0
 [] ? kthread_park+0x60/0x60
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_park+0x60/0x60

This repeats over and over causing 100% CPU usage - eventually on all
vcpus assigned to the DomU and the only recovery is 'xl destroy'.

I'm currently running Xen 4.6.1 on this system - with kernel 4.4.6 in
both the DomU and Dom0.

On 17/03/2016 8:39 AM, Steven Haigh wrote:
> Hi all,
> 
> I've noticed the following problem that ends up with a non-repsonsive PV
> DomU using kernel 4.4.5 under heavy disk IO:
> 
> INFO: rcu_sched self-detected stall on CPU
> 0-...: (6759098 ticks this GP) idle=cb3/141/0
> softirq=3244615/3244615 fqs=4
>  (t=6762321 jiffies g=2275626 c=2275625 q=54)
> rcu_sched kthread starved for 6762309 jiffies! g2275626 c2275625 f0x0 s3
> ->state=0x0
> Task dump for CPU 0:
> updatedbR  running task0  6027   6021 0x0088
>  818d0c00 88007fc03c58 810a625f 
>  818d0c00 88007fc03c70 810a8699 0001
>  88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
> Call Trace:
>[] sched_show_task+0xaf/0x110
>  [] dump_cpu_task+0x39/0x40
>  [] rcu_dump_cpu_stacks+0x8a/0xc0
>  [] rcu_check_callbacks+0x424/0x7a0
>  [] ? account_system_time+0x81/0x110
>  [] ? account_process_tick+0x61/0x160
>  [] ? tick_sched_do_timer+0x30/0x30
>  [] update_process_times+0x39/0x60
>  [] tick_sched_handle.isra.15+0x36/0x50
>  [] tick_sched_timer+0x3d/0x70
>  [] __hrtimer_run_queues+0xf2/0x250
>  [] hrtimer_interrupt+0xa8/0x190
>  [] xen_timer_interrupt+0x2e/0x140
>  [] handle_irq_event_percpu+0x55/0x1e0
>  [] handle_percpu_irq+0x3a/0x50
>  [] generic_handle_irq+0x22/0x30
>  [] __evtchn_fifo_handle_events+0x15f/0x180
>  [] evtchn_fifo_handle_events+0x10/0x20
>  [] __xen_evtchn_do_upcall+0x43/0x80
>  [] xen_evtchn_do_upcall+0x30/0x50
>  [] xen_hvm_callback_vector+0x82/0x90
>[] ? queued_spin_lock_slowpath+0x10/0x170
>  [] _raw_spin_lock+0x20/0x30
>  [] find_inode_fast+0x61/0xa0
>  [] iget_locked+0x6e/0x170
>  [] ext4_iget+0x33/0xae0
>  [] ? out_of_line_wait_on_bit+0x72/0x80
>  [] ext4_iget_normal+0x30/0x40
>  [] ext4_lookup+0xd5/0x140
>  [] lookup_real+0x1d/0x50
>  [] __lookup_hash+0x33/0x40
>  [] walk_component+0x177/0x280
>  [] path_lookupat+0x60/0x110
>  [] filename_lookup+0x9c/0x150
>  [] ? kfree+0x10d/0x290
>  [] ? call_filldir+0x9c/0x130
>  [] ? getname_flags+0x4f/0x1f0
>  [] user_path_at_empty+0x36/0x40
>  [] vfs_fstatat+0x53/0xa0
>  [] ? __fput+0x169/0x1d0
>  [] SYSC_newlstat+0x22/0x40
>  [] ? __audit_syscall_exit+0x1f0/0x270
>  [] ? syscall_slow_exit_work+0x3f/0xc0
>  [] ? __audit_syscall_entry+0xaf/0x100
>  [] SyS_newlstat+0xe/0x10
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> 
> This ends up with the system not responding at 100% CPU usa

Re: [Xen-devel] rcu_sched self-detected stall on CPU on kernel 4.4.5 in PV DomU

2016-03-19 Thread Steven Haigh
On 19/03/2016 8:40 AM, Steven Haigh wrote:
> Hi all,
> 
> So I'd just like to give this a prod. I'm still getting DomU's randomly
> go to 100% CPU usage using kernel 4.4.6 now. It seems running 4.4.6 as
> the DomU does not induce these problems.

Sorry - slight correction. Running 4.4.6 as the Dom0 kernel doesn't show
these errors. Only in the DomU.

> 
> Latest crash message from today:
> INFO: rcu_sched self-detected stall on CPU
> 0-...: (20869552 ticks this GP) idle=9c9/141/0
> softirq=1440865/1440865 fqs=15068
>  (t=20874993 jiffies g=1354899 c=1354898 q=798)
> rcu_sched kthread starved for 20829030 jiffies! g1354899 c1354898 f0x0
> s3 ->state=0x0
> Task dump for CPU 0:
> kworker/u4:1R  running task0  5853  2 0x0088
> Workqueue: writeback wb_workfn (flush-202:0)
>  818d0c00 88007fc03c58 810a625f 
>  818d0c00 88007fc03c70 810a8699 0001
>  88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
> Call Trace:
>[] sched_show_task+0xaf/0x110
>  [] dump_cpu_task+0x39/0x40
>  [] rcu_dump_cpu_stacks+0x8a/0xc0
>  [] rcu_check_callbacks+0x424/0x7a0
>  [] ? account_system_time+0x81/0x110
>  [] ? account_process_tick+0x61/0x160
>  [] ? tick_sched_do_timer+0x30/0x30
>  [] update_process_times+0x39/0x60
>  [] tick_sched_handle.isra.15+0x36/0x50
>  [] tick_sched_timer+0x3d/0x70
>  [] __hrtimer_run_queues+0xf2/0x250
>  [] hrtimer_interrupt+0xa8/0x190
>  [] xen_timer_interrupt+0x2e/0x140
>  [] handle_irq_event_percpu+0x55/0x1e0
>  [] handle_percpu_irq+0x3a/0x50
>  [] generic_handle_irq+0x22/0x30
>  [] __evtchn_fifo_handle_events+0x15f/0x180
>  [] evtchn_fifo_handle_events+0x10/0x20
>  [] __xen_evtchn_do_upcall+0x43/0x80
>  [] xen_evtchn_do_upcall+0x30/0x50
>  [] xen_hvm_callback_vector+0x82/0x90
>[] ? queued_spin_lock_slowpath+0x22/0x170
>  [] _raw_spin_lock+0x20/0x30
>  [] writeback_sb_inodes+0x124/0x560
>  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>  [] __writeback_inodes_wb+0x86/0xc0
>  [] wb_writeback+0x1d6/0x2d0
>  [] wb_workfn+0x284/0x3e0
>  [] process_one_work+0x151/0x400
>  [] worker_thread+0x11a/0x460
>  [] ? __schedule+0x2bf/0x880
>  [] ? rescuer_thread+0x2f0/0x2f0
>  [] kthread+0xc9/0xe0
>  [] ? kthread_park+0x60/0x60
>  [] ret_from_fork+0x3f/0x70
>  [] ? kthread_park+0x60/0x60
> 
> This repeats over and over causing 100% CPU usage - eventually on all
> vcpus assigned to the DomU and the only recovery is 'xl destroy'.
> 
> I'm currently running Xen 4.6.1 on this system - with kernel 4.4.6 in
> both the DomU and Dom0.
> 
> On 17/03/2016 8:39 AM, Steven Haigh wrote:
>> Hi all,
>>
>> I've noticed the following problem that ends up with a non-repsonsive PV
>> DomU using kernel 4.4.5 under heavy disk IO:
>>
>> INFO: rcu_sched self-detected stall on CPU
>> 0-...: (6759098 ticks this GP) idle=cb3/141/0
>> softirq=3244615/3244615 fqs=4
>>  (t=6762321 jiffies g=2275626 c=2275625 q=54)
>> rcu_sched kthread starved for 6762309 jiffies! g2275626 c2275625 f0x0 s3
>> ->state=0x0
>> Task dump for CPU 0:
>> updatedbR  running task0  6027   6021 0x0088
>>  818d0c00 88007fc03c58 810a625f 
>>  818d0c00 88007fc03c70 810a8699 0001
>>  88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
>> Call Trace:
>>[] sched_show_task+0xaf/0x110
>>  [] dump_cpu_task+0x39/0x40
>>  [] rcu_dump_cpu_stacks+0x8a/0xc0
>>  [] rcu_check_callbacks+0x424/0x7a0
>>  [] ? account_system_time+0x81/0x110
>>  [] ? account_process_tick+0x61/0x160
>>  [] ? tick_sched_do_timer+0x30/0x30
>>  [] update_process_times+0x39/0x60
>>  [] tick_sched_handle.isra.15+0x36/0x50
>>  [] tick_sched_timer+0x3d/0x70
>>  [] __hrtimer_run_queues+0xf2/0x250
>>  [] hrtimer_interrupt+0xa8/0x190
>>  [] xen_timer_interrupt+0x2e/0x140
>>  [] handle_irq_event_percpu+0x55/0x1e0
>>  [] handle_percpu_irq+0x3a/0x50
>>  [] generic_handle_irq+0x22/0x30
>>  [] __evtchn_fifo_handle_events+0x15f/0x180
>>  [] evtchn_fifo_handle_events+0x10/0x20
>>  [] __xen_evtchn_do_upcall+0x43/0x80
>>  [] xen_evtchn_do_upcall+0x30/0x50
>>  [] xen_hvm_callback_vector+0x82/0x90
>>[] ? queued_spin_lock_slowpath+0x10/0x170
>>  [] _raw_spin_lock+0x20/0x30
>>  [] find_inode_fast+0x61/0xa0
>>  [] iget_locked+0x6e/0x170
>>  [] ext4_iget+0x33/0xae0
>>  [] ? out_of_line_wait_on_bit+0x72/0x80
>>  [] ext4_iget_normal+0x30/0x40
>>  [] ext4_lookup+0xd5/0x140
>>  [] lookup_rea

[Xen-devel] pygrub detects false keypress on console

2016-03-11 Thread Steven Haigh
Hi all,

Testing on Xen 4.6.1 - and I've noticed that if I start an EL7 instance
and auto-attach to the console, that pygrub seems to see a key press and
waits.

Sometimes, I have to press ^]   to make it actually boot.

The guest is EL7 with grub2 config file.

Started with:
# xl create /etc/xen/configfile.vm -c

I believe this behaviour was seen in the past - but had been fixed in -
maybe 4.5.x?

Has anyone else noticed this?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Fixation on polarssl 1.1.4 - EOL was 2013-10-01

2016-02-14 Thread Steven Haigh
Hi all,

Just been looking at the polarssl parts in Xen 4.6 and others - seems
like we're hard coded to version 1.1.4 which was released on 31st May 2012.

Branch 1.1.x has been EOL for a number of years, 1.2.x has been EOL
since Jan.

It's now called mbedtls and current versions are 2.2.1 released in Jan
this year.

I'm not exactly clear on what polarssl is used for (and why not
openssl?) - but is it time this was shown some loving?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG?] qemuu only built with i386-softmmu

2016-02-05 Thread Steven Haigh
On 05/02/16 20:51, Ian Campbell wrote:
> On Fri, 2016-02-05 at 08:09 +1100, Steven Haigh wrote:
>> In building my Xen 4.6.0 packages, I disable qemu-traditional and ONLY
>> build qemu-upstream - however as the value for i386-softmmu is not based
>> on variables, I'm not sure this makes a difference.
> 
> QEMU in a Xen system only provides device model (DM) emulation and not any
> CPU instruction emulation, so the nominal arch doesn't actually matter and
> Xen build's i386 everywhere as a basically arbitrary choice.
> 
> It happens that the Xen DM part of QEMU is quite closely tied to the x86
> scaffolding for various historical reasons, so we end up using qemu-system-
> i386 even e.g. on ARM!
> 
> This comes up a lot, So I've also pasted the two paras above into a new
> section in http://wiki.xenproject.org/wiki/QEMU_Upstream . If anyone thinks
> the above is inaccurate then please edit the wiki (and post here too if you
> like).

I think this is a great addition that explains the situation well.
Documenting these is always a good thing.

> 
> On thing I was sure on (so didn't write) is whether the second paragraph
> could have an extra sentence:
> 
> If you are using a distro supplied QEMU then the qemu-system-x86_64
> could also be used, but it makes no practical difference to the
> functionality of the system.
> 
> I wasn't sure if that was true (I suspect it is) and in any case I think
> various bits of libxl etc will look for qemu-system-i386 in various paths
> so a user would need to try reasonably hard to do so by giving an explicit
> path and there is no real reason to do so maybe better not to muddy the
> waters?

Maybe go along the lines of:

"There is no practical difference between qemu-system-i386 and
qemu-system-x86_64 therefore both can be interchanged freely."

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [QUESTION] x86_64 -> i386/i686 CPU translation between xl and qemu binary?

2016-02-04 Thread Steven Haigh

On 2016-02-05 09:22, Andrew Cooper wrote:

On 04/02/2016 22:06, Alex Braunegg wrote:

root 30511 46.4  0.1 398728  1860 ?RLsl 08:47   0:27
/usr/lib/xen/bin/qemu-system-i386 -xen-domid 6 -chardev
socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-6,server,n
owait -no-shutdown -mon chardev=libxl-cmd,mode=control -chardev
socket,id=libxenstat-cmd,path=/var/run/xen/qmp-libxenstat-6,server,nowait
-mon chardev=libxenstat-cmd,mode=control
-nodefaults -name test2 -vnc
0.0.0.0:0,websocket,x509=/etc/pki/xen,password,to=99 -display none 
-serial
pty -device VGA,vgamem_mb=16 -boot order=cd -usb -usbdevice tablet 
-soundhw
 ac97 -device rtl8139,id=nic0,netdev=net0,mac=00:16:3e:f1:48:8c 
-netdev
type=tap,id=net0,ifname=vif6.0-emu,script=no,downscript=no -machine 
xenfv -m

496 -drive file=/dev/zvol/stor
age0/xen/test2/disk_sda,if=ide,index=0,media=disk,format=raw,cache=writeback
-drive
file=/storage0/data-shares/iso/CentOS-6.5-x86_64-minimal.iso,if=ide,index=2,
readonly=on,media=c
drom,format=raw,cache=writeback,id=ide-5632

--

So - to me it appears that xl is performing some sort of x86_64 -> 
i386/i686

instruction translation to make things work.

Would this not be introducing a performance impediment by having some 
sort
of extra translation processing going on between xl and the qemu 
binary?


Qemu is only used for device emulation when used with Xen, not CPU
emulation.

The "-machine xenfv" tells this to Qemu, and "-xen-domid 6" tells it
which Xen domain to connect to.

All HVM domains run with hardware virtualisation extensions, which are
managed by Xen itself.


Hi Andrew,

Thanks for this response - to ensure I have this correct, there is no 
need for qemu-upstream to build qemu-system-x86_64 as the CPU is 
directly handled by Xen and not by qemu - thereby passing through the 
capabilities of the CPU directly to the guest. As such, as long as qemu 
starts on a 64 bit machine it will be able to run 64 bit OS/kernel etc.


I ask this as I see a number of qemu packages that do include 
qemu-system-x86_64 as well as qemu-system-i386 - which makes me seek 
clarification. I would assume that these are just not built to use Xen 
as the hypervisor for hardware acceleration?


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [BUG?] qemuu only built with i386-softmmu

2016-02-04 Thread Steven Haigh
Hi all,

Looking specifically at 4.6.0.

It seems that the Makefile for qemuu uses the following:
$$source/configure --enable-xen --target-list=i386-softmmu \
$(QEMU_XEN_ENABLE_DEBUG) \
--prefix=$(LIBEXEC) \
--libdir=$(LIBEXEC_LIB) \
--includedir=$(LIBEXEC_INC) \
--source-path=$$source \
--extra-cflags="-I$(XEN_ROOT)/tools/include \
-I$(XEN_ROOT)/tools/libxc/include \
-I$(XEN_ROOT)/tools/xenstore/include \
-I$(XEN_ROOT)/tools/xenstore/compat/include \
$(EXTRA_CFLAGS_QEMU_XEN)" \
--extra-ldflags="-L$(XEN_ROOT)/tools/libxc \
-L$(XEN_ROOT)/tools/xenstore \
$(QEMU_UPSTREAM_RPATH)" \
--bindir=$(LIBEXEC_BIN) \
--datadir=$(SHAREDIR)/qemu-xen \
--localstatedir=$(localstatedir) \
--disable-kvm \
--disable-docs \
--disable-guest-agent \
--python=$(PYTHON) \
$(CONFIG_QEMUU_EXTRA_ARGS) \
--cpu=$(IOEMU_CPU_ARCH) \
$(IOEMU_CONFIGURE_CROSS); \
$(MAKE) all

As such, this only builds the 32 bit version of qemuu. It seems that
starting a HVM with more than 4Gb of RAM fails:

libxl: debug: libxl_event.c:691:libxl__ev_xswatch_deregister: watch
w=0xa4d188: deregister unregistered
xc: detail: elf_parse_binary: phdr: paddr=0x10 memsz=0x5b3a4
xc: detail: elf_parse_binary: memory: 0x10 -> 0x15b3a4
xc: detail: VIRTUAL MEMORY ARRANGEMENT:
xc: detail:   Loader:   0010->0015b3a4
xc: detail:   Modules:  ->
xc: detail:   TOTAL:->00017f00
xc: detail:   ENTRY:00100600
xc: detail: PHYSICAL MEMORY ALLOCATION:
xc: detail:   4KB PAGES: 0x0200
xc: detail:   2MB PAGES: 0x05f7
xc: detail:   1GB PAGES: 0x0003
xc: detail: elf_load_binary: phdr 0 at 0x7ff67320f000 -> 0x7ff673260910
xc: error: Could not clear special pages (22 = Invalid argument):
Internal error
libxl: error: libxl_dom.c:1003:libxl__build_hvm: hvm building failed

In building my Xen 4.6.0 packages, I disable qemu-traditional and ONLY
build qemu-upstream - however as the value for i386-softmmu is not based
on variables, I'm not sure this makes a difference.

My question is, should this also build x86_64-softmmu as well as
i386-softmmu at a bare minimum?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] bridge call iptables being forced

2015-11-19 Thread Steven Haigh

On 2015-11-19 12:46, Juan Rossi wrote:

Hi

I am sending this due the change of behaviour in some parts, and
perhaps it needs some code amendments, unsure if the devel list is the
best place, fell free to point me to the right place for this. Let me
know if I should load a bug instead.


I'm tracking this at:
http://xen.crc.id.au/bugs/view.php?id=62


diff --git a/tools/hotplug/Linux/vif-bridge 
b/tools/hotplug/Linux/vif-bridge

index 3d72ca4..7fc6650 100644
--- a/tools/hotplug/Linux/vif-bridge
+++ b/tools/hotplug/Linux/vif-bridge
@@ -93,7 +93,16 @@ case "$command" in
 ;;
 esac

-handle_iptable
+brcalliptables=$(sysctl -n net.bridge.bridge-nf-call-iptables 
2>/dev/null)

+brcalliptables=${brcalliptables:-0}
+
+brcallip6tables=$(sysctl -n net.bridge.bridge-nf-call-ip6tables 
2>/dev/null)

+brcallip6tables=${brcallip6tables:-0}
+
+if [ "$brcalliptables" -eq "1" -a "$brcallip6tables" -eq "1" ];
+then
+   handle_iptable
+fi

 call_hooks vif post


I'm not a fan of this as it will also enable the call to 
handle_iptable() if people create their own firewall rules - ie these 
will be true - hence the rule will get loaded anyway.


My comment on the bug report is included below to hopefully get further 
input from people:
Thinking about this further - as it is a change in behaviour for a point 
release, I believe we should do the following:


1) Create a new option in /etc/xen/xl.conf - and default it to False.
2) Name an option "autocreate_firewall_files"
3) Evaluate autocreate_firewall_rules in vif-common.sh function 
handle_iptable()


I suggest something like the following psuedo code:
if [ $autocreate_firewall_rules == 0 ]; then
return
fi

Happy to start debate on the correct way of handling this :)

Hopefully this can lead to some further debate.

--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] PV random device

2015-10-05 Thread Steven Haigh

On 2015-10-06 15:29, Andy Smith wrote:

- Your typical EntropyKey or OneRNG can generate quite a bit of
  entropy. Maybe 32 kilobytes per second for ~$50 each.


If you can get one... :)


- You can access them over the network so no USB passthrough needed.


Care to give details on this? I've got a HWRNG on a system that I'd like 
to 'share' the entropy source out - but haven't found anything to do 
this.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] RFC: change to 6 months release cycle

2015-10-05 Thread Steven Haigh
On 5/10/2015 10:23 PM, Wei Liu wrote:
> On Mon, Oct 05, 2015 at 05:04:19AM -0600, Jan Beulich wrote:
>>>>> On 02.10.15 at 19:43, <wei.l...@citrix.com> wrote:
>>> The main objection from previous discussion seems to be that "shorter
>>> release cycle creates burdens for downstream projects". I couldn't
>>> quite get the idea, but I think we can figure out a way to sort that
>>> out once we know what exactly the burdens are.
>>
>> I don't recall it that way. My main objection remains the resulting
>> higher burden of maintaining stable trees. Right now, most of the
>> time we have two trees to maintain. A 6-month release cycle means
>> three of them (shortening the time we maintain those trees doesn't
>> seem a viable option to me).
>>
>> Similar considerations apply to security maintenance of older trees.

> Just to throw around some ideas: we can have more stable tree
> maintainers, we can pick a stable tree every X releases etc etc.

So everyone else in the industry is increasing their support periods for
stable things, and we're wanting to go the opposite way?

Sorry - but this is nuts. Have a stable branch that is actually
supported properly with backports of security fixes etc - then have a
'bleeding edge' branch that rolls with the punches.

Remember that folks are still running Xen 3.4 on EL5 - and will be at
least until 2017. I still run the occasional patch for 4.2, and most
people are on either 4.4 or testing with 4.5 when running with EL6.

EL6 is supported until November 30, 2020. EL7 until 2024. People are not
exactly thrilled with EL7 in the virt area - but will eventually move to
it (or directly to EL8 or EL9).

The 6 month release cycle is exactly why people don't run Fedora on
their production environments. Why are we suddenly wanting the same
release schedule for Xen?

Sorry - but I'm VERY much against this proposal. Focus on stable and
complete, not Ooo Shiny!

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] RFC: change to 6 months release cycle

2015-10-05 Thread Steven Haigh
On 5/10/2015 10:44 PM, Ian Campbell wrote:
> On Mon, 2015-10-05 at 12:23 +0100, Wei Liu wrote:
>> we can pick a stable tree every X releases etc etc.
> 
> I think switching to an LTS style model, i.e. only supporting 1/N for
> longer than it takes to release the next major version might be interesting
> to consider. I'm thinking e.g. of N=4 with a 6 month cycle.

^^ This.

> I think some of our downstreams (i.e. distros) would like this, since it
> gives them releases which are supported for a length of time more like
> their own release cycles.

^^ This as well :)

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] RFC: change to 6 months release cycle

2015-10-05 Thread Steven Haigh
On 6/10/2015 12:05 AM, George Dunlap wrote:
> On Mon, Oct 5, 2015 at 12:44 PM, Steven Haigh <net...@crc.id.au> wrote:
>> On 5/10/2015 10:23 PM, Wei Liu wrote:
>>> On Mon, Oct 05, 2015 at 05:04:19AM -0600, Jan Beulich wrote:
>>>>>>> On 02.10.15 at 19:43, <wei.l...@citrix.com> wrote:
>>>>> The main objection from previous discussion seems to be that "shorter
>>>>> release cycle creates burdens for downstream projects". I couldn't
>>>>> quite get the idea, but I think we can figure out a way to sort that
>>>>> out once we know what exactly the burdens are.
>>>>
>>>> I don't recall it that way. My main objection remains the resulting
>>>> higher burden of maintaining stable trees. Right now, most of the
>>>> time we have two trees to maintain. A 6-month release cycle means
>>>> three of them (shortening the time we maintain those trees doesn't
>>>> seem a viable option to me).
>>>>
>>>> Similar considerations apply to security maintenance of older trees.
>> 
>>> Just to throw around some ideas: we can have more stable tree
>>> maintainers, we can pick a stable tree every X releases etc etc.
>>
>> So everyone else in the industry is increasing their support periods for
>> stable things, and we're wanting to go the opposite way?
>>
>> Sorry - but this is nuts. Have a stable branch that is actually
>> supported properly with backports of security fixes etc - then have a
>> 'bleeding edge' branch that rolls with the punches.
>>
>> Remember that folks are still running Xen 3.4 on EL5 - and will be at
>> least until 2017. I still run the occasional patch for 4.2, and most
>> people are on either 4.4 or testing with 4.5 when running with EL6.
>>
>> EL6 is supported until November 30, 2020. EL7 until 2024. People are not
>> exactly thrilled with EL7 in the virt area - but will eventually move to
>> it (or directly to EL8 or EL9).
>>
>> The 6 month release cycle is exactly why people don't run Fedora on
>> their production environments. Why are we suddenly wanting the same
>> release schedule for Xen?
>>
>> Sorry - but I'm VERY much against this proposal. Focus on stable and
>> complete, not Ooo Shiny!
> 
> I think you're talking about something completely different.
> 
> Wei is talking about releasing *more often*; you're talking about
> having *longer support windows*.

I think we are both along the same lines - however we both have
different points. The problem is, the more releases you have in a
support window, the more you have to maintain.

I did like Ian's idea of a new stable / lts / whatever you want to call
it every 4 x normal releases at 6 month timing. This would mean an LTS
release would be supported for 2 years.

I would really like to see:
LTS = 4 year full support + 1 year security fixes only
Rolling Release = 6 - 12 months between releases.

Is this possible? Not really sure - but the bigger end users don't want
to have to retest everything every year. Honestly, even an LTS of
*longer* than 4 years would be good - but I'm not sure that is even in
the realm of consideration.

> Nobody is suggesting that we shouldn't have releases that are
> supported for long periods of time.  What Wei is proposing is that
> instead of releasing every 0.75 years and supporting every release for
> N years, we release every 0.5 years, but every 1.0 (or 1.5) years make
> a release that we support for N years.  Many projects do this,
> including the Linux kernel.

True, but the kernel has several orders of magnitude more resources
contributed. I still do my best to keep a security patched package of
4.2 for EL6 users - some of who don't want to move to XL due to
reworking all their management tools.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.5.1 released

2015-06-23 Thread Steven Haigh
On 23/06/2015 8:45 PM, Ian Jackson wrote:
 Ian Jackson writes (Re: [Xen-devel] Xen 4.5.1 released):
 M A Young writes (Re: [Xen-devel] Xen 4.5.1 released):
 I don't believe this release has the qemu-xen-traditional half of XSA-135. 
 If this wasn't deliberate it might be worth noting it somewhere.

 You're right.  It appears that the patch for XSA-135 was never applied
 to qemu-traditional, due to an oversight.
 
 The XSA-135 fix was missing everywhere.  I have now applied it (to all
 trees 4.1 and onward).

Out of interest, is the plan now to re-release a fixed 4.5.1 archive or
document the lack of the XSA135 patches and allow people to patch manually?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Inplace upgrading 4.4.x - 4.5.0

2015-02-18 Thread Steven Haigh
On 18/02/2015 11:38 PM, Ian Campbell wrote:
 On Mon, 2015-02-09 at 20:36 +1100, Steven Haigh wrote:
 This sounds like a packaging issue -- Debian's packages for example jump
 through some hoops to make sure multiple tools packages can be installed
 in parallel and the correct ones selected for the currently running
 hypervisor.

 Hmmm - that sounds very hacky :\
 
 I've been slowly unpicking the Debian patches and upstreaming bits of
 them, I'm not sure if I'll manage to get this stuff upstream though
 since it is a bit more invasive than the other stuff.

Anything that helps out here is a good thing.

 Hmmm Andrew is correct, the errors are all:

 = xl info ==
 libxl: error: libxl.c:5044:libxl_get_physinfo: getting physinfo:
 Permission denied
 
 EPERM is essential tools/hypervisor version mismatch in most contexts.
 
 [...]
 So, this leads me to wonder - as I'm sure MANY people get bitten by this
 - how to control (at least to shutdown) DomUs after an in-place upgrade?
 
 You should evacuate the host before upgrading it, which is what I
 suppose most people do as the first step in their maintenance window.
 Evacuation might involve migrating VMs to another host (perhaps as part
 of a pool rolling upgrade type manoeuvre) or just shutting them down.
 
 Even if no other functions are implemented other than shutdown, I would
 call that an acceptable functionality. At least this way, you're not
 hard killing running VMs on reboot.
 
 I'd expect that it might be possible to arrange to connect to the VM
 console and shut it down from within, or possibly to use the xenstore
 CLI tools to initiate the shutdown externally.
 
 After that then you would still end up with some zombie domains since
 after they have shutdown actually reaping them would require toolstack
 actions to talk to the hypervisor and you'd hit the version mismatch.

In large scale organisations (10+ systems), then yes, I'd say you're
probably right. This problem hits those who are smaller than that
however. I could list countless people who have been bitten by this in
the past - the majority of which are small businesses / hobbyists that
don't have the kind of equipment to do any of the above.

The expected upgrade path for those types is yum -y update  reboot

As we know, this falls over in a heap - however I don't think it is
beyond the realms of expectation for this to work.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Inplace upgrading 4.4.x - 4.5.0

2015-02-09 Thread Steven Haigh
On 9/02/2015 7:59 PM, Sander Eikelenboom wrote:
 Monday, February 9, 2015, 9:35:33 AM, you wrote:
 
 Hello Steven,
 upgrades from Xen 4.4 to 4.5 are supposed to work out of the box.
 
 Please post more details and we'll try to help you figure out what's
 wrong.
 
 Cheers,
 
 Stefano
 
 On Sun, 8 Feb 2015, Steven Haigh wrote:
 Hi all,

 I was under the impression that you should be able to do in-place
 upgrades from Xen 4.4 to 4.5 on a system without losing the ability to
 manage DomUs...

 This would support upgrades from running systems from Xen 4.4.x to 4.5.0
 - only requiring a reboot to boot into the 4.5.0 hypervisor.

 When I try this in practice, I get a whole heap of permission denied
 errors and lose control of any running DomUs.

 Is there some secret sauce that will allow this to work?
 
 You are probably running into a mismatch between the running hypervisor (4.4) 
 and 
 the now installed toolstack (4.5) .. for instance when trying to shutdown the 
 VM's
 to do the reboot. 
 (Since the newly installed hypervisor parts are only loaded and run on the 
 next boot).

Correct - It is the 4.4 Hypervisor with 4.5 toolstack. After a reboot,
all is good. However this causes the problem - once you update the
packages from 4.4 - 4.5, you lose the ability to manage any running DomUs.

This is problematic - if only for the fact that you can't shut down
running DomUs for the Dom0 reboot.

I understand that large jumps in versions isn't supported - but I
believe that point versions should be supported using the same toolset.
ie 4.2 - 4.3, 4.4 - 4.5 etc.

I'm just about to gather some data for it - and I'll make a new thread
with what I can gather.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Inplace upgrading 4.4.x - 4.5.0

2015-02-09 Thread Steven Haigh
On 9/02/2015 8:16 PM, Ian Campbell wrote:
 On Mon, 2015-02-09 at 20:09 +1100, Steven Haigh wrote:
 On 9/02/2015 7:59 PM, Sander Eikelenboom wrote:
 Monday, February 9, 2015, 9:35:33 AM, you wrote:

 Hello Steven,
 upgrades from Xen 4.4 to 4.5 are supposed to work out of the box.

 Please post more details and we'll try to help you figure out what's
 wrong.

 Cheers,

 Stefano

 On Sun, 8 Feb 2015, Steven Haigh wrote:
 Hi all,

 I was under the impression that you should be able to do in-place
 upgrades from Xen 4.4 to 4.5 on a system without losing the ability to
 manage DomUs...

 This would support upgrades from running systems from Xen 4.4.x to 4.5.0
 - only requiring a reboot to boot into the 4.5.0 hypervisor.

 When I try this in practice, I get a whole heap of permission denied
 errors and lose control of any running DomUs.

 Is there some secret sauce that will allow this to work?

 You are probably running into a mismatch between the running hypervisor 
 (4.4) and 
 the now installed toolstack (4.5) .. for instance when trying to shutdown 
 the VM's
 to do the reboot. 
 (Since the newly installed hypervisor parts are only loaded and run on the 
 next boot).

 Correct - It is the 4.4 Hypervisor with 4.5 toolstack. After a reboot,
 all is good. However this causes the problem - once you update the
 packages from 4.4 - 4.5, you lose the ability to manage any running DomUs.
 
 This sounds like a packaging issue -- Debian's packages for example jump
 through some hoops to make sure multiple tools packages can be installed
 in parallel and the correct ones selected for the currently running
 hypervisor.

Hmmm - that sounds very hacky :\

 Otherwise I think the upgrade path is:
   * shutdown all VMs (or migrate them away)
   * install new Xen + tools
   * reboot
   * restart domains with new tools.
 
 I'm afraid that using old tools on a new Xen is not something which is
 supported, even in the midst of an upgrade and AFAIK never has been. The
 N-N+1-N+2 statement is normally with reference to live migration (i.e.
 you can live migrate from a 4.4 system to a 4.5 one).

Hmmm Andrew is correct, the errors are all:

= xl info ==
libxl: error: libxl.c:5044:libxl_get_physinfo: getting physinfo:
Permission denied
libxl_physinfo failed.
libxl: error: libxl.c:5534:libxl_get_scheduler: getting domain info
list: Permission denied
host   : xenhost
release: 3.14.32-1.el6xen.x86_64
version: #1 SMP Sun Feb 8 15:41:07 AEDT 2015
machine: x86_64
xen_major  : 4
xen_minor  : 4
xen_extra  : .1
xen_version: 4.4.1
xen_caps   : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler  : (null)
xen_pagesize   : 4096
platform_params: virt_start=0x8000
xen_changeset  :
xen_commandline: dom0_mem=1024M cpufreq=xen dom0_max_vcpus=1
dom0_vcpus_pin console=tty0 console=com1 com1=115200,8n1
cc_compiler: gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11)
cc_compile_by  : mockbuild
cc_compile_domain  : crc.id.au
cc_compile_date: Thu Jan  1 18:19:30 AEDT 2015
xend_config_format : 4

= xl list ==
libxl: error: libxl.c:669:libxl_list_domain: getting domain info list:
Permission denied
libxl_list_domain failed.

= xl dmesg ==
libxl: error: libxl.c:6061:libxl_xen_console_read_line: reading console
ring buffer: Permission denied

So, this leads me to wonder - as I'm sure MANY people get bitten by this
- how to control (at least to shutdown) DomUs after an in-place upgrade?

Even if no other functions are implemented other than shutdown, I would
call that an acceptable functionality. At least this way, you're not
hard killing running VMs on reboot.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Inplace upgrading 4.4.x - 4.5.0

2015-02-09 Thread Steven Haigh
Thanks Stefano,

I wanted to make sure I was correct with this belief before I posted all
the info.

I'll gather as much info as I can in the next day or so and see what we
can do.

On 9/02/2015 7:35 PM, Stefano Stabellini wrote:
 Hello Steven,
 upgrades from Xen 4.4 to 4.5 are supposed to work out of the box.
 
 Please post more details and we'll try to help you figure out what's
 wrong.
 
 Cheers,
 
 Stefano
 
 On Sun, 8 Feb 2015, Steven Haigh wrote:
 Hi all,

 I was under the impression that you should be able to do in-place
 upgrades from Xen 4.4 to 4.5 on a system without losing the ability to
 manage DomUs...

 This would support upgrades from running systems from Xen 4.4.x to 4.5.0
 - only requiring a reboot to boot into the 4.5.0 hypervisor.

 When I try this in practice, I get a whole heap of permission denied
 errors and lose control of any running DomUs.

 Is there some secret sauce that will allow this to work?

 -- 
 Steven Haigh

 Email: net...@crc.id.au
 Web: http://www.crc.id.au
 Phone: (03) 9001 6090 - 0412 935 897




-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Inplace upgrading 4.4.x - 4.5.0

2015-02-08 Thread Steven Haigh
Hi all,

I was under the impression that you should be able to do in-place
upgrades from Xen 4.4 to 4.5 on a system without losing the ability to
manage DomUs...

This would support upgrades from running systems from Xen 4.4.x to 4.5.0
- only requiring a reboot to boot into the 4.5.0 hypervisor.

When I try this in practice, I get a whole heap of permission denied
errors and lose control of any running DomUs.

Is there some secret sauce that will allow this to work?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel