from:"Ed Swierk"

Re: [Qemu-devel] [PATCH v2 5/6] e1000: Choose which set of props to migrate

2018-03-28 Thread Ed Swierk

On Wed, Mar 28, 2018 at 9:36 AM, Dr. David Alan Gilbert (git)
<dgilb...@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilb...@redhat.com>
>
> When we're using the subsection we migrate both
> the 'props' and 'tso_props' data; when we're not using
> the subsection (to migrate to 2.11 or old machine types) we've
> got to choose what to migrate in the main structure.
>
> If we're using the subsection migrate 'props' in the main structure.
> If we're not using the subsection then migrate the last one
> that changed, which gives behaviour similar to the old behaviour.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com>

Acked-by: Ed Swierk <eswi...@gmail.com>

> ---
>  hw/net/e1000.c | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/hw/net/e1000.c b/hw/net/e1000.c
> index 4e606d4b2a..13a9494a8d 100644
> --- a/hw/net/e1000.c
> +++ b/hw/net/e1000.c
> @@ -130,6 +130,7 @@ typedef struct E1000State_st {
>  #define E1000_FLAG_TSO (1 << E1000_FLAG_TSO_BIT)
>  uint32_t compat_flags;
>  bool received_tx_tso;
> +bool use_tso_for_migration;
>  e1000x_txd_props mig_props;
>  } E1000State;
>
> @@ -622,9 +623,11 @@ process_tx_desc(E1000State *s, struct e1000_tx_desc *dp)
>  if (dtype == E1000_TXD_CMD_DEXT) {/* context descriptor */
>  if (le32_to_cpu(xp->cmd_and_length) & E1000_TXD_CMD_TSE) {
>  e1000x_read_tx_ctx_descr(xp, >tso_props);
> +s->use_tso_for_migration = 1;
>  tp->tso_frames = 0;
>  } else {
>  e1000x_read_tx_ctx_descr(xp, >props);
> +s->use_tso_for_migration = 0;
>  }
>  return;
>  } else if (dtype == (E1000_TXD_CMD_DEXT | E1000_TXD_DTYP_D)) {
> @@ -1366,7 +1369,20 @@ static int e1000_pre_save(void *opaque)
>  s->phy_reg[PHY_STATUS] |= MII_SR_AUTONEG_COMPLETE;
>  }
>
> -s->mig_props = s->tx.props;
> +/* Decide which set of props to migrate in the main structure */
> +if (chkflag(TSO) || !s->use_tso_for_migration) {
> +/* Either we're migrating with the extra subsection, in which
> + * case the mig_props is always 'props' OR
> + * we've not got the subsection, but 'props' was the last
> + * updated.
> + */
> +s->mig_props = s->tx.props;
> +} else {
> +/* We're not using the subsection, and 'tso_props' was
> + * the last updated.
> + */
> +s->mig_props = s->tx.tso_props;
> +}
>  return 0;
>  }
>
> --
> 2.14.3
>

Re: [Qemu-devel] [PATCH 3/3] e1000: Old machine types, turn new subsection off

2018-03-27 Thread Ed Swierk

On Tue, Mar 27, 2018 at 10:28 AM, Paolo Bonzini  wrote:
> On 27/03/2018 18:47, Dr. David Alan Gilbert wrote:
>>> So if the subsection is absent you
>>> have to migrate either tx.tso_props or tx.props, depending on s->tx.cptse.
>> Do you mean when sending you have to decide which set to send in the
>> non-subsection data?  And with cptse true that means use tso_props?
>
> Yes.
>
>>> Likewise if you migrate from older versions: if s->tx.props.tse &&
>>> s->tx.cptse, you have to copy s->tx.props to s->tx.tso_props and clear
>>> s->tx.props.
>>
>> I don't see any equivalent code in the existing non-subsection postload to
>> do this; so I'm guessing there are some cases of 2.11->2.12 that will
>> break at the moment?
>
> Yes, I think so.
>
>>> My understanding is that s->tx.tso_props.tse will be 1 if
>>> and only if the source sent s->tx.tso_props.
>> I don't see anything in the current code that migrates tso_props.tse -
>> where does it come from?
>
> Ouch... The tse field is more or less dead in current code AFAICS, but
> it was used in the previous version.  What's the best way then to find
> if the subsection was transmitted?  Do we have anything like a post_load
> callback in the subsection itself?

The TSE flag in the cmd_and_length field of the context descriptor is
useful only as an indication of which context is being updated: TSO
(tso_props) or non-TSO (props). There is no reason to store it or
migrate it, and all prior uses of the stored field were based on an
incorrect understanding of its meaning. Now props.tse is always 0, and
tso_props.tse is always 1 after the first TSO context is processed.

> To find out which "props" to transmit to older QEMU, you can add a
> tp->use_tso_for_migration = tp->cptse just before "if (!(txd_lower &
> E1000_TXD_CMD_EOP))" in process_tx_desc...

tp->cptse only indicates whether the current tx data descriptor should
be segmented using parameters from the last TSO context descriptor.
It's perfectly legal for the guest to set up a TSO context and then
use it for some but not all subsequent data descriptors. tp->cptse
doesn't help in deciding what to migrate.

Whether to migrate props or tso_props back to 2.11 should be instead
based on which was updated last by a context descriptor. Something
like

if (dtype == E1000_TXD_CMD_DEXT) {/* context descriptor */
if (le32_to_cpu(xp->cmd_and_length) & E1000_TXD_CMD_TSE) {
e1000x_read_tx_ctx_descr(xp, >tso_props);
tp->use_tso_for_migration = 1;
tp->tso_frames = 0;
} else {
e1000x_read_tx_ctx_descr(xp, >props);
tp->use_tso_for_migration = 0;
}
return;

--Ed

Re: [Qemu-devel] [PATCH 0/3] e1000 migration changes for 2.12

2018-03-27 Thread Ed Swierk

On Tue, Mar 27, 2018 at 7:26 AM, Dr. David Alan Gilbert
 wrote:
> If I understand correctly the d6244b description, there's two pieces of
> state (TSO and non-TSO) where there used to be only one.
> Now, with the new format we migrate both pieces of state, but lets think
> about what happens if we have to migrate only one piece.
>
> a) 2.11->2.11 migrated the only piece it knew about; so I guess the only
> problem was really the UDP corruption mentioned in the commit.

Right.

> b) 2.11->2.12 - it receives the wrongly merged piece of state and puts it
> in - well which of the two states does it load it into?  What's the
> effect of that?

The best we can do is copy the old props into both props and
tso_props. Copying it into just props means that e1000 would suddenly
find all zero offsets in tso_props, screwing up an ongoing TCP session
using TSO in a Windows guest. Conversely copying into just tso_props
would screw up an ongoing UDP session using non-TSO offload in a
Windows guest. Copying means we do no worse than the buggy 2.11
behavior, until both TSO and non-TSO contexts are refreshed and
everything is fine. (For Linux guests it doesn't matter since Linux
always seems to refresh both TSO and non-TSO contexts before every tx
data descriptor.)

> c) 2.12(+my mod)->2.11 ok, so 2.12 will have filled in both sets of state
> internally; but now it's only going to send one of them over to 2.11 -
> which one gets sent to 2.11? Is it the one that 2.11 is expecting?

This case is trickier. We want to copy into the old props whichever
one of props and tso_props was updated most recently by a non-TSO or
TSO context descriptor. Always copying one or the other risks screwing
up an ongoing session in a Windows guest by using outdated offsets.
(Again there's no problem for Linux guests.)

Probably the easiest solution is to add yet another flag (which
doesn't itself get migrated) that's set when tso_props is updated and
cleared when props is updated.

> d) 2.12(+my mod)->2.12(+my mod) with an old machine type, again we're only
> going to send one set of data (same as c) - but what's 2.12 going to
> make of the one piece of state received?

This is the same as (b) I think.

--Ed

Re: [Qemu-devel] [PULL 00/18] Net patches

2018-01-08 Thread Ed Swierk via Qemu-devel

On Mon, Jan 8, 2018 at 7:10 AM, Eric Blake <ebl...@redhat.com> wrote:
>
> On 01/08/2018 07:30 AM, Ed Swierk via Qemu-devel wrote:
>
> > Applied, thanks.
> >
> > PS: just noticed, but "Ed Swierk via Qemu-devel <qemu-devel@nongnu.org>"
> > is a bit of an odd Author string to end up in git commit logs, so you
> > and/or Ed might like to fix that up for any future patches.
> >
> >
> > Thanks for pointing that out. I wonder if it's happening because I let
> > git-send-email automatically fill in the From: line.
>
> It's also a factor of how strict your ISP is about DMARC handling; the
> list automatically rewrites the 'From:' header to insert the 'via
> Qemu-devel' tag if it detects DMARC settings at your ISP that won't
> allow your email through as originally written.  Sadly, mailman doesn't
> know to insert a manual 'From:' line in the body when it rewrites the
> original From: header; but if you know that DMARC settings are going to
> munge your original header, you can probably convince git to always
> insert an explicit From: line in the message body to override whatever
> munging the list does.

I'm trying to figure out what I need to fix on my end. I went back and
looked at the email headers. Here are the two that ended up with the
wrong author:

Return-Path: <eswi...@skyportsystems.com>
Received: from eswierk-sc.localdomain
(67-207-112-138.static.wiline.com. [67.207.112.138])
by smtp.gmail.com with ESMTPSA id
d9sm20150979pfk.117.2017.11.14.15.23.43
    (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
Tue, 14 Nov 2017 15:23:44 -0800 (PST)
From: Ed Swierk <eswi...@skyportsystems.com>
To: Jason Wang <jasow...@redhat.com>,
    "Daniel P . Berrange" <berra...@redhat.com>,
Stefan Hajnoczi <stefa...@gmail.com>
Cc: qemu-devel@nongnu.org,
Ed Swierk <eswi...@skyportsystems.com>
Subject: [PATCH 1/2] e1000, e1000e: Move per-packet TX offload flags
out of context state
Date: Tue, 14 Nov 2017 15:23:33 -0800
Message-Id: <1510701814-52973-2-git-send-email-eswi...@skyportsystems.com>
X-Mailer: git-send-email 1.9.1
In-Reply-To: <1510701814-52973-1-git-send-email-eswi...@skyportsystems.com>
References: <1510701814-52973-1-git-send-email-eswi...@skyportsystems.com>

Return-Path: <eswi...@skyportsystems.com>
Received: from eswierk-sc.localdomain
(67-207-112-138.static.wiline.com. [67.207.112.138])
by smtp.gmail.com with ESMTPSA id
d9sm20150979pfk.117.2017.11.14.15.23.45
(version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
Tue, 14 Nov 2017 15:23:45 -0800 (PST)
From: Ed Swierk <eswi...@skyportsystems.com>
To: Jason Wang <jasow...@redhat.com>,
"Daniel P . Berrange" <berra...@redhat.com>,
Stefan Hajnoczi <stefa...@gmail.com>
Cc: qemu-devel@nongnu.org,
Ed Swierk <eswi...@skyportsystems.com>
Subject: [PATCH 2/2] e1000: Separate TSO and non-TSO contexts, fixing
UDP TX corruption
Date: Tue, 14 Nov 2017 15:23:34 -0800
Message-Id: <1510701814-52973-3-git-send-email-eswi...@skyportsystems.com>
X-Mailer: git-send-email 1.9.1
In-Reply-To: <1510701814-52973-1-git-send-email-eswi...@skyportsystems.com>
References: <1510701814-52973-1-git-send-email-eswi...@skyportsystems.com>

This one had the correct author:

Return-Path: <eswi...@skyportsystems.com>
Received: from eswierk-sc.localdomain
(67-207-112-138.static.wiline.com. [67.207.112.138])
by smtp.gmail.com with ESMTPSA id s3sm4082810pfk.7.2017.11.16.06.06.36
(version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
Thu, 16 Nov 2017 06:06:37 -0800 (PST)
From: Ed Swierk <eswi...@skyportsystems.com>
To: Jason Wang <jasow...@redhat.com>,
"Daniel P . Berrange" <berra...@redhat.com>,
Stefan Hajnoczi <stefa...@gmail.com>,
Dmitry Fleytman <dmi...@daynix.com>
Cc: qemu-devel@nongnu.org,
Ed Swierk <eswi...@skyportsystems.com>
Subject: [PATCH v3 REPOST] net: Transmit zero UDP checksum as 0x
Date: Thu, 16 Nov 2017 06:06:06 -0800
Message-Id: <1510841166-40615-1-git-send-email-eswi...@skyportsystems.com>
X-Mailer: git-send-email 1.9.1

Is there any way to tell exactly what mailman didn't like about the
first two, causing it to change the author?

--Ed

Re: [Qemu-devel] [PULL 00/18] Net patches

2018-01-08 Thread Ed Swierk via Qemu-devel

On Jan 8, 2018 02:16, "Peter Maydell" <peter.mayd...@linaro.org> wrote:

On 22 December 2017 at 02:15, Jason Wang <jasow...@redhat.com> wrote:
> The following changes since commit 43ab9a5376c95c61ae898a222c4d04
bdf60e239b:
>
>   hw/i386/vmport: fix missing definitions with non-log trace backends
(2017-12-21 22:52:28 +)
>
> are available in the git repository at:
>
>   https://github.com/jasowang/qemu.git tags/net-pull-request
>
> for you to fetch changes up to 0065e915192cdf83c2700bb377e5323c2649476e:
>
>   qemu-doc: Update the deprecation information of -tftp, -bootp, -redir
and -smb (2017-12-22 10:06:05 +0800)
>
> 
>
> --------

Applied, thanks.

PS: just noticed, but "Ed Swierk via Qemu-devel <qemu-devel@nongnu.org>"
is a bit of an odd Author string to end up in git commit logs, so you
and/or Ed might like to fix that up for any future patches.

Thanks for pointing that out. I wonder if it's happening because I let
git-send-email automatically fill in the From: line.

Anyway if it's not too late to fix, the correct address is Ed Swierk <
eswi...@skyportsystems.com> .

thanks
-- PMM

[Qemu-devel] [PATCH v3 REPOST] net: Transmit zero UDP checksum as 0xFFFF

2017-11-16 Thread Ed Swierk via Qemu-devel

The checksum algorithm used by IPv4, TCP and UDP allows a zero value
to be represented by either 0x and 0x. But per RFC 768, a zero
UDP checksum must be transmitted as 0x because 0x is a special
value meaning no checksum.

Substitute 0x whenever a checksum is computed as zero when
modifying a UDP datagram header. Doing this on IPv4 and TCP checksums
is unnecessary but legal. Add a wrapper for net_checksum_finish() that
makes the substitution.

(We can't just change net_checksum_finish(), as that function is also
used by receivers to verify checksums, and in that case the expected
value is always 0x.)

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
v3:

(Reposted to fix patch format)

Leave net_tx_pkt_update_ip_checksums() alone since it's only computing
a partial sum of the IP pseudo-header.

Rename wrapper to net_checksum_finish_nozero() for clarity.

v2:

Add a wrapper net_checksum_finish_hdr() rather than duplicating the
logic at every caller.
---
 hw/net/e1000.c | 2 +-
 hw/net/net_rx_pkt.c| 2 +-
 hw/net/net_tx_pkt.c| 2 +-
 hw/net/vmxnet3.c   | 3 ++-
 include/net/checksum.h | 6 ++
 5 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 72a92be..804ec08 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -506,7 +506,7 @@ putsum(uint8_t *data, uint32_t n, uint32_t sloc, uint32_t 
css, uint32_t cse)
 n = cse + 1;
 if (sloc < n-1) {
 sum = net_checksum_add(n-css, data+css);
-stw_be_p(data + sloc, net_checksum_finish(sum));
+stw_be_p(data + sloc, net_checksum_finish_nozero(sum));
 }
 }
 
diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
index cef1c2e..98a5030 100644
--- a/hw/net/net_rx_pkt.c
+++ b/hw/net/net_rx_pkt.c
@@ -518,7 +518,7 @@ _net_rx_pkt_calc_l4_csum(struct NetRxPkt *pkt)
 cntr += net_checksum_add_iov(pkt->vec, pkt->vec_len,
  pkt->l4hdr_off, csl, cso);
 
-csum = net_checksum_finish(cntr);
+csum = net_checksum_finish_nozero(cntr);
 
 trace_net_rx_pkt_l4_csum_calc_csum(pkt->l4hdr_off, csl, cntr, csum);
 
diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
index 20b2549..e29c881 100644
--- a/hw/net/net_tx_pkt.c
+++ b/hw/net/net_tx_pkt.c
@@ -486,7 +486,7 @@ static void net_tx_pkt_do_sw_csum(struct NetTxPkt *pkt)
 net_checksum_add_iov(iov, iov_len, pkt->virt_hdr.csum_start, csl, cso);
 
 /* Put the checksum obtained into the packet */
-csum = cpu_to_be16(net_checksum_finish(csum_cntr));
+csum = cpu_to_be16(net_checksum_finish_nozero(csum_cntr));
 iov_from_buf(iov, iov_len, csum_offset, , sizeof csum);
 }
 
diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 8c4bae5..cdc307d 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -972,7 +972,8 @@ static void vmxnet3_rx_need_csum_calculate(struct NetRxPkt 
*pkt,
 data = (uint8_t *)pkt_data + vhdr->csum_start;
 len = pkt_len - vhdr->csum_start;
 /* Put the checksum obtained into the packet */
-stw_be_p(data + vhdr->csum_offset, net_raw_checksum(data, len));
+stw_be_p(data + vhdr->csum_offset,
+ net_checksum_finish_nozero(net_checksum_add(len, data)));
 
 vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
 vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
diff --git a/include/net/checksum.h b/include/net/checksum.h
index 7df472c..05a0d27 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -34,6 +34,12 @@ net_checksum_add(int len, uint8_t *buf)
 }
 
 static inline uint16_t
+net_checksum_finish_nozero(uint32_t sum)
+{
+return net_checksum_finish(sum) ?: 0x;
+}
+
+static inline uint16_t
 net_raw_checksum(uint8_t *data, int length)
 {
 return net_checksum_finish(net_checksum_add(length, data));
-- 
1.9.1

Re: [Qemu-devel] [PATCH v3] net: Transmit zero UDP checksum as 0xFFFF

2017-11-16 Thread Ed Swierk via Qemu-devel

On Wed, Nov 15, 2017 at 7:55 PM, Eric Blake <ebl...@redhat.com> wrote:
>
> On 11/15/2017 09:07 PM, Ed Swierk via Qemu-devel wrote:
>
> This part below...
>
> >
> > v3:
> >
> > Leave net_tx_pkt_update_ip_checksums() alone since it's only computing
> > a partial sum of the IP pseudo-header.
> >
> > Rename wrapper to net_checksum_finish_nozero() for clarity.
> >
> > v2:
> >
> > Add a wrapper net_checksum_finish_hdr() rather than duplicating the
> > logic at every caller.
>
> ...through here, belongs...
>
> >
> > Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
> > ---
>
> ...here, after the --- separator (you can have more than one ---; 'git
> am' stops parsing at the first).  It is data that is useful to
> reviewers, but will be meaningless a year from now when browsing git
> history (at which point we won't care how many revisions the patch went
> through on the list, but only what was checked in).

Thanks for the clue. I'll repost a corrected version.

Maybe checkpatch.pl should squawk if it sees ^v[0-9]\+: before the first ---?

--Ed

[Qemu-devel] [PATCH v3] net: Transmit zero UDP checksum as 0xFFFF

2017-11-15 Thread Ed Swierk via Qemu-devel

The checksum algorithm used by IPv4, TCP and UDP allows a zero value
to be represented by either 0x and 0x. But per RFC 768, a zero
UDP checksum must be transmitted as 0x, as 0x is a special
value meaning no checksum.

Substitute 0x whenever a checksum is computed as zero when
modifying a UDP datagram header. Doing this on IPv4 packets and TCP
segments is unnecessary but legal. Add a wrapper for
net_checksum_finish() that makes the substitution.

(We can't just change net_checksum_finish(), as that function is also
used by receivers to verify checksums, and in that case the expected
value is always 0x.)

v3:

Leave net_tx_pkt_update_ip_checksums() alone since it's only computing
a partial sum of the IP pseudo-header.

Rename wrapper to net_checksum_finish_nozero() for clarity.

v2:

Add a wrapper net_checksum_finish_hdr() rather than duplicating the
logic at every caller.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000.c | 2 +-
 hw/net/net_rx_pkt.c| 2 +-
 hw/net/net_tx_pkt.c| 2 +-
 hw/net/vmxnet3.c   | 3 ++-
 include/net/checksum.h | 6 ++
 5 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 72a92be..804ec08 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -506,7 +506,7 @@ putsum(uint8_t *data, uint32_t n, uint32_t sloc, uint32_t 
css, uint32_t cse)
 n = cse + 1;
 if (sloc < n-1) {
 sum = net_checksum_add(n-css, data+css);
-stw_be_p(data + sloc, net_checksum_finish(sum));
+stw_be_p(data + sloc, net_checksum_finish_nozero(sum));
 }
 }
 
diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
index cef1c2e..98a5030 100644
--- a/hw/net/net_rx_pkt.c
+++ b/hw/net/net_rx_pkt.c
@@ -518,7 +518,7 @@ _net_rx_pkt_calc_l4_csum(struct NetRxPkt *pkt)
 cntr += net_checksum_add_iov(pkt->vec, pkt->vec_len,
  pkt->l4hdr_off, csl, cso);
 
-csum = net_checksum_finish(cntr);
+csum = net_checksum_finish_nozero(cntr);
 
 trace_net_rx_pkt_l4_csum_calc_csum(pkt->l4hdr_off, csl, cntr, csum);
 
diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
index 20b2549..e29c881 100644
--- a/hw/net/net_tx_pkt.c
+++ b/hw/net/net_tx_pkt.c
@@ -486,7 +486,7 @@ static void net_tx_pkt_do_sw_csum(struct NetTxPkt *pkt)
 net_checksum_add_iov(iov, iov_len, pkt->virt_hdr.csum_start, csl, cso);
 
 /* Put the checksum obtained into the packet */
-csum = cpu_to_be16(net_checksum_finish(csum_cntr));
+csum = cpu_to_be16(net_checksum_finish_nozero(csum_cntr));
 iov_from_buf(iov, iov_len, csum_offset, , sizeof csum);
 }
 
diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 8c4bae5..cdc307d 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -972,7 +972,8 @@ static void vmxnet3_rx_need_csum_calculate(struct NetRxPkt 
*pkt,
 data = (uint8_t *)pkt_data + vhdr->csum_start;
 len = pkt_len - vhdr->csum_start;
 /* Put the checksum obtained into the packet */
-stw_be_p(data + vhdr->csum_offset, net_raw_checksum(data, len));
+stw_be_p(data + vhdr->csum_offset,
+ net_checksum_finish_nozero(net_checksum_add(len, data)));
 
 vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
 vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
diff --git a/include/net/checksum.h b/include/net/checksum.h
index 7df472c..05a0d27 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -34,6 +34,12 @@ net_checksum_add(int len, uint8_t *buf)
 }
 
 static inline uint16_t
+net_checksum_finish_nozero(uint32_t sum)
+{
+return net_checksum_finish(sum) ?: 0x;
+}
+
+static inline uint16_t
 net_raw_checksum(uint8_t *data, int length)
 {
 return net_checksum_finish(net_checksum_add(length, data));
-- 
1.9.1

Re: [Qemu-devel] [PATCH v2] net: Transmit zero UDP checksum as 0xFFFF

2017-11-15 Thread Ed Swierk via Qemu-devel

On Tue, Nov 14, 2017 at 6:57 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
> index 20b2549..dc95f12 100644
> --- a/hw/net/net_tx_pkt.c
> +++ b/hw/net/net_tx_pkt.c
> @@ -126,12 +126,12 @@ void net_tx_pkt_update_ip_checksums(struct NetTxPkt 
> *pkt)
>
>  /* Calculate IP pseudo header checksum */
>  cntr = eth_calc_ip4_pseudo_hdr_csum(ip_hdr, pkt->payload_len, );
> -csum = cpu_to_be16(~net_checksum_finish(cntr));
> +csum = cpu_to_be16(~net_checksum_finish_hdr(cntr));
>  } else if (gso_type == VIRTIO_NET_HDR_GSO_TCPV6) {
>  /* Calculate IP pseudo header checksum */
>  cntr = eth_calc_ip6_pseudo_hdr_csum(ip_hdr, pkt->payload_len,
>  IP_PROTO_TCP, );
> -csum = cpu_to_be16(~net_checksum_finish(cntr));
> +csum = cpu_to_be16(~net_checksum_finish_hdr(cntr));
>  } else {
>  return;
>  }

Actually this change looks wrong. The checksum here is just the
partial sum of the IP pseudo-header fields, to be incorporated into
the actual UDP or TCP checksum (including the packet payload) at some
later stage. The partial sum should never be zero (if
net_rx_pkt_fix_l4_csum() is to be believed), but changing 0x to
0x and then storing the complement ensures just that.

Unless someone with more of a clue says otherwise, I'll drop this change in v3.

--Ed

[Qemu-devel] [PATCH v2] net: Transmit zero UDP checksum as 0xFFFF

2017-11-14 Thread Ed Swierk via Qemu-devel

The checksum algorithm used by IPv4, TCP and UDP allows a zero value
to be represented by either 0x and 0x. But per RFC 768, a zero
UDP checksum must be transmitted as 0x, as 0x is a special
value meaning no checksum.

Substitute 0x whenever a checksum is computed as zero when
modifying a UDP datagram header. Doing this on IPv4 packets and TCP
segments is unnecessary but legal. Add a wrapper for
net_checksum_finish() that makes the substitution.

(We can't just change net_checksum_finish(), as that function is also
used by receivers to verify checksums, and in that case the expected
value is always 0x.)

v2:

Add a wrapper net_checksum_finish_hdr() rather than duplicating the
logic at every caller.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000.c | 2 +-
 hw/net/net_rx_pkt.c| 2 +-
 hw/net/net_tx_pkt.c| 6 +++---
 hw/net/vmxnet3.c   | 3 ++-
 include/net/checksum.h | 7 +++
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index d642314..4e33667 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -506,7 +506,7 @@ putsum(uint8_t *data, uint32_t n, uint32_t sloc, uint32_t 
css, uint32_t cse)
 n = cse + 1;
 if (sloc < n-1) {
 sum = net_checksum_add(n-css, data+css);
-stw_be_p(data + sloc, net_checksum_finish(sum));
+stw_be_p(data + sloc, net_checksum_finish_hdr(sum));
 }
 }
 
diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
index 1019b50..a2b2c2d 100644
--- a/hw/net/net_rx_pkt.c
+++ b/hw/net/net_rx_pkt.c
@@ -517,7 +517,7 @@ _net_rx_pkt_calc_l4_csum(struct NetRxPkt *pkt)
 cntr += net_checksum_add_iov(pkt->vec, pkt->vec_len,
  pkt->l4hdr_off, csl, cso);
 
-csum = net_checksum_finish(cntr);
+csum = net_checksum_finish_hdr(cntr);
 
 trace_net_rx_pkt_l4_csum_calc_csum(pkt->l4hdr_off, csl, cntr, csum);
 
diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
index 20b2549..dc95f12 100644
--- a/hw/net/net_tx_pkt.c
+++ b/hw/net/net_tx_pkt.c
@@ -126,12 +126,12 @@ void net_tx_pkt_update_ip_checksums(struct NetTxPkt *pkt)
 
 /* Calculate IP pseudo header checksum */
 cntr = eth_calc_ip4_pseudo_hdr_csum(ip_hdr, pkt->payload_len, );
-csum = cpu_to_be16(~net_checksum_finish(cntr));
+csum = cpu_to_be16(~net_checksum_finish_hdr(cntr));
 } else if (gso_type == VIRTIO_NET_HDR_GSO_TCPV6) {
 /* Calculate IP pseudo header checksum */
 cntr = eth_calc_ip6_pseudo_hdr_csum(ip_hdr, pkt->payload_len,
 IP_PROTO_TCP, );
-csum = cpu_to_be16(~net_checksum_finish(cntr));
+csum = cpu_to_be16(~net_checksum_finish_hdr(cntr));
 } else {
 return;
 }
@@ -486,7 +486,7 @@ static void net_tx_pkt_do_sw_csum(struct NetTxPkt *pkt)
 net_checksum_add_iov(iov, iov_len, pkt->virt_hdr.csum_start, csl, cso);
 
 /* Put the checksum obtained into the packet */
-csum = cpu_to_be16(net_checksum_finish(csum_cntr));
+csum = cpu_to_be16(net_checksum_finish_hdr(csum_cntr));
 iov_from_buf(iov, iov_len, csum_offset, , sizeof csum);
 }
 
diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 90f6943..ee7cdeb 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -970,7 +970,8 @@ static void vmxnet3_rx_need_csum_calculate(struct NetRxPkt 
*pkt,
 data = (uint8_t *)pkt_data + vhdr->csum_start;
 len = pkt_len - vhdr->csum_start;
 /* Put the checksum obtained into the packet */
-stw_be_p(data + vhdr->csum_offset, net_raw_checksum(data, len));
+stw_be_p(data + vhdr->csum_offset,
+ net_checksum_finish_hdr(net_checksum_add(len, data)));
 
 vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
 vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
diff --git a/include/net/checksum.h b/include/net/checksum.h
index 7df472c..9878c8b 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -34,6 +34,13 @@ net_checksum_add(int len, uint8_t *buf)
 }
 
 static inline uint16_t
+net_checksum_finish_hdr(uint32_t sum)
+{
+uint16_t s = net_checksum_finish(sum);
+return s ? s : 0x;
+}
+
+static inline uint16_t
 net_raw_checksum(uint8_t *data, int length)
 {
 return net_checksum_finish(net_checksum_add(length, data));
-- 
1.9.1

Re: [Qemu-devel] [PATCH] net: Transmit zero UDP checksum as 0xFFFF

2017-11-14 Thread Ed Swierk via Qemu-devel

On Tue, Nov 14, 2017 at 6:10 PM, Jason Wang <jasow...@redhat.com> wrote:
>
>
> On 2017年11月15日 07:25, Ed Swierk wrote:
>>
>> The checksum algorithm used by IPv4, TCP and UDP allows a zero value
>> to be represented by either 0x and 0x. But per RFC 768, a zero
>> UDP checksum must be transmitted as 0x, as 0x is a special
>> value meaning no checksum.
>>
>> Substitute 0x whenever a checksum is computed as zero on a UDP
>> datagram. Doing this on IPv4 packets and TCP segments is unnecessary
>> but legal.
>>
>> (While it is tempting to make the substitution in
>> net_checksum_finish(), that function is also used by receivers to
>> verify checksums, and in that case the expected value is always
>> 0x.)
>
>
> Then looks like you'd better have an wrapper for net_checksum_finish() and
> do things there.

I'll do that in v2.

>> index 1019b50..e820132 100644
>> --- a/hw/net/net_rx_pkt.c
>> +++ b/hw/net/net_rx_pkt.c
>> @@ -588,6 +588,9 @@ bool net_rx_pkt_fix_l4_csum(struct NetRxPkt *pkt)
>> /* Calculate L4 checksum */
>>   csum = cpu_to_be16(_net_rx_pkt_calc_l4_csum(pkt));
>> +if (!csum) {
>> +csum = 0x; /* For UDP, zero checksum must be sent as 0x
>> */
>> +}
>
>
> I thought we should only do this for tx?

We need to do this any time we modify the checksum field in a UDP
datagram header for someone else to verify. Normally this happens on
the tx path, and that someone is a remote system. But here
net_rx_pkt_fix_l4_csum() is used to fill in the checksum on packets
received with the NEEDS_CSUM vhdr flag before passing them along to
the guest.

--Ed

Re: [Qemu-devel] [PATCH v2 0/2] e1000e: Reimplement e1000 as a variant of e1000e

2017-11-14 Thread Ed Swierk via Qemu-devel

On Thu, Nov 9, 2017 at 5:53 AM, Daniel P. Berrange  wrote:
> My fear is that this approach of building a new e1000-ng device in
> parallel with having the existing e1000 device is going to cause
> long term pain, possibly never getting to a state where the e1000-ng
> device can replace the e1000 device. Any time there needs to be a
> "big bang" to switch from one impl to another completely different
> impl always causes trouble IME. With need for migration wire format
> & state compatibility, this is even more difficult. From a code review
> POV it will be essentially impossible to have confidence that the new
> impl can be a viable drop-in replacement for the old impl.
>
> Is there really no way that you can change the approach to do a more
> incremental conversion of the existing code base, but still end up in
> the same place at the very end ?
>
> eg just copy all the e1000.c code into the e1000e.c file to start with.
> Then gradually merge functional areas over a longish series of patches
> to eliminate the duplication. This would make it far more practical to
> identify where any regressions come from, and will give reviewers more
> confidence that we're not breaking migration compat.

I agree an incremental conversion is the only realistic way to unearth
potential regressions; testing won't be enough. In the past couple of
days I've run into several cases where e1000 works but e1000-ng
doesn't. Apparently e1000e guest drivers just happen not to tickle
these bugs. So e1000-ng isn't ready for prime time anyway. I'll shelve
this patch series for the moment.

Meanwhile I'll post separate patches for the bugs I've encountered
with e1000 UDP checksum offload.

--Ed

[Qemu-devel] [PATCH] net: Transmit zero UDP checksum as 0xFFFF

2017-11-14 Thread Ed Swierk via Qemu-devel

The checksum algorithm used by IPv4, TCP and UDP allows a zero value
to be represented by either 0x and 0x. But per RFC 768, a zero
UDP checksum must be transmitted as 0x, as 0x is a special
value meaning no checksum.

Substitute 0x whenever a checksum is computed as zero on a UDP
datagram. Doing this on IPv4 packets and TCP segments is unnecessary
but legal.

(While it is tempting to make the substitution in
net_checksum_finish(), that function is also used by receivers to
verify checksums, and in that case the expected value is always
0x.)

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000.c  | 5 +++--
 hw/net/net_rx_pkt.c | 3 +++
 hw/net/net_tx_pkt.c | 6 ++
 hw/net/vmxnet3.c| 7 +--
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index d642314..97242a1 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -505,8 +505,9 @@ putsum(uint8_t *data, uint32_t n, uint32_t sloc, uint32_t 
css, uint32_t cse)
 if (cse && cse < n)
 n = cse + 1;
 if (sloc < n-1) {
-sum = net_checksum_add(n-css, data+css);
-stw_be_p(data + sloc, net_checksum_finish(sum));
+sum = net_raw_checksum(data + css, n - css);
+/* For UDP, zero checksum must be sent as 0x */
+stw_be_p(data + sloc, sum ? sum : 0x);
 }
 }
 
diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
index 1019b50..e820132 100644
--- a/hw/net/net_rx_pkt.c
+++ b/hw/net/net_rx_pkt.c
@@ -588,6 +588,9 @@ bool net_rx_pkt_fix_l4_csum(struct NetRxPkt *pkt)
 
 /* Calculate L4 checksum */
 csum = cpu_to_be16(_net_rx_pkt_calc_l4_csum(pkt));
+if (!csum) {
+csum = 0x; /* For UDP, zero checksum must be sent as 0x */
+}
 
 /* Set calculated checksum to checksum word */
 iov_from_buf(pkt->vec, pkt->vec_len,
diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
index 20b2549..21194e7 100644
--- a/hw/net/net_tx_pkt.c
+++ b/hw/net/net_tx_pkt.c
@@ -136,6 +136,9 @@ void net_tx_pkt_update_ip_checksums(struct NetTxPkt *pkt)
 return;
 }
 
+if (!csum) {
+csum = 0x; /* For UDP, zero checksum must be sent as 0x */
+}
 iov_from_buf(>vec[NET_TX_PKT_PL_START_FRAG], pkt->payload_frags,
  pkt->virt_hdr.csum_offset, , sizeof(csum));
 }
@@ -487,6 +490,9 @@ static void net_tx_pkt_do_sw_csum(struct NetTxPkt *pkt)
 
 /* Put the checksum obtained into the packet */
 csum = cpu_to_be16(net_checksum_finish(csum_cntr));
+if (!csum) {
+csum = 0x; /* For UDP, zero checksum must be sent as 0x */
+}
 iov_from_buf(iov, iov_len, csum_offset, , sizeof csum);
 }
 
diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 90f6943..de9c40e 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -942,6 +942,7 @@ static void vmxnet3_rx_need_csum_calculate(struct NetRxPkt 
*pkt,
 bool isip4, isip6, istcp, isudp;
 uint8_t *data;
 int len;
+uint16_t sum;
 
 if (!net_rx_pkt_has_virt_hdr(pkt)) {
 return;
@@ -969,8 +970,10 @@ static void vmxnet3_rx_need_csum_calculate(struct NetRxPkt 
*pkt,
 
 data = (uint8_t *)pkt_data + vhdr->csum_start;
 len = pkt_len - vhdr->csum_start;
-/* Put the checksum obtained into the packet */
-stw_be_p(data + vhdr->csum_offset, net_raw_checksum(data, len));
+sum = net_raw_checksum(data, len);
+/* Put the checksum obtained into the packet; for UDP, zero checksum */
+/* must be sent as 0x */
+stw_be_p(data + vhdr->csum_offset, sum ? sum : 0x);
 
 vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
 vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
-- 
1.9.1

[Qemu-devel] [PATCH 2/2] e1000: Separate TSO and non-TSO contexts, fixing UDP TX corruption

2017-11-14 Thread Ed Swierk via Qemu-devel

The device is supposed to maintain two distinct contexts for transmit
offloads: one has parameters for both segmentation and checksum
offload, the other only for checksum offload. The guest driver can
send two context descriptors, one for each context (the TSE flag
specifies which). Then the guest can refer to one or the other context
in subsequent transmit data descriptors, depending on what offloads it
wants applied to each packet.

Currently the e1000 device stores just one context, and misinterprets
the TSE flags in the context and data descriptors. This is often okay:
Linux happens to send a fresh context descriptor before every data
descriptor, so forgetting the other context doesn't matter. Windows
does rely on separate contexts for TSO vs. non-TSO packets, but for
mostly-TCP traffic the two contexts have identical TCP-specific
offload parameters so confusing them doesn't matter.

One case where this confusion matters is when a Windows guest sets up
a TSO context for TCP and a non-TSO context for UDP, and then
transmits both TCP and UDP traffic in parallel. The e1000 device
sometimes ends up using TCP-specific parameters while doing checksum
offload on a UDP datagram: it writes the checksum to offset 16 (the
correct location for a TCP checksum), stomping on two bytes of UDP
data, and leaving the wrong value in the actual UDP checksum field at
offset 6. (Even worse, the host network stack may then recompute the
UDP checksum, "correcting" it to match the corrupt data before sending
it out a physical interface.)

Correct this by tracking the TSO context independently of the non-TSO
context, and selecting the appropriate context based on the TSE flag
in each transmit data descriptor.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000.c | 70 +-
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 471cdd9..d642314 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -101,6 +101,7 @@ typedef struct E1000State_st {
 unsigned char sum_needed;
 bool cptse;
 e1000x_txd_props props;
+e1000x_txd_props tso_props;
 uint16_t tso_frames;
 } tx;
 
@@ -541,35 +542,37 @@ xmit_seg(E1000State *s)
 uint16_t len;
 unsigned int frames = s->tx.tso_frames, css, sofar;
 struct e1000_tx *tp = >tx;
+struct e1000x_txd_props *props = tp->cptse ? >tso_props : >props;
 
-if (tp->props.tse && tp->cptse) {
-css = tp->props.ipcss;
+if (tp->cptse) {
+css = props->ipcss;
 DBGOUT(TXSUM, "frames %d size %d ipcss %d\n",
frames, tp->size, css);
-if (tp->props.ip) {/* IPv4 */
+if (props->ip) {/* IPv4 */
 stw_be_p(tp->data+css+2, tp->size - css);
 stw_be_p(tp->data+css+4,
  lduw_be_p(tp->data + css + 4) + frames);
 } else { /* IPv6 */
 stw_be_p(tp->data+css+4, tp->size - css);
 }
-css = tp->props.tucss;
+css = props->tucss;
 len = tp->size - css;
-DBGOUT(TXSUM, "tcp %d tucss %d len %d\n", tp->props.tcp, css, len);
-if (tp->props.tcp) {
-sofar = frames * tp->props.mss;
+DBGOUT(TXSUM, "tcp %d tucss %d len %d\n", props->tcp, css, len);
+if (props->tcp) {
+sofar = frames * props->mss;
 stl_be_p(tp->data+css+4, ldl_be_p(tp->data+css+4)+sofar); /* seq */
-if (tp->props.paylen - sofar > tp->props.mss) {
+if (props->paylen - sofar > props->mss) {
 tp->data[css + 13] &= ~9;/* PSH, FIN */
 } else if (frames) {
 e1000x_inc_reg_if_not_full(s->mac_reg, TSCTC);
 }
-} else/* UDP */
+} else {/* UDP */
 stw_be_p(tp->data+css+4, len);
+}
 if (tp->sum_needed & E1000_TXD_POPTS_TXSM) {
 unsigned int phsum;
 // add pseudo-header length before checksum calculation
-void *sp = tp->data + tp->props.tucso;
+void *sp = tp->data + props->tucso;
 
 phsum = lduw_be_p(sp) + len;
 phsum = (phsum >> 16) + (phsum & 0x);
@@ -579,12 +582,10 @@ xmit_seg(E1000State *s)
 }
 
 if (tp->sum_needed & E1000_TXD_POPTS_TXSM) {
-putsum(tp->data, tp->size, tp->props.tucso,
-   tp->props.tucss, tp->props.tucse);
+putsum(tp->data, tp->size, props->tucso, props->tucss, props->tucse);
 }
 if (tp->sum_needed & E1000_TXD_POPTS_IXSM) {
-putsum(tp->data, tp->size, tp->props.ipcso,
-   tp->props.ipcss, tp->props.ipcse);
+putsum(tp->

[Qemu-devel] [PATCH 1/2] e1000, e1000e: Move per-packet TX offload flags out of context state

2017-11-14 Thread Ed Swierk via Qemu-devel

sum_needed and cptse flags are received from the guest within each
transmit data descriptor. They are not part of the offload context;
instead, they determine how to apply a previously received context to
the packet being transmitted:

- If cptse is set, perform both segmentation and checksum offload
  using the parameters in the TSO context; otherwise just do checksum
  offload. (Currently the e1000 device incorrectly stores only one
  context, which will be fixed in a subsequent patch.)

- Depending on the bits set in sum_needed, possibly perform L4
  checksum offload and/or IP checksum offload, using the parameters in
  the appropriate context.

Move these flags out of struct e1000x_txd_props, which is otherwise
dedicated to storing values from a context descriptor, and into the
per-packet TX struct.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000.c | 30 --
 hw/net/e1000e.c|  4 ++--
 hw/net/e1000e_core.c   | 16 
 hw/net/e1000e_core.h   |  2 ++
 hw/net/e1000x_common.h |  2 --
 5 files changed, 28 insertions(+), 26 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 9324949..471cdd9 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -98,6 +98,8 @@ typedef struct E1000State_st {
 unsigned char data[0x1];
 uint16_t size;
 unsigned char vlan_needed;
+unsigned char sum_needed;
+bool cptse;
 e1000x_txd_props props;
 uint16_t tso_frames;
 } tx;
@@ -540,7 +542,7 @@ xmit_seg(E1000State *s)
 unsigned int frames = s->tx.tso_frames, css, sofar;
 struct e1000_tx *tp = >tx;
 
-if (tp->props.tse && tp->props.cptse) {
+if (tp->props.tse && tp->cptse) {
 css = tp->props.ipcss;
 DBGOUT(TXSUM, "frames %d size %d ipcss %d\n",
frames, tp->size, css);
@@ -564,7 +566,7 @@ xmit_seg(E1000State *s)
 }
 } else/* UDP */
 stw_be_p(tp->data+css+4, len);
-if (tp->props.sum_needed & E1000_TXD_POPTS_TXSM) {
+if (tp->sum_needed & E1000_TXD_POPTS_TXSM) {
 unsigned int phsum;
 // add pseudo-header length before checksum calculation
 void *sp = tp->data + tp->props.tucso;
@@ -576,11 +578,11 @@ xmit_seg(E1000State *s)
 tp->tso_frames++;
 }
 
-if (tp->props.sum_needed & E1000_TXD_POPTS_TXSM) {
+if (tp->sum_needed & E1000_TXD_POPTS_TXSM) {
 putsum(tp->data, tp->size, tp->props.tucso,
tp->props.tucss, tp->props.tucse);
 }
-if (tp->props.sum_needed & E1000_TXD_POPTS_IXSM) {
+if (tp->sum_needed & E1000_TXD_POPTS_IXSM) {
 putsum(tp->data, tp->size, tp->props.ipcso,
tp->props.ipcss, tp->props.ipcse);
 }
@@ -624,17 +626,17 @@ process_tx_desc(E1000State *s, struct e1000_tx_desc *dp)
 } else if (dtype == (E1000_TXD_CMD_DEXT | E1000_TXD_DTYP_D)) {
 // data descriptor
 if (tp->size == 0) {
-tp->props.sum_needed = le32_to_cpu(dp->upper.data) >> 8;
+tp->sum_needed = le32_to_cpu(dp->upper.data) >> 8;
 }
-tp->props.cptse = (txd_lower & E1000_TXD_CMD_TSE) ? 1 : 0;
+tp->cptse = (txd_lower & E1000_TXD_CMD_TSE) ? 1 : 0;
 } else {
 // legacy descriptor
-tp->props.cptse = 0;
+tp->cptse = 0;
 }
 
 if (e1000x_vlan_enabled(s->mac_reg) &&
 e1000x_is_vlan_txd(txd_lower) &&
-(tp->props.cptse || txd_lower & E1000_TXD_CMD_EOP)) {
+(tp->cptse || txd_lower & E1000_TXD_CMD_EOP)) {
 tp->vlan_needed = 1;
 stw_be_p(tp->vlan_header,
   le16_to_cpu(s->mac_reg[VET]));
@@ -643,7 +645,7 @@ process_tx_desc(E1000State *s, struct e1000_tx_desc *dp)
 }
 
 addr = le64_to_cpu(dp->buffer_addr);
-if (tp->props.tse && tp->props.cptse) {
+if (tp->props.tse && tp->cptse) {
 msh = tp->props.hdr_len + tp->props.mss;
 do {
 bytes = split_size;
@@ -665,7 +667,7 @@ process_tx_desc(E1000State *s, struct e1000_tx_desc *dp)
 }
 split_size -= bytes;
 } while (bytes && split_size);
-} else if (!tp->props.tse && tp->props.cptse) {
+} else if (!tp->props.tse && tp->cptse) {
 // context descriptor TSE is not set, while data descriptor TSE is set
 DBGOUT(TXERR, "TCP segmentation error\n");
 } else {
@@ -676,14 +678,14 @@ process_tx_desc(E1000State *s, struct e1000_tx_desc *dp)
 
 if (!(txd_lower & E1000_TXD_CMD_EOP))
 return;
-if (!(tp->props.tse && tp->props.cptse && tp->size < tp->props.hdr_len)) {
+

[Qemu-devel] [PATCH 0/2] e1000: Correct TX offload context handling

2017-11-14 Thread Ed Swierk via Qemu-devel

;props.tucso,
 tp->props.tucss, tp->props.tucse);
 +debug_csum(tp, oldsum); /* FIXME: remove */
  }
  if (tp->props.sum_needed & E1000_TXD_POPTS_IXSM) {
  putsum(tp->data, tp->size, tp->props.ipcso,


This patch series first moves per-TX packet flags out of the context
struct, which is used by both e1000 and e1000e. ("Used" is generous,
as e1000e ignores most of the context parameters supplied by the guest
and does its own thing, which is only sometimes correct. Taming e1000e
is a project for another day. The change to e1000e here is a baby step
in that direction.)

Then we separate the e1000 TSO and non-TSO contexts, which fixes the
UDP TX corruption bug.

Ed Swierk (2):
  e1000, e1000e: Move per-packet TX offload flags out of context state
  e1000: Separate TSO and non-TSO contexts, fixing UDP TX corruption

 hw/net/e1000.c | 92 --
 hw/net/e1000e.c|  4 +--
 hw/net/e1000e_core.c   | 16 -
 hw/net/e1000e_core.h   |  2 ++
 hw/net/e1000x_common.h |  2 --
 5 files changed, 64 insertions(+), 52 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v2 2/2] e1000e: Add e1000-ng devices

2017-11-08 Thread Ed Swierk via Qemu-devel

Implement e1000-compatible devices using the reworked e1000e code:

- e1000-ng: Intel 82540EM
- e1000-82544gc-ng: Intel 82544GC
- e1000-82545em-ng: Intel 82545EM

>From a guest's perspective, these should be drop-in replacements for
the existing e1000 devices.

This version has undergone minimal testing, but should work well
enough to start experimenting with e1000-ng devices with a variety of
guest OSes.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000e.c  | 111 +--
 hw/net/e1000e_core.c | 120 ++-
 hw/net/e1000e_core.h |  12 ++
 3 files changed, 229 insertions(+), 14 deletions(-)

diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
index 7685ab0..9ae5105 100644
--- a/hw/net/e1000e.c
+++ b/hw/net/e1000e.c
@@ -281,6 +281,17 @@ static const uint16_t e1000e_eeprom_template[64] = {
 0x, 0x, 0x, 0x, 0x, 0x0120, 0x, 0x,
 };
 
+static const uint16_t e1000_eeprom_template[64] = {
+0x, 0x, 0x, 0x,  0x, 0x,  0x, 0x,
+0x3000, 0x1000, 0x6403, 0 /*DevId*/, 0x8086, 0 /*DevId*/, 0x8086, 0x3040,
+0x0008, 0x2000, 0x7e14, 0x0048,  0x1000, 0x00d8,  0x, 0x2700,
+0x6cc9, 0x3150, 0x0722, 0x040b,  0x0984, 0x,  0xc000, 0x0706,
+0x1008, 0x, 0x0f04, 0x7fff,  0x4d01, 0x,  0x, 0x,
+0x, 0x, 0x, 0x,  0x, 0x,  0x, 0x,
+0x0100, 0x4000, 0x121c, 0x,  0x, 0x,  0x, 0x,
+0x, 0x, 0x, 0x,  0x, 0x,  0x, 0x,
+};
+
 static void e1000e_core_realize(E1000EState *s)
 {
 s->core.owner = >parent_obj;
@@ -644,8 +655,8 @@ static const VMStateDescription e1000e_vmstate_intr_timer = 
{
 
 static const VMStateDescription e1000e_vmstate = {
 .name = "e1000e",
-.version_id = 1,
-.minimum_version_id = 1,
+.version_id = 2,
+.minimum_version_id = 2,
 .pre_save = e1000e_pre_save,
 .post_load = e1000e_post_load,
 .fields = (VMStateField[]) {
@@ -691,6 +702,61 @@ static const VMStateDescription e1000e_vmstate = {
 
 VMSTATE_STRUCT_ARRAY(core.tx, E1000EState, E1000E_NUM_QUEUES, 0,
  e1000e_vmstate_tx, struct e1000e_tx),
+
+VMSTATE_UINT32(core.eecd_state.val_in, E1000EState),
+VMSTATE_UINT16(core.eecd_state.bitnum_in, E1000EState),
+VMSTATE_UINT16(core.eecd_state.bitnum_out, E1000EState),
+VMSTATE_UINT16(core.eecd_state.reading, E1000EState),
+VMSTATE_UINT32(core.eecd_state.old_eecd, E1000EState),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static const VMStateDescription e1000_vmstate = {
+.name = "e1000-ng",
+.version_id = 1,
+.minimum_version_id = 1,
+.pre_save = e1000e_pre_save,
+.post_load = e1000e_post_load,
+.fields = (VMStateField[]) {
+VMSTATE_PCI_DEVICE(parent_obj, E1000EState),
+
+VMSTATE_UINT32(ioaddr, E1000EState),
+VMSTATE_UINT32(core.rxbuf_min_shift, E1000EState),
+VMSTATE_UINT8(core.rx_desc_len, E1000EState),
+VMSTATE_UINT32_ARRAY(core.rxbuf_sizes, E1000EState,
+ E1000_PSRCTL_BUFFS_PER_DESC),
+VMSTATE_UINT32(core.rx_desc_buf_size, E1000EState),
+VMSTATE_UINT16_ARRAY(core.eeprom, E1000EState, E1000E_EEPROM_SIZE),
+VMSTATE_UINT16_2DARRAY(core.phy, E1000EState,
+   E1000E_PHY_PAGES, E1000E_PHY_PAGE_SIZE),
+VMSTATE_UINT32_ARRAY(core.mac, E1000EState, E1000E_MAC_SIZE),
+VMSTATE_UINT8_ARRAY(core.permanent_mac, E1000EState, ETH_ALEN),
+
+VMSTATE_UINT32(core.delayed_causes, E1000EState),
+
+VMSTATE_UINT16(subsys, E1000EState),
+VMSTATE_UINT16(subsys_ven, E1000EState),
+
+VMSTATE_E1000E_INTR_DELAY_TIMER(core.rdtr, E1000EState),
+VMSTATE_E1000E_INTR_DELAY_TIMER(core.radv, E1000EState),
+VMSTATE_E1000E_INTR_DELAY_TIMER(core.raid, E1000EState),
+VMSTATE_E1000E_INTR_DELAY_TIMER(core.tadv, E1000EState),
+VMSTATE_E1000E_INTR_DELAY_TIMER(core.tidv, E1000EState),
+
+VMSTATE_E1000E_INTR_DELAY_TIMER(core.itr, E1000EState),
+VMSTATE_BOOL(core.itr_intr_pending, E1000EState),
+
+VMSTATE_UINT16(core.vet, E1000EState),
+
+VMSTATE_STRUCT_ARRAY(core.tx, E1000EState, E1000E_NUM_QUEUES, 0,
+ e1000e_vmstate_tx, struct e1000e_tx),
+
+VMSTATE_UINT32(core.eecd_state.val_in, E1000EState),
+VMSTATE_UINT16(core.eecd_state.bitnum_in, E1000EState),
+VMSTATE_UINT16(core.eecd_state.bitnum_out, E1000EState),
+VMSTATE_UINT16(core.eecd_state.reading, E1000EState),
+VMSTATE_UINT32(core.eecd_state.old_eecd, E1000EState),
 VMSTATE_END_OF_LIST()
 }
 };
@@ -730,7 +796,7 @@ static void e1000e_class_init(ObjectClass *class, void 
*data)
 
 dc

[Qemu-devel] [PATCH v2 0/2] e1000e: Reimplement e1000 as a variant of e1000e

2017-11-08 Thread Ed Swierk via Qemu-devel

>From a hardware functionality and register perspective, the Intel
8254x Gigabit Ethernet Controller[1] is roughly a subset of the Intel
82574[2], making it practical to implement e1000 device emulation as
an extension of e1000e.

That would be a step towards eliminating the existing e1000
implementation--a bunch of semi-redundant code with its own bugs (like
the faulty tx checksum offload I reported recently[3]).

[1] 
https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf
[2] 
https://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
[3] https://lists.nongnu.org/archive/html/qemu-devel/2017-10/msg03008.html

This patch series adds a new e1000-ng device but leaves the existing
e1000 device alone. Linux 4.1 and Windows 10 e1000 guest drivers
recognize the e1000-ng device and successfully pass traffic on a Linux
host.

Preliminary performance measurements are encouraging: with e1000-ng,
single-threaded TCP iperf shows about a 2x speedup in tx and rx
throughput vs e1000, and CPU usage is slighly lower.

A ton of work is needed before e1000-ng will be ready to supplant
e1000: testing with every random guest OS, fixing bugs, figuring out
migration and other missing functionality. There's no way I can do
this myself.

I'd like to propose a more modest and achievable short-term goal: find
and fix any regressions that might affect users of e1000e and e1000,
so that the patches can be applied and folks can easily start trying
out e1000-ng with their favorite guest OS. That means accepting (for
now) known deficiencies in the new e1000-ng devices, which can be
addressed by myself and (hopefully) others in follow-up patches. Does
that sound reasonable?

v2:
- added new eecd_state fields to e1000e_vmstate and e1000_vmstate
- got savevm/loadvm working with e1000-ng
- leave MSI enabled for e1000-ng (though guest drivers seem to ignore it)

Ed Swierk (2):
  e1000e: Infrastructure for e1000-ng
  e1000e: Add e1000-ng devices

 hw/net/e1000e.c  | 277 +++
 hw/net/e1000e_core.c | 249 +
 hw/net/e1000e_core.h |  25 -
 3 files changed, 437 insertions(+), 114 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v2 1/2] e1000e: Infrastructure for e1000-ng

2017-11-08 Thread Ed Swierk via Qemu-devel

Generalize e1000e to support e1000 and other devices with a similar
register spec:

- plain PCI instead of PCIe, skipping setup of MSI-X, AER, etc.
- model-specific PHY ID2, register values and access permissions
- model-specific EEPROM template and read/write methods

This is just infrastructure, no functional changes.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000e.c  | 166 ++-
 hw/net/e1000e_core.c | 131 +++-
 hw/net/e1000e_core.h |  13 ++--
 3 files changed, 209 insertions(+), 101 deletions(-)

diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
index f1af279..7685ab0 100644
--- a/hw/net/e1000e.c
+++ b/hw/net/e1000e.c
@@ -49,8 +49,13 @@
 #include "trace.h"
 #include "qapi/error.h"
 
-#define TYPE_E1000E "e1000e"
-#define E1000E(obj) OBJECT_CHECK(E1000EState, (obj), TYPE_E1000E)
+#define TYPE_E1000E_BASE "e1000e-base"
+#define E1000E(obj) \
+OBJECT_CHECK(E1000EState, (obj), TYPE_E1000E_BASE)
+#define E1000E_DEVICE_CLASS(klass) \
+OBJECT_CLASS_CHECK(E1000EBaseClass, (klass), TYPE_E1000E_BASE)
+#define E1000E_DEVICE_GET_CLASS(obj) \
+OBJECT_GET_CLASS(E1000EBaseClass, (obj), TYPE_E1000E_BASE)
 
 typedef struct E1000EState {
 PCIDevice parent_obj;
@@ -76,6 +81,28 @@ typedef struct E1000EState {
 
 } E1000EState;
 
+typedef struct E1000EInfo {
+const char *name;
+const char *desc;
+
+uint16_tdevice_id;
+uint8_t revision;
+uint16_tsubsystem_vendor_id;
+uint16_tsubsystem_id;
+int is_express;
+const char *romfile;
+
+const uint16_t *eeprom_templ;
+uint32_teeprom_size;
+
+uint16_t phy_id2;
+} E1000EInfo;
+
+typedef struct E1000EBaseClass {
+PCIDeviceClassparent_class;
+const E1000EInfo *info;
+} E1000EBaseClass;
+
 #define E1000E_MMIO_IDX 0
 #define E1000E_FLASH_IDX1
 #define E1000E_IO_IDX   2
@@ -416,7 +443,9 @@ static void e1000e_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 static const uint16_t e1000e_pcie_offset = 0x0E0;
 static const uint16_t e1000e_aer_offset =  0x100;
 static const uint16_t e1000e_dsn_offset =  0x140;
+PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
 E1000EState *s = E1000E(pci_dev);
+E1000EBaseClass *edc = E1000E_DEVICE_GET_CLASS(s);
 uint8_t *macaddr;
 int ret;
 
@@ -427,6 +456,13 @@ static void e1000e_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 pci_dev->config[PCI_CACHE_LINE_SIZE] = 0x10;
 pci_dev->config[PCI_INTERRUPT_PIN] = 1;
 
+if (s->subsys_ven == (uint16_t)-1) {
+s->subsys_ven = pci_get_word(pci_dev->config + 
PCI_SUBSYSTEM_VENDOR_ID);
+}
+if (s->subsys == (uint16_t)-1) {
+s->subsys = pci_get_word(pci_dev->config + PCI_SUBSYSTEM_ID);
+}
+
 pci_set_word(pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID, s->subsys_ven);
 pci_set_word(pci_dev->config + PCI_SUBSYSTEM_ID, s->subsys);
 
@@ -453,19 +489,23 @@ static void e1000e_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 pci_register_bar(pci_dev, E1000E_IO_IDX,
  PCI_BASE_ADDRESS_SPACE_IO, >io);
 
-memory_region_init(>msix, OBJECT(s), "e1000e-msix",
-   E1000E_MSIX_SIZE);
-pci_register_bar(pci_dev, E1000E_MSIX_IDX,
- PCI_BASE_ADDRESS_SPACE_MEMORY, >msix);
+if (pc->is_express) {
+memory_region_init(>msix, OBJECT(s), "e1000e-msix",
+   E1000E_MSIX_SIZE);
+pci_register_bar(pci_dev, E1000E_MSIX_IDX,
+ PCI_BASE_ADDRESS_SPACE_MEMORY, >msix);
+}
 
 /* Create networking backend */
 qemu_macaddr_default_if_unset(>conf.macaddr);
 macaddr = s->conf.macaddr.a;
 
-e1000e_init_msix(s);
+if (pc->is_express) {
+e1000e_init_msix(s);
 
-if (pcie_endpoint_cap_v1_init(pci_dev, e1000e_pcie_offset) < 0) {
-hw_error("Failed to initialize PCIe capability");
+if (pcie_endpoint_cap_v1_init(pci_dev, e1000e_pcie_offset) < 0) {
+hw_error("Failed to initialize PCIe capability");
+}
 }
 
 ret = msi_init(PCI_DEVICE(s), 0xD0, 1, true, false, NULL);
@@ -473,18 +513,20 @@ static void e1000e_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 trace_e1000e_msi_init_fail(ret);
 }
 
-if (e1000e_add_pm_capability(pci_dev, e1000e_pmrb_offset,
-  PCI_PM_CAP_DSI) < 0) {
-hw_error("Failed to initialize PM capability");
-}
+if (pc->is_express) {
+if (e1000e_add_pm_capability(pci_dev, e1000e_pmrb_offset,
+ PCI_PM_CAP_DSI) < 0) {
+hw_error("Failed to initialize PM capability");
+}
 
-if (pcie_aer_init(pci_dev, PCI_ERR_VER, e1000e_aer_offset,

[Qemu-devel] [PATCH 1/2] e1000e: Infrastructure for e1000-ng

2017-10-26 Thread Ed Swierk via Qemu-devel

Generalize e1000e to support e1000 and other devices with a similar
register spec:

- plain PCI instead of PCIe, skipping setup of MSI-X, AER, etc.
- model-specific PHY ID2, register values and access permissions
- model-specific EEPROM template and read/write methods

This is just infrastructure, no functional changes.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000e.c  | 186 ---
 hw/net/e1000e_core.c | 131 
 hw/net/e1000e_core.h |  13 ++--
 3 files changed, 217 insertions(+), 113 deletions(-)

diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
index f1af279..100979c 100644
--- a/hw/net/e1000e.c
+++ b/hw/net/e1000e.c
@@ -49,8 +49,13 @@
 #include "trace.h"
 #include "qapi/error.h"
 
-#define TYPE_E1000E "e1000e"
-#define E1000E(obj) OBJECT_CHECK(E1000EState, (obj), TYPE_E1000E)
+#define TYPE_E1000E_BASE "e1000e-base"
+#define E1000E(obj) \
+OBJECT_CHECK(E1000EState, (obj), TYPE_E1000E_BASE)
+#define E1000E_DEVICE_CLASS(klass) \
+OBJECT_CLASS_CHECK(E1000EBaseClass, (klass), TYPE_E1000E_BASE)
+#define E1000E_DEVICE_GET_CLASS(obj) \
+OBJECT_GET_CLASS(E1000EBaseClass, (obj), TYPE_E1000E_BASE)
 
 typedef struct E1000EState {
 PCIDevice parent_obj;
@@ -76,6 +81,28 @@ typedef struct E1000EState {
 
 } E1000EState;
 
+typedef struct E1000EInfo {
+const char *name;
+const char *desc;
+
+uint16_tdevice_id;
+uint8_t revision;
+uint16_tsubsystem_vendor_id;
+uint16_tsubsystem_id;
+int is_express;
+const char *romfile;
+
+const uint16_t *eeprom_templ;
+uint32_teeprom_size;
+
+uint16_t phy_id2;
+} E1000EInfo;
+
+typedef struct E1000EBaseClass {
+PCIDeviceClassparent_class;
+const E1000EInfo *info;
+} E1000EBaseClass;
+
 #define E1000E_MMIO_IDX 0
 #define E1000E_FLASH_IDX1
 #define E1000E_IO_IDX   2
@@ -416,7 +443,9 @@ static void e1000e_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 static const uint16_t e1000e_pcie_offset = 0x0E0;
 static const uint16_t e1000e_aer_offset =  0x100;
 static const uint16_t e1000e_dsn_offset =  0x140;
+PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
 E1000EState *s = E1000E(pci_dev);
+E1000EBaseClass *edc = E1000E_DEVICE_GET_CLASS(s);
 uint8_t *macaddr;
 int ret;
 
@@ -427,11 +456,16 @@ static void e1000e_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 pci_dev->config[PCI_CACHE_LINE_SIZE] = 0x10;
 pci_dev->config[PCI_INTERRUPT_PIN] = 1;
 
-pci_set_word(pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID, s->subsys_ven);
-pci_set_word(pci_dev->config + PCI_SUBSYSTEM_ID, s->subsys);
+if (s->subsys_ven != (uint16_t)-1) {
+pci_set_word(pci_dev->config + PCI_SUBSYSTEM_VENDOR_ID, s->subsys_ven);
+}
+if (s->subsys != (uint16_t)-1) {
+pci_set_word(pci_dev->config + PCI_SUBSYSTEM_ID, s->subsys);
+}
 
-s->subsys_ven_used = s->subsys_ven;
-s->subsys_used = s->subsys;
+s->subsys_ven_used = pci_get_word(pci_dev->config
+  + PCI_SUBSYSTEM_VENDOR_ID);
+s->subsys_used = pci_get_word(pci_dev->config + PCI_SUBSYSTEM_ID);
 
 /* Define IO/MMIO regions */
 memory_region_init_io(>mmio, OBJECT(s), _ops, s,
@@ -453,65 +487,75 @@ static void e1000e_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 pci_register_bar(pci_dev, E1000E_IO_IDX,
  PCI_BASE_ADDRESS_SPACE_IO, >io);
 
-memory_region_init(>msix, OBJECT(s), "e1000e-msix",
-   E1000E_MSIX_SIZE);
-pci_register_bar(pci_dev, E1000E_MSIX_IDX,
- PCI_BASE_ADDRESS_SPACE_MEMORY, >msix);
+if (pc->is_express) {
+memory_region_init(>msix, OBJECT(s), "e1000e-msix",
+   E1000E_MSIX_SIZE);
+pci_register_bar(pci_dev, E1000E_MSIX_IDX,
+ PCI_BASE_ADDRESS_SPACE_MEMORY, >msix);
+}
 
 /* Create networking backend */
 qemu_macaddr_default_if_unset(>conf.macaddr);
 macaddr = s->conf.macaddr.a;
 
-e1000e_init_msix(s);
+if (pc->is_express) {
+e1000e_init_msix(s);
 
-if (pcie_endpoint_cap_v1_init(pci_dev, e1000e_pcie_offset) < 0) {
-hw_error("Failed to initialize PCIe capability");
+if (pcie_endpoint_cap_v1_init(pci_dev, e1000e_pcie_offset) < 0) {
+hw_error("Failed to initialize PCIe capability");
+}
+
+ret = msi_init(PCI_DEVICE(s), 0xD0, 1, true, false, NULL);
+if (ret) {
+trace_e1000e_msi_init_fail(ret);
+}
+
+if (e1000e_add_pm_capability(pci_dev, e1000e_pmrb_offset,
+ PCI_PM_CAP_DSI) < 0) {
+hw_error("Failed to initialize PM capability");

[Qemu-devel] [PATCH 0/2] e1000e: Reimplement e1000 as a variant of e1000e

2017-10-26 Thread Ed Swierk via Qemu-devel

>From a hardware functionality and register perspective, the Intel
8254x Gigabit Ethernet Controller[1] is roughly a subset of the Intel
82574[2], making it practical to implement e1000 device emulation as
an extension of e1000e.

That would be a step towards eliminating the existing e1000
implementation--a bunch of semi-redundant code with its own bugs (like
the faulty tx checksum offload I reported recently[3]).

[1] 
https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf
[2] 
https://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
[3] https://lists.nongnu.org/archive/html/qemu-devel/2017-10/msg03008.html

This patch series adds a new e1000-ng device but leaves the existing
e1000 device alone. Linux 4.1 and Windows 10 e1000 guest drivers
recognize the e1000-ng device and successfully pass traffic on a
Linux host.

Preliminary performance measurements are encouraging: with e1000-ng,
single-threaded TCP iperf shows about a 2x speedup in tx and rx
throughput vs e1000, and CPU usage is slighly lower.

A ton of work is needed before e1000-ng will be ready to supplant
e1000: testing with every random guest OS, fixing bugs, figuring out
migration and other missing functionality. There's no way I can do
this myself.

I'd like to propose a more modest and achievable short-term goal: find
and fix any regressions that might affect users of e1000e and e1000,
so that the patches can be applied and folks can easily start trying
out e1000-ng with their favorite guest OS. That means accepting (for
now) known deficiencies in the new e1000-ng devices, which can be
addressed by myself and (hopefully) others in follow-up patches. Does
that sound reasonable?

Ed Swierk (2):
  e1000e: Infrastructure for e1000 devices
  e1000e: Add e1000 devices

 hw/net/e1000e.c  | 236 +---
 hw/net/e1000e_core.c | 249 +--
 hw/net/e1000e_core.h |  25 +-
 3 files changed, 387 insertions(+), 123 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH 2/2] e1000e: Add e1000-ng devices

2017-10-26 Thread Ed Swierk via Qemu-devel

Implement e1000-compatible devices using the reworked e1000e code:

- e1000-ng: Intel 82540EM
- e1000-82544gc-ng: Intel 82544GC
- e1000-82545em-ng: Intel 82545EM

>From a guest's perspective, these should be drop-in replacements for
the existing e1000 devices.

Among other deficiencies, this version does not handle migration. But
it should work well enough to start testing e1000-ng devices with a
variety of guest OSes.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 hw/net/e1000e.c  |  50 +
 hw/net/e1000e_core.c | 120 ++-
 hw/net/e1000e_core.h |  12 ++
 3 files changed, 171 insertions(+), 11 deletions(-)

diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
index 100979c..5f75a41 100644
--- a/hw/net/e1000e.c
+++ b/hw/net/e1000e.c
@@ -281,6 +281,17 @@ static const uint16_t e1000e_eeprom_template[64] = {
 0x, 0x, 0x, 0x, 0x, 0x0120, 0x, 0x,
 };
 
+static const uint16_t e1000_eeprom_template[64] = {
+0x, 0x, 0x, 0x,  0x, 0x,  0x, 0x,
+0x3000, 0x1000, 0x6403, 0 /*DevId*/, 0x8086, 0 /*DevId*/, 0x8086, 0x3040,
+0x0008, 0x2000, 0x7e14, 0x0048,  0x1000, 0x00d8,  0x, 0x2700,
+0x6cc9, 0x3150, 0x0722, 0x040b,  0x0984, 0x,  0xc000, 0x0706,
+0x1008, 0x, 0x0f04, 0x7fff,  0x4d01, 0x,  0x, 0x,
+0x, 0x, 0x, 0x,  0x, 0x,  0x, 0x,
+0x0100, 0x4000, 0x121c, 0x,  0x, 0x,  0x, 0x,
+0x, 0x, 0x, 0x,  0x, 0x,  0x, 0x,
+};
+
 static void e1000e_core_realize(E1000EState *s)
 {
 s->core.owner = >parent_obj;
@@ -776,6 +787,45 @@ static const E1000EInfo e1000e_devices[] = {
 .eeprom_size = sizeof(e1000e_eeprom_template),
 .phy_id2 = E1000_PHY_ID2_82574x,
 },
+{
+.name= "e1000-ng",
+.desc= "Intel 82540EM Gigabit Ethernet Controller",
+.device_id   = E1000_DEV_ID_82540EM,
+.revision= 3,
+.subsystem_vendor_id = PCI_SUBVENDOR_ID_REDHAT_QUMRANET,
+.subsystem_id= PCI_SUBDEVICE_ID_QEMU,
+.is_express  = 0,
+.romfile = "efi-e1000.rom",
+.eeprom_templ= e1000_eeprom_template,
+.eeprom_size = sizeof(e1000_eeprom_template),
+.phy_id2 = E1000_PHY_ID2_8254xx_DEFAULT,
+},
+{
+.name= "e1000-82544gc-ng",
+.desc= "Intel 82544GC Gigabit Ethernet Controller",
+.device_id   = E1000_DEV_ID_82544GC_COPPER,
+.revision= 3,
+.subsystem_vendor_id = PCI_SUBVENDOR_ID_REDHAT_QUMRANET,
+.subsystem_id= PCI_SUBDEVICE_ID_QEMU,
+.is_express  = 0,
+.romfile = "efi-e1000.rom",
+.eeprom_templ= e1000_eeprom_template,
+.eeprom_size = sizeof(e1000_eeprom_template),
+.phy_id2 = E1000_PHY_ID2_82544x,
+},
+{
+.name= "e1000-82545em-ng",
+.desc= "Intel 82545EM Gigabit Ethernet Controller",
+.device_id   = E1000_DEV_ID_82545EM_COPPER,
+.revision= 3,
+.subsystem_vendor_id = PCI_SUBVENDOR_ID_REDHAT_QUMRANET,
+.subsystem_id= PCI_SUBDEVICE_ID_QEMU,
+.is_express  = 0,
+.romfile = "efi-e1000.rom",
+.eeprom_templ= e1000_eeprom_template,
+.eeprom_size = sizeof(e1000_eeprom_template),
+.phy_id2 = E1000_PHY_ID2_8254xx_DEFAULT,
+},
 };
 
 static void e1000e_register_types(void)
diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index 959c697..02a60a1 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -2261,6 +2261,24 @@ static const char 
e1000e_phy_regcap[E1000E_PHY_PAGES][E1000E_PHY_PAGE_SIZE] = {
 }
 };
 
+static const char e1000_phy_regcap[E1000E_PHY_PAGES][E1000E_PHY_PAGE_SIZE] = {
+[0] = {
+[PHY_CTRL]   = PHY_RW,
+[PHY_STATUS] = PHY_R,
+[PHY_ID1]= PHY_R,
+[PHY_ID2]= PHY_R,
+[PHY_AUTONEG_ADV]= PHY_RW,
+[PHY_LP_ABILITY] = PHY_R,
+[PHY_AUTONEG_EXP]= PHY_R,
+[PHY_1000T_CTRL] = PHY_RW,
+[PHY_1000T_STATUS]   = PHY_R,
+[M88E1000_PHY_SPEC_CTRL] = PHY_RW,
+[M88E1000_PHY_SPEC_STATUS]   = PHY_R,
+[M88E1000_EXT_PHY_SPEC_CTRL] = PHY_RW,
+[M88E1000_RX_ERR_CNTR]   = PHY_R,
+}
+};
+
 static bool
 e1000e_phy_reg_check_cap(E1000ECore *core, uint32_t addr,

Re: [Qemu-devel] [RFC v2] e1000: Faulty tx checksum offload corrupts packets

2017-10-26 Thread Ed Swierk via Qemu-devel

On Mon, Oct 23, 2017 at 8:28 PM, Jason Wang <jasow...@redhat.com> wrote:
> 
> On 2017å¹´10æ24æ¥ 08:22, Ed Swierk wrote:
> > (Another layer of icing on the cake is that QEMU ignores the
> > requirement that a UDP checksum computed as zero be sent as 0x,
> > since zero is a special value meaning no checksum. So even when QEMU
> > doesn't corrupt the packet data, the packet sometimes leaves the box
> > with no checksum at all.)
> 
> Please submit another patch for this.

Will do.

> > I have instrumented QEMU and reproduced this behavior with a Windows
> > 10 guest, rather easily with a TCP iperf and a UDP iperf running in
> > parallel. I have also attempted a fix, which is below in very rough
> > form.
> 
> How do you instrument qemu? Can this be reproduced without this?

I can reproduce the bug with just the patchlet below. It would be even
better to devise a test that detects the corruption without modifying
QEMU, as that could be used as a regression test after the bug itself
is fixed. I'll have to ponder that.

> > One puzzle is what to do about e1000e: it shares shares some data
> > structures and a bit of code with e1000, but little else, which is
> > surprising given how similar they are (or should be). The e1000e's
> > handling of TCP segmentation offload and checksum offload is totally
> > different, and problematic for other reasons (it totally ignores most
> > of the context parameters provided by the driver and basically does
> > what it thinks is best by digging into the packet data). Is this
> > divergence intentional?
> 
> Somehow, and if we can find a way to unify the codes, it would be better.
> 
> > Is there a reason not to change e1000e as long
> > as I'm trying to make e1000 more datasheet-conformant?
> 
> Please fix them individually.

I went ahead and reimplemented e1000 as a variant of e1000e. It's just
a proof of concept, but hopefully a step towards eliminating the
redundancy and manintaining just one codebase rather than two. Please
see the patch series I'm sending separately.

Anyway, here's how I catch the tx checksum offload bug, with a Windows
guest running one TCP iperf and one UDP iperf simultaneously through
the same e1000 interface.

 --- a/hw/net/e1000.c
 +++ b/hw/net/e1000.c
 @@ -534,6 +534,30 @@ e1000_send_packet(E1000State *s, const uint8_t *buf, int 
size)
  }
  
  static void
 +debug_csum(struct e1000_tx *tp, uint16_t oldsum)
 +{
 +e1000x_txd_props *props = >props;
 +uint8_t proto = tp->data[14 + 9];
 +uint32_t sumoff = props->tucso - props->tucss;
 +
 +if ((proto == 17 && sumoff != 6) ||
 +(proto == 6 && sumoff != 16)) {
 +DBGOUT(TXERR, "txsum bug! ver %d src %08x dst %08x len %d proto %d "
 +   "cptse %d sum_needed %x oldsum %x newsum %x realsum %x\n",
 +   tp->data[14] >> 4,
 +   ldl_be_p(tp->data + 14 + 12),
 +   ldl_be_p(tp->data + 14 + 16),
 +   lduw_be_p(tp->data + 14 + 2),
 +   proto,
 +   props->cptse,
 +   props->sum_needed,
 +   oldsum,
 +   lduw_be_p(tp->data + props->tucso),
 +   lduw_be_p(tp->data + props->tucss + (proto == 6 ? 16 : 6)));
 +}
 +}
 +
 +static void
  xmit_seg(E1000State *s)
  {
  uint16_t len;
 @@ -577,8 +601,10 @@ xmit_seg(E1000State *s)
  }
  
  if (tp->props.sum_needed & E1000_TXD_POPTS_TXSM) {
 +uint16_t oldsum = lduw_be_p(tp->data + tp->props.tucso);
  putsum(tp->data, tp->size, tp->props.tucso,
 tp->props.tucss, tp->props.tucse);
 +debug_csum(tp, oldsum);
  }
  if (tp->props.sum_needed & E1000_TXD_POPTS_IXSM) {
  putsum(tp->data, tp->size, tp->props.ipcso,

[Qemu-devel] [RFC v2] e1000: Faulty tx checksum offload corrupts packets

2017-10-23 Thread Ed Swierk via Qemu-devel

[Resending to full set of maintainers]

v2: Cosmetic fixes for checkpatch/buildbot errors

The transmit checksum offload implementation in QEMU's e1000 device is
deficient and causes packet data corruption in some situations.

According to the Intel 8254x software developer's manual[1], the
hardware device maintains two separate contexts: the TCP segmentation
offload (TSO) context includes parameters for both segmentation
offload and checksum offload, and the normal (SUM,
i.e. checksum-offload-only) context includes only checksum offload
parameters. These parameters specify over which packet data to compute
the checksum, and where in the packet to store the computed
checksum(s).

[1] 
https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf

The e1000 driver can update either of these contexts by sending a
transmit context descriptor. The TSE bit in the TUCMD field controls
which context is modified by the descriptor. Crucially, a transmit
context descriptor with TSE=1 changes only the TSO context, leaving
the SUM context unchanged; with TSE=0 the opposite is true.

Fields in the transmit data descriptor determine which (if either) of
these two contexts the device uses when actually transmitting some
data:

- If the TSE bit in the DCMD field is set, then the device performs
  TCP segmentation offload using the parameters previously set in the
  TSO context. In addition, if TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the same (TSO) context.

- Otherwise, if the TSE bit in the DCMD field is clear, then there is
  no TCP segmentation offload. If TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the SUM context.

The e1000 driver is free to set up the TSO and SUM contexts and then
transmit a mixture of data, with each data descriptor using a
different (or neither) context. This is what the e1000 driver for
Windows (Intel(R) PRO/1000 MT Network Connection, aka E1G6023E.sys)
does in certain cases. Sometimes with quite undesirable results, since
the QEMU e1000 device doesn't work as described above.

Instead, the QEMU e1000 device maintains only one context in its state
structure. When it receives a transmit context descriptor from the
driver, it overwrites the context parameters regardless of the TSE bit
in the TUCMD field.

To see why this is wrong, suppose the driver first sets up a SUM
context with UDP checksum offload parameters (say, TUCSO pointing to
the appropriate offset for a UDP checksum, 6 bytes into the header),
and then sets up a TSO context with TCP checksum offload parameters
(TUCSO pointing to the appropriate offset for a TCP checksum, 16 bytes
into the header). The driver then sends a transmit data descriptor
with TSO=0 and TXSM=1 along with a UDP datagram. The QEMU e1000 device
computes the checksum using the last set of checksum offload
parameters, and writes the checksum to offset 16, stomping on two
bytes of UDP data, and leaving the wrong checksum in the UDP checksum
field.

To make matters worse, if the host network stack treats data
transmitted from a VM as locally originated, it may do its own UDP
checksum computation, "correcting" it to match the corrupt data before
sending it on the wire. Now the corrupt UDP packet makes its way all
the way to the destination.

(Another layer of icing on the cake is that QEMU ignores the
requirement that a UDP checksum computed as zero be sent as 0x,
since zero is a special value meaning no checksum. So even when QEMU
doesn't corrupt the packet data, the packet sometimes leaves the box
with no checksum at all.)

I have instrumented QEMU and reproduced this behavior with a Windows
10 guest, rather easily with a TCP iperf and a UDP iperf running in
parallel. I have also attempted a fix, which is below in very rough
form.

Before I spend too much time refining a patch, I'd like to get
feedback on my approach.

One puzzle is what to do about e1000e: it shares shares some data
structures and a bit of code with e1000, but little else, which is
surprising given how similar they are (or should be). The e1000e's
handling of TCP segmentation offload and checksum offload is totally
different, and problematic for other reasons (it totally ignores most
of the context parameters provided by the driver and basically does
what it thinks is best by digging into the packet data). Is this
divergence intentional? Is there a reason not to change e1000e as long
as I'm trying to make e1000 more datasheet-conformant?

Not ready for prime time, but nonetheless
Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>

---
 hw/net/e1000.c | 183 +++--
 hw/net/e1000x_common.h |   4 +-
 2 files changed, 132 insertions(+), 55 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 9324949..66ac7d3 100

[Qemu-devel] [RFC v2] e1000: Faulty tx checksum offload corrupts packets

2017-10-23 Thread Ed Swierk via Qemu-devel

v2: Cosmetic fixes for checkpatch/buildbot errors

The transmit checksum offload implementation in QEMU's e1000 device is
deficient and causes packet data corruption in some situations.

According to the Intel 8254x software developer's manual[1], the
hardware device maintains two separate contexts: the TCP segmentation
offload (TSO) context includes parameters for both segmentation
offload and checksum offload, and the normal (SUM,
i.e. checksum-offload-only) context includes only checksum offload
parameters. These parameters specify over which packet data to compute
the checksum, and where in the packet to store the computed
checksum(s).

[1] 
https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf

The e1000 driver can update either of these contexts by sending a
transmit context descriptor. The TSE bit in the TUCMD field controls
which context is modified by the descriptor. Crucially, a transmit
context descriptor with TSE=1 changes only the TSO context, leaving
the SUM context unchanged; with TSE=0 the opposite is true.

Fields in the transmit data descriptor determine which (if either) of
these two contexts the device uses when actually transmitting some
data:

- If the TSE bit in the DCMD field is set, then the device performs
  TCP segmentation offload using the parameters previously set in the
  TSO context. In addition, if TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the same (TSO) context.

- Otherwise, if the TSE bit in the DCMD field is clear, then there is
  no TCP segmentation offload. If TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the SUM context.

The e1000 driver is free to set up the TSO and SUM contexts and then
transmit a mixture of data, with each data descriptor using a
different (or neither) context. This is what the e1000 driver for
Windows (Intel(R) PRO/1000 MT Network Connection, aka E1G6023E.sys)
does in certain cases. Sometimes with quite undesirable results, since
the QEMU e1000 device doesn't work as described above.

Instead, the QEMU e1000 device maintains only one context in its state
structure. When it receives a transmit context descriptor from the
driver, it overwrites the context parameters regardless of the TSE bit
in the TUCMD field.

To see why this is wrong, suppose the driver first sets up a SUM
context with UDP checksum offload parameters (say, TUCSO pointing to
the appropriate offset for a UDP checksum, 6 bytes into the header),
and then sets up a TSO context with TCP checksum offload parameters
(TUCSO pointing to the appropriate offset for a TCP checksum, 16 bytes
into the header). The driver then sends a transmit data descriptor
with TSO=0 and TXSM=1 along with a UDP datagram. The QEMU e1000 device
computes the checksum using the last set of checksum offload
parameters, and writes the checksum to offset 16, stomping on two
bytes of UDP data, and leaving the wrong checksum in the UDP checksum
field.

To make matters worse, if the host network stack treats data
transmitted from a VM as locally originated, it may do its own UDP
checksum computation, "correcting" it to match the corrupt data before
sending it on the wire. Now the corrupt UDP packet makes its way all
the way to the destination.

(Another layer of icing on the cake is that QEMU ignores the
requirement that a UDP checksum computed as zero be sent as 0x,
since zero is a special value meaning no checksum. So even when QEMU
doesn't corrupt the packet data, the packet sometimes leaves the box
with no checksum at all.)

I have instrumented QEMU and reproduced this behavior with a Windows
10 guest, rather easily with a TCP iperf and a UDP iperf running in
parallel. I have also attempted a fix, which is below in very rough
form.

Before I spend too much time refining a patch, I'd like to get
feedback on my approach.

One puzzle is what to do about e1000e: it shares shares some data
structures and a bit of code with e1000, but little else, which is
surprising given how similar they are (or should be). The e1000e's
handling of TCP segmentation offload and checksum offload is totally
different, and problematic for other reasons (it totally ignores most
of the context parameters provided by the driver and basically does
what it thinks is best by digging into the packet data). Is this
divergence intentional? Is there a reason not to change e1000e as long
as I'm trying to make e1000 more datasheet-conformant?

Not ready for prime time, but nonetheless
Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>

---
 hw/net/e1000.c | 183 +++--
 hw/net/e1000x_common.h |   4 +-
 2 files changed, 132 insertions(+), 55 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 9324949..66ac7d3 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c

[Qemu-devel] [RFC] e1000: Faulty tx checksum offload corrupts packets

2017-10-12 Thread Ed Swierk via Qemu-devel

The transmit checksum offload implementation in QEMU's e1000 device is
deficient and causes packet data corruption in some situations.

According to the Intel 8254x software developer's manual[1], the
device maintains two separate contexts: the TCP segmentation offload
context includes parameters for both segmentation offload and checksum
offload, and the normal (checksum-offload-only) context includes only
checksum offload parameters. These parameters specify over which
packet data to compute the checksum, and where in the packet to store
the computed checksum(s).

[1] 
https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf

The e1000 driver can update either of these contexts by sending a
transmit context descriptor. The TSE bit in the TUCMD field controls
which context is modified by the descriptor. Crucially, a transmit
context descriptor with TSE=1 changes only the TSO context, leaving
the SUM context unchanged; with TSE=0 the opposite is true.

Fields in the transmit data descriptor determine which (if either) of
these two contexts the device uses when actually transmitting some
data:

- If the TSE bit in the DCMD field is set, then the device performs
  TCP segmentation offload using the parameters previously set in the
  TSO context. In addition, if TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the same (TSO) context.

- Otherwise, if the TSE bit in the DCMD field is clear, then there is
  no TCP segmentation offload. If TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the SUM context.

The e1000 driver is free to set up the TSO and SUM contexts and then
transmit a mixture of data, with each data descriptor using a
different (or neither) context. This is what the e1000 driver for
Windows (Intel(R) PRO/1000 MT Network Connection, aka E1G6023E.sys)
does in certain cases. Sometimes with quite undesirable results, since
the QEMU e1000 device doesn't work as described above.

Instead, the QEMU e1000 device maintains only one context in its state
structure. When it receives a transmit context descriptor from the
driver, it overwrites the context parameters regardless of the TSE bit
in the TUCMD field.

To see why this is wrong, suppose the driver first sets up a SUM
context with UDP checksum offload parameters (say, TUCSO pointing to
the appropriate offset for a UDP checksum, 6 bytes into the header),
and then sets up a TSO context with TCP checksum offload parameters
(TUCSO pointing to the appropriate offset for a TCP checksum, 16 bytes
into the header). The driver then sends a transmit data descriptor
with TSO=0 and TXSM=1 along with a UDP datagram. The QEMU e1000 device
computes the checksum using the last set of checksum offload
parameters, and writes the checksum to offset 16, stomping on two
bytes of UDP data, and leaving the wrong checksum in the UDP checksum
field.

To make matters worse, if the network stack on the host running QEMU
treats data transmitted from a VM as locally originated, it may do its
own UDP checksum computation, "correcting" it to match the corrupt
data before sending it on the wire. Now the corrupt UDP packet makes
its way all the way to the destination.

(A separate layer of icing on the cake is that QEMU ignores the
requirement that a UDP checksum computed as zero be sent as 0x,
since zero is a special value meaning no checksum. So even when QEMU
doesn't corrupt the packet data, the packet sometimes leaves the box
with no checksum at all.)

I have instrumented QEMU and reproduced this behavior with a Windows
10 guest (rather easily with a TCP iperf and a UDPi iperf running in
parallel). I have also attempted a fix, which is below in very rough
form.

Before I spend too much time refining a patch, I'd like to get
feedback on my approach.

One puzzle is what to do about e1000e: it shares shares some data
structures and a bit of code with e1000, but little else, which is
surprising given how similar they are (or should be). The e1000e's
handling of TCP segmentation offload and checksum offload is totally
different, and problematic for other reasons (it totally ignores most
of the context parameters provided by the driver and basically does
what it thinks is best by digging into the packet data). Is this
divergence intentional? Is there a reason not to change e1000e as long
as I'm trying to make e1000 more datasheet-conformant?

--Ed

---
 hw/net/e1000.c | 176 ++---
 hw/net/e1000x_common.h |   4 +-
 2 files changed, 126 insertions(+), 54 deletions(-)

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 9324949..e45746f 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -98,7 +98,10 @@ typedef struct E1000State_st {
 unsigned char data[0x1];
 uint16_t size;
 unsigned char vlan_needed;
-

Re: [Qemu-devel] New iotest repros failures on virtio external snapshot with iothread

2017-03-30 Thread Ed Swierk via Qemu-devel

On Thu, Mar 30, 2017 at 4:06 PM, John Snow <js...@redhat.com> wrote:
> On 03/29/2017 10:01 PM, Ed Swierk via Qemu-devel wrote:
>> * 2.9.0-rc2
>> - guest, virtio-blk, iothread, single snapshot create+commit:
>> "include/block/aio.h:457: aio_enable_external: Assertion
>> `ctx->external_disable_cnt > 0' failed." after snapshot create
>> - guest, virtio-blk, iothread, repeated snapshot create+commit: same as above
>> - guest, virtio-scsi, iothread, single snapshot create+commit: same as above
>> - guest, virtio-scsi, iothread, repeated snapshot create+commit: same as 
>> above
>> - no guest, virtio-blk, iothread, repeated snapshot create+commit: same as 
>> above
>> - no guest, virtio-scsi, iothread, single snapshot create+commit: same as 
>> above
>> - no guest, virtio-scsi, iothread, repeated snapshot create+commit:
>> same as above
>
> Do you mean to say that all of these 2.9.0-rc2 cases produce the same
> aio.h assertion?

Correct.

--Ed

[Qemu-devel] New iotest repros failures on virtio external snapshot with iothread

2017-03-29 Thread Ed Swierk via Qemu-devel

Parts of qemu's block code have changed a lot in recent months but are
not well exercised by current tests.

Subtle bugs have crept in causing assertion failures, hangs and other
crashes in a variety of situations: immediately on start, on first
guest activity, on external snapshot create or commit, on qmp quit
command.

Reproducing these bugs has proved tricky, as each may occur only with
a specific combination of qemu version, block device type (virtio-blk
or virtio-scsi) and iothread enabled or not. In some cases the bug
occurs only after several external snapshot operations. And in some
cases the bug only manifests when a guest is accessing the block
device simultaneously.

I've written an iotest (number 176, for now) that attempts to cover
many of these configurations. Currently it only exercises the external
snapshot create and commit lifted from iotest 118. The new iotest does
this repeatedly in each of 16 combinations:
- no guest / guest
- virtio-blk / virtio-scsi
- no iothread / iothread
- single / repeated external snapshot create+commit

I made some minor changes to the test infrastructure so the new iotest
can deal gracefully with qemu hanging--the test script itself
shouldn't hang. And in all failure modes the test needs to expose
enough console output and other information to diagnose the problem.

The main departure from existing iotests is running a real guest. I
used buildroot to generate a small (~4 MB) Linux kernel with built-in
initrd containing a busybox-based userland. After the iotest launches
qemu, the guest loops writing to the block device, while the test
performs snapshot operations.

I ran the new iotest on 3 qemu versions: 2.7.1, stable-2.8-staging and
2.9.0-rc2. The latter two fail several test cases, all
iothread-enabled. Only 2.7.1 passes all the cases.

Here is the code for the new iotest (I didn't dare email patches with
a 4 MB blob):
https://github.com/skyportsystems/qemu-1/commits/eswierk-iotests-2.7
https://github.com/skyportsystems/qemu-1/commits/eswierk-iotests-2.8
https://github.com/skyportsystems/qemu-1/commits/eswierk-iotests-2.9

And here is the buildroot I used to generate the guest Linux kernel+initrd:
https://github.com/skyportsystems/buildroot-1/commits/qemu-iotests

Please check out the code and try the new test--particularly anyone
who can also help figure out these failures. (Note that since half the
test cases use an iothread, /dev/kvm must be readable and writable.)

* stable-2.8-staging
- guest, virtio-blk, iothread, single snapshot create+commit: hang on
quit (intermittent)
- guest, virtio-blk, iothread, repeated snapshot create+commit: hang
after 1 iteration
- guest, virtio-scsi, iothread, single snapshot create+commit: hang on
quit (intermittent)
- guest, virtio-scsi, iothread, repeated snapshot create+commit: hang
after 1 iteration

* 2.9.0-rc2
- guest, virtio-blk, iothread, single snapshot create+commit:
"include/block/aio.h:457: aio_enable_external: Assertion
`ctx->external_disable_cnt > 0' failed." after snapshot create
- guest, virtio-blk, iothread, repeated snapshot create+commit: same as above
- guest, virtio-scsi, iothread, single snapshot create+commit: same as above
- guest, virtio-scsi, iothread, repeated snapshot create+commit: same as above
- no guest, virtio-blk, iothread, repeated snapshot create+commit: same as above
- no guest, virtio-scsi, iothread, single snapshot create+commit: same as above
- no guest, virtio-scsi, iothread, repeated snapshot create+commit:
same as above

--Ed

Re: [Qemu-devel] Assertion failure taking external snapshot with virtio drive + iothread

2017-03-22 Thread Ed Swierk

On Wed, Mar 22, 2017 at 2:19 AM, Fam Zheng <f...@redhat.com> wrote:
> On Tue, 03/21 06:05, Ed Swierk wrote:
>> Actually running snapshot_blkdev command in the text monitor doesn't
>> trigger this assertion (I mixed up my notes). Instead it's triggered
>> by the following sequence in qmp-shell:
>>
>> (QEMU) blockdev-snapshot-sync device=drive0 format=qcow2
>> snapshot-file=/x/snap1.qcow2
>> {"return": {}}
>> (QEMU) block-commit device=drive0
>> {"return": {}}
>> (QEMU) block-job-complete device=drive0
>> {"return": {}}
>>
>> > Is there a backtrace?
>>
>> #0  0x73757067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
>> #1  0x73758448 in abort () from /lib/x86_64-linux-gnu/libc.so.6
>> #2  0x73750266 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #3  0x73750312 in __assert_fail () from 
>> /lib/x86_64-linux-gnu/libc.so.6
>> #4  0x55b4b0bb in bdrv_drain_recurse
>> (bs=bs@entry=0x57bd6010)  at /x/qemu/block/io.c:164
>> #5  0x55b4b7ad in bdrv_drained_begin (bs=0x57bd6010)  at
>> /x/qemu/block/io.c:231
>> #6  0x55b4b802 in bdrv_parent_drained_begin
>> (bs=0x568c1a00)  at /x/qemu/block/io.c:53
>> #7  bdrv_drained_begin (bs=bs@entry=0x568c1a00)  at 
>> /x/qemu/block/io.c:228
>> #8  0x55b4be1e in bdrv_co_drain_bh_cb (opaque=0x7fff9aaece40)
>> at /x/qemu/block/io.c:190
>> #9  0x55bb431e in aio_bh_call (bh=0x5750e5f0)  at
>> /x/qemu/util/async.c:90
>> #10 aio_bh_poll (ctx=ctx@entry=0x56718090)  at /x/qemu/util/async.c:118
>> #11 0x55bb72eb in aio_poll (ctx=0x56718090,
>> blocking=blocking@entry=true)  at /x/qemu/util/aio-posix.c:682
>> #12 0x559443ce in iothread_run (opaque=0x56717b80)  at
>> /x/qemu/iothread.c:59
>> #13 0x73ad50a4 in start_thread () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #14 0x7380a87d in clone () from /lib/x86_64-linux-gnu/libc.so.6
>
> Hmm, looks like a separate bug to me. In addition please apply this (the
> assertion here is correct I think, but all callers are not audited yet):
>
> diff --git a/block.c b/block.c
> index 6e906ec..447d908 100644
> --- a/block.c
> +++ b/block.c
> @@ -1737,6 +1737,9 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>  {
>  BlockDriverState *old_bs = child->bs;
>
> +if (old_bs && new_bs) {
> +assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
> +}
>  if (old_bs) {
>  if (old_bs->quiesce_counter && child->role->drained_end) {
>  child->role->drained_end(child);
> diff --git a/block/mirror.c b/block/mirror.c
> index ca4baa5..a23ca9e 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1147,6 +1147,7 @@ static void mirror_start_job(const char *job_id, 
> BlockDriverState *bs,
>  return;
>  }
>  mirror_top_bs->total_sectors = bs->total_sectors;
> +bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
>
>  /* bdrv_append takes ownership of the mirror_top_bs reference, need to 
> keep
>   * it alive until block_job_create() even if bs has no parent. */

With this patch, I'm seeing either assertions or hangs when I run
blockdev-snapshot-sync, block-commit and block-job-complete
repeatedly. The exact assertion seems to depend on timing and/or what
combination of your other patches I apply. They include:

/x/qemu/hw/virtio/virtio.c:212: vring_get_region_caches: Assertion
`caches != ((void *)0)' failed.

/x/qemu/block/mirror.c:350: mirror_iteration: Assertion `sector_num >=
0' failed.

/x/qemu/block/mirror.c:865: mirror_run: Assertion
`((>tracked_requests)->lh_first == ((void *)0))' failed.

We don't appear to be converging on a solution here. Perhaps I should
instead focus on implementing automated tests so that you or anyone
else can easily reproduce these problems. The only tricky part is
extending qemu-iotests to include booting a guest to generate block IO
and trigger race conditions, but I have some ideas about how to do
this with a minimal (< 5 MB) Linux kernel+rootfs.

--Ed

Re: [Qemu-devel] Assertion failure taking external snapshot with virtio drive + iothread

2017-03-21 Thread Ed Swierk

On Tue, Mar 21, 2017 at 5:50 AM, Fam Zheng <f...@redhat.com> wrote:
> On Tue, 03/21 05:20, Ed Swierk wrote:
>> On Mon, Mar 20, 2017 at 10:26 PM, Fam Zheng <f...@redhat.com> wrote:
>> > On Fri, 03/17 09:55, Ed Swierk wrote:
>> >> I'm running into the same problem taking an external snapshot with a
>> >> virtio-blk drive with iothread, so it's not specific to virtio-scsi.
>> >> Run a Linux guest on qemu master
>> >>
>> >>   qemu-system-x86_64 -nographic -enable-kvm -monitor
>> >> telnet:0.0.0.0:1234,server,nowait -m 1024 -object
>> >> iothread,id=iothread1 -drive file=/x/drive.qcow2,if=none,id=drive0
>> >> -device virtio-blk-pci,iothread=iothread1,drive=drive0
>> >>
>> >> Then in the monitor
>> >>
>> >>   snapshot_blkdev drive0 /x/snap1.qcow2
>> >>
>> >> qemu bombs with
>> >>
>> >>   qemu-system-x86_64: /x/qemu/include/block/aio.h:457:
>> >> aio_enable_external: Assertion `ctx->external_disable_cnt > 0' failed.
>> >>
>> >> whereas without the iothread the assertion failure does not occur.
>> >
>> >
>> > Can you test this one?
>> >
>> > ---
>> >
>> >
>> > diff --git a/blockdev.c b/blockdev.c
>> > index c5b2c2c..4c217d5 100644
>> > --- a/blockdev.c
>> > +++ b/blockdev.c
>> > @@ -1772,6 +1772,8 @@ static void external_snapshot_prepare(BlkActionState 
>> > *common,
>> >  return;
>> >  }
>> >
>> > +bdrv_set_aio_context(state->new_bs, state->aio_context);
>> > +
>> >  /* This removes our old bs and adds the new bs. This is an operation 
>> > that
>> >   * can fail, so we need to do it in .prepare; undoing it for abort is
>> >   * always possible. */
>> > @@ -1789,8 +1791,6 @@ static void external_snapshot_commit(BlkActionState 
>> > *common)
>> >  ExternalSnapshotState *state =
>> >   DO_UPCAST(ExternalSnapshotState, common, 
>> > common);
>> >
>> > -bdrv_set_aio_context(state->new_bs, state->aio_context);
>> > -
>> >  /* We don't need (or want) to use the transactional
>> >   * bdrv_reopen_multiple() across all the entries at once, because we
>> >   * don't want to abort all of them if one of them fails the reopen */
>>
>> With this change, a different assertion fails on running snapshot_blkdev:
>>
>>   qemu-system-x86_64: /x/qemu/block/io.c:164: bdrv_drain_recurse:
>> Assertion `qemu_get_current_aio_context() == qemu_get_aio_context()'
>> failed.

Actually running snapshot_blkdev command in the text monitor doesn't
trigger this assertion (I mixed up my notes). Instead it's triggered
by the following sequence in qmp-shell:

(QEMU) blockdev-snapshot-sync device=drive0 format=qcow2
snapshot-file=/x/snap1.qcow2
{"return": {}}
(QEMU) block-commit device=drive0
{"return": {}}
(QEMU) block-job-complete device=drive0
{"return": {}}

> Is there a backtrace?

#0  0x73757067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x73758448 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x73750266 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x73750312 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x55b4b0bb in bdrv_drain_recurse
(bs=bs@entry=0x57bd6010)  at /x/qemu/block/io.c:164
#5  0x55b4b7ad in bdrv_drained_begin (bs=0x57bd6010)  at
/x/qemu/block/io.c:231
#6  0x55b4b802 in bdrv_parent_drained_begin
(bs=0x568c1a00)  at /x/qemu/block/io.c:53
#7  bdrv_drained_begin (bs=bs@entry=0x568c1a00)  at /x/qemu/block/io.c:228
#8  0x55b4be1e in bdrv_co_drain_bh_cb (opaque=0x7fff9aaece40)
at /x/qemu/block/io.c:190
#9  0x55bb431e in aio_bh_call (bh=0x5750e5f0)  at
/x/qemu/util/async.c:90
#10 aio_bh_poll (ctx=ctx@entry=0x56718090)  at /x/qemu/util/async.c:118
#11 0x55bb72eb in aio_poll (ctx=0x56718090,
blocking=blocking@entry=true)  at /x/qemu/util/aio-posix.c:682
#12 0x559443ce in iothread_run (opaque=0x56717b80)  at
/x/qemu/iothread.c:59
#13 0x73ad50a4 in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#14 0x7380a87d in clone () from /lib/x86_64-linux-gnu/libc.so.6

--Ed

Re: [Qemu-devel] Assertion failure taking external snapshot with virtio drive + iothread

2017-03-21 Thread Ed Swierk

On Mon, Mar 20, 2017 at 10:26 PM, Fam Zheng <f...@redhat.com> wrote:
> On Fri, 03/17 09:55, Ed Swierk wrote:
>> I'm running into the same problem taking an external snapshot with a
>> virtio-blk drive with iothread, so it's not specific to virtio-scsi.
>> Run a Linux guest on qemu master
>>
>>   qemu-system-x86_64 -nographic -enable-kvm -monitor
>> telnet:0.0.0.0:1234,server,nowait -m 1024 -object
>> iothread,id=iothread1 -drive file=/x/drive.qcow2,if=none,id=drive0
>> -device virtio-blk-pci,iothread=iothread1,drive=drive0
>>
>> Then in the monitor
>>
>>   snapshot_blkdev drive0 /x/snap1.qcow2
>>
>> qemu bombs with
>>
>>   qemu-system-x86_64: /x/qemu/include/block/aio.h:457:
>> aio_enable_external: Assertion `ctx->external_disable_cnt > 0' failed.
>>
>> whereas without the iothread the assertion failure does not occur.
>
>
> Can you test this one?
>
> ---
>
>
> diff --git a/blockdev.c b/blockdev.c
> index c5b2c2c..4c217d5 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -1772,6 +1772,8 @@ static void external_snapshot_prepare(BlkActionState 
> *common,
>  return;
>  }
>
> +bdrv_set_aio_context(state->new_bs, state->aio_context);
> +
>  /* This removes our old bs and adds the new bs. This is an operation that
>   * can fail, so we need to do it in .prepare; undoing it for abort is
>   * always possible. */
> @@ -1789,8 +1791,6 @@ static void external_snapshot_commit(BlkActionState 
> *common)
>  ExternalSnapshotState *state =
>   DO_UPCAST(ExternalSnapshotState, common, 
> common);
>
> -bdrv_set_aio_context(state->new_bs, state->aio_context);
> -
>  /* We don't need (or want) to use the transactional
>   * bdrv_reopen_multiple() across all the entries at once, because we
>   * don't want to abort all of them if one of them fails the reopen */

With this change, a different assertion fails on running snapshot_blkdev:

  qemu-system-x86_64: /x/qemu/block/io.c:164: bdrv_drain_recurse:
Assertion `qemu_get_current_aio_context() == qemu_get_aio_context()'
failed.

--Ed

Re: [Qemu-devel] Assertion failure taking external snapshot with virtio drive + iothread

2017-03-20 Thread Ed Swierk

On Fri, Mar 17, 2017 at 12:27 PM, Paolo Bonzini  wrote:
> And this is a fix, but I have no idea why/how it works and what else it
> may break.
>
> Patches 1 and 2 are pretty obvious and would be the first step towards
> eliminating aio_disable/enable_external altogether.
>
> However I got patch 3 more or less by trial and error, and when I
> thought I had the reasoning right I noticed this:
>
> bdrv_drained_end(state->old_bs);
>
> in external_snapshot_clean which makes no sense given the
>
> bdrv_drained_begin(bs_new);
>
> that I added to bdrv_append.  So take this with a ton of salt.
>
> The basic idea is that calling child->role->drained_begin and
> child->role->drained_end is not necessary and in fact actively wrong
> when both the old and the new child should be in a drained section.
> But maybe instead it should be asserted that they are, except for the
> special case of adding or removing a child.  i.e. after
>
> int drain = !!(old_bs && old_bs->quiesce_counter) - !!(new_bs && 
> new_bs->quiesce_counter);
>
> add
>
> assert(!(drain && old_bs && new_bs));
>
> Throwing this out because it's Friday evening... Maybe Fam can pick
> it up on Monday.

I just tested this patch on top of today's master. It does make the
ctx->external_disable_cnt > 0 assertion failure on snapshot_blkdev go
away. But it seems to cause a different assertion failure when running
without an iothread, e.g.

  qemu-system-x86_64 -nographic -enable-kvm -monitor
telnet:0.0.0.0:1234,server,nowait -m 1024 -object
iothread,id=iothread1 -drive file=/x/drive.qcow2,if=none,id=drive0
-device virtio-blk-pci,drive=drive0

and with the guest constantly writing to the disk with something like

  while true; do echo 12345 >blah; done

Running snapshot_blkdev in the monitor repeatedly (with a new backing
file each time) triggers the following after a few tries:

  qemu-system-x86_64: /x/qemu/block.c:2965: bdrv_replace_node:
Assertion `!({ typedef struct { int:(sizeof(*>in_flight) >
sizeof(void *)) ? -1 : 1; } qemu_build_bug_on__4
__attribute__((unused)); __atomic_load_n(>in_flight, 0); })'
failed.

This does not occur on today's master without this patch.

--Ed

Re: [Qemu-devel] Assertion failure taking external snapshot with virtio drive + iothread

2017-03-20 Thread Ed Swierk

On Fri, Mar 17, 2017 at 12:27 PM, Paolo Bonzini  wrote:
> And this is a fix, but I have no idea why/how it works and what else it
> may break.
>
> Patches 1 and 2 are pretty obvious and would be the first step towards
> eliminating aio_disable/enable_external altogether.
>
> However I got patch 3 more or less by trial and error, and when I
> thought I had the reasoning right I noticed this:
>
> bdrv_drained_end(state->old_bs);
>
> in external_snapshot_clean which makes no sense given the
>
> bdrv_drained_begin(bs_new);
>
> that I added to bdrv_append.  So take this with a ton of salt.
>
> The basic idea is that calling child->role->drained_begin and
> child->role->drained_end is not necessary and in fact actively wrong
> when both the old and the new child should be in a drained section.
> But maybe instead it should be asserted that they are, except for the
> special case of adding or removing a child.  i.e. after
>
> int drain = !!(old_bs && old_bs->quiesce_counter) - !!(new_bs && 
> new_bs->quiesce_counter);
>
> add
>
> assert(!(drain && old_bs && new_bs));
>
> Throwing this out because it's Friday evening... Maybe Fam can pick
> it up on Monday.

OK, thanks. It would be good to figure this out for 2.9, since the
workaround of disabling iothreads will affect performance. Let me know
if there's anything I can do to help.

Meanwhile I'm also looking into an intermittent crash running
block-commit on an external snapshot. This is with an earlier QEMU
snapshot (from around 20 Jan):

  /x/qemu/deb/debuild/block.c:2433: bdrv_append: Assertion
`!bdrv_requests_pending(bs_top)' failed.

That assertion no longer exists in the current master, but I'm trying
to reproduce it reliably and see whether the bug itself has
disappeared.

--Ed

Re: [Qemu-devel] Assertion failure taking external snapshot with virtio drive + iothread

2017-03-17 Thread Ed Swierk

On Fri, Mar 17, 2017 at 10:15 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 17/03/2017 18:11, Paolo Bonzini wrote:
>>
>>
>> On 17/03/2017 17:55, Ed Swierk wrote:
>>> I'm running into the same problem taking an external snapshot with a
>>> virtio-blk drive with iothread, so it's not specific to virtio-scsi.
>>> Run a Linux guest on qemu master
>>>
>>>   qemu-system-x86_64 -nographic -enable-kvm -monitor
>>> telnet:0.0.0.0:1234,server,nowait -m 1024 -object
>>> iothread,id=iothread1 -drive file=/x/drive.qcow2,if=none,id=drive0
>>> -device virtio-blk-pci,iothread=iothread1,drive=drive0
>>>
>>> Then in the monitor
>>>
>>>   snapshot_blkdev drive0 /x/snap1.qcow2
>>>
>>> qemu bombs with
>>>
>>>   qemu-system-x86_64: /x/qemu/include/block/aio.h:457:
>>> aio_enable_external: Assertion `ctx->external_disable_cnt > 0' failed.
>>>
>>> whereas without the iothread the assertion failure does not occur.
>>
>> Please try this patch:
>
> Hmm, no.  I'll post the full fix on top of John Snow's patches.

OK. Incidentally, testing with virtio-blk I bisected the assertion
failure to b2c2832c6140cfe3ddc0de2d77eeb0b77dea8fd3 ("block: Add Error
parameter to bdrv_append()").

--Ed

[Qemu-devel] Assertion failure taking external snapshot with virtio drive + iothread

2017-03-17 Thread Ed Swierk

I'm running into the same problem taking an external snapshot with a
virtio-blk drive with iothread, so it's not specific to virtio-scsi.
Run a Linux guest on qemu master

  qemu-system-x86_64 -nographic -enable-kvm -monitor
telnet:0.0.0.0:1234,server,nowait -m 1024 -object
iothread,id=iothread1 -drive file=/x/drive.qcow2,if=none,id=drive0
-device virtio-blk-pci,iothread=iothread1,drive=drive0

Then in the monitor

  snapshot_blkdev drive0 /x/snap1.qcow2

qemu bombs with

  qemu-system-x86_64: /x/qemu/include/block/aio.h:457:
aio_enable_external: Assertion `ctx->external_disable_cnt > 0' failed.

whereas without the iothread the assertion failure does not occur.

--Ed

On Thu, Mar 16, 2017 at 5:26 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> With this change on top of 2.9.0-rc0, I am able to boot a Linux guest
> from a virtio-scsi drive with an iothread, e.g.
>
>   qemu-system-x86_64 -nographic -enable-kvm -monitor
> telnet:0.0.0.0:1234,server,nowait -m 1024 -object
> iothread,id=iothread1 -device
> virtio-scsi-pci,iothread=iothread1,id=scsi0 -drive
> file=/x/drive.qcow2,format=qcow2,if=none,id=drive0,cache=directsync,aio=native
> -device scsi-hd,drive=drive0,bootindex=1
>
> But when I try to take a snapshot by running this in the monitor
>
>   snapshot_blkdev drive0 /x/snap1.qcow2
>
> qemu bombs with
>
>   qemu-system-x86_64: /x/qemu/include/block/aio.h:457:
> aio_enable_external: Assertion `ctx->external_disable_cnt > 0' failed.
>
> This does not occur if I don't use the iothread.
>
> I instrumented the code a bit, printing the value of bs,
> bdrv_get_aio_context(bs), and
> bdrv_get_aio_context(bs)->external_disable_cnt before and after
> aio_{disable,enable}_external() in bdrv_drained_{begin,end}().
>
> Without the iothread, nested calls to these functions cause the
> counter to increase and decrease as you'd expect, and the context is
> the same in each call.
>
> bdrv_drained_begin 0 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=0
> bdrv_drained_begin 1 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_end 0 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_end 1 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=0
>
> But with the iothread, there are at least two different context
> pointers, and there is one extra call to bdrv_drained_end() without a
> matching bdrv_drained_begin(). That last call comes from
> external_snapshot_clean().
>
> bdrv_drained_begin 0 bs=0x7fe4437545c0 ctx=0x7fe443749a00 cnt=0
> bdrv_drained_begin 1 bs=0x7fe4437545c0 ctx=0x7fe443749a00 cnt=1
> bdrv_drained_begin 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
> bdrv_drained_begin 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
> bdrv_drained_end 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
> bdrv_drained_end 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
> bdrv_drained_begin 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
> bdrv_drained_begin 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=1
> bdrv_

Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Fix acquire/release in dataplane handlers

2017-03-17 Thread Ed Swierk

On Mar 16, 2017 23:02, "Fam Zheng" <f...@redhat.com> wrote:

On Thu, 03/16 17:26, Ed Swierk wrote:
> On Tue, Mar 14, 2017 at 8:36 AM, Fam Zheng <f...@redhat.com> wrote:
> > After the AioContext lock push down, there is a race between
> > virtio_scsi_dataplane_start and those "assert(s->ctx &&
> > s->dataplane_started)", because the latter doesn't isn't wrapped in
> > aio_context_acquire.
> >
> > Reproducer is simply booting a Fedora guest with an empty
> > virtio-scsi-dataplane controller:
> >
> > qemu-system-x86_64 \
> >   -drive 
> > if=none,id=root,format=raw,file=Fedora-Cloud-Base-25-1.3.x86_64.raw
\
> >   -device virtio-scsi \
> >   -device scsi-disk,drive=root,bootindex=1 \
> >   -object iothread,id=io \
> >   -device virtio-scsi-pci,iothread=io \
> >   -net user,hostfwd=tcp::10022-:22 -net nic,model=virtio -m 2048 \
> >   --enable-kvm
> >
> > Fix this by moving acquire/release pairs from virtio_scsi_handle_*_vq to
> > their callers - and wrap the broken assertions in.
> >
> > Signed-off-by: Fam Zheng <f...@redhat.com>
>
> With this change on top of 2.9.0-rc0, I am able to boot a Linux guest
> from a virtio-scsi drive with an iothread, e.g.
>
>   qemu-system-x86_64 -nographic -enable-kvm -monitor
> telnet:0.0.0.0:1234,server,nowait -m 1024 -object
> iothread,id=iothread1 -device
> virtio-scsi-pci,iothread=iothread1,id=scsi0 -drive
> file=/x/drive.qcow2,format=qcow2,if=none,id=drive0,cache=
directsync,aio=native
> -device scsi-hd,drive=drive0,bootindex=1
>
> But when I try to take a snapshot by running this in the monitor
>
>   snapshot_blkdev drive0 /x/snap1.qcow2
>
> qemu bombs with
>
>   qemu-system-x86_64: /x/qemu/include/block/aio.h:457:
> aio_enable_external: Assertion `ctx->external_disable_cnt > 0' failed.
>
> This does not occur if I don't use the iothread.
>
> I instrumented the code a bit, printing the value of bs,
> bdrv_get_aio_context(bs), and
> bdrv_get_aio_context(bs)->external_disable_cnt before and after
> aio_{disable,enable}_external() in bdrv_drained_{begin,end}().
>
> Without the iothread, nested calls to these functions cause the
> counter to increase and decrease as you'd expect, and the context is
> the same in each call.
>
> bdrv_drained_begin 0 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=0
> bdrv_drained_begin 1 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_begin 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
> bdrv_drained_end 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
> bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_end 0 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=1
> bdrv_drained_end 1 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=0
>
> But with the iothread, there are at least two different context
> pointers, and there is one extra call to bdrv_drained_end() without a
> matching bdrv_drained_begin(). That last call comes from
> external_snapshot_clean().
>
> bdrv_drained_begin 0 bs=0x7fe4437545c0 ctx=0x7fe443749a00 cnt=0
> bdrv_drained_begin 1 bs=0x7fe4437545c0 ctx=0x7fe443749a00 cnt=1
> bdrv_drained_begin 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
> bdrv_drained_begin 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
> bdrv_drained_end 0 bs=0x7fe443990a00 ctx=0x7fe4

Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Fix acquire/release in dataplane handlers

2017-03-16 Thread Ed Swierk

On Tue, Mar 14, 2017 at 8:36 AM, Fam Zheng  wrote:
> After the AioContext lock push down, there is a race between
> virtio_scsi_dataplane_start and those "assert(s->ctx &&
> s->dataplane_started)", because the latter doesn't isn't wrapped in
> aio_context_acquire.
>
> Reproducer is simply booting a Fedora guest with an empty
> virtio-scsi-dataplane controller:
>
> qemu-system-x86_64 \
>   -drive 
> if=none,id=root,format=raw,file=Fedora-Cloud-Base-25-1.3.x86_64.raw \
>   -device virtio-scsi \
>   -device scsi-disk,drive=root,bootindex=1 \
>   -object iothread,id=io \
>   -device virtio-scsi-pci,iothread=io \
>   -net user,hostfwd=tcp::10022-:22 -net nic,model=virtio -m 2048 \
>   --enable-kvm
>
> Fix this by moving acquire/release pairs from virtio_scsi_handle_*_vq to
> their callers - and wrap the broken assertions in.
>
> Signed-off-by: Fam Zheng 

With this change on top of 2.9.0-rc0, I am able to boot a Linux guest
from a virtio-scsi drive with an iothread, e.g.

  qemu-system-x86_64 -nographic -enable-kvm -monitor
telnet:0.0.0.0:1234,server,nowait -m 1024 -object
iothread,id=iothread1 -device
virtio-scsi-pci,iothread=iothread1,id=scsi0 -drive
file=/x/drive.qcow2,format=qcow2,if=none,id=drive0,cache=directsync,aio=native
-device scsi-hd,drive=drive0,bootindex=1

But when I try to take a snapshot by running this in the monitor

  snapshot_blkdev drive0 /x/snap1.qcow2

qemu bombs with

  qemu-system-x86_64: /x/qemu/include/block/aio.h:457:
aio_enable_external: Assertion `ctx->external_disable_cnt > 0' failed.

This does not occur if I don't use the iothread.

I instrumented the code a bit, printing the value of bs,
bdrv_get_aio_context(bs), and
bdrv_get_aio_context(bs)->external_disable_cnt before and after
aio_{disable,enable}_external() in bdrv_drained_{begin,end}().

Without the iothread, nested calls to these functions cause the
counter to increase and decrease as you'd expect, and the context is
the same in each call.

bdrv_drained_begin 0 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=0
bdrv_drained_begin 1 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_begin 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_begin 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
bdrv_drained_end 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
bdrv_drained_end 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_begin 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_begin 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
bdrv_drained_end 0 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=3
bdrv_drained_end 1 bs=0x7fe9f67cfde0 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_begin 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 0 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=2
bdrv_drained_end 1 bs=0x7fe9f5d12a00 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_end 0 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=1
bdrv_drained_end 1 bs=0x7fe9f5ad65a0 ctx=0x7fe9f5abc7b0 cnt=0

But with the iothread, there are at least two different context
pointers, and there is one extra call to bdrv_drained_end() without a
matching bdrv_drained_begin(). That last call comes from
external_snapshot_clean().

bdrv_drained_begin 0 bs=0x7fe4437545c0 ctx=0x7fe443749a00 cnt=0
bdrv_drained_begin 1 bs=0x7fe4437545c0 ctx=0x7fe443749a00 cnt=1
bdrv_drained_begin 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
bdrv_drained_begin 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
bdrv_drained_end 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
bdrv_drained_end 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
bdrv_drained_begin 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
bdrv_drained_begin 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
bdrv_drained_begin 0 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=1
bdrv_drained_begin 1 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=2
bdrv_drained_end 0 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=2
bdrv_drained_end 1 bs=0x7fe4de20 ctx=0x7fe44373a7b0 cnt=1
bdrv_drained_end 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=1
bdrv_drained_end 1 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
bdrv_drained_begin 0 bs=0x7fe443990a00 ctx=0x7fe44373a7b0 cnt=0
bdrv_drained_begin 1 bs=0x7fe443990a00

Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Fix acquire/release in dataplane handlers

2017-03-16 Thread Ed Swierk

On Tue, Mar 14, 2017 at 8:36 AM, Fam Zheng  wrote:
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index e7466d3..4939f1f 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> ...
>  bool virtio_scsi_handle_event_vq(VirtIOSCSI *s, VirtQueue *vq)
>  {
> -virtio_scsi_acquire(s);
>  if (s->events_dropped) {
>  virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
>  virtio_scsi_release(s);

Did you intend to leave this virtio_scsi_release() call?

>  return true;
>  }
> -virtio_scsi_release(s);
>  return false;
>  }

Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Fix acquire/release in dataplane handlers

2017-03-15 Thread Ed Swierk

On Tue, Mar 14, 2017 at 8:36 AM, Fam Zheng <f...@redhat.com> wrote:
> After the AioContext lock push down, there is a race between
> virtio_scsi_dataplane_start and those "assert(s->ctx &&
> s->dataplane_started)", because the latter doesn't isn't wrapped in
> aio_context_acquire.
>
> Reproducer is simply booting a Fedora guest with an empty
> virtio-scsi-dataplane controller:
>
> qemu-system-x86_64 \
>   -drive 
> if=none,id=root,format=raw,file=Fedora-Cloud-Base-25-1.3.x86_64.raw \
>   -device virtio-scsi \
>   -device scsi-disk,drive=root,bootindex=1 \
>   -object iothread,id=io \
>   -device virtio-scsi-pci,iothread=io \
>   -net user,hostfwd=tcp::10022-:22 -net nic,model=virtio -m 2048 \
>   --enable-kvm
>
> Fix this by moving acquire/release pairs from virtio_scsi_handle_*_vq to
> their callers - and wrap the broken assertions in.
>
> Signed-off-by: Fam Zheng <f...@redhat.com>

Verified this fixes the assertion failure on 2.9.0-rc0.

Tested-by: Ed Swierk <eswi...@skyportsystems.com>

Re: [Qemu-devel] [PATCH] virtio: Report real progress in VQ aio poll handler

2017-02-09 Thread Ed Swierk

On Thu, Feb 9, 2017 at 12:40 AM, Fam Zheng <f...@redhat.com> wrote:
> In virtio_queue_host_notifier_aio_poll, not all "!virtio_queue_empty()"
> cases are making true progress.
>
> Currently the offending one is virtio-scsi event queue, whose handler
> does nothing if no event is pending. As a result aio_poll() will spin on
> the "non-empty" VQ and take 100% host CPU.
>
> Fix this by reporting actual progress from virtio queue aio handlers.
>
> Reported-by: Ed Swierk <eswi...@skyportsystems.com>
> Signed-off-by: Fam Zheng <f...@redhat.com>

Tested-by: Ed Swierk <eswi...@skyportsystems.com>

> ---
>  hw/block/dataplane/virtio-blk.c |  4 ++--
>  hw/block/virtio-blk.c   | 12 ++--
>  hw/scsi/virtio-scsi-dataplane.c | 14 +++---
>  hw/scsi/virtio-scsi.c   | 14 +++---
>  hw/virtio/virtio.c  | 15 +--
>  include/hw/virtio/virtio-blk.h  |  2 +-
>  include/hw/virtio/virtio-scsi.h |  6 +++---
>  include/hw/virtio/virtio.h  |  4 ++--
>  8 files changed, 45 insertions(+), 26 deletions(-)
>
> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index d1f9f63..5556f0e 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -147,7 +147,7 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane 
> *s)
>  g_free(s);
>  }
>
> -static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
> +static bool virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
>  VirtQueue *vq)
>  {
>  VirtIOBlock *s = (VirtIOBlock *)vdev;
> @@ -155,7 +155,7 @@ static void 
> virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
>  assert(s->dataplane);
>  assert(s->dataplane_started);
>
> -virtio_blk_handle_vq(s, vq);
> +return virtio_blk_handle_vq(s, vq);
>  }
>
>  /* Context: QEMU global mutex held */
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 702eda8..baaa195 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -581,10 +581,11 @@ static int virtio_blk_handle_request(VirtIOBlockReq 
> *req, MultiReqBuffer *mrb)
>  return 0;
>  }
>
> -void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
> +bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
>  {
>  VirtIOBlockReq *req;
>  MultiReqBuffer mrb = {};
> +bool progress = false;
>
>  blk_io_plug(s->blk);
>
> @@ -592,6 +593,7 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
>  virtio_queue_set_notification(vq, 0);
>
>  while ((req = virtio_blk_get_request(s, vq))) {
> +progress = true;
>  if (virtio_blk_handle_request(req, )) {
>  virtqueue_detach_element(req->vq, >elem, 0);
>  virtio_blk_free_request(req);
> @@ -607,6 +609,12 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
>  }
>
>  blk_io_unplug(s->blk);
> +return progress;
> +}
> +
> +static void virtio_blk_handle_output_do(VirtIOBlock *s, VirtQueue *vq)
> +{
> +virtio_blk_handle_vq(s, vq);
>  }
>
>  static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> @@ -622,7 +630,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
> VirtQueue *vq)
>  return;
>  }
>  }
> -virtio_blk_handle_vq(s, vq);
> +virtio_blk_handle_output_do(s, vq);
>  }
>
>  static void virtio_blk_dma_restart_bh(void *opaque)
> diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
> index 6b8d0f0..74c95e0 100644
> --- a/hw/scsi/virtio-scsi-dataplane.c
> +++ b/hw/scsi/virtio-scsi-dataplane.c
> @@ -49,35 +49,35 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error 
> **errp)
>  }
>  }
>
> -static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
> +static bool virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
>VirtQueue *vq)
>  {
>  VirtIOSCSI *s = (VirtIOSCSI *)vdev;
>
>  assert(s->ctx && s->dataplane_started);
> -virtio_scsi_handle_cmd_vq(s, vq);
> +return virtio_scsi_handle_cmd_vq(s, vq);
>  }
>
> -static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
> +static bool virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
> VirtQueue *vq)
>  {
>  VirtIOSCSI *s = VIRTIO_SCSI(vdev);
>
>  assert(s->ctx && s->dataplane_started);
> -virtio_scsi_handle_ctrl_vq(s, vq);
> +return virtio_scsi_handle_c

Re: [Qemu-devel] virtio-scsi-pci iothread spins at 100%

2017-02-09 Thread Ed Swierk

On Thu, Feb 9, 2017 at 2:12 AM, Fam Zheng  wrote:
> This should fix it:
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg01874.html

I tested this patch and found it fixes the problem. Thanks for the
speedy response!

--Ed

Re: [Qemu-devel] virtio-scsi-pci iothread spins at 100%

2017-02-08 Thread Ed Swierk

On Wed, Feb 8, 2017 at 6:52 PM, Fam Zheng  wrote:
> This means virtio-scsi event vq handler is returning true but actually no
> progress is made. Can you try the following patch to see if it's because a
> stalled cache of VQ index?
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 6365706..7f7ab57 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -2126,7 +2126,7 @@ static bool virtio_queue_host_notifier_aio_poll(void 
> *opaque)
>  EventNotifier *n = opaque;
>  VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
>
> -if (virtio_queue_empty(vq)) {
> +if (vring_avail_idx(vq) == vq->last_avail_idx) {
>  return false;
>  }

I tried this change but the behavior is the same, unfortunately.

--Ed

Re: [Qemu-devel] virtio-scsi-pci iothread spins at 100%

2017-02-08 Thread Ed Swierk

On Wed, Feb 8, 2017 at 5:47 PM, Fam Zheng  wrote:
> No, something is wrong. The polling shouldn't keep running when there is no 
> I/O
> activity.
>
> Can you try "perf top" to see what poll handlers are spinning?

Samples: 288K of event 'cycles', Event count (approx.): 57149970643
Overhead  Shared ObjectSymbol
  16.96%  qemu-system-x86_64   [.] lduw_le_phys
  15.77%  [vdso]   [.] __vdso_clock_gettime
   7.25%  qemu-system-x86_64   [.] qemu_lockcnt_cmpxchg_or_wait
   7.16%  qemu-system-x86_64   [.] aio_poll
   3.94%  qemu-system-x86_64   [.] address_space_translate
   3.69%  qemu-system-x86_64   [.] qemu_lockcnt_dec
   3.46%  qemu-system-x86_64   [.] virtio_queue_host_notifier_aio_poll
   3.32%  qemu-system-x86_64   [.] address_space_translate_internal
   2.54%  qemu-system-x86_64   [.] run_poll_handlers_once
   2.54%  libpthread-2.19.so   [.] pthread_mutex_lock
   2.53%  libpthread-2.19.so   [.] __pthread_mutex_unlock_usercnt
   2.53%  qemu-system-x86_64   [.] aio_notify_accept
   2.40%  qemu-system-x86_64   [.] timerlist_deadline_ns
   2.38%  qemu-system-x86_64   [.] address_space_lookup_region
   2.23%  qemu-system-x86_64   [.] timerlistgroup_deadline_ns
   2.21%  qemu-system-x86_64   [.] qemu_lockcnt_inc
   2.08%  qemu-system-x86_64   [.] qemu_clock_get_ns
   1.91%  qemu-system-x86_64   [.] object_dynamic_cast_assert
   1.54%  qemu-system-x86_64   [.] virtio_queue_notify_aio_vq.part.16
   1.21%  qemu-system-x86_64   [.] qemu_map_ram_ptr
   1.16%  libc-2.19.so [.] __clock_gettime
   1.02%  qemu-system-x86_64   [.] event_notifier_poll
   1.02%  qemu-system-x86_64   [.] timerlistgroup_run_timers
   1.02%  qemu-system-x86_64   [.] virtio_queue_set_notification
   0.82%  qemu-system-x86_64   [.] aio_bh_poll
   0.81%  qemu-system-x86_64   [.] timerlist_run_timers
   0.70%  qemu-system-x86_64   [.] aio_dispatch
   0.66%  qemu-system-x86_64   [.] virtio_queue_empty.part.32
   0.62%  qemu-system-x86_64   [.] aio_compute_timeout
   0.50%  qemu-system-x86_64   [.] virtio_scsi_data_plane_handle_event
   0.32%  qemu-system-x86_64   [.] clock_gettime@plt
   0.26%  qemu-system-x86_64   [.] memory_region_is_ram_device
   0.21%  qemu-system-x86_64   [.] qemu_mutex_unlock
   0.12%  qemu-system-x86_64   [.] aio_context_acquire
   0.12%  qemu-system-x86_64   [.] iothread_run
   0.12%  qemu-system-x86_64   [.] qemu_lockcnt_count
   0.11%  qemu-system-x86_64   [.] pthread_mutex_lock@plt
   0.11%  qemu-system-x86_64   [.] qemu_mutex_lock
   0.11%  [kernel] [k] vmx_vcpu_run
   0.10%  libpthread-2.19.so   [.] pthread_mutex_unlock
   0.09%  qemu-system-x86_64   [.] aio_context_release
   0.09%  qemu-system-x86_64   [.] pthread_mutex_unlock@plt
   0.09%  [kernel] [k] native_write_msr_safe
   0.08%  [kernel] [k] ksm_do_scan
   0.07%  qemu-system-x86_64   [.] virtio_scsi_handle_event_vq

--Ed

Re: [Qemu-devel] virtio-scsi-pci iothread spins at 100%

2017-02-08 Thread Ed Swierk

On Wed, Feb 8, 2017 at 8:33 AM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> Recently I noticed that when I configure a virtio-scsi-pci device
> using an iothread, as soon as the guest virtio-scsi driver loads, the
> iothread spins at 100%:
>
>   -object iothread,id=iothread1 -device virtio-scsi-pci,iothread=iothread1
>
> This occurs whether or not a disk is attached, with either
> poll-max-ns=0 or poll-max-ns=32768, and with Linux 3.13, 4.1 and 4.4
> guests. The iothread stops spinning as soon as the guest driver is
> unloaded.
>
> I bisected the issue to commit 684e508c23d28af8d6ed2c62738a0f60447c8274:
>
>   aio: add .io_poll_begin/end() callbacks
>
> It doesn't seem to affect performance, but obviously consuming CPU
> cycles when there's no disk attached is undesirable. Is this an
> expected side effect of implementing iothread polling? Is there a way
> to revert to the old non-polling behavior?

FWIW, changing

  return run_poll_handlers_once(ctx);

to

  return false;

in try_poll_mode() in aio-posix.c makes the iothread stop spinning,
but I don't know what damage this will cause.

--Ed

[Qemu-devel] virtio-scsi-pci iothread spins at 100%

2017-02-08 Thread Ed Swierk

Recently I noticed that when I configure a virtio-scsi-pci device
using an iothread, as soon as the guest virtio-scsi driver loads, the
iothread spins at 100%:

  -object iothread,id=iothread1 -device virtio-scsi-pci,iothread=iothread1

This occurs whether or not a disk is attached, with either
poll-max-ns=0 or poll-max-ns=32768, and with Linux 3.13, 4.1 and 4.4
guests. The iothread stops spinning as soon as the guest driver is
unloaded.

I bisected the issue to commit 684e508c23d28af8d6ed2c62738a0f60447c8274:

  aio: add .io_poll_begin/end() callbacks

It doesn't seem to affect performance, but obviously consuming CPU
cycles when there's no disk attached is undesirable. Is this an
expected side effect of implementing iothread polling? Is there a way
to revert to the old non-polling behavior?

--Ed

Re: [Qemu-devel] [PATCH] char: drop data written to a disconnected pty

2017-01-31 Thread Ed Swierk

On Tue, Jan 31, 2017 at 6:06 AM, Marc-André Lureau  wrote:
> I think this can be confusing if some backends silently drop the data (under 
> disconnected state), while other don't. Perhaps we should have instead a new 
> common chardev property "hup-drop" ? (suggestions for better name welcome)

Either way, when a pty is disconnected data will get dropped, whether
by the pty backend (as proposed) or by the serial port device or other
frontend (as currently).

The only difference from a user's perspective is whether the dropped
data gets written to the log file, which is the main motivation for
this change. The backward compatibility concern would be an existing
application that relies on data not being logged when the pty is
disconnected.

--Ed

Re: [Qemu-devel] [libvirt] char: Logging serial pty output when disconnected

2017-01-31 Thread Ed Swierk

On Tue, Jan 31, 2017 at 7:34 AM, Paolo Bonzini  wrote:
>
>
> On 31/01/2017 04:39, Daniel P. Berrange wrote:
>> I don't think so - retrying in this way is pointless IMHO - it is just
>> going to get the same result on every retry on 99% of occassions.
>
> Just to provide the full context, the retry happens even if you get
> EAGAIN, and in that case it does makes sense.
>
> But if the pty is disconnected I agree it should discard the data.

Thanks for the feedback. Please refer to this proposed patch:
http://patchwork.ozlabs.org/patch/721974/ .

--Ed

[Qemu-devel] [PATCH] char: drop data written to a disconnected pty

2017-01-31 Thread Ed Swierk

When a serial port writes data to a pty that's disconnected, drop the
data and return the length dropped. This avoids triggering pointless
retries in callers like the 16550A serial_xmit(), and causes
qemu_chr_fe_write() to write all data to the log file, rather than
logging only while a pty client like virsh console happens to be
connected.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 qemu-char.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qemu-char.c b/qemu-char.c
index 676944a..ccb6923 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -1528,7 +1528,7 @@ static int pty_chr_write(CharDriverState *chr, const 
uint8_t *buf, int len)
 /* guest sends data, check for (re-)connect */
 pty_chr_update_read_handler_locked(chr);
 if (!s->connected) {
-return 0;
+return len;
 }
 }
 return io_channel_send(s->ioc, buf, len);
-- 
1.9.1

Re: [Qemu-devel] [libvirt] char: Logging serial pty output when disconnected

2017-01-30 Thread Ed Swierk

On Fri, Jan 27, 2017 at 1:40 AM, Daniel P. Berrange <berra...@redhat.com> wrote:
> On Thu, Jan 26, 2017 at 05:07:16PM -0800, Ed Swierk wrote:
>> Currently qemu_chr_fe_write() calls qemu_chr_fe_write_log() only for
>> data consumed by the backend chr_write function. With the pty backend,
>> pty_chr_write() returns 0 indicating that the data was not consumed
>> when the pty is disconnected. Simply changing it to return len instead
>> of 0 tricks the caller into logging the data even when the pty is
>> disconnected. I don't know what problems this might cause, but one
>> data point is that tcp_chr_write() already happens to work this way.
>>
>> Alternatively, qemu_chr_fe_write() could be modified to log everything
>> passed to it, regardless of how much data chr_write claims to have
>> consumed. The trouble is that the serial device retries writing
>> unconsumed data, so when the pty is disconnected you'd see every
>> character duplicated 4 times in the log file.
>>
>> Any opinions on either approach, or other suggestions? If there are no
>> objections to the first one, I'll prepare a patch.
>
> If the pty backend intends to just drop data into a blackhole when
> no client is connected, then its chr_write() impl should return
> the length of the data discarded, not zero.

That's exactly the question: when no client is connected, should the
pty backend just drop the data into a black hole, returning the length
of the data discarded? Or should it return 0, letting the frontend
device decide what to do with it?

I can't discern a consistent pattern across all the char backends. The
closest analog is the tcp backend, which does discard the data and
return len. In contrast, several backends call
io_channel_send{,_full}(), which returns -1 if the write would block
or fails for any other reason.

It's not clear there's much the frontend can do to recover from an
error, but there's no consistent pattern across serial devices either.
Most just ignore the return value. But the 16550A serial device
retries 4 times after an error. Changing the pty backend to discard
the data on the first attempt would bypass this retry mechanism. Is
that a problem?

--Ed

[Qemu-devel] char: Logging serial pty output when disconnected

2017-01-26 Thread Ed Swierk

Interactive access to a guest serial console can be enabled by hooking
the serial device to a pty backend, e.g. -device
isa-serial,chardev=cs0 -chardev pty,id=cs0. With libvirt this can be
configured via .

Output from the same serial device can also be logged to a file by
adding logfile=/somefile to the -chardev option (
in libvirt).

Unfortunately output gets logged only when a client like virsh console
is connected to the pty; otherwise qemu drops it on the floor. This
makes chardev logging much less useful than it could be for debugging
guest problems after the fact.

Currently qemu_chr_fe_write() calls qemu_chr_fe_write_log() only for
data consumed by the backend chr_write function. With the pty backend,
pty_chr_write() returns 0 indicating that the data was not consumed
when the pty is disconnected. Simply changing it to return len instead
of 0 tricks the caller into logging the data even when the pty is
disconnected. I don't know what problems this might cause, but one
data point is that tcp_chr_write() already happens to work this way.

Alternatively, qemu_chr_fe_write() could be modified to log everything
passed to it, regardless of how much data chr_write claims to have
consumed. The trouble is that the serial device retries writing
unconsumed data, so when the pty is disconnected you'd see every
character duplicated 4 times in the log file.

Any opinions on either approach, or other suggestions? If there are no
objections to the first one, I'll prepare a patch.

--Ed

Re: [Qemu-devel] Virtual Machine Generation ID

2017-01-17 Thread Ed Swierk

You mean what causes the guest to re-read the vmgenid guid? The
vmgenid ACPI table defines a notify method, and when the guest
receives the corresponding event, it re-reads the guid. (Also it
appears that with Windows Server 2012 at least, if no notify method is
defined, as is the case with Xen, the guest just re-reads it on
demand.)

Wouldn't it be sufficient for the qmp set-vmgenid command to call
acpi_ram_update() with the new guid, and acpi_send_event() to notify
the guest?

--Ed


On Tue, Jan 17, 2017 at 6:33 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> Because this relies on guest to run code to read out the new fw cfg.
> Thus guest can not reliably detect that host didn't update the gen id -
> new one might be there in fw cfg but not yet read.
>
> RSDP never changes during guest lifetime so the issue does
> not occur.
>
> On Tue, Jan 17, 2017 at 06:26:57AM -0800, Ed Swierk wrote:
>> Why is using fw_cfg for vmgenid preferable to the approach used for
>> RSDP: call acpi_add_rom_blob() to allocate a MemoryRegion with the
>> initial vmgenid guid, and call acpi_ram_update() with the new guid
>> whenever the host needs to change it?
>>
>> --Ed
>>
>>
>> On Tue, Jan 17, 2017 at 5:26 AM, Igor Mammedov <imamm...@redhat.com> wrote:
>> > On Mon, 16 Jan 2017 10:57:42 -0800
>> > Ben Warren <b...@skyportsystems.com> wrote:
>> >
>> >> > On Jan 16, 2017, at 6:21 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> >> >
>> >> > On Sat, Jan 14, 2017 at 10:17:53PM -0800, Ben Warren wrote:
>> >> >> Hi Michael,
>> >> >>
>> >> >>
>> >> >>On Dec 10, 2016, at 7:28 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> >> wrote:
>> >> >>
>> >> >>On Tue, Dec 06, 2016 at 06:15:34PM -0800, Ben Warren wrote:
>> >> >>
>> >> >>Hi Michael,
>> >> >>
>> >> >>I’m well on my way to implementing this, but I am really new to 
>> >> >> the
>> >> >>QEMU code base and am struggling with some concepts.  Please 
>> >> >> see below:
>> >> >>
>> >> >>On Oct 5, 2016, at 6:29 PM, Michael S. Tsirkin 
>> >> >> <m...@redhat.com>
>> >> >>wrote:
>> >> >>
>> >> >>On Tue, Oct 04, 2016 at 03:51:40PM -0700, Ed Swierk wrote:
>> >> >>
>> >> >>On Thu, Sep 15, 2016 at 5:36 PM, Michael S. Tsirkin <
>> >> >>m...@redhat.com> wrote:
>> >> >>
>> >> >>On Thu, Sep 15, 2016 at 05:23:28PM -0700, Ed Swierk 
>> >> >> wrote:
>> >> >>
>> >> >>I'm wondering what it will take to finish up 
>> >> >> work on
>> >> >>vmgenid.
>> >> >>
>> >> >>
>> >> >> https://lists.gnu.org/archive/html/qemu-devel/2016-01/
>> >> >>msg05599.html
>> >> >>
>> >> >>
>> >> >>We have ACPI_BUILD_TPMLOG_FILE in tree now and I 
>> >> >> think it
>> >> >>could be
>> >> >>allocated in a similar way.
>> >> >>Integrate patch "fw-cfg: support writeable blobs" to
>> >> >>communicate the
>> >> >>allocated address back to QEMU.
>> >> >>
>> >> >>
>> >> >>Starting with Igor's last version at
>> >> >>https://github.com/imammedo/qemu/commits/vmgen_wip , 
>> >> >> it's not
>> >> >>clear to
>> >> >>me which changes need to be ported, which changes are 
>> >> >> obsoleted
>> >> >>by
>> >> >>your new fw-cfg stuff and/or upstream churn in ACPI, 
>> >> >> device
>> >> >>properties, etc. In particular ACPI is still a total 
>> >> >> mystery to
>> >> >>me,
>> >> >>though passing a single address from guest to host 
>> >>

Re: [Qemu-devel] Virtual Machine Generation ID

2017-01-17 Thread Ed Swierk

Why is using fw_cfg for vmgenid preferable to the approach used for
RSDP: call acpi_add_rom_blob() to allocate a MemoryRegion with the
initial vmgenid guid, and call acpi_ram_update() with the new guid
whenever the host needs to change it?

--Ed


On Tue, Jan 17, 2017 at 5:26 AM, Igor Mammedov <imamm...@redhat.com> wrote:
> On Mon, 16 Jan 2017 10:57:42 -0800
> Ben Warren <b...@skyportsystems.com> wrote:
>
>> > On Jan 16, 2017, at 6:21 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> >
>> > On Sat, Jan 14, 2017 at 10:17:53PM -0800, Ben Warren wrote:
>> >> Hi Michael,
>> >>
>> >>
>> >>On Dec 10, 2016, at 7:28 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >>
>> >>On Tue, Dec 06, 2016 at 06:15:34PM -0800, Ben Warren wrote:
>> >>
>> >>Hi Michael,
>> >>
>> >>I’m well on my way to implementing this, but I am really new to the
>> >>QEMU code base and am struggling with some concepts.  Please see 
>> >> below:
>> >>
>> >>On Oct 5, 2016, at 6:29 PM, Michael S. Tsirkin 
>> >> <m...@redhat.com>
>> >>wrote:
>> >>
>> >>On Tue, Oct 04, 2016 at 03:51:40PM -0700, Ed Swierk wrote:
>> >>
>> >>On Thu, Sep 15, 2016 at 5:36 PM, Michael S. Tsirkin <
>> >>m...@redhat.com> wrote:
>> >>
>> >>On Thu, Sep 15, 2016 at 05:23:28PM -0700, Ed Swierk 
>> >> wrote:
>> >>
>> >>I'm wondering what it will take to finish up work 
>> >> on
>> >>vmgenid.
>> >>
>> >>
>> >> https://lists.gnu.org/archive/html/qemu-devel/2016-01/
>> >>msg05599.html
>> >>
>> >>
>> >>We have ACPI_BUILD_TPMLOG_FILE in tree now and I think 
>> >> it
>> >>could be
>> >>allocated in a similar way.
>> >>Integrate patch "fw-cfg: support writeable blobs" to
>> >>communicate the
>> >>allocated address back to QEMU.
>> >>
>> >>
>> >>Starting with Igor's last version at
>> >>https://github.com/imammedo/qemu/commits/vmgen_wip , it's 
>> >> not
>> >>clear to
>> >>me which changes need to be ported, which changes are 
>> >> obsoleted
>> >>by
>> >>your new fw-cfg stuff and/or upstream churn in ACPI, device
>> >>properties, etc. In particular ACPI is still a total 
>> >> mystery to
>> >>me,
>> >>though passing a single address from guest to host can't be
>> >>that hard,
>> >>can it?
>> >>
>> >>Any clues would be appreciated.
>> >>
>> >>--Ed
>> >>
>> >>
>> >>It might be best to just re-start from the beginning.
>> >>So the idea is that ACPI should be about supplying the address
>> >>to guest. To supply address to host we'll use fw cfg.
>> >>This would be new I think:
>> >>
>> >>- add support for writeable fw cfg blobs
>> >>
>> >>patch applied
>> >>
>> >>- add linker/loader command to write address of a blob into
>> >>such a fw cfg file
>> >>- add a new file used for vm gen id, use loader command above
>> >>to pass the address of a blob allocated for it to host
>> >>
>> >>I don’t really understand the meaning of “file” in this context.  
>> >> It
>> >>seems to be a way of specifying individual fw_cfg entries without
>> >>explicitly giving an index, but is not something that is visible in
>> >>either the host or guest file system.  Is this about right?  In my 
>> >> code
>> >>I’m using “/etc/vmgenid”
>> >>
>> >>
>> >>yes
>> >>
>> >

Re: [Qemu-devel] Assertion failure on qcow2 disk with cluster_size != 64k

2016-10-24 Thread Ed Swierk

On Mon, Oct 24, 2016 at 2:21 PM, Eric Blake  wrote:
> How are you getting max_transfer == 65536?  I can't reproduce it with
> the following setup:
>
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ qemu-io -f qcow2 -c 'w 7m 1k' file
> $ qemu-io -f qcow2 -c 'w -z 8003584 2093056' file
>
> although I did confirm that the above sequence was enough to get the
> -ENOTSUP failure and fall into the code calculating max_transfer.
>
> I'm guessing that you are using something other than a file system as
> the backing protocol for your qcow2 image.  But do you really have a
> protocol that takes AT MOST 64k per transaction, while still trying to a
> cluster size of 1M in the qcow2 format?  That's rather awkward, as it
> means that you are required to do 16 transactions per cluster (the whole
> point of using larger clusters is usually to get fewer transactions).  I
> think we need to get to a root cause of why you are seeing such a small
> max_transfer, before I can propose the right patch, since I haven't been
> able to reproduce it locally yet (although I admit I haven't tried to
> see if blkdebug could reliably introduce artificial limits to simulate
> your setup).  And it may turn out that I just have to fix the
> bdrv_co_do_pwrite_zeroes() code to loop multiple times if the size of
> the unaligned head really does exceed the max_transfer size that the
> underlying protocol is able to support, rather than assuming that the
> unaligned head/tail always fit in a single fallback write.

In this case I'm using a qcow2 image that's stored directly in a raw
dm-crypt/LUKS container, which is itself a loop device on an ext4
filesystem.

It appears loop devices (with or without dm-crypt/LUKS) report a
255-sector maximum per request via the BLKSECTGET ioctl, which qemu
rounds down to 64k in raw_refresh_limits(). However this maximum
appears to be just a hint: bdrv_driver_pwritev() succeeds even with a
385024-byte buffer of zeroes.

As for the 1M cluster size, this is a temporary workaround for another
qemu issue (the default qcow2 L2 table cache size performs well with
random reads covering only up to 8 GB of image data with 64k clusters;
beyond that the L2 table cache thrashes). I agree this is not an
optimal configuration for writes.

> Can you also try this patch? If I'm right, you'll still fail, but the
> assertion will be slightly different.  (Again, I'm passing locally, but
> that's because I'm using the file protocol, and my file system does not
> impose a puny 64k max transfer).
>
> diff --git i/block/io.c w/block/io.c
> index b136c89..8757063 100644
> --- i/block/io.c
> +++ w/block/io.c
> @@ -1179,6 +1179,8 @@ static int coroutine_fn
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>  int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>  int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>  bs->bl.request_alignment);
> +int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> +MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>
>  assert(alignment % bs->bl.request_alignment == 0);
>  head = offset % alignment;
> @@ -1197,6 +1199,8 @@ static int coroutine_fn
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>  /* Make a small request up to the first aligned sector.  */
>  num = MIN(count, alignment - head);
>  head = 0;
> +assert(num < max_write_zeroes);
> +assert(num < max_transfer);
>  } else if (tail && num > alignment) {
>  /* Shorten the request to the last aligned sector.  */
>  num -= tail;
> @@ -1222,8 +1226,6 @@ static int coroutine_fn
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>
>  if (ret == -ENOTSUP) {
>  /* Fall back to bounce buffer if write zeroes is unsupported */
> -int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> -
> MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>  BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>
>  if ((flags & BDRV_REQ_FUA) &&

With this change, the num < max_transfer assertion fails on the first
iteration (with num=385024 and max_transfer=65536).

--Ed

Re: [Qemu-devel] Assertion failure on qcow2 disk with cluster_size != 64k

2016-10-21 Thread Ed Swierk

On Thu, Oct 20, 2016 at 6:38 PM, Eric Blake <ebl...@redhat.com> wrote:
> On 10/20/2016 07:24 PM, Ed Swierk wrote:
>> Changing max_transfer in the normal write case to
>> MIN_NON_ZERO(alignment, MAX_WRITE_ZEROES_BOUNCE_BUFFER) appears to fix
>> the problem, but I don't pretend to understand all the subtleties
>> here.
>
> That actually sounds like the right fix.  But since the bug was probably
> caused by my code, I'll formalize it into a patch and see if I can
> modify the testsuite to give it coverage.

If alignment > MAX_WRITE_ZEROES_BOUNCE_BUFFER (however unlikely) we
have the same problem, so maybe this would be better?

max_transfer = alignment > 0 ? alignment : MAX_WRITE_ZEROES_BOUNCE_BUFFER

--Ed

[Qemu-devel] Assertion failure on qcow2 disk with cluster_size != 64k

2016-10-20 Thread Ed Swierk

Shortly after I start qemu 2.7.0 with a qcow2 disk image created with
-o cluster_size=1048576, it prints the following and dies:

block/qcow2.c:2451: qcow2_co_pwrite_zeroes: Assertion `head + count <=
s->cluster_size' failed.

I narrowed the problem to bdrv_co_do_pwrite_zeroes(), called by
bdrv_aligned_pwritev() with flags & BDRV_REQ_ZERO_WRITE set.

On the first loop iteration, offset=8003584, count=2093056,
head=663552, tail=659456 and num=2093056. qcow2_co_pwrite_zeroes() is
called with offset=8003584 and count=385024 and finds that the head
portion is not already zero, so it returns -ENOTSUP.
bdrv_co_do_pwrite_zeroes() falls back to a normal write, with
max_transfer=65536.

The next iteration starts with offset incremented by 65536, count and
num decremented by 65536, and head=0, violating the assumption that
the entire 385024 bytes of the head remainder had been zeroed the
first time around. Then it calls qcow2_co_pwrite_zeroes() with an
unaligned offset=8069120 and a count=1368064 greater than the cluster
size, triggering the assertion failure.

Changing max_transfer in the normal write case to
MIN_NON_ZERO(alignment, MAX_WRITE_ZEROES_BOUNCE_BUFFER) appears to fix
the problem, but I don't pretend to understand all the subtleties
here.

--Ed

Re: [Qemu-devel] [Qemu-block] [PATCH] qcow2: Optimize L2 table cache size based on image and cluster sizes

2016-10-07 Thread Ed Swierk

On Fri, Oct 7, 2016 at 6:46 AM, Frank Myhr  wrote:
> On 10/07/2016 06:56 AM, Alberto Garcia wrote:
>>
>> I would like to know what's the use case you (Frank, Ed)
>> are thinking about:
>
>
>> - Are we talking about command-line options, QMP, or both?
>
>
> Command-line options alone are sufficient for my use case, which is 
> configuring and starting VMs using libvirt.
>
>
>> - Do you know the size of the disk image or you want to be able to set
>>   the cache size without having access to that information? In other
>>   words, what's preventing you from calculating the cache size that you
>>   need?
>
>
> I know the disk image size, and can set cache size in bytes. 
> l2-cache-size=max would be a convenience feature, *especially if it could 
> become the default*. Then I could forget thinking about whether the image is 
> larger than the current 8GB fully-cached threshold, and only have to 
> calculate cache size in bytes in case of very large images, multiple backing 
> files, or very tight memory.

Same here, using libvirt. l2-cache-size=max would be ideal. Or if
there were a cache-coverage-size option that takes an absolute number,
libvirt could set it to the image size.

--Ed

Re: [Qemu-devel] Virtual Machine Generation ID

2016-10-04 Thread Ed Swierk

On Thu, Sep 15, 2016 at 5:36 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Thu, Sep 15, 2016 at 05:23:28PM -0700, Ed Swierk wrote:
>> I'm wondering what it will take to finish up work on vmgenid.
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05599.html
>
> We have ACPI_BUILD_TPMLOG_FILE in tree now and I think it could be
> allocated in a similar way.
> Integrate patch "fw-cfg: support writeable blobs" to communicate the
> allocated address back to QEMU.

Starting with Igor's last version at
https://github.com/imammedo/qemu/commits/vmgen_wip , it's not clear to
me which changes need to be ported, which changes are obsoleted by
your new fw-cfg stuff and/or upstream churn in ACPI, device
properties, etc. In particular ACPI is still a total mystery to me,
though passing a single address from guest to host can't be that hard,
can it?

Any clues would be appreciated.

--Ed

[Qemu-devel] [PATCH v2] qcow2: Add cache-size=max option to optimize caches for performance

2016-10-04 Thread Ed Swierk

The optimal size of the qcow2 L2 cache depends on the working set size
and the cluster size of the virtual disk. If the cache is too small,
L2 tables are re-read from disk on every IO operation in the worst
case. The host's buffer cache can paper over this inefficiency, but
with cache=none or cache=directsync, L2 cache thrashing dramatically
reduces IO operations per second (e.g. from 50k/sec to 5k/sec for a
qcow2 stored on a modern SSD).

The default L2 cache size of 1 MB is optimal for an 8 GB working set
in a virtual disk with a 64 KB cluster size. The user can control the
cache size manually via the l2-cache-size and cache-size options, but
still has to compute the correct size based on working set and cluster
sizes.

This change adds a cache-size=max option that tells qemu to optimize
for qcow2 performance, making the L2 and refcount caches as large as
necessary to cover the entire virtual disk.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 block/qcow2.c| 16 +---
 docs/qcow2-cache.txt | 10 --
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 91ef4df..6f3a6ed 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -442,7 +442,7 @@ static QemuOptsList qcow2_runtime_opts = {
 },
 {
 .name = QCOW2_OPT_CACHE_SIZE,
-.type = QEMU_OPT_SIZE,
+.type = QEMU_OPT_STRING,
 .help = "Maximum combined metadata (L2 tables and refcount blocks) 
"
 "cache size",
 },
@@ -525,18 +525,28 @@ static void read_cache_sizes(BlockDriverState *bs, 
QemuOpts *opts,
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t combined_cache_size;
-bool l2_cache_size_set, refcount_cache_size_set, combined_cache_size_set;
+bool l2_cache_size_set, refcount_cache_size_set;
+const char *combined_cache_size_set;
 
 combined_cache_size_set = qemu_opt_get(opts, QCOW2_OPT_CACHE_SIZE);
 l2_cache_size_set = qemu_opt_get(opts, QCOW2_OPT_L2_CACHE_SIZE);
 refcount_cache_size_set = qemu_opt_get(opts, 
QCOW2_OPT_REFCOUNT_CACHE_SIZE);
 
-combined_cache_size = qemu_opt_get_size(opts, QCOW2_OPT_CACHE_SIZE, 0);
 *l2_cache_size = qemu_opt_get_size(opts, QCOW2_OPT_L2_CACHE_SIZE, 0);
 *refcount_cache_size = qemu_opt_get_size(opts,
  QCOW2_OPT_REFCOUNT_CACHE_SIZE, 0);
 
 if (combined_cache_size_set) {
+if (strcasecmp(combined_cache_size_set, "max"))
+parse_option_size(QCOW2_OPT_CACHE_SIZE, combined_cache_size_set,
+  _cache_size, _abort);
+else
+combined_cache_size =
+(bs->total_sectors * BDRV_SECTOR_SIZE * 8
+ + s->cluster_size - 1) / s->cluster_size
+* (DEFAULT_L2_REFCOUNT_SIZE_RATIO + 1)
+/ DEFAULT_L2_REFCOUNT_SIZE_RATIO;
+
 if (l2_cache_size_set && refcount_cache_size_set) {
 error_setg(errp, QCOW2_OPT_CACHE_SIZE ", " QCOW2_OPT_L2_CACHE_SIZE
" and " QCOW2_OPT_REFCOUNT_CACHE_SIZE " may not be set "
diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt
index 5bb0607..c7ea9f3 100644
--- a/docs/qcow2-cache.txt
+++ b/docs/qcow2-cache.txt
@@ -114,9 +114,10 @@ There are three options available, and all of them take 
bytes:
 
 "l2-cache-size": maximum size of the L2 table cache
 "refcount-cache-size":   maximum size of the refcount block cache
-"cache-size":maximum size of both caches combined
+"cache-size":maximum size of both caches combined, or
+ "max" to cover entire disk
 
-There are two things that need to be taken into account:
+There are several things that need to be taken into account:
 
  - Both caches must have a size that is a multiple of the cluster
size.
@@ -125,11 +126,16 @@ There are two things that need to be taken into account:
adjust the others so that the L2 cache is 4 times bigger than the
refcount cache.
 
+ - If you set cache-size to the special value "max", QEMU will
+   automatically size both caches to cover the entire virtual disk.
+
 This means that these options are equivalent:
 
-drive file=hd.qcow2,l2-cache-size=2097152
-drive file=hd.qcow2,refcount-cache-size=524288
-drive file=hd.qcow2,cache-size=2621440
+   -drive file=hd.qcow2,cache-size=max [assuming hd.qcow2 is 16 GB
+with 64 KB clusters]
 
 The reason for this 1/4 ratio is to ensure that both caches cover the
 same amount of disk space. Note however that this is only valid with
-- 
1.9.1

Re: [Qemu-devel] [Qemu-block] [PATCH] qcow2: Optimize L2 table cache size based on image and cluster sizes

2016-10-04 Thread Ed Swierk

On Tue, Oct 4, 2016 at 7:31 AM, Alberto Garcia  wrote:
> That might be a lot of memory if the image is big. 1 TB qcow2 image ==
> 128 MB L2 cache.
>
> See https://bugzilla.redhat.com/show_bug.cgi?id=1377735#c2
>
> If we want to add this feature, I think I'd rather make it explicit.

I agree explicit is generally better than automagical, but unlike say
the VM RAM size, the qcow L2 table cache is an implementation detail
no user should be expected to know exists, let alone needs tweaking
according to a specific formula to avoid an arbitrary performance
cliff.

At least giving users a way to skip the math would be an improvement.
Would you be okay with an explicitly-set option like
l2_cache_size=auto or =max that optimizes for performance at the
expense of memory?

--Ed

[Qemu-devel] [PATCH] qcow2: Optimize L2 table cache size based on image and cluster sizes

2016-10-04 Thread Ed Swierk

The optimal size of the qcow2 L2 table cache depends on the working
set size and the cluster size of the disk image. If the cache is too
small, L2 tables are be re-read from disk on every IO operation in the
worst case. The host's buffer cache can paper over this inefficiency,
but with cache=none or cache=directsync, L2 table cache thrashing
dramatically reduces IO operations per second (e.g. from 50k/sec to
5k/sec for a qcow2 stored on a modern SSD).

The L2 table cache size can be set manually via the l2-cache-size
runtime option, but such a performance-critical value needs a sane
default. Conservatively assuming the entire image is the working
set, the existing hardcoded default value of 1 MiB is appropriate
for images up to 8 GiB with the default 64 KiB cluster size.

This change replaces the hardcoded default L2 table cache size with a
value computed from the image size and cluster size. The resulting
default cache size is just large enough to service random accesses in
the entire image.

Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
---
 block/qcow2.c | 5 ++---
 block/qcow2.h | 4 
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 91ef4df..259e377 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -563,9 +563,8 @@ static void read_cache_sizes(BlockDriverState *bs, QemuOpts 
*opts,
 }
 } else {
 if (!l2_cache_size_set && !refcount_cache_size_set) {
-*l2_cache_size = MAX(DEFAULT_L2_CACHE_BYTE_SIZE,
- (uint64_t)DEFAULT_L2_CACHE_CLUSTERS
- * s->cluster_size);
+*l2_cache_size = (bs->total_sectors * BDRV_SECTOR_SIZE * 8
+  + s->cluster_size - 1) / s->cluster_size;
 *refcount_cache_size = *l2_cache_size
  / DEFAULT_L2_REFCOUNT_SIZE_RATIO;
 } else if (!l2_cache_size_set) {
diff --git a/block/qcow2.h b/block/qcow2.h
index b36a7bf..f148c4d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -68,10 +68,6 @@
 /* Must be at least 4 to cover all cases of refcount table growth */
 #define MIN_REFCOUNT_CACHE_SIZE 4 /* clusters */
 
-/* Whichever is more */
-#define DEFAULT_L2_CACHE_CLUSTERS 8 /* clusters */
-#define DEFAULT_L2_CACHE_BYTE_SIZE 1048576 /* bytes */
-
 /* The refblock cache needs only a fourth of the L2 cache size to cover as many
  * clusters */
 #define DEFAULT_L2_REFCOUNT_SIZE_RATIO 4
-- 
1.9.1

[Qemu-devel] Virtual Machine Generation ID

2016-09-15 Thread Ed Swierk

I'm wondering what it will take to finish up work on vmgenid.

https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05599.html

It appears all of the designs explored through the 19 iterations were
problematic in some way. Is any of them vaguely acceptable to all
involved in the discussions? Or do we need to start from square one?

--Ed

Re: [Qemu-devel] [Xen-devel] [RFC] e1000: Don't save writes to ICS/ICR masked by IMS

2016-09-15 Thread Ed Swierk

On Thu, Sep 15, 2016 at 2:15 AM, Denis V. Lunev <d...@openvz.org> wrote:
> On 09/13/2016 11:59 PM, Konrad Rzeszutek Wilk wrote:
> > On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote:
> >> Windows 8, 10 and Server 2012 guests hang intermittently while booting
> >> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows
> >> logo appears and the little dots start spinning.
> >>
> >> Running strace on qemu shows its main thread doing the following every
> >> couple of milliseconds:
> >>
> >>  ppoll([..., {fd=30, events=POLLIN|POLLERR|POLLHUP},
> >> ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...)
> >>  read(30, "^\0\0\0", 4) = 4
> >>  write(30, "^\0\0\0", 4) = 4
> >>  ioctl(30, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0
> >>  clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0
> >>  clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0
> >>  gettimeofday({1472251376, 673434}, NULL) = 0
> >>  clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0
> >>  gettimeofday({1472251376, 673679}, NULL) = 0
> >>
> >> The event channel (identified by '^' or 94 in this example) is always
> >> the third of the domain's four channels.
> >>
> >> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and
> >> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar
> >> issues, but don't help in this case.
> >>
> >> The proposed fix from
> >> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang
> >> go away. It's not clear to me why it works, or if it's just papering
> >> over a bug elsewhere, or if there are any possible side effects.
> > CC-ing Denis.
> >
> >
> > Is the fix below based on reading the spec or more of instrumenting?
> >
> > Thanks.
>
> hmm. I have looked in our older code (completely separate
> from QEMU). It does not have this trick.
>
> 2012r2 (as far as I remember) has a bug in the driver
> when LSC interrupt was raised unexpectedly during driver
> initialization. The original bug was about TXQE. Can you
> pls confirm which interrupt causes the storm?

How can I tell which interrupt is firing? Instrument the QEMU e1000 code?

--Ed

[Qemu-devel] [RFC] e1000: Don't save writes to ICS/ICR masked by IMS

2016-09-01 Thread Ed Swierk

Windows 8, 10 and Server 2012 guests hang intermittently while booting
on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows
logo appears and the little dots start spinning.

Running strace on qemu shows its main thread doing the following every
couple of milliseconds:

 ppoll([..., {fd=30, events=POLLIN|POLLERR|POLLHUP},
...], ...) = 1 ([{fd=30, revents=POLLIN}], ...)
 read(30, "^\0\0\0", 4) = 4
 write(30, "^\0\0\0", 4) = 4
 ioctl(30, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0
 clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0
 clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0
 gettimeofday({1472251376, 673434}, NULL) = 0
 clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0
 gettimeofday({1472251376, 673679}, NULL) = 0

The event channel (identified by '^' or 94 in this example) is always
the third of the domain's four channels.

Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and
http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar
issues, but don't help in this case.

The proposed fix from
https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang
go away. It's not clear to me why it works, or if it's just papering
over a bug elsewhere, or if there are any possible side effects.

Suggested-by: Andrew Jones <drjo...@redhat.com>
Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 6eac66d..c891b67 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -293,6 +293,8 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val)
 uint32_t pending_ints;
 uint32_t mit_delay;
 
+val &= s->mac_reg[IMS];
+
 s->mac_reg[ICR] = val;
 
 /*
@@ -305,7 +307,7 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val)
  */
 s->mac_reg[ICS] = val;
 
-pending_ints = (s->mac_reg[IMS] & s->mac_reg[ICR]);
+pending_ints = s->mac_reg[ICR];
 if (!s->mit_irq_level && pending_ints) {
 /*
  * Here we detect a potential raising edge. We postpone raising the

Re: [Qemu-devel] [PATCH for-2.5] piix: Document coreboot-specific RAM size config register

2015-08-09 Thread Ed Swierk

That original coreboot code certainly looks like a mistake. Thanks for
helping close the decade-long loop.



On Fri, Aug 7, 2015 at 12:15 PM, Eduardo Habkost ehabk...@redhat.com
wrote:

 The existing i440fx initialization code sets a PCI config register that
 isn't documented anywhere in the Intel 440FX datasheet. Register 0x57 is
 DRAMC (DRAM Control) and has nothing to do with the RAM size.

 This was implemented in commit ec5f92ce6ac8ec09056be77e03c941be188648fa
 because old coreboot code tried to read registers 0x5a-0x5f,0x56,0x57 to
 get the RAM size from QEMU, but I couldn't find out why coreboot did
 that. I assume it was a mistake, and the original code was supposed to
 be reading the DRB[0-7] registers (offsets 0x60-0x67).

 Document that coreboot-specific register offset in a macro and a
 comment, for future reference.

 Cc: Ed Swierk eswi...@skyportsystems.com
 Cc: Richard Smith smithb...@gmail.com
 Signed-off-by: Eduardo Habkost ehabk...@redhat.com
 ---
 References to coreboot commits:
 * Original commit adding code reading register offsets
   0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, 0x56, 0x57 to Intel 440bx code in
   coreboot:
   cb8eab482ff09ec256456312ef2d6e7710123551
 * Commit adding the same register offsets to QEMU-specific
   coreboot code:
   9cf642bad3fdd2205ffdd83a3222a39855b1ceff
 * coreboot commit replacing the weird register offsets with
   code that actually reads the DRB7 register, in the I440BX code:
   1a9c892d58c746aef0cb530481c214e63a6a6871
 * coreboot commit replacing the weird register offets with
   code reading the CMOS in QEMU-specific code:
   7339f36961917814ed12d5a4e6f1fe19418b8aca
 ---
  hw/pci-host/piix.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

 diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c
 index ad55f99..1cb25f3 100644
 --- a/hw/pci-host/piix.c
 +++ b/hw/pci-host/piix.c
 @@ -117,6 +117,11 @@ struct PCII440FXState {
  #define I440FX_PAM_SIZE 7
  #define I440FX_SMRAM0x72

 +/* Older coreboot versions (4.0 and older) read a config register that
 doesn't
 + * exist in real hardware, to get the RAM size from QEMU.
 + */
 +#define I440FX_COREBOOT_RAM_SIZE 0x57
 +
  static void piix3_set_irq(void *opaque, int pirq, int level);
  static PCIINTxRoute piix3_route_intx_pin_to_irq(void *opaque, int
 pci_intx);
  static void piix3_write_config_xen(PCIDevice *dev,
 @@ -394,7 +399,7 @@ PCIBus *i440fx_init(PCII440FXState **pi440fx_state,
  if (ram_size  255) {
  ram_size = 255;
  }
 -d-config[0x57] = ram_size;
 +d-config[I440FX_COREBOOT_RAM_SIZE] = ram_size;

  i440fx_update_memory_mappings(f);

 --
 2.1.0

[Qemu-devel] [PATCH] linux-user: Fix ioctl cmd type mismatch on 64-bit targets

2014-12-16 Thread Ed Swierk

linux-user passes the cmd argument of the ioctl syscall as a signed long,
but compares it to an unsigned int when iterating through the ioctl_entries
list.  When the cmd is a large value like 0x80047476 (TARGET_TIOCSWINSZ on
mips64) it gets sign-extended to 0x80047476, causing the comparison
to fail and resulting in lots of spurious Unsupported ioctl errors.
Changing the target_cmd field in the ioctl_entries list to a signed int
causes those values to be sign-extended as well during the comparison.

Signed-off-by: Ed Swierk eswi...@skyportsystems.com
---
 linux-user/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index aaac6a2..d636c81 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -3278,7 +3278,7 @@ typedef abi_long do_ioctl_fn(const IOCTLEntry *ie, 
uint8_t *buf_temp,
  int fd, abi_long cmd, abi_long arg);
 
 struct IOCTLEntry {
-unsigned int target_cmd;
+int target_cmd;
 unsigned int host_cmd;
 const char *name;
 int access;
-- 
1.9.1

[Qemu-devel] [PATCH] mips64-linux-user: Fix definition of struct sigaltstack

2014-12-16 Thread Ed Swierk

Without this fix, qemu segfaults when emulating the sigaltstack syscall,
because it incorrectly treats the ss_flags field as 64 bits rather than 32
bits.

Signed-off-by: Ed Swierk eswi...@skyportsystems.com
---
 linux-user/mips64/target_signal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/linux-user/mips64/target_signal.h 
b/linux-user/mips64/target_signal.h
index 6e1dc8b..5fb6a2c 100644
--- a/linux-user/mips64/target_signal.h
+++ b/linux-user/mips64/target_signal.h
@@ -8,7 +8,7 @@
 typedef struct target_sigaltstack {
abi_long ss_sp;
abi_ulong ss_size;
-   abi_long ss_flags;
+   abi_int ss_flags;
 } target_stack_t;
 
 
-- 
1.9.1

[Qemu-devel] [PATCH] linux-user: Return correct errno for unsupported netlink socket

2014-05-05 Thread Ed Swierk

This fixes Cannot open audit interface - aborting. when the
EAFNOSUPPORT errno differs between the target and host
architectures (e.g. mips target and x86_64 host).

Signed-off-by: Ed Swierk eswi...@skyportsystems.com
---
 linux-user/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 9864813..271d28a 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -1861,7 +1861,7 @@ static abi_long do_socket(int domain, int type, int 
protocol)
 }
 
 if (domain == PF_NETLINK)
-return -EAFNOSUPPORT; /* do not NETLINK socket connections possible */
+return -TARGET_EAFNOSUPPORT;
 ret = get_errno(socket(domain, type, protocol));
 if (ret = 0) {
 ret = sock_flags_fixup(ret, target_type);
-- 
1.8.3.2

[Qemu-devel] Invalid data memory access on qemu-ppc

2013-02-08 Thread Ed Swierk

I'm using the ppc-linux-user target to run processes in a Debian
Wheezy filesystem I built using multistrap. I built qemu from
yesterday's head of the git tree.

When I try to run a Python JSON database called jsonstore, I qemu-ppc
barfs with an Invalid data memory access, with the invalid address
always being the lower 32 bits of guest_base (which I take to mean 0).

Here is an excerpt of qemu.log with in_asm enabled. I've included the
top and bottom, as well as a chunk from the middle where the guest
tries to execute the fcfid instruction (with no ill effects, at least
until the crash later on).

real_start=0x7f0468307000 real_size=0xf700
Reserved 0xf700 bytes of guest address space
host mmap_min_addr=0x1
guest_base  0x7f0468307000
startend  size prot
1000-1022e000 0022e000 r-x
1023d000-1029a000 0005d000 rw-
f678b000-f67ab000 0002 r-x
f67ab000-f67ac000 1000 ---
f67ac000-f67ae000 2000 rw-
f67ae000-f67af000 1000 ---
f67af000-f700 00851000 rw-
start_brk   0x
end_code0x1022d238
start_code  0x1000
start_data  0x1023ded4
end_data0x1028d99c
start_stack 0xf6fffa68
brk 0x10299da0
entry   0xf67a2f94
IN:
0xf67a2f94:  mr  r3,r1
0xf67a2f98:  li  r4,0
0xf67a2f9c:  addir1,r1,-16
0xf67a2fa0:  stw r4,0(r1)
0xf67a2fa4:  bl  0xf678d2e0

...

IN:
0x0fa89a54:  cmpwi   cr7,r3,0
0x0fa89a58:  beq-cr7,0xfa89a94

IN:
0x0fa89a94:  bl  0xfa89690

invalid/unsupported opcode: 3f - 0e - 1a (fc200e9c) 0fa89690 0
IN:
0x0fa89690:  fcfid   f1,f1

Invalid instruction
NIP 0fa89694   LR 0fa89a98 CTR 0fd28508 XER 2000
MSR 02006040 HID0   HF 02006000 idx 0
TB  
GPR00 0fa89a54 f6ffe3c0 f678b4a0 
GPR04  0fc0ae34 0008 0020
GPR08 ffc0  1028d968 0fd28508
GPR12 22282464 10295894 1033acf0 
GPR16 102f6fc0 1033acf0 0063 1028d968
GPR20 1033c780 1029 f65077b8 1029da90
GPR24 00a4 0029 f66fd0c0 103578f0
GPR28 0fcaaa64 0fc0aec4 0fc08b08 0fc0ac70
CR 22282462  [ E  E  E  L  E  G  G  E  ] RES 
FPR00 4018 4008 43308002 43308003
FPR04 fff80233 41d44526a340 4330d1149a8d 4330
FPR08 40c0e380 433021c7 fff821c7 40c0
FPR12 43308006 4018  
FPR16    
FPR20    
FPR24    
FPR28    
FPSCR 2000
IN:
0x0fa89810:  stwur1,-32(r1)
0x0fa89814:  mflrr0
0x0fa89818:  bcl-20,4*cr7+so,0xfa8981c

...

IN:
0x0f6de328:  li  r0,0
0x0f6de32c:  stw r0,12(r27)
0x0f6de330:  lwz r0,36(r1)
0x0f6de334:  lwz r26,8(r1)
0x0f6de338:  mtlrr0
0x0f6de33c:  lwz r27,12(r1)
0x0f6de340:  lwz r28,16(r1)
0x0f6de344:  lwz r29,20(r1)
0x0f6de348:  lwz r30,24(r1)
0x0f6de34c:  lwz r31,28(r1)
0x0f6de350:  addir1,r1,32
0x0f6de354:  blr

IN:
0x0f6dcda8:  lwz r3,16(r29)
0x0f6dcdac:  bl  0xf6de860

IN:
0x0f6de860:  lwz r11,-32304(r30)
0x0f6de864:  mtctr   r11
0x0f6de868:  bctr

Invalid data memory access: 0x68307000
NIP    LR 0f6dcdb0 CTR  XER 2000
MSR 02006040 HID0   HF 02006000 idx 0
TB  
GPR00 0f6dcda8 f6ffe210 f678b4a0 103cb648
GPR04 0016 0002 100eb470 0f6de0c4
GPR08 0001 0002 0002 
GPR12 4848 10295894 1029a0b8 
GPR16 107b1190 f66e6110  1028d968
GPR20 10774548 1029 10782b60 107a2b40
GPR24  1029a0b8 1031cc58 0790
GPR28  10779830 0f6f77c4 0f6ef9b4
CR 24482422  [ E  G  G  L  E  G  E  E  ] RES 
FPR00   43308000 4330d1158120
FPR04 4330800f18b7 41d44526a5c0 4330d1149a97 4330
FPR08  4330 3f320b5c2700 43308000
FPR12 43308001   
FPR16    
FPR20    
FPR24    
FPR28    
FPSCR 82002000

Any

[Qemu-devel] [Qemu] [PATCH] Allow booting large Coreboot image as BIOS

2010-05-19 Thread Ed Swierk

When I build a 1024-kB Coreboot image (emulation/qemu-x86 mainboard),
qemu goes into a loop, resetting just after the Jumping to boot code
message.  This can be avoided by removing the | IO_MEM_ROM from the
call to cpu_register_physical_memory(0x10 - isa_bios_size, ...)
call in pc_memory_init().

The next problem I hit is the complaint ERROR: No valid CBFS header
found!  Maybe the ROM isn't entirely mapped yet?.  I can work around
this by allocating a separate chunk of RAM for the top 128 kB of the
BIOS image in ISA space and copying it rather than mapping it from the
full ROM image.

Both of these changes are the result of random hacking rather than any
real understanding of the issues involved.  I'd appreciate any better
ideas.

Signed-off-by: Ed Swierk eswi...@aristanetworks.com
---
 hw/pc.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/hw/pc.c b/hw/pc.c
index 20dc7fd..88478e7 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -880,7 +880,10 @@ void pc_memory_init(ram_addr_t ram_size,
 isa_bios_size = 128 * 1024;
 cpu_register_physical_memory(0x10 - isa_bios_size,
  isa_bios_size,
- (bios_offset + bios_size - isa_bios_size) | IO_MEM_ROM);
+ qemu_ram_alloc(isa_bios_size));
+cpu_physical_memory_write_rom(0x10 - isa_bios_size,
+  qemu_get_ram_ptr(bios_offset + bios_size - isa_bios_size),
+  isa_bios_size);
 
 option_rom_offset = qemu_ram_alloc(PC_ROM_SIZE);
 cpu_register_physical_memory(PC_ROM_MIN_VGA, PC_ROM_SIZE, option_rom_offset);

Re: [Qemu-devel] getfd monitor command broken

2010-02-22 Thread Ed Swierk

On Mon, Feb 22, 2010 at 12:51 PM, Luiz Capitulino
lcapitul...@redhat.com wrote:
 How do you reproduce it?

Here's a test program that reproduces the problem. Start qemu with

  -chardev socket,id=monitor,path=/tmp/qemu-monitor,server,nowait -mon
chardev=monitor,mode=readline

and run check_getfd /tmp/qemu-monitor. It will print an error and
return nonzero if the monitor output indicates getfd or closefd
failed.

--Ed
/*
 * check_getfd
 *
 * Tests the qemu getfd monitor command
 *
 * Copyright (c) 2010 Arista Networks, Inc.
 * 
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the Software), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
 */

#include stdio.h
#include stdlib.h
#include unistd.h
#include fcntl.h

#include sys/types.h
#include sys/socket.h
#include sys/un.h

int receive_output(int s, char *m) {
   unsigned int i = 0;
   char buf[10240];

   buf[0] = '\0';
   while (1) {
  if (recv(s, buf[i], 1, 0)  0) {
 perror(Failed to receive);
 return -1;
  }
  buf[++i] = '\0';
  if ((i  7)  !strcmp(buf[i-7], (qemu) ))
 break;
   }

   if (m 
   ((i  strlen(m) + 7) || strncmp(buf[i-7-strlen(m)], m, strlen(m {
  fprintf(stderr, %s\n, buf);
  return -1;
   }

   return 0;
}

int main(int argc, char *argv[]) {
   struct sockaddr_un addr;
   int s;
   int fd;
   char fdbuf[CMSG_SPACE(sizeof(fd))];
   struct msghdr msg;
   struct cmsghdr *cmsg;
   struct iovec mvec;
   char *cmd = getfd MYFD\nclosefd MYFD\n;

   if (argc != 2) {
  printf(Usage: %s QEMU_MONITOR\n\n, argv[0]);
  printf(  (start qemu with -chardev socket,id=monitor,path=QEMU_MONITOR
 ,server,nowait -mon chardev=monitor,mode=readline)\n);
  return 1;
   }

   fd = open(/dev/null, O_RDWR);
   if (fd  0) {
  perror(Failed to open /dev/null);
  return 1;
   }

   memset(addr, 0, sizeof(addr));
   addr.sun_family = AF_UNIX;
   strncpy(addr.sun_path, argv[1], sizeof(addr.sun_path));

   s = socket(PF_UNIX, SOCK_STREAM, 0);
   if (s  0) {
  perror(No socket);
  return 1;
   }

   if (connect(s, (struct sockaddr *) addr, sizeof(addr))  0) {
  perror(Failed to connect);
  return 1;
   }

   if (receive_output(s, NULL)  0)
  return 1;

   mvec.iov_base = cmd;
   mvec.iov_len = strlen(cmd) + 1;
   msg.msg_name = NULL;
   msg.msg_namelen = 0;
   msg.msg_iov = mvec;
   msg.msg_iovlen = 1;
   msg.msg_control = fdbuf;
   msg.msg_controllen = CMSG_LEN(sizeof(fd));
   msg.msg_flags = 0;

   cmsg = CMSG_FIRSTHDR(msg);
   cmsg-cmsg_level = SOL_SOCKET;
   cmsg-cmsg_type = SCM_RIGHTS;
   cmsg-cmsg_len = msg.msg_controllen;
   memcpy(CMSG_DATA(cmsg), fd, sizeof(fd));
   
   if (sendmsg(s, msg, 0)  0) {
  perror(Failed to send);
  return 1;
   }

   if (receive_output(s, \033[K\r\n)  0)
  return 1;
   if (receive_output(s, \033[K\r\n)  0)
  return 1;

   return 0;
}

[Qemu-devel] getfd monitor command broken

2010-02-19 Thread Ed Swierk

Commit c62313bbdc48f72e93fa8196f2fff96ba35e4e9d seems to have broken
the getfd monitor command in qemu 0.12.

tcp_chr_read() calls tcp_chr_recv(), which checks whether the received
message includes an SCM_RIGHTS header, and if so, stores the received
fd in the CharDriverState struct. tcp_chr_read() passes the received
data to the monitor via qemu_chr_read(), and then closes the stored
fd.

Previously, tcp_chr_read() would receive the entire getfd command in
one message (perhaps by sheer luck), allowing the monitor to call
qemu_chr_get_msgfd() to obtain the stored fd before it disappeared.
Now that tcp_chr_read() receives only 1 byte at a time, the stored fd
vanishes when it receives byte 2.

I'm too confused by this control flow to suggest a proper solution,
but commenting out the if (s-msgfd != -1) block in tcp_chr_read()
at least makes the problem go away.

--Ed

[Qemu-devel] Re: [PATCH 1/2] macvtap: rework object lifetime rules

2010-02-15 Thread Ed Swierk

On Sat, Feb 13, 2010 at 2:33 AM, Arnd Bergmann a...@arndb.de wrote:
 The original macvtap code has a number of problems
 resulting from the use of RCU for protecting the
 access to struct macvtap_queue from open files.

 This includes
 - need for GFP_ATOMIC allocations for skbs
 - potential deadlocks when copy_*_user sleeps
 - inability to work with vhost-net

 Changing the lifetime of macvtap_queue to always
 depend on the open file solves all these. The
 RCU reference simply moves one step down to
 the reference on the macvlan_dev, which we
 only need for nonblocking operations.

 Signed-off-by: Arnd Bergmann a...@arndb.de

This works for me.

Acked-by: Ed Swierk eswi...@aristanetworks.com

Re: [Qemu-devel] [PATCH 0/7] KVM SMP support, early version

2009-11-30 Thread Ed Swierk

On Thu, Nov 26, 2009 at 9:24 AM, Glauber Costa glom...@redhat.com wrote:
 This is an early version of smp support in kvm that kinda works.

Can you please elaborate on smp support? The KVM FAQ makes some
rather broad claims about smp support already:

  Does KVM support SMP hosts?
  Yes.

  Does KVM support SMP guests?
  Yes. Up to 16 CPUs can be specified using the -smp option.

Thanks,
--Ed

Re: [Qemu-devel] [PATCH 0/7] KVM SMP support, early version

2009-11-30 Thread Ed Swierk

On Mon, Nov 30, 2009 at 4:31 PM, Ed Swierk eswi...@aristanetworks.com wrote:
 On Thu, Nov 26, 2009 at 9:24 AM, Glauber Costa glom...@redhat.com wrote:
 This is an early version of smp support in kvm that kinda works.

 Can you please elaborate on smp support? The KVM FAQ makes some
 rather broad claims about smp support already:

I see now that you're integrating smp support into qemu proper from
the kvm fork of qemu. Sorry for the noise.

--Ed

Re: [Qemu-devel] PreP kernels boot using Qemu

2007-10-30 Thread Ed Swierk

On 10/28/07, Rob Landley [EMAIL PROTECTED] wrote:
 Hmmm...  All the kernels I've built for this project are static.  In theory I
 can add ne.irq=9 to the kernel command line, but in practice it doesn't
 seem to work.  Nor does ne.0.irq=9 or irq=9

 However, when I hardwire dev-irq=9; into the source code, it does seem to
 work.  (Or at least I can ping qemu's virtual gateway.)

 Off to read the kernel command line parsing logic...

It seems that the driver.parameter syntax works only for parameters
that the driver exports in /sys/module/driver/parameters. The ne
driver doesn't define parameters at all when not compiled as a module,
let alone export them via sysfs.

The attached patch is my feeble attempt to remedy the situation. With
the patch, adding ne.irq=9,10 to the kernel command line sets the
correct irqs for eth0 and eth1.

--Ed


linux-2.6.23-ne-module-params.patch
Description: Binary data

Re: [Qemu-devel] merging kqemu into mainline kernel?

2007-08-16 Thread Ed Swierk

On 8/16/07, Christian MICHON [EMAIL PROTECTED] wrote:
 blame it on gmail. this is the default behaviour.
 it's obviously not intended...

At the risk of drifting further off-topic, I'd just point out that you
can press Ctrl-End in the Gmail message window before you type your
reply.

--Ed

[Qemu-devel] [PATCH] Memory corruption after OHCI reset

2007-07-25 Thread Ed Swierk


When the USB OHCI controller starts, a periodic end-of-frame routine
writes to a chunk of memory set aside by the device driver. If the
machine reboots or the OS kexecs, the controller continues writing
even though the memory is no longer owned by the device driver,
causing random, mysterious corruption, until the new OS reinitializes
the OHCI controller.

The attached patch fixes this by resetting the controller whenever the
machine reboots or the device driver issues a reset command, and
disabling the timer when the controller resets.

--Ed
Index: qemu-snapshot-2007-02-09_05/hw/usb-ohci.c
===
--- qemu-snapshot-2007-02-09_05.orig/hw/usb-ohci.c
+++ qemu-snapshot-2007-02-09_05/hw/usb-ohci.c
@@ -106,6 +106,8 @@ struct ohci_hcca {
 uint32_t done;
 };
 
+static void ohci_bus_stop(OHCIState *ohci);
+
 /* Bitfields for the first word of an Endpoint Desciptor.  */
 #define OHCI_ED_FA_SHIFT  0
 #define OHCI_ED_FA_MASK   (0x7fOHCI_ED_FA_SHIFT)
@@ -323,11 +325,13 @@ static void ohci_attach(USBPort *port1, 
 }
 
 /* Reset the controller */
-static void ohci_reset(OHCIState *ohci)
+static void ohci_reset(void *opaque)
 {
+OHCIState *ohci = opaque;
 OHCIPort *port;
 int i;
 
+ohci_bus_stop(ohci);
 ohci-ctl = 0;
 ohci-old_ctl = 0;
 ohci-status = 0;
@@ -813,6 +817,7 @@ static void ohci_bus_stop(OHCIState *ohc
 {
 if (ohci-eof_timer)
 qemu_del_timer(ohci-eof_timer);
+ohci-eof_timer = NULL;
 }
 
 /* Sets a flag in a port status register but only set it if the port is
@@ -898,6 +903,7 @@ static void ohci_set_ctl(OHCIState *ohci
 dprintf(usb-ohci: %s: USB Resume\n, ohci-pci_dev.name);
 break;
 case OHCI_USB_RESET:
+ohci_reset(ohci);
 dprintf(usb-ohci: %s: USB Reset\n, ohci-pci_dev.name);
 break;
 }
@@ -1266,5 +1272,6 @@ void usb_ohci_init(struct PCIBus *bus, i
 }
 
 ohci-async_td = 0;
+qemu_register_reset(ohci_reset, ohci);
 ohci_reset(ohci);
 }

Re: [Qemu-devel] Classic Mac OS

2007-05-21 Thread Ed Swierk

On Monday 21 May 2007 03:40:12 Markus Hitter wrote:
 Somewhere in the FAQs, the state of Classic Mac OS ist mentioned as
 it is being worked on. Is this still true, is it possible to join
 the effort?

I'm not aware of anyone currently working on classic Mac OS support. PowerPC 
system emulation has been broken for a while--at least I haven't had any luck 
booting a recent Linux distribution with a 2.6 kernel--so even with the 
proper firmware support, getting classic Mac OS to boot might be difficult 
until it's fixed.

--Ed

Re: [Qemu-devel] Rewritten Linux kernel loader

2007-05-17 Thread Ed Swierk

On Wednesday 16 May 2007 14:31:38 H. Peter Anvin wrote:
 As rewritten, it should follow the current version of the Linux boot
 protocol specification and recommendations.  As a side benefit, it no
 longer relies on load_linux.S; instead I have a small code generator
 which can be used to set up an arbitrary state -- thus usable for other
 startup scenarios as well.

It also eliminates the annoying hardcoded load addresses for the kernel and 
initrd. Thanks for doing this!

--Ed

[Qemu-devel] OpenHackWare hacking

2007-04-20 Thread Ed Swierk

The attached patch addresses a few problems in OpenHackWare:

- The return value from the OpenFirmware read function should not exceed the 
actual file size by more than one block; otherwise the Linux kernel's 
initramfs routines get confused by the extra junk and reject the initramfs.

- The OpenFirmware nextprop function should return 1 when a property is found, 
not the length of the property's name. Otherwise Linux fails to find any 
properties when unflattening the device tree.

- If the boot file's checksum is unknown, OpenHackWare should assume it's a 
Linux or OpenBSD boot script rather than barfing.

- The linker script requires Daniel Jacobowitz's fix to build on Fedora 6.

These changes get me a few steps closer to booting an unmodified Fedora 6 
PowerPC boot.iso. Outstanding issues include:

- The Fedora 6 version of yaboot looks for a conf= parameter in the bootargs 
to tell it where to find yaboot.conf; OHW needs to read this parameter from 
the boot script and pass it along.

- After the kernel finally boots, it complains about unhandled interrupts for 
the CD device. I assume this is a Qemu issue.

Comments welcome.

--Ed
[5~[5~[5~[5~diff -BurN OpenHackWare-release-0.4/src/libexec/chrp.c OpenHackWare-release-0.4.mine/src/libexec/chrp.c
--- OpenHackWare-release-0.4/src/libexec/chrp.c	2005-03-30 23:23:33.0 -0800
+++ OpenHackWare-release-0.4.mine/src/libexec/chrp.c	2007-04-19 13:06:53.0 -0700
@@ -243,9 +243,8 @@
 DPRINTF(Boot file embedded at the end of boot script\n);
 break;
 default:
-ERROR(XML error: unknown Forth script: %08x\n%s\n,
-  crc, (char *)tag-data);
-goto out;
+script_type = CHRP_SCRIPT_LOAD_BOOT;
+goto do_script;
 }
 break;
 
diff -BurN OpenHackWare-release-0.4/src/libfs/core.c OpenHackWare-release-0.4.mine/src/libfs/core.c
--- OpenHackWare-release-0.4/src/libfs/core.c	2005-03-30 23:23:33.0 -0800
+++ OpenHackWare-release-0.4.mine/src/libfs/core.c	2007-04-20 15:50:02.0 -0700
@@ -421,13 +421,15 @@
 
 int fs_read (inode_t *inode, void *buffer, int len)
 {
-uint32_t bsize, total;
+uint32_t bsize, total, max;
 int done, tmp;
 
 bsize = part_blocsize(inode-fs-part);
 total = 0;
 if (fs_seek(inode, inode-vbloc, inode-vpos)  0)
 return -1;
+max = inode-size.bloc * bsize + inode-size.offset
+- inode-vbloc * bsize + inode-vpos;
 for (; len != 0; len -= done) {
 tmp = bsize - inode-vpos;
 if (len  tmp)
@@ -444,6 +446,8 @@
 total += done;
 }
 
+if (total  max)
+return max;
 return total;
 }
 
diff -BurN OpenHackWare-release-0.4/src/main.ld OpenHackWare-release-0.4.mine/src/main.ld
--- OpenHackWare-release-0.4/src/main.ld	2005-03-30 23:23:33.0 -0800
+++ OpenHackWare-release-0.4.mine/src/main.ld	2007-04-19 13:01:25.0 -0700
@@ -49,7 +49,7 @@
 _sdata_end = . ;
 . = ALIGN(4) ;
 _ro_start = . ;
-.rodata: { *(.rodata) }  bios
+.rodata: { *(.rodata*) }  bios
 _ro_end = . ;
 . = ALIGN(4) ;
 _RTAS_start = .;
diff -BurN OpenHackWare-release-0.4/src/of.c OpenHackWare-release-0.4.mine/src/of.c
--- OpenHackWare-release-0.4/src/of.c	2007-04-20 15:46:23.0 -0700
+++ OpenHackWare-release-0.4.mine/src/of.c	2007-04-20 14:56:56.0 -0700
@@ -4058,7 +4058,7 @@
 OF_DPRINTF(Return property name [%s]\n, next-name);
 OF_sts(next_name, (void *)(next-name));
 OF_DUMP_STRING(OF_env, next_name);
-pushd(OF_env, strlen(next-name) + 1);
+pushd(OF_env, 1);
 }
 }
 }

[Qemu-devel] Building OpenHackWare (qemu-system-ppc firmware)

2007-04-19 Thread Ed Swierk

I'm having trouble building OpenHackWare (the firmware for qemu-system-ppc) 
from source on a PPC machine running Fedora 6, using either gcc 4.1.1 or gcc 
3.4.6. I'm using the source tarball 
http://perso.magic.fr/l_indien/OpenHackWare/0.4/OpenHackWare-0.4.1.tar.bz2, 
with the patch pc-bios/ohw.diff from qemu-snapshot-2007-02-09_05.tar.bz2.

The code compiles just fine (although gcc 4.1.1 generates a lot of warnings), 
but fails to link:

ld -O2 -g -nostdlib -T 
src/main.ld -o .objs/main.out .objs/main.o .objs/bootinfos.o .objs/bloc.o 
.objs/pci.o .objs/of.
\
o .objs/start.o .objs/nvram.o .objs/vga.o .objs/mm.o .objs/char.o 
.objs/malloc.o .objs/errno.o .objs/_vprintf.o .objs/\
printf.o .objs/sprintf.o .objs/snprintf.o .objs/vprintf.o .objs/vsprintf.o 
.objs/vsnprintf.o .objs/dprintf.o .objs/vdp\
rintf.o .objs/memcpy.o .objs/memccpy.o .objs/mempcpy.o .objs/memmove.o 
.objs/memcmove.o .objs/mempmove.o .objs/memset.
\
o .objs/memcmp.o .objs/memchr.o .objs/rawmemchr.o .objs/memrchr.o 
.objs/memmem.o .objs/strcpy.o .objs/strdup.o .objs/s\
trndup.o .objs/stpcpy.o .objs/stpncpy.o .objs/strcat.o .objs/strncat.o 
.objs/strcmp.o .objs/strcasecmp.o .objs/strncmp\
.o .objs/strncasecmp.o .objs/strchr.o .objs/strchrnul.o .objs/strrchr.o 
.objs/basename.o .objs/dirname.o .objs/strlen.
\
o .objs/strnlen.o .objs/exec_core.o .objs/exec_elf.o .objs/exec_xcoff.o 
.objs/exec_macho.o .objs/exec_chrp.o .objs/exe\
c_prep.o .objs/exec_pef.o .objs/fs_core.o .objs/fs_raw.o .objs/fs_ext2.o 
.objs/fs_isofs.o .objs/fs_hfs.o .objs/part_co\
re.o .objs/part_apple.o .objs/part_isofs.o .objs/part_prep.o 
.objs/dev_char_pckbd.o .objs/dev_char_kbdadb.o .objs/dev_\
char_kbd.o
ld: region start is full (.objs/main.out section .rodata.str1.4)
ld: region start is full (.objs/main.out section .rodata.str1.4)
ld: section .text [05800400 - 05813767] overlaps 
section .rodata.str1.4 [05800054 - \
05804683]
ld: .objs/main.out: section .text lma 0x5800400 overlaps previous sections
ld: .objs/main.out: section .OpenFirmware lma 0x5813768 overlaps previous 
sections
ld: .objs/main.out: section .data lma 0x581946c overlaps previous sections
ld: .objs/main.out: section .OpenFirmware_vars lma 0x582053c overlaps previous 
sections
ld: .objs/main.out: section .sdata lma 0x5821a4c overlaps previous sections
ld: .objs/main.out: section .rodata lma 0x5821a5c overlaps previous sections
ld: .objs/main.out: section .RTAS lma 0x5822654 overlaps previous sections
ld: .objs/main.out: section .RTAS_vars lma 0x5823c10 overlaps previous 
sections
make: *** [.objs/main.out] Error 1

I thought debug symbols might be making the code too big, but the same error 
occurs when compiling without -g. The only thing that seems to work is 
increasing the length of the start section in main.ld, and moving the bios 
section's origin to compensate. Unfortunately the resulting binary hangs; the 
OS presumably relies on the bios origin being 0x05800400.

Any hints would be appreciated.

--Ed

Re: [Qemu-devel] Building OpenHackWare (qemu-system-ppc firmware)

2007-04-19 Thread Ed Swierk

On Thursday 19 April 2007 09:58:30 Daniel Jacobowitz wrote:
 I posted some other patches for OpenHackWare when I was experimenting
 with OpenBSD; I think this is the one you need, to the linker script:

 -.rodata: { *(.rodata) }  bios
 +.rodata: { *(.rodata*) }  bios

Works perfectly--thanks!

--Ed

[Qemu-devel] Linux 2.6 kernel on PPC

2007-03-22 Thread Ed Swierk


Has anyone succeeded in booting a Linux 2.6 kernel on Qemu PPC? If so,
what distribution did you use?

Thanks,
--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

Re: [Qemu-devel] [PATCH] don't require a disk image for network boot

2007-02-15 Thread Ed Swierk


On 2/15/07, Thomas Petazzoni [EMAIL PROTECTED] wrote:

BTW, is there a reason why a disk image is required when using the
-kernel option ?

In the following case: -kernel vmlinuz -append nfsroot=blabla, we
could boot over the network, without the need for any disk image, but
Qemu wants to have a disk image. Is it mandatory ?


On Linux, -hda /dev/zero works fine, so I'd say there's no reason a
disk image is necessary.

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] [PATCH] Fix crash after char device read returns 0

2007-02-14 Thread Ed Swierk


qemu 0.9.0 on Linux crashes with SIGSEGV after read() on a char device
returns 0, which occurs if the char device is a fifo and the writer
closes the file.

In this case, fd_chr_read() and stdio_read() react by removing the IO
handler and freeing it. Unfortunately main_loop_wait() is unprepared
to deal with this (as the comment XXX: better handling of removal
suggests) and attempts to access the freed handler.

Even if main_loop_wait() were improved, it is not correct to remove
the IO handler just because read() returns 0: if the char device is a
fifo, a process may well reopen the fifo for writing at a later point.

The attached patch is a naive fix; feedback is welcome.

--Ed
Index: qemu-snapshot-2007-02-09_05/vl.c
===
--- qemu-snapshot-2007-02-09_05.orig/vl.c
+++ qemu-snapshot-2007-02-09_05/vl.c
@@ -1346,11 +1346,13 @@ static void fd_chr_read(void *opaque)
 if (len == 0)
 return;
 size = read(s-fd_in, buf, len);
+#if 0
 if (size == 0) {
 /* FD has been closed. Remove it from the active list.  */
 qemu_set_fd_handler2(s-fd_in, NULL, NULL, NULL, NULL);
 return;
 }
+#endif
 if (size  0) {
 qemu_chr_read(chr, buf, size);
 }
@@ -1546,11 +1548,13 @@ static void stdio_read(void *opaque)
 uint8_t buf[1];
 
 size = read(0, buf, 1);
+#if 0
 if (size == 0) {
 /* stdin has been closed. Remove it from the active list.  */
 qemu_set_fd_handler2(0, NULL, NULL, NULL, NULL);
 return;
 }
+#endif
 if (size  0)
 stdio_received_byte(buf[0]);
 }
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] Re: [PATCH] LinuxBIOS support: map BIOS into ISA address space as RAM instead of ROM

2007-02-14 Thread Ed Swierk


On 2/13/07, Ed Swierk [EMAIL PROTECTED] wrote:

This patch changes qemu to map the BIOS into ISA address space as RAM
instead of ROM, allowing LinuxBIOS to run on qemu with no further
modifications (although the DRAM size is still not detected properly).


Unfortunately this isn't the right answer, as modifying the BIOS area
messes things up on soft reboot. What we really want is shadow RAM
support, which Bochs seems to support but qemu doesn't.

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] [PATCH] LinuxBIOS support: map BIOS into ISA address space as RAM instead of ROM

2007-02-13 Thread Ed Swierk


LinuxBIOS writes the IRQ routing table (PIRQ) to 0xf000 and then reads
it back to verify the write. Currently qemu maps the top 128 KB of the
BIOS into ISA address space (0xe000 - 0x) as ROM, which causes the
write to fail, preventing Linux from finding interrupt routing info.

This patch changes qemu to map the BIOS into ISA address space as RAM
instead of ROM, allowing LinuxBIOS to run on qemu with no further
modifications (although the DRAM size is still not detected properly).

I've verified that Windows 2000 and Linux still boot with the Bochs
BIOS with this patch applied.

--Ed
Index: qemu-snapshot-2007-02-09_05/hw/pc.c
===
--- qemu-snapshot-2007-02-09_05.orig/hw/pc.c
+++ qemu-snapshot-2007-02-09_05/hw/pc.c
@@ -530,7 +530,7 @@ static void pc_init1(int ram_size, int v
  IO_MEM_UNASSIGNED);
 cpu_register_physical_memory(0x10 - isa_bios_size, 
  isa_bios_size, 
- (bios_offset + bios_size - isa_bios_size) | IO_MEM_ROM);
+ bios_offset + bios_size - isa_bios_size);
 
 {
 ram_addr_t option_rom_offset;
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] [PATCH] Updated bochs bios.bin fixes SYSLINUX hanging

2007-02-06 Thread Ed Swierk


The latest bochs bios (bochs-20070105 snapshot) seems to fix the
SYSLINUX hanging issue. The bios.bin is attached.

--Ed


bios.bin.bz2
Description: BZip2 compressed data
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] Re: Recent BIOS breaks SYSLINUX with -nographic

2007-02-05 Thread Ed Swierk


On 2/5/07, Ed Swierk [EMAIL PROTECTED] wrote:

I'm attempting to boot a Linux disk image with the SYSLINUX boot
loader using the -nographic option. On the latest qemu CVS, the boot
loader hangs before printing Ready and booting the kernel. But if I
use the same version of qemu with the BIOS from 0.8.2, it boots just
fine.


More specifically, bios.bin rev 1.14 works, rev 1.15 doesn't.

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] Re: Recent BIOS breaks SYSLINUX

2007-02-05 Thread Ed Swierk


On 2/5/07, Ed Swierk [EMAIL PROTECTED] wrote:

I'm attempting to boot a Linux disk image with the SYSLINUX boot
loader using the -nographic option. On the latest qemu CVS, the boot
loader hangs before printing Ready and booting the kernel. But if I
use the same version of qemu with the BIOS from 0.8.2, it boots just
fine.


More testing reveals that -nographic has nothing to do with the
problem. The FC6 diskboot.img (which uses SYSLINUX) doesn't boot, even
without -nographic:

 http://mirrors.kernel.org/fedora/core/6/i386/os/images/diskboot.img

The FC6 boot.iso (which uses ISOLINUX) boots fine:

 http://mirrors.kernel.org/fedora/core/6/i386/os/images/boot.iso

as does diskboot.img with bios.bin 1.14 or earlier.

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

Re: [Qemu-devel] [PATCH] PIIX4 SMBus host, EEPROM device emulation

2007-01-23 Thread Ed Swierk


Here's a revised patch with the mmap stuff removed. I'll refine the
persistence support, but in the meantime the EEPROM device is usable
even if it forgets its contents when qemu exits.

--Ed
Index: qemu-0.8.2/hw/acpi.c
===
--- qemu-0.8.2.orig/hw/acpi.c
+++ qemu-0.8.2/hw/acpi.c
@@ -27,6 +27,7 @@
 #define PM_IO_BASE0xb000
 #define SMI_CMD_IO_ADDR   0xb040
 #define ACPI_DBG_IO_ADDR  0xb044
+#define SMB_IO_BASE 0xb100
 
 typedef struct PIIX4PMState {
 PCIDevice dev;
@@ -35,6 +36,15 @@ typedef struct PIIX4PMState {
 uint16_t pmcntrl;
 QEMUTimer *tmr_timer;
 int64_t tmr_overflow_time;
+SMBusDevice *smb_dev[128];
+uint8_t smb_stat;
+uint8_t smb_ctl;
+uint8_t smb_cmd;
+uint8_t smb_addr;
+uint8_t smb_data0;
+uint8_t smb_data1;
+uint8_t smb_data[32];
+uint8_t smb_index;
 } PIIX4PMState;
 
 #define RTC_EN (1  10)
@@ -46,8 +56,6 @@ typedef struct PIIX4PMState {
 
 #define SUS_EN (1  13)
 
-/* Note: only used for ACPI bios init. Could be deleted when ACPI init
-   is integrated in Bochs BIOS */
 static PIIX4PMState *piix4_pm_state;
 
 static uint32_t get_pmtmr(PIIX4PMState *s)
@@ -218,13 +226,163 @@ static void acpi_dbg_writel(void *opaque
 #endif
 }
 
+static void smb_transaction(PIIX4PMState *s)
+{
+uint8_t prot = (s-smb_ctl  2)  0x07;
+uint8_t read = s-smb_addr  0x01;
+uint8_t cmd = s-smb_cmd;
+uint8_t addr = s-smb_addr  1;
+SMBusDevice *dev = s-smb_dev[addr];
+
+#ifdef DEBUG
+printf(SMBus trans addr=0x%02x prot=0x%02x\n, addr, prot);
+#endif
+if (!dev) goto error;
+
+switch(prot) {
+case 0x0:
+if (!dev-quick_cmd) goto error;
+(*dev-quick_cmd)(dev, read);
+break;
+case 0x1:
+if (read) {
+if (!dev-receive_byte) goto error;
+s-smb_data0 = (*dev-receive_byte)(dev);
+}
+else {
+if (!dev-send_byte) goto error;
+(*dev-send_byte)(dev, cmd);
+}
+break;
+case 0x2:
+if (read) {
+if (!dev-read_byte) goto error;
+s-smb_data0 = (*dev-read_byte)(dev, cmd);
+}
+else {
+if (!dev-write_byte) goto error;
+(*dev-write_byte)(dev, cmd, s-smb_data0);
+}
+break;
+case 0x3:
+if (read) {
+uint16_t val;
+if (!dev-read_word) goto error;
+val = (*dev-read_word)(dev, cmd);
+s-smb_data0 = val;
+s-smb_data1 = val  8;
+}
+else {
+if (!dev-write_word) goto error;
+(*dev-write_word)(dev, cmd, (s-smb_data1  8) | s-smb_data0);
+}
+break;
+case 0x5:
+if (read) {
+if (!dev-read_block) goto error;
+s-smb_data0 = (*dev-read_block)(dev, cmd, s-smb_data);
+}
+else {
+if (!dev-write_block) goto error;
+(*dev-write_block)(dev, cmd, s-smb_data0, s-smb_data);
+}
+break;
+default:
+goto error;
+}
+return;
+
+  error:
+s-smb_stat |= 0x04;
+}
+
+static void smb_ioport_writeb(void *opaque, uint32_t addr, uint32_t val)
+{
+PIIX4PMState *s = opaque;
+addr = 0x3f;
+#ifdef DEBUG
+printf(SMB writeb port=0x%04x val=0x%02x\n, addr, val);
+#endif
+switch(addr) {
+case 0x00: /* SMBHSTSTS */
+s-smb_stat = 0;
+s-smb_index = 0;
+break;
+case 0x02: /* SMBHSTCNT */
+s-smb_ctl = val;
+if (val  0x40)
+smb_transaction(s);
+break;
+case 0x03: /* SMBHSTCMD */
+s-smb_cmd = val;
+break;
+case 0x04: /* SMBHSTADD */
+s-smb_addr = val;
+break;
+case 0x05: /* SMBHSTDAT0 */
+s-smb_data0 = val;
+break;
+case 0x06: /* SMBHSTDAT1 */
+s-smb_data1 = val;
+break;
+case 0x07: /* SMBBLKDAT */
+s-smb_data[s-smb_index++] = val;
+if (s-smb_index  31)
+s-smb_index = 0;
+break;
+default:
+break;
+}
+}
+
+static uint32_t smb_ioport_readb(void *opaque, uint32_t addr)
+{
+PIIX4PMState *s = opaque;
+uint32_t val;
+
+addr = 0x3f;
+switch(addr) {
+case 0x00: /* SMBHSTSTS */
+val = s-smb_stat;
+break;
+case 0x02: /* SMBHSTCNT */
+s-smb_index = 0;
+val = s-smb_ctl  0x1f;
+break;
+case 0x03: /* SMBHSTCMD */
+val = s-smb_cmd;
+break;
+case 0x04: /* SMBHSTADD */
+val = s-smb_addr;
+break;
+case 0x05: /* SMBHSTDAT0 */
+val = s-smb_data0;
+break;
+case 0x06: /* SMBHSTDAT1 */
+val = s-smb_data1;
+break;
+case 0x07: /* SMBBLKDAT */
+val = s-smb_data[s-smb_index++];
+if (s-smb_index  31)
+s-smb_index = 0;
+break;
+default:
+val = 0;
+break;
+}
+#ifdef DEBUG
+printf(SMB readb port=0x%04x

[Qemu-devel] Unidirectional pipe support

2006-12-13 Thread Ed Swierk


qemu allows redirecting the monitor to a named pipe (fifo): if you
specify -monitor pipe:/my/fifo, it opens /my/fifo and uses it for
communication in both directions.

Unfortunately pipes are unidirectional on Linux. The pipe(7) man page
says: Portability notes: On some systems (but not Linux), pipes are
bidirectional: data can be transmitted in both directions between the
pipe ends. According to POSIX.1-2001, pipes only need to be
unidirectional. Portable applications should avoid reliance on
bidirectional pipe semantics.

When qemu writes into the pipe, it immediately reads back what it just
wrote and treats it as a monitor command, endlessly breathing its own
exhaust.

The attached patch changes qemu to first try opening a pair of pipes,
/my/fifo.in and /my/fifo.out,  and uses each to communicate in a
single direction. If either file cannot be opened, it reverts to the
current behavior, using /my/fifo bidirectionally.

--Ed
Index: qemu-0.8.2/vl.c
===
--- qemu-0.8.2.orig/vl.c
+++ qemu-0.8.2/vl.c
@@ -1285,12 +1285,19 @@ CharDriverState *qemu_chr_open_file_out(
 
 CharDriverState *qemu_chr_open_pipe(const char *filename)
 {
-int fd;
+int fd_in, fd_out;
+char filename_in[256], filename_out[256];
 
-fd = open(filename, O_RDWR | O_BINARY);
-if (fd  0)
-return NULL;
-return qemu_chr_open_fd(fd, fd);
+snprintf(filename_in, 256, %s.in, filename);
+snprintf(filename_out, 256, %s.out, filename);
+fd_in = open(filename_in, O_RDWR | O_BINARY);
+fd_out = open(filename_out, O_RDWR | O_BINARY);
+if (fd_in  0 || fd_out  0) {
+fd_in = fd_out = open(filename, O_RDWR | O_BINARY);
+if (fd_in  0)
+return NULL;
+}
+return qemu_chr_open_fd(fd_in, fd_out);
 }
 
 
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] OHCI for i386

2006-11-30 Thread Ed Swierk


After spending some time trying to figure out why the emulated UHCI
USB controller is so slow, I switched uhci_usb_init() in hw/pc.c to
ohci_usb_init(). To my delight, Linux booted up and detected the
controller on the first try, and accessing an emulated block device is
2 to 3 times faster.

It seems that UHCI is used only on i386, while the other architectures
use OHCI. Would switching i386 to OHCI cause problems for other guest
OSes?

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

Re: [Qemu-devel] Fastest TCP/IP Stack

2006-09-14 Thread Ed Swierk


On 9/14/06, Joseph Miller [EMAIL PROTECTED] wrote:

I'm running a terminal server under qemu with kqemu compiled into my kernel
under the -kernel-kqemu for fastest performance.  What is the most efficient
method of -net ?  I was using -net user with OpenVPN to connect to my
internal LAN, but I have switched to -net tap to see if that is faster.  Does
anyone have any knowledge of which would take the least overhead?  I noticed
on my top stats that my % sys was particularly high when using networking
under -net user.  Thanks


-net tap is considerably more efficient than -net user in most cases.
-net user is a TCP/UDP proxy, so (a) your TCP connections are no
longer end-to-end, and (b) performance is at the mercy of qemu's
internal TCP/IP stack. With -net tap and a bridge, IP packets are
passed along unmolested, and all the magic occurs within the host
kernel.

(Of course where performance is not a pressing issue, -net user is
still awfully convenient.)

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] PCI IRQ routing problems

2006-09-13 Thread Ed Swierk


Linux 2.6.17 running on the latest qemu snapshot is unable to route
IRQs to more than 4 network interfaces when running without ACPI, and
is limited to 2 network interfaces with ACPI enabled.

With 8 network interfaces and ACPI disabled, I get the following
kernel output during boot:

ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
 http://www.scyld.com/network/ne2k-pci.html
PCI: Found IRQ 10 for device :00:03.0
eth0: RealTek RTL-8029 found at 0xc100, IRQ 10, 52:54:00:12:34:56.
PCI: Found IRQ 11 for device :00:04.0
PCI: Sharing IRQ 11 with :00:01.2
eth1: RealTek RTL-8029 found at 0xc200, IRQ 11, 52:54:00:12:34:57.
PCI: Found IRQ 9 for device :00:05.0
eth2: RealTek RTL-8029 found at 0xc300, IRQ 9, 52:54:00:12:34:58.
PCI: Found IRQ 5 for device :00:06.0
eth3: RealTek RTL-8029 found at 0xc400, IRQ 5, 52:54:00:12:34:59.
eth4: RealTek RTL-8029 found at 0xc500, IRQ 11, 52:54:00:12:34:5A.
eth5: RealTek RTL-8029 found at 0xc600, IRQ 9, 52:54:00:12:34:5B.
eth6: RealTek RTL-8029 found at 0xc700, IRQ 11, 52:54:00:12:34:5C.
eth7: RealTek RTL-8029 found at 0xc800, IRQ 9, 52:54:00:12:34:5D.

The missing Found IRQ messages for eth4 through eth7 indicate that
IRQs won't get routed properly. Indeed, later on the kernel emits
complaints like: irq 11: nobody cared ... Disabling IRQ #11.

With 3 network interfaces and ACPI enabled, I get slightly different output:

ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
 http://www.scyld.com/network/ne2k-pci.html
PCI: Found IRQ 9 for device :00:03.0
eth0: RealTek RTL-8029 found at 0xc100, IRQ 9, 52:54:00:12:34:56.
PCI: Found IRQ 11 for device :00:04.0
PCI: Sharing IRQ 11 with :00:01.2
eth1: RealTek RTL-8029 found at 0xc200, IRQ 11, 52:54:00:12:34:57.
PCI: Found IRQ 10 for device :00:05.0
IRQ routing conflict for :00:01.3, have irq 9, want irq 10
eth2: RealTek RTL-8029 found at 0xc300, IRQ 10, 52:54:00:12:34:58.
piix4_smbus :00:01.3: Found :00:01.3 device
piix4_smbus :00:01.3: SMB base address uninitialized - upgrade
BIOS or use force_addr=0xaddr

Only eth0 and eth1 end up working.

I suspect the problem in the non-ACPI case is caused by a limitation
in the PCI IRQ routing table in the Bochs BIOS, but I haven't a clue
how to fix it. Any ideas would be appreciated.

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] [PATCH] Support for large LinuxBIOS images

2006-09-12 Thread Ed Swierk


The attached patch makes two changes needed to boot Linux on qemu with
a large (256KB) LinuxBIOS image instead of the built-in BIOS:

- Increases the space set aside for the BIOS from 256KB to 2MB; this
could of course be increased further, but 2MB seems to be the largest
EEPROM hardware currently available.

- Refrains from mapping the upper 128KB of the BIOS into ISA space
when using a non-default BIOS (size != 64KB). LinuxBIOS writes the PCI
IRQ routing table to 0xf, so it has to be left as normal RAM.

With these changes to qemu, plus the following patches to LinuxBIOS,
Linux 2.6.17.11 successfully communicates with qemu's Ethernet and USB
devices:

http://www.openbios.org/pipermail/linuxbios/2006-September/015687.html
http://www.openbios.org/pipermail/linuxbios/2006-September/015688.html

--Ed
diff -BurN qemu-0.8.2.orig/hw/pc.c qemu-0.8.2/hw/pc.c
--- qemu-0.8.2.orig/hw/pc.c	2006-09-12 23:33:03.0 +
+++ qemu-0.8.2/hw/pc.c	2006-09-12 23:35:47.0 +
@@ -646,13 +646,13 @@
 
 /* BIOS load */
 bios_offset = ram_size + vga_ram_size;
-vga_bios_offset = bios_offset + 256 * 1024;
+vga_bios_offset = bios_offset + 2048 * 1024;
 
 snprintf(buf, sizeof(buf), %s/%s, bios_dir, BIOS_FILENAME);
 bios_size = get_image_size(buf);
 if (bios_size = 0 || 
 (bios_size % 65536) != 0 ||
-bios_size  (256 * 1024)) {
+bios_size  (2048 * 1024)) {
 goto bios_error;
 }
 ret = load_image(buf, phys_ram_base + bios_offset);
@@ -677,15 +677,17 @@
 cpu_register_physical_memory(0xc, 0x1, 
  vga_bios_offset | IO_MEM_ROM);
 
-/* map the last 128KB of the BIOS in ISA space */
-isa_bios_size = bios_size;
-if (isa_bios_size  (128 * 1024))
-isa_bios_size = 128 * 1024;
-cpu_register_physical_memory(0xd, (192 * 1024) - isa_bios_size, 
- IO_MEM_UNASSIGNED);
-cpu_register_physical_memory(0x10 - isa_bios_size, 
- isa_bios_size, 
- (bios_offset + bios_size - isa_bios_size) | IO_MEM_ROM);
+if (bios_size == 65536) {
+/* map the last 128KB of the BIOS in ISA space */
+isa_bios_size = bios_size;
+if (isa_bios_size  (128 * 1024))
+isa_bios_size = 128 * 1024;
+cpu_register_physical_memory(0xd, (192 * 1024) - isa_bios_size, 
+ IO_MEM_UNASSIGNED);
+cpu_register_physical_memory(0x10 - isa_bios_size, 
+ isa_bios_size, 
+ (bios_offset + bios_size - isa_bios_size) | IO_MEM_ROM);
+}
 /* map all the bios at the top of memory */
 cpu_register_physical_memory((uint32_t)(-bios_size), 
  bios_size, bios_offset | IO_MEM_ROM);
diff -BurN qemu-0.8.2.orig/vl.h qemu-0.8.2/vl.h
--- qemu-0.8.2.orig/vl.h	2006-09-12 23:33:03.0 +
+++ qemu-0.8.2/vl.h	2006-09-12 23:34:32.0 +
@@ -160,7 +160,7 @@
 #elif defined(TARGET_MIPS)
 #define BIOS_SIZE (128 * 1024)
 #else
-#define BIOS_SIZE ((256 + 64) * 1024)
+#define BIOS_SIZE ((2048 + 64) * 1024)
 #endif
 
 /* keyboard/mouse support */
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] [PATCH] Set RAM size in i440FX PCI configuration

2006-09-12 Thread Ed Swierk


Currently qemu's i440FX PCI bridge emulation code does not set the
registers indicating the amount of RAM installed in each DIMM slot.
LinuxBIOS relies on this information to properly set up RAM before
booting Linux.

The attached patch sets a i440FX configuration register indicating the
highest RAM address in all slots, in units of 8MB. Since this is an
8-bit value, the maximum is capped at 2040MB.

--Ed
--- qemu-0.8.2.orig/hw/piix_pci.c	2006-07-22 17:23:34.0 +
+++ qemu-0.8.2/hw/piix_pci.c	2006-09-13 03:05:53.0 +
@@ -42,7 +42,7 @@
 
 static void piix3_set_irq(PCIDevice *pci_dev, void *pic, int irq_num, int level);
 
-PCIBus *i440fx_init(void)
+PCIBus *i440fx_init(int ram_size)
 {
 PCIBus *b;
 PCIDevice *d;
@@ -73,6 +73,10 @@
 d-config[0x0a] = 0x00; // class_sub = host2pci
 d-config[0x0b] = 0x06; // class_base = PCI_bridge
 d-config[0x0e] = 0x00; // header_type
+ram_size = ram_size / 8 / 1024 / 1024;
+if (ram_size  255)
+ram_size = 255;
+d-config[0x57] = ram_size;
 return b;
 }
 
--- qemu-0.8.2.orig/hw/pc.c	2006-09-12 23:33:03.0 +
+++ qemu-0.8.2/hw/pc.c	2006-09-13 02:38:33.0 +
@@ -747,7 +749,7 @@
 }
 
 if (pci_enabled) {
-pci_bus = i440fx_init();
+pci_bus = i440fx_init(ram_size);
 piix3_devfn = piix3_init(pci_bus);
 } else {
 pci_bus = NULL;
--- qemu-0.8.2.orig/vl.h	2006-09-12 23:33:03.0 +
+++ qemu-0.8.2/vl.h	2006-09-13 02:39:48.0 +
@@ -695,7 +695,7 @@
 PCIBus *pci_vpb_init(void *pic);
 
 /* piix_pci.c */
-PCIBus *i440fx_init(void);
+PCIBus *i440fx_init(int ram_size);
 int piix3_init(PCIBus *bus);
 void pci_bios_init(void);
 
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

Re: [Qemu-devel] pronunciation of Qemu

2006-06-28 Thread Ed Swierk


On 6/28/06, Paul Robinson [EMAIL PROTECTED] wrote:

How should you pronounce Qemu?

FYI, my best guess is Q (as in the letter Q) followed by the first 2
syllables of emulator.


That's how I've always pronounced it, but I've also heard people say
kee-moo, which I have to admit is kind of cute.

--Ed


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

Re: [Qemu-devel] [PATCH] -redir revisited

2006-06-14 Thread Ed Swierk


On 6/5/06, René Korthaus [EMAIL PROTECTED] wrote:

I researched current qemu cvs and found out that the patch was
integrated not complete but to 95%. There are two lines in slirp/
misc.c that need a change (patch attached). Tested to work with Mac
OS X host and debian guest i could reach the Apache running inside
the guest via localhost:8080 with the appropriate -redir command.


It looks like kju: is based on qemu 0.8.0, which definitely had the
bug you found. However, this was fixed in 0.8.1. Did you try a current
version of qemu (0.8.1 or cvs) without your patch?

I'm having trouble seeing how your patch could help solve the redir
problem, since as far as I can tell, our_addr is no longer used for
anything at all. Grepping for our_addr comes up with a buch of hits
that are either #if 0'ed or irrelevant (e.g. get_dns_addr, which
references our_addr before our_addr has been set).

--Ed
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Qemu-devel] [PATCH] Allow user to configure -net user subnet

2006-05-03 Thread Ed Swierk


In environments where the 10.0.2.0/24 subnet is already used for
another purpose, it's useful to be able to configure a different -net
user (slirp) subnet, such as 192.168.100.0/24.

The attached patch adds a subnet option to -user net. Currently only
/24 subnets (mask 255.255.255.0) are accepted. An error is generated
if the specified subnet is not of the form XX.YY.ZZ.0/24. This
restriction could be relaxed in the future with more extensive changes
to slirp.

One quirk is that the default guest-host for the -redir option remains
10.0.2.15 even if the user passes a different subnet to -net user.
Obviously this can be worked around by explicitly specifying the
correct guest-host. The next patch (qemu-slirp-options.patch) fixes
this issue, among others.

--Ed
diff -BurN qemu.orig/slirp/libslirp.h qemu/slirp/libslirp.h
--- qemu.orig/slirp/libslirp.h	2006-05-01 16:05:27.0 +
+++ qemu/slirp/libslirp.h	2006-05-02 17:27:59.0 +
@@ -13,6 +13,8 @@
 extern C {
 #endif
 
+#include ctl.h
+
 void slirp_init(void);
 
 void slirp_select_fill(int *pnfds, 
@@ -33,6 +35,8 @@
 
 extern const char *tftp_prefix;
 extern char slirp_hostname[33];
+extern struct in_addr special_addr;
+extern struct in_addr alias_addr;
 
 #ifdef __cplusplus
 }
diff -BurN qemu.orig/slirp/main.h qemu/slirp/main.h
--- qemu.orig/slirp/main.h	2006-05-02 17:27:34.0 +
+++ qemu/slirp/main.h	2006-05-02 17:27:59.0 +
@@ -33,8 +33,6 @@
 extern u_int curtime;
 extern fd_set *global_readfds, *global_writefds, *global_xfds;
 extern struct in_addr ctl_addr;
-extern struct in_addr special_addr;
-extern struct in_addr alias_addr;
 extern struct in_addr our_addr;
 extern struct in_addr loopback_addr;
 extern struct in_addr dns_addr;
diff -BurN qemu.orig/slirp/slirp.c qemu/slirp/slirp.c
--- qemu.orig/slirp/slirp.c	2006-05-02 17:27:34.0 +
+++ qemu/slirp/slirp.c	2006-05-02 17:27:59.0 +
@@ -155,8 +155,6 @@
 fprintf (stderr, Warning: No DNS servers found\n);
 }
 
-inet_aton(CTL_SPECIAL, special_addr);
-alias_addr.s_addr = special_addr.s_addr | htonl(CTL_ALIAS);
 getouraddr();
 }
 
diff -BurN qemu.orig/vl.c qemu/vl.c
--- qemu.orig/vl.c	2006-05-01 16:05:27.0 +
+++ qemu/vl.c	2006-05-02 17:28:44.0 +
@@ -2400,7 +2400,7 @@
 if (get_str_sep(buf, sizeof(buf), p, ':')  0)
 goto fail;
 if (buf[0] == '\0') {
-pstrcpy(buf, sizeof(buf), 10.0.2.15);
+pstrcpy(buf, sizeof(buf), CTL_LOCAL);
 }
 if (!inet_aton(buf, guest_addr))
 goto fail;
@@ -3170,6 +3170,22 @@
 } else
 #ifdef CONFIG_SLIRP
 if (!strcmp(device, user)) {
+if (get_param_value(buf, sizeof(buf), subnet, p)) {
+char *len;
+len = strchr(buf, 0) - 3;
+if (len  buf || strcmp(len, /24)) {
+fprintf(stderr, qemu: invalid subnet=%s (mask must be /24)\n, buf);
+return -1;
+}  
+*len = 0;
+} else {
+pstrcpy(buf, sizeof(buf), CTL_SPECIAL);
+}
+if (!inet_aton(buf, special_addr) || (ntohl(special_addr.s_addr)  0xff)) {
+fprintf(stderr, qemu: invalid subnet=%s\n, buf);
+return -1;
+}
+alias_addr.s_addr = special_addr.s_addr | htonl(CTL_ALIAS);
 if (get_param_value(buf, sizeof(buf), hostname, p)) {
 pstrcpy(slirp_hostname, sizeof(slirp_hostname), buf);
 }
@@ -4628,9 +4644,9 @@
-net nic[,vlan=n][,macaddr=addr][,model=type]\n
create a new Network Interface Card and connect it to VLAN 'n'\n
 #ifdef CONFIG_SLIRP
-   -net user[,vlan=n][,hostname=host]\n
-   connect the user mode network stack to VLAN 'n' and send\n
-   hostname 'host' to DHCP clients\n
+   -net user[,vlan=n][,hostname=host][,subnet=addr/24]\n
+   connect the user mode network stack to VLAN 'n'; send\n
+   hostname 'host' to DHCP clients; use subnet addr\n
 #endif
 #ifdef _WIN32
-net tap[,vlan=n],ifname=name\n




___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

1 2 >

1 - 100 of 115 matches

Mail list logo