On Thu, 28 Sep 2023, Andrew Cooper wrote:
> On 28/09/2023 12:04 am, Stefano Stabellini wrote:
> > On Mon, 25 Sep 2023, Henry Wang wrote:
> >> 3. dom0less vs xenstored setup race Was: xen | Failed pipeline for staging 
> >> | 6a47ba2f
> >>
> >> https://marc.info/?l=xen-devel&m=168312468808977
> >>
> >> https://marc.info/?l=xen-devel&m=168312687610283
> > For this issue I suggest to go with this fix unless someone can produce
> > a better fix before RC2. I don't think we should cut RC2 with this issue
> > unsolved.
> >
> > ---
> > [PATCH] xenstored: reset domain connection before XENSTORE_CONNECTED
> >
> > From: Julien Grall <jul...@xen.org>
> >
> > xenstored will set interface->connection to XENSTORE_CONNECTED before
> > finalizing the connection which can cause initialization errors on the
> > guest side. Make sure to call domain_conn_reset(domain) before setting
> > XENSTORE_CONNECTED.
> >
> > Signed-off-by: Julien Grall <jul...@xen.org>
> > [stefano: rebase, commit message]
> > Signed-off-by: Stefano Stabellini <stefano.stabell...@amd.com>
> > Acked-by: Stefano Stabellini <sstabell...@kernel.org>
> 
> No - this hasn't got any better at fixing the problem than the last time
> it failed to fix the problem.

It does solve one of the issues: the sporadic failure of the gitlab-ci
job. Even if the fix is not complete, if nothing else, it does that.

Of course, if someone comes up with a better fix all the better!

There hasn't been a lot of movement on this issue so I am being
pessimistic and I am offering the (maybe partial) solution we have
today. I don't mean to take anything away from the value of doing a
better fix.


> You cannot have 3 entities in parallel fight for control in a 2-way
> communication channel.
> 
> Failure to understand this is what created the problem to begin with.
> 
> You took an existing ABI from oxenstored, and implemented it
> incompatibly in other entities, had init-dom0less corrupt a shared comms
> buffer that it isn't the producer or consumer of, and added bug in Linux
> because you didn't write down the behaviour you wanted, let alone the
> behaviour you actually provided.

I think I could write a document covering the intended behavior at the
time of contributing the original feature. That might be good to have
regardless of the bug.


> Stop tinkering in the hope that it hides the problem.  You're only
> making it harder to fix properly.
 
Making it harder to fix properly would be a valid reason not to commit
the (maybe partial) fix. But looking at the fix again:

diff --git a/tools/xenstored/domain.c b/tools/xenstored/domain.c
index a6cd199fdc..9cd6678015 100644
--- a/tools/xenstored/domain.c
+++ b/tools/xenstored/domain.c
@@ -989,6 +989,7 @@ static struct domain *introduce_domain(const void *ctx,
                talloc_steal(domain->conn, domain);
 
                if (!restore) {
+                       domain_conn_reset(domain);
                        /* Notify the domain that xenstore is available */
                        interface->connection = XENSTORE_CONNECTED;
                        xenevtchn_notify(xce_handle, domain->port);
@@ -1031,8 +1032,6 @@ int do_introduce(const void *ctx, struct connection *conn,
        if (!domain)
                return errno;
 
-       domain_conn_reset(domain);
-
        send_ack(conn, XS_INTRODUCE);

It is a 1-line movement. Textually small. Easy to understand and to
revert. It doesn't seem to be making things harder to fix? We could
revert it any time if a better fix is offered.

Maybe we could have a XXX note in the commit message or in-code
comment?


> Tell me, when was the last time this failed...

I went through all the gitlab reports and this is the last one I found
which is from 1 month ago:
https://gitlab.com/xen-project/xen/-/pipelines/953569140

Even if the heinsenbug manifests only once a month I think it is bad.
Of course it is up to Henry but I don't think we should go far into the
release process without this problem being (at least partially) fixed.

Reply via email to