Re: Wayland fd watching

Maksim Sisov Thu, 13 Jan 2022 04:06:15 -0800

On 2022-01-13 11:42, Pekka Paalanen wrote:
> On Thu, 13 Jan 2022 10:05:56 +0200
> Maksim Sisov <msi...@igalia.com> wrote:
> 
>> + wayland-devel ML.
>>
>> Hi Pekka,
>>
>> Thanks for your answers! Please see my questions inlined.
>>
>> On 2022-01-12 16:16, Pekka Paalanen wrote:
>>
>> > On Mon, 10 Jan 2022 08:53:50 +0200
>> > Maksim Sisov <msi...@igalia.com> wrote:
>> >
>> >> Hi Pekka,
>> >>
>> >> Thanks for answering all my previous questions before.
>> >
>> > Hi Maksim,
>> >
>> > looks like that was a year ago. I had forgot. :-)
>>
>> Time flies!
>>
>> >
>> >> I came up with a new question and wondered what was your opinion.
>> >>
>> >> In Chromium, we have two modes - one is normal when a GPU service runs
>> >> in a separate sandboxed process and uses surfaceless path with libgbm
>> >> (we pass dmabuf to the browser process where Wayland connection is...),
>> >> and another one is --in-process-gpu when the GPU service runs in the
>> >> same browser process, but in a different thread. That also uses
>> >> surfaceless path by default.
>> >>
>> >> However, when either libgbm is not available or a system in question
>> >> doesn't support drm render nodes, the browser may fall back to use
>> >> Wayland EGL. However, it is only used if the browser runs with
>> >> --in-process-gpu flag as we need to access a wl_surface somehow. That
>> >> means the GPU service on a separate thread that use Wayland EGL and the
>> >> main UI thread that actually does all the UI stuff, manages the
>> >> connection, etc.
>> >>
>> >> Long time ago, we figured out there was a problem with the
>> >> --in-process-gpu + Wayland EGL. But given both Chromium as a Wayland
>> >> client and Wayland EGL that is another client prepare/read/dispatch
>> >> events (each own event queue, of course), we started to experience
>> >> deadlocks.
>> >
>> > Surely they are the same client (connection)?
>>
>> Yes, they are.
>>
>> >
>> >> The deadlock happened when Chromium was closing a native window and our
>> >> event loop had already prepared to read for the display, but it then
>> >> didn't read as the thread was closing the GPU thread, which was calling
>> >> eglSwapBuffers, which internally also called prepare and then read, but
>> >> it didn't read, but rather was blocked as another thread already
>> >> prepared to read, but didn't read either. So, the UI thread could make a
>> >> blocking wait until the GPU thread is done doing its stuff before
>> >> tearing it down.
>> >>
>> >> As you can see from my description, the UI thread prepared to read the
>> >> wayland display, then switched to tearing down the GPU thread, which was
>> >> busy with processing our gl commands, and it had to wait until its done.
>> >> The GPU thread used Wayland EGL, which also prepared to read and the
>> >> read call was blocked because there were other clients willing to read.
>> >
>> > I see. Yes, there is the assumption that if a thread prepares to read,
>> > then it will either read or cancel the read in finite time (ASAP).
>> >
>> >> To overcome this problem, it was decided to use a separate polling
>> >> thread. What it does now is that the thread prepares/polls/reads (we use
>> >> libevent or libglib, btw. this decision is based on how chromium is
>> >> configured), while the UI thread calls dispatch events whenever our
>> >> "polling" thread asks. This way, the polling thread can always call
>> >> wl_display_read_events and the deadlock doesn't ever happen.
>> >
>> > Right, I can see that working.
>> >
>> >> I wonder if this is the right way to do so. I also wonder if doing this
>> >> stuff on a separate thread gives any performance benefits?
>> >
>> > It works, but it's not a "proposed" design in my opinion. Doing
>> > all of prepare/read/dispatch in each thread is the "intended" usage.
>> >
>> > I think the core of the problem is the thread that prepares to read and
>> > then doesn't read but goes to do something else. That is basically a
>> > violation of the API contract of libwayland-client. Unfortunately in
>> > this case it leads to a solid deadlock. The documentation is very clear
>> > about it:
>> >
>> > * This function (or wl_display_prepare_read()) must be called before 
>> > reading
>> > * from the file descriptor using wl_display_read_events(). Calling
>> > * wl_display_prepare_read_queue() announces the calling thread's intention 
>> > to
>> > * read and ensures that until the thread is ready to read and calls
>> > * wl_display_read_events(), no other thread will read from the file 
>> > descriptor.
>> >
>> > You could also call wl_display_cancel_read() instead:
>> >
>> > * After a thread successfully called wl_display_prepare_read() it must
>> > * either call wl_display_read_events() or wl_display_cancel_read().
>> > * If the threads do not follow this rule it will lead to deadlock.
>> >
>> > But if you cancel, then you must not read without preparing again first.
>> >
>> > The separate polling thread does not sound bad though. It is a correct
>> > design. Now that I think of it, wl_display_read_events() has a
>> > pthread_cond_wait() in it that blocks all other reading threads until
>> > the last reading thread actually calls read (or cancel). The separate
>> > polling thread, like you described it, might avoid some of that
>> > blocking.
>> >
>> > Perhaps the polling thread design is better when it is possible. You
>> > can use it in your own code, but it's not really possible for things
>> > like EGL implementations since EGL does not define an API for
>> > cross-thread wake-ups.
>>
>> I see. The only only problem here is that we don't use poll, but rather
>> libevent, which is wrapped around Chromium's MessagePumpLibevent. It
>> constantly notifies us that a fd is ready to be read.
> 
> You say "constantly", do you mean it never waits?
> 
>> Here is a diagram that shows the flow without a dedicated thread -
>> https://drive.google.com/file/d/1Jq8aX4cWszwL5fleE8ayGOeond70xChS/view?usp=sharing
>> .
>>
>> As you can see, it returns from OnCanRead and waits until libevent
>> notifies us the next time that we can actually read events (if prepared
>> to read previously). And if the main thread blocks because of the above
>> mentioned reason (libevent didn't have a chance to notify us), the
>> events will never be read.
> 
> Are you describing the old approach or the new approach with the
> separate polling thread?


I'm describing the old one as having a dedicated thread doesn't really
bring much benefits as I said in my first message.

> 
> Doing prepare_read from OnCanRead handler seems a bit backwards. The
> prepare_read dance is to make sure that when you are going to wait for
> new events (block), you have processed all incoming events so far and
> flushed out all requests that might have resulted from those, so that
> if you are waiting for replies to those requests, they will actually
> arrive. So the prepare_read/dispatch/flush dance should be done from a
> "the event loop is going to sleep now" hook, not from "the fd has bytes
> to read" hook.

I checked libevent - they added prepare/check watchers in
https://github.com/libevent/libevent/pull/793 in Spring, 2019. Chromium
still uses libevent 1.4.15
(https://source.chromium.org/chromium/chromium/src/+/main:base/third_party/libevent/README.chromium),
which looks to be old. It seems like we need to make an update that
hasn't been done for ages. Moreover, it seems to be patched :\ I'll need
to check if we can upgrade.

> 
> Hmm, yeah... I suppose when your event loop decides to do something
> else than actually read the Wayland socket after it was prepared for
> reading, and that something else is a blocking thing, you would have to
> cancel the read first just in case, and then prepare_read again to be
> able to read afterwards.
> 
> Maybe others here have better ideas, or know libevent or glib better?
> I recall people having written integrations with those event loop
> libraries, but I can't remember who and where.
> 
> 
> Thanks,
> pq

-- 
Best Regards,
Maksim Sisov
* Usual work time - 08:00-16:00 (EEST).

Re: Wayland fd watching

Reply via email to