On 2022-01-13 11:42, Pekka Paalanen wrote: > On Thu, 13 Jan 2022 10:05:56 +0200 > Maksim Sisov <msi...@igalia.com> wrote: > >> + wayland-devel ML. >> >> Hi Pekka, >> >> Thanks for your answers! Please see my questions inlined. >> >> On 2022-01-12 16:16, Pekka Paalanen wrote: >> >> > On Mon, 10 Jan 2022 08:53:50 +0200 >> > Maksim Sisov <msi...@igalia.com> wrote: >> > >> >> Hi Pekka, >> >> >> >> Thanks for answering all my previous questions before. >> > >> > Hi Maksim, >> > >> > looks like that was a year ago. I had forgot. :-) >> >> Time flies! >> >> > >> >> I came up with a new question and wondered what was your opinion. >> >> >> >> In Chromium, we have two modes - one is normal when a GPU service runs >> >> in a separate sandboxed process and uses surfaceless path with libgbm >> >> (we pass dmabuf to the browser process where Wayland connection is...), >> >> and another one is --in-process-gpu when the GPU service runs in the >> >> same browser process, but in a different thread. That also uses >> >> surfaceless path by default. >> >> >> >> However, when either libgbm is not available or a system in question >> >> doesn't support drm render nodes, the browser may fall back to use >> >> Wayland EGL. However, it is only used if the browser runs with >> >> --in-process-gpu flag as we need to access a wl_surface somehow. That >> >> means the GPU service on a separate thread that use Wayland EGL and the >> >> main UI thread that actually does all the UI stuff, manages the >> >> connection, etc. >> >> >> >> Long time ago, we figured out there was a problem with the >> >> --in-process-gpu + Wayland EGL. But given both Chromium as a Wayland >> >> client and Wayland EGL that is another client prepare/read/dispatch >> >> events (each own event queue, of course), we started to experience >> >> deadlocks. >> > >> > Surely they are the same client (connection)? >> >> Yes, they are. >> >> > >> >> The deadlock happened when Chromium was closing a native window and our >> >> event loop had already prepared to read for the display, but it then >> >> didn't read as the thread was closing the GPU thread, which was calling >> >> eglSwapBuffers, which internally also called prepare and then read, but >> >> it didn't read, but rather was blocked as another thread already >> >> prepared to read, but didn't read either. So, the UI thread could make a >> >> blocking wait until the GPU thread is done doing its stuff before >> >> tearing it down. >> >> >> >> As you can see from my description, the UI thread prepared to read the >> >> wayland display, then switched to tearing down the GPU thread, which was >> >> busy with processing our gl commands, and it had to wait until its done. >> >> The GPU thread used Wayland EGL, which also prepared to read and the >> >> read call was blocked because there were other clients willing to read. >> > >> > I see. Yes, there is the assumption that if a thread prepares to read, >> > then it will either read or cancel the read in finite time (ASAP). >> > >> >> To overcome this problem, it was decided to use a separate polling >> >> thread. What it does now is that the thread prepares/polls/reads (we use >> >> libevent or libglib, btw. this decision is based on how chromium is >> >> configured), while the UI thread calls dispatch events whenever our >> >> "polling" thread asks. This way, the polling thread can always call >> >> wl_display_read_events and the deadlock doesn't ever happen. >> > >> > Right, I can see that working. >> > >> >> I wonder if this is the right way to do so. I also wonder if doing this >> >> stuff on a separate thread gives any performance benefits? >> > >> > It works, but it's not a "proposed" design in my opinion. Doing >> > all of prepare/read/dispatch in each thread is the "intended" usage. >> > >> > I think the core of the problem is the thread that prepares to read and >> > then doesn't read but goes to do something else. That is basically a >> > violation of the API contract of libwayland-client. Unfortunately in >> > this case it leads to a solid deadlock. The documentation is very clear >> > about it: >> > >> > * This function (or wl_display_prepare_read()) must be called before >> > reading >> > * from the file descriptor using wl_display_read_events(). Calling >> > * wl_display_prepare_read_queue() announces the calling thread's intention >> > to >> > * read and ensures that until the thread is ready to read and calls >> > * wl_display_read_events(), no other thread will read from the file >> > descriptor. >> > >> > You could also call wl_display_cancel_read() instead: >> > >> > * After a thread successfully called wl_display_prepare_read() it must >> > * either call wl_display_read_events() or wl_display_cancel_read(). >> > * If the threads do not follow this rule it will lead to deadlock. >> > >> > But if you cancel, then you must not read without preparing again first. >> > >> > The separate polling thread does not sound bad though. It is a correct >> > design. Now that I think of it, wl_display_read_events() has a >> > pthread_cond_wait() in it that blocks all other reading threads until >> > the last reading thread actually calls read (or cancel). The separate >> > polling thread, like you described it, might avoid some of that >> > blocking. >> > >> > Perhaps the polling thread design is better when it is possible. You >> > can use it in your own code, but it's not really possible for things >> > like EGL implementations since EGL does not define an API for >> > cross-thread wake-ups. >> >> I see. The only only problem here is that we don't use poll, but rather >> libevent, which is wrapped around Chromium's MessagePumpLibevent. It >> constantly notifies us that a fd is ready to be read. > > You say "constantly", do you mean it never waits? > >> Here is a diagram that shows the flow without a dedicated thread - >> https://drive.google.com/file/d/1Jq8aX4cWszwL5fleE8ayGOeond70xChS/view?usp=sharing >> . >> >> As you can see, it returns from OnCanRead and waits until libevent >> notifies us the next time that we can actually read events (if prepared >> to read previously). And if the main thread blocks because of the above >> mentioned reason (libevent didn't have a chance to notify us), the >> events will never be read. > > Are you describing the old approach or the new approach with the > separate polling thread?
I'm describing the old one as having a dedicated thread doesn't really bring much benefits as I said in my first message. > > Doing prepare_read from OnCanRead handler seems a bit backwards. The > prepare_read dance is to make sure that when you are going to wait for > new events (block), you have processed all incoming events so far and > flushed out all requests that might have resulted from those, so that > if you are waiting for replies to those requests, they will actually > arrive. So the prepare_read/dispatch/flush dance should be done from a > "the event loop is going to sleep now" hook, not from "the fd has bytes > to read" hook. I checked libevent - they added prepare/check watchers in https://github.com/libevent/libevent/pull/793 in Spring, 2019. Chromium still uses libevent 1.4.15 (https://source.chromium.org/chromium/chromium/src/+/main:base/third_party/libevent/README.chromium), which looks to be old. It seems like we need to make an update that hasn't been done for ages. Moreover, it seems to be patched :\ I'll need to check if we can upgrade. > > Hmm, yeah... I suppose when your event loop decides to do something > else than actually read the Wayland socket after it was prepared for > reading, and that something else is a blocking thing, you would have to > cancel the read first just in case, and then prepare_read again to be > able to read afterwards. > > Maybe others here have better ideas, or know libevent or glib better? > I recall people having written integrations with those event loop > libraries, but I can't remember who and where. > > > Thanks, > pq -- Best Regards, Maksim Sisov * Usual work time - 08:00-16:00 (EEST).