Hey all!

While working on some maintenance scripts for TimedMediaHandler I've been
trying to make it easier to do scripts that use multiple parallel processes
to run through a large input set faster.

My proposal is a ForkableMaintenance class, with an underlying
QueueingForkController which is a refactoring of the
OrderedStreamingForkController used by (at least) some CirrusSearch
maintenance scripts.

Patch in progress:
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/451099/

The expected use case is a relatively long loop of reading input data
interleaved with running CPU-intensive or DB-intensive jobs, where the
individual jobs are independent and order of input is not strongly coupled.
(Ordering of _running_ of items on the child processes is not guaranteed,
but the order of result processing is guaranteed to be in the same order as
input, making output across runs predictable for idempotent processing.)

A simple ForkableMaintenance script might look like:

class Foo extends ForkableMaintenance {
  // Handle any input on the parent thread, and
  // pass any data as JSON-serializable form into
  // the queue() method, where it gets funneled into
  // a child process.
  public function loop() {
     for ( $i = 0; $i < 1000; $i++) {
       $this->queue( $i );
     }
  }

  // On the child process, receives the queued value
  // via JSON encode/decode. Here it's a number.
  public function work( $count ) {
    return str_repeat( '*', $count );
  }

  // On the parent thread, receives the work() return value
  // via JSON encode/decode. Here it's a string.
  public function result( $data ) {
    $this->output( $data . "\n" );
  }
}

Because data is serialized as JSON and sent over a pipe, you can't send
live objects like Titles or Pages or Files, but you can send arrays or
associative arrays with fairly complex data.

There is a little per-job overhead and multiple threads can cause more
contention on the database server etc, but it scales well on the subtitle
format conversion script I'm testing with, which is a little DB loading and
some CPU work. On an 8-core/16-thread test server:

threads runtime (s) speedup
0 320 n/a
1 324 0.987654321
2 183 1.74863388
4 105 3.047619048
8 66 4.848484848
16 58 5.517241379

I've added a phpunit test case for OrderedStreamingForkController to make
sure I don't regress something used by other teams, but noticed a couple
problems with using this fork stuff in the test runners.

First, doing pcntl_fork() inside a phpunit test case has some potential
side effects, since there's a lot of tear-down work done in destructors
even if you call exit() after processing completes. As a workaround, when
I'm having the child procs send a SIGKILL to themselves to terminate
immediately without running test-case destructors.

Second, probably related to that I'm seeing a failure in the code coverage
calculations -- it's seeing some increased coverage on the parent process
at least but seems to think it's returning a non-zero exit code somewhere,
which marks the whole operation as a failure:

https://integration.wikimedia.org/ci/job/mediawiki-phpunit-coverage-patch-docker/460/console

Worst case, can probably exclude these from some automated tests but that
always seems like a bad idea. :D

If anybody else is using, or thinking about using, ForkController and its
friends and wants to help polish this up, give a shout!

-- brion
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to