On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> > Hello,
> >
> > When performing an online schema conversion for a clustered DB the
> > `ovsdb-client` connects to the current leader of the cluster and
> > requests it to convert the DB to a new schema.
> >
> > The main thread of the leader ovsdb-server will then parse the new
> > schema and copy the entire database into a new in-memory copy using
> > the new schema. For a moderately sized database, let's say 650MB
> > on-disk, this process can take north of 24 seconds on a modern
> > adequately performant system.
> >
> > While this is happening the ovsdb-server process will not process any
> > raft election events or inactivity probes, so by the time the
> > conversion is done and the now past leader wants to write the
> > converted database to the cluster, its connection to the cluster is
> > dead.
> >
> > The past leader will keep repeating this process indefinitely, until
> > the client requesting the conversion disconnects. No message is passed
> > to the client.
> >
> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >
> > A workaround for this scenario would be to increase the election timer
> > to a value great enough so that the conversion can succeed within an
> > election window.
> >
> > I don't view this as a permanent solution though, as it would be
> > unfair to leave the end user with guessing the correct election timer
> > in order for their upgrades to succeed.
> >
> > Maybe we need to hand off conversion to a thread and make the main
> > loop only process raft requests until it is done, similar to the
> > recent addition of preparing snapshot JSON in a separate thread [0].
> >
> > Any other thoughts or ideas?
> >
> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> >
>
> Hi, Frode.  Thanks for starting this conversation.
>
> First of all I'd still respectfully disagree that 650 MB is a
> moderately sized database. :)  ovsdb-server on its own doesn't limit
> users on how much data they can put in, but that doesn't mean there
> is no limit at which it will be difficult for it or even impossible
> to handle the database.  From my experience 650 MB is far beyond the
> threshold for a smooth work.
>
> Allowing database to grow to such size might be considered a user
> error, or a CMS error.  In any case, setups should be tested at the
> desired [simulated at least] scale including upgrades before
> deploying in production environment to not run into such issues
> unexpectedly.
>
> Another way out from the situation, beside bumping the election
> timer, might be to pin ovn-controllers, destroy the database (maybe
> keep port bindings, etc.) and let northd to re-create it after
> conversion.  Not sure if that will actually work though, as I
> didn't try.
>
>
> For the threads, I'll re-iterate my thought that throwing more
> cores on the problem is absolutely last thing we should do.  Only
> if there is no other choice.  Simply because many parts of
> ovsdb-server was never optimized for performance and there are
> likely many things we can do to improve without blindly using more
> resources and increasing the code complexity by adding threads.
>
>
> Speaking of not optimal code, the conversion process seems very
> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> case, focusing on the clustered mode.)
>
> There are few main steps:
>
> 1. ovsdb_convert() - Creates a copy of a database converting
>    each column along the way and checks the constraints.
>
> 2. ovsdb_to_txn_json() - Converts the new database into a
>    transaction JSON object.
>
> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
>    and the transaction JSON to the storage (RAFT).
>
> 4. ovsdb_destroy() - Copy of a database is destroyed.
>
>    -----
>
> 5. read_db()/parse_txn() - Reads the new schema and the
>    transaction JSON from the storage, replaces the current
>    database with an empty one and replays the transaction
>    that creates a new converted database.
>
> There is always a storage run between steps 4 and 5, so we generally
> care only that steps 1-4 and the step 5 are below the election timer
> threshold separately.
>
>
> Now looking closer to the step 1, which is the most time consuming
> step.  It has two stages - data conversion and the transaction
> check.  Data conversion part makes sure that we're creating all
> the rows in a new database with all the new columns and without
> removed columns.  It also makes sure that all the datum objects
> are converted from the old column type to the new column type by
> calling ovsdb_datum_convert() for every one of them.
>
> Datum conversion is a very heavy operation, because it involves
> converting it to JSON and back.  However, in vast majority of cases
> column types do not change at all, and even if they do, it only
> happens for a few columns, not for all of them.
>
> This part can be optimized by not converting columns if the type
> didn't change, and just creating a shallow copy of the datum with
> an ovsdb_datum_clone() instead.
>
> In my preliminary testing this saves about 70% of the time spent
> in ovsdb_convert() and reduces the overall conversion time by
> about 30%.  Creation of a shallow copy also speeds up the database
> destruction at step 4 and saves some memory.
>
> I'll post patches I have for this a bit later after cleaning
> them up.

Thanks Ilya. This is great and I acked to your patches.

>
> The next part of the ovsdb_convert() is ovsdb_txn_replay_commit()
> that we can't really get rid of, because we have to perform all
> the transaction checks including referential integrity checks that
> take up most of the time.  We might look at further optimizations
> for the weak reference re-assessment process in the future though
> as it seems to be the heaviest part.
>
>
>
> Now to the steps 2 and 3.
> Creation of a transaction JSON and writing it to the storage seems
> to be completely redundant.  I understand that it was probably done
> this way because conversion to JSON and back was cheaper than
> ovsdb_convert().  However, with the change for a step 1, it will not
> be the case anymore.
>
> Database conversion is a deterministic process.  All we need to
> perform it is the database content, which is already in the
> storage, and the new schema.  So, we should be able to drop the
> step 2 and write only the new schema to the storage on step 3.
> On step 5 we'll need to read the schema and call ovsdb_convert()
> again, but that should not be heavier than parsing transaction from
> JSON and replaying it, with the above optimizations.  RAFT will
> ensure that we're converting the same data as in the first
> ovsdb_convert().
>
> Writing only the new schema to the storage may mean a backward
> incompatible database file format change, but that should not
> be very hard to handle taking into account that it only happens
> during conversion and the next compaction will get rid of
> incompatibility.
>
> I'll sketch up some patches for this as well.   Didn't try yet,
> so don't have any performance data.
>

For this part, I think your proposal is more efficient, but it is not clear
to me how you would implement it for RAFT. I think currently the steps 2, 3
and 5 were done this way probably because it is the easiest way to reuse
the RAFT mechanism. In your proposal, are you thinking about introducing
something new in the RAFT command to trigger ovsdb_convert on each node?
Could you explain more about it?

Thanks,
Han

>
> All in all the conversion process will be condensed to just two
> calls to optimized ovsdb_convert() and two database_destroy().
> One before writing to storage and one after reading from it.
> This should be enough to cover most of the use cases, I hope.
>
> Thoughts?
>
> Best regards, Ilya Maximets.
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to