Hi Mark, Writing to the db in bulk would be the first step. Have you looked into writing to the DB with a larger batch size. I believe mysql-beam-connector also supports this.
On Wed, May 18, 2022 at 2:13 AM Mark Striebeck <[email protected]> wrote: > Hi, > > We have a datapipeline that produces ~400M datapoints each day. If we run > it without storing, it finishes in a little over an hour. If we run it and > store the datapoints in a MySQL database it takes several hours. > > We are running on GCP dataflow, the MySQL instances are hosted GCP > instances. We are using mysql-beam-connector > <https://github.com/esakik/beam-mysql-connector>. > > The pipeline writes ~5000 datapoints per second. > > A couple of questions: > > - Does this throughput sound reasonable or could it be significantly > improved by optimizing the database? > - The pipeline runs several workers to write this out - and because > it's a write operation they content for write access. Is it better to write > out through just one worker and one connection? > - Is it actually faster to write from the pipeline to pubsub or kafka > or such and have a client on the other side which then writes in bulk? > > Thanks for any ideas or pointers (no, I'm by no means an > experienced DBA!!!) > > Mark >
