Re: Connecting To MSSQL Server With Apache Beam, Python SDK

Luke Cwik Fri, 17 Jul 2020 19:41:49 -0700

There is typically no trivial way to have SQL queries be partitioned to
parallelize the reading automatically since you typically can't have
multiple "readers" connect and use the same query results. There are a
couple of options that you could use:
1) Use a single query use the databases abilitity to dump the contents to a
file and then perform a parallel read over the files contents
2) Paritition your query over a primary key and submit N queries where N is
the amount of parallel reads you want (alternatively dump to a temp table
and partition your query)
3) Write your own splittable DoFn that can partition a query arbitrarily


Do you see some meaningful errors when running your pipeline on Dataflow in
the logs? If its an issue that pyodbc isn't installed on the remote worker
you should check out how to manage pipeline dependencies[1].

3 has the best parallelism potential followed by 1 while 2 is the easiest
to get working followed by 1 (assuming that the dump file format is already
supported by Beam). On the writing side, you can use GroupIntoBatches[2] to
have meaningfully large transactions or try to use the JdbcIO write
transform when it becomes available in a cross language way.

There is ongoing work[3, 4, 5] to use the existing JdbcIO in Beam Java
connector as a cross language transform available to Python.

I don't know of another way to get on the mailing list then sending an
e-mail to user-subscr...@beam.apache.org

1: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
2:
https://github.com/apache/beam/blob/62118fa66788ad45032a60abc30799cd6f0e4658/sdks/python/apache_beam/transforms/util.py#L744
3: https://github.com/apache/beam/pull/12145
4: https://issues.apache.org/jira/browse/BEAM-10135
5: https://issues.apache.org/jira/browse/BEAM-10136

On Fri, Jul 17, 2020 at 9:21 AM Dennis <denniszvigel...@gmail.com> wrote:

> Hello,
>
>
>
> I'm writing in order to inquire about developing a pipeline (using the
> Python SDK) with multiple PTransforms that can read from, write to, and
> alter data from an MSSQL server.
>
>
>
> I've been using beam-nuggets (https://pypi.org/project/beam-nuggets/), a
> community I/O Connector for dealing with these kinds of PTransforms for a
> MySQL server, and was looking to see if there's an option to do this for
> MSSQL.
>
>
>
> So far, I've been able to run a pipeline with DirectRunner that reads data
> from MSSQL using pyodbc. While this is a good starting point, it's not
> running with DataflowRunner (even after configuring Private IP), and it's
> not parallelized.
>
>
>
> I tried to look into SQLAlchemy, but it seems that there isn't as much
> support as there is for MySQL, especially for the insertion method. It is
> expected that the default insertion method is upsert. For MySQL, this was
> implemented using:
>
>
>
> from sqlalchemy.dialects.mysql import insert as mysql_insert
>
>
>
> There is not such a package available for MSSQL...
>
>
>
> How would one go about doing this? I've looked at several stack overflow
> articles, but there wasn't any solution there that had any similar
> functionality to that of beam-nuggets. Perhaps I missed a solution?
>
>
>
> I realize that this is a loaded question, so I greatly appreciate any help
> in advance.
>
>
>
> Thanks,
>
> Dennis
>
>
>
> P.S. I had trouble adding my work email address, dzvigel...@questrade.com
> to the mailing list (even though I went through the same steps to subscribe
> as with this one), could you please add it? Thanks.
>

Re: Connecting To MSSQL Server With Apache Beam, Python SDK

Reply via email to