I have a table of about 100 million rows, lets call it tbl_large.  I am
exposing this data as a kafka stream using a confluent connector to poll
the table every second or so.  For (no) good reason, i need to do a clean
load of this data each weekend in our lower environment (dev and qa) from
data in our production environment.
I accomplish this be 'cleaning' the topics and wiping the offsets, which
works great for the most part.

My question is one pertaining to optimization.  I am using a query to
select about 70 million rows tbl_large.   In the connect properties for
this stream i have the following:

query = select * from tbl_large where asset_type='eqty'
batch.max.rows = 100000
mode=timestamp+incrementing
timestamp.column.name=mod_date
incrementing.column.name=id



My understanding of ' batch.max.rows' setting is that it means although the
query will return 70 million rows from the database table, after producing
topics for 100,000 of the rows, it will exit the poll loop and then
subsequently poll the database , appending a where clause with the current
time, the last processed increment id and mod_date.  Is that correct?

If that is the case, would be more optimal if i limited my query by using
the TOP keyword where the value would be equal to batch.max.rows :

query = select TOP 100000 * from tbl_large where asset_type='eqty'


This would reduce the amount of data being sent over the network as well as
possibly the memory required on the server my connector is running on
(depending on the odbc driver).  Is that correct?

The other question I have is, does connect always append a where clause on
your query that includes a timestamp and incrementing Id?  I am seeing
misses occasionally for large batches with same timestamps and inserting if
perhaps it is because the sort for the incrementing is off.

Thanks much for any clarification/education you can provide on this.

Reply via email to