I have a table of about 100 million rows, lets call it tbl_large. I am exposing this data as a kafka stream using a confluent connector to poll the table every second or so. For (no) good reason, i need to do a clean load of this data each weekend in our lower environment (dev and qa) from data in our production environment. I accomplish this be 'cleaning' the topics and wiping the offsets, which works great for the most part.
My question is one pertaining to optimization. I am using a query to select about 70 million rows tbl_large. In the connect properties for this stream i have the following: query = select * from tbl_large where asset_type='eqty' batch.max.rows = 100000 mode=timestamp+incrementing timestamp.column.name=mod_date incrementing.column.name=id My understanding of ' batch.max.rows' setting is that it means although the query will return 70 million rows from the database table, after producing topics for 100,000 of the rows, it will exit the poll loop and then subsequently poll the database , appending a where clause with the current time, the last processed increment id and mod_date. Is that correct? If that is the case, would be more optimal if i limited my query by using the TOP keyword where the value would be equal to batch.max.rows : query = select TOP 100000 * from tbl_large where asset_type='eqty' This would reduce the amount of data being sent over the network as well as possibly the memory required on the server my connector is running on (depending on the odbc driver). Is that correct? The other question I have is, does connect always append a where clause on your query that includes a timestamp and incrementing Id? I am seeing misses occasionally for large batches with same timestamps and inserting if perhaps it is because the sort for the incrementing is off. Thanks much for any clarification/education you can provide on this.