I have a table of about 100 million rows, lets call it tbl_large. I am exposing this data as a kafka stream using a confluent connector to poll the table every second or so. For (no) good reason, i need to do a clean load of this data each weekend in our lower environment (dev and qa) from data in our production environment. I accomplish this be 'cleaning' the topics and wiping the offsets, which works great for the most part.
My question is one pertaining to optimization. I am using a query to select about 70 million rows tbl_large. In the connect properties for this stream i have the following: query = select * from tbl_large where asset_type='eqty' > batch.max.rows = 100000 > mode=timestamp+incrementing > timestamp.column.name=mod_date > incrementing.column.name=id My understanding of ' batch.max.rows' setting is that it means although the query will return 70 million rows from the database table, after producing topics for 100,000 of the rows, it will exit the poll loop and then subsequently poll the database , appending a where clause with the current time, the last processed increment id and mod_date. Is that correct? If that is the case, would be more optimal if i limited my query by using the TOP keyword where the value would be equal to batch.max.rows : query = select TOP 100000 * from tbl_large where asset_type='eqty' This would reduce the amount of data being sent over the network as well as possibly the memory required on the server my connector is running on (depending on the odbc driver). Is that correct? Thanks much for an clarification/education you can provide on this.