RE: [EXT] Re: Architecting the MS SQL CDC Processor

Peter Wicks (pwicks) Thu, 26 Oct 2017 23:17:36 -0700

I've submitted the first pass at this processor: 
https://github.com/apache/nifi/pull/2230

I did use RecordSetWriter, and ResultSetRecordSet for reading, which has been 
working well.
I found I didn't need to worry about the rate that state gets updated. In the 
MySQL case it checks to see if it should update state after every single row, 
in mine it updates after all changes for the table since the last run have been 
processed, so more like a QueryDatabaseTable. This is possible because I'm not 
reading changes in anything remotely resembling the way MySQL works :)

I created unit tests for this processor, the unit tests run on Apache DB tables 
that match in schema, but not necessarily in type, to those in MS SQL.

The only quirky thing was in order to get my generated SQL to work for both 
Apache DB and MS SQL, I had to use quotes a lot more than usual in my SQL 
statements. So please no comments along the lines of, "Why are there so many 
quoted identifiers in your SQL statements" :)

Thanks,
  Peter

-----Original Message-----
From: Matt Burgess [mailto:mattyb...@apache.org] 
Sent: Tuesday, October 17, 2017 10:59 AM
To: dev@nifi.apache.org
Subject: [EXT] Re: Architecting the MS SQL CDC Processor

Peter,

This is great to hear, I'm sure the community is looking forward to such a 
solution!  I worked on the first offering of the CaptureChangeMySQL processor, 
so here are some notes, comments, and
(hopefully!) answers to your questions:

* If you support a RecordSetWriter controller service as your output, then you 
won't need JdbcCommon per se; instead you would create Records and pass those 
to the user-selected RecordSetWriter. In that sense you can support Avro, CSV, 
JSON, or anything else for which there is a RecordSetWriter implementation.

* Depending on how often you'll be updating state, you may want to implement 
something similar to the State Update Interval property in CaptureChangeMySQL, 
which came about due to similar concerns about the overhead of state updates vs 
the amount of processing beforehand.
This allows to user to tune the tradeoff, based on their own requirements and 
performance and such.

* I have no concerns with having a different output format from 
CaptureChangeMySQL; in fact the only reason it doesn't have a RecordSetWriter 
output interface is that those capabilities were being developed in parallel, 
so rather than have to wait for the record-aware API stuff, I chose to output 
JSON. I have written
NIFI-4491 to improve/augment CDC processor(s) with RecordSetWriter support. 
This would be very helpful by supporting various output formats as well as 
generating the accompanying schema. If your processor were the first to support 
this, it could be the exemplar for past and future CDC processors :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4491

On Mon, Oct 16, 2017 at 10:36 PM, Peter Wicks (pwicks) <pwi...@micron.com> 
wrote:
> I've been working on a new processor that does Change Data Capture with 
> Microsoft SQL Server. I followed Microsoft's documentation on how CDC works, 
> and I've got some code that gets me the changes and is testing well. Right 
> now, I don't actually have a processor, but a number of scripts that generate 
> SQL and I put it into ExecuteSQL and QueryDatabaseTable processors; with QDB 
> using my as-yet incomplete 
> NIFI-1706<https://github.com/apache/nifi/pull/2162>.
>
> One of the reasons I don't have a processor yet is because I don't want to 
> use the same output format as the MySQL CDC Processor, but I didn't want to 
> put in the time if it was not going to get merged. The MySQL CDC processor 
> uses JSON messages as the output format, but in MS SQL the CDC messages are 
> rows in a table; and it's much more convenient to output them as records. 
> Currently, I'm using Avro.
>
> Questions:
>
>   *   My output format doesn't have to be Avro, but given the source is rows 
> in a table being returned by a ResultSet, using the JdbcCommon class makes a 
> lot of sense to me. Can I move JdbcCommon to somewhere useful like 
> nifi-avro-record-utils?
>   *   I'll be looping through a list of tables and plan on committing the 
> files immediately to the success relationship as that table's CDC records are 
> pulled. I want to make sure that the max value tracking gets updated 
> immediately too. Does calling setState on the State Manager cause an 
> immediate state save? Is this safe to call repeatedly, assuming single 
> threaded, during the execution of the processor?
>   *   Concerns with using a different output format than the MySQL CDC 
> Processor?
>
> Thanks,
>   Peter

RE: [EXT] Re: Architecting the MS SQL CDC Processor

Reply via email to