Alberto,

You probably just need to try out the options and see what works best (Avro or 
ORC, etc…).

With the Avro option, you wouldn’t need to change the type of your main HIVE 
table, keep that as ORC.
Only the staging table would use Avro. Then call Hive QL to merge the data from 
your staging table into your main table. Let your clusters CPU power crunch 
through the data to do the merge.

If you split the data using SplitRecord into individual rows then you could 
probably route on the transaction type. But working with individual rows in 
NiFi adds a lot of overhead, and just imagine executing 10k Hive QL SQL 
statements instead of 1 big one… If you have ACID enabled I guess it would all 
get recombined, but the overhead of calling that many statements would be 
really high.

--Peter

From: Alberto Bengoa [mailto:albe...@propus.com.br]
Sent: Thursday, December 07, 2017 02:27
To: users@nifi.apache.org
Subject: Re: [EXT] CDC like updates on Nifi

On Tue, Dec 5, 2017 at 11:55 PM, Peter Wicks (pwicks) 
<pwi...@micron.com<mailto:pwi...@micron.com>> wrote:

Alberto,
Hello Peter,

Thanks for your answer.



Since it sounds like you have control over the structure of the tables, this 
should be doable.



If you have a changelog table for each table this will probably be easier, and 
in your changelog table you’ll need to make sure you have a good transaction 
timestamp column and a change type column (I/U/D). Then use QueryDatabaseTable 
to tail your change log table, one copy of QueryDatabaseTable for each change 
table.

Yes. This is the way that I'm trying to do. I have the TimeStamp and Operation 
type columns as "metadata columns" and all the other "data columns" of each 
table.



Now your changes are in easy to ingest Avro files. For HIVE I’d probably use an 
external table with the Avro schema, this makes it easy to use PutHDFS to load 
the file and make it accessible from HIVE. I haven’t used Phoenix, sorry.

Hmm. Sounds interesting.

I was planning to use ORC because it's allow transactions (to make updates / 
deletes). Avro do not allow transactions, but changing data using HDFS instead 
of HiveQL would be an option.

Would be possible to update fields of specific records using PutHDFS?

On my changelog table I do not have the entire row data when triggered by an 
update. I just have values of changed fields (not changed fields have <null> 
values on changelog tables).

_TimeStamp                              _Operation        Column_A Column_B  
Column_C
2017-12-01 14:35:56:204 - 02:00          3 7501 <null>  <null>
2017-12-01 14:35:56:211 - 02:00          4 7501 1234  <null>
2017-12-01 15:25:35:945 - 02:00          3 7503 <null>  <null>
2017-12-01 15:25:35:945 - 02:00          4 7503 5678  <null>

In the example above, we had two update operations (_Operation = 4). Column_B 
was changed, Column_C not. Column_C would have any prior value.


If you have a single change table for all tables, then you can still use the 
above patter, but you’ll need a middle step where you extract and rebuild the 
changes. Maybe if you store the changes in JSON you could extract them using 
one of the Record parsers and then rebuild the data row. Much harder though.

I have one changelog table for each table.

Considering that I would use HiveQL to update tables on the Datalake, could I 
use a RouteOnContent processor to create SQL Queries according to the 
_Operation type?






Thanks,

  Peter



Thanks you!

Alberto


From: Alberto Bengoa 
[mailto:albe...@propus.com.br<mailto:albe...@propus.com.br>]
Sent: Wednesday, December 06, 2017 06:24
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: [EXT] CDC like updates on Nifi



Hey folks,



I read about Nifi CDC processor for MySQL and other CDC "solutions" with Nifi 
found on Google, like these:



https://community.hortonworks.com/idea/53420/apache-nifi-processor-to-address-cdc-use-cases-for.html

https://community.hortonworks.com/questions/88686/change-data-capture-using-nifi-1.html

https://community.hortonworks.com/articles/113941/change-data-capture-cdc-with-apache-nifi-version-1-1.html



I'm trying a different approach to acquire fresh information from tables, using 
triggers on source database's tables to write changes to a "changelog table".



This is done, but my questions are:



Would Nifi be capable to read this tables, transform these data to generate a 
SQL equivalent query (insert/update/delete) to send to Hive and/or Phoenix with 
current available processors?



Which would be the best / suggested flow?



The objective is to keep tables on the Data Lake as up-to-date as possible for 
real time analyses.



Cheers,

Alberto

Reply via email to