RE: ELT on Nifi

Carlos Manuel Fernandes (DSI) Fri, 07 Oct 2016 10:31:22 -0700

Andy,

Good suggestion, i  will do that  , I had created several executeScript (in 
groovy) before.


Thanks

Carlos





From: Andy LoPresto [mailto:alopre...@apache.org]
Sent: sexta-feira, 7 de Outubro de 2016 18:21
To: users@nifi.apache.org
Subject: Re: ELT on Nifi

Carlos,

If you are comfortable with Groovy I would suggest you look at using 
ExecuteScript [1] processor to prototype what you want the processor to do. 
That processor will take an (inline or read from file) Groovy script and 
execute it within the processor lifecycle. Matt Burgess has written some 
excellent blog posts on getting started with it [2][3].

Once you have that behaving the way you like (and feel free to continue to ask 
questions here), another developer would probably be able to help you convert 
it to a “real" custom processor.

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.ExecuteScript/index.html
[2] 
https://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html
[3] 
https://funnifi.blogspot.com/2016/02/writing-reusable-scripted-processors-in.html


Andy LoPresto
alopre...@apache.org<mailto:alopre...@apache.org>
alopresto.apa...@gmail.com<mailto:alopresto.apa...@gmail.com>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 7, 2016, at 7:20 AM, João Henrique Freitas 
<joa...@gmail.com<mailto:joa...@gmail.com>> wrote:

Hi.
Maybe a linkedin/databus client processor could be created to handle ETL.

Em 06/10/2016 10:39, "Carlos Manuel Fernandes (DSI)" 
<carlos.antonio.fernan...@cgd.pt<mailto:carlos.antonio.fernan...@cgd.pt>> 
escreveu:
Hi Uwe,

I saw you had developed similar approach of mine. Joe Witt lunched a challenge  
to build a processor based on Json structure I proposed.

I think  we can use the code of convertJSONtoSQl processor as a template for 
this new processor.  This new processor will belong  to the category  - 
JSONtoSQL (the convertJSONtoSQL is the first one).

We can  work together to reach this goal but first we must agree on the Json 
structure for the input.

What you think?  You can contact me directly.

Thanks

Carlos

From: Uwe Geercken [mailto:uwe.geerc...@web.de<mailto:uwe.geerc...@web.de>]
Sent: terça-feira, 4 de Outubro de 2016 14:42
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Aw: Re: ELT on Nifi

Carlos,

I think that is a good point.

But I would like to bring up a little different view to it:

I have developed a business ruleengine (open source) written in Java and it is 
meanwhile in production at least at two bigger companies - they both use the 
Pentaho ETL tool together with the ruleengine. You can use the rules to 
filter/evaluate conditions and there are also actions which execute or 
transform data. The advantage is, that within Pentaho it is just a plugin and 
the business logic (or if you will also IT logic) it managed externally 
(through a web interface and possibly by users or superusers themselve and not 
by IT). This keeps a proper seperation of responsibilities of business logic 
and IT logic and the ETL process itself is much, much cleaner.

Likewise one could think of creating a plugin for Nifi which takes a similar 
approach: you have a processor that in the background calls the ruleengine. It 
runs and deliveres the results back to the process. Instead of having complex 
connections between transformation processors, which clutter the Nifi desktop 
there would be one processor for the ruleengine (of course also multiple ones).

In one of my later projects I have implemented the complete invoicing process 
for the company I work for using the ruleengine. The ETL is very clean and 
contains only IT logic (formatting of fields, splitting of fields, renaming, 
etc) and the rest is in external rule projects which contain the business logic.

My thinking is that the devision of responsibilities for the logic and a clean 
ETL or in the Nifi case a clean Flow diagram is a very strong argument for this 
approach.

Of course there is nothing to say against a mixed approach - custom processors 
and ruleengine - I just wanted to explain my point a little bit. Everything is 
available on github.com/uwegeercken<http://github.com/uwegeercken>.

I could write the Nifi code for the processor I guess, but I will need some 
help with testing, documentation and also packaging the nar file (I am not used 
to Maven and have struggled in the past to create a proper nar archive).

Greetings,

Uwe

Gesendet: Dienstag, 04. Oktober 2016 um 04:48 Uhr
Von: "Matt Burgess" <mattyb...@apache.org<mailto:mattyb...@apache.org>>
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: Re: ELT on Nifi
Carlos,

The extensible nature of NiFi, whether the overall architecture was intended 
for ETL/ELT and/or RDBMS/DW concepts or not, means that many of these kinds of 
operations are welcome (but possibly not yet present) in NiFi. Some might 
warrant framework changes, but for a good portion, many RDBMS/DW processors are 
possible but just haven't been added/contributed yet. In my experience, ETL/ELT 
tools have focused mainly on this kind of "processor" and in contrast can't 
handle the level of throughput, data formats, provenance/lineage, security, 
and/or data integrity that NiFi can. In exchange, NiFi doesn't have as many of 
the RDBMS/DW-specific processors available at this time. I see a few categories 
(please feel free to add/change/delete/discuss), mostly having to do with 
tabular (row-oriented, character-delimited) data:

1) Row-level operations. This includes projections (select fields from row), 
alter fields (change timestamp of column 'last_updated', e.g.), add column(s), 
replace-with-lookup, etc.
2) Table-level operations. This includes joins, grouping/aggregates, 
transposition, etc.
3) Composition/Application of the other two. This includes normalization & 
denormalization (star/snowflake schemas, e.g.), dimension updates (Kimball's 
SCD Type 2, e.g.), etc.
4) Bulk Loading. These usually involve custom code (although in many cases for 
NiFi you can deploy a command-line tool for bulk loading to a DB and use 
ExecuteProcess or ExecuteStreamCommand to make it happen). These are usually 
native processes for getting lots of data into the DB using an end-run around 
their own interfaces, possibly bypassing mechanisms that NiFi embraces, such as 
provenance. But they are often faster than their SQL interface counterparts for 
large data ingest.
5) Transactions. This involves executing a number of SQL statements as an 
atomic group (i.e. BEGIN, a bunch of INSERTs, COMMIT). Not all DBs support this 
(and many have their own dialects for such things).

That's a lot of feature surface to cover! Luckily we have an ever-growing 
community filled with folks representing a whole spectrum of experience and a 
shared passion for data :)  I am very interested in your thoughts on where NiFi 
could improve on these (or other) fronts with respect to ETL/ELT, I think we 
can get some good discussions (and code contributions!) going on this. 
Alternatively, if you'd like to pursue a discussion on how to offload data 
transformations, I'm sure the community has thoughts on that as well.

Regards,
Matt

P.S. I didn't include push-down optimization on the list because of its 
complexity and in NiFi terms involves things like dynamic flow-rewrites and 
other magic that IMHO is against the design principles of NiFi itself 
(simplicity, accountability, e.g.).

On Mon, Oct 3, 2016 at 2:25 PM, Carlos Manuel Fernandes (DSI) 
<carlos.antonio.fernan...@cgd.pt<mailto:carlos.antonio.fernan...@cgd.pt>> wrote:
Hi all,

When i saw Nifi for the first time , I try to build  a classical ETL/ELT flow , 
and this question is recurrent for the new users.

Nifi has very good processors for the Extract and Load, the problem arise on 
Transform, because in ETL/ELT  tools there are specific “processors”  (ex: map, 
SCD, etc.)  binded to DW concepts  and sometimes binded  to a specific database 
(ex: SCDNetezza) . The Transformer processors in Nifi  are general purpose  and 
not correlated with  this concepts. The immediate solution is to create a lot 
of Custom script processors but  the metadata of ELT (sql) turn attributes or 
code of processors, not an ideal solution.

But, If we put  the logic of Transform  outside of Nifi, for example in some 
Json structure , then its relative easy, construct a ELT NIFI Template capable 
of run a generic ELT flows.

Example of a ELT JSon Structure  (the “steps” inside  the “flow” are to be 
executed on PutSql in the same transaction)
{
       "Transformer": [{
             "name": "foo1",
             "type": "Map",
             "description": "Summarize the table foo from table bar",
             "flow": [{
                    "step": 1,
                    "description": "delete all data",
                    "stmt": "delete from  foo"
             }, {
                    "step": 2,
                    "Description": "Count f2 by f1",
                    "stmt": "insert into foo(c1, c2) select c1,sum(c2) from bar 
group by c1"
             }]
       }, {
             "name": "foo2",
             "type": "SCD- Slowly change Dimensions type 1",
             "description": "Update a prod table based on stage table",
             "flow": [{
                    "step": 1,
                    "description": "Process type 1",
                    "stmt": "Update Prod Set Prod.columns = Stage.Columns From 
Stage Inner Join Prod on Stage.key = Prod.key Where Stage.IsType1 = 1 "
             }]
       }]
}

Example of a  NIFI template who execute that Json structure :

<image001.png>


This make sense?  Give me feedback.

Carlos

RE: ELT on Nifi

Reply via email to