Many thanks. I'm now more confident NiFi could be a good fit for us.
On Wednesday, October 12, 2016 9:06 PM, Jeff <jtsw...@gmail.com> wrote:
You're asking on the right list!
Based on the scenario you described, I think NiFi would suit your needs. To
address your 3 major steps of your workflow:
1) Processors can run based on a timer-based or cron-based schedule.
GenerateTableFetch is a processor that can be used to create SQL SELECT
statements from a table based on increasing values in one or more columns, and
can be partitioned depending on your batching needs. These SQL SELECT
statements can then be executed against the destination database by use of the
2) With the more recent data, which I'm assuming is queried from the
destination database, you can use QueryDatabaseTable to retrieve the new rows
in Avro format and then transform as needed, which may include processors that
encapsulate any custom logic you might have written for your homemade ETL
3) The PostHTTP processor can be used to send files over HTTPS to the external
Processors have failure relationships when processing for a flow file fails,
and can be routed as appropriate, such as retrying failed flow files. For
errors that require human intervention, there are a number of options. Most
likely, the way your homemade solution currently handles errors that require
human intervention can be done by NiFi as well.
Personally, I have used NiFi in similar ways to what you have described. There
are some examples on the Apache NiFi site  that you can check out. Your
questions about the stopping and restarting of processing when errors occur is
possible, though much of that is in how you design your flow.
Feel free to ask any questions! Much of the information above is fairly
high-level, and NiFi offers a lot of processors to meet your data flow needs.
On Tue, Oct 11, 2016 at 5:18 PM Márcio Faria <faria.mar...@ymail.com> wrote:
Potential NiFi user here.
I'm trying to figure out if NiFi could be a good choice to replace our existent
homemade ETL system, which roughly works like this:
1) Either on demand or at periodic instants, fetch fresh rows from one or more
tables in the source database and insert or update them into the destination
2) Run the jobs which depend on the more recent data, and generate files based
3) Upload the generated files to an external server using HTTPS.
Since our use cases are more of a "pull" style (Ex: It's time to run the report
-> get the required data updated -> run the processing job and submit the
results) than "push" (Ex: Get the latest data available -> when some condition
is met, run the processing job and submit the results), I'm wondering if NiFi,
or any other flow-based toolset for that matter, would be a good option for us
to try or not. Your opinion? Suggestions?
Besides, what is the recommended way to handle errors in a ETL scenario like
that? For example, we submit a "page" of rows to a remote server and its
response tells us which of those rows were accepted and which ones had a
validation error. What would be the recommended approach to handle such errors
if the fix requires some human intervention? Is there a way of stopping the
whole flow until the correction is done? How to restart it when part of the
data were already processed by some of the processors? The server won't accept
a transaction B if it depends on a transaction A that wasn't successfully
As you see, our processing is very batch-oriented. I know NiFi can fetch data
in chunks from a relational database, but I'm not sure how to approach the
conversion from our current style to a more "stream"-oriented one. I'm afraid I
could try to use the "right tool for the wrong problem", if you know what I
Apologies if this is not the proper venue to ask. I checked all the posts in
this mailing list and also tried to search for information elsewhere, but I
wasn't able to find the answers myself.
Any guidance, like examples or links to further reading, would be very much
appreciated. I'm just starting to learn the ropes.