RE: Ingestion from databases: pure NiFi vs Kylo with Scoop

Vitaly Krivoy Mon, 06 Aug 2018 09:12:01 -0700

Boris,

Thank you for your feedback.
I think I can now answer my own question after dipping a little deeper into 
Kylo’s documentation. While you probably don’t have to deploy your 
infrastructure this way, Kylo documentation implies that NiFi is deployed on 
the edge node. Given recommended reliance for data ingestion on Kylo’s own 
GetTableData custom processor instead of combination of GenerateTableFetch and 
QueryDatabaseTable processors deployed on a stand-alone NiFi node and a NiFi 
cluster, this guarantees an inferior performance relatively to Sparc-driven 
Scoop based ingestion. Why documentation doesn’t discuss the relative 
advantages of different possible  topologies is another matter.

From: Boris Tyukin <[email protected]>
Sent: Saturday, August 04, 2018 5:06 PM
To: [email protected]
Subject: Re: Ingestion from databases: pure NiFi vs Kylo with Scoop

Vitaly,

The best way is to try yourself and build a simple process to prove your case.

I got excited first about Kylo, but quickly realized I could do everything I 
needed with NiFi. I did not really care about fancy UI with Kylo, but I did 
love a lots of things - integration with Spark and sqoop, template s for 
pipelines, centralized monitoring etc. But at the same time, it is someone 
else's product, lagging behind nifi, with tons of other dependencies and 
packages, built by that company.

I do believe you don't have to use sqoop if you don't want it - you can build 
your own templates in Kylo which would be just a nifi flow with parameters and 
use JDBC SQL processors instead.

Now, you will be missing a lot of cool features of sqoop. One example, is 
direct database connectors (Oracle for example). Much better performance. 
Changing timezones etc.

NiFi till recently could not ingest a table concurrently - with sqoop I can run 
32 mappers and it will break a table on 32 pieces and will ingest them to hdfs.

NiFi have a similar ability now but I think till NiFi 1.6, you had to use 
primary keys or something like that. I think this has been improved recently 
and fetchdatabase processor can do a lot like breaking a table on pieces and 
also support incremental loads.

Speaking of incrementals, I also wanted to build my own framework around 
incremental loads with my own control table, audit and logging. I did not use 
sqoop incremental load feature but some devs love it.

So if you do not care about all the cool sqoop features amd it's high 
performance, and just need to ingest data, you will be fine using NiFi 
processors.

Boris

On Fri, Aug 3, 2018, 15:28 Vitaly Krivoy 
<[email protected]<mailto:[email protected]>> wrote:
We are considering using Kylo on top of NiFi. It is my understanding that while 
Kylo manages both NiFi and Spark, its designers decided to utilize Scoop from 
Spark in order to ingest the data from relational databases. I am also aware 
that it is possible to drive Scoop from NiFi using one of the processors which 
can run scripts. Why would Kylo designers rely on Scoop rather than on NiFi? 
It’s possible to set up a stand-alone NiFi instance and a NiFi cluster to do 
parallel database access. Scoop will achieve polarization for extraction from 
databases relying on the power of MR. We are a HortonWorks on Azure shop, so we 
already have infrastructure for both approaches. Does anyone have any feedback 
why would one approach be preferable to another?

STATEMENT OF CONFIDENTIALITY The information contained in this email message 
and any attachments may be confidential and legally privileged and is intended 
for the use of the addressee(s) only. If you are not an intended recipient, 
please: (1) notify me immediately by replying to this message; (2) do not use, 
disseminate, distribute or reproduce any part of the message or any attachment; 
and (3) destroy all copies of this message and any attachments.

STATEMENT OF CONFIDENTIALITY The information contained in this email message 
and any attachments may be confidential and legally privileged and is intended 
for the use of the addressee(s) only. If you are not an intended recipient, 
please: (1) notify me immediately by replying to this message; (2) do not use, 
disseminate, distribute or reproduce any part of the message or any attachment; 
and (3) destroy all copies of this message and any attachments.

RE: Ingestion from databases: pure NiFi vs Kylo with Scoop

Reply via email to