I still think sqoop is the way to go to handle large volumes of data. I
wish NiFi had a handy sqoop processor (like in Kylo) but it is easy to do
it with groovy (I blogged about it).

We just had issues with NiFi and very large flow files. Flows which are
using sqoop are not limited by NiFi JVM limits, flowfile sizes or
backpressure.

You have less stress on NiFi and let your cluster do the heavy lifting,
using all the power of your Hadoop cluster.

We built incremental processes as well, using a small control table on a
side and let NiFi coordinate all of that.

On Thu, Aug 16, 2018 at 9:55 AM Matt Burgess <[email protected]> wrote:

> Walter,
>
> If you're looking to distribute database fetching among a cluster,
> then GenerateTableFetch is the right choice (over QueryDatabaseTable).
> As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts
> incoming flow files, the capability was added in response to exactly
> the use case you outlined, distributed fetch of multiple tables via
> ListDatabaseTables. You still want your source processor to run on the
> Primary Node Only, otherwise all nodes get the same source data and as
> you said, you end up with duplicate data.
>
> QueryDatabaseTable does not accept incoming connections, but you can
> use ExecuteSQL to actually do the fetching.  To distributed fetching
> of tables among the cluster, I recommend the following flow:
>
> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
> GenerateTableFetch -> ExecuteSQL
>
> Each node on the cluster will get flow files for the various tables in
> the database, then GenerateTableFetch will generate the SQL to fetch
> "pages" based on the Partition Size property, then ExecuteSQL will
> execute the statements. You can use multiple concurrent tasks for
> ExecuteSQL to perform the fetching concurrently for the SQL
> statements, and the RPG->Input port part will let you do the tables in
> parallel.  If instead you want to fully distribute the SQL execution
> (fetching various pages from various tables), you could move the
> RPG->Input Port after GTF:
>
> ListDatabaseTables (on Primary Node only) -> GenerateTableFetch -> RPG
> -> Input Port -> ExecuteSQL
>
> In this flow, the Primary Node will do all the work of generating all
> the SQL for all the pages, then will distributed the SQL among the
> cluster. So each node may be grabbing different pages from the same
> table, etc.  Depending on how much work it takes to generate the SQL,
> this may not be as performant as the first flow.  Alternatively you
> can distribute the SQL generation and the execution:
>
> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
> GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL
>
> This might be overkill but does "fully" parallelize the work. In
> addition, as mentioned, you can set multiple concurrent tasks for
> ExecuteSQL (but not GenerateTableFetch) to achieve concurrency for
> fetching. One thing to watch out for in all cases is the Max
> Connections property for the DBCPConnectionPool. Each node will get
> its own pool, but depending on how much is going through GTF and
> ExecuteSQL, you may run out of connections (which will slow your
> throughput) or if Max Connections is high, you may exhaust all
> connections at the server, just something to keep in mind when
> configuring the flow.
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-2881
> On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <[email protected]> wrote:
> >
> > Hi,
> >
> >
> >
> > I’m trying to find a good strategy for distributing work among a cluster
> when we’re fetching data from a database. My developers are currently doing
> GenerateTableFetch and executing it only on the primary node because
> “otherwise we end up with duplicate data”. A little googling on my end and
> I found out about the List/Fetch pattern. All the examples are for SFTP
> though.
> >
> >
> >
> > I’m wondering what a good configuration might be if you’re looking to
> use this pattern for fetching from a database. I’ve found
> GenerateTableFetch, and I can certainly use this, but since we’re querying
> multiple tables (but not all tables in the DB!) I’m hoping to use something
> like ListDatabaseTables before that, so that GenerateTableFetch can be done
> on the whole cluster and then QueryDatabaseTable as well.
> >
> >
> >
> > So one option is Multiple GenerateTableFetch processors > Funnel > RPG
> // Input port > QueryDatabaseTable. I’m wondering if there’s also a good
> way to go this route: ListDatabaseTables > RPG // Input port >
> GenerateTableFetch > QueryDatabaseTable. I want to distribute as much work
> as possible within the cluster.
> >
> >
> >
> > Kind regards,
> >
> >
> >
> > Walter
> >
> >
> > ________________________________
> >
> > Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor
> (gebruik door) de geadresseerde. De e-mail kan persoonlijke of
> vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging,
> verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en
> eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u
> niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene
> die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail
> (en eventuele bijlagen) te vernietigen.
> >
> > Informatie vennootschap
>

Reply via email to