Walter,

If you're looking to distribute database fetching among a cluster,
then GenerateTableFetch is the right choice (over QueryDatabaseTable).
As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts
incoming flow files, the capability was added in response to exactly
the use case you outlined, distributed fetch of multiple tables via
ListDatabaseTables. You still want your source processor to run on the
Primary Node Only, otherwise all nodes get the same source data and as
you said, you end up with duplicate data.

QueryDatabaseTable does not accept incoming connections, but you can
use ExecuteSQL to actually do the fetching.  To distributed fetching
of tables among the cluster, I recommend the following flow:

ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
GenerateTableFetch -> ExecuteSQL

Each node on the cluster will get flow files for the various tables in
the database, then GenerateTableFetch will generate the SQL to fetch
"pages" based on the Partition Size property, then ExecuteSQL will
execute the statements. You can use multiple concurrent tasks for
ExecuteSQL to perform the fetching concurrently for the SQL
statements, and the RPG->Input port part will let you do the tables in
parallel.  If instead you want to fully distribute the SQL execution
(fetching various pages from various tables), you could move the
RPG->Input Port after GTF:

ListDatabaseTables (on Primary Node only) -> GenerateTableFetch -> RPG
-> Input Port -> ExecuteSQL

In this flow, the Primary Node will do all the work of generating all
the SQL for all the pages, then will distributed the SQL among the
cluster. So each node may be grabbing different pages from the same
table, etc.  Depending on how much work it takes to generate the SQL,
this may not be as performant as the first flow.  Alternatively you
can distribute the SQL generation and the execution:

ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL

This might be overkill but does "fully" parallelize the work. In
addition, as mentioned, you can set multiple concurrent tasks for
ExecuteSQL (but not GenerateTableFetch) to achieve concurrency for
fetching. One thing to watch out for in all cases is the Max
Connections property for the DBCPConnectionPool. Each node will get
its own pool, but depending on how much is going through GTF and
ExecuteSQL, you may run out of connections (which will slow your
throughput) or if Max Connections is high, you may exhaust all
connections at the server, just something to keep in mind when
configuring the flow.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-2881
On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <[email protected]> wrote:
>
> Hi,
>
>
>
> I’m trying to find a good strategy for distributing work among a cluster when 
> we’re fetching data from a database. My developers are currently doing 
> GenerateTableFetch and executing it only on the primary node because 
> “otherwise we end up with duplicate data”. A little googling on my end and I 
> found out about the List/Fetch pattern. All the examples are for SFTP though.
>
>
>
> I’m wondering what a good configuration might be if you’re looking to use 
> this pattern for fetching from a database. I’ve found GenerateTableFetch, and 
> I can certainly use this, but since we’re querying multiple tables (but not 
> all tables in the DB!) I’m hoping to use something like ListDatabaseTables 
> before that, so that GenerateTableFetch can be done on the whole cluster and 
> then QueryDatabaseTable as well.
>
>
>
> So one option is Multiple GenerateTableFetch processors > Funnel > RPG // 
> Input port > QueryDatabaseTable. I’m wondering if there’s also a good way to 
> go this route: ListDatabaseTables > RPG // Input port > GenerateTableFetch > 
> QueryDatabaseTable. I want to distribute as much work as possible within the 
> cluster.
>
>
>
> Kind regards,
>
>
>
> Walter
>
>
> ________________________________
>
> Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
> (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke 
> informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of 
> verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan 
> derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde 
> geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond 
> hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) 
> te vernietigen.
>
> Informatie vennootschap

Reply via email to