Merging Ed's suggestion with mine:
GenerateFlowFile -> SplitText -> RPG -> Input Port -> ExtractText ->
GTF -> ExecuteSQL
If you want to fully distribute the fetch, you could have another RPG
-> Input Port between GTF and ExecuteQL. The ExtractText is there to
pull the line into an attribute table.name for use in GTF.
In the upcoming 1.8.0 release (via NIFI-5454), you'll be able to have
a single GenerateFlowFile with an attribute (perhaps "table.list")
containing a comma-delimited list of table names, then
DuplicateFlowFile (to make <num_tables - 1> copies), then
UpdateAttribute setting table.name to
${table.list:getDelimitedField(${copy.index))}.
Regards,
Matt
[1] https://issues.apache.org/jira/browse/NIFI-5454
On Fri, Aug 17, 2018 at 7:38 AM Ed B <[email protected]> wrote:
>
> There is a good reason to use ListDatabaseTables in cases when tables aren't
> known, or can be dynamically added to source system.
> If you already know the list of tables you will be pulling data from, you can
> use Generate Flow File. In GFF, in custom text property, specify a list of
> tables (one table name per line), then use SplitText (1 line) to have flow
> files for each table, then RPG and then GTF.
> Regards,
> Ed.
>
> On Fri, Aug 17, 2018 at 4:21 AM Vos, Walter <[email protected]> wrote:
>>
>> Hi Matt,
>>
>> Thanks for that very thorough explanation. I'll be sure to present this to
>> my dev's in order to come to a way of working. When it comes to
>> ListDatabaseTables, how can I use this when I don't want to query all tables
>> from the database? Let's say there's some 30 tables in the database and I
>> only want to query 8 of them. I realize there's a parameter called 'Table
>> Name Pattern' in the processor, but how do I put together a pattern when the
>> tables I'm targeting have very different names? Also, it seems a bit silly
>> to open a database connections just to list the specific tables I'm looking
>> to query (which I already know the names of). Is there a processor I can use
>> where I can just fill out this list myself?
>>
>> Perhaps these are very basic questions, by the way. Everybody here is
>> learning how to work with NiFi and from an admin perspective I'm trying to
>> provide some guiding on best practices for common flows.
>>
>> Kind regards,
>>
>> Walter
>>
>> -----Oorspronkelijk bericht-----
>> Van: Matt Burgess [mailto:[email protected]]
>> Verzonden: donderdag 16 augustus 2018 15:55
>> Aan: [email protected]
>> Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable
>>
>> Walter,
>>
>> If you're looking to distribute database fetching among a cluster, then
>> GenerateTableFetch is the right choice (over QueryDatabaseTable).
>> As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts incoming
>> flow files, the capability was added in response to exactly the use case you
>> outlined, distributed fetch of multiple tables via ListDatabaseTables. You
>> still want your source processor to run on the Primary Node Only, otherwise
>> all nodes get the same source data and as you said, you end up with
>> duplicate data.
>>
>> QueryDatabaseTable does not accept incoming connections, but you can use
>> ExecuteSQL to actually do the fetching. To distributed fetching of tables
>> among the cluster, I recommend the following flow:
>>
>> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
>> GenerateTableFetch -> ExecuteSQL
>>
>> Each node on the cluster will get flow files for the various tables in the
>> database, then GenerateTableFetch will generate the SQL to fetch "pages"
>> based on the Partition Size property, then ExecuteSQL will execute the
>> statements. You can use multiple concurrent tasks for ExecuteSQL to perform
>> the fetching concurrently for the SQL statements, and the RPG->Input port
>> part will let you do the tables in parallel. If instead you want to fully
>> distribute the SQL execution (fetching various pages from various tables),
>> you could move the
>> RPG->Input Port after GTF:
>>
>> ListDatabaseTables (on Primary Node only) -> GenerateTableFetch -> RPG
>> -> Input Port -> ExecuteSQL
>>
>> In this flow, the Primary Node will do all the work of generating all the
>> SQL for all the pages, then will distributed the SQL among the cluster. So
>> each node may be grabbing different pages from the same table, etc.
>> Depending on how much work it takes to generate the SQL, this may not be as
>> performant as the first flow. Alternatively you can distribute the SQL
>> generation and the execution:
>>
>> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
>> GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL
>>
>> This might be overkill but does "fully" parallelize the work. In addition,
>> as mentioned, you can set multiple concurrent tasks for ExecuteSQL (but not
>> GenerateTableFetch) to achieve concurrency for fetching. One thing to watch
>> out for in all cases is the Max Connections property for the
>> DBCPConnectionPool. Each node will get its own pool, but depending on how
>> much is going through GTF and ExecuteSQL, you may run out of connections
>> (which will slow your
>> throughput) or if Max Connections is high, you may exhaust all connections
>> at the server, just something to keep in mind when configuring the flow.
>>
>> Regards,
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-2881
>> On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > I’m trying to find a good strategy for distributing work among a cluster
>> > when we’re fetching data from a database. My developers are currently
>> > doing GenerateTableFetch and executing it only on the primary node because
>> > “otherwise we end up with duplicate data”. A little googling on my end and
>> > I found out about the List/Fetch pattern. All the examples are for SFTP
>> > though.
>> >
>> >
>> >
>> > I’m wondering what a good configuration might be if you’re looking to use
>> > this pattern for fetching from a database. I’ve found GenerateTableFetch,
>> > and I can certainly use this, but since we’re querying multiple tables
>> > (but not all tables in the DB!) I’m hoping to use something like
>> > ListDatabaseTables before that, so that GenerateTableFetch can be done on
>> > the whole cluster and then QueryDatabaseTable as well.
>> >
>> >
>> >
>> > So one option is Multiple GenerateTableFetch processors > Funnel > RPG //
>> > Input port > QueryDatabaseTable. I’m wondering if there’s also a good way
>> > to go this route: ListDatabaseTables > RPG // Input port >
>> > GenerateTableFetch > QueryDatabaseTable. I want to distribute as much work
>> > as possible within the cluster.
>> >
>> >
>> >
>> > Kind regards,
>> >
>> >
>> >
>> > Walter
>> >
>> >
>> > ________________________________
>> >
>> > Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor
>> > (gebruik door) de geadresseerde. De e-mail kan persoonlijke of
>> > vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging,
>> > verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en
>> > eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u
>> > niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene
>> > die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail
>> > (en eventuele bijlagen) te vernietigen.
>> >
>> > Informatie vennootschap
>>
>> ________________________________
>>
>> Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor
>> (gebruik door) de geadresseerde. De e-mail kan persoonlijke of
>> vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging,
>> verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en
>> eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u
>> niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die
>> de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en
>> eventuele bijlagen) te vernietigen.
>>
>> Informatie vennootschap<http://www.ns.nl/emaildisclaimer>