RE: List/Fetch pattern for QueryDatabaseTable

Vos, Walter Sun, 19 Aug 2018 23:54:22 -0700

Thanks Ed and Matt,

That'll certainly work, so I'll get the team here up to speed!


-Walter

-----Oorspronkelijk bericht-----
Van: Matt Burgess [mailto:[email protected]]
Verzonden: vrijdag 17 augustus 2018 15:21
Aan: [email protected]
Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable

Merging Ed's suggestion with mine:

GenerateFlowFile -> SplitText -> RPG -> Input Port -> ExtractText -> GTF -> 
ExecuteSQL

If you want to fully distribute the fetch, you could have another RPG
-> Input Port between GTF and ExecuteQL. The ExtractText is there to
pull the line into an attribute table.name for use in GTF.

In the upcoming 1.8.0 release (via NIFI-5454), you'll be able to have a single 
GenerateFlowFile with an attribute (perhaps "table.list") containing a 
comma-delimited list of table names, then DuplicateFlowFile (to make 
<num_tables - 1> copies), then UpdateAttribute setting table.name to 
${table.list:getDelimitedField(${copy.index))}.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-5454
On Fri, Aug 17, 2018 at 7:38 AM Ed B <[email protected]> wrote:
>
> There is a good reason to use ListDatabaseTables in cases when tables aren't 
> known, or can be dynamically added to source system.
> If you already know the list of tables you will be pulling data from, you can 
> use Generate Flow File. In GFF, in custom text property, specify a list of 
> tables (one table name per line), then use SplitText (1 line) to have flow 
> files for each table, then RPG and then GTF.
> Regards,
> Ed.
>
> On Fri, Aug 17, 2018 at 4:21 AM Vos, Walter <[email protected]> wrote:
>>
>> Hi Matt,
>>
>> Thanks for that very thorough explanation. I'll be sure to present this to 
>> my dev's in order to come to a way of working. When it comes to 
>> ListDatabaseTables, how can I use this when I don't want to query all tables 
>> from the database? Let's say there's some 30 tables in the database and I 
>> only want to query 8 of them. I realize there's a parameter called 'Table 
>> Name Pattern' in the processor, but how do I put together a pattern when the 
>> tables I'm targeting have very different names? Also, it seems a bit silly 
>> to open a database connections just to list the specific tables I'm looking 
>> to query (which I already know the names of). Is there a processor I can use 
>> where I can just fill out this list myself?
>>
>> Perhaps these are very basic questions, by the way. Everybody here is 
>> learning how to work with NiFi and from an admin perspective I'm trying to 
>> provide some guiding on best practices for common flows.
>>
>> Kind regards,
>>
>> Walter
>>
>> -----Oorspronkelijk bericht-----
>> Van: Matt Burgess [mailto:[email protected]]
>> Verzonden: donderdag 16 augustus 2018 15:55
>> Aan: [email protected]
>> Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable
>>
>> Walter,
>>
>> If you're looking to distribute database fetching among a cluster, then 
>> GenerateTableFetch is the right choice (over QueryDatabaseTable).
>> As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts incoming 
>> flow files, the capability was added in response to exactly the use case you 
>> outlined, distributed fetch of multiple tables via ListDatabaseTables. You 
>> still want your source processor to run on the Primary Node Only, otherwise 
>> all nodes get the same source data and as you said, you end up with 
>> duplicate data.
>>
>> QueryDatabaseTable does not accept incoming connections, but you can use 
>> ExecuteSQL to actually do the fetching.  To distributed fetching of tables 
>> among the cluster, I recommend the following flow:
>>
>> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
>> GenerateTableFetch -> ExecuteSQL
>>
>> Each node on the cluster will get flow files for the various tables
>> in the database, then GenerateTableFetch will generate the SQL to
>> fetch "pages" based on the Partition Size property, then ExecuteSQL
>> will execute the statements. You can use multiple concurrent tasks
>> for ExecuteSQL to perform the fetching concurrently for the SQL
>> statements, and the RPG->Input port part will let you do the tables
>> in parallel.  If instead you want to fully distribute the SQL
>> execution (fetching various pages from various tables), you could
>> move the
>> RPG->Input Port after GTF:
>>
>> ListDatabaseTables (on Primary Node only) -> GenerateTableFetch ->
>> RPG
>> -> Input Port -> ExecuteSQL
>>
>> In this flow, the Primary Node will do all the work of generating all the 
>> SQL for all the pages, then will distributed the SQL among the cluster. So 
>> each node may be grabbing different pages from the same table, etc.  
>> Depending on how much work it takes to generate the SQL, this may not be as 
>> performant as the first flow.  Alternatively you can distribute the SQL 
>> generation and the execution:
>>
>> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port ->
>> GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL
>>
>> This might be overkill but does "fully" parallelize the work. In
>> addition, as mentioned, you can set multiple concurrent tasks for
>> ExecuteSQL (but not GenerateTableFetch) to achieve concurrency for
>> fetching. One thing to watch out for in all cases is the Max
>> Connections property for the DBCPConnectionPool. Each node will get
>> its own pool, but depending on how much is going through GTF and
>> ExecuteSQL, you may run out of connections (which will slow your
>> throughput) or if Max Connections is high, you may exhaust all connections 
>> at the server, just something to keep in mind when configuring the flow.
>>
>> Regards,
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-2881
>> On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > I’m trying to find a good strategy for distributing work among a cluster 
>> > when we’re fetching data from a database. My developers are currently 
>> > doing GenerateTableFetch and executing it only on the primary node because 
>> > “otherwise we end up with duplicate data”. A little googling on my end and 
>> > I found out about the List/Fetch pattern. All the examples are for SFTP 
>> > though.
>> >
>> >
>> >
>> > I’m wondering what a good configuration might be if you’re looking to use 
>> > this pattern for fetching from a database. I’ve found GenerateTableFetch, 
>> > and I can certainly use this, but since we’re querying multiple tables 
>> > (but not all tables in the DB!) I’m hoping to use something like 
>> > ListDatabaseTables before that, so that GenerateTableFetch can be done on 
>> > the whole cluster and then QueryDatabaseTable as well.
>> >
>> >
>> >
>> > So one option is Multiple GenerateTableFetch processors > Funnel > RPG // 
>> > Input port > QueryDatabaseTable. I’m wondering if there’s also a good way 
>> > to go this route: ListDatabaseTables > RPG // Input port > 
>> > GenerateTableFetch > QueryDatabaseTable. I want to distribute as much work 
>> > as possible within the cluster.
>> >
>> >
>> >
>> > Kind regards,
>> >
>> >
>> >
>> > Walter
>> >
>> >
>> > ________________________________
>> >
>> > Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
>> > (gebruik door) de geadresseerde. De e-mail kan persoonlijke of 
>> > vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, 
>> > verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en 
>> > eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u 
>> > niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene 
>> > die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail 
>> > (en eventuele bijlagen) te vernietigen.
>> >
>> > Informatie vennootschap
>>
>> ________________________________
>>
>> Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
>> (gebruik door) de geadresseerde. De e-mail kan persoonlijke of 
>> vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, 
>> verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en 
>> eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u 
>> niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die 
>> de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en 
>> eventuele bijlagen) te vernietigen.
>>
>> Informatie vennootschap<http://www.ns.nl/emaildisclaimer>

________________________________

Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik 
door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie 
bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking 
van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is 
uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, 
wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de 
hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen.

Informatie vennootschap<http://www.ns.nl/emaildisclaimer>

RE: List/Fetch pattern for QueryDatabaseTable

Reply via email to