Thanks Ed and Matt, That'll certainly work, so I'll get the team here up to speed!
-Walter -----Oorspronkelijk bericht----- Van: Matt Burgess [mailto:[email protected]] Verzonden: vrijdag 17 augustus 2018 15:21 Aan: [email protected] Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable Merging Ed's suggestion with mine: GenerateFlowFile -> SplitText -> RPG -> Input Port -> ExtractText -> GTF -> ExecuteSQL If you want to fully distribute the fetch, you could have another RPG -> Input Port between GTF and ExecuteQL. The ExtractText is there to pull the line into an attribute table.name for use in GTF. In the upcoming 1.8.0 release (via NIFI-5454), you'll be able to have a single GenerateFlowFile with an attribute (perhaps "table.list") containing a comma-delimited list of table names, then DuplicateFlowFile (to make <num_tables - 1> copies), then UpdateAttribute setting table.name to ${table.list:getDelimitedField(${copy.index))}. Regards, Matt [1] https://issues.apache.org/jira/browse/NIFI-5454 On Fri, Aug 17, 2018 at 7:38 AM Ed B <[email protected]> wrote: > > There is a good reason to use ListDatabaseTables in cases when tables aren't > known, or can be dynamically added to source system. > If you already know the list of tables you will be pulling data from, you can > use Generate Flow File. In GFF, in custom text property, specify a list of > tables (one table name per line), then use SplitText (1 line) to have flow > files for each table, then RPG and then GTF. > Regards, > Ed. > > On Fri, Aug 17, 2018 at 4:21 AM Vos, Walter <[email protected]> wrote: >> >> Hi Matt, >> >> Thanks for that very thorough explanation. I'll be sure to present this to >> my dev's in order to come to a way of working. When it comes to >> ListDatabaseTables, how can I use this when I don't want to query all tables >> from the database? Let's say there's some 30 tables in the database and I >> only want to query 8 of them. I realize there's a parameter called 'Table >> Name Pattern' in the processor, but how do I put together a pattern when the >> tables I'm targeting have very different names? Also, it seems a bit silly >> to open a database connections just to list the specific tables I'm looking >> to query (which I already know the names of). Is there a processor I can use >> where I can just fill out this list myself? >> >> Perhaps these are very basic questions, by the way. Everybody here is >> learning how to work with NiFi and from an admin perspective I'm trying to >> provide some guiding on best practices for common flows. >> >> Kind regards, >> >> Walter >> >> -----Oorspronkelijk bericht----- >> Van: Matt Burgess [mailto:[email protected]] >> Verzonden: donderdag 16 augustus 2018 15:55 >> Aan: [email protected] >> Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable >> >> Walter, >> >> If you're looking to distribute database fetching among a cluster, then >> GenerateTableFetch is the right choice (over QueryDatabaseTable). >> As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts incoming >> flow files, the capability was added in response to exactly the use case you >> outlined, distributed fetch of multiple tables via ListDatabaseTables. You >> still want your source processor to run on the Primary Node Only, otherwise >> all nodes get the same source data and as you said, you end up with >> duplicate data. >> >> QueryDatabaseTable does not accept incoming connections, but you can use >> ExecuteSQL to actually do the fetching. To distributed fetching of tables >> among the cluster, I recommend the following flow: >> >> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> >> GenerateTableFetch -> ExecuteSQL >> >> Each node on the cluster will get flow files for the various tables >> in the database, then GenerateTableFetch will generate the SQL to >> fetch "pages" based on the Partition Size property, then ExecuteSQL >> will execute the statements. You can use multiple concurrent tasks >> for ExecuteSQL to perform the fetching concurrently for the SQL >> statements, and the RPG->Input port part will let you do the tables >> in parallel. If instead you want to fully distribute the SQL >> execution (fetching various pages from various tables), you could >> move the >> RPG->Input Port after GTF: >> >> ListDatabaseTables (on Primary Node only) -> GenerateTableFetch -> >> RPG >> -> Input Port -> ExecuteSQL >> >> In this flow, the Primary Node will do all the work of generating all the >> SQL for all the pages, then will distributed the SQL among the cluster. So >> each node may be grabbing different pages from the same table, etc. >> Depending on how much work it takes to generate the SQL, this may not be as >> performant as the first flow. Alternatively you can distribute the SQL >> generation and the execution: >> >> ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> >> GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL >> >> This might be overkill but does "fully" parallelize the work. In >> addition, as mentioned, you can set multiple concurrent tasks for >> ExecuteSQL (but not GenerateTableFetch) to achieve concurrency for >> fetching. One thing to watch out for in all cases is the Max >> Connections property for the DBCPConnectionPool. Each node will get >> its own pool, but depending on how much is going through GTF and >> ExecuteSQL, you may run out of connections (which will slow your >> throughput) or if Max Connections is high, you may exhaust all connections >> at the server, just something to keep in mind when configuring the flow. >> >> Regards, >> Matt >> >> [1] https://issues.apache.org/jira/browse/NIFI-2881 >> On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <[email protected]> wrote: >> > >> > Hi, >> > >> > >> > >> > I’m trying to find a good strategy for distributing work among a cluster >> > when we’re fetching data from a database. My developers are currently >> > doing GenerateTableFetch and executing it only on the primary node because >> > “otherwise we end up with duplicate data”. A little googling on my end and >> > I found out about the List/Fetch pattern. All the examples are for SFTP >> > though. >> > >> > >> > >> > I’m wondering what a good configuration might be if you’re looking to use >> > this pattern for fetching from a database. I’ve found GenerateTableFetch, >> > and I can certainly use this, but since we’re querying multiple tables >> > (but not all tables in the DB!) I’m hoping to use something like >> > ListDatabaseTables before that, so that GenerateTableFetch can be done on >> > the whole cluster and then QueryDatabaseTable as well. >> > >> > >> > >> > So one option is Multiple GenerateTableFetch processors > Funnel > RPG // >> > Input port > QueryDatabaseTable. I’m wondering if there’s also a good way >> > to go this route: ListDatabaseTables > RPG // Input port > >> > GenerateTableFetch > QueryDatabaseTable. I want to distribute as much work >> > as possible within the cluster. >> > >> > >> > >> > Kind regards, >> > >> > >> > >> > Walter >> > >> > >> > ________________________________ >> > >> > Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor >> > (gebruik door) de geadresseerde. De e-mail kan persoonlijke of >> > vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, >> > verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en >> > eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u >> > niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene >> > die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail >> > (en eventuele bijlagen) te vernietigen. >> > >> > Informatie vennootschap >> >> ________________________________ >> >> Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor >> (gebruik door) de geadresseerde. De e-mail kan persoonlijke of >> vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, >> verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en >> eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u >> niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die >> de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en >> eventuele bijlagen) te vernietigen. >> >> Informatie vennootschap<http://www.ns.nl/emaildisclaimer> ________________________________ Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen. Informatie vennootschap<http://www.ns.nl/emaildisclaimer>
