Hi Walter, For the SQL reading, you could set the Maximum-Value Column to an auto-increment field like an ID, that would allow you to incrementally process your database.
But if your processing really relies on an event window of 1 day, then you might need the scheduling (or need to design the processing differently, but that's not my place to say). Does that help? -Alexandre -----Original Message----- From: Vos, Walter [mailto:[email protected]] Sent: September 25, 2018 2:25 AM To: '[email protected]' <[email protected]> Subject: RE: A sensible approach to scheduling via the API? Hi Alexandre, Nathan, Our sources are sometimes folders (FTP), but very often they're SQL databases/tables. We have a NiFi instance that's on premises where our source systems (160 sources!) are, and our developers are building flows to generally pull data from one source, once a day. These flows all end with a site to site connection to our cloud hosted NiFi instance. This NiFi instance does some processing and then stores data on HDFS or in a SQL database. So: NiFi (on premises): [QueryDatabaseTable] > S2S > NiFi (cloud): [Some processing] > [PutHDFS/PutSQL] A major scenario is the one where we're pulling data from a SQL database once per day. The way I see it, I can start the flow by enabling the trigger processor, but I have no way of knowing when all of our data has gone through and therefore have no idea when to turn it off again. Does this make sense and clarify our architecture a bit? I've heard it being said that NiFi is a weird choice for this use case, but changing that is beyond my influence... -Walter -----Oorspronkelijk bericht----- Van: Cardinal, Alexandre [mailto:[email protected]] Verzonden: maandag 24 september 2018 19:01 Aan: [email protected] Onderwerp: RE: A sensible approach to scheduling via the API? You could do it with an external scheduler, but my gut would tell me that there is probably a way to structure your flow in a way that satisfies your batch requirement, without having to manage a scheduler. -Alexandre -----Original Message----- From: Nathan Gough [mailto:[email protected]] Sent: September 24, 2018 12:53 PM To: [email protected] Subject: Re: A sensible approach to scheduling via the API? Typically I would not expect to schedule dataflows in NiFi as it's not the ideal place for data to stay sitting. For running scheduled batch jobs as you describe I would expect the data to be constantly flowing to date/time based directories on HDFS. This allows data to be stored in a place meant for storing data and allows jobs to run for specified time periods with any data that arrived during that period. In the past I have used a directory structure of year/month/day/hour. Eg. 2018/09/24/12. Any data arriving during that time will be placed in those directories. Depending on your requirements you can bucket files into these directories based on collected date/time or arrival time (when it's received by NiFi). The scheduled batch jobs can then be configured to use the directory structure. Let us know if this helps at all. Nathan On 9/24/18, 6:13 AM, "Vos, Walter" <[email protected]> wrote: Hi, I don't know what the etiquette on a mailing list is for this, but I'd like to bump my original question. Perhaps it's good to add that many of our flows are batch loads and therefore depend on a schedule to run, once. Does anyone have experience with remote scheduling in NiFi or do you think you have a smart take on this? Please let me know :) Cheers, Walter -----Oorspronkelijk bericht----- Van: Vos, Walter Verzonden: woensdag 5 september 2018 10:02 Aan: [email protected] Onderwerp: A sensible approach to scheduling via the API? Hi, In our big data environment one of the architectural principles is to schedule jobs with Azure Automation (runbooks). A scheduling database is used to decide when to start which jobs. NiFi flows however are currently being scheduled in NiFi itself. We're looking for a good approach to move this over to runbooks. I see a couple of options: * Have each flow start with a timer driven processor, where the run schedule is an hour or so. This processor will be stopped by default, and can be turned on via the API. It is then stopped at some point before the run schedule ends, preventing the processor from running twice. * Use a ListenHTTP processor that we can POST a message to that specifies which flow to start. Do something like RouteOnAttribute to choose the right flow. I imagine this as being one ListenHTTP processor that is connected to all flows. * Translate the schedule from the scheduling database to a ChronTrigger expression. Check if the CRON schedule on the processor is indeed set to that schedule. If not, stop the processor, change the schedule and start it again. If it is, do nothing and assume it'll run. This one seems convoluted on the one hand, but requires the least architecture within NiFi itself I imagine. What do you think? Has anyone had to deal with something like this? How did you solve it? I can't find much information about this on the web, although I could be using the wrong terms. Kind regards, Walter Vos ________________________________ Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen. Informatie vennootschap<http://www.ns.nl/emaildisclaimer> ________________________________ Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen. Informatie vennootschap<http://www.ns.nl/emaildisclaimer> CONFIDENTIALITÉ : Ce document est destiné uniquement à la personne ou à l'entité à qui il est adressé. L'information apparaissant dans ce document est de nature légalement privilégiée et confidentielle. Si vous n'êtes pas le destinataire visé ou la personne chargée de le remettre à son destinataire, vous êtes, par la présente, avisé que toute lecture, usage, copie ou communication du contenu de ce document est strictement interdit. De plus, vous êtes prié de communiquer avec l'expéditeur sans délai ou d'écrire à [email protected] et de détruire ce document immédiatement. CONFIDENTIALITY: This document is intended solely for the individual or entity to whom it is addressed. The information contained in this document is legally privileged and confidential. If you are not the intended recipient or the person responsible for delivering it to the intended recipient, you are hereby advised that you are strictly prohibited from reading, using, copying or disseminating the contents of this document. Please inform the sender immediately or write to [email protected] and delete this document immediately. ________________________________ Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen. Informatie vennootschap<http://www.ns.nl/emaildisclaimer> CONFIDENTIALITÉ : Ce document est destiné uniquement à la personne ou à l'entité à qui il est adressé. L'information apparaissant dans ce document est de nature légalement privilégiée et confidentielle. Si vous n'êtes pas le destinataire visé ou la personne chargée de le remettre à son destinataire, vous êtes, par la présente, avisé que toute lecture, usage, copie ou communication du contenu de ce document est strictement interdit. De plus, vous êtes prié de communiquer avec l'expéditeur sans délai ou d'écrire à [email protected] et de détruire ce document immédiatement. CONFIDENTIALITY: This document is intended solely for the individual or entity to whom it is addressed. The information contained in this document is legally privileged and confidential. If you are not the intended recipient or the person responsible for delivering it to the intended recipient, you are hereby advised that you are strictly prohibited from reading, using, copying or disseminating the contents of this document. Please inform the sender immediately or write to [email protected] and delete this document immediately.
