RE: A sensible approach to scheduling via the API?

Vos, Walter Mon, 24 Sep 2018 23:26:18 -0700

Hi Alexandre, Nathan,

Our sources are sometimes folders (FTP), but very often they're SQL 
databases/tables. We have a NiFi instance that's on premises where our source 
systems (160 sources!) are, and our developers are building flows to generally 
pull data from one source, once a day. These flows all end with a site to site 
connection to our cloud hosted NiFi instance. This NiFi instance does some 
processing and then stores data on HDFS or in a SQL database.


So: NiFi (on premises): [QueryDatabaseTable] > S2S > NiFi (cloud): [Some 
processing] > [PutHDFS/PutSQL]

A major scenario is the one where we're pulling data from a SQL database once 
per day. The way I see it, I can start the flow by enabling the trigger 
processor, but I have no way of knowing when all of our data has gone through 
and therefore have no idea when to turn it off again.

Does this make sense and clarify our architecture a bit? I've heard it being 
said that NiFi is a weird choice for this use case, but changing that is beyond 
my influence...

-Walter

-----Oorspronkelijk bericht-----
Van: Cardinal, Alexandre [mailto:[email protected]]
Verzonden: maandag 24 september 2018 19:01
Aan: [email protected]
Onderwerp: RE: A sensible approach to scheduling via the API?

You could do it with an external scheduler, but my gut would tell me that there 
is probably a way to structure your flow in a way that satisfies your batch 
requirement, without having to manage a scheduler.

-Alexandre

-----Original Message-----
From: Nathan Gough [mailto:[email protected]]
Sent: September 24, 2018 12:53 PM
To: [email protected]
Subject: Re: A sensible approach to scheduling via the API?

Typically I would not expect to schedule dataflows in NiFi as it's not the 
ideal place for data to stay sitting. For running scheduled batch jobs as you 
describe I would expect the data to be constantly flowing to date/time based 
directories on HDFS. This allows data to be stored in a place meant for storing 
data and allows jobs to run for specified time periods with any data that 
arrived during that period.

In the past I have used a directory structure of year/month/day/hour. Eg. 
2018/09/24/12. Any data arriving during that time will be placed in those 
directories. Depending on your requirements you can bucket files into these 
directories based on collected date/time or arrival time (when it's received by 
NiFi). The scheduled batch jobs can then be configured to use the directory 
structure.

Let us know if this helps at all.
Nathan


On 9/24/18, 6:13 AM, "Vos, Walter" <[email protected]> wrote:

    Hi,

    I don't know what the etiquette on a mailing list is for this, but I'd like 
to bump my original question.

    Perhaps it's good to add that many of our flows are batch loads and 
therefore depend on a schedule to run, once.

    Does anyone have experience with remote scheduling in NiFi or do you think 
you have a smart take on this? Please let me know :)

    Cheers,

    Walter

    -----Oorspronkelijk bericht-----
    Van: Vos, Walter
    Verzonden: woensdag 5 september 2018 10:02
    Aan: [email protected]
    Onderwerp: A sensible approach to scheduling via the API?

    Hi,

    In our big data environment one of the architectural principles is to 
schedule jobs with Azure Automation (runbooks). A scheduling database is used 
to decide when to start which jobs. NiFi flows however are currently being 
scheduled in NiFi itself. We're looking for a good approach to move this over 
to runbooks. I see a couple of options:

    * Have each flow start with a timer driven processor, where the run 
schedule is an hour or so. This processor will be stopped by default, and can 
be turned on via the API. It is then stopped at some point before the run 
schedule ends, preventing the processor from running twice.
    * Use a ListenHTTP processor that we can POST a message to that specifies 
which flow to start. Do something like RouteOnAttribute to choose the right 
flow. I imagine this as being one ListenHTTP processor that is connected to all 
flows.
    * Translate the schedule from the scheduling database to a ChronTrigger 
expression. Check if the CRON schedule on the processor is indeed set to that 
schedule. If not, stop the processor, change the schedule and start it again. 
If it is, do nothing and assume it'll run. This one seems convoluted on the one 
hand, but requires the least architecture within NiFi itself I imagine.

    What do you think? Has anyone had to deal with something like this? How did 
you solve it? I can't find much information about this on the web, although I 
could be using the wrong terms.

    Kind regards,

    Walter Vos


    ________________________________

    Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
(gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke 
informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of 
verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden 
is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, 
wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de 
hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen.

    Informatie vennootschap<http://www.ns.nl/emaildisclaimer>

    ________________________________

    Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
(gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke 
informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of 
verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden 
is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, 
wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de 
hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen.

    Informatie vennootschap<http://www.ns.nl/emaildisclaimer>




CONFIDENTIALITÉ : Ce document est destiné uniquement à la personne ou à 
l'entité à qui il est adressé. L'information apparaissant dans ce document est 
de nature légalement privilégiée et confidentielle. Si vous n'êtes pas le 
destinataire visé ou la personne chargée de le remettre à son destinataire, 
vous êtes, par la présente, avisé que toute lecture, usage, copie ou 
communication du contenu de ce document est strictement interdit. De plus, vous 
êtes prié de communiquer avec l'expéditeur sans délai ou d'écrire à 
[email protected] et de détruire ce document immédiatement.
CONFIDENTIALITY: This document is intended solely for the individual or entity 
to whom it is addressed. The information contained in this document is legally 
privileged and confidential. If you are not the intended recipient or the 
person responsible for delivering it to the intended recipient, you are hereby 
advised that you are strictly prohibited from reading, using, copying or 
disseminating the contents of this document. Please inform the sender 
immediately or write to [email protected] and delete this document 
immediately.


________________________________

Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik 
door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie 
bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking 
van (de inhoud van) deze e-mail (en eventuele bijlagen) aan derden is 
uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, 
wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de 
hoogte te brengen en de e-mail (en eventuele bijlagen) te vernietigen.

Informatie vennootschap<http://www.ns.nl/emaildisclaimer>

RE: A sensible approach to scheduling via the API?

Reply via email to