Re: NiFi Data Usage via Rest API
Ryan, That is correct. Would just clarify that when you say "SEND events are when they are leaving the system" -- the data is being sent to an external system, but it is not being dropped from NiFi. So you could send the data to 10 different places. A "DROP" event indicates that NiFi is now finished processing it. Thanks -Mark On Jul 26, 2018, at 10:25 AM, Ryan H mailto:ryan.howell.developm...@gmail.com>> wrote: Hi Mark, Thanks for the explanation on this; this is what I was looking for. So it sounds like Provenance info is the way to go (as mentioned by Mike [thanks Mike]). I will have to do a little more research on the Provenance events, but it sounds like RECEIVE events are for when something is coming in to the system from an external source and SEND events are when they are leaving the system (I hope this is a correct assumption). I think I will mull this over for a bit -Ryan On Thu, Jul 26, 2018 at 9:57 AM, Mark Payne mailto:marka...@hotmail.com>> wrote: Hey Ryan, The stats that you are seeing here is a rolling 5-minute window. The "bytesReceived" indicates the number of bytes that were received from external systems (i.e., the number of bytes reported as Provenance RECEIVE events). The "bytesSent' indicates the number of bytes that were sent to external systems (i.e., the number of bytes reported as Provenance SEND events). If you were to receive a file (or a datagram or a packet or a message or whatever) that is say 10 MB, then send it to two destinations, then you'd see "bytesReceived" of 10,000,000 and "bytesSent" of 20,000,000 (because the 10 MB were sent twice). Because these are rolling windows, not tumbling windows, it would be difficult to use them to get exact counts over long periods of time. I do think the Provenance Data is the right place to look for that kind of information, but it may not be as trivial as you'd hope, by hitting a REST API. At first thought, it may make sense to have another endpoint that does return counts for all data received/sent/etc. since the instance was started. This would make it much easier to query every 4 hours, for instance, and then subtract the two numbers to get the difference. This, however, would still be problematic if a node is restarted, because if you get the information at time t = 4 hours, then a node restarts at time t = 7 hours, and then you get the next snapshot at time t = 8 hours, you'll only have 1 hour worth of data from the node that restarted... this is one of the benefits of gleaning this info from Provenance data. Alternatively, I suppose, some sort of persistent reporting mechanism could be built within NiFi itself to track this sort of thing, but nothing like that exists at the moment. Thanks -Mark On Jul 26, 2018, at 9:12 AM, Ryan H mailto:ryan.howell.developm...@gmail.com>> wrote: Hi Matt, The use case that I am investigating is fairly simplistic (and I may be naive about it). I am only looking for the amount of data that has came in to the cluster (across all PG's) and out of the cluster for a given time period (or a way to derive based on a time period). I do not want to have the requirement of adding reporting tasks, S2S, or anything on the canvas to achieve this. I want to be able to just query the rest api to get this information. I don't care about what the data is, only the amount of data that has came in and went out. For example: I would like to call the rest api every 4 hours to see how much data has came in to the cluster and how much data has gone out of the cluster (500 GB came in via external sources such as data lakes, message queues, other api's, etc and 400 GB went out such as to other data lakes, other clusters, applications, etc). I was thinking that information was already available, but it is unclear to me based on the rest api documentation (see original email). Is this something you can speak to (rest-api: nifi-api/flow/process-groups/root/status?recursive=true) and the property values that are returned? With the data model, it is unclear on what the values represent (only for last 5 minutes? running counter since the start of time? if it is an aggregate of forever, when will it reset? cluster restart?). Data model in question: ProcessGroupStatusEntity { "processGroupStatus": { ... "aggregateSnapshot": { ... "flowFilesIn": 0, "bytesIn": 0, "input": "value", "flowFilesQueued": 0, "bytesQueued": 0, "queued": "value", "queuedCount": "value", "queuedSize": "value", "bytesRead": 0, "read": "value", "bytesWritten": 0, "written": "value", "flowFilesOut": 0, "bytesOut": 0, "output": "value", "flowFilesTransferred": 0, "bytesTransferred": 0, "transferred": "value", "bytesReceived": 0,// I think this is the one, but not sure "flowFilesReceived": 0, "received": "value", "bytesSent": 0, // I think this is the other one, but not
Re: NiFi Data Usage via Rest API
Hi Mark, Thanks for the explanation on this; this is what I was looking for. So it sounds like Provenance info is the way to go (as mentioned by Mike [thanks Mike]). I will have to do a little more research on the Provenance events, but it sounds like RECEIVE events are for when something is coming in to the system from an external source and SEND events are when they are leaving the system (I hope this is a correct assumption). I think I will mull this over for a bit -Ryan On Thu, Jul 26, 2018 at 9:57 AM, Mark Payne wrote: > Hey Ryan, > > The stats that you are seeing here is a rolling 5-minute window. The > "bytesReceived" indicates the number of bytes that were received from > external systems (i.e., the number of bytes reported as Provenance RECEIVE > events). The "bytesSent' indicates the number of bytes that were sent to > external systems (i.e., the number of bytes reported as Provenance SEND > events). If you were to receive a file (or a datagram or a packet or a > message or whatever) that is say 10 MB, then send it to two destinations, > then you'd see "bytesReceived" of 10,000,000 and "bytesSent" of 20,000,000 > (because the 10 MB were sent twice). > > Because these are rolling windows, not tumbling windows, it would be > difficult to use them to get exact counts over long periods of time. I do > think the Provenance Data is the right place to look for that kind of > information, but it may not be as trivial as you'd hope, by hitting a REST > API. > > At first thought, it may make sense to have another endpoint that does > return counts for all data received/sent/etc. since the instance was > started. This would make it much easier to query every 4 hours, for > instance, and then subtract the two numbers to get the difference. This, > however, would still be problematic if a node is restarted, because if you > get the information at time t = 4 hours, then a node restarts at time t = 7 > hours, and then you get the next snapshot at time t = 8 hours, you'll only > have 1 hour worth of data from the node that restarted... this is one of > the benefits of gleaning this info from Provenance data. > > Alternatively, I suppose, some sort of persistent reporting mechanism > could be built within NiFi itself to track this sort of thing, but nothing > like that exists at the moment. > > Thanks > -Mark > > > On Jul 26, 2018, at 9:12 AM, Ryan H > wrote: > > Hi Matt, > > The use case that I am investigating is fairly simplistic (and I may be > naive about it). I am only looking for the amount of data that has came in > to the cluster (across all PG's) and out of the cluster for a given time > period (or a way to derive based on a time period). I do not want to have > the requirement of adding reporting tasks, S2S, or anything on the canvas > to achieve this. I want to be able to just query the rest api to get this > information. I don't care about what the data is, only the amount of data > that has came in and went out. > > For example: I would like to call the rest api every 4 hours to see how > much data has came in to the cluster and how much data has gone out of the > cluster (500 GB came in via external sources such as data lakes, message > queues, other api's, etc and 400 GB went out such as to other data lakes, > other clusters, applications, etc). > > I was thinking that information was already available, but it is unclear > to me based on the rest api documentation (see original email). Is this > something you can speak to (rest-api: nifi-api/flow/process-groups/ > root/status?recursive=true) and the property values that are returned? > With the data model, it is unclear on what the values represent (only for > last 5 minutes? running counter since the start of time? if it is an > aggregate of forever, when will it reset? cluster restart?). > > Data model in question: > ProcessGroupStatusEntity > > { > "processGroupStatus": { > ... > "aggregateSnapshot": { > ... > "flowFilesIn": 0, > "bytesIn": 0, > "input": "value", > "flowFilesQueued": 0, > "bytesQueued": 0, > "queued": "value", > "queuedCount": "value", > "queuedSize": "value", > "bytesRead": 0, > "read": "value", > "bytesWritten": 0, > "written": "value", > "flowFilesOut": 0, > "bytesOut": 0, > "output": "value", > "flowFilesTransferred": 0, > "bytesTransferred": 0, > "transferred": "value", >* "bytesReceived": 0,// I think this is the one, but not sure* > "flowFilesReceived": 0, > * "received": "value",* >* "bytesSent": 0, // I think this is the other one, but not sure* > "flowFilesSent": 0, > * "sent": "value",* > "activeThreadCount": 0, > "terminatedThreadCount": 0 > }, > "nodeSnapshots": [{…}] > }, > "canRead": true > } > > > > Thanks, > > Ryan H > > > > On Wed, Jul 25, 2018 at 9:59 PM, Matt Burgess > wrote: > >> Mike, Ryan, Boris et al, >> >> I'd like to wrap my head around the kinds of
Re: NiFi Data Usage via Rest API
Hey Ryan, The stats that you are seeing here is a rolling 5-minute window. The "bytesReceived" indicates the number of bytes that were received from external systems (i.e., the number of bytes reported as Provenance RECEIVE events). The "bytesSent' indicates the number of bytes that were sent to external systems (i.e., the number of bytes reported as Provenance SEND events). If you were to receive a file (or a datagram or a packet or a message or whatever) that is say 10 MB, then send it to two destinations, then you'd see "bytesReceived" of 10,000,000 and "bytesSent" of 20,000,000 (because the 10 MB were sent twice). Because these are rolling windows, not tumbling windows, it would be difficult to use them to get exact counts over long periods of time. I do think the Provenance Data is the right place to look for that kind of information, but it may not be as trivial as you'd hope, by hitting a REST API. At first thought, it may make sense to have another endpoint that does return counts for all data received/sent/etc. since the instance was started. This would make it much easier to query every 4 hours, for instance, and then subtract the two numbers to get the difference. This, however, would still be problematic if a node is restarted, because if you get the information at time t = 4 hours, then a node restarts at time t = 7 hours, and then you get the next snapshot at time t = 8 hours, you'll only have 1 hour worth of data from the node that restarted... this is one of the benefits of gleaning this info from Provenance data. Alternatively, I suppose, some sort of persistent reporting mechanism could be built within NiFi itself to track this sort of thing, but nothing like that exists at the moment. Thanks -Mark On Jul 26, 2018, at 9:12 AM, Ryan H mailto:ryan.howell.developm...@gmail.com>> wrote: Hi Matt, The use case that I am investigating is fairly simplistic (and I may be naive about it). I am only looking for the amount of data that has came in to the cluster (across all PG's) and out of the cluster for a given time period (or a way to derive based on a time period). I do not want to have the requirement of adding reporting tasks, S2S, or anything on the canvas to achieve this. I want to be able to just query the rest api to get this information. I don't care about what the data is, only the amount of data that has came in and went out. For example: I would like to call the rest api every 4 hours to see how much data has came in to the cluster and how much data has gone out of the cluster (500 GB came in via external sources such as data lakes, message queues, other api's, etc and 400 GB went out such as to other data lakes, other clusters, applications, etc). I was thinking that information was already available, but it is unclear to me based on the rest api documentation (see original email). Is this something you can speak to (rest-api: nifi-api/flow/process-groups/root/status?recursive=true) and the property values that are returned? With the data model, it is unclear on what the values represent (only for last 5 minutes? running counter since the start of time? if it is an aggregate of forever, when will it reset? cluster restart?). Data model in question: ProcessGroupStatusEntity { "processGroupStatus": { ... "aggregateSnapshot": { ... "flowFilesIn": 0, "bytesIn": 0, "input": "value", "flowFilesQueued": 0, "bytesQueued": 0, "queued": "value", "queuedCount": "value", "queuedSize": "value", "bytesRead": 0, "read": "value", "bytesWritten": 0, "written": "value", "flowFilesOut": 0, "bytesOut": 0, "output": "value", "flowFilesTransferred": 0, "bytesTransferred": 0, "transferred": "value", "bytesReceived": 0,// I think this is the one, but not sure "flowFilesReceived": 0, "received": "value", "bytesSent": 0, // I think this is the other one, but not sure "flowFilesSent": 0, "sent": "value", "activeThreadCount": 0, "terminatedThreadCount": 0 }, "nodeSnapshots": [{…}] }, "canRead": true } Thanks, Ryan H On Wed, Jul 25, 2018 at 9:59 PM, Matt Burgess mailto:mattyb...@apache.org>> wrote: Mike, Ryan, Boris et al, I'd like to wrap my head around the kinds of use cases y'all have for provenance data in NiFi: what's good, what's bad, what we need to do to make things better. Are there questions you want to ask of provenance that you can't today? Do the DROP events give you what you need for your reporting, or would you benefit from some sort of "lifetime record" that might be generated from a NiFi subflow based on provenance events? I've been bouncing around the following concepts/improvements: 1) Let the ProcessSession keep track of the duration in a processor (for those that don't explicitly report it) [1] 2) Add a "durationNanos" field to provenance events, possibly replacing "durationMillis" in NiFi
Re: NiFi Data Usage via Rest API
Hi Matt, The use case that I am investigating is fairly simplistic (and I may be naive about it). I am only looking for the amount of data that has came in to the cluster (across all PG's) and out of the cluster for a given time period (or a way to derive based on a time period). I do not want to have the requirement of adding reporting tasks, S2S, or anything on the canvas to achieve this. I want to be able to just query the rest api to get this information. I don't care about what the data is, only the amount of data that has came in and went out. For example: I would like to call the rest api every 4 hours to see how much data has came in to the cluster and how much data has gone out of the cluster (500 GB came in via external sources such as data lakes, message queues, other api's, etc and 400 GB went out such as to other data lakes, other clusters, applications, etc). I was thinking that information was already available, but it is unclear to me based on the rest api documentation (see original email). Is this something you can speak to (rest-api: nifi-api/flow/process-groups/root/status?recursive=true) and the property values that are returned? With the data model, it is unclear on what the values represent (only for last 5 minutes? running counter since the start of time? if it is an aggregate of forever, when will it reset? cluster restart?). Data model in question: ProcessGroupStatusEntity { "processGroupStatus": { ... "aggregateSnapshot": { ... "flowFilesIn": 0, "bytesIn": 0, "input": "value", "flowFilesQueued": 0, "bytesQueued": 0, "queued": "value", "queuedCount": "value", "queuedSize": "value", "bytesRead": 0, "read": "value", "bytesWritten": 0, "written": "value", "flowFilesOut": 0, "bytesOut": 0, "output": "value", "flowFilesTransferred": 0, "bytesTransferred": 0, "transferred": "value", * "bytesReceived": 0,// I think this is the one, but not sure* "flowFilesReceived": 0, *"received": "value",* * "bytesSent": 0, // I think this is the other one, but not sure* "flowFilesSent": 0, *"sent": "value",* "activeThreadCount": 0, "terminatedThreadCount": 0 }, "nodeSnapshots": [{…}] }, "canRead": true } Thanks, Ryan H On Wed, Jul 25, 2018 at 9:59 PM, Matt Burgess wrote: > Mike, Ryan, Boris et al, > > I'd like to wrap my head around the kinds of use cases y'all have for > provenance data in NiFi: what's good, what's bad, what we need to do > to make things better. Are there questions you want to ask of > provenance that you can't today? Do the DROP events give you what you > need for your reporting, or would you benefit from some sort of > "lifetime record" that might be generated from a NiFi subflow based on > provenance events? I've been bouncing around the following > concepts/improvements: > > 1) Let the ProcessSession keep track of the duration in a processor > (for those that don't explicitly report it) [1] > 2) Add a "durationNanos" field to provenance events, possibly > replacing "durationMillis" in NiFi 2.0, to give better precision [2] > 3) A processor to generate lineage when a DROP event is received > (likely via the SiteToSiteProvenanceReportingTask), dependent on the > persistence settings of the provenance repository > 4) A "Query Filter" property on the SiteToSiteProvenanceReportingTask > (or a separate reporting task if need be), perhaps leveraging Calcite > for SQL filters or supporting the Lucene query language (since the > prov events are indexed by Lucene) > > I still haven't come up with the New Feature Wiki page for graph tech > (from a previous discussion on the list) but #3 above lends itself to > also generating a lineage graph for a FlowFile, in some well-known > format perhaps (Kryo, GraphML, etc.) I'll try to get that Wiki (and > the discussion) going soon... > > Regards, > Matt > > [1] https://issues.apache.org/jira/browse/NIFI-5420 > [2] https://issues.apache.org/jira/browse/NIFI-5463 > On Wed, Jul 25, 2018 at 9:18 PM Mike Thomsen > wrote: > > > > Ryan, > > > > Understandable. We haven't found a need for Beats or Forwarders here > either because S2S gives everything you need to reliably ship the data. > > > > FWIW, if your need changes, I would recommend stripping down the > provenance data. We cut out about 66-75% of the fields and dropped the > intermediate records in favor of keeping DROP events for our simple > dashboarding needs because we figured if a record never made it there > something very bad happened. > > > > On Wed, Jul 25, 2018 at 8:54 PM Ryan H gmail.com> wrote: > >> > >> Thanks Mike for the suggestion on it. I'm looking for a solution that > doesn't involve the additional components such as any Beats/Forwarders/ > Elasticsearch/etc. > >> > >> Boris, thanks for the link for the Monitoring introduction--I've > checked it out multiple times. What I want to avoid is having the need for > anything to be
Re: NiFi Data Usage via Rest API
Matt, Our main use, which provenance data handles well, is figuring out **what** data was handled. We drop everything but DROP out of convenience because we have no known scenarios where data will be removed before it reaches the end of the flow. FWIW, this is what inspired the record stats processor we added about a release ago. You can use it to drill down into a record set with path operations and add attributes to a flowfile describing what the record set therein contains. It should be particularly helpful to anyone who wants to do partial or full table fetches and then say "count this batch of records using this path." Mike On Wed, Jul 25, 2018 at 9:59 PM Matt Burgess wrote: > Mike, Ryan, Boris et al, > > I'd like to wrap my head around the kinds of use cases y'all have for > provenance data in NiFi: what's good, what's bad, what we need to do > to make things better. Are there questions you want to ask of > provenance that you can't today? Do the DROP events give you what you > need for your reporting, or would you benefit from some sort of > "lifetime record" that might be generated from a NiFi subflow based on > provenance events? I've been bouncing around the following > concepts/improvements: > > 1) Let the ProcessSession keep track of the duration in a processor > (for those that don't explicitly report it) [1] > 2) Add a "durationNanos" field to provenance events, possibly > replacing "durationMillis" in NiFi 2.0, to give better precision [2] > 3) A processor to generate lineage when a DROP event is received > (likely via the SiteToSiteProvenanceReportingTask), dependent on the > persistence settings of the provenance repository > 4) A "Query Filter" property on the SiteToSiteProvenanceReportingTask > (or a separate reporting task if need be), perhaps leveraging Calcite > for SQL filters or supporting the Lucene query language (since the > prov events are indexed by Lucene) > > I still haven't come up with the New Feature Wiki page for graph tech > (from a previous discussion on the list) but #3 above lends itself to > also generating a lineage graph for a FlowFile, in some well-known > format perhaps (Kryo, GraphML, etc.) I'll try to get that Wiki (and > the discussion) going soon... > > Regards, > Matt > > [1] https://issues.apache.org/jira/browse/NIFI-5420 > [2] https://issues.apache.org/jira/browse/NIFI-5463 > On Wed, Jul 25, 2018 at 9:18 PM Mike Thomsen > wrote: > > > > Ryan, > > > > Understandable. We haven't found a need for Beats or Forwarders here > either because S2S gives everything you need to reliably ship the data. > > > > FWIW, if your need changes, I would recommend stripping down the > provenance data. We cut out about 66-75% of the fields and dropped the > intermediate records in favor of keeping DROP events for our simple > dashboarding needs because we figured if a record never made it there > something very bad happened. > > > > On Wed, Jul 25, 2018 at 8:54 PM Ryan H < > ryan.howell.developm...@gmail.com> wrote: > >> > >> Thanks Mike for the suggestion on it. I'm looking for a solution that > doesn't involve the additional components such as any > Beats/Forwarders/Elasticsearch/etc. > >> > >> Boris, thanks for the link for the Monitoring introduction--I've > checked it out multiple times. What I want to avoid is having the need for > anything to be set on the Canvas and have the metrics collection via the > rest api. I'm thinking that the api in the original question may be the way > to go, but unsure of it without a little more information on the data model > and how that data is collected/aggregated (such as what the data returned > actually represents). I may just dig into the source if this email goes > stale. > >> > >> -Ryan > >> > >> > >> On Wed, Jul 25, 2018 at 9:17 AM, Boris Tyukin > wrote: > >>> > >>> Ryan, if you have not seen these posts from Pierre, I suggest starting > there. He does a good job explaining different options > >>> https://pierrevillard.com/2017/05/11/monitoring-nifi-introduction/ > >>> > >>> I do agree that 5 minute thing is super confusing and pretty useless > and you cannot change that interval. I think it is only useful to check > quickly on your real-time pipelines at the moment. > >>> > >>> I wish NiFi provided nicer out of the box logging/monitoring > capabilities but on a bright side, it seems to me that you can build your > own and customize it as you want. > >>> > >>> > >>> On Tue, Jul 24, 2018 at 10:55 PM Ryan H < > ryan.howell.developm...@gmail.com> wrote: > > Hi All, > > I am looking for a way to obtain the total amount of data that has > been processed by a running cluster for a period of time, ideally via the > rest api. > > Example of my use case: > I have say 50 different process groups, each that have a connection > to some data source. Each one is continuously pulling data in, doing > something to it, then sending it out to some other external place. I'd like > to programmatically
Re: NiFi Data Usage via Rest API
Mike, Ryan, Boris et al, I'd like to wrap my head around the kinds of use cases y'all have for provenance data in NiFi: what's good, what's bad, what we need to do to make things better. Are there questions you want to ask of provenance that you can't today? Do the DROP events give you what you need for your reporting, or would you benefit from some sort of "lifetime record" that might be generated from a NiFi subflow based on provenance events? I've been bouncing around the following concepts/improvements: 1) Let the ProcessSession keep track of the duration in a processor (for those that don't explicitly report it) [1] 2) Add a "durationNanos" field to provenance events, possibly replacing "durationMillis" in NiFi 2.0, to give better precision [2] 3) A processor to generate lineage when a DROP event is received (likely via the SiteToSiteProvenanceReportingTask), dependent on the persistence settings of the provenance repository 4) A "Query Filter" property on the SiteToSiteProvenanceReportingTask (or a separate reporting task if need be), perhaps leveraging Calcite for SQL filters or supporting the Lucene query language (since the prov events are indexed by Lucene) I still haven't come up with the New Feature Wiki page for graph tech (from a previous discussion on the list) but #3 above lends itself to also generating a lineage graph for a FlowFile, in some well-known format perhaps (Kryo, GraphML, etc.) I'll try to get that Wiki (and the discussion) going soon... Regards, Matt [1] https://issues.apache.org/jira/browse/NIFI-5420 [2] https://issues.apache.org/jira/browse/NIFI-5463 On Wed, Jul 25, 2018 at 9:18 PM Mike Thomsen wrote: > > Ryan, > > Understandable. We haven't found a need for Beats or Forwarders here either > because S2S gives everything you need to reliably ship the data. > > FWIW, if your need changes, I would recommend stripping down the provenance > data. We cut out about 66-75% of the fields and dropped the intermediate > records in favor of keeping DROP events for our simple dashboarding needs > because we figured if a record never made it there something very bad > happened. > > On Wed, Jul 25, 2018 at 8:54 PM Ryan H > wrote: >> >> Thanks Mike for the suggestion on it. I'm looking for a solution that >> doesn't involve the additional components such as any >> Beats/Forwarders/Elasticsearch/etc. >> >> Boris, thanks for the link for the Monitoring introduction--I've checked it >> out multiple times. What I want to avoid is having the need for anything to >> be set on the Canvas and have the metrics collection via the rest api. I'm >> thinking that the api in the original question may be the way to go, but >> unsure of it without a little more information on the data model and how >> that data is collected/aggregated (such as what the data returned actually >> represents). I may just dig into the source if this email goes stale. >> >> -Ryan >> >> >> On Wed, Jul 25, 2018 at 9:17 AM, Boris Tyukin wrote: >>> >>> Ryan, if you have not seen these posts from Pierre, I suggest starting >>> there. He does a good job explaining different options >>> https://pierrevillard.com/2017/05/11/monitoring-nifi-introduction/ >>> >>> I do agree that 5 minute thing is super confusing and pretty useless and >>> you cannot change that interval. I think it is only useful to check quickly >>> on your real-time pipelines at the moment. >>> >>> I wish NiFi provided nicer out of the box logging/monitoring capabilities >>> but on a bright side, it seems to me that you can build your own and >>> customize it as you want. >>> >>> >>> On Tue, Jul 24, 2018 at 10:55 PM Ryan H >>> wrote: Hi All, I am looking for a way to obtain the total amount of data that has been processed by a running cluster for a period of time, ideally via the rest api. Example of my use case: I have say 50 different process groups, each that have a connection to some data source. Each one is continuously pulling data in, doing something to it, then sending it out to some other external place. I'd like to programmatically gather some metrics about the amount of data flowing thru the cluster as a whole (everything that is running across the cluster). It looks like the following api may be the solution, but I am curious about some of the properties: "nifi-api/flow/process-groups/root/status?recursive=true". Looking at the data model (as defined in the rest api documentation) and the actual data that is returned, my questions are: 1. Would this be the correct way to obtain this information? 2. And if so, I'm not sure which properties to look at as it isn't immediately clear to me the difference between some of them. Example being "bytesSent" vs "bytesOut". 3. How is this data updated? It looks like a lot of these metrics are supposed to updated every 5 minutes. So would it be
Re: NiFi Data Usage via Rest API
Mike, would love to see a post about that solution you've mentioned. Thinking about doing something similar for my company and using Elastic/Kibana or Grafana On Wed, Jul 25, 2018 at 9:17 PM Mike Thomsen wrote: > Ryan, > > Understandable. We haven't found a need for Beats or Forwarders here > either because S2S gives everything you need to reliably ship the data. > > FWIW, if your need changes, I would recommend stripping down the > provenance data. We cut out about 66-75% of the fields and dropped the > intermediate records in favor of keeping DROP events for our simple > dashboarding needs because we figured if a record never made it there > something very bad happened. > > On Wed, Jul 25, 2018 at 8:54 PM Ryan H > wrote: > >> Thanks Mike for the suggestion on it. I'm looking for a solution that >> doesn't involve the additional components such as any >> Beats/Forwarders/Elasticsearch/etc. >> >> Boris, thanks for the link for the Monitoring introduction--I've checked >> it out multiple times. What I want to avoid is having the need for anything >> to be set on the Canvas and have the metrics collection via the rest api. >> I'm thinking that the api in the original question may be the way to go, >> but unsure of it without a little more information on the data model and >> how that data is collected/aggregated (such as what the data returned >> actually represents). I may just dig into the source if this email goes >> stale. >> >> -Ryan >> >> >> On Wed, Jul 25, 2018 at 9:17 AM, Boris Tyukin >> wrote: >> >>> Ryan, if you have not seen these posts from Pierre, I suggest >>> starting there. He does a good job explaining different options >>> https://pierrevillard.com/2017/05/11/monitoring-nifi-introduction/ >>> >>> I do agree that 5 minute thing is super confusing and pretty useless and >>> you cannot change that interval. I think it is only useful to check quickly >>> on your real-time pipelines at the moment. >>> >>> I wish NiFi provided nicer out of the box logging/monitoring >>> capabilities but on a bright side, it seems to me that you can build your >>> own and customize it as you want. >>> >>> >>> On Tue, Jul 24, 2018 at 10:55 PM Ryan H < >>> ryan.howell.developm...@gmail.com> wrote: >>> Hi All, I am looking for a way to obtain the total amount of data that has been processed by a running cluster for a period of time, ideally via the rest api. Example of my use case: I have say 50 different process groups, each that have a connection to some data source. Each one is continuously pulling data in, doing something to it, then sending it out to some other external place. I'd like to programmatically gather some metrics about the amount of data flowing thru the cluster as a whole (everything that is running across the cluster). It looks like the following api may be the solution, but I am curious about some of the properties: "nifi-api/flow/process-groups/root/status?recursive=true". Looking at the data model (as defined in the rest api documentation) and the actual data that is returned, my questions are: 1. Would this be the correct way to obtain this information? 2. And if so, I'm not sure which properties to look at as it isn't immediately clear to me the difference between some of them. Example being "bytesSent" vs "bytesOut". 3. How is this data updated? It looks like a lot of these metrics are supposed to updated every 5 minutes. So would it be that the info I would get now is what was collected from the last 5 minute interval and would stay the same until the next 5 minute interval? And does the data aggregate or is it only representative of a single 5 minute period? Something else? { "processGroupStatus": { ... "aggregateSnapshot": { ... "flowFilesIn": 0, "bytesIn": 0, "input": "value", "flowFilesQueued": 0, "bytesQueued": 0, "queued": "value", "queuedCount": "value", "queuedSize": "value", "bytesRead": 0, "read": "value", "bytesWritten": 0, "written": "value", "flowFilesOut": 0, "bytesOut": 0, "output": "value", "flowFilesTransferred": 0, "bytesTransferred": 0, "transferred": "value", "bytesReceived": 0,// I think this is the one, but not sure "flowFilesReceived": 0, "received": "value", "bytesSent": 0, // I think this is the other one, but not sure "flowFilesSent": 0, "sent": "value", "activeThreadCount": 0, "terminatedThreadCount": 0 }, "nodeSnapshots": [{…}] }, "canRead": true } Any help or insight is always appreciated! Cheers, Ryan H. >>
Re: NiFi Data Usage via Rest API
Thanks Mike for the suggestion on it. I'm looking for a solution that doesn't involve the additional components such as any Beats/Forwarders/Elasticsearch/etc. Boris, thanks for the link for the Monitoring introduction--I've checked it out multiple times. What I want to avoid is having the need for anything to be set on the Canvas and have the metrics collection via the rest api. I'm thinking that the api in the original question may be the way to go, but unsure of it without a little more information on the data model and how that data is collected/aggregated (such as what the data returned actually represents). I may just dig into the source if this email goes stale. -Ryan On Wed, Jul 25, 2018 at 9:17 AM, Boris Tyukin wrote: > Ryan, if you have not seen these posts from Pierre, I suggest > starting there. He does a good job explaining different options > https://pierrevillard.com/2017/05/11/monitoring-nifi-introduction/ > > I do agree that 5 minute thing is super confusing and pretty useless and > you cannot change that interval. I think it is only useful to check quickly > on your real-time pipelines at the moment. > > I wish NiFi provided nicer out of the box logging/monitoring capabilities > but on a bright side, it seems to me that you can build your own and > customize it as you want. > > > On Tue, Jul 24, 2018 at 10:55 PM Ryan H > wrote: > >> Hi All, >> >> I am looking for a way to obtain the total amount of data that has been >> processed by a running cluster for a period of time, ideally via the rest >> api. >> >> Example of my use case: >> I have say 50 different process groups, each that have a connection to >> some data source. Each one is continuously pulling data in, doing something >> to it, then sending it out to some other external place. I'd like to >> programmatically gather some metrics about the amount of data flowing thru >> the cluster as a whole (everything that is running across the cluster). >> >> It looks like the following api may be the solution, but I am curious >> about some of the properties: >> "nifi-api/flow/process-groups/root/status?recursive=true". >> >> Looking at the data model (as defined in the rest api documentation) and >> the actual data that is returned, my questions are: >> 1. Would this be the correct way to obtain this information? >> 2. And if so, I'm not sure which properties to look at as it isn't >> immediately clear to me the difference between some of them. Example being >> "bytesSent" vs "bytesOut". >> 3. How is this data updated? It looks like a lot of these metrics are >> supposed to updated every 5 minutes. So would it be that the info I would >> get now is what was collected from the last 5 minute interval and would >> stay the same until the next 5 minute interval? And does the data aggregate >> or is it only representative of a single 5 minute period? Something else? >> >> >> >> { >> "processGroupStatus": { >> ... >> "aggregateSnapshot": { >> ... >> "flowFilesIn": 0, >> "bytesIn": 0, >> "input": "value", >> "flowFilesQueued": 0, >> "bytesQueued": 0, >> "queued": "value", >> "queuedCount": "value", >> "queuedSize": "value", >> "bytesRead": 0, >> "read": "value", >> "bytesWritten": 0, >> "written": "value", >> "flowFilesOut": 0, >> "bytesOut": 0, >> "output": "value", >> "flowFilesTransferred": 0, >> "bytesTransferred": 0, >> "transferred": "value", >> "bytesReceived": 0,// I think this is the one, but not sure >> "flowFilesReceived": 0, >> "received": "value", >> "bytesSent": 0, // I think this is the other one, but not sure >> "flowFilesSent": 0, >> "sent": "value", >> "activeThreadCount": 0, >> "terminatedThreadCount": 0 >> }, >> "nodeSnapshots": [{…}] >> }, >> "canRead": true >> } >> >> >> >> >> Any help or insight is always appreciated! >> >> >> Cheers, >> >> Ryan H. >> >> >>
Re: NiFi Data Usage via Rest API
Ryan, if you have not seen these posts from Pierre, I suggest starting there. He does a good job explaining different options https://pierrevillard.com/2017/05/11/monitoring-nifi-introduction/ I do agree that 5 minute thing is super confusing and pretty useless and you cannot change that interval. I think it is only useful to check quickly on your real-time pipelines at the moment. I wish NiFi provided nicer out of the box logging/monitoring capabilities but on a bright side, it seems to me that you can build your own and customize it as you want. On Tue, Jul 24, 2018 at 10:55 PM Ryan H wrote: > Hi All, > > I am looking for a way to obtain the total amount of data that has been > processed by a running cluster for a period of time, ideally via the rest > api. > > Example of my use case: > I have say 50 different process groups, each that have a connection to > some data source. Each one is continuously pulling data in, doing something > to it, then sending it out to some other external place. I'd like to > programmatically gather some metrics about the amount of data flowing thru > the cluster as a whole (everything that is running across the cluster). > > It looks like the following api may be the solution, but I am curious > about some of the properties: > "nifi-api/flow/process-groups/root/status?recursive=true". > > Looking at the data model (as defined in the rest api documentation) and > the actual data that is returned, my questions are: > 1. Would this be the correct way to obtain this information? > 2. And if so, I'm not sure which properties to look at as it isn't > immediately clear to me the difference between some of them. Example being > "bytesSent" vs "bytesOut". > 3. How is this data updated? It looks like a lot of these metrics are > supposed to updated every 5 minutes. So would it be that the info I would > get now is what was collected from the last 5 minute interval and would > stay the same until the next 5 minute interval? And does the data aggregate > or is it only representative of a single 5 minute period? Something else? > > > > { > "processGroupStatus": { > ... > "aggregateSnapshot": { > ... > "flowFilesIn": 0, > "bytesIn": 0, > "input": "value", > "flowFilesQueued": 0, > "bytesQueued": 0, > "queued": "value", > "queuedCount": "value", > "queuedSize": "value", > "bytesRead": 0, > "read": "value", > "bytesWritten": 0, > "written": "value", > "flowFilesOut": 0, > "bytesOut": 0, > "output": "value", > "flowFilesTransferred": 0, > "bytesTransferred": 0, > "transferred": "value", > "bytesReceived": 0,// I think this is the one, but not sure > "flowFilesReceived": 0, > "received": "value", > "bytesSent": 0, // I think this is the other one, but not sure > "flowFilesSent": 0, > "sent": "value", > "activeThreadCount": 0, > "terminatedThreadCount": 0 > }, > "nodeSnapshots": [{…}] > }, > "canRead": true > } > > > > > Any help or insight is always appreciated! > > > Cheers, > > Ryan H. > > >
Re: NiFi Data Usage via Rest API
I have a client with a similar use case. They wanted to be able to figure out when they processed a particular data set (they're using batch processing with NiFi). The solution I gave them was based on using Metrics and Provenance reporting to the ELK stack. I know that doesn't directly answer your particular question, but we found it to be a good approach to the problem of finding out how much data, how many records, etc. went from A to Z in the pipeline at any given execution. On Tue, Jul 24, 2018 at 10:55 PM Ryan H wrote: > Hi All, > > I am looking for a way to obtain the total amount of data that has been > processed by a running cluster for a period of time, ideally via the rest > api. > > Example of my use case: > I have say 50 different process groups, each that have a connection to > some data source. Each one is continuously pulling data in, doing something > to it, then sending it out to some other external place. I'd like to > programmatically gather some metrics about the amount of data flowing thru > the cluster as a whole (everything that is running across the cluster). > > It looks like the following api may be the solution, but I am curious > about some of the properties: > "nifi-api/flow/process-groups/root/status?recursive=true". > > Looking at the data model (as defined in the rest api documentation) and > the actual data that is returned, my questions are: > 1. Would this be the correct way to obtain this information? > 2. And if so, I'm not sure which properties to look at as it isn't > immediately clear to me the difference between some of them. Example being > "bytesSent" vs "bytesOut". > 3. How is this data updated? It looks like a lot of these metrics are > supposed to updated every 5 minutes. So would it be that the info I would > get now is what was collected from the last 5 minute interval and would > stay the same until the next 5 minute interval? And does the data aggregate > or is it only representative of a single 5 minute period? Something else? > > > > { > "processGroupStatus": { > ... > "aggregateSnapshot": { > ... > "flowFilesIn": 0, > "bytesIn": 0, > "input": "value", > "flowFilesQueued": 0, > "bytesQueued": 0, > "queued": "value", > "queuedCount": "value", > "queuedSize": "value", > "bytesRead": 0, > "read": "value", > "bytesWritten": 0, > "written": "value", > "flowFilesOut": 0, > "bytesOut": 0, > "output": "value", > "flowFilesTransferred": 0, > "bytesTransferred": 0, > "transferred": "value", > "bytesReceived": 0,// I think this is the one, but not sure > "flowFilesReceived": 0, > "received": "value", > "bytesSent": 0, // I think this is the other one, but not sure > "flowFilesSent": 0, > "sent": "value", > "activeThreadCount": 0, > "terminatedThreadCount": 0 > }, > "nodeSnapshots": [{…}] > }, > "canRead": true > } > > > > > Any help or insight is always appreciated! > > > Cheers, > > Ryan H. > > >