Execute script - python example
Hi I am looking for an example in python to convert a new field based on attribute value. Let say syslog.facilty holds value 23, based on the value i want to create new field with text value like syslog.facility_label=LOCAL7 If this transformation possible with existing processors, please provide an example or direct me to right processor. Thanks in Advance,
Re: Maximum attribute size
Thanks a lot for confirming my suspicions. One last clarification: The WAL is different from the swapping concept, correct? I guess it's way faster to swap in a dedicated "dump" than replaying a WAL. On Wed, Feb 17, 2016 at 7:53 PM, Joe Wittwrote: > Lars, > > You are right about the thought process. We've never provided solid > guidance here but we should. It is definitely the case that flow file > content is streamed to and from the underlying repository and the only > way to access it is through that API. Thus well behaved extensions > and the framework itself can handle basically data as large as the > underlying repository has space for. For the flow file attributes > though these are held in memory in a map with each flowfile object. > So it is important to avoid having vast (undefined) quantities of > attributes or attributes with really large (undefined) values. > > There are things we can and should do to make even this relatively > transparent to the users and it is why actually we support swapping > flowfiles to disk when there are large queues because even those inmem > attributes can really add up. > > Thanks > Joe > > On Wed, Feb 17, 2016 at 11:06 AM, Lars Francke > wrote: > > Hi and sorry for all these questions. > > > > I know that FlowFile content is persisted to the content_repository and > can > > handle reasonably large amounts of data. Is the same true for attributes? > > > > I download JSON files (up to 200kb I'd say) and I want to insert them as > > they are into a PostgreSQL JSONB column. I'd love to use the PutSQL > > processor for that but it requires parameters in attributes. > > > > I have a feeling that putting large objects in attributes is a bad idea? >
Re: Version Control on NiFi flow.xml
Vincent, Yeah you're hitting the nail on the head from what we're hearing more and more. We have a couple really nice roadmap items to make these work more like you're doing now. Thanks Joe On Wed, Feb 17, 2016 at 5:27 PM, Vincent Russellwrote: > My team has played around with versioning control with the nifi in the > following way (we have yet to use this for deployments yet though): > > We version control the flow.xml file and all of the config files that need > to be changed > We build a distribution of nifi, gziping the flow.xml and string-replacing > properties in the config files with maven > We then can install this "version" of our nifi app. > > We want to be able to use this to test our flows and processes on our test > system before making it live in production. But like I said he have yet to > actually use this for production deployments. > > On Wed, Feb 17, 2016 at 7:21 PM, Jeff - Data Bean Australia > wrote: >> >> Thanks Matt for describing the feature in such an intuitive way, and >> pointing out the location for the archive. >> >> This looks good. Just wondering whether we also want to archive the >> templates along with flow.xml.gz. >> >> Thanks, >> Jeff >> >> On Thu, Feb 18, 2016 at 11:08 AM, Matthew Clarke >> wrote: >>> >>> Jeff, >>> NiFi gives users the ability to create snapshot backups of their >>> flow.xml through the "back-up flow" link found under the "controller >>> settings" (Icon looks like wrench and screwdriver in upper right corner). >>> The default nifi.properties configuration will write these back-ups to a >>> directory called archive inside teh /conf directory, but >>> you can of course change were they are written. >>> >>> Matt >>> >>> On Wed, Feb 17, 2016 at 4:52 PM, Jeff - Data Bean Australia >>> wrote: Thanks Oleg for sharing this. They are definitely useful. By my question focused more on keeping the data flow definition files' versions, so that Data Flow Developers, or NiFi Cluster Manager in NiFi's term can keep track of our work. Currently I am using the following command line to generate a formatted XML to put it into our Git repository: cat conf/flow.xml.gz | gzip -dc | xmllint --format - On Thu, Feb 18, 2016 at 10:01 AM, Oleg Zhurakousky wrote: > > Jeff, what you are describing is in works and actively discussed > https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry > and > > https://cwiki.apache.org/confluence/display/NIFI/Component+documentation+improvements > > The last one may not directly speaks to the “ExtensionRegistry”, but if > you look through he comments there is a whole lot about it since it is > dependent. > Feel free to participate, but I can say for now that it is slated for > 1.0 release. > > Cheers > Oleg > > On Feb 17, 2016, at 3:08 PM, Jeff - Data Bean Australia > wrote: > > Hi, > > As my NiFi data flow becomes more and more serious, I need to put on > Version Control. Since flow.xml.gz is generated automatically and it is > saved in a compressed file, I am wondering what would be the best practice > regarding version control? > > Thanks, > Jeff > > -- > Data Bean - A Big Data Solution Provider in Australia. > > -- Data Bean - A Big Data Solution Provider in Australia. >>> >>> >> >> >> >> -- >> Data Bean - A Big Data Solution Provider in Australia. > >
Re: Version Control on NiFi flow.xml
My team has played around with versioning control with the nifi in the following way (we have yet to use this for deployments yet though): - We version control the flow.xml file and all of the config files that need to be changed - We build a distribution of nifi, gziping the flow.xml and string-replacing properties in the config files with maven - We then can install this "version" of our nifi app. We want to be able to use this to test our flows and processes on our test system before making it live in production. But like I said he have yet to actually use this for production deployments. On Wed, Feb 17, 2016 at 7:21 PM, Jeff - Data Bean Australia < databean...@gmail.com> wrote: > Thanks Matt for describing the feature in such an intuitive way, and > pointing out the location for the archive. > > This looks good. Just wondering whether we also want to archive the > templates along with flow.xml.gz. > > Thanks, > Jeff > > On Thu, Feb 18, 2016 at 11:08 AM, Matthew Clarke < > matt.clarke@gmail.com> wrote: > >> Jeff, >> NiFi gives users the ability to create snapshot backups of their >> flow.xml through the "back-up flow" link found under the "controller >> settings" (Icon looks like wrench and screwdriver in upper right corner). >> The default nifi.properties configuration will write these back-ups to a >> directory called archive inside teh /conf directory, but >> you can of course change were they are written. >> >> Matt >> >> On Wed, Feb 17, 2016 at 4:52 PM, Jeff - Data Bean Australia < >> databean...@gmail.com> wrote: >> >>> Thanks Oleg for sharing this. They are definitely useful. >>> >>> By my question focused more on keeping the data flow definition files' >>> versions, so that Data Flow Developers, or NiFi Cluster Manager in NiFi's >>> term can keep track of our work. >>> >>> Currently I am using the following command line to generate a formatted >>> XML to put it into our Git repository: >>> >>> cat conf/flow.xml.gz | gzip -dc | xmllint --format - >>> >>> >>> >>> >>> On Thu, Feb 18, 2016 at 10:01 AM, Oleg Zhurakousky < >>> ozhurakou...@hortonworks.com> wrote: >>> Jeff, what you are describing is in works and actively discussed https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry and https://cwiki.apache.org/confluence/display/NIFI/Component+documentation+improvements The last one may not directly speaks to the “ExtensionRegistry”, but if you look through he comments there is a whole lot about it since it is dependent. Feel free to participate, but I can say for now that it is slated for 1.0 release. Cheers Oleg On Feb 17, 2016, at 3:08 PM, Jeff - Data Bean Australia < databean...@gmail.com> wrote: Hi, As my NiFi data flow becomes more and more serious, I need to put on Version Control. Since flow.xml.gz is generated automatically and it is saved in a compressed file, I am wondering what would be the best practice regarding version control? Thanks, Jeff -- Data Bean - A Big Data Solution Provider in Australia. >>> >>> >>> -- >>> Data Bean - A Big Data Solution Provider in Australia. >>> >> >> > > > -- > Data Bean - A Big Data Solution Provider in Australia. >
Re: Version Control on NiFi flow.xml
Jeff, "do we have some tool to compare two flow.xml.gz for some subtle changes?" Unfortunately no. That is what Oleg was referring to. We're finding an increasing number of people that are interested in this sort of Git/Diff capability so we def need to get some momentum on it. Making ordering deterministic for the flow and templates should be pretty doable. We already have feature proposal/JIRA to go after this. Thanks Joe On Wed, Feb 17, 2016 at 5:21 PM, Jeff - Data Bean Australiawrote: > Thanks Matt for describing the feature in such an intuitive way, and > pointing out the location for the archive. > > This looks good. Just wondering whether we also want to archive the > templates along with flow.xml.gz. > > Thanks, > Jeff > > On Thu, Feb 18, 2016 at 11:08 AM, Matthew Clarke > wrote: >> >> Jeff, >> NiFi gives users the ability to create snapshot backups of their >> flow.xml through the "back-up flow" link found under the "controller >> settings" (Icon looks like wrench and screwdriver in upper right corner). >> The default nifi.properties configuration will write these back-ups to a >> directory called archive inside teh /conf directory, but >> you can of course change were they are written. >> >> Matt >> >> On Wed, Feb 17, 2016 at 4:52 PM, Jeff - Data Bean Australia >> wrote: >>> >>> Thanks Oleg for sharing this. They are definitely useful. >>> >>> By my question focused more on keeping the data flow definition files' >>> versions, so that Data Flow Developers, or NiFi Cluster Manager in NiFi's >>> term can keep track of our work. >>> >>> Currently I am using the following command line to generate a formatted >>> XML to put it into our Git repository: >>> >>> cat conf/flow.xml.gz | gzip -dc | xmllint --format - >>> >>> >>> >>> >>> On Thu, Feb 18, 2016 at 10:01 AM, Oleg Zhurakousky >>> wrote: Jeff, what you are describing is in works and actively discussed https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry and https://cwiki.apache.org/confluence/display/NIFI/Component+documentation+improvements The last one may not directly speaks to the “ExtensionRegistry”, but if you look through he comments there is a whole lot about it since it is dependent. Feel free to participate, but I can say for now that it is slated for 1.0 release. Cheers Oleg On Feb 17, 2016, at 3:08 PM, Jeff - Data Bean Australia wrote: Hi, As my NiFi data flow becomes more and more serious, I need to put on Version Control. Since flow.xml.gz is generated automatically and it is saved in a compressed file, I am wondering what would be the best practice regarding version control? Thanks, Jeff -- Data Bean - A Big Data Solution Provider in Australia. >>> >>> >>> >>> -- >>> Data Bean - A Big Data Solution Provider in Australia. >> >> > > > > -- > Data Bean - A Big Data Solution Provider in Australia.
Re: Version Control on NiFi flow.xml
Thanks Joe for pointing out the order issue. Given that, I need to reconsider my approach, because the original thought was to help facilitating existing version control tools, such as Git, and compare different versions on the fly. Given the order issue, this approach doesn't make more sense than simply store the gz file. In this case, do we have some tool to compare two flow.xml.gz for some subtle changes? I am sure the UI based auditing is helpful though. On Thu, Feb 18, 2016 at 11:07 AM, Joe Wittwrote: > Jeff > > I think what you're doing is just fine for now. To Oleg's point we > should make it better. > > We do also have a database where each flow change is being written to > from a audit perspective and so we can show in the UI who made what > changes last. That is less about true CM and more about providing a > meaningful user experience. > > The biggest knock for CM of our current flow.xml.gz and for the > templates is that the order in which their components are serialized > is not presently guaranteed so it means diff won't be meaningful. But > as far as capturing at specific intervals and storing the flow you > should be in good shape with your approach. > > Thanks > Joe > > On Wed, Feb 17, 2016 at 4:52 PM, Jeff - Data Bean Australia > wrote: > > Thanks Oleg for sharing this. They are definitely useful. > > > > By my question focused more on keeping the data flow definition files' > > versions, so that Data Flow Developers, or NiFi Cluster Manager in NiFi's > > term can keep track of our work. > > > > Currently I am using the following command line to generate a formatted > XML > > to put it into our Git repository: > > > > cat conf/flow.xml.gz | gzip -dc | xmllint --format - > > > > > > > > > > On Thu, Feb 18, 2016 at 10:01 AM, Oleg Zhurakousky > > wrote: > >> > >> Jeff, what you are describing is in works and actively discussed > >> https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry > >> and > >> > >> > https://cwiki.apache.org/confluence/display/NIFI/Component+documentation+improvements > >> > >> The last one may not directly speaks to the “ExtensionRegistry”, but if > >> you look through he comments there is a whole lot about it since it is > >> dependent. > >> Feel free to participate, but I can say for now that it is slated for > 1.0 > >> release. > >> > >> Cheers > >> Oleg > >> > >> On Feb 17, 2016, at 3:08 PM, Jeff - Data Bean Australia > >> wrote: > >> > >> Hi, > >> > >> As my NiFi data flow becomes more and more serious, I need to put on > >> Version Control. Since flow.xml.gz is generated automatically and it is > >> saved in a compressed file, I am wondering what would be the best > practice > >> regarding version control? > >> > >> Thanks, > >> Jeff > >> > >> -- > >> Data Bean - A Big Data Solution Provider in Australia. > >> > >> > > > > > > > > -- > > Data Bean - A Big Data Solution Provider in Australia. > -- Data Bean - A Big Data Solution Provider in Australia.
Re: Version Control on NiFi flow.xml
Jeff I think what you're doing is just fine for now. To Oleg's point we should make it better. We do also have a database where each flow change is being written to from a audit perspective and so we can show in the UI who made what changes last. That is less about true CM and more about providing a meaningful user experience. The biggest knock for CM of our current flow.xml.gz and for the templates is that the order in which their components are serialized is not presently guaranteed so it means diff won't be meaningful. But as far as capturing at specific intervals and storing the flow you should be in good shape with your approach. Thanks Joe On Wed, Feb 17, 2016 at 4:52 PM, Jeff - Data Bean Australiawrote: > Thanks Oleg for sharing this. They are definitely useful. > > By my question focused more on keeping the data flow definition files' > versions, so that Data Flow Developers, or NiFi Cluster Manager in NiFi's > term can keep track of our work. > > Currently I am using the following command line to generate a formatted XML > to put it into our Git repository: > > cat conf/flow.xml.gz | gzip -dc | xmllint --format - > > > > > On Thu, Feb 18, 2016 at 10:01 AM, Oleg Zhurakousky > wrote: >> >> Jeff, what you are describing is in works and actively discussed >> https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry >> and >> >> https://cwiki.apache.org/confluence/display/NIFI/Component+documentation+improvements >> >> The last one may not directly speaks to the “ExtensionRegistry”, but if >> you look through he comments there is a whole lot about it since it is >> dependent. >> Feel free to participate, but I can say for now that it is slated for 1.0 >> release. >> >> Cheers >> Oleg >> >> On Feb 17, 2016, at 3:08 PM, Jeff - Data Bean Australia >> wrote: >> >> Hi, >> >> As my NiFi data flow becomes more and more serious, I need to put on >> Version Control. Since flow.xml.gz is generated automatically and it is >> saved in a compressed file, I am wondering what would be the best practice >> regarding version control? >> >> Thanks, >> Jeff >> >> -- >> Data Bean - A Big Data Solution Provider in Australia. >> >> > > > > -- > Data Bean - A Big Data Solution Provider in Australia.
Re: Version Control on NiFi flow.xml
Thanks Oleg for sharing this. They are definitely useful. By my question focused more on keeping the data flow definition files' versions, so that Data Flow Developers, or NiFi Cluster Manager in NiFi's term can keep track of our work. Currently I am using the following command line to generate a formatted XML to put it into our Git repository: cat conf/flow.xml.gz | gzip -dc | xmllint --format - On Thu, Feb 18, 2016 at 10:01 AM, Oleg Zhurakousky < ozhurakou...@hortonworks.com> wrote: > Jeff, what you are describing is in works and actively discussed > https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry > and > > https://cwiki.apache.org/confluence/display/NIFI/Component+documentation+improvements > > The last one may not directly speaks to the “ExtensionRegistry”, but if > you look through he comments there is a whole lot about it since it is > dependent. > Feel free to participate, but I can say for now that it is slated for 1.0 > release. > > Cheers > Oleg > > On Feb 17, 2016, at 3:08 PM, Jeff - Data Bean Australia < > databean...@gmail.com> wrote: > > Hi, > > As my NiFi data flow becomes more and more serious, I need to put on > Version Control. Since flow.xml.gz is generated automatically and it is > saved in a compressed file, I am wondering what would be the best practice > regarding version control? > > Thanks, > Jeff > > -- > Data Bean - A Big Data Solution Provider in Australia. > > > -- Data Bean - A Big Data Solution Provider in Australia.
Re: Version Control on NiFi flow.xml
Jeff, what you are describing is in works and actively discussed https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry and https://cwiki.apache.org/confluence/display/NIFI/Component+documentation+improvements The last one may not directly speaks to the “ExtensionRegistry”, but if you look through he comments there is a whole lot about it since it is dependent. Feel free to participate, but I can say for now that it is slated for 1.0 release. Cheers Oleg On Feb 17, 2016, at 3:08 PM, Jeff - Data Bean Australia> wrote: Hi, As my NiFi data flow becomes more and more serious, I need to put on Version Control. Since flow.xml.gz is generated automatically and it is saved in a compressed file, I am wondering what would be the best practice regarding version control? Thanks, Jeff -- Data Bean - A Big Data Solution Provider in Australia.
Version Control on NiFi flow.xml
Hi, As my NiFi data flow becomes more and more serious, I need to put on Version Control. Since flow.xml.gz is generated automatically and it is saved in a compressed file, I am wondering what would be the best practice regarding version control? Thanks, Jeff -- Data Bean - A Big Data Solution Provider in Australia.
Re: Maximum attribute size
Lars, You are right about the thought process. We've never provided solid guidance here but we should. It is definitely the case that flow file content is streamed to and from the underlying repository and the only way to access it is through that API. Thus well behaved extensions and the framework itself can handle basically data as large as the underlying repository has space for. For the flow file attributes though these are held in memory in a map with each flowfile object. So it is important to avoid having vast (undefined) quantities of attributes or attributes with really large (undefined) values. There are things we can and should do to make even this relatively transparent to the users and it is why actually we support swapping flowfiles to disk when there are large queues because even those inmem attributes can really add up. Thanks Joe On Wed, Feb 17, 2016 at 11:06 AM, Lars Franckewrote: > Hi and sorry for all these questions. > > I know that FlowFile content is persisted to the content_repository and can > handle reasonably large amounts of data. Is the same true for attributes? > > I download JSON files (up to 200kb I'd say) and I want to insert them as > they are into a PostgreSQL JSONB column. I'd love to use the PutSQL > processor for that but it requires parameters in attributes. > > I have a feeling that putting large objects in attributes is a bad idea?
Re: Nifi 'as a service'?
Keaton, You can definitely build a REST service in NiFi! I would take a look at HandleHttpRequest and HandleHttpResponse. HandleHttpRequest would be the entry point of your service, the FlowFiles coming out of this processor would represent the request being made, you can then perform whatever logic you need and send a response back with HandleHttpResponse. Let us know if that doesn't make sense. Thanks, Bryan [1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.HandleHttpRequest/index.html [2] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.HandleHttpResponse/index.html On Wed, Feb 17, 2016 at 12:58 PM, Keaton Clevewrote: > Hi, > > Would it be possible to use Nifi 'as a service'? If yes what would be the > best pattern? > > Here is what I have in mind: > > I would like to setup a template with different possible predetermined > destinations. But instead of having predefined sources that I would query > with a cron like GetFile or GetHDFS, I would like to have a REST API as an > entry point and users can request to copy a file from any directory into > one or several of the predetermined destination (this would involve some > routing to the correct processor I guess). The REST API would only support > to specify sources which the templates allows, but the specific directory / > file would be dynamic. > > Does that make any sense?
Re: Generate URL based on different conditions
Thank you Matt and Joe for your help. On Wed, Feb 17, 2016 at 4:22 PM, Matt Burgesswrote: > Here's a Gist template that uses Joe's approach of RouteOnAttribute then > UpdateAttribute to generate URLs with the use case you described: > https://gist.github.com/mattyb149/8fd87efa1338a70c > > On Tue, Feb 16, 2016 at 9:51 PM, Joe Witt wrote: > >> Jeff, >> >> For each of the input files could it be that you would pull data from >> multiple URLs? >> >> Have you had a chance to learn about the NiFi Expression language? >> That will come in quite handy for constructing the URL used in >> InvokeHTTP. >> >> The general pattern I think makes sense here is: >> - Gather Data >> - Extract Features from data to construct URL >> - Fetch document/response from URL >> >> During 'Gather Data' you acquire the files. >> >> During 'Extract features' you pull out elements of the content of the >> file into flow file attributes. You can use RouteOnAttribute to send >> to an UpdateAttribute processor which constructs a new attribute of >> URL pattern A or URL pattern B respectively. You can also collapse >> that into a single UpdateAttribute possibly using the advanced UI and >> set specific URLs based on patterns of attributes. Lots of ways to >> slice that. >> >> During Fetch document you should be able to just have a single >> InvokeHTTP potentially which looks at some attribute you've defined >> say 'the-url' and specify in InvokeHTTP the remote URL value to be >> "${the-url}" >> >> We should publish a template for this pattern/approach if we've not >> already but let's see how you progress and decide what would be most >> useful for others. >> >> Thanks >> Joe >> >> On Tue, Feb 16, 2016 at 9:36 PM, Jeff - Data Bean Australia >> wrote: >> > Hi, >> > >> > I got a use case like this: >> > >> > There are two files, say fileA and fileB, both of them contains multiple >> > lines of items and used for generate URLs. However, the algorithm for >> > generating URLs are different. If items come from fileA, the URL >> template >> > looks like this: >> > >> > foo--foo >> > >> > If items come from fileB, the template looks like this: >> > >> > bar--foo--whatever >> > >> > I am going to create a NiFi template to for the Data Flow from reading >> the >> > list file up to downloading data using InvokeHTTP, and place a >> > UpdateAttribute processor in front of the template to feed in different >> file >> > names (I have only two files). >> > >> > The problem I have so far is how to generate the URLs based on different >> > input, so that I can make a general NiFi template for reusability. >> > >> > Thanks, >> > Jeff >> > >> > >> > >> > -- >> > Data Bean - A Big Data Solution Provider in Australia. >> > > -- Data Bean - A Big Data Solution Provider in Australia.
Nifi 'as a service'?
Hi, Would it be possible to use Nifi 'as a service'? If yes what would be the best pattern? Here is what I have in mind: I would like to setup a template with different possible predetermined destinations. But instead of having predefined sources that I would query with a cron like GetFile or GetHDFS, I would like to have a REST API as an entry point and users can request to copy a file from any directory into one or several of the predetermined destination (this would involve some routing to the correct processor I guess). The REST API would only support to specify sources which the templates allows, but the specific directory / file would be dynamic. Does that make any sense?