Re: Nifi provenance indexing throughput if it is being used as an event store
Thanks, Joe. Given the fact that we would like to add a few attributes and set them to be indexed for the provenance, the mentioned rate should be alright? Cheers, Ali On Sat, Feb 16, 2019 at 2:56 PM Joe Witt wrote: > Ali > > You certainly can and at the rates you mention you should be able to keep > it for a good while. > > Just set the properties you need for your system and measure the rate at > which prov storage fills. > > Thanks > > On Fri, Feb 15, 2019 at 10:29 PM Ali Nazemian > wrote: > >> I didn't mean to use Nifi provenance search for an external provenance >> search. I meant to use it for internal search provenance but keep the >> provenance for a longer time than usual. It means instead of expecting it >> to keep provenance data for a few days, use it as an event store as it also >> provides the search capability. >> >> Regards, >> Ali >> >> On Sat, Feb 16, 2019 at 5:29 AM Andrew Grande wrote: >> >>> NiFi provenance searches are not a good integration pattern for external >>> systems. I.e. using it to periodicaly fetch history burdens the cluster >>> (those searches can be heavy) and disrupt normal processing SLAs. >>> >>> Pushing provenance events out to an external system (pitebtially even >>> filtered down to components of interest) is a much more predictable pattern >>> and provides lots of flexibility on how to interpret the events. >>> >>> Andrew >>> >>> On Thu, Feb 14, 2019, 11:26 PM Ali Nazemian >>> wrote: >>> Can I expect the Nifi search provenance part do the job for me? On Fri, 15 Feb. 2019, 13:21 Mike Thomsen >>> > Ali, > > There is a site to site publishing task for provenance that you can > add as a root controller service that would be great here. It'll just take > all of your provenance data periodically and ship it off to another NiFi > server or cluster that can process all of the provenance data as blocks of > JSON data. A common pattern there is to filter down to the events you want > and publish to ElasticSearch. > > On Thu, Feb 14, 2019 at 7:05 PM Ali Nazemian > wrote: > >> Hi All, >> >> I am investigating to see how Nifi provenance can be used as an event >> store for a long period of time. Our use case is very burst based and >> sometimes we may not receive any event for a period of time and sometimes >> we may get burst traffic. On average we can say maybe around 1000 eps is >> the expected throughput at this stage. Nifi has a powerful provenance >> that >> gives you an ability to also index based on some attributes. I am >> investigating how reliable is to use Nifi provenance store for a long >> period of time and enable index for a few extra attributes. Has anybody >> used Nifi provenance at this scale? Can lots of Lucene indices create >> other >> issues within Nifi as provenance uses Lucene for the indexing? >> >> P.S: Our use case is pretty light for Nifi as we are not going to >> have any ETL and Nifi is being used mostly as an Orchestrator of multiple >> Microservices. >> >> Regards, >> Ali >> > >> >> -- >> A.Nazemian >> > -- A.Nazemian
Re: Record-oriented DetectDuplicate?
Andrew, Mark, etc. A new contributor alerted me on Jira that he did his own take on this processor. I encouraged him to join the dev list so we can discuss the use case in more depth and sort out what is the best way forward. See https://issues.apache.org/jira/browse/NIFI-6047 I'll give him a little while to join and announce he's ready to go over it before I move forward with a discussion on this. On Sat, Feb 9, 2019 at 12:34 PM Mike Thomsen wrote: > PR if anyone is interested: > > https://github.com/apache/nifi/pull/3298 > > On Fri, Feb 8, 2019 at 5:34 PM Mike Thomsen > wrote: > >> With Redis and HBase you can set a TTL on the data itself in the lookup >> table. Were you thinking something more than that? >> >> On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande wrote: >> >>> Can I suggest a time-based option for specifying the window? I think we >>> only mentioned the number of records. >>> >>> Andrew >>> >>> On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen >>> wrote: >>> Thanks. That answers it succinctly for me. I'll build out a DetectDuplicateRecord processor to handle this. On Fri, Feb 8, 2019 at 11:17 AM Mark Payne wrote: > Matt, > > That would work if you want to select distinct records in a given > FlowFIle but not across FlowFiles. > PartitionRecord -> UpdateAttribute (optionally to combine multiple > attributes into one) -> DetectDuplicate > would work, but given that you expect the records to be unique > generally, this would have the effect of > splitting each FlowFile into Record-per-FlowFile, which is certainly > not ideal. > > Thanks > -Mark > > > > On Feb 8, 2019, at 11:14 AM, Matt Burgess > wrote: > > > > Mike, > > > > I don't think so, but you could try a SELECT DISTINCT in QueryRecord, > > might be a bit of a pain if you want to select all columns and there > > are lots of them. > > > > Alternatively you could try PartitionRecord -> QueryRecord (select * > > limit 1). Neither PartitionRecord nor QueryRecord keeps state so > you'd > > likely need to use distributed cache or UpdateAttribute. > > > > Regards, > > Matt > > > > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen > wrote: > >> > >> Do we have anything like DetectDuplicate for the Record API > already? Didn't see anything, but wanted to ask before reinventing the > wheel. > >> > >> Thanks, > >> > >> Mike > >
Re: 1.9 release date?
wondering the same thing Boris On Sat, Feb 16, 2019 at 10:41 AM dan young wrote: > Heya folks, > > Any insight on 1.9 release date? Looks like a lot of goodies and fixes > included... > > Regards, > > Dano >
Re: 1.9 release date?
dan we did rc1 this week and will have rc2 up today or tomorrow ideally. thanks On Sat, Feb 16, 2019, 10:42 AM dan young Heya folks, > > Any insight on 1.9 release date? Looks like a lot of goodies and fixes > included... > > Regards, > > Dano >
1.9 release date?
Heya folks, Any insight on 1.9 release date? Looks like a lot of goodies and fixes included... Regards, Dano