Boris, We also use them for data cleanup. A common pattern I established on my team is to script out a service with ScriptedLookupService and use it to either regenerate a missing field from other fields or rewrite a field with bad data.
On Fri, Feb 22, 2019 at 2:38 PM Boris Tyukin <[email protected]> wrote: > awesome, thanks, guys! I will try both options but lookup makes a lot of > sense and probably will be easier to support and understand. > > We are planning to get NiFi 1.9 soon too, really excited with all the new > features especially load balancing connections and Hive 1.1 processor. > Which is funny because we just created ours to work with Hive on CDH. > > Kudu based lookup also sounds great - we love Kudu and started using it > recently for real-time replication of Oracle databases into our cluster. > > Boris > > > > On Fri, Feb 22, 2019 at 1:14 PM Mike Thomsen <[email protected]> > wrote: > >> @Boris >> >> Mark's approach will work for a lot of scenarios. I've used it >> extensively with different clients. >> >> On Fri, Feb 22, 2019 at 1:10 PM Mark Payne <[email protected]> wrote: >> >>> This is certainly a better route to go than my previous suggestion :) >>> Have one flow that grabs one of the datasets and stores it somewhere. >>> In a CSV or XML file, even. Then, have a second flow that pulls the >>> other dataset and uses LookupRecord to perform >>> the enrichment. The CSVLookupService and XMLLookupService would >>> automatically reload when the data is updated. >>> We should probably have a JDBCLookupService as well, which would allow >>> for dynamic lookups against a database. I >>> thought that existed already but does not appear to. Point is, you can >>> look at DataSet A as the 'reference dataset' and >>> DataSet B as the 'streaming dataset' and then use LookupRecord in order >>> to do the enrichment/join. >>> >>> Unfortunately, I don't seem to be able to find any blogs that describe >>> this pattern, but it would certainly make for a good >>> blog. Generally, you'd have two flows setup, though: >>> >>> Flow A (get the enrichment dataset): >>> ExcuteSQLRecord (write as CSV) -> PutFile >>> >>> Flow B (enrich the other dataset): >>> ExecuteSQLRecord -> LookupRecord (uses a CSVLookupService that loads the >>> file written by the other flow) -> PublishKafkaRecord_2_0 >>> >>> Thanks >>> -Mark >>> >>> >>> On Feb 22, 2019, at 12:30 PM, Joe Witt <[email protected]> wrote: >>> >>> I should add you can use NiFi to update the reference dataset in a >>> database/backing store in one flow. And have another flow that handles the >>> live stream/lookup,etc. MarkPayne/Others: I think there are blogs that >>> describe this pattern. Anyone have links? >>> >>> On Fri, Feb 22, 2019 at 12:27 PM Joe Witt <[email protected]> wrote: >>> >>>> Boris, >>>> >>>> Great. So have a process to load the periodic dataset into a lookup >>>> service. COuld be backed by a simple file, a database, Hive, whatever. >>>> Then have the live flow run against that. >>>> >>>> This reminds me - we should make a Kudu based lookup service i think. >>>> I'll chat with some of our new Kudu friends on this. >>>> >>>> Thanks >>>> >>>> On Fri, Feb 22, 2019 at 12:25 PM Boris Tyukin <[email protected]> >>>> wrote: >>>> >>>>> Thanks Joe and Bryan. In this case I don't need to do it in real-time, >>>>> probably once a day only. >>>>> >>>>> I am thinking to trigger both pulls by generateflow processor, then >>>>> merge datasets somehow since flowfile id will be the same for both sets. >>>>> And then need to join somehow. >>>>> >>>>> Would like to use nifi still :) >>>>> >>>> >>>
