This is certainly a better route to go than my previous suggestion :) Have one flow that grabs one of the datasets and stores it somewhere. In a CSV or XML file, even. Then, have a second flow that pulls the other dataset and uses LookupRecord to perform the enrichment. The CSVLookupService and XMLLookupService would automatically reload when the data is updated. We should probably have a JDBCLookupService as well, which would allow for dynamic lookups against a database. I thought that existed already but does not appear to. Point is, you can look at DataSet A as the 'reference dataset' and DataSet B as the 'streaming dataset' and then use LookupRecord in order to do the enrichment/join.
Unfortunately, I don't seem to be able to find any blogs that describe this pattern, but it would certainly make for a good blog. Generally, you'd have two flows setup, though: Flow A (get the enrichment dataset): ExcuteSQLRecord (write as CSV) -> PutFile Flow B (enrich the other dataset): ExecuteSQLRecord -> LookupRecord (uses a CSVLookupService that loads the file written by the other flow) -> PublishKafkaRecord_2_0 Thanks -Mark On Feb 22, 2019, at 12:30 PM, Joe Witt <[email protected]<mailto:[email protected]>> wrote: I should add you can use NiFi to update the reference dataset in a database/backing store in one flow. And have another flow that handles the live stream/lookup,etc. MarkPayne/Others: I think there are blogs that describe this pattern. Anyone have links? On Fri, Feb 22, 2019 at 12:27 PM Joe Witt <[email protected]<mailto:[email protected]>> wrote: Boris, Great. So have a process to load the periodic dataset into a lookup service. COuld be backed by a simple file, a database, Hive, whatever. Then have the live flow run against that. This reminds me - we should make a Kudu based lookup service i think. I'll chat with some of our new Kudu friends on this. Thanks On Fri, Feb 22, 2019 at 12:25 PM Boris Tyukin <[email protected]<mailto:[email protected]>> wrote: Thanks Joe and Bryan. In this case I don't need to do it in real-time, probably once a day only. I am thinking to trigger both pulls by generateflow processor, then merge datasets somehow since flowfile id will be the same for both sets. And then need to join somehow. Would like to use nifi still :)
