Re: join two datasets

Mark Payne Fri, 22 Feb 2019 10:11:03 -0800

This is certainly a better route to go than my previous suggestion :) Have one 
flow that grabs one of the datasets and stores it somewhere.
In a CSV or XML file, even. Then, have a second flow that pulls the other 
dataset and uses LookupRecord to perform
the enrichment. The CSVLookupService and XMLLookupService would automatically 
reload when the data is updated.
We should probably have a JDBCLookupService as well, which would allow for 
dynamic lookups against a database. I
thought that existed already but does not appear to. Point is, you can look at 
DataSet A as the 'reference dataset' and
DataSet B as the 'streaming dataset' and then use LookupRecord in order to do 
the enrichment/join.


Unfortunately, I don't seem to be able to find any blogs that describe this 
pattern, but it would certainly make for a good
blog. Generally, you'd have two flows setup, though:

Flow A (get the enrichment dataset):
ExcuteSQLRecord (write as CSV) -> PutFile

Flow B (enrich the other dataset):
ExecuteSQLRecord -> LookupRecord (uses a CSVLookupService that loads the file 
written by the other flow) -> PublishKafkaRecord_2_0

Thanks
-Mark


On Feb 22, 2019, at 12:30 PM, Joe Witt 
<[email protected]<mailto:[email protected]>> wrote:

I should add you can use NiFi to update the reference dataset in a 
database/backing store in one flow.  And have another flow that handles the 
live stream/lookup,etc.  MarkPayne/Others: I think there are blogs that 
describe this pattern.  Anyone have links?

On Fri, Feb 22, 2019 at 12:27 PM Joe Witt 
<[email protected]<mailto:[email protected]>> wrote:
Boris,

Great.  So have a process to load the periodic dataset into a lookup service.  
COuld be backed by a simple file, a database, Hive, whatever.  Then have the 
live flow run against that.

This reminds me - we should make a Kudu based lookup service i think.  I'll 
chat with some of our new Kudu friends on this.

Thanks

On Fri, Feb 22, 2019 at 12:25 PM Boris Tyukin 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Joe and Bryan. In this case I don't need to do it in real-time, probably 
once a day only.

I am thinking to trigger both pulls by generateflow processor, then merge 
datasets somehow since flowfile id will be the same for both sets. And then 
need to join somehow.

Would like to use nifi still :)

Re: join two datasets

Reply via email to