Re: join two datasets

Mike Thomsen Fri, 22 Feb 2019 10:15:15 -0800

@Boris

Mark's approach will work for a lot of scenarios. I've used it extensively
with different clients.


On Fri, Feb 22, 2019 at 1:10 PM Mark Payne <[email protected]> wrote:

> This is certainly a better route to go than my previous suggestion :) Have
> one flow that grabs one of the datasets and stores it somewhere.
> In a CSV or XML file, even. Then, have a second flow that pulls the other
> dataset and uses LookupRecord to perform
> the enrichment. The CSVLookupService and XMLLookupService would
> automatically reload when the data is updated.
> We should probably have a JDBCLookupService as well, which would allow for
> dynamic lookups against a database. I
> thought that existed already but does not appear to. Point is, you can
> look at DataSet A as the 'reference dataset' and
> DataSet B as the 'streaming dataset' and then use LookupRecord in order to
> do the enrichment/join.
>
> Unfortunately, I don't seem to be able to find any blogs that describe
> this pattern, but it would certainly make for a good
> blog. Generally, you'd have two flows setup, though:
>
> Flow A (get the enrichment dataset):
> ExcuteSQLRecord (write as CSV) -> PutFile
>
> Flow B (enrich the other dataset):
> ExecuteSQLRecord -> LookupRecord (uses a CSVLookupService that loads the
> file written by the other flow) -> PublishKafkaRecord_2_0
>
> Thanks
> -Mark
>
>
> On Feb 22, 2019, at 12:30 PM, Joe Witt <[email protected]> wrote:
>
> I should add you can use NiFi to update the reference dataset in a
> database/backing store in one flow.  And have another flow that handles the
> live stream/lookup,etc.  MarkPayne/Others: I think there are blogs that
> describe this pattern.  Anyone have links?
>
> On Fri, Feb 22, 2019 at 12:27 PM Joe Witt <[email protected]> wrote:
>
>> Boris,
>>
>> Great.  So have a process to load the periodic dataset into a lookup
>> service.  COuld be backed by a simple file, a database, Hive, whatever.
>> Then have the live flow run against that.
>>
>> This reminds me - we should make a Kudu based lookup service i think.
>> I'll chat with some of our new Kudu friends on this.
>>
>> Thanks
>>
>> On Fri, Feb 22, 2019 at 12:25 PM Boris Tyukin <[email protected]>
>> wrote:
>>
>>> Thanks Joe and Bryan. In this case I don't need to do it in real-time,
>>> probably once a day only.
>>>
>>> I am thinking to trigger both pulls by generateflow processor, then
>>> merge datasets somehow since flowfile id will be the same for both sets.
>>> And then need to join somehow.
>>>
>>> Would like to use nifi still :)
>>>
>>
>

Re: join two datasets

Reply via email to