Re: join two datasets

Boris Tyukin Fri, 22 Feb 2019 11:39:47 -0800

awesome, thanks, guys! I will try both options but lookup makes a lot of
sense and probably will be easier to support and understand.


We are planning to get NiFi 1.9 soon too, really excited with all the new
features especially load balancing connections and Hive 1.1 processor.
Which is funny because we just created ours to work with Hive on CDH.

Kudu based lookup also sounds great - we love Kudu and started using it
recently for real-time replication of Oracle databases into our cluster.

Boris



On Fri, Feb 22, 2019 at 1:14 PM Mike Thomsen <[email protected]> wrote:

> @Boris
>
> Mark's approach will work for a lot of scenarios. I've used it extensively
> with different clients.
>
> On Fri, Feb 22, 2019 at 1:10 PM Mark Payne <[email protected]> wrote:
>
>> This is certainly a better route to go than my previous suggestion :)
>> Have one flow that grabs one of the datasets and stores it somewhere.
>> In a CSV or XML file, even. Then, have a second flow that pulls the other
>> dataset and uses LookupRecord to perform
>> the enrichment. The CSVLookupService and XMLLookupService would
>> automatically reload when the data is updated.
>> We should probably have a JDBCLookupService as well, which would allow
>> for dynamic lookups against a database. I
>> thought that existed already but does not appear to. Point is, you can
>> look at DataSet A as the 'reference dataset' and
>> DataSet B as the 'streaming dataset' and then use LookupRecord in order
>> to do the enrichment/join.
>>
>> Unfortunately, I don't seem to be able to find any blogs that describe
>> this pattern, but it would certainly make for a good
>> blog. Generally, you'd have two flows setup, though:
>>
>> Flow A (get the enrichment dataset):
>> ExcuteSQLRecord (write as CSV) -> PutFile
>>
>> Flow B (enrich the other dataset):
>> ExecuteSQLRecord -> LookupRecord (uses a CSVLookupService that loads the
>> file written by the other flow) -> PublishKafkaRecord_2_0
>>
>> Thanks
>> -Mark
>>
>>
>> On Feb 22, 2019, at 12:30 PM, Joe Witt <[email protected]> wrote:
>>
>> I should add you can use NiFi to update the reference dataset in a
>> database/backing store in one flow.  And have another flow that handles the
>> live stream/lookup,etc.  MarkPayne/Others: I think there are blogs that
>> describe this pattern.  Anyone have links?
>>
>> On Fri, Feb 22, 2019 at 12:27 PM Joe Witt <[email protected]> wrote:
>>
>>> Boris,
>>>
>>> Great.  So have a process to load the periodic dataset into a lookup
>>> service.  COuld be backed by a simple file, a database, Hive, whatever.
>>> Then have the live flow run against that.
>>>
>>> This reminds me - we should make a Kudu based lookup service i think.
>>> I'll chat with some of our new Kudu friends on this.
>>>
>>> Thanks
>>>
>>> On Fri, Feb 22, 2019 at 12:25 PM Boris Tyukin <[email protected]>
>>> wrote:
>>>
>>>> Thanks Joe and Bryan. In this case I don't need to do it in real-time,
>>>> probably once a day only.
>>>>
>>>> I am thinking to trigger both pulls by generateflow processor, then
>>>> merge datasets somehow since flowfile id will be the same for both sets.
>>>> And then need to join somehow.
>>>>
>>>> Would like to use nifi still :)
>>>>
>>>
>>

Re: join two datasets

Reply via email to