Re: join two datasets

Mike Thomsen Fri, 22 Feb 2019 14:56:08 -0800

Boris,

We also use them for data cleanup. A common pattern I established on my
team is to script out a service with ScriptedLookupService and use it to
either regenerate a missing field from other fields or rewrite a field with
bad data.


On Fri, Feb 22, 2019 at 2:38 PM Boris Tyukin <[email protected]> wrote:

> awesome, thanks, guys! I will try both options but lookup makes a lot of
> sense and probably will be easier to support and understand.
>
> We are planning to get NiFi 1.9 soon too, really excited with all the new
> features especially load balancing connections and Hive 1.1 processor.
> Which is funny because we just created ours to work with Hive on CDH.
>
> Kudu based lookup also sounds great - we love Kudu and started using it
> recently for real-time replication of Oracle databases into our cluster.
>
> Boris
>
>
>
> On Fri, Feb 22, 2019 at 1:14 PM Mike Thomsen <[email protected]>
> wrote:
>
>> @Boris
>>
>> Mark's approach will work for a lot of scenarios. I've used it
>> extensively with different clients.
>>
>> On Fri, Feb 22, 2019 at 1:10 PM Mark Payne <[email protected]> wrote:
>>
>>> This is certainly a better route to go than my previous suggestion :)
>>> Have one flow that grabs one of the datasets and stores it somewhere.
>>> In a CSV or XML file, even. Then, have a second flow that pulls the
>>> other dataset and uses LookupRecord to perform
>>> the enrichment. The CSVLookupService and XMLLookupService would
>>> automatically reload when the data is updated.
>>> We should probably have a JDBCLookupService as well, which would allow
>>> for dynamic lookups against a database. I
>>> thought that existed already but does not appear to. Point is, you can
>>> look at DataSet A as the 'reference dataset' and
>>> DataSet B as the 'streaming dataset' and then use LookupRecord in order
>>> to do the enrichment/join.
>>>
>>> Unfortunately, I don't seem to be able to find any blogs that describe
>>> this pattern, but it would certainly make for a good
>>> blog. Generally, you'd have two flows setup, though:
>>>
>>> Flow A (get the enrichment dataset):
>>> ExcuteSQLRecord (write as CSV) -> PutFile
>>>
>>> Flow B (enrich the other dataset):
>>> ExecuteSQLRecord -> LookupRecord (uses a CSVLookupService that loads the
>>> file written by the other flow) -> PublishKafkaRecord_2_0
>>>
>>> Thanks
>>> -Mark
>>>
>>>
>>> On Feb 22, 2019, at 12:30 PM, Joe Witt <[email protected]> wrote:
>>>
>>> I should add you can use NiFi to update the reference dataset in a
>>> database/backing store in one flow.  And have another flow that handles the
>>> live stream/lookup,etc.  MarkPayne/Others: I think there are blogs that
>>> describe this pattern.  Anyone have links?
>>>
>>> On Fri, Feb 22, 2019 at 12:27 PM Joe Witt <[email protected]> wrote:
>>>
>>>> Boris,
>>>>
>>>> Great.  So have a process to load the periodic dataset into a lookup
>>>> service.  COuld be backed by a simple file, a database, Hive, whatever.
>>>> Then have the live flow run against that.
>>>>
>>>> This reminds me - we should make a Kudu based lookup service i think.
>>>> I'll chat with some of our new Kudu friends on this.
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Feb 22, 2019 at 12:25 PM Boris Tyukin <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Joe and Bryan. In this case I don't need to do it in real-time,
>>>>> probably once a day only.
>>>>>
>>>>> I am thinking to trigger both pulls by generateflow processor, then
>>>>> merge datasets somehow since flowfile id will be the same for both sets.
>>>>> And then need to join somehow.
>>>>>
>>>>> Would like to use nifi still :)
>>>>>
>>>>
>>>

Re: join two datasets

Reply via email to