Re: PutDistributedMapCache

sudeep mishra Tue, 12 Jan 2016 06:47:59 -0800

Thanks Matt.

In my data flow I am expected to perform certain validations on data. I am
loading some SQLServer data into HDFSusing Sqoop (not part of NiFi flow).
For each record in HDFS file I have to query another database and then save
the validated record again in HDFS which will be processed bysome Spark
jobs.


Since I have to query for each record thus I was planning to cache the
database records against which I have to validate the HDFS. Thus I was
evaluating the DistributedCacheServer. But looks like its purpose is
different. Alternatively can we integrate Redis or another distributed
cache with NiFi as I do not see any processor for it.

Appreciate your help.

Thanks & Regards,

Sudeep


On Tue, Jan 12, 2016 at 6:59 PM, Matthew Clarke <[email protected]>
wrote:

> Sudeep,
>        I was a little off on my second scenario.  The detectduplicate
> processor uses the distributedcache service all on its own.. Files that are
> route through it are loaded into the cache if they do not already exist in
> the cache.  if they do already exist they are routed to duplicate.  The
> putDistributedCache processor was a community contribution to which there
> are no processor that make use of the info that it caches.
>
>        We should probably build a processor that would make use of the
> data that can be loaded by the putDistributeCache processor.  Is there a
> particular use case you are trying to solve where this would be applicable?
>
> Thanks,
> Matt
>
> On Tue, Jan 12, 2016 at 8:11 AM, Matthew Clarke <[email protected]
> > wrote:
>
>> Sudeep,
>>     The DistributedMapCache is typically used to prevent the consumption
>> of duplicate data by some of the ingest type processors (GetHBASE,
>> ListHDFS, and ListSFTP).  NiFi uses the service to keep a listing of what
>> has been consumed so the same files are not consumed multiple times. The
>> Service can also be used to detect if duplicate data already exists within
>> a NiFi Instance or cluster. This would be the scenario where some source is
>> pushing data to your NiFi and perhaps they push the same data more than
>> once. You want to catch these duplicates so you can perhaps kick them out
>> of your flow. For this you would use the PutDistributedCache processor to
>> cache all incoming data and then use the DetectDuplicate processor to find
>> those duplicates.
>>
>>     Was there a different use case you were looking to solve using the
>> Distributed cache service?
>>
>> Thanks,
>> Matt
>>
>> On Tue, Jan 12, 2016 at 4:36 AM, sudeep mishra <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I can cache some data to be used in NiFi flow. I can see the
>>> processor PutDistributedMapCache in the documentation which saves key-value
>>> pairs in DistributedMapCache for NiFi but I do not see any processor to red
>>> this data. How can I read data from DistributedMapCache in my data flow?
>>>
>>>
>>> Thanks & Regards,
>>>
>>> Sudeep Shekhar Mishra
>>>
>>>
>>
>


-- 
Thanks & Regards,

Sudeep Shekhar Mishra

+91-9167519029
[email protected]

Re: PutDistributedMapCache

Reply via email to