Thanks Matt. In my data flow I am expected to perform certain validations on data. I am loading some SQLServer data into HDFSusing Sqoop (not part of NiFi flow). For each record in HDFS file I have to query another database and then save the validated record again in HDFS which will be processed bysome Spark jobs.
Since I have to query for each record thus I was planning to cache the database records against which I have to validate the HDFS. Thus I was evaluating the DistributedCacheServer. But looks like its purpose is different. Alternatively can we integrate Redis or another distributed cache with NiFi as I do not see any processor for it. Appreciate your help. Thanks & Regards, Sudeep On Tue, Jan 12, 2016 at 6:59 PM, Matthew Clarke <[email protected]> wrote: > Sudeep, > I was a little off on my second scenario. The detectduplicate > processor uses the distributedcache service all on its own.. Files that are > route through it are loaded into the cache if they do not already exist in > the cache. if they do already exist they are routed to duplicate. The > putDistributedCache processor was a community contribution to which there > are no processor that make use of the info that it caches. > > We should probably build a processor that would make use of the > data that can be loaded by the putDistributeCache processor. Is there a > particular use case you are trying to solve where this would be applicable? > > Thanks, > Matt > > On Tue, Jan 12, 2016 at 8:11 AM, Matthew Clarke <[email protected] > > wrote: > >> Sudeep, >> The DistributedMapCache is typically used to prevent the consumption >> of duplicate data by some of the ingest type processors (GetHBASE, >> ListHDFS, and ListSFTP). NiFi uses the service to keep a listing of what >> has been consumed so the same files are not consumed multiple times. The >> Service can also be used to detect if duplicate data already exists within >> a NiFi Instance or cluster. This would be the scenario where some source is >> pushing data to your NiFi and perhaps they push the same data more than >> once. You want to catch these duplicates so you can perhaps kick them out >> of your flow. For this you would use the PutDistributedCache processor to >> cache all incoming data and then use the DetectDuplicate processor to find >> those duplicates. >> >> Was there a different use case you were looking to solve using the >> Distributed cache service? >> >> Thanks, >> Matt >> >> On Tue, Jan 12, 2016 at 4:36 AM, sudeep mishra <[email protected]> >> wrote: >> >>> Hi, >>> >>> I can cache some data to be used in NiFi flow. I can see the >>> processor PutDistributedMapCache in the documentation which saves key-value >>> pairs in DistributedMapCache for NiFi but I do not see any processor to red >>> this data. How can I read data from DistributedMapCache in my data flow? >>> >>> >>> Thanks & Regards, >>> >>> Sudeep Shekhar Mishra >>> >>> >> > -- Thanks & Regards, Sudeep Shekhar Mishra +91-9167519029 [email protected]
