Hello Vikram, Can you please clarify -- is the question on how to run job only on data in some time interval (for example, -1 hour from now up to now), or how to guarantee that job will read only new records since last DQ job run (even if job was running twice same hour)? If question is about the first, you can refer to Partition Configuration <https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md> set of user guide, or "where" field example in API guide <https://github.com/apache/griffin/blob/master/griffin-doc/service/api-guide.md#add-measure>. If it's about the second -- there is no mechanism in Griffin to track which records have or have not been processed (at least in batch mode), and doing something like that would require custom code.
On Thu, Feb 14, 2019 at 4:31 AM Vikram Jain <[email protected]> wrote: > Hi, > > I have a hive table partitioned date wise and hour wise. Data is coming in > every hour our 2 hours in the table. I am using Griffin to perform certain > profiling and accuracy checks on the data. > > However, I want only the new data that was accumulated post the last job > run to be processed. The job is scheduled to run every hour. > > Right now, Griffin is picking up all the data present in the hive table > (new data accumulated in past hour + past data already processed by the > Griffin job previously). I believe there should be some configurations > while creating a measure and job to avoid this scenario and process only > the data acquired in the last hour. I have tried various permutation and > combinations but have not been successful. > > Can someone please tell me the list of steps and configurations in UI that > I need to ensure in order to achieve the desired result? > > Any help is much appreciated. > > > > Regards, > > Vikram >
