Elasticsearch doesn't have a CDC-like capability (it doesn't maintain a transaction log or such), so that approach isn't possible.
What I've done previously is to maintain an audit log in a separate index within elasticsearch to track what data I've previously posted, e.g. this might be the last "updated_date" value read from the data index in a previous run of the nifi processor. So your nifi Flow would be something like: Query for latest processed updated_date > paginated query for all new data > determine new latest updated_date (e.g. using QueryRecord) > put new latest updated_date into elasticsearch, ready for the next run On 2023/08/16 23:15:19 Richard Beare wrote: > One further question - what is the recommended way of checking for updates > in an index and fetching new records in a similar manner to > GenerateTableFetch for an sql DB? > Thanks > > On Thu, Aug 17, 2023 at 7:21 AM Richard Beare <richard.be...@gmail.com> > wrote: > > > Sounds perfect. Thanks > > > > On Thu, Aug 17, 2023 at 5:11 AM Chris Sampson <chr...@apache.org> wrote: > > > >> What you describe sounds like the processor is working as designed & > >> documented, i.e. it will restart the same query once it has reached the end > >> of the paginated scroll (or search_after, or point-in-time) query. > >> > >> Instead, it sounds like you want to try using the > >> PaginatedJsonQueryElasticsearch [1] processor instead. This will execute > >> the query given to it, either as the query property or the body of an > >> incoming FlowFile, output the results, and then stop. > >> > >> > >> [1] > >> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-restapi-nar/1.23.0/org.apache.nifi.processors.elasticsearch.PaginatedJsonQueryElasticsearch/index.html > >> > >> On 2023/08/16 07:57:43 Richard Beare wrote: > >> > Hi, > >> > I am using the SearchElasticSearch (1.20.0) processor to retrieve all > >> > documents (~20M) from an index, process and eventually return results > >> to a > >> > new index, although for this test I'm retrieving and processing then > >> > discarding. I'm using opensearch. > >> > > >> > My problem is that the process restarts after completion - I discovered > >> > this, and docs confirm, after seeing warnings from my processing code > >> > (which reformats json ready for other work) being repeated for the same > >> > document ID. > >> > > >> > How do I configure the processor to stop after the completing the first > >> > query. > >> > > >> > I've tried the following: > >> > > >> > Query: {"query" : {"match_all" :{}}} > >> > > >> > with pagination_type SCROLL > >> > > >> > I haven't found a combination of the properties that doesn't lead to > >> > repeated cycles through the index. > >> > > >> > I've also tried {"query" : {"match_all" :{}}, "sort" : > >> [{"Visit_DateTime" : > >> > "asc"]}} > >> > > >> > and SEARCH_AFTER pagination type, with the same problem. > >> > > >> > What am I missing? > >> > Thanks > >> > > >> > > >