Re: opensearch/elasticsearch with pagination

Chris Sampson Wed, 16 Aug 2023 21:24:00 -0700

Elasticsearch doesn't have a CDC-like capability (it doesn't maintain a 
transaction log or such), so that approach isn't possible.


What I've done previously is to maintain an audit log in a separate index 
within elasticsearch to track what data I've previously posted, e.g. this might 
be the last "updated_date" value read from the data index in a previous run of 
the nifi processor. So your nifi Flow would be something like:

Query for latest processed updated_date > paginated query for all new data > 
determine new latest updated_date (e.g. using QueryRecord) > put new latest 
updated_date into elasticsearch, ready for the next run

On 2023/08/16 23:15:19 Richard Beare wrote:
> One further question - what is the recommended way of checking for updates
> in an index and fetching new records in a similar manner to
> GenerateTableFetch for an sql DB?
> Thanks
> 
> On Thu, Aug 17, 2023 at 7:21 AM Richard Beare <richard.be...@gmail.com>
> wrote:
> 
> > Sounds perfect. Thanks
> >
> > On Thu, Aug 17, 2023 at 5:11 AM Chris Sampson <chr...@apache.org> wrote:
> >
> >> What you describe sounds like the processor is working as designed &
> >> documented, i.e. it will restart the same query once it has reached the end
> >> of the paginated scroll (or search_after, or point-in-time) query.
> >>
> >> Instead, it sounds like you want to try using the
> >> PaginatedJsonQueryElasticsearch [1] processor instead. This will execute
> >> the query given to it, either as the query property or the body of an
> >> incoming FlowFile, output the results, and then stop.
> >>
> >>
> >> [1]
> >> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-restapi-nar/1.23.0/org.apache.nifi.processors.elasticsearch.PaginatedJsonQueryElasticsearch/index.html
> >>
> >> On 2023/08/16 07:57:43 Richard Beare wrote:
> >> > Hi,
> >> > I am using the SearchElasticSearch (1.20.0) processor to retrieve all
> >> > documents (~20M) from an index, process and eventually return results
> >> to a
> >> > new index, although for this test I'm retrieving and processing then
> >> > discarding. I'm using opensearch.
> >> >
> >> > My problem is that the process restarts after completion - I discovered
> >> > this, and docs confirm, after seeing warnings from my processing code
> >> > (which reformats json ready for other work) being repeated for the same
> >> > document ID.
> >> >
> >> > How do I configure the processor to stop after the completing the first
> >> > query.
> >> >
> >> > I've tried the following:
> >> >
> >> > Query: {"query" : {"match_all" :{}}}
> >> >
> >> > with pagination_type SCROLL
> >> >
> >> > I haven't found a combination of the properties that doesn't lead to
> >> > repeated cycles through the index.
> >> >
> >> > I've also tried {"query" : {"match_all" :{}}, "sort" :
> >> [{"Visit_DateTime" :
> >> > "asc"]}}
> >> >
> >> > and SEARCH_AFTER pagination type, with the same problem.
> >> >
> >> > What am I missing?
> >> > Thanks
> >> >
> >>
> >
>

Re: opensearch/elasticsearch with pagination

Reply via email to