Just picking up on a couple of comments/threads within this chain (for clarity).
Environment: * NiFi 2.0.0-SNAPSHOT (i.e. current “main” latest, but should be pretty much the same for the Elasticsearch processors as the current 1.23.2 release) * Elasticsearch 8.9.1 If I run the SearchElasticsearch with an “aggs”-only query and a “size” of 0, I cannot get the processor to work due to the following Elasticsearch validation responses: * Pagination Type = SCROLL: “Could not query documents. java.lang.IllegalArgumentException: Query using pit/search_after must contain a "sort" field” * Pagination Type = SEARCH_AFTER: “{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: [size] cannot be [0] in a scroll context;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: [size] cannot be [0] in a scroll context;"},"status":400}" So, basically, it doesn’t appear possible to have an eggs-only setup with the Pagination-based processors, but that kind of makes sense really - they’re aimed at paginating through hits (documents), not aggregations. If I run the same “eggs”-only query with “size” set to 0 through the JsonQueryElasticsearch processor, this succeeds with a single output to the “aggregations” relationship, with nothing output to the “hits” relationship (unless I set the “Output No Hits” property to true - this is false by default). I’ve also raised NIFI-11985 [1] as a feature request to add a ConsumeElasticsearch (or similar) processor in a future version of NiFi, which might help with this kind of use case in future. [1] https://issues.apache.org/jira/browse/NIFI-11985 Cheers, --- Chris Sampson IT Consultant chris.samp...@naimuri.com > On 21 Aug 2023, at 10:17, Richard Beare <richard.be...@gmail.com> wrote: > > I discovered the jsonqueryelasticsearch just before your message arrived, and > it looks to be returning aggregates only if I set size: 0, so I think that is > the place to start for this problem. > > On Mon, Aug 21, 2023 at 5:41 PM Chris Sampson <chr...@apache.org > <mailto:chr...@apache.org>> wrote: >> Using SearchElasticsearch for just an aggregation feels like it might not be >> the right choice (maybe look at JsonQueryElasticsearch instead), or are the >> dates constantly changing, i.e. new data is always appearing, so you want to >> keep triggering the flow, and you want to use this as the starting processor >> of your flow? You could, if course, set this processor to only run on a >> slower schedule e.g. once per hour/day, etc. N.B. JsonQueryElasticsearch >> allows, but doesn't require, an incoming connection, so you can use it as an >> initial processor within a flow. >> >> Do you have "Output No Hits" set to true? That would explain the empty >> flowfile behaviour. If not, I know that there have been a couple of >> changes/fixes in that area in recent versions (you mention your on 1.20.0, >> latest now is 1.23.1), so it could be something that had now been fixed, or >> a bug with the processor. If you have "No Hits" set to false, please raise a >> jira with a much detail about your processor settings as you can provide, >> and it could be something for the community to look at fixing in a new >> version (or checking whether it's still an issue in the latest versions, if >> you're not in a position to try that yourself). >> >> On 2023/08/21 06:59:58 Richard Beare wrote: >> > I'm repeatedly selecting the min and max date stamp using a >> > SearchElasticSearch processor to begin creating the query generator. >> > >> > The query looks like: >> > { >> > "size" : 0, >> > "aggs" : { >> > "newest" : { "max" : { "field" : "Visit_DateTime"}}, >> > "oldest" : { "min" : { "field" : "Visit_DateTime"}} >> > } >> > } >> > >> > This seems to work, but I always end up with a document in the "hit" >> > relationship, rather than just the aggregation. I can terminate that >> > relationship, but it seems strange. >> > >> > On Sun, Aug 20, 2023 at 3:33 PM Chris Sampson <chr...@apache.org >> > <mailto:chr...@apache.org>> wrote: >> > >> > > To retrieve large quantities of data from Elasticsearch into nifi, yes, >> > > it's probably the best way we have. >> > > >> > > The processor's don't currently use slicing (parallelism) internally for >> > > the Elasticsearch queries, but as you're writing a query for every month, >> > > you could increase the processor's Concurrency, and therefore run >> > > multiple >> > > queries in parallel at that level - bear in mind the impact this will >> > > have >> > > on your system resources and bandwidth, so test it and increase the >> > > Concurrency incrementally. >> > > >> > > If/when you get to the point of having less data to pull, e.g. just the >> > > most recent data, you could switch to one of the other Elasticsearch >> > > processors of you wanted, but sticking with the Paginated processors >> > > would >> > > give some safety for occasionally having large amounts of data to pull - >> > > the point of pagination primarily being to reduce the impact on your >> > > Elasticsearch instance/cluster and network. >> > > >> > > In terms of flowfile size in nifi, there's nothing wrong having multiple >> > > GB or even TB of content in a single file, but ideally you'd want to >> > > stick >> > > to Record-based processors if you need to make any changes to the content >> > > once it's in nifi. >> > > >> > > The flowfile size might be important for your data destination, but >> > > again, >> > > you can always split flowfiles up in nifi, e.g. using SplitRecord or >> > > PartitionRecord, etc. >> > > >> > > On 2023/08/19 23:06:50 Richard Beare wrote: >> > > > Good points - I've done some testing. >> > > > >> > > > About 1-2 minutes for 1 month's data with 1k page sizes and about half >> > > that >> > > > for 10k. About 8-10 minutes for 1 years worth of data at 10k pages. >> > > > >> > > > Per month looks like the sweet spot in terms of size - that's about >> > > > 500-750MB. >> > > > >> > > > In terms of building the upstream tools to generate the queries, is the >> > > > paginatedjsonquery the way to go to retrieve the oldest and most recent >> > > > date from an index? >> > > > >> > > > >> > > > ~ >> > > > >> > > > On Sun, Aug 20, 2023 at 1:53 AM Chris Sampson <chr...@apache.org >> > > > <mailto:chr...@apache.org>> wrote: >> > > > >> > > > > I'd guess it depends on what you want to achieve downstream, e.g. >> > > > > would >> > > > > setting the query processor to output per_query and return everything >> > > in 1 >> > > > > to be useful? Internally, the processor is so fetching everything in >> > > pages >> > > > > from Elasticsearch, setting the size higher will reduce the number of >> > > > > network round-trips, but note that nifi will hold the entire response >> > > from >> > > > > Elasticsearch in memory until it is written to a flowfile - this is >> > > fine >> > > > > before the next loop within the processor, even if the prices session >> > > isn't >> > > > > committed and you don't see the output for a while. >> > > > > >> > > > > You've a choice to make between number of network calls (page >> > > > > fetches), >> > > > > number of queries (which kind of amounts to the same thing really), >> > > page >> > > > > size in memory (will impact both nifi and elasticsearch, as well as >> > > network >> > > > > performance), and number of flowfiles you want to deal with >> > > > > downstream >> > > - >> > > > > having all your data in a single flowfile might be useful, if you can >> > > use >> > > > > Record-based processors for everything you want to do later - the >> > > > > fewer >> > > > > flowfiles you have, the more performance your flow is likely to be >> > > (general >> > > > > oversimplification). >> > > > > >> > > > > How long did it take for you to fetch a day of data using 1k page >> > > sizes? >> > > > > Did it work if you up page size to 10k? How about 10k page for a >> > > > > month >> > > or a >> > > > > whole year? >> > > > > >> > > > > If you decide to break up the query by time range, e.g. years or >> > > months, >> > > > > then a python or groovy script is certainly an option in order to >> > > generate >> > > > > the parameters (e.g. attributes on a flowfile) to feed into the >> > > > > query. >> > > > > >> > > > > On 2023/08/19 05:05:39 Richard Beare wrote: >> > > > > > A bit of progress. >> > > > > > First up, firing a match_all at my index with 20M documents doesn't >> > > work, >> > > > > > as you probably expected. Or more precisely, is unlikely to be >> > > useful - I >> > > > > > left it overnight and nothing appeared to have happened, so I guess >> > > it >> > > > > was >> > > > > > madly fetching pages and filling up available storage. >> > > > > > >> > > > > > So I tested with a query of the form >> > > > > > { >> > > > > > query": { >> > > > > > "range" : { >> > > > > > "Visit_DateTime": { >> > > > > > "gte" : "01/07/2020", >> > > > > > "lte" : "02/07/2020", >> > > > > > "format" : "dd/MM/yyyy" >> > > > > > } >> > > > > > } >> > > > > > } >> > > > > > } >> > > > > > >> > > > > > i.e a single days worth of documents (38998 according to a curl >> > > _count >> > > > > > version of the query). This did indeed produce 3900 flowfiles in >> > > > > > the >> > > hits >> > > > > > queue and consume the input. >> > > > > > >> > > > > > Including a size parameter as follows: >> > > > > > >> > > > > > { >> > > > > > "size" : 1000, >> > > > > > query": { >> > > > > > "range" : { >> > > > > > "Visit_DateTime": { >> > > > > > "gte" : "01/07/2020", >> > > > > > "lte" : "02/07/2020", >> > > > > > "format" : "dd/MM/yyyy" >> > > > > > } >> > > > > > } >> > > > > > } >> > > > > > } >> > > > > > >> > > > > > Leads to 39 flowfiles in the hits queue. >> > > > > > >> > > > > > So it looks like my best way forward processing many years worth of >> > > data >> > > > > is >> > > > > > to generate a set of day-based queries. Is a python script the best >> > > > > option? >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Fri, Aug 18, 2023 at 4:03 PM Chris Sampson <chr...@apache.org >> > > > > > <mailto:chr...@apache.org>> >> > > wrote: >> > > > > > >> > > > > > > Ah, so these processors have all been written for Elasticsearch, >> > > and >> > > > > use >> > > > > > > the Elasticsearch low-level REST API library to form connections. >> > > > > They've >> > > > > > > not been tested against OpenSearch, although hopefully should >> > > > > > > work >> > > for >> > > > > any >> > > > > > > interactions where the API is the same, but the two products >> > > continue >> > > > > to >> > > > > > > diverge, so there's increasing chance that some things won't >> > > > > > > work. >> > > > > > > >> > > > > > > Any details of things that aren't working would be good to know >> > > about >> > > > > > > (e.g. raised as Jira tickets, containing a much detail as >> > > > > > > possible, >> > > > > like >> > > > > > > the query used and any log details of errors), so that the >> > > community >> > > > > could >> > > > > > > look into providing OpenSearch compatibility in the future. >> > > > > > > >> > > > > > > I've known a few people try with OpenSearch and things either >> > > work, or >> > > > > we >> > > > > > > don't hear about the errors that are received, so we don't know >> > > what >> > > > > needs >> > > > > > > looking at from a NiFi perspective. >> > > > > > > >> > > > > > > On 2023/08/18 04:37:10 Richard Beare wrote: >> > > > > > > > I did use the example and got errors. I'll revisit that >> > > > > > > > (perhaps >> > > it >> > > > > is an >> > > > > > > > opensearch idiosyncrasy). The per response option is probably >> > > > > > > > my >> > > > > issue. >> > > > > > > > I'll check that out and get back to you. >> > > > > > > > Thanks again >> > > > > > > > >> > > > > > > > On Fri, Aug 18, 2023 at 2:30 PM Chris Sampson >> > > > > > > > <chr...@apache.org <mailto:chr...@apache.org> >> > > > >> > > > > wrote: >> > > > > > > > >> > > > > > > > > Check the example in the processor's additional details docs >> > > [1] >> > > > > for >> > > > > > > how >> > > > > > > > > you could set size and sort fields for the query - size is >> > > used to >> > > > > > > > > determine the number of documents returned per page, sorry is >> > > > > required >> > > > > > > if >> > > > > > > > > using a "search after" or "point in time" query type. >> > > > > > > > > >> > > > > > > > > If the Query property is set, the incoming FlowFile content >> > > should >> > > > > be >> > > > > > > > > ignored, i.e. it doesn't need to be empty. >> > > > > > > > > >> > > > > > > > > Use the "Search Results Split" property to determine how the >> > > > > results >> > > > > > > are >> > > > > > > > > output. This defaults to "per response", which outputs a >> > > flowfile >> > > > > for >> > > > > > > every >> > > > > > > > > page of results. As PaginatedJsonQueryElasticsearch takes an >> > > input >> > > > > > > > > flowfile, its internal "process session" remains active until >> > > the >> > > > > > > processor >> > > > > > > > > completes and commits is session - this happens when there >> > > > > > > > > are >> > > no >> > > > > more >> > > > > > > > > results to retrieve from Elasticsearch, at which point the >> > > input >> > > > > > > flowfile >> > > > > > > > > disappears from the input queue and all output flowfiles >> > > appear in >> > > > > the >> > > > > > > > > output queues. This is how the nifi framework handles >> > > sessions, but >> > > > > > > can be >> > > > > > > > > confusing if you're not aware of that beforehand. >> > > > > > > > > >> > > > > > > > > SearchElasticsearch is different in this regard because its >> > > session >> > > > > > > ends >> > > > > > > > > after every iteration (determined by the "Search Results >> > > Split", >> > > > > e.g. >> > > > > > > this >> > > > > > > > > could be per page or per entire query), then uses nifi state >> > > > > > > > > to >> > > > > setup >> > > > > > > the >> > > > > > > > > next iteration. This means you could start to see output >> > > flowfiles >> > > > > > > sooner. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > [1] >> > > > > > > > > >> > > > > > > >> > > > > >> > > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-restapi-nar/1.23.0/org.apache.nifi.processors.elasticsearch.PaginatedJsonQueryElasticsearch/additionalDetails.html >> > > > > > > > > >> > > > > > > > > On 2023/08/17 22:13:22 Richard Beare wrote: >> > > > > > > > > > Thanks, that makes sense. I've had trouble getting a size >> > > > > parameter >> > > > > > > > > > accepted, but will work on that later. >> > > > > > > > > > >> > > > > > > > > > However, I'm unsure what I should expect to see in the >> > > following >> > > > > test >> > > > > > > > > > scenario. >> > > > > > > > > > >> > > > > > > > > > A fixed query in the Query parameter - a match all. i.e. >> > > nothing >> > > > > > > dynamic >> > > > > > > > > > set by upstream processing >> > > > > > > > > > >> > > > > > > > > > An empty input flowfile to trigger activity. >> > > > > > > > > > >> > > > > > > > > > The test index is large. (20M docs) >> > > > > > > > > > >> > > > > > > > > > Do I expect the processor to begin filling the output queue >> > > as >> > > > > fast >> > > > > > > as it >> > > > > > > > > > can, with one flowfile per received page, pausing as the >> > > queue >> > > > > fills? >> > > > > > > > > > That was what I was anticipating, but at the moment I'm >> > > getting >> > > > > no >> > > > > > > output >> > > > > > > > > > and the input flowfile isn't being consumed. I suspect one >> > > flag >> > > > > is >> > > > > > > wrong, >> > > > > > > > > > but can't see it. >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Fri, Aug 18, 2023 at 12:06 AM Chris Sampson < >> > > > > chr...@apache.org <mailto:chr...@apache.org>> >> > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Again, sounds like it's working as documented [1] - an >> > > input is >> > > > > > > > > required >> > > > > > > > > > > to trigger the PaginatedJsonQueryElasticsearch processor, >> > > so >> > > > > > > something >> > > > > > > > > like >> > > > > > > > > > > GenerateFlowFile is a way to achieve that if you want to >> > > > > > > periodically >> > > > > > > > > > > execute a paginated query, e.g. by setting the Generate >> > > > > processor's >> > > > > > > > > > > schedule to run every hour, or use cron syntax, etc. The >> > > > > advantage >> > > > > > > with >> > > > > > > > > > > this processor is that you can use the output of another >> > > > > processor >> > > > > > > > > (e.g. >> > > > > > > > > > > build a query using the results of another processor, >> > > > > > > > > > > such >> > > as >> > > > > an >> > > > > > > > > initial >> > > > > > > > > > > query of Elasticsearch) to trigger the paginated query of >> > > > > > > > > Elasticsearch, >> > > > > > > > > > > but once the query is finished, the processor won't keep >> > > > > firing. >> > > > > > > > > > > >> > > > > > > > > > > Conversely, SearchElasticsearch does not allow incoming >> > > > > > > connections, >> > > > > > > > > but >> > > > > > > > > > > only triggers the same query on the defined schedule. If >> > > the >> > > > > query >> > > > > > > > > needs to >> > > > > > > > > > > use parameters (or some sort of variable), you need to >> > > figure >> > > > > out >> > > > > > > how >> > > > > > > > > to >> > > > > > > > > > > apply that in the Query parameter of the processor - it >> > > could >> > > > > be by >> > > > > > > > > > > Elasticsearch notation (e.g. "now/d" for the start of the >> > > > > current >> > > > > > > day >> > > > > > > > > in a >> > > > > > > > > > > date range filter), or something that can be achieved >> > > > > > > > > > > using >> > > > > NiFi >> > > > > > > > > Expression >> > > > > > > > > > > Language [2], but without the flexibility of providing >> > > inputs >> > > > > in >> > > > > > > > > FlowFile >> > > > > > > > > > > content, which could be the output of a previous query, >> > > > > > > > > > > or >> > > > > > > > > > > GenerateFlowFile, etc. >> > > > > > > > > > > >> > > > > > > > > > > You need to figure out what query you want to run, what >> > > > > input(s) >> > > > > > > are >> > > > > > > > > > > appropriate, and the schedule to which you want to >> > > > > > > > > > > execute. >> > > > > > > > > > > >> > > > > > > > > > > The Search processor is aimed more at a use case of "I >> > > want to >> > > > > > > > > continually >> > > > > > > > > > > retrieve the contents of an Elasticsearch index/query as >> > > it is >> > > > > > > > > populated >> > > > > > > > > > > from an extremal source", PaginatedQuery is more for "I >> > > want to >> > > > > > > > > retrieve >> > > > > > > > > > > data from Elasticsearch that match a query"; both >> > > processors >> > > > > are >> > > > > > > meant >> > > > > > > > > to >> > > > > > > > > > > "allow for the possibility of many documents to be >> > > retrieved". >> > > > > > > > > > > >> > > > > > > > > > > For various reasons, neither processor was designed to >> > > > > > > > > > > hold >> > > > > state >> > > > > > > > > between >> > > > > > > > > > > initiation of paginated queries, e.g. they don't follow >> > > > > > > > > > > the >> > > > > > > pattern of >> > > > > > > > > a >> > > > > > > > > > > "Consume" or "List" processor that attempts to retain the >> > > > > > > knowledge of >> > > > > > > > > the >> > > > > > > > > > > "last timestamp" within NiFi itself. That's something >> > > > > > > > > > > that >> > > > > could be >> > > > > > > > > > > considered, but would need a code change (feel free to >> > > raise a >> > > > > jira >> > > > > > > > > ticket >> > > > > > > > > > > for the future [3] if you think that would be helpful). >> > > One of >> > > > > the >> > > > > > > > > reasons >> > > > > > > > > > > for this is that, unlike an S3 Bucket (for example), >> > > documents >> > > > > are >> > > > > > > not >> > > > > > > > > > > guaranteed to always be indexed within Elasticsearch in >> > > > > order/with >> > > > > > > > > such an >> > > > > > > > > > > "updated at" field, although one could design their >> > > > > > > > > > > system >> > > that >> > > > > > > way, of >> > > > > > > > > > > course. >> > > > > > > > > > > >> > > > > > > > > > > [1] >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-restapi-nar/1.23.0/org.apache.nifi.processors.elasticsearch.PaginatedJsonQueryElasticsearch/index.html >> > > > > > > > > > > >> > > > > > > > > > > [2] >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html >> > > > > > > > > > > >> > > > > > > > > > > [3] https://issues.apache.org/jira/browse/NIFI >> > > > > > > > > > > >> > > > > > > > > > > On 2023/08/17 12:43:31 Richard Beare wrote: >> > > > > > > > > > > > I must be missing something simple. I've copied the >> > > > > parameters >> > > > > > > and >> > > > > > > > > query >> > > > > > > > > > > > from the SearchElasticSearch processor and I'm not >> > > getting >> > > > > > > errors, >> > > > > > > > > but no >> > > > > > > > > > > > flowfiles are produced. >> > > > > > > > > > > > >> > > > > > > > > > > > I'm forced to add an input connection, despite coding >> > > > > > > > > > > > the >> > > > > query >> > > > > > > in >> > > > > > > > > the >> > > > > > > > > > > > Query property. I have a GenerateFlowFile processor >> > > > > connected. >> > > > > > > I'm >> > > > > > > > > > > using.a >> > > > > > > > > > > > basic match all as a starting point >> > > > > > > > > > > > { >> > > > > > > > > > > > "query" : >> > > > > > > > > > > > { >> > > > > > > > > > > > "match_all" : {} >> > > > > > > > > > > > } >> > > > > > > > > > > > } >> > > > > > > > > > > > >> > > > > > > > > > > > Sending the query via curl appears to work OK - I get a >> > > page >> > > > > of >> > > > > > > stuff >> > > > > > > > > > > back. >> > > > > > > > > > > > I'm using nifi 1.20. >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > On Thu, Aug 17, 2023 at 2:24 PM Chris Sampson < >> > > > > chr...@apache.org <mailto:chr...@apache.org> >> > > > > > > > >> > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Elasticsearch doesn't have a CDC-like capability (it >> > > > > doesn't >> > > > > > > > > maintain a >> > > > > > > > > > > > > transaction log or such), so that approach isn't >> > > possible. >> > > > > > > > > > > > > >> > > > > > > > > > > > > What I've done previously is to maintain an audit log >> > > in a >> > > > > > > separate >> > > > > > > > > > > index >> > > > > > > > > > > > > within elasticsearch to track what data I've >> > > > > > > > > > > > > previously >> > > > > posted, >> > > > > > > > > e.g. >> > > > > > > > > > > this >> > > > > > > > > > > > > might be the last "updated_date" value read from the >> > > data >> > > > > > > index in >> > > > > > > > > a >> > > > > > > > > > > > > previous run of the nifi processor. So your nifi Flow >> > > > > would be >> > > > > > > > > > > something >> > > > > > > > > > > > > like: >> > > > > > > > > > > > > >> > > > > > > > > > > > > Query for latest processed updated_date > paginated >> > > query >> > > > > for >> > > > > > > all >> > > > > > > > > new >> > > > > > > > > > > data >> > > > > > > > > > > > > > determine new latest updated_date (e.g. using >> > > > > QueryRecord) > >> > > > > > > put >> > > > > > > > > new >> > > > > > > > > > > > > latest updated_date into elasticsearch, ready for the >> > > next >> > > > > run >> > > > > > > > > > > > > >> > > > > > > > > > > > > On 2023/08/16 23:15:19 Richard Beare wrote: >> > > > > > > > > > > > > > One further question - what is the recommended way >> > > > > > > > > > > > > > of >> > > > > > > checking >> > > > > > > > > for >> > > > > > > > > > > > > updates >> > > > > > > > > > > > > > in an index and fetching new records in a similar >> > > manner >> > > > > to >> > > > > > > > > > > > > > GenerateTableFetch for an sql DB? >> > > > > > > > > > > > > > Thanks >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Thu, Aug 17, 2023 at 7:21 AM Richard Beare < >> > > > > > > > > > > richard.be...@gmail.com <mailto:richard.be...@gmail.com>> >> > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Sounds perfect. Thanks >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Thu, Aug 17, 2023 at 5:11 AM Chris Sampson < >> > > > > > > > > chr...@apache.org <mailto:chr...@apache.org>> >> > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> What you describe sounds like the processor is >> > > > > working as >> > > > > > > > > > > designed & >> > > > > > > > > > > > > > >> documented, i.e. it will restart the same query >> > > once >> > > > > it >> > > > > > > has >> > > > > > > > > > > reached >> > > > > > > > > > > > > the end >> > > > > > > > > > > > > > >> of the paginated scroll (or search_after, or >> > > > > > > point-in-time) >> > > > > > > > > query. >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> Instead, it sounds like you want to try using >> > > > > > > > > > > > > > >> the >> > > > > > > > > > > > > > >> PaginatedJsonQueryElasticsearch [1] processor >> > > instead. >> > > > > > > This >> > > > > > > > > will >> > > > > > > > > > > > > execute >> > > > > > > > > > > > > > >> the query given to it, either as the query >> > > property >> > > > > or the >> > > > > > > > > body >> > > > > > > > > > > of an >> > > > > > > > > > > > > > >> incoming FlowFile, output the results, and then >> > > stop. >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> [1] >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-restapi-nar/1.23.0/org.apache.nifi.processors.elasticsearch.PaginatedJsonQueryElasticsearch/index.html >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> On 2023/08/16 07:57:43 Richard Beare wrote: >> > > > > > > > > > > > > > >> > Hi, >> > > > > > > > > > > > > > >> > I am using the SearchElasticSearch (1.20.0) >> > > > > processor to >> > > > > > > > > > > retrieve >> > > > > > > > > > > > > all >> > > > > > > > > > > > > > >> > documents (~20M) from an index, process and >> > > > > eventually >> > > > > > > > > return >> > > > > > > > > > > > > results >> > > > > > > > > > > > > > >> to a >> > > > > > > > > > > > > > >> > new index, although for this test I'm >> > > retrieving and >> > > > > > > > > processing >> > > > > > > > > > > then >> > > > > > > > > > > > > > >> > discarding. I'm using opensearch. >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > My problem is that the process restarts after >> > > > > > > completion - I >> > > > > > > > > > > > > discovered >> > > > > > > > > > > > > > >> > this, and docs confirm, after seeing warnings >> > > from >> > > > > my >> > > > > > > > > processing >> > > > > > > > > > > > > code >> > > > > > > > > > > > > > >> > (which reformats json ready for other work) >> > > being >> > > > > > > repeated >> > > > > > > > > for >> > > > > > > > > > > the >> > > > > > > > > > > > > same >> > > > > > > > > > > > > > >> > document ID. >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > How do I configure the processor to stop after >> > > the >> > > > > > > > > completing >> > > > > > > > > > > the >> > > > > > > > > > > > > first >> > > > > > > > > > > > > > >> > query. >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > I've tried the following: >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > Query: {"query" : {"match_all" :{}}} >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > with pagination_type SCROLL >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > I haven't found a combination of the >> > > > > > > > > > > > > > >> > properties >> > > that >> > > > > > > doesn't >> > > > > > > > > > > lead to >> > > > > > > > > > > > > > >> > repeated cycles through the index. >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > I've also tried {"query" : {"match_all" :{}}, >> > > > > "sort" : >> > > > > > > > > > > > > > >> [{"Visit_DateTime" : >> > > > > > > > > > > > > > >> > "asc"]}} >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > and SEARCH_AFTER pagination type, with the >> > > > > > > > > > > > > > >> > same >> > > > > problem. >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> > What am I missing? >> > > > > > > > > > > > > > >> > Thanks >> > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >