Just picking up on a couple of comments/threads within this chain (for clarity).
Environment:
* NiFi 2.0.0-SNAPSHOT (i.e. current “main” latest, but should be pretty much
the same for the Elasticsearch processors as the current 1.23.2 release)
* Elasticsearch 8.9.1
If I run the SearchElasticsear
I discovered the jsonqueryelasticsearch just before your message arrived,
and it looks to be returning aggregates only if I set size: 0, so I think
that is the place to start for this problem.
On Mon, Aug 21, 2023 at 5:41 PM Chris Sampson wrote:
> Using SearchElasticsearch for just an aggregatio
Using SearchElasticsearch for just an aggregation feels like it might not be
the right choice (maybe look at JsonQueryElasticsearch instead), or are the
dates constantly changing, i.e. new data is always appearing, so you want to
keep triggering the flow, and you want to use this as the starting
I'm repeatedly selecting the min and max date stamp using a
SearchElasticSearch processor to begin creating the query generator.
The query looks like:
{
"size" : 0,
"aggs" : {
"newest" : { "max" : { "field" : "Visit_DateTime"}},
"oldest" : { "min" : { "field" : "Visit_DateTime"}}
}
}
This seems t
To retrieve large quantities of data from Elasticsearch into nifi, yes, it's
probably the best way we have.
The processor's don't currently use slicing (parallelism) internally for the
Elasticsearch queries, but as you're writing a query for every month, you could
increase the processor's Concu
Good points - I've done some testing.
About 1-2 minutes for 1 month's data with 1k page sizes and about half that
for 10k. About 8-10 minutes for 1 years worth of data at 10k pages.
Per month looks like the sweet spot in terms of size - that's about
500-750MB.
In terms of building the upstream t
I'd guess it depends on what you want to achieve downstream, e.g. would setting
the query processor to output per_query and return everything in 1 to be
useful? Internally, the processor is so fetching everything in pages from
Elasticsearch, setting the size higher will reduce the number of netw
A bit of progress.
First up, firing a match_all at my index with 20M documents doesn't work,
as you probably expected. Or more precisely, is unlikely to be useful - I
left it overnight and nothing appeared to have happened, so I guess it was
madly fetching pages and filling up available storage.
S
Ah, so these processors have all been written for Elasticsearch, and use the
Elasticsearch low-level REST API library to form connections. They've not been
tested against OpenSearch, although hopefully should work for any interactions
where the API is the same, but the two products continue to d
I did use the example and got errors. I'll revisit that (perhaps it is an
opensearch idiosyncrasy). The per response option is probably my issue.
I'll check that out and get back to you.
Thanks again
On Fri, Aug 18, 2023 at 2:30 PM Chris Sampson wrote:
> Check the example in the processor's add
Check the example in the processor's additional details docs [1] for how you
could set size and sort fields for the query - size is used to determine the
number of documents returned per page, sorry is required if using a "search
after" or "point in time" query type.
If the Query property is se
Thanks, that makes sense. I've had trouble getting a size parameter
accepted, but will work on that later.
However, I'm unsure what I should expect to see in the following test
scenario.
A fixed query in the Query parameter - a match all. i.e. nothing dynamic
set by upstream processing
An empty
Again, sounds like it's working as documented [1] - an input is required to
trigger the PaginatedJsonQueryElasticsearch processor, so something like
GenerateFlowFile is a way to achieve that if you want to periodically execute a
paginated query, e.g. by setting the Generate processor's schedule
I must be missing something simple. I've copied the parameters and query
from the SearchElasticSearch processor and I'm not getting errors, but no
flowfiles are produced.
I'm forced to add an input connection, despite coding the query in the
Query property. I have a GenerateFlowFile processor conn
Elasticsearch doesn't have a CDC-like capability (it doesn't maintain a
transaction log or such), so that approach isn't possible.
What I've done previously is to maintain an audit log in a separate index
within elasticsearch to track what data I've previously posted, e.g. this might
be the las
One further question - what is the recommended way of checking for updates
in an index and fetching new records in a similar manner to
GenerateTableFetch for an sql DB?
Thanks
On Thu, Aug 17, 2023 at 7:21 AM Richard Beare
wrote:
> Sounds perfect. Thanks
>
> On Thu, Aug 17, 2023 at 5:11 AM Chris
Sounds perfect. Thanks
On Thu, Aug 17, 2023 at 5:11 AM Chris Sampson wrote:
> What you describe sounds like the processor is working as designed &
> documented, i.e. it will restart the same query once it has reached the end
> of the paginated scroll (or search_after, or point-in-time) query.
>
What you describe sounds like the processor is working as designed &
documented, i.e. it will restart the same query once it has reached the end of
the paginated scroll (or search_after, or point-in-time) query.
Instead, it sounds like you want to try using the
PaginatedJsonQueryElasticsearch [
18 matches
Mail list logo