RE: Caused by: java.lang.OutOfMemoryError: Java heap space

stephen.hindmarch.bt.com via users Fri, 08 Nov 2024 07:45:10 -0800

Hi Minh,

The two settings are independent. You set them for different purposes.


The run schedule says how often the processor should run. A value of 0 means 
run whenever there are resources free and it is my turn. Any other value means 
run at set intervals. A reason to use this is when you are waiting on an 
external resource and there is no point polling it as fast as possible. This 
might be for example if you are running a database query and you need to query 
it at regular intervals to reduce stress on the database, rather than querying 
as fast as possible.

The run duration says, once the processor is running, how long should it run 
for. A value of 0 means process only one item from the queue (per concurrent 
task of the processor, per node in the cluster). A value of 25mS means keep 
processing items until the 25mS is up, or until the input queue is empty, 
whichever comes first. This tends to be used when you fetching data from an 
external system such as a message queue and you want to make sure you have 
retrieved all of the data each time you poll it. I also tend to use this when a 
processor has more work items flowing to it than other parts of the flow. For 
example if the upstream processor split a big file into many lines, then the 
downstream processor might have many more times the items to process compared 
to the upstream. If they just kept taking it in turns to process one item at a 
time the second processor would never catch. So setting a run duration is one 
tool you could use to ensure that the processor gets to process multiple items 
per tick.

I would say in your case you need to set both settings to 0.

Regards

Steve Hindmarch

From: e-soci...@gmx.fr <e-soci...@gmx.fr>
Sent: 08 November 2024 09:32
To: users@nifi.apache.org
Cc: users@nifi.apache.org
Subject: Re: Caused by: java.lang.OutOfMemoryError: Java heap space

Thanks of lot !

You make my day with these videos.

[3] if the "run schedule" = 0 seconds, we don't need to change the "run 
Duration" value, right ?

Thanks

Minh


Envoyé: jeudi 7 novembre 2024 à 18:12
De: "Mark Payne" <marka...@hotmail.com<mailto:marka...@hotmail.com>>
À: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Objet: Re: Caused by: java.lang.OutOfMemoryError: Java heap space
OK so given that, the issue is almost certainly because you’re promoting huge 
chunks of JSON into attributes using EvaluateJsonPath.
You’ll want to avoid putting anything larger than a few hundred characters into 
attributes. Instead, lean into using Record-based processors
In order to manipulate the contents of the FlowFiles as they are, without 
creating attributes from content. EvaluateJsonPath is helpful for creating 
attributes on small JSON Fields so that you can perform routing, etc. but 
should not be used to create large attributes. [1]

I also see in your canvas that you have several load-balanced connections, 
which you should avoid [2].

Re: the relationship between “Run Schedule” and “Run Duration” - Run Schedule 
indicates how long to wait between triggering  the Processor. Run Duration says 
how long to run the Processor each time it’s scheduled to run. So if Run 
Schedule = 5 seconds and Run Duration = 2 seconds, then  the Processor will run 
for up to 2 seconds. Then it will not run again for 5 seconds. Then it will run 
for 2 seconds. Then it will do nothing for 5 seconds. In practice, Processors 
should almost always have a Run Schedule of 0 seconds except for source 
processors. See [3] for more details.

Thanks
-Mark

[1] https://www.youtube.com/watch?v=RjWstt7nRVY&t=187
[2] https://www.youtube.com/watch?v=by9P0Zi8Dk8
[3] https://www.youtube.com/watch?v=pZq0EbfDBy4



On Nov 7, 2024, at 3:49 AM, e-soci...@gmx.fr<mailto:e-soci...@gmx.fr> wrote:

Here the configuration for EvaluteJsonPath and ReplaceText

Another question about "Run Schedule" and "Run Duration"
In separately feature I know how each of them is working but how they do to 
work together ?

I mean, if "Run Schedule" is setup to 0s and "Run Duration" is setup to 2s.
It means the processor always running ?
How does the impact one on the other ?

Thanks a lot

Minh

Envoyé: mercredi 6 novembre 2024 à 16:13
De: "Mark Payne" <marka...@hotmail.com<mailto:marka...@hotmail.com>>
À: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Objet: Re: Caused by: java.lang.OutOfMemoryError: Java heap space
OK so the decompress should be CPU intensive but not heap/memory intensive.
EvaluateJsonPath will potentially consume large amounts of heap as well, 
depending on how it’s configured.
The ExecuteGroovyScript sounds like it would use very little.
ReplaceText may well consume huge amounts of heap, depending on how it’s 
configured.

Can you share how EvaluteJsonPath and ReplaceText are configured?

The idea that 16 GB of RAM is max recommended for a JVM was true a while ago 
but with modern JVM’s you can go much higher. That said, given the flow 
described, 4 GB should be more than sufficient if properly configured.

Thanks
-Mark


On Nov 6, 2024, at 9:51 AM, e-soci...@gmx.fr<mailto:e-soci...@gmx.fr> wrote:

Thanks for reply Mark,

The groovy script is very simple :

        hexContent = flowFile.getAttribute('hexContent')
        hexContent = hexContent.decodeHex()
        outputStream.write(hexContent)

The question is how is possible to process flowfiles as quickly as possible.
If I upgrade the CPU to 8 per node, is it possible to process less flowfiles at 
the same time but more flowfiles ?

The main nifi dataflow is :

  *   Uncompress incoming flowfiles (cpu/heap consume I suppose)
  *   ReplaceText (heap consume)
  *   EvaluateJsonPath (heap consume)
  *   ExecuteGroovyScript (heap consume)

I read that 16GB of RAM is the maximum recommended for a JVM and that adding 
more isn’t beneficial.
Is that true, or can I increase it to 32GB?

Regards

Minh
Envoyé: mercredi 6 novembre 2024 à 15:24
De: "Mark Payne" <marka...@hotmail.com<mailto:marka...@hotmail.com>>
À: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Objet: Re: Caused by: java.lang.OutOfMemoryError: Java heap space
Hi Minh,

It is possible that the heap is being exhausted by EvaluateJsonPath if you are 
using it to add large JSON chunks as attributes. For example, if you’re 
creating an attribute from `$.` to put the entire JSON contents into 
attributes. Generally, attributes should be kept pretty small.

Otherwise, based on the flow described, the issue is almost certainly within 
the ExecuteGroovyScript. There, there’s not much guidance we can provide, as 
it’s running your own script. You’d need to understand what in your own script 
is using up all of the heap.

Thanks
-Mark


On Nov 6, 2024, at 4:26 AM, e-soci...@gmx.fr<mailto:e-soci...@gmx.fr> wrote:

Hello all,

We got a cluster with 10 nodes (4CPU/16Go) - NIFI 1.25 - jdk-11.0.19

We use this cluster to send the datas to GCP bucket, the datas are sent by 
others clusters, so we do S2S betweens them.

I can't determine where is the issue. This message could by raise by 
EvaluateJsonPath/ExecuteGroovyScript/UpdateAttribute
We have around 100.000 flowfiles (160Go datas)
We need configure more than 1 tasks for each processor to run more faster but 
we have always this error

<evaluateJsonPath.png><evaluateJsonPath2.png><out_of_memory.png><replaceText.png><replaceText2.png>

RE: Caused by: java.lang.OutOfMemoryError: Java heap space

Reply via email to