RE: [EXT] Re: FlowFile Repository can't checkpoint, out of heap space.

2019-08-15 Thread Peter Wicks (pwicks)
 serde.deserializeRecord(dataInputStream, serdeVersion);
}

outStream.close();
dataInputStream.close();

System.out.println(file + " - " + saved + " / " + total);
}

From: Joe Witt 
Sent: Thursday, August 15, 2019 10:58 AM
To: users@nifi.apache.org
Subject: Re: [EXT] Re: FlowFile Repository can't checkpoint, out of heap space.

Peter

All the details you can share on this would be good.  First, we should be 
resilient to any sort of repo corruption in the event of heap issues.  While 
obviously the flow isn't in a good state at that point the saved state should 
be reliable/recoverable.  Second, how the repo/journals got that large itself 
should be evaluated/considered/determined.  A full JIRA/description of the 
situation/logs/known state would be worthy of further resolution.

Thanks

On Thu, Aug 15, 2019 at 12:50 PM Peter Wicks (pwicks) 
mailto:pwi...@micron.com>> wrote:
We were able to recover this morning, in the end we deleted the queues that 
were causing trouble from the Flow, and when the problem node came online it 
deleted the FlowFile’s all on its own, since the queue did not exist. Since 
this is done during the FlowFile Repository load into memory, it didn’t run out 
of heap.

But before we go to that point we maxed out heap, 500GB’s!  All our server had 
to offer. I also tried scripting a cleanup of the journals overflow files. 
Which failed, because the journal keeps track of those files, and won’t restore 
if some are missing.  I’m thinking of building some nifi-utility functions for 
doing emergency cleanup of the FlowFile repository where you can specify a 
Queue ID and it removes those files, or maybe doing an offline compaction.

Thanks,
  Peter


From: Brandon DeVries mailto:b...@jhu.edu>>
Sent: Thursday, August 15, 2019 9:53 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: [EXT] Re: FlowFile Repository can't checkpoint, out of heap space.


Peter,

Unfortunately, I don't have a perfect solution for your current problem.  I 
would try starting with autoResume=false, just to try to limit what's going on 
in the system.  If possible, you can also try temporarily giving the JVM more 
heap.

This is, however, the use case that led to the idea of "recovery mode" in the 
new RocksDBFlowFileRepository[1] that should be in nifi 1.10.0 (the 
documentation[2] is attached to the ticket):

"[Recovery mode] limits the number of FlowFiles loaded into the graph at a 
time, while not actually removing any FlowFiles (or content) from the system. 
This allows for the recovery of a system that is encountering OutOfMemory 
errors or similar on startup..."

[1] 
https://issues.apache.org/jira/browse/NIFI-4775<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-4775=02%7C01%7Cpwicks%40micron.com%7Ce785de36aeeb49a54bf308d721a1d839%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C637014851336060085=BT%2FQoS0CeWySXE5VIJblhE%2BLaXW7ziR1rcfUlRQdnBc%3D=0>
[2] 
https://issues.apache.org/jira/secure/attachment/12976954/RocksDBFlowFileRepo.html<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fsecure%2Fattachment%2F12976954%2FRocksDBFlowFileRepo.html=02%7C01%7Cpwicks%40micron.com%7Ce785de36aeeb49a54bf308d721a1d839%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C637014851336070081=GMs8yj24VVSL0Igk8wYkYvLlx0i6wtXVI8xRU3VsL0Y%3D=0>

On Wed, Aug 14, 2019 at 12:12 PM Peter Wicks (pwicks) 
mailto:pwi...@micron.com>> wrote:
I have a node in a cluster whose FlowFile repository grew so fast that it 
exceeded the amount of available heap space and now can't checkpoint. Or that 
is my interpretation of the error.

"Cannot update journal file flowfile_repository/journals/.journal because 
this journal  has already encountered a failure when attempting to write to the 
file."
Additionally, on restart, we see NiFi failed to restart because it ran out of 
heap space while doing a SchemaRecordReader.readFieldValue.  Feeling a bit 
stuck on where to go from here.

Based on metrics we collect, we see a large increase in FlowFile's on that node 
right before it crashed, and in linux we see the following:
94G ./journals/overflow-569618072
356G./journals/overflow-569892338

Oh, and a 280 GB checkpoint file

There are a few queues/known FlowFile’s that are probably the problem, and I’m 
OK with dropping them, but there is plenty of other data in there too that I 
don’t want to lose…

Thanks,
  Peter


Re: [EXT] Re: FlowFile Repository can't checkpoint, out of heap space.

2019-08-15 Thread Joe Witt
Peter

All the details you can share on this would be good.  First, we should be
resilient to any sort of repo corruption in the event of heap issues.
While obviously the flow isn't in a good state at that point the saved
state should be reliable/recoverable.  Second, how the repo/journals got
that large itself should be evaluated/considered/determined.  A full
JIRA/description of the situation/logs/known state would be worthy of
further resolution.

Thanks

On Thu, Aug 15, 2019 at 12:50 PM Peter Wicks (pwicks) 
wrote:

> We were able to recover this morning, in the end we deleted the queues
> that were causing trouble from the Flow, and when the problem node came
> online it deleted the FlowFile’s all on its own, since the queue did not
> exist. Since this is done during the FlowFile Repository load into memory,
> it didn’t run out of heap.
>
>
>
> But before we go to that point we maxed out heap, 500GB’s!  All our server
> had to offer. I also tried scripting a cleanup of the journals overflow
> files. Which failed, because the journal keeps track of those files, and
> won’t restore if some are missing.  I’m thinking of building some
> nifi-utility functions for doing emergency cleanup of the FlowFile
> repository where you can specify a Queue ID and it removes those files, or
> maybe doing an offline compaction.
>
>
>
> Thanks,
>
>   Peter
>
>
>
>
>
> *From:* Brandon DeVries 
> *Sent:* Thursday, August 15, 2019 9:53 AM
> *To:* users@nifi.apache.org
> *Subject:* [EXT] Re: FlowFile Repository can't checkpoint, out of heap
> space.
>
>
>
>
> Peter,
>
> Unfortunately, I don't have a perfect solution for your current problem.
> I would try starting with autoResume=false, just to try to limit what's
> going on in the system.  If possible, you can also try temporarily giving
> the JVM more heap.
>
> This is, however, the use case that led to the idea of "recovery mode" in
> the new RocksDBFlowFileRepository[1] that should be in nifi 1.10.0 (the
> documentation[2] is attached to the ticket):
>
> "[Recovery mode] limits the number of FlowFiles loaded into the graph at a
> time, while not actually removing any FlowFiles (or content) from the
> system. This allows for the recovery of a system that is encountering
> OutOfMemory errors or similar on startup..."
>
> [1] https://issues.apache.org/jira/browse/NIFI-4775
> <https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-4775=02%7C01%7Cpwicks%40micron.com%7C2292944c3b9f430c5d9508d72198ab64%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C637014811949909503=EWX6b3fQkHV9ANXRFu7oiiv8n8khDVIy9PT6fiaJmUY%3D=0>
> [2]
> https://issues.apache.org/jira/secure/attachment/12976954/RocksDBFlowFileRepo.html
> <https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fsecure%2Fattachment%2F12976954%2FRocksDBFlowFileRepo.html=02%7C01%7Cpwicks%40micron.com%7C2292944c3b9f430c5d9508d72198ab64%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C637014811949909503=XDKoAcehZmYFqJ6JLVkVyhx0gsRGNyw7vDNfURhsn4E%3D=0>
>
>
>
> On Wed, Aug 14, 2019 at 12:12 PM Peter Wicks (pwicks) 
> wrote:
>
> I have a node in a cluster whose FlowFile repository grew so fast that it
> exceeded the amount of available heap space and now can't checkpoint. Or
> that is my interpretation of the error.
>
>
>
> "Cannot update journal file flowfile_repository/journals/.journal
> because this journal  has already encountered a failure when attempting to
> write to the file."
>
> Additionally, on restart, we see NiFi failed to restart because it ran out
> of heap space while doing a SchemaRecordReader.readFieldValue.  Feeling a
> bit stuck on where to go from here.
>
>
>
> Based on metrics we collect, we see a large increase in FlowFile's on that
> node right before it crashed, and in linux we see the following:
>
> 94G ./journals/overflow-569618072
>
> 356G./journals/overflow-569892338
>
>
>
> Oh, and a 280 GB checkpoint file
>
>
>
> There are a few queues/known FlowFile’s that are probably the problem, and
> I’m OK with dropping them, but there is plenty of other data in there too
> that I don’t want to lose…
>
>
>
> Thanks,
>
>   Peter
>
>


RE: [EXT] Re: FlowFile Repository can't checkpoint, out of heap space.

2019-08-15 Thread Peter Wicks (pwicks)
We were able to recover this morning, in the end we deleted the queues that 
were causing trouble from the Flow, and when the problem node came online it 
deleted the FlowFile’s all on its own, since the queue did not exist. Since 
this is done during the FlowFile Repository load into memory, it didn’t run out 
of heap.

But before we go to that point we maxed out heap, 500GB’s!  All our server had 
to offer. I also tried scripting a cleanup of the journals overflow files. 
Which failed, because the journal keeps track of those files, and won’t restore 
if some are missing.  I’m thinking of building some nifi-utility functions for 
doing emergency cleanup of the FlowFile repository where you can specify a 
Queue ID and it removes those files, or maybe doing an offline compaction.

Thanks,
  Peter


From: Brandon DeVries 
Sent: Thursday, August 15, 2019 9:53 AM
To: users@nifi.apache.org
Subject: [EXT] Re: FlowFile Repository can't checkpoint, out of heap space.


Peter,

Unfortunately, I don't have a perfect solution for your current problem.  I 
would try starting with autoResume=false, just to try to limit what's going on 
in the system.  If possible, you can also try temporarily giving the JVM more 
heap.

This is, however, the use case that led to the idea of "recovery mode" in the 
new RocksDBFlowFileRepository[1] that should be in nifi 1.10.0 (the 
documentation[2] is attached to the ticket):

"[Recovery mode] limits the number of FlowFiles loaded into the graph at a 
time, while not actually removing any FlowFiles (or content) from the system. 
This allows for the recovery of a system that is encountering OutOfMemory 
errors or similar on startup..."

[1] 
https://issues.apache.org/jira/browse/NIFI-4775<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-4775=02%7C01%7Cpwicks%40micron.com%7C2292944c3b9f430c5d9508d72198ab64%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C637014811949909503=EWX6b3fQkHV9ANXRFu7oiiv8n8khDVIy9PT6fiaJmUY%3D=0>
[2] 
https://issues.apache.org/jira/secure/attachment/12976954/RocksDBFlowFileRepo.html<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fsecure%2Fattachment%2F12976954%2FRocksDBFlowFileRepo.html=02%7C01%7Cpwicks%40micron.com%7C2292944c3b9f430c5d9508d72198ab64%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C637014811949909503=XDKoAcehZmYFqJ6JLVkVyhx0gsRGNyw7vDNfURhsn4E%3D=0>

On Wed, Aug 14, 2019 at 12:12 PM Peter Wicks (pwicks) 
mailto:pwi...@micron.com>> wrote:
I have a node in a cluster whose FlowFile repository grew so fast that it 
exceeded the amount of available heap space and now can't checkpoint. Or that 
is my interpretation of the error.

"Cannot update journal file flowfile_repository/journals/.journal because 
this journal  has already encountered a failure when attempting to write to the 
file."
Additionally, on restart, we see NiFi failed to restart because it ran out of 
heap space while doing a SchemaRecordReader.readFieldValue.  Feeling a bit 
stuck on where to go from here.

Based on metrics we collect, we see a large increase in FlowFile's on that node 
right before it crashed, and in linux we see the following:
94G ./journals/overflow-569618072
356G./journals/overflow-569892338

Oh, and a 280 GB checkpoint file

There are a few queues/known FlowFile’s that are probably the problem, and I’m 
OK with dropping them, but there is plenty of other data in there too that I 
don’t want to lose…

Thanks,
  Peter