Re: OplogCompactor race issue?

Kapil Goyal Fri, 07 Oct 2016 09:52:41 -0700

Thanks Darrel.

I will try out your suggestion and get back with results.


Regards
Kapil

From: Darrel Schneider <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, October 6, 2016 at 4:12 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: OplogCompactor race issue?

the ticket is GEODE-1969

On Thu, Oct 6, 2016 at 4:01 PM, Darrel Schneider 
<[email protected]<mailto:[email protected]>> wrote:
The call stack shows that Oplog.flush is calling "force." It only does this 
when the system property "gemfire.syncWrites" is set to "true." I think that is 
why you found this problem. I think all/most of our testing runs with this sys 
prop at the default of "false."

Currently the code in Oplog.flush that calls "force" does it outside of the 
sync on "lock."
If you move the "force" code up into this sync block it should fix your problem.
For example old code:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
      }
      if (doSync) {
        if (SYNC_WRITES) {
          // Synch Meta Data as well as content
          olf.channel.force(true);
        }
      }
new code with fix:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
        if (doSync) {
          if (SYNC_WRITES) {
            // Synch Meta Data as well as content
            olf.channel.force(true);
          }
        }
      }

I will file a ticket on this. Let me know if this fix works for you. Also be 
aware that setting "gemfire.syncWrites" to "true" can really slow down the rate 
at which you can write to disk.


On Thu, Oct 6, 2016 at 2:47 PM, Kapil Goyal 
<[email protected]<mailto:[email protected]>> wrote:
Hi Dan,

We are doing put operations within transactions.

Here is the stack trace:

[warn 2016/09/29 06:17:16.297 UTC <Event Processor for 
GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] 
com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher@55bded67:
 Exception during processing batch 448
com.gemstone.gemfire.internal.cache.wan.GatewaySenderException: 
com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher@55bded67:
 Exception during processing batch 448, caused by 
com.gemstone.gemfire.cache.CacheClosedException: For DiskStore: nsxDiskStore: 
Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused 
by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: 
Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused 
by java.io.IOException: Stream Closed
        at 
com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:173)
        at 
com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:83)
        at 
com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySenderEventProcessor.processQueue(AbstractGatewaySenderEventProcessor.java:579)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:219)
Caused by: com.gemstone.gemfire.cache.CacheClosedException: For DiskStore: 
nsxDiskStore: Failed writing key to 
"/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by 
com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: 
Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused 
by java.io.IOException: Stream Closed
        at 
com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1299)
        at 
com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(CancelCriterion.java:82)
        at 
com.gemstone.gemfire.internal.cache.TXManagerImpl.checkClosed(TXManagerImpl.java:606)
        at 
com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(TXManagerImpl.java:279)
        at 
com.vmware.nsx.management.container.dao.gemfire.GemFireTxLogDao.processTxLog(GemFireTxLogDao.java:119)
        at 
com.vmware.nsx.management.container.dao.gemfire.TxLogAsyncEventListener.processEvents(TxLogAsyncEventListener.java:93)
        at 
com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:164)
        ... 3 more
Caused by: com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: 
nsxDiskStore: Failed writing key to 
"/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: 
Stream Closed
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5223)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5215)
        at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.java:3312)
        at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.java:3375)
        at 
com.gemstone.gemfire.internal.cache.Oplog.handleEmptyAndOldest(Oplog.java:5877)
        at 
com.gemstone.gemfire.internal.cache.Oplog.handleNoLiveValues(Oplog.java:5924)
        at 
com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog.java:4751)
        at 
com.gemstone.gemfire.internal.cache.Oplog.copyForwardModifyForCompact(Oplog.java:4594)
        at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.java:6087)
        at 
com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompactor.compact(DiskStoreImpl.java:2970)
        at 
com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompactor.run(DiskStoreImpl.java:3034)
        at 
com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(DiskStoreImpl.java:4701)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Stream Closed
        at java.io.RandomAccessFile.getFilePointer(Native Method)
        at 
com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile.getFilePointer(UninterruptibleRandomAccessFile.java:59)
        at 
com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.doUninterruptibly(UninterruptibleRandomAccessFile.java:102)
        at 
com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.force(UninterruptibleRandomAccessFile.java:226)
        at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:5202)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5232)
        ... 17 more

Thanks
Kapil

-----Original Message-----
From: Dan Smith [mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, October 04, 2016 10:08 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: OplogCompactor race issue?

Hi Kapil,

I haven't seen this particular issue before. Do you have any stack traces of 
the exception from the OplogCompactor? What sort of operations are you doing? 
puts?

-Dan

On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal 
<[email protected]<mailto:[email protected]>> wrote:
> Hi All,
>
> We have been testing a single cache node with a lot of data recently
> and frequently run into this error:
>
> [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog 
> id(s):
> oplog#6
> [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in
> 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task>
> tid=0x19] Deleted
> oplog#6 crf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> Deleted
> oplog#6 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> Deleted
> oplog#6 drf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.887 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> oplog#8 drf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.911 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> oplog#8 crf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19]
> Created
> oplog#7 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog 
> id(s):
> oplog#7
> [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#7> tid=0x19] A DiskAccessException has occurred while
> writing to the disk for disk store nsxDiskStore. The cache will be closed.
> ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
> java.io.IOException: Stream Closed ?at
> com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
>
> From the logs it appears there may be a race between threads "Idle
> OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I
> see that both are doing operations related to oplog#7. The former logs
> creation of a KRF file, while the latter is trying to access either
> the DRF or the CRF file. Now, is it possible that "Idle
> OplogCompactor" closed the DRF/CRF files for oplog#7 as part of
> creating the KRF for the same? This is what GemFire docs say about it:
>
> "After the oplog is closed, GemFire also attempts to create a krf
> file, which contains the key names as well as the offset for the value
> within the crf file."
>
> Based on the above, it's possible that oplog#7 was already closed and
> its KRF was already created, when the compactor tried to access the files.
>
> Have any of you run into this error before? Any suggestions?
>
> Thanks
> Kapil

Re: OplogCompactor race issue?

Reply via email to