Re: OplogCompactor race issue?

Darrel Schneider Thu, 06 Oct 2016 16:01:56 -0700

The call stack shows that Oplog.flush is calling "force." It only does this
when the system property "gemfire.syncWrites" is set to "true." I think
that is why you found this problem. I think all/most of our testing runs
with this sys prop at the default of "false."


Currently the code in Oplog.flush that calls "force" does it outside of the
sync on "lock."
If you move the "force" code up into this sync block it should fix your
problem.
For example old code:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix
bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
      }
      if (doSync) {
        if (SYNC_WRITES) {
          // Synch Meta Data as well as content
          olf.channel.force(true);
        }
      }
new code with fix:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix
bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
 *       if (doSync) {*
*          if (SYNC_WRITES) {*
*            // Synch Meta Data as well as content*
*            olf.channel.force(true);*
*          }*
*        }*
*      }*

I will file a ticket on this. Let me know if this fix works for you. Also
be aware that setting "gemfire.syncWrites" to "true" can really slow down
the rate at which you can write to disk.


On Thu, Oct 6, 2016 at 2:47 PM, Kapil Goyal <[email protected]> wrote:

> Hi Dan,
>
> We are doing put operations within transactions.
>
> Here is the stack trace:
>
> [warn 2016/09/29 06:17:16.297 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19]
> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDisp
> atcher@55bded67: Exception during processing batch 448
> com.gemstone.gemfire.internal.cache.wan.GatewaySenderException:
> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDisp
> atcher@55bded67: Exception during processing batch 448, caused by
> com.gemstone.gemfire.cache.CacheClosedException: For DiskStore:
> nsxDiskStore: Failed writing key to 
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to 
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by java.io.IOException: Stream Closed
>         at com.gemstone.gemfire.internal.cache.wan.
> GatewaySenderEventCallbackDispatcher.dispatchBatch(
> GatewaySenderEventCallbackDispatcher.java:173)
>         at com.gemstone.gemfire.internal.cache.wan.
> GatewaySenderEventCallbackDispatcher.dispatchBatch(
> GatewaySenderEventCallbackDispatcher.java:83)
>         at com.gemstone.gemfire.internal.cache.wan.
> AbstractGatewaySenderEventProcessor.processQueue(
> AbstractGatewaySenderEventProcessor.java:579)
>         at com.gemstone.gemfire.internal.cache.wan.serial.
> SerialGatewaySenderEventProcessor.run(SerialGatewaySenderEventProces
> sor.java:219)
> Caused by: com.gemstone.gemfire.cache.CacheClosedException: For
> DiskStore: nsxDiskStore: Failed writing key to 
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to 
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by java.io.IOException: Stream Closed
>         at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.
> generateCancelledException(GemFireCacheImpl.java:1299)
>         at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(
> CancelCriterion.java:82)
>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.
> checkClosed(TXManagerImpl.java:606)
>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(
> TXManagerImpl.java:279)
>         at com.vmware.nsx.management.container.dao.gemfire.
> GemFireTxLogDao.processTxLog(GemFireTxLogDao.java:119)
>         at com.vmware.nsx.management.container.dao.gemfire.
> TxLogAsyncEventListener.processEvents(TxLogAsyncEventListener.java:93)
>         at com.gemstone.gemfire.internal.cache.wan.
> GatewaySenderEventCallbackDispatcher.dispatchBatch(
> GatewaySenderEventCallbackDispatcher.java:164)
>         ... 3 more
> Caused by: com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to 
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by java.io.IOException: Stream Closed
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5235)
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5223)
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5215)
>         at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.
> java:3312)
>         at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.
> java:3375)
>         at com.gemstone.gemfire.internal.cache.Oplog.
> handleEmptyAndOldest(Oplog.java:5877)
>         at com.gemstone.gemfire.internal.cache.Oplog.
> handleNoLiveValues(Oplog.java:5924)
>         at com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog.
> java:4751)
>         at com.gemstone.gemfire.internal.cache.Oplog.
> copyForwardModifyForCompact(Oplog.java:4594)
>         at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.
> java:6087)
>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$
> OplogCompactor.compact(DiskStoreImpl.java:2970)
>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$
> OplogCompactor.run(DiskStoreImpl.java:3034)
>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(
> DiskStoreImpl.java:4701)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Stream Closed
>         at java.io.RandomAccessFile.getFilePointer(Native Method)
>         at com.gemstone.gemfire.internal.cache.persistence.
> UninterruptibleRandomAccessFile.getFilePointer(
> UninterruptibleRandomAccessFile.java:59)
>         at com.gemstone.gemfire.internal.cache.persistence.
> UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl
> .doUninterruptibly(UninterruptibleRandomAccessFile.java:102)
>         at com.gemstone.gemfire.internal.cache.persistence.
> UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.force(
> UninterruptibleRandomAccessFile.java:226)
>         at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:
> 5202)
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5232)
>         ... 17 more
>
> Thanks
> Kapil
>
> -----Original Message-----
> From: Dan Smith [mailto:[email protected]]
> Sent: Tuesday, October 04, 2016 10:08 AM
> To: [email protected]
> Subject: Re: OplogCompactor race issue?
>
> Hi Kapil,
>
> I haven't seen this particular issue before. Do you have any stack traces
> of the exception from the OplogCompactor? What sort of operations are you
> doing? puts?
>
> -Dan
>
> On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <[email protected]> wrote:
> > Hi All,
> >
> > We have been testing a single cache node with a lot of data recently
> > and frequently run into this error:
> >
> > [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction
> oplog id(s):
> > oplog#6
> > [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in
> > 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task>
> > tid=0x19] Deleted
> > oplog#6 crf for disk store nsxDiskStore.
> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> > Deleted
> > oplog#6 krf for disk store nsxDiskStore.
> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> > Deleted
> > oplog#6 drf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:03.887 UTC <Event Processor for
> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> > oplog#8 drf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:03.911 UTC <Event Processor for
> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> > oplog#8 crf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19]
> > Created
> > oplog#7 krf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction
> oplog id(s):
> > oplog#7
> > [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#7> tid=0x19] A DiskAccessException has occurred while
> > writing to the disk for disk store nsxDiskStore. The cache will be
> closed.
> > ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> > nsxDiskStore: Failed writing key to
> > "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
> > java.io.IOException: Stream Closed ?at
> > com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
> >
> > From the logs it appears there may be a race between threads "Idle
> > OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I
> > see that both are doing operations related to oplog#7. The former logs
> > creation of a KRF file, while the latter is trying to access either
> > the DRF or the CRF file. Now, is it possible that "Idle
> > OplogCompactor" closed the DRF/CRF files for oplog#7 as part of
> > creating the KRF for the same? This is what GemFire docs say about it:
> >
> > "After the oplog is closed, GemFire also attempts to create a krf
> > file, which contains the key names as well as the offset for the value
> > within the crf file."
> >
> > Based on the above, it's possible that oplog#7 was already closed and
> > its KRF was already created, when the compactor tried to access the
> files.
> >
> > Have any of you run into this error before? Any suggestions?
> >
> > Thanks
> > Kapil
>

Re: OplogCompactor race issue?

Reply via email to