[jira] [Commented] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled

2019-09-05 Thread Dmitriy Govorukhin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923423#comment-16923423
 ] 

Dmitriy Govorukhin commented on IGNITE-12127:
-

Cherry-picked to ignite-2.7.6 402c9450dafbb201708f66d8bdab0ade0b87bd4f

> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -
>
> Key: IGNITE-12127
> URL: https://issues.apache.org/jira/browse/IGNITE-12127
> Project: Ignite
>  Issue Type: Bug
>Reporter: Dmitriy Govorukhin
>Assignee: Dmitriy Govorukhin
>Priority: Critical
> Fix For: 2.7.6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.StorageException: Failed to write 
> buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException: 
> Failed to write buffer.
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.5.7.jar!/:2.5.7]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) 
> ~[na:1.8.0_201]
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) 
> ~[na:1.8.0_201]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
>  * Thread A attempts to log a large record which does not fit segment, 
> {{addRecord}} fails and the thread A starts segment rollover. I successfully 
> runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment 
> record
>  * Thread B attempts to log another record, which fits exactly till the end 
> of the current segment. The record is added to the buffer
>  * Thread A resumes and fails to add the switch segment record. No flush is 
> performed and the thread immediately proceeds for wal-writer close
>  * WAL writer thread wakes up, sees that there is a CLOSE request, closes the 
> file IO and immediately proceeds to write unflushed changes causing the 
> exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the 
> {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
>  * There is an {{fsync(filePtr)}} call inside {{close()}}; however, 
> {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and 
> returns immediately after {{flushOrWait()}} if the flag is set - this is very 
> confusing. After all, the {{close()}} itself explicitly calls {{force}} after 
> flush
>  * There is an ignored IO exception in mmap mode - this should be propagated 
> to the failure handler
>  * In WAL writer, we check for file CLOSE and then attemp to write to 
> (possibly) the same write handle - write should be always before close
>  * In WAL writer, there are racy reads of current handle - it would be better 
> if we read the current handle once and then operate on it during the whole 
> loop iteration



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled

2019-09-05 Thread Dmitriy Govorukhin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923412#comment-16923412
 ] 

Dmitriy Govorukhin commented on IGNITE-12127:
-

Merged to master a13337d94755d7e1cc097c6f00311552fea25ae6

> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -
>
> Key: IGNITE-12127
> URL: https://issues.apache.org/jira/browse/IGNITE-12127
> Project: Ignite
>  Issue Type: Bug
>Reporter: Dmitriy Govorukhin
>Assignee: Dmitriy Govorukhin
>Priority: Critical
> Fix For: 2.7.6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.StorageException: Failed to write 
> buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException: 
> Failed to write buffer.
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.5.7.jar!/:2.5.7]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) 
> ~[na:1.8.0_201]
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) 
> ~[na:1.8.0_201]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
>  * Thread A attempts to log a large record which does not fit segment, 
> {{addRecord}} fails and the thread A starts segment rollover. I successfully 
> runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment 
> record
>  * Thread B attempts to log another record, which fits exactly till the end 
> of the current segment. The record is added to the buffer
>  * Thread A resumes and fails to add the switch segment record. No flush is 
> performed and the thread immediately proceeds for wal-writer close
>  * WAL writer thread wakes up, sees that there is a CLOSE request, closes the 
> file IO and immediately proceeds to write unflushed changes causing the 
> exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the 
> {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
>  * There is an {{fsync(filePtr)}} call inside {{close()}}; however, 
> {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and 
> returns immediately after {{flushOrWait()}} if the flag is set - this is very 
> confusing. After all, the {{close()}} itself explicitly calls {{force}} after 
> flush
>  * There is an ignored IO exception in mmap mode - this should be propagated 
> to the failure handler
>  * In WAL writer, we check for file CLOSE and then attemp to write to 
> (possibly) the same write handle - write should be always before close
>  * In WAL writer, there are racy reads of current handle - it would be better 
> if we read the current handle once and then operate on it during the whole 
> loop iteration



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled

2019-09-05 Thread Andrey Gura (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923405#comment-16923405
 ] 

Andrey Gura commented on IGNITE-12127:
--

[~DmitriyGovorukhin] LGTM. Thanks for contribution! Please merge to master 
branch.

> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -
>
> Key: IGNITE-12127
> URL: https://issues.apache.org/jira/browse/IGNITE-12127
> Project: Ignite
>  Issue Type: Bug
>Reporter: Dmitriy Govorukhin
>Assignee: Dmitriy Govorukhin
>Priority: Critical
> Fix For: 2.7.6
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.StorageException: Failed to write 
> buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException: 
> Failed to write buffer.
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.5.7.jar!/:2.5.7]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) 
> ~[na:1.8.0_201]
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) 
> ~[na:1.8.0_201]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
>  * Thread A attempts to log a large record which does not fit segment, 
> {{addRecord}} fails and the thread A starts segment rollover. I successfully 
> runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment 
> record
>  * Thread B attempts to log another record, which fits exactly till the end 
> of the current segment. The record is added to the buffer
>  * Thread A resumes and fails to add the switch segment record. No flush is 
> performed and the thread immediately proceeds for wal-writer close
>  * WAL writer thread wakes up, sees that there is a CLOSE request, closes the 
> file IO and immediately proceeds to write unflushed changes causing the 
> exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the 
> {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
>  * There is an {{fsync(filePtr)}} call inside {{close()}}; however, 
> {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and 
> returns immediately after {{flushOrWait()}} if the flag is set - this is very 
> confusing. After all, the {{close()}} itself explicitly calls {{force}} after 
> flush
>  * There is an ignored IO exception in mmap mode - this should be propagated 
> to the failure handler
>  * In WAL writer, we check for file CLOSE and then attemp to write to 
> (possibly) the same write handle - write should be always before close
>  * In WAL writer, there are racy reads of current handle - it would be better 
> if we read the current handle once and then operate on it during the whole 
> loop iteration



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled

2019-09-05 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923392#comment-16923392
 ] 

Ignite TC Bot commented on IGNITE-12127:


{panel:title=Branch: [pull/6840/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *-- Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=4567088buildTypeId=IgniteTests24Java8_RunAll]

> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -
>
> Key: IGNITE-12127
> URL: https://issues.apache.org/jira/browse/IGNITE-12127
> Project: Ignite
>  Issue Type: Bug
>Reporter: Dmitriy Govorukhin
>Assignee: Dmitriy Govorukhin
>Priority: Critical
> Fix For: 2.7.6
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.StorageException: Failed to write 
> buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException: 
> Failed to write buffer.
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.5.7.jar!/:2.5.7]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) 
> ~[na:1.8.0_201]
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) 
> ~[na:1.8.0_201]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
> at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
>  [ignite-core-2.5.7.jar!/:2.5.7]
> ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
>  * Thread A attempts to log a large record which does not fit segment, 
> {{addRecord}} fails and the thread A starts segment rollover. I successfully 
> runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment 
> record
>  * Thread B attempts to log another record, which fits exactly till the end 
> of the current segment. The record is added to the buffer
>  * Thread A resumes and fails to add the switch segment record. No flush is 
> performed and the thread immediately proceeds for wal-writer close
>  * WAL writer thread wakes up, sees that there is a CLOSE request, closes the 
> file IO and immediately proceeds to write unflushed changes causing the 
> exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the 
> {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
>  * There is an {{fsync(filePtr)}} call inside {{close()}}; however, 
> {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and 
> returns immediately after {{flushOrWait()}} if the flag is set - this is very 
> confusing. After all, the {{close()}} itself explicitly calls {{force}} after 
> flush
>  * There is an ignored IO exception in mmap mode - this should be propagated 
> to the failure handler
>  * In WAL writer, we check for file CLOSE and then attemp to write to 
> (possibly) the same write handle - write should be always before close
>  * In WAL writer, there are racy reads of current handle - it would be better 
> if we read the current handle once and then operate on it during the whole 
> loop iteration



--
This message was sent by Atlassian Jira
(v8.3.2#803003)