from:"Ivan Bessonov \(JIRA\)"

[jira] [Assigned] (IGNITE-17673) Extend MV partition storage API with methods to help cleaning up SQL indices

2022-09-23 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17673:
--

Assignee: Ivan Bessonov

> Extend MV partition storage API with methods to help cleaning up SQL indices
> 
>
> Key: IGNITE-17673
> URL: https://issues.apache.org/jira/browse/IGNITE-17673
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> h3. TLDR;
> Following method should be added to MvPartitionStorage:
> {code:java}
> Cursor scanVersions(RowId rowId); {code}
> h3. Details
> In order to allow indices to be cleaned, we need extra API in partition 
> storage.
> In pseudo-code, cleanup should look like following:
> {code:java}
> BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);
> if (oldRow != null) {
> Set allIndexes = getAllIndexes();
> for (BinaryRow version : partition.scanVersions(rowId)) {
> for (Index index : allIndexes) {
> if (index.rowsMatch(oldRow, version)) {
> allIndexes.remove(index);
> }
> }
> if (allIndexes.isEmpty()) {
> break;
> }
> }
> for (Index index : allIndexes) {
> index.remove(oldRow);
> }
> }{code}
> Now, I guess I need to explain this a little bit.
> First of all, the real implementation will probably look a bit different. 
> Cursor has to be closed, oldRow must be converted to a binary tuple, 
> tombstones are not handled properly
> . Rows matching algorithm shouldn't be in the index itself, because it 
> depends on versioned row schemas and indexes don't know about them. Having a 
> set and removing from it doesn't look optimal either. Etc. This is just a 
> sketch.
> Second, from the API standpoint for getting versions for a single key, it's 
> pretty accurate to what I imagine:
> {code:java}
> Cursor scanVersions(RowId rowId);{code}
> Versions should be returned from newest to oldest. Timestamp itself doesn't 
> seem to be necessary.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17579) TxStateStorage management

2022-09-23 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608606#comment-17608606
 ] 

Ivan Bessonov commented on IGNITE-17579:


[~Denis Chudov] Looks good to me, you can proceed with merge

> TxStateStorage management
> -
>
> Key: IGNITE-17579
> URL: https://issues.apache.org/jira/browse/IGNITE-17579
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3, transaction3_rw
>
> h3. Motivation
> Currently TxStateStorage is instantiated every time on PartitionListener 
> instantiation, probably with incorrect storage path, that actually means that 
> separate rocks instances will be also instantiated, that leads us to the 
> unfortunate conclusion of inefficient cursors usage and whole set of 
> excessive resource utilization.
> {code:java}
> new PartitionListener(
> partitionStorage,
> // TODO: https://issues.apache.org/jira/browse/IGNITE-17579 
> TxStateStorage management.
> new TxStateRocksDbStorage(Paths.get("tx_state_storage" + tblId + 
> partId)),
> txManager,
> new ConcurrentHashMap<>()
> ){code}
> All in all, instead of new TxStateRocksDbStorage() proper storage factory 
> should be used like it's done for MVPartitionStorage.
> h3. Definition of Done
>  * Proper usage of storage factory for TxStateStorage is expected, meaning 
> that same amount of resources is used for TxnStateStorage as for 
> MVPartitionStorage.
> h3. Implementation Notes
>  * It's required to add new 
> {code:java}
> TxnStateTableStorage txnStateStorage();{code}
>  method to InternalTable.
>  * Add new interface TxnStateTableStorage with following methods
> {code:java}
> public interface TxnStateTableStorage {
> TxnStateStorage getOrCreateTxnStateStorage(int partitionId) throws 
> StorageException;
> @Nullable
> TxnStateStorage getTxnStateStorage(int partitionId);
> CompletableFuture destroyTxnStateStorage(int partitionId) throws 
> StorageException;
> TableConfiguration configuration(); 
> void start() throws StorageException;
> void stop() throws StorageException;
> void destroy() throws StorageException;
> }{code}
> Not sure whether we need TableConfiguration configuration();  methods, let's 
> check it during implementation.
>  * Add RocksDb implementation of aforementioned interface similar to 
> RocksDbTableStorage
>  * Implement new RocksDbTxnStateStorageEngine similar to RocksDbStorageEngine 
> that'll be used in order to create TxnStateTableStorage
>  * Update direct TxnStateStorage instantiations with proper storage 
> instantiation pipeline.
>  * It's not clear what DataStorageConfiguration to use, let's check this 
> during implementation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17720) Extend MvPartitionStorage scan API with write intent resolution capabilities

2022-09-22 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608576#comment-17608576
 ] 

Ivan Bessonov commented on IGNITE-17720:


My humble opinion - we probably shouldn't. I don't really understand the way we 
could use it. Let me tell you why.

Public SQL API expects tuples as a part of result set. They are not expected to 
have RowId in them. I guess SQL just doesn't need it.

The only use for the full scan is a situation where good index just doesn't 
exist. On top of everything, there's a chance that it will be implemented using 
PK and explicit scan method is not required at all. I'm not sure.

I heard index rebuild being mentioned. Partition scan of this design is 
inapplicable for such operations. It returns only a single version for each 
row. Index rebuild requires all version for each rows. These two approaches are 
incompatible.

Let's wait for [~Sergey Uttsel]'s reply though.

> Extend MvPartitionStorage scan API with write intent resolution capabilities
> 
>
> Key: IGNITE-17720
> URL: https://issues.apache.org/jira/browse/IGNITE-17720
> Project: Ignite
>  Issue Type: Bug
>Reporter: Semyon Danilov
>Priority: Major
>  Labels: ignite-3
>
> Commit of RW transaction is not instantaneous. RO transaction might require 
> reads of data that's in the process of being committed. Current API doesn't 
> support such scenario.
> RO API in partition storage has only two methods: read and scan.
> For read see https://issues.apache.org/jira/browse/IGNITE-17627
> h3. Scan
> This one is tricky, we can't just return a cursor. Special type of cursor is 
> required, and it must allow same read capabilities on each individual element.
> API is to be defined.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17627) Extend MvPartitionStorage read API with write intent resolution capabilities

2022-09-22 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608238#comment-17608238
 ] 

Ivan Bessonov commented on IGNITE-17627:


[~Sergey Uttsel] for performance purposes, scan API is going to be more 
complicated (and implemented in another Issue). Everything else is reflected in 
PR, thank you!

[~sdanilov] code looks good to me, thank you! Please proceed with merge

> Extend MvPartitionStorage read API with write intent resolution capabilities
> 
>
> Key: IGNITE-17627
> URL: https://issues.apache.org/jira/browse/IGNITE-17627
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Semyon Danilov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Commit of RW transaction is not instantaneous. RO transaction might require 
> reads of data that's in the process of being committed. Current API doesn't 
> support such scenario.
> RO API in partition storage has only two methods: {{read}} and {{{}scan{}}}.
> h3. Read
> This one is pretty simple. It should return pair of {{binaryRow}} and 
> {{{}txId{}}}. After that, caller can check the state of the transaction and 
> either return the value or repeat the call.
> There must be a way to hint read method that uncommitted data must be skipped.
> An interesting way of reading data might be required: it there's a write 
> intent, but we see a commit done after the timestamp, we can safely proceed 
> with reading.
> Unfortunately, such optimization may be heavy on the storage read operations, 
> because it requires a "deep" look-ahead request. So, whether or not we 
> implement this depends on one thing - how often do we have write intent 
> resolution in real RO transactions?
> API is to be defined.
> For scans see https://issues.apache.org/jira/browse/IGNITE-17720



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16835) Add data storage support to TableDefinition (public API)

2022-09-22 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16835.

Resolution: Won't Fix

That API is obsolete and has to be removed, there's no point in modifying it

> Add data storage support to TableDefinition (public API)
> 
>
> Key: IGNITE-16835
> URL: https://issues.apache.org/jira/browse/IGNITE-16835
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Priority: Minor
>  Labels: ignite-3
>
> We need to add support for exposing the data storage 
> (*org.apache.ignite.configuration.schemas.table.TableConfigurationSchema#dataStorage*)
>  to the public API (*org.apache.ignite.schema.definition.TableDefinition*).
> After that, we need to fix the tests using the public api and deleting the 
> code marked *TODO: IGNITE-16835*.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17719) IgnitePdsThreadInterruptionTest#testInterruptsOnWALWrite hangs

2022-09-19 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17719:
---
Component/s: persistence

> IgnitePdsThreadInterruptionTest#testInterruptsOnWALWrite hangs
> --
>
> Key: IGNITE-17719
> URL: https://issues.apache.org/jira/browse/IGNITE-17719
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Affects Versions: 2.13
>Reporter: Ivan Bessonov
>Priority: Major
>
> {code:java}
> org.apache.ignite.internal.IgniteInterruptedCheckedException: null
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentCurrentStateStorage.nextAbsoluteSegmentIndex(SegmentCurrentStateStorage.java:100)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentAware.nextAbsoluteSegmentIndex(SegmentAware.java:105)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileArchiver.nextAbsoluteSegmentIndex(FileWriteAheadLogManager.java:2017)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileArchiver.access$1100(FileWriteAheadLogManager.java:1785)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.pollNextFile(FileWriteAheadLogManager.java:1698)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.initNextWriteHandle(FileWriteAheadLogManager.java:1512)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1358)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:966)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:879)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4006)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6172)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5918)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5603)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4253)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:4147)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2225)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2206)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:2115)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1698)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1681)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2805)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:425)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:1975)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2552)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2012)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1831)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1704)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.

[jira] [Updated] (IGNITE-17719) IgnitePdsThreadInterruptionTest#testInterruptsOnWALWrite hangs

2022-09-19 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17719:
---
Affects Version/s: 2.13

> IgnitePdsThreadInterruptionTest#testInterruptsOnWALWrite hangs
> --
>
> Key: IGNITE-17719
> URL: https://issues.apache.org/jira/browse/IGNITE-17719
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.13
>Reporter: Ivan Bessonov
>Priority: Major
>
> {code:java}
> org.apache.ignite.internal.IgniteInterruptedCheckedException: null
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentCurrentStateStorage.nextAbsoluteSegmentIndex(SegmentCurrentStateStorage.java:100)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentAware.nextAbsoluteSegmentIndex(SegmentAware.java:105)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileArchiver.nextAbsoluteSegmentIndex(FileWriteAheadLogManager.java:2017)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileArchiver.access$1100(FileWriteAheadLogManager.java:1785)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.pollNextFile(FileWriteAheadLogManager.java:1698)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.initNextWriteHandle(FileWriteAheadLogManager.java:1512)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1358)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:966)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:879)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4006)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6172)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5918)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5603)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4253)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:4147)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2225)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2206)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:2115)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1698)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1681)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2805)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:425)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:1975)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2552)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2012)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1831)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1704)
>  ~[classes/:?]
>     at 
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomi

[jira] [Created] (IGNITE-17719) IgnitePdsThreadInterruptionTest#testInterruptsOnWALWrite hangs

2022-09-19 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17719:
--

 Summary: IgnitePdsThreadInterruptionTest#testInterruptsOnWALWrite 
hangs
 Key: IGNITE-17719
 URL: https://issues.apache.org/jira/browse/IGNITE-17719
 Project: Ignite
  Issue Type: Bug
Reporter: Ivan Bessonov


{code:java}
org.apache.ignite.internal.IgniteInterruptedCheckedException: null
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentCurrentStateStorage.nextAbsoluteSegmentIndex(SegmentCurrentStateStorage.java:100)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.aware.SegmentAware.nextAbsoluteSegmentIndex(SegmentAware.java:105)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileArchiver.nextAbsoluteSegmentIndex(FileWriteAheadLogManager.java:2017)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileArchiver.access$1100(FileWriteAheadLogManager.java:1785)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.pollNextFile(FileWriteAheadLogManager.java:1698)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.initNextWriteHandle(FileWriteAheadLogManager.java:1512)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1358)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:966)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:879)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4006)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6172)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5918)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5603)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4253)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:4147)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2225)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2206)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:2115)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1698)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1681)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2805)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:425)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:1975)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2552)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2012)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1831)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1704)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.sendSingleRequest(GridNearAtomicAbstractUpdateFuture.java:300)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.map(GridNearAtomicSingleUpdateFuture.java:481)
 ~[classes/:?]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.mapOnTopology(GridNearAtomicSingleUpdateFuture.java:441)

[jira] [Updated] (IGNITE-17697) Some RocksDB-based classes leak threadpools

2022-09-16 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17697:
---
Reviewer: Kirill Tkalenko

> Some RocksDB-based classes leak threadpools
> ---
>
> Key: IGNITE-17697
> URL: https://issues.apache.org/jira/browse/IGNITE-17697
> Project: Ignite
>  Issue Type: Bug
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> RocksDbStorageEngine and TxStateRocksDbStorage don't do it because I forgot 
> to write that code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17697) Some RocksDB-based classes leak threadpools

2022-09-16 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17697:
--

 Summary: Some RocksDB-based classes leak threadpools
 Key: IGNITE-17697
 URL: https://issues.apache.org/jira/browse/IGNITE-17697
 Project: Ignite
  Issue Type: Bug
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov
 Fix For: 3.0.0-alpha6


RocksDbStorageEngine and TxStateRocksDbStorage don't do it because I forgot to 
write that code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17693) Unify all copyrights

2022-09-15 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17693:
---
Fix Version/s: 3.0.0-alpha6

> Unify all copyrights
> 
>
> Key: IGNITE-17693
> URL: https://issues.apache.org/jira/browse/IGNITE-17693
> Project: Ignite
>  Issue Type: Improvement
>  Components: ignite-3
>Reporter: Mikhail Pochatkin
>Priority: Major
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Change all copyrights to one format 
> {code:java}
> /*
>  * Licensed to the Apache Software Foundation (ASF) under one or more
>  * contributor license agreements. See the NOTICE file distributed with
>  * this work for additional information regarding copyright ownership.
>  * The ASF licenses this file to You under the Apache License, Version 2.0
>  * (the "License"); you may not use this file except in compliance with
>  * the License. You may obtain a copy of the License at
>  *
>  *  http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17693) Unify all copyrights

2022-09-15 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17693:
---
Reviewer: Ivan Bessonov

> Unify all copyrights
> 
>
> Key: IGNITE-17693
> URL: https://issues.apache.org/jira/browse/IGNITE-17693
> Project: Ignite
>  Issue Type: Improvement
>  Components: ignite-3
>Reporter: Mikhail Pochatkin
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Change all copyrights to one format 
> {code:java}
> /*
>  * Licensed to the Apache Software Foundation (ASF) under one or more
>  * contributor license agreements. See the NOTICE file distributed with
>  * this work for additional information regarding copyright ownership.
>  * The ASF licenses this file to You under the Apache License, Version 2.0
>  * (the "License"); you may not use this file except in compliance with
>  * the License. You may obtain a copy of the License at
>  *
>  *  http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-17652) Improve B+Tree implementation documentation in Ignite 3

2022-09-15 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-17652.

Resolution: Fixed

Looks good, thank you!

> Improve B+Tree implementation documentation in Ignite 3
> ---
>
> Key: IGNITE-17652
> URL: https://issues.apache.org/jira/browse/IGNITE-17652
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Roman Puchkovskiy
>Assignee: Roman Puchkovskiy
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Current B+Tree impl documentation lacks description of some invariants 
> maintained by the structure. For newcomers, it will be much easier to 
> understand the code if they read about the invariants first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17652) Improve B+Tree implementation documentation in Ignite 3

2022-09-15 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17652:
---
Reviewer: Ivan Bessonov

> Improve B+Tree implementation documentation in Ignite 3
> ---
>
> Key: IGNITE-17652
> URL: https://issues.apache.org/jira/browse/IGNITE-17652
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Roman Puchkovskiy
>Assignee: Roman Puchkovskiy
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Current B+Tree impl documentation lacks description of some invariants 
> maintained by the structure. For newcomers, it will be much easier to 
> understand the code if they read about the invariants first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16136) System Thread pool starvation and out of memory

2022-09-15 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16136:
---
Release Note: Fixed thread pool starvation during the binary metadata and 
marshaller mapping propagation for the client node  (was: Fixed treahd pool 
starvation during the binary metadata and marshaller mapping propagation for 
the client node)

> System Thread pool starvation and out of memory
> ---
>
> Key: IGNITE-16136
> URL: https://issues.apache.org/jira/browse/IGNITE-16136
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.7.6
>Reporter: David Albrecht
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: ise
> Fix For: 2.14
>
> Attachments: configuration.zip, image-2021-12-15-21-13-43-775.png, 
> image-2021-12-15-21-17-47-652.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We are experiencing thread pool starvations and after some time out of memory 
> exceptions in some of our ignite client nodes while the server node seems to 
> be running without any problems. It seems like all sys threads are stuck when 
> calling MarshallerContextImpl.getClassName. Which in turn leads to a growing 
> worker queue.
>  
> First warnings regarding the thread pool starvation:
> {code:java}
> 10.12.21 11:22:34.603 [WARN ] 
> IgniteKernal.warning(127): Possible thread pool starvation detected (no task 
> completed in last 3ms, is system thread pool size large enough?)
> 10.12.21 11:27:34.654 [WARN ] 
> IgniteKernal.warning(127): Possible thread pool starvation detected (no task 
> completed in last 3ms, is system thread pool size large enough?)
> 10.12.21 11:32:34.713 [WARN ] 
> IgniteKernal.warning(127): Possible thread pool starvation detected (no task 
> completed in last 3ms, is system thread pool size large enough?)
> 10.12.21 11:37:34.764 [WARN ] 
> IgniteKernal.warning(127): Possible thread pool starvation detected (no task 
> completed in last 3ms, is system thread pool size large enough?)
> 10.12.21 11:42:34.796 [WARN ] 
> IgniteKernal.warning(127): Possible thread pool starvation detected (no task 
> completed in last 3ms, is system thread pool size large enough?)
> 10.12.21 11:47:34.839 [WARN ] 
> IgniteKernal.warning(127): Possible thread pool starvation detected (no task 
> completed in last 3ms, is system thread pool size large enough?)
> {code}
> Out of memory error leading to a crash of the application:
> {code}
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "https-openssl-nio-16443-ClientPoller"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "ajp-nio-16009-ClientPoller"
> 11-Dec-2021 03:07:24.446 SEVERE [Catalina-utility-1] 
> org.apache.coyote.AbstractProtocol.startAsyncTimeout Error processing async 
> timeouts
>   java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: 
> Java heap space
> {code}
> The queue full of messages:
>  !image-2021-12-15-21-17-47-652.png! 
> It seems like all sys threads are stuck while waiting at:
> {code}
> sys-#170
>   at jdk.internal.misc.Unsafe.park(ZJ)V (Native Method)
>   at java.util.concurrent.locks.LockSupport.park()V (LockSupport.java:323)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(Z)Ljava/lang/Object;
>  (GridFutureAdapter.java:178)
>   at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.get()Ljava/lang/Object;
>  (GridFutureAdapter.java:141)
>   at 
> org.apache.ignite.internal.MarshallerContextImpl.getClassName(BI)Ljava/lang/String;
>  (MarshallerContextImpl.java:379)
>   at 
> org.apache.ignite.internal.MarshallerContextImpl.getClass(ILjava/lang/ClassLoader;)Ljava/lang/Class;
>  (MarshallerContextImpl.java:344)
>   at 
> org.apache.ignite.internal.marshaller.optimized.OptimizedMarshallerUtils.classDescriptor(Ljava/util/concurrent/ConcurrentMap;ILjava/lang/ClassLoader;Lorg/apache/ignite/marshaller/MarshallerContext;Lorg/apache/ignite/internal/marshaller/optimized/OptimizedMarshallerIdMapper;)Lorg/apache/ignite/internal/marshaller/optimized/OptimizedClassDescriptor;
>  (OptimizedMarshallerUtils.java:264)
>   at 
> org.apache.ignite.internal.marshaller.optimized.OptimizedObjectInputStream.readObject0()Ljava/lang/Object;
>  (OptimizedObjectInputStream.java:341)
>   at 
> org.apache.ignite.internal.marshaller.optimized.OptimizedObjectInputStream.readObjectOverride()Ljava/lang/Object;
>  (OptimizedObjectInputStream.java:198)
>   at 
> java.io.ObjectInputStream.

[jira] [Commented] (IGNITE-16926) Interrupted compute job may fail a node

2022-09-14 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604032#comment-17604032
 ] 

Ivan Bessonov commented on IGNITE-16926:


[~NSAmelchev] I see now, thank you!

That was clearly a mistake from my side. In your PR I would recommend hiding 
*walWriter.close();* in *else* branch of mmap check.

Alternative way to fix it would be like this I believe, but now it seems more 
dangerous to me, we would need to look into code more carefully:
{code:java}
Index: 
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/wal/filehandle/FileHandleManagerImpl.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git 
a/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/wal/filehandle/FileHandleManagerImpl.java
 
b/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/wal/filehandle/FileHandleManagerImpl.java
--- 
a/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/wal/filehandle/FileHandleManagerImpl.java
    (revision 18ff1592f9c7f78abad2b62b9c7a2034bb72796e)
+++ 
b/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/wal/filehandle/FileHandleManagerImpl.java
    (date 1663154842011)
@@ -216,8 +216,7 @@
 
     /** {@inheritDoc} */
     @Override public void resumeLogging() {
-        if (!mmap)
-            walWriter.restart();
+        walWriter.restart();
 
         if (cctx.kernalContext().clientNode())
             return;
@@ -475,7 +474,7 @@
          * @param expPos Expected position.
          */
         void flushBuffer(long expPos) throws IgniteCheckedException {
-            if (mmap)
+            if (mmap && expPos >= 0)
                 return;
 
             Throwable err = walWriter.err;
 {code}

> Interrupted compute job may fail a node
> ---
>
> Key: IGNITE-16926
> URL: https://issues.apache.org/jira/browse/IGNITE-16926
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ise.lts
> Fix For: 2.14
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:java}
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
> corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden 
> ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>  B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag.

[jira] [Created] (IGNITE-17676) Race in AwaitTasksCompletionExecutor

2022-09-14 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17676:
--

 Summary: Race in AwaitTasksCompletionExecutor
 Key: IGNITE-17676
 URL: https://issues.apache.org/jira/browse/IGNITE-17676
 Project: Ignite
  Issue Type: Bug
Reporter: Ivan Bessonov


"execute" method registers a future that doesn't wait for heartbeat event, 
making tests flaky. The easiest way to fix it is changing class itself. Writing 
tests that consider this race condition is close to impossible



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17676) Race in AwaitTasksCompletionExecutor

2022-09-14 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17676:
--

Assignee: Ivan Bessonov

> Race in AwaitTasksCompletionExecutor
> 
>
> Key: IGNITE-17676
> URL: https://issues.apache.org/jira/browse/IGNITE-17676
> Project: Ignite
>  Issue Type: Bug
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> "execute" method registers a future that doesn't wait for heartbeat event, 
> making tests flaky. The easiest way to fix it is changing class itself. 
> Writing tests that consider this race condition is close to impossible



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-16926) Interrupted compute job may fail a node

2022-09-14 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603926#comment-17603926
 ] 

Ivan Bessonov commented on IGNITE-16926:


Hi [~NSAmelchev],

as far as I see, {{FileHandleManagerImpl.WALWriter#body}} does close a 
{{fileIO}} instance when we call {{{}walWriter.close(){}}}.

Can you explain your scenario in more details? Is this a different {{fileIO}} 
instance?

The idea of the fix was to move all remaining {{fileIO}} manipulations to wal 
writer thread, thus preventing channel interruption if user thread is 
interrupted.

> Interrupted compute job may fail a node
> ---
>
> Key: IGNITE-16926
> URL: https://issues.apache.org/jira/browse/IGNITE-16926
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ise.lts
> Fix For: 2.14
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:java}
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
> corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden 
> ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>  B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden ]] at

[jira] [Updated] (IGNITE-17673) Extend MV partition storage API with methods to help cleaning up SQL indices

2022-09-13 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17673:
---
Description: 
h3. TLDR;

Following method should be added to MvPartitionStorage:
{code:java}
Cursor scanVersions(RowId rowId); {code}
h3. Details

In order to allow indices to be cleaned, we need extra API in partition storage.

In pseudo-code, cleanup should look like following:
{code:java}
BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);

if (oldRow != null) {
Set allIndexes = getAllIndexes();

for (BinaryRow version : partition.scanVersions(rowId)) {
for (Index index : allIndexes) {
if (index.rowsMatch(oldRow, version)) {
allIndexes.remove(index);
}
}

if (allIndexes.isEmpty()) {
break;
}
}

for (Index index : allIndexes) {
index.remove(oldRow);
}
}{code}
Now, I guess I need to explain this a little bit.

First of all, the real implementation will probably look a bit different. 
Cursor has to be closed, oldRow must be converted to a binary tuple, tombstones 
are not handled properly

. Rows matching algorithm shouldn't be in the index itself, because it depends 
on versioned row schemas and indexes don't know about them. Having a set and 
removing from it doesn't look optimal either. Etc. This is just a sketch.

Second, from the API standpoint for getting versions for a single key, it's 
pretty accurate to what I imagine:
{code:java}
Cursor scanVersions(RowId rowId);{code}
Versions should be returned from newest to oldest. Timestamp itself doesn't 
seem to be necessary.

 

  was:
h3. TLDR;

Following method should be added to MvPartitionStorage:
{code:java}
Cursor scanVersions(RowId rowId); {code}
h3. Details

In order to allow indices to be cleaned, we need extra API in partition storage.

In pseudo-code, cleanup should look like following:
{code:java}
BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);

if (oldRow != null) {
Set allIndexes = getAllIndexes();

for (BinaryRow version : partition.scanVersions(rowId)) {
for (Index index : allIndexes) {
if (index.rowsMatch(oldRow, version)) {
allIndexes.remove(index);
}
}

if (allIndexes.isEmpty()) {
break;
}
}

for (Index index : allIndexes) {
index.remove(oldRow);
}
}{code}
Now, I guess I need to explain this a little bit.

First of all, the real implementation will probably look a bit different. 
Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
matching algorithm shouldn't be in the index itself, because it depends on 
versioned row schemas and indexes don't know about them. Having a set and 
removing from it doesn't look optimal either. Etc. This is just a sketch.

Second, from the API standpoint for getting versions for a single key, it's 
pretty accurate to what I imagine:
{code:java}
Cursor scanVersions(RowId rowId);{code}
Versions should be returned from newest to oldest. Timestamp itself doesn't 
seem to be necessary.

 


> Extend MV partition storage API with methods to help cleaning up SQL indices
> 
>
> Key: IGNITE-17673
> URL: https://issues.apache.org/jira/browse/IGNITE-17673
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> h3. TLDR;
> Following method should be added to MvPartitionStorage:
> {code:java}
> Cursor scanVersions(RowId rowId); {code}
> h3. Details
> In order to allow indices to be cleaned, we need extra API in partition 
> storage.
> In pseudo-code, cleanup should look like following:
> {code:java}
> BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);
> if (oldRow != null) {
> Set allIndexes = getAllIndexes();
> for (BinaryRow version : partition.scanVersions(rowId)) {
> for (Index index : allIndexes) {
> if (index.rowsMatch(oldRow, version)) {
> allIndexes.remove(index);
> }
> }
> if (allIndexes.isEmpty()) {
> break;
> }
> }
> for (Index index : allIndexes) {
> index.remove(oldRow);
> }
> }{code}
> Now, I guess I need to explain this a little bit.
> First of all, the real implementation will probably look a bit different. 
> Cursor has to be closed, oldRow must be converted to a binary tuple, 
> tombstones are not handled properly
> . Rows matching algorithm shouldn't be in the index itself, because it 
> depends on versioned row schemas and indexes don't know about them. Having a 
> set and removing from it doesn't look optimal either. Etc. This is just a 
> sketch.
> Second, from

[jira] [Updated] (IGNITE-17673) Extend MV partition storage API with methods to help cleaning up SQL indices

2022-09-13 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17673:
---
Description: 
h3. TLDR;

Following method should be added to MvPartitionStorage:
{code:java}
Cursor scanVersions(RowId rowId); {code}
h3. Details

In order to allow indices to be cleaned, we need extra API in partition storage.

In pseudo-code, cleanup should look like following:
{code:java}
BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);

if (oldRow != null) {
Set allIndexes = getAllIndexes();

for (BinaryRow version : partition.scanVersions(rowId)) {
for (Index index : allIndexes) {
if (index.rowsMatch(oldRow, version)) {
allIndexes.remove(index);
}
}

if (allIndexes.isEmpty()) {
break;
}
}

for (Index index : allIndexes) {
index.remove(oldRow);
}
}{code}
Now, I guess I need to explain this a little bit.

First of all, the real implementation will probably look a bit different. 
Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
matching algorithm shouldn't be in the index itself, because it depends on 
versioned row schemas and indexes don't know about them. Having a set and 
removing from it doesn't look optimal either. Etc. This is just a sketch.

Second, from the API standpoint for getting versions for a single key, it's 
pretty accurate to what I imagine:
{code:java}
Cursor scanVersions(RowId rowId);{code}
Versions should be returned from newest to oldest. Timestamp itself doesn't 
seem to be necessary.

 

  was:
In order to allow indices to be cleaned, we need extra API in partition storage.

In pseudo-code, cleanup should look like following:
{code:java}
BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);

if (oldRow != null) {
Set allIndexes = getAllIndexes();

for (BinaryRow version : partition.scanVersions(rowId)) {
for (Index index : allIndexes) {
if (index.rowsMatch(oldRow, version)) {
allIndexes.remove(index);
}
}

if (allIndexes.isEmpty()) {
break;
}
}

for (Index index : allIndexes) {
index.remove(oldRow);
}
}{code}
Now, I guess I need to explain this a little bit.

First of all, the real implementation will probably look a bit different. 
Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
matching algorithm shouldn't be in the index itself, because it depends on 
versioned row schemas and indexes don't know about them. Having a set and 
removing from it doesn't look optimal either. Etc. This is just a sketch.

Second, from the API standpoint for getting versions for a single key, it's 
pretty accurate to what I imagine:
{code:java}
Cursor scanVersions(RowId rowId);{code}
Versions should be returned from newest to oldest. Timestamp itself doesn't 
seem to be necessary.

 


> Extend MV partition storage API with methods to help cleaning up SQL indices
> 
>
> Key: IGNITE-17673
> URL: https://issues.apache.org/jira/browse/IGNITE-17673
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> h3. TLDR;
> Following method should be added to MvPartitionStorage:
> {code:java}
> Cursor scanVersions(RowId rowId); {code}
> h3. Details
> In order to allow indices to be cleaned, we need extra API in partition 
> storage.
> In pseudo-code, cleanup should look like following:
> {code:java}
> BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);
> if (oldRow != null) {
> Set allIndexes = getAllIndexes();
> for (BinaryRow version : partition.scanVersions(rowId)) {
> for (Index index : allIndexes) {
> if (index.rowsMatch(oldRow, version)) {
> allIndexes.remove(index);
> }
> }
> if (allIndexes.isEmpty()) {
> break;
> }
> }
> for (Index index : allIndexes) {
> index.remove(oldRow);
> }
> }{code}
> Now, I guess I need to explain this a little bit.
> First of all, the real implementation will probably look a bit different. 
> Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
> matching algorithm shouldn't be in the index itself, because it depends on 
> versioned row schemas and indexes don't know about them. Having a set and 
> removing from it doesn't look optimal either. Etc. This is just a sketch.
> Second, from the API standpoint for getting versions for a single key, it's 
> pretty accurate to what I imagine:
> {code:java}
> Cursor scanVersions(RowId rowId);{code}
> Versions should be returned from newest to oldest. Times

[jira] [Updated] (IGNITE-17673) Extend MV partition storage API with methods to help cleaning up SQL indices

2022-09-13 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17673:
---
Description: 
In order to allow indices to be cleaned, we need extra API in partition storage.

In pseudo-code, cleanup should look like following:
{code:java}
BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);

if (oldRow != null) {
Set allIndexes = getAllIndexes();

for (BinaryRow version : partition.scanVersions(rowId)) {
for (Index index : allIndexes) {
if (index.rowsMatch(oldRow, version)) {
allIndexes.remove(index);
}
}

if (allIndexes.isEmpty()) {
break;
}
}

for (Index index : allIndexes) {
index.remove(oldRow);
}
}{code}
Now, I guess I need to explain this a little bit.

First of all, the real implementation will probably look a bit different. 
Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
matching algorithm shouldn't be in the index itself, because it depends on 
versioned row schemas and indexes don't know about them. Having a set and 
removing from it doesn't look optimal either. Etc. This is just a sketch.

Second, from the API standpoint for getting versions for a single key, it's 
pretty accurate to what I imagine:
{code:java}
Cursor scanVersions(RowId rowId);{code}
Versions should be returned from newest to oldest. Timestamp itself doesn't 
seem to be necessary.

 

  was:
In order to allow indices to be cleaned, we need extra API in partition storage.

In pseudo-code, cleanup should look like following:

 
{code:java}
BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);

if (oldRow != null) {
Set allIndexes = getAllIndexes();

for (BinaryRow version : partition.scanVersions(rowId)) {
for (Index index : allIndexes) {
if (index.rowsMatch(oldRow, version)) {
allIndexes.remove(index);
}
}

if (allIndexes.isEmpty()) {
break;
}
}

for (Index index : allIndexes) {
index.remove(oldRow);
}
}{code}
Now, I guess I need to explain this a little bit.

First of all, the real implementation will probably look a bit different. 
Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
matching algorithm shouldn't be in the index itself, because it depends on 
versioned row schemas and indexes don't know about them. Having a set and 
removing from it doesn't look optimal either. Etc. This is just a sketch.

Second, from the API standpoint for getting versions for a single key, it's 
pretty accurate to what I imagine:
{code:java}
Cursor scanVersions(RowId rowId);{code}
Versions should be returned from newest to oldest. Timestamp itself doesn't 
seem to be necessary.

 


> Extend MV partition storage API with methods to help cleaning up SQL indices
> 
>
> Key: IGNITE-17673
> URL: https://issues.apache.org/jira/browse/IGNITE-17673
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> In order to allow indices to be cleaned, we need extra API in partition 
> storage.
> In pseudo-code, cleanup should look like following:
> {code:java}
> BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);
> if (oldRow != null) {
> Set allIndexes = getAllIndexes();
> for (BinaryRow version : partition.scanVersions(rowId)) {
> for (Index index : allIndexes) {
> if (index.rowsMatch(oldRow, version)) {
> allIndexes.remove(index);
> }
> }
> if (allIndexes.isEmpty()) {
> break;
> }
> }
> for (Index index : allIndexes) {
> index.remove(oldRow);
> }
> }{code}
> Now, I guess I need to explain this a little bit.
> First of all, the real implementation will probably look a bit different. 
> Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
> matching algorithm shouldn't be in the index itself, because it depends on 
> versioned row schemas and indexes don't know about them. Having a set and 
> removing from it doesn't look optimal either. Etc. This is just a sketch.
> Second, from the API standpoint for getting versions for a single key, it's 
> pretty accurate to what I imagine:
> {code:java}
> Cursor scanVersions(RowId rowId);{code}
> Versions should be returned from newest to oldest. Timestamp itself doesn't 
> seem to be necessary.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17673) Extend MV partition storage API with methods to help cleaning up SQL indices

2022-09-13 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17673:
--

 Summary: Extend MV partition storage API with methods to help 
cleaning up SQL indices
 Key: IGNITE-17673
 URL: https://issues.apache.org/jira/browse/IGNITE-17673
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


In order to allow indices to be cleaned, we need extra API in partition storage.

In pseudo-code, cleanup should look like following:

 
{code:java}
BinaryRow oldRow = partition.addWrite(rowId, txId, partitionId, newRow);

if (oldRow != null) {
Set allIndexes = getAllIndexes();

for (BinaryRow version : partition.scanVersions(rowId)) {
for (Index index : allIndexes) {
if (index.rowsMatch(oldRow, version)) {
allIndexes.remove(index);
}
}

if (allIndexes.isEmpty()) {
break;
}
}

for (Index index : allIndexes) {
index.remove(oldRow);
}
}{code}
Now, I guess I need to explain this a little bit.

First of all, the real implementation will probably look a bit different. 
Cursor has to be closed, oldRow must be converted to a binary tuple. Rows 
matching algorithm shouldn't be in the index itself, because it depends on 
versioned row schemas and indexes don't know about them. Having a set and 
removing from it doesn't look optimal either. Etc. This is just a sketch.

Second, from the API standpoint for getting versions for a single key, it's 
pretty accurate to what I imagine:
{code:java}
Cursor scanVersions(RowId rowId);{code}
Versions should be returned from newest to oldest. Timestamp itself doesn't 
seem to be necessary.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17654) Set executable flags to gradlew and gradlew.bat

2022-09-08 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17654:
--

Assignee: Ivan Bessonov

> Set executable flags to gradlew and gradlew.bat
> ---
>
> Key: IGNITE-17654
> URL: https://issues.apache.org/jira/browse/IGNITE-17654
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Title says it all



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17654) Set executable flags to gradlew and gradlew.bat

2022-09-08 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17654:
--

 Summary: Set executable flags to gradlew and gradlew.bat
 Key: IGNITE-17654
 URL: https://issues.apache.org/jira/browse/IGNITE-17654
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Title says it all



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17653) Fix gradle build

2022-09-08 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17653:
---
Fix Version/s: 3.0.0-alpha6

> Fix gradle build 
> -
>
> Key: IGNITE-17653
> URL: https://issues.apache.org/jira/browse/IGNITE-17653
> Project: Ignite
>  Issue Type: Bug
>  Components: build, ignite-3
>Reporter: Mikhail Pochatkin
>Assignee: Mikhail Pochatkin
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> After IGNITE-15931 and IGNITE-16040 some dependencies was changed. Need to 
> sync it in gradle 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17653) Fix gradle build

2022-09-08 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17653:
---
Labels: ignite-3  (was: )

> Fix gradle build 
> -
>
> Key: IGNITE-17653
> URL: https://issues.apache.org/jira/browse/IGNITE-17653
> Project: Ignite
>  Issue Type: Bug
>  Components: build, ignite-3
>Reporter: Mikhail Pochatkin
>Assignee: Mikhail Pochatkin
>Priority: Major
>  Labels: ignite-3
>
> After IGNITE-15931 and IGNITE-16040 some dependencies was changed. Need to 
> sync it in gradle 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17579) TxStateStorage management

2022-09-07 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601209#comment-17601209
 ] 

Ivan Bessonov commented on IGNITE-17579:


We can go even further and make it shared between all tables, like 
RocksDB-based RAFT log storage.

> TxStateStorage management
> -
>
> Key: IGNITE-17579
> URL: https://issues.apache.org/jira/browse/IGNITE-17579
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Denis Chudov
>Priority: Major
>  Labels: ignite-3, transaction3_rw
>
> h3. Motivation
> Currently TxStateStorage is instantiated every time on PartitionListener 
> instantiation, probably with incorrect storage path, that actually means that 
> separate rocks instances will be also instantiated, that leads us to the 
> unfortunate conclusion of inefficient cursors usage and whole set of 
> excessive resource utilization.
> {code:java}
> new PartitionListener(
> partitionStorage,
> // TODO: https://issues.apache.org/jira/browse/IGNITE-17579 
> TxStateStorage management.
> new TxStateRocksDbStorage(Paths.get("tx_state_storage" + tblId + 
> partId)),
> txManager,
> new ConcurrentHashMap<>()
> ){code}
> All in all, instead of new TxStateRocksDbStorage() proper storage factory 
> should be used like it's done for MVPartitionStorage.
> h3. Definition of Done
>  * Proper usage of storage factory for TxStateStorage is expected, meaning 
> that same amount of resources is used for TxnStateStorage as for 
> MVPartitionStorage.
> h3. Implementation Notes
>  * It's required to add new 
> {code:java}
> TxnStateTableStorage txnStateStorage();{code}
>  method to InternalTable.
>  * Add new interface TxnStateTableStorage with following methods
> {code:java}
> public interface TxnStateTableStorage {
> TxnStateStorage getOrCreateTxnStateStorage(int partitionId) throws 
> StorageException;
> @Nullable
> TxnStateStorage getTxnStateStorage(int partitionId);
> CompletableFuture destroyTxnStateStorage(int partitionId) throws 
> StorageException;
> TableConfiguration configuration(); 
> void start() throws StorageException;
> void stop() throws StorageException;
> void destroy() throws StorageException;
> }{code}
> Not sure whether we need TableConfiguration configuration();  methods, let's 
> check it during implementation.
>  * Add RocksDb implementation of aforementioned interface similar to 
> RocksDbTableStorage
>  * Implement new RocksDbTxnStateStorageEngine similar to RocksDbStorageEngine 
> that'll be used in order to create TxnStateTableStorage
>  * Update direct TxnStateStorage instantiations with proper storage 
> instantiation pipeline.
>  * It's not clear what DataStorageConfiguration to use, let's check this 
> during implementation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17611) Implement proper local storage recovery for transaction state store

2022-09-07 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17611:
--

Assignee: Ivan Bessonov

> Implement proper local storage recovery for transaction state store
> ---
>
> Key: IGNITE-17611
> URL: https://issues.apache.org/jira/browse/IGNITE-17611
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> h3. Preliminaries
> Current design expects transaction states to be replicated using the same 
> RAFT groups that process partition transactional data. In code this means 
> that there are two physical storages associated with a single state machine. 
> This design is easy to achieve when the system is stable, but fault tolerance 
> and basic node restart might introduce some complications.
> h3. Partition storage design
> By itself, partition storage works this way:
>  * every write command writes value of the RAFT log index, associated with 
> the command;
>  * this index value is written atomically with the data from the command;
>  * updates are accumulated in the memory buffer before being written to disk.
>  * upon restart, we read the value of the last applied index and proceed the 
> recovery process from it. It's done with RAFT snapshots infrastructure.
> h3. Changes to tx state store
> Basically, everything has to be repeated:
>  * applied index value must be introduced to tx state storage;
>  * updates must be atomic;
>  * on restart, we should use the minimal value of last applied index from 
> both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has 
> to be changed).
> h3. Other necessary changes
>  * atomic flush must be set up for the tx state storage. WAL should be 
> disabled;
>  * snapshot command must trigger the flush. Please refer to 
> {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for 
> implementation reference. Listener class can be generified and reused;
>  * assertion in {{PartitionListener#onWrite}} should be removed or 
> drastically improved;
>  * read operation on storages must be prohibited until local recovery is 
> completed - we should apply all command up to "commitIndex" value that's been 
> read at the start of the node, otherwise storages may have data, inconsistent 
> with each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17339) Implement B+Tree based hash index storage

2022-09-06 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17339:
---
Reviewer: Semyon Danilov  (was: Roman Puchkovskiy)

> Implement B+Tree based hash index storage
> -
>
> Key: IGNITE-17339
> URL: https://issues.apache.org/jira/browse/IGNITE-17339
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Please refer to IGNITE-17320 and issues from the epic for the gist. It's 
> basically the same thing, but with hash slapped inside of the tree pages and 
> a simplified comparison algorithm.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17339) Implement B+Tree based hash index storage

2022-09-06 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17339:
---
Reviewer: Roman Puchkovskiy

> Implement B+Tree based hash index storage
> -
>
> Key: IGNITE-17339
> URL: https://issues.apache.org/jira/browse/IGNITE-17339
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Please refer to IGNITE-17320 and issues from the epic for the gist. It's 
> basically the same thing, but with hash slapped inside of the tree pages and 
> a simplified comparison algorithm.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17627) Extend MvPartitionStorage API with write intent resolution capabilities

2022-09-06 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17627:
--

 Summary: Extend MvPartitionStorage API with write intent 
resolution capabilities
 Key: IGNITE-17627
 URL: https://issues.apache.org/jira/browse/IGNITE-17627
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Commit of RW transaction is not instantaneous. RO transaction might require 
reads of data that's in the process of being committed. Current API doesn't 
support such scenario.

RO API in partition storage has only two methods: {{read}} and {{{}scan{}}}.
h3. Read

This one is pretty simple. It should return pair of {{binaryRow}} and 
{{{}txId{}}}. After that, caller can check the state of the transaction and 
either return the value or repeat the call.

There must be a way to hint read method that uncommitted data must be skipped.

An interesting way of reading data might be required: it there's a write 
intent, but we see a commit done after the timestamp, we can safely proceed 
with reading.

Unfortunately, such optimization may be heavy on the storage read operations, 
because it requires a "deep" look-ahead request. So, whether or not we 
implement this depends on one thing - how often do we have write intent 
resolution in real RO transactions?

API is to be defined.
h3. Scan

This one is tricky, we can't just return a cursor. Special type of cursor is 
required, and it must allow same read capabilities on each individual element.

API is to be defined.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17626) Implement drop index in page-memory based storages

2022-09-06 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17626:
--

 Summary: Implement drop index in page-memory based storages
 Key: IGNITE-17626
 URL: https://issues.apache.org/jira/browse/IGNITE-17626
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Both sync and async versions of the operation should be implemented. Schema may 
not be known at the moment of destruction.

Integrate it into tests.

Maybe implement destruction on start for indexes that had not finished their 
completion before the restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17611) Implement proper local storage recovery for transaction state store

2022-09-01 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17611:
---
Description: 
h3. Preliminaries

Current design expects transaction states to be replicated using the same RAFT 
groups that process partition transactional data. In code this means that there 
are two physical storages associated with a single state machine. This design 
is easy to achieve when the system is stable, but fault tolerance and basic 
node restart might introduce some complications.
h3. Partition storage design

By itself, partition storage works this way:
 * every write command writes value of the RAFT log index, associated with the 
command;
 * this index value is written atomically with the data from the command;
 * updates are accumulated in the memory buffer before being written to disk.
 * upon restart, we read the value of the last applied index and proceed the 
recovery process from it. It's done with RAFT snapshots infrastructure.

h3. Changes to tx state store

Basically, everything has to be repeated:
 * applied index value must be introduced to tx state storage;
 * updates must be atomic;
 * on restart, we should use the minimal value of last applied index from both 
TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be 
changed).

h3. Other necessary changes
 * atomic flush must be set up for the tx state storage. WAL should be disabled;
 * snapshot command must trigger the flush. Please refer to 
{{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for 
implementation reference. Listener class can be generified and reused;
 * assertion in {{PartitionListener#onWrite}} should be removed or drastically 
improved;
 * read operation on storages must be prohibited until local recovery is 
completed - we should apply all command up to "commitIndex" value that's been 
read at the start of the node, otherwise storages may have data, inconsistent 
with each other.

  was:
h3. Preliminaries

Current design expects transaction states to be replicated using the same RAFT 
groups that process partition transactional data. In code this means that there 
are two physical storages associated with a single state machine. This design 
is easy to achieve when the system is stable, but fault tolerance and basic 
node restart might introduce some complications.
h3. Partition storage design

By itself, partition storage works this way:
 * every write command writes value of the RAFT log index, associated with the 
command;
 * this index value is written atomically with the data from the comment;
 * updates are accumulated in the memory buffer before being written to disk.
 * upon restart, we read the value of the last applied index and proceed the 
recovery process from it. It's done with RAFT snapshots infrastructure.

h3. Changes to tx state store

Basically, everything has to be repeated:
 * applied index value must be introduced to tx state storage;
 * updates must be atomic;
 * on restart, we should use the minimal value of last applied index from both 
TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be 
changed).

h3. Other necessary changes
 * atomic flush must be set up for the tx state storage. WAL should be disabled;
 * snapshot command must trigger the flush. Please refer to 
{{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for 
implementation reference. Listener class can be generified and reused;
 * assertion in {{PartitionListener#onWrite}} should be removed or drastically 
improved;
 * read operation on storages must be prohibited until local recovery is 
completed - we should apply all command up to "commitIndex" value that's been 
read at the start of the node, otherwise storages may have data, inconsistent 
with each other.


> Implement proper local storage recovery for transaction state store
> ---
>
> Key: IGNITE-17611
> URL: https://issues.apache.org/jira/browse/IGNITE-17611
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> h3. Preliminaries
> Current design expects transaction states to be replicated using the same 
> RAFT groups that process partition transactional data. In code this means 
> that there are two physical storages associated with a single state machine. 
> This design is easy to achieve when the system is stable, but fault tolerance 
> and basic node restart might introduce some complications.
> h3. Partition storage design
> By itself, partition storage works this way:
>  * every write command writes value of the RAFT log index, associated with 
> the command;
>  * this index value is written atomically with the data from the command;
>  * updates are accumulated in the me

[jira] [Created] (IGNITE-17611) Implement proper local storage recovery for transaction state store

2022-09-01 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17611:
--

 Summary: Implement proper local storage recovery for transaction 
state store
 Key: IGNITE-17611
 URL: https://issues.apache.org/jira/browse/IGNITE-17611
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


h3. Preliminaries

Current design expects transaction states to be replicated using the same RAFT 
groups that process partition transactional data. In code this means that there 
are two physical storages associated with a single state machine. This design 
is easy to achieve when the system is stable, but fault tolerance and basic 
node restart might introduce some complications.
h3. Partition storage design

By itself, partition storage works this way:
 * every write command writes value of the RAFT log index, associated with the 
command;
 * this index value is written atomically with the data from the comment;
 * updates are accumulated in the memory buffer before being written to disk.
 * upon restart, we read the value of the last applied index and proceed the 
recovery process from it. It's done with RAFT snapshots infrastructure.

h3. Changes to tx state store

Basically, everything has to be repeated:
 * applied index value must be introduced to tx state storage;
 * updates must be atomic;
 * on restart, we should use the minimal value of last applied index from both 
TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be 
changed).

h3. Other necessary changes
 * atomic flush must be set up for the tx state storage. WAL should be disabled;
 * snapshot command must trigger the flush. Please refer to 
{{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for 
implementation reference. Listener class can be generified and reused;
 * assertion in {{PartitionListener#onWrite}} should be removed or drastically 
improved;
 * read operation on storages must be prohibited until local recovery is 
completed - we should apply all command up to "commitIndex" value that's been 
read at the start of the node, otherwise storages may have data, inconsistent 
with each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17339) Implement B+Tree based hash index storage

2022-08-31 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17339:
--

Assignee: Ivan Bessonov  (was: Kirill Tkalenko)

> Implement B+Tree based hash index storage
> -
>
> Key: IGNITE-17339
> URL: https://issues.apache.org/jira/browse/IGNITE-17339
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> Please refer to IGNITE-17320 and issues from the epic for the gist. It's 
> basically the same thing, but with hash slapped inside of the tree pages and 
> a simplified comparison algorithm.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17196) Implement in-memory raft group reconfiguration on node failure

2022-08-30 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597731#comment-17597731
 ] 

Ivan Bessonov commented on IGNITE-17196:


Looks good to me!

> Implement in-memory raft group reconfiguration on node failure
> --
>
> Key: IGNITE-17196
> URL: https://issues.apache.org/jira/browse/IGNITE-17196
> Project: Ignite
>  Issue Type: Improvement
>  Components: persistence
>Reporter: Roman Puchkovskiy
>Assignee: Semyon Danilov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> We need to implement design described in IGNITE-16668



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17535) Implementing a hash index B+Tree

2022-08-30 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17535:
--

Assignee: Ivan Bessonov  (was: Kirill Tkalenko)

> Implementing a hash index B+Tree
> 
>
> Key: IGNITE-17535
> URL: https://issues.apache.org/jira/browse/IGNITE-17535
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It is necessary to implement a hash index B+Tree, for simplicity, without 
> inlining  a *BinaryTuple*, but simply storing a link to it.
> The key will be: hash and link of the *BinaryTuple*.
> The value will be: *RowId*.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17577) Refactor MVPartitionStorage together with corresponding implementation in order to use HybridTimestamp instead of Timestamp

2022-08-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17577:
---
Epic Link: IGNITE-16923

> Refactor MVPartitionStorage together with corresponding implementation in 
> order to use HybridTimestamp instead of Timestamp
> ---
>
> Key: IGNITE-17577
> URL: https://issues.apache.org/jira/browse/IGNITE-17577
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> As a starting point in order to satisfy tx protocol it's required to rework 
> MVPartitionStorage and corresponding implementation with HybridTimestamp 
> instead of Timestamp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17577) Refactor MVPartitionStorage together with corresponding implementation in order to use HybridTimestamp instead of Timestamp

2022-08-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17577:
--

Assignee: Ivan Bessonov

> Refactor MVPartitionStorage together with corresponding implementation in 
> order to use HybridTimestamp instead of Timestamp
> ---
>
> Key: IGNITE-17577
> URL: https://issues.apache.org/jira/browse/IGNITE-17577
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> As a starting point in order to satisfy tx protocol it's required to rework 
> MVPartitionStorage and corresponding implementation with HybridTimestamp 
> instead of Timestamp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17481) Ignite shutdown sequence throws a ClassCastException from inside GridManagerAdapter on latest Java 11.0.16 and 17.0.4 point releases

2022-08-25 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17481:
--

Assignee: Ivan Bessonov

> Ignite shutdown sequence throws a ClassCastException from inside 
> GridManagerAdapter on latest Java 11.0.16 and 17.0.4 point releases
> 
>
> Key: IGNITE-17481
> URL: https://issues.apache.org/jira/browse/IGNITE-17481
> Project: Ignite
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10, 
> 2.12, 2.13
>Reporter: Paolo de Dios
>Assignee: Ivan Bessonov
>Priority: Major
> Fix For: 2.15
>
>
>  
> {{When ClassLoaders are undeployed, the 
> `GridDeploymentStoreAdapter.clearSerializationCache()` method attempts to 
> clear serialization caches to avoid PermGen memory leaks.  The implementation 
> of this optimization seems to no longer work as the underlying JVM 
> implementaiton of `java.io.ObjectInputStream$Caches` and 
> java.io.ObjectOutputStream$Caches` no longer maintain a private cache of 
> subclass security audit results as a java.util.Map, which Ignite expects 
> inside 
> `[GridDeploymentStoreAdapter.clearSerializationCache()|https://github.com/apache/ignite/blob/da8a6bb4756c998aa99494d395752be96d841ec8/modules/core/src/main/java/org/apache/ignite/internal/managers/deployment/GridDeploymentStoreAdapter.java#L151]`.}}
>  
> *Stacktrace*
>  
> {code:java}
> [INFO ] 2022-08-06T20:28:04,778+ T=[vert.x-eventloop-thread-4] 
> L=[GridDeploymentLocalStore] - Removed undeployed class: GridDeployment 
> [ts=1659817673460, depMode=SHARED, 
> clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader@277050dc, 
> clsLdrId=b2497d47281-7ff6d972-ec5d-4d9c-bc60-95463b5e10b6, userVer=0, 
> loc=true, 
> sampleClsName=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.IgniteDhtPartitionHistorySuppliersMap,
>  pendingUndeploy=false, undeployed=true, usage=0]
> [ERROR] 2022-08-06T20:28:04,778+ T=[vert.x-eventloop-thread-4] L=[local] 
> - Failed to stop component (ignoring): GridManagerAdapter [enabled=true, 
> name=o.a.i.i.managers.deployment.GridDeploymentManager]
> java.lang.ClassCastException: class java.io.ObjectInputStream$Caches$1 cannot 
> be cast to class java.util.Map (java.io.ObjectInputStream$Caches$1 and 
> java.util.Map are in module java.base of loader 'bootstrap')
>         at 
> org.apache.ignite.internal.managers.deployment.GridDeploymentStoreAdapter.clearSerializationCache(GridDeploymentStoreAdapter.java:151)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.managers.deployment.GridDeploymentStoreAdapter.clearSerializationCaches(GridDeploymentStoreAdapter.java:120)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.managers.deployment.GridDeploymentLocalStore.undeploy(GridDeploymentLocalStore.java:565)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.managers.deployment.GridDeploymentLocalStore.stop(GridDeploymentLocalStore.java:101)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.managers.deployment.GridDeploymentManager.storesStop(GridDeploymentManager.java:630)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.managers.deployment.GridDeploymentManager.stop(GridDeploymentManager.java:137)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.IgniteKernal.stop0(IgniteKernal.java:1928) 
> ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.IgniteKernal.stop(IgniteKernal.java:1806) 
> ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2382)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop(IgnitionEx.java:2205)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at org.apache.ignite.internal.IgnitionEx.stop(IgnitionEx.java:350) 
> ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at org.apache.ignite.Ignition.stop(Ignition.java:230) 
> ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> io.appliedtheory.disco.services.IgniteClusterBootstrap.stop(IgniteClusterBootstrap.java:1148)
>  ~[proof-web-gateway-1.0.0+4dbe618b49758c4e-aio.jar:1.0.0]
>         at 
> io.appliedtheory.disco.services.IgniteClusterService.doStop(IgniteClusterService

[jira] [Updated] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-08-24 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17306:
---
Labels: iep-55 ignite-3  (was: ignite-3)

> Speedup runtime classes compilation speed for configuration
> ---
>
> Key: IGNITE-17306
> URL: https://issues.apache.org/jira/browse/IGNITE-17306
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: iep-55, ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few places in presto that for too slow, we can easily optimize 
> them
> (Nothing will be committed if there's no visible difference in tests duration)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-08-24 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17306:
---
Epic Link: IGNITE-14904

> Speedup runtime classes compilation speed for configuration
> ---
>
> Key: IGNITE-17306
> URL: https://issues.apache.org/jira/browse/IGNITE-17306
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few places in presto that for too slow, we can easily optimize 
> them
> (Nothing will be committed if there's no visible difference in tests duration)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17338) Implement RocksDB based hash index storage

2022-08-24 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17338:
---
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Implement RocksDB based hash index storage
> --
>
> Key: IGNITE-17338
> URL: https://issues.apache.org/jira/browse/IGNITE-17338
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Please see IGNITE-17318 for partial description of what needs to be achieved.
> I expect that hash index records will have the following structure:
> {code:java}
> [ indexId | partitionId | hash | tuple | rowId ] -> []{code}
> Fixed-length prefix should cover indexId, partitionId and hash value.
> Searching rows effectively becomes a scan, but this is fine.
> Hashing must be performed internally, hash function already presents 
> somewhere in the code.
> Is far as I understand, PK is going to be implemented as a secondary hash 
> index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17571) Implement GC for MV storages

2022-08-23 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17571:
--

 Summary: Implement GC for MV storages
 Key: IGNITE-17571
 URL: https://issues.apache.org/jira/browse/IGNITE-17571
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


h3. Basics

Current implementation can only work with infinite storage space. This is 
because the storage works in appen-only mode (except for transactions 
rollbacks).

There's a value in the system called "low watermark". It's guaranteed, that no 
new RO transactions will be started at the time earlier then LW. Existence of 
such value allows us to safely delete all versions before it. We will not 
discuss the mechanics of acquiring such value, assuming that it is simply given.
h3. API

Original intended design looks like following:

 
{code:java}
cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, 
@Nullable byte[] upper,  boolean toInclusive, Timestamp timestamp){code}
Now, I don't think that this is necessarily a good design. Main problem with it 
is the existence of bounds:
 * First of all, why not just have inclusive lower bound and exclusive upper 
bound, like almost all methods with bounds in existence.
 * Second, I believe that this API has been proposed in assumption that row ids 
will have a uniform distribution and every "range" cleanup will result in 
somewhat equal amount of job executed. This is simply not true in current 
implementation.
RowId is a timestamp based value that exist in a very narrow time slice, making 
most of ranges empty and meaningless.
Then, the way they're stored is actually up to the storage. There's no 
restrictions on byte order when physically storing RowId objects.

Given that "cleanup" is a background process, a simple update of low watermark 
value would be enough. Underlying machinery will do its job.
h3. Problems

There's one thing that worries me: indexes.

Because storages are unaware of table schemas, it's impossible to cleanup 
indexes. This gets me thinking. API should be flexible enough so that indexes 
cleanup can be performed as an external operation over the storage. This will 
also reduce the amount of job that we need to do for the implementation.

To be specific, it feels like the method would look like this:
{code:java}
RowId cleanup(Timestamp threshold, RowId startId, int numRows, 
BiConsumer indexCleaner);{code}
Explanation is required.
 * timestamp represents the same thing as before - low watermark.
 * startId - the row that should be first to iterate in current batch.
 * numRows - number of rows that should be cleaned in current batch. By this I 
mean individual versions, not chains.
 * cleaner closure must be used to clean indexes after every individual version 
removal. Right now it doesn't look optimal to me, but I struggle to find a good 
solution on efficient indexes cleanup.
 * next rowId is returned, that should be used a startId in next call. "null" 
if cleanup is complete. In this case it can be started from the beginning or 
simply postponed until new low watermark value is available.

How to operate it:
 * numRows has two strategic purposes:
 ** to control the size of batches in "runConsistently" closures.
 ** to be able to run many cleanups in parallel avoiding pools starvation. 
Every task is split into small chunks.
 * cleanup should start from the smallest possible row id. Unfortunately, we 
can't just use (0L, 0L), that's wrong. Maybe we should add something like 
"smallestRowId()" to storage engine.
 * low watermark value can be changed in-between calls. This is fine and even 
preferable.

h3. Random notes

Passed closure should only be executed when specific version is already 
deleted. Otherwise the data in row store will match indexes and indexes would 
not be cleaned, which is wrong. There must be a test for it, of course.

Unsurprisingly, integration of this method into a raft machine is out of scope. 
I don't think that LW propagation is implemented at this point. I may be wrong. 
Anyways, we better discuss it with guys who implement transactions right now.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17571) Implement GC for MV storages

2022-08-23 Thread Ivan Bessonov (Jira)

[
https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ivan Bessonov updated IGNITE-17571:
---
Description:
h3. Basics

Current implementation can only work with infinite storage space. This is
because the storage works in appen-only mode (except for transactions
rollbacks).

There's a value in the system called "low watermark". It's guaranteed, that no
new RO transactions will be started at the time earlier then LW. Existence of
such value allows us to safely delete all versions before it. We will not
discuss the mechanics of acquiring such value, assuming that it is simply given.
h3. API

Original intended design looks like following:
{code:java}
cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive,
@Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code}
Now, I don't think that this is necessarily a good design. Main problem with it
is the existence of bounds:
* First of all, why not just have inclusive lower bound and exclusive upper
bound, like almost all methods with bounds in existence.
* Second, I believe that this API has been proposed in assumption that row ids
will have a uniform distribution and every "range" cleanup will result in
somewhat equal amount of job executed. This is simply not true in current
implementation.
RowId is a timestamp based value that exist in a very narrow time slice, making
most of ranges empty and meaningless.
Then, the way they're stored is actually up to the storage. There's no
restrictions on byte order when physically storing RowId objects.

Given that "cleanup" is a background process, a simple update of low watermark
value would be enough. Underlying machinery will do its job.
h3. Problems

There's one thing that worries me: indexes.

Because storages are unaware of table schemas, it's impossible to cleanup
indexes. This gets me thinking. API should be flexible enough so that indexes
cleanup can be performed as an external operation over the storage. This will
also reduce the amount of job that we need to do for the implementation.

To be specific, it feels like the method would look like this:
{code:java}
RowId cleanup(Timestamp threshold, RowId startId, int numRows,
BiConsumer indexCleaner);{code}
Explanation is required.
* timestamp represents the same thing as before - low watermark.
* startId - the row that should be first to iterate in current batch.
* numRows - number of rows that should be cleaned in current batch. By this I
mean individual versions, not chains.
* cleaner closure must be used to clean indexes after every individual version
removal. Right now it doesn't look optimal to me, but I struggle to find a good
solution on efficient indexes cleanup.
* next rowId is returned, that should be used a startId in next call. "null"
if cleanup is complete. In this case it can be started from the beginning or
simply postponed until new low watermark value is available.

How to operate it:
* numRows has two strategic purposes:
** to control the size of batches in "runConsistently" closures.
** to be able to run many cleanups in parallel avoiding pools starvation.
Every task is split into small chunks.
* cleanup should start from the smallest possible row id. Unfortunately, we
can't just use (0L, 0L), that's wrong. Maybe we should add something like
"smallestRowId()" to storage engine.
* low watermark value can be changed in-between calls. This is fine and even
preferable.

h3. Random notes

Passed closure should only be executed when specific version is already
deleted. Otherwise the data in row store will match indexes and indexes would
not be cleaned, which is wrong. There must be a test for it, of course.

Unsurprisingly, integration of this method into a raft machine is out of scope.
I don't think that LW propagation is implemented at this point. I may be wrong.
Anyways, we better discuss it with guys who implement transactions right now.

was:
h3. Basics

Current implementation can only work with infinite storage space. This is
because the storage works in appen-only mode (except for transactions
rollbacks).

Original intended design looks like following:

{code:java}
cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive,
@Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code}
Now, I don't think that this is necessarily a good design. Main problem with it
is the existence of bounds:
* First of all, why not just have inclusive lower bound and exclusive upper
bound, like almost all methods with boun

[jira] [Resolved] (IGNITE-17466) Remove TableStorage and PartitionStorage implementations

2022-08-16 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-17466.

Resolution: Fixed

> Remove TableStorage and PartitionStorage implementations
> 
>
> Key: IGNITE-17466
> URL: https://issues.apache.org/jira/browse/IGNITE-17466
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> All implementations of *TableStorage* and *PartitionStorage* should be 
> removed, as well as the code associated with them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-17532) RocksDB partition destruction

2022-08-16 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-17532.

Resolution: Fixed

> RocksDB partition destruction
> -
>
> Key: IGNITE-17532
> URL: https://issues.apache.org/jira/browse/IGNITE-17532
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Current implementation has WAL disabled. This means that deleted partition 
> can resurrect bu itself after restart and no one will delete it.
> Partition should be treated as deleted only when it's deleted on disc. For 
> this reason "destroyPartition" method should return a future.
> EDIT: current implementation does return a future, but with explicit flush, 
> which is technically unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17532) RocksDB partition destruction

2022-08-16 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17532:
---
Description: 
Current implementation has WAL disabled. This means that deleted partition can 
resurrect bu itself after restart and no one will delete it.

Partition should be treated as deleted only when it's deleted on disc. For this 
reason "destroyPartition" method should return a future.

EDIT: current implementation does return a future, but with explicit flush, 
which is technically unnecessary.

  was:
Current implementation has WAL disabled. This means that deleted partition can 
resurrect bu itself after restart and no one will delete it.

Partition should be treated as deleted only when it's deleted on disc. For this 
reason "destroyPartition" method should return a future.


> RocksDB partition destruction
> -
>
> Key: IGNITE-17532
> URL: https://issues.apache.org/jira/browse/IGNITE-17532
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
> Fix For: 3.0.0-alpha6
>
>
> Current implementation has WAL disabled. This means that deleted partition 
> can resurrect bu itself after restart and no one will delete it.
> Partition should be treated as deleted only when it's deleted on disc. For 
> this reason "destroyPartition" method should return a future.
> EDIT: current implementation does return a future, but with explicit flush, 
> which is technically unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17532) RocksDB partition destruction

2022-08-16 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17532:
--

 Summary: RocksDB partition destruction
 Key: IGNITE-17532
 URL: https://issues.apache.org/jira/browse/IGNITE-17532
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov
 Fix For: 3.0.0-alpha6


Current implementation has WAL disabled. This means that deleted partition can 
resurrect bu itself after restart and no one will delete it.

Partition should be treated as deleted only when it's deleted on disc. For this 
reason "destroyPartition" method should return a future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17224) Support eviction for volatile (in-memory) data region

2022-08-12 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17224:
--

Assignee: (was: Ivan Bessonov)

> Support eviction for volatile (in-memory) data region
> -
>
> Key: IGNITE-17224
> URL: https://issues.apache.org/jira/browse/IGNITE-17224
> Project: Ignite
>  Issue Type: Bug
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> I found that the volatile (in memory) data region contains a configuration 
> for eviction, but does not implement it, you need to implement it by analogy 
> with 2.0 and write tests for it. We also need to consider validation for a 
> configuration intended to be evicted.
> See in 3.0:
> * 
> *org.apache.ignite.internal.pagememory.configuration.schema.VolatilePageMemoryDataRegionConfigurationSchema*
> * *org.apache.ignite.internal.storage.pagememory.VolatilePageMemoryDataRegion*
> * *org.apache.ignite.internal.pagememory.inmemory.VolatilePageMemory*
> See in 2.0:
> * 
> *org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#ensureFreeSpace*
> * 
> *org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#checkRegionEvictionProperties*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17224) Support eviction for volatile (in-memory) data region

2022-08-12 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17224:
--

Assignee: Ivan Bessonov

> Support eviction for volatile (in-memory) data region
> -
>
> Key: IGNITE-17224
> URL: https://issues.apache.org/jira/browse/IGNITE-17224
> Project: Ignite
>  Issue Type: Bug
>Reporter: Kirill Tkalenko
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> I found that the volatile (in memory) data region contains a configuration 
> for eviction, but does not implement it, you need to implement it by analogy 
> with 2.0 and write tests for it. We also need to consider validation for a 
> configuration intended to be evicted.
> See in 3.0:
> * 
> *org.apache.ignite.internal.pagememory.configuration.schema.VolatilePageMemoryDataRegionConfigurationSchema*
> * *org.apache.ignite.internal.storage.pagememory.VolatilePageMemoryDataRegion*
> * *org.apache.ignite.internal.pagememory.inmemory.VolatilePageMemory*
> See in 2.0:
> * 
> *org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#ensureFreeSpace*
> * 
> *org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#checkRegionEvictionProperties*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17076) Unify RowId format for different storages

2022-08-09 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17076:
---
Description: 
Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
There's a method called "insert" that generates RowId by itself. This is wrong, 
because it can lead to different id for the same row on the replica storage. 
This completely breaks everything.

Every replicated write command, that inserts new value, should produce same row 
ids. There are several ways to achieve this:
 * Use timestamps as identifiers. This is not very convenient, because we would 
have to attach partition id on top of it. It's mandatory to know the partition 
of the row.
 * Use more complicated structure, for example a tuple of (raftCommitIndex, 
partitionId, batchCounter), where

 * 
 ** raftCommitIndex is the index of write command that performs insertion.
 ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
considering that there are plans to support more than 65000 partitions per 
table.
 ** batchCounter is used to differentiate insertions made in a single write 
command. We can limit it with 2 bytes to save a little bit of space, if it's 
necessary.

I prefer the second option, but maybe it could be revised during the 
implementation.

Of course, method "insert" should be removed from bridge API. Tests have to be 
updated. With the lack of RAFT group in storage tests, we can generate row ids 
artificially, it's not a big deal.

EDIT: second option makes it difficult to use row ids in action request 
processor in cases when data is inserted. So, hybrid clock + partition id is a 
better option.

EDIT 2: removing "insert" method from the API is out of scope for now.

  was:
Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
There's a method called "insert" that generates RowId by itself. This is wrong, 
because it can lead to different id for the same row on the replica storage. 
This completely breaks everything.

Every replicated write command, that inserts new value, should produce same row 
ids. There are several ways to achieve this:
 * Use timestamps as identifiers. This is not very convenient, because we would 
have to attach partition id on top of it. It's mandatory to know the partition 
of the row.
 * Use more complicated structure, for example a tuple of (raftCommitIndex, 
partitionId, batchCounter), where

 * 
 ** raftCommitIndex is the index of write command that performs insertion.
 ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
considering that there are plans to support more than 65000 partitions per 
table.
 ** batchCounter is used to differentiate insertions made in a single write 
command. We can limit it with 2 bytes to save a little bit of space, if it's 
necessary.

I prefer the second option, but maybe it could be revised during the 
implementation.

Of course, method "insert" should be removed from bridge API. Tests have to be 
updated. With the lack of RAFT group in storage tests, we can generate row ids 
artificially, it's not a big deal.

EDIT: second option makes it difficult to use row ids in action request 
processor in cases when data is inserted. So, hybrid clock + partition id is a 
better option.


> Unify RowId format for different storages
> -
>
> Key: IGNITE-17076
> URL: https://issues.apache.org/jira/browse/IGNITE-17076
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
> There's a method called "insert" that generates RowId by itself. This is 
> wrong, because it can lead to different id for the same row on the replica 
> storage. This completely breaks everything.
> Every replicated write command, that inserts new value, should produce same 
> row ids. There are several ways to achieve this:
>  * Use timestamps as identifiers. This is not very convenient, because we 
> would have to attach partition id on top of it. It's mandatory to know the 
> partition of the row.
>  * Use more complicated structure, for example a tuple of (raftCommitIndex, 
> partitionId, batchCounter), where
>  * 
>  ** raftCommitIndex is the index of write command that performs insertion.
>  ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
> considering that there are plans to support more than 65000 partitions per 
> table.
>  ** batchCounter is used to differentiate insertions made in a single write 
> command. We can limit it with 2 bytes to save a little bit of space, if it's 
> necessary.
> I prefer the second option, but maybe it could be revised during the 
> implementatio

[jira] [Assigned] (IGNITE-17076) Unify RowId format for different storages

2022-08-09 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17076:
--

Assignee: Ivan Bessonov

> Unify RowId format for different storages
> -
>
> Key: IGNITE-17076
> URL: https://issues.apache.org/jira/browse/IGNITE-17076
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
> There's a method called "insert" that generates RowId by itself. This is 
> wrong, because it can lead to different id for the same row on the replica 
> storage. This completely breaks everything.
> Every replicated write command, that inserts new value, should produce same 
> row ids. There are several ways to achieve this:
>  * Use timestamps as identifiers. This is not very convenient, because we 
> would have to attach partition id on top of it. It's mandatory to know the 
> partition of the row.
>  * Use more complicated structure, for example a tuple of (raftCommitIndex, 
> partitionId, batchCounter), where
>  * 
>  ** raftCommitIndex is the index of write command that performs insertion.
>  ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
> considering that there are plans to support more than 65000 partitions per 
> table.
>  ** batchCounter is used to differentiate insertions made in a single write 
> command. We can limit it with 2 bytes to save a little bit of space, if it's 
> necessary.
> I prefer the second option, but maybe it could be revised during the 
> implementation.
> Of course, method "insert" should be removed from bridge API. Tests have to 
> be updated. With the lack of RAFT group in storage tests, we can generate row 
> ids artificially, it's not a big deal.
> EDIT: second option makes it difficult to use row ids in action request 
> processor in cases when data is inserted. So, hybrid clock + partition id is 
> a better option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17077) Implement checkpointIndex for PDS

2022-08-03 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17077:
--

Assignee: Ivan Bessonov

> Implement checkpointIndex for PDS
> -
>
> Key: IGNITE-17077
> URL: https://issues.apache.org/jira/browse/IGNITE-17077
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
> prerequisites.
> h2. General idea
> The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
> "getUpdateIndex" methods (names might be different).
>  * First one is invoked at the end of every write command, with RAFT commit 
> index being passed as a parameter. This is done right before releasing 
> checkpoint read lock (or whatever the name we will come up with). More on 
> that later.
>  * Second one is invoked at the beginning of every write command to validate 
> that update don't come out of order or with gaps. This is the way to 
> guarantee that IndexMismatchException can be thrown at the right time.
> So, the write command flow will look like this. All names here are completely 
> random.
>  
> {code:java}
> try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
> long updateIndex = partition.getUpdateIndex();
> long raftIndex = writeCommand.raftIndex();
> if (raftIndex != updateIndex + 1) {
> throw new IndexMismatchException(updateIndex);
> }
> partition.write(writeCommand.row());
> for (Index index : table.indexes(partition) {
> index.index(writeCommand.row());
> }
> partition.setUpdateIndex(raftIndex);
> }{code}
>  
> Some nuances:
>  * Mismatch exception must be thrown before any data modifications. Storage 
> content must be intact, otherwise we'll just break it.
>  * Case above is the simplest one - there's a single "atomic" storage update. 
> Generally speaking, we can't or sometimes don't want to work this way. 
> Examples of operations, where atomicity this strict is not required:
>  ** Batch insert/update from the transaction.
>  ** Transaction commit might have a huge number of row ids, we can exhaust 
> the memory while committing.
>  * If we split write operation into several operations, we should externally 
> guarantee their idempotence. "setUpdateIndex" should be at the end of the 
> last "atomic" operation, so that the last command could be safely reapplied.
> h2. Implementation
> "set" method could write a value directly into partitions meta page. This 
> *will* work. But it's not quite optimal.
> Optimal solution is tightly coupled with the way checkpoint should work. This 
> may not be the right place to describe the issue, but I do it nonetheless. 
> It'll probably get split into another issue one day.
> There's a simple way to touch every meta page only once per checkpoint. We 
> just do it while holding checkpoint write lock. This way data is consistent. 
> But this solution is equally {*}bad{*}, it forces us to perform pages 
> manipulation under write lock. Flushing freelists is enough already. (NOTE: 
> we should test the performance without onheap-cache, it'll speed-up 
> checkpoint start process, thus reducing latency spikes)
> Better way to do this is not having meta pages in page memory whatsoever. 
> Maybe during the start, but that's it. It's a common practice to have a 
> pageSize being equal to 16Kb. Effective payload of partition meta page in 
> Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite 
> 3.0. Having a loaded page for every partition is just a waste of resources, 
> all required data can be stored on-heap.
> Then, let's rely on two simple facts:
>  * If meta page date is cached on-heap, no one would need to read it from 
> disk. I should also mention that it will mostly be immutable.
>  * We can write partition meta page into every delta file even if meta has 
> not changed. In actuality, this will be very rare situation.
> Considering both of these facts, checkpointer may unconditionally write meta 
> page from heap to disk at the beginning of writing the delta file. This page 
> will become a write-only page, which is basically what we need. 
> h2. Callbacks and RAFT snapshots
> I argue against scheduled RAFT snapshots. They will produce a lot of junk 
> checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine 
> RAFT triggering snapshots for 100 partitions in a row. This will result in a 
> 100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
>  * partition.getCheckpointerUpdateIndex();
>  * partition.registerCheckpointedUpdateIndexListener(closure);
> Bot of these methods could be used by RAFT to determine whether it ne

[jira] [Commented] (IGNITE-17082) Validate hocon output in REST endpoints

2022-08-01 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573794#comment-17573794
 ] 

Ivan Bessonov commented on IGNITE-17082:


Looks good to me!

> Validate hocon output in REST endpoints
> ---
>
> Key: IGNITE-17082
> URL: https://issues.apache.org/jira/browse/IGNITE-17082
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr
>Assignee: Aleksandr
>Priority: Major
>  Labels: ignite-3, rest
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now all endpoints return plain/text. That looks strange in case the endpoint 
> returns a hocon string. It's not just _any_ plain text. Here 
> [https://www.w3schools.io/file/hocon-introduction/] they suggest 
> {{{}application/hocon{}}}; it does not seem to be registered with IANA, but 
> it looks logical.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-14986) Re-work error handling in meta storage component in accordance with error groups

2022-08-01 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573692#comment-17573692
 ] 

Ivan Bessonov commented on IGNITE-14986:


Looks good to me, thank you!

> Re-work error handling in meta storage component in accordance with error 
> groups
> 
>
> Key: IGNITE-14986
> URL: https://issues.apache.org/jira/browse/IGNITE-14986
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Vyacheslav Koptilin
>Assignee: Vyacheslav Koptilin
>Priority: Major
>  Labels: iep-84, ignite-3
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Need to introduce a new error group related to Meta Storage Service and add 
> all needed error codes.
> Also, the implementation should using _IgniteInternalException_ and 
> _IgniteInternalCheckedException_ with specific error codes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (IGNITE-17445) RocksDbKeyValueStorage recreates DB on start, so data can't be found until Raft log is replayed

2022-08-01 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573666#comment-17573666
 ] 

Ivan Bessonov edited comment on IGNITE-17445 at 8/1/22 9:56 AM:


This bug is pretty stable in IGNITE-17081, flaky rate of 
ItIgniteNodeRestartTest#nodeWithDataTest goes through the roof on the warmed-up 
JVM.

The reason is that, presumably, node startup (partitions recovery) became 
faster in that branch.


was (Author: ibessonov):
This bug is pretty stable in IGNITE-17081, flaky rate of 
ItIgniteNodeRestartTest#nodeWithDataTest goes through the roof on the warmed-up 
JVM.

The reason is that node startup (partitions recovery) became faster in that 
branch.

> RocksDbKeyValueStorage recreates DB on start, so data can't be found until 
> Raft log is replayed
> ---
>
> Key: IGNITE-17445
> URL: https://issues.apache.org/jira/browse/IGNITE-17445
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>
> RocksDbKeyValueStorage recreates DB on start. This means that entries that 
> were put to this storage earlier, can or cant be found until raft log is 
> replayed, i.e. the behavior is undefined. For example, this can cause 
> assertion on node recovery:
> {code:java}
> java.lang.AssertionError: Configuration revision must be greater than local 
> node applied revision [msRev=0, appliedRev=1
> {code}
> which means that applied revision in vault is 1 but only 0 is found in meta 
> storage, as the storage of meta storage is being recreated.
> For now, the only thing that saves us from this assertion to be thrown every 
> time, is that operations related to node recovery, applied from distributed 
> configuration (see IgniteImpl#notifyConfigurationListeners ), take some time 
> and raft log is small enough to replay faster than the performing of recovery 
> operations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17445) RocksDbKeyValueStorage recreates DB on start, so data can't be found until Raft log is replayed

2022-08-01 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573666#comment-17573666
 ] 

Ivan Bessonov commented on IGNITE-17445:


This bug is pretty stable in IGNITE-17081, flaky rate of 
ItIgniteNodeRestartTest#nodeWithDataTest goes through the roof on the warmed-up 
JVM.

The reason is that node startup (partitions recovery) became faster in that 
branch.

> RocksDbKeyValueStorage recreates DB on start, so data can't be found until 
> Raft log is replayed
> ---
>
> Key: IGNITE-17445
> URL: https://issues.apache.org/jira/browse/IGNITE-17445
> Project: Ignite
>  Issue Type: Bug
>Reporter: Denis Chudov
>Priority: Major
>
> RocksDbKeyValueStorage recreates DB on start. This means that entries that 
> were put to this storage earlier, can or cant be found until raft log is 
> replayed, i.e. the behavior is undefined. For example, this can cause 
> assertion on node recovery:
> {code:java}
> java.lang.AssertionError: Configuration revision must be greater than local 
> node applied revision [msRev=0, appliedRev=1
> {code}
> which means that applied revision in vault is 1 but only 0 is found in meta 
> storage, as the storage of meta storage is being recreated.
> For now, the only thing that saves us from this assertion to be thrown every 
> time, is that operations related to node recovery, applied from distributed 
> configuration (see IgniteImpl#notifyConfigurationListeners ), take some time 
> and raft log is small enough to replay faster than the performing of recovery 
> operations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17372) Implement DeltaFilePageStore

2022-07-28 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572327#comment-17572327
 ] 

Ivan Bessonov commented on IGNITE-17372:


Looks good to me. Thank you for the contribution!

> Implement DeltaFilePageStore
> 
>
> Key: IGNITE-17372
> URL: https://issues.apache.org/jira/browse/IGNITE-17372
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> For the new checkpoint, we need to implement *DeltaFilePageStore*.
> Will consist of:
> * File header, similar to *FilePageStore*, but in addition it will store a 
> sorted list of pageIdx in which it will be stored;
> * Pages themselves, sorted by pageIdx.
> Some implementation notes
> * Format of the file name for the *DeltaFilePageStore* is 
> *part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
> is the partition identifier, and the second is the serial number of the delta 
> file for this partition;
> * Before creating *part-1-delta-3.bin*, a temporary file 
> *part-1-delta-3.bin.tmp* will be created at the checkpoint first, then 
> filled, then renamed to *part-1-delta-3.bin*;
> * In each delta file we will store the 
> *org.apache.ignite.internal.storage.pagememory.io.PartitionMetaIo*, which 
> will be the first page in this file, and it will be special.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17372) Implement DeltaFilePageStore

2022-07-28 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17372:
---
Release Note:   (was: Looks good to me. Thank you for the contribution!)

> Implement DeltaFilePageStore
> 
>
> Key: IGNITE-17372
> URL: https://issues.apache.org/jira/browse/IGNITE-17372
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> For the new checkpoint, we need to implement *DeltaFilePageStore*.
> Will consist of:
> * File header, similar to *FilePageStore*, but in addition it will store a 
> sorted list of pageIdx in which it will be stored;
> * Pages themselves, sorted by pageIdx.
> Some implementation notes
> * Format of the file name for the *DeltaFilePageStore* is 
> *part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
> is the partition identifier, and the second is the serial number of the delta 
> file for this partition;
> * Before creating *part-1-delta-3.bin*, a temporary file 
> *part-1-delta-3.bin.tmp* will be created at the checkpoint first, then 
> filled, then renamed to *part-1-delta-3.bin*;
> * In each delta file we will store the 
> *org.apache.ignite.internal.storage.pagememory.io.PartitionMetaIo*, which 
> will be the first page in this file, and it will be special.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17415) A node receives and resolves obsolete addresses from the previously restarted and killed nodes

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17415:
---
Reviewer: Ivan Bessonov

> A node receives and resolves obsolete addresses from the previously restarted 
> and killed nodes
> --
>
> Key: IGNITE-17415
> URL: https://issues.apache.org/jira/browse/IGNITE-17415
> Project: Ignite
>  Issue Type: Bug
>  Components: networking
>Affects Versions: 2.13
>Reporter: Semyon Danilov
>Assignee: Semyon Danilov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Observation: Ignite tries to resolve *all* known IP and DNS names exposed by 
> the nodes since the cluster startup. This might cause a delay in response if 
> DNS resolution takes some time and critical for Kubernetes envroments.
> The issue is due to the ignite keeping track of the topology history.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17415) A node receives and resolves obsolete addresses from the previously restarted and killed nodes

2022-07-26 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571375#comment-17571375
 ] 

Ivan Bessonov commented on IGNITE-17415:


Looks good to me, thank you!

> A node receives and resolves obsolete addresses from the previously restarted 
> and killed nodes
> --
>
> Key: IGNITE-17415
> URL: https://issues.apache.org/jira/browse/IGNITE-17415
> Project: Ignite
>  Issue Type: Bug
>  Components: networking
>Affects Versions: 2.13
>Reporter: Semyon Danilov
>Assignee: Semyon Danilov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Observation: Ignite tries to resolve *all* known IP and DNS names exposed by 
> the nodes since the cluster startup. This might cause a delay in response if 
> DNS resolution takes some time and critical for Kubernetes envroments.
> The issue is due to the ignite keeping track of the topology history.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16265) Integration SQL Index and data storage

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16265.

Resolution: Won't Fix

> Integration SQL Index and data storage
> --
>
> Key: IGNITE-16265
> URL: https://issues.apache.org/jira/browse/IGNITE-16265
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Yury Gerzhedovich
>Assignee: Konstantin Orlov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Need to think about point of integration of data modification 
> (put/remove/amend) with update data at SQL indexes. 
> Let's as first version for integation will be update index on commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16202) Supports transactions by index

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16202.

Resolution: Won't Fix

> Supports transactions by index
> --
>
> Key: IGNITE-16202
> URL: https://issues.apache.org/jira/browse/IGNITE-16202
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Indexes must support transaction protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-14940) Investigation parallel index scan

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-14940.

Resolution: Won't Fix

> Investigation parallel index scan
> -
>
> Key: IGNITE-14940
> URL: https://issues.apache.org/jira/browse/IGNITE-14940
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Motivation: 2.x version implements {{queryParallelism}} by creation index 
> segments. Each segment contains subset of partitions. This approach has 
> several shortcomings:
> - index scans parallelism cannot be changed / scaled on runtime.
> - we have always scan all segments (looks like virtual MapNode for query);
> - many index storages for one logical index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-14939) Tests coverage for index rebuild and recovery scenarios

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-14939.

Resolution: Won't Fix

> Tests coverage for index rebuild and recovery scenarios
> ---
>
> Key: IGNITE-14939
> URL: https://issues.apache.org/jira/browse/IGNITE-14939
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Test cases from version 2.x must be analyzed and ported to 3.0.
> See in 2.x {{AbstractRebuildIndexTest}} and the children.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14937) Index schema & Index management integration

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14937:
---
Labels: ignite-3  (was: )

> Index schema & Index management integration
> ---
>
> Key: IGNITE-14937
> URL: https://issues.apache.org/jira/browse/IGNITE-14937
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Public index schema (required indexes) and current indexes state on the 
> cluster are different.
> We have to track it, store it and provide actual indexes schema state for any 
> components: select query, DDL query etc..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-14936) Benchmark sorted index scan vs table's partitions scan

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-14936.

Resolution: Won't Fix

> Benchmark sorted index scan vs table's partitions scan
> --
>
> Key: IGNITE-14936
> URL: https://issues.apache.org/jira/browse/IGNITE-14936
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> We have to decide what are data structures is used for PK and table scan.
> Possible cases:
> - table partitions sorted by plain bytes/hash (in fact: unsorted);
> - table partitions sorted by PK columns;
> - PK sorted index (one store for all partitions on the node).
> All cases have pros and cons. The choice should be based on benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14936) Benchmark sorted index scan vs table's partitions scan

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14936:
---
Labels: ignite-3  (was: )

> Benchmark sorted index scan vs table's partitions scan
> --
>
> Key: IGNITE-14936
> URL: https://issues.apache.org/jira/browse/IGNITE-14936
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> We have to decide what are data structures is used for PK and table scan.
> Possible cases:
> - table partitions sorted by plain bytes/hash (in fact: unsorted);
> - table partitions sorted by PK columns;
> - PK sorted index (one store for all partitions on the node).
> All cases have pros and cons. The choice should be based on benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-07-26 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17081:
--

Assignee: Ivan Bessonov

> Implement checkpointIndex for RocksDB
> -
>
> Key: IGNITE-17081
> URL: https://issues.apache.org/jira/browse/IGNITE-17081
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
> prerequisites.
> Please also familiarize yourself with 
> https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
> the description is continued from there.
> For RocksDB based storage the recovery process is trivial, because RocksDB 
> has its own WAL. So, for testing purposes, it would be enough to just store 
> update index in meta column family.
> Immediately we have a write amplification issue, on top of possible 
> performance degradation. Obvious solution is inherently bad and needs to be 
> improved.
> h2. General idea & implementation
> Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
> breaks RocksDB recovery procedure, we need to take measures to avoid it.
> The only feasible way to do so is to use DBOptions#setAtomicFlush in 
> conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
> all column families consistently, if you have batches that cover several CFs. 
> Basically, {{acquireConsistencyLock()}} would create a thread-local write 
> batch, that's applied on locks release. Most of RocksDbMvPartitionStorage 
> will be affected by this change.
> NOTE: I believe that scans with unapplied batches should be prohibited for 
> now  (gladly, there's a WriteBatchInterface#count() to check). I don't see 
> any practical value and a proper way of implementing it, considering how 
> spread-out in time the scan process is.
> h2. Callbacks and RAFT snapshots
> Simply storing and reading update index is easy. Reading committed index is 
> more challenging, I propose caching it and update only from the closure, that 
> can also be used by RAFT to truncate the log.
> For a closure, there are several things to account for during the 
> implementation:
>  * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
> ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
> atomic flush mode. And, once you have your first "completed" event ,you have 
> a guarantee that *all* memtables are already persisted.
> This allows easy tracking of RocksDB flushes, monitoring events alteration is 
> all that's needed.
>  * Unlike PDS implementation, here we will be writing updateIndex value into 
> a memtable every time. This makes it harder to find persistedIndex values for 
> partitions. Gladly, considering the events that we have, during the time 
> between first "completed" and the very next "begin", the state on disk is 
> fully consistent. And there's a way to read data from storage avoiding 
> memtable completely - ReadOptions#setReadTier(PERSISTED_TIER).
> Summarizing everything from the above, we should implement following protocol:
>  
> {code:java}
> During table start: read latest values of update indexes. Store them in an 
> in-memory structure.
> Set "lastEventType = ON_FLUSH_COMPLETED;".
> onFlushBegin:
>   if (lastEventType == ON_FLUSH_BEGIN)
> return;
>   waitForLastAsyncUpdateIndexesRead();
>   lastEventType = ON_FLUSH_BEGIN;
> onFlushCompleted:
>   if (lastEventType == ON_FLUSH_COMPLETED)
> return;
>   asyncReadUpdateIndexesFromDisk();
>   lastEventType = ON_FLUSH_COMPLETED;{code}
> Reading values from disk must be performed asynchronously to not stall 
> flushing process. We don't control locks that RocksDb holds while calling 
> listener's methods.
> That asynchronous process would invoke closures that provide presisted 
> updateIndex values to other components.
> NODE: One might say that we should call 
> "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But 
> my implementation says calling it during the first event. This is fine. I 
> noticed that column families are flushed in order of their internal ids. 
> These ids correspond to a sequence number of CFs, and the "default" CF is 
> always created first. This is the exact CF that we use to store meta. Maybe 
> we're going to change this and create a separate meta CF. Only then we could 
> start optimizing this part, and only if we'll have an actual proof that 
> there's a stall in this exact place.
> h3. Types of storages
> RocksDB is used for:
>  * tables
>  * cluster management
>  * meta-storage
> All these types should use the same recovery procedure, but code is located 
> in different places. I hope that it won't b

[jira] [Updated] (IGNITE-17076) Unify RowId format for different storages

2022-07-25 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17076:
---
Description: 
Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
There's a method called "insert" that generates RowId by itself. This is wrong, 
because it can lead to different id for the same row on the replica storage. 
This completely breaks everything.

Every replicated write command, that inserts new value, should produce same row 
ids. There are several ways to achieve this:
 * Use timestamps as identifiers. This is not very convenient, because we would 
have to attach partition id on top of it. It's mandatory to know the partition 
of the row.
 * Use more complicated structure, for example a tuple of (raftCommitIndex, 
partitionId, batchCounter), where

 * 
 ** raftCommitIndex is the index of write command that performs insertion.
 ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
considering that there are plans to support more than 65000 partitions per 
table.
 ** batchCounter is used to differentiate insertions made in a single write 
command. We can limit it with 2 bytes to save a little bit of space, if it's 
necessary.

I prefer the second option, but maybe it could be revised during the 
implementation.

Of course, method "insert" should be removed from bridge API. Tests have to be 
updated. With the lack of RAFT group in storage tests, we can generate row ids 
artificially, it's not a big deal.

EDIT: second option makes it difficult to use row ids in action request 
processor in cases when data is inserted. So, hybrid clock + partition id is a 
better option.

  was:
Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
There's a method called "insert" that generates RowId by itself. This is wrong, 
because it can lead to different id for the same row on the replica storage. 
This completely breaks everything.

Every replicated write command, that inserts new value, should produce same row 
ids. There are several ways to achieve this:
 * Use timestamps as identifiers. This is not very convenient, because we would 
have to attach partition id on top of it. It's mandatory to know the partition 
of the row.
 * Use more complicated structure, for example a tuple of (raftCommitIndex, 
partitionId, batchCounter), where

 ** raftCommitIndex is the index of write command that performs insertion.
 ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
considering that there are plans to support more than 65000 partitions per 
table.
 ** batchCounter is used to differentiate insertions made in a single write 
command. We can limit it with 2 bytes to save a little bit of space, if it's 
necessary.

I prefer the second option, but maybe it could be revised during the 
implementation.

Of course, method "insert" should be removed from bridge API. Tests have to be 
updated. With the lack of RAFT group in storage tests, we can generate row ids 
artificially, it's not a big deal.


> Unify RowId format for different storages
> -
>
> Key: IGNITE-17076
> URL: https://issues.apache.org/jira/browse/IGNITE-17076
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
> There's a method called "insert" that generates RowId by itself. This is 
> wrong, because it can lead to different id for the same row on the replica 
> storage. This completely breaks everything.
> Every replicated write command, that inserts new value, should produce same 
> row ids. There are several ways to achieve this:
>  * Use timestamps as identifiers. This is not very convenient, because we 
> would have to attach partition id on top of it. It's mandatory to know the 
> partition of the row.
>  * Use more complicated structure, for example a tuple of (raftCommitIndex, 
> partitionId, batchCounter), where
>  * 
>  ** raftCommitIndex is the index of write command that performs insertion.
>  ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
> considering that there are plans to support more than 65000 partitions per 
> table.
>  ** batchCounter is used to differentiate insertions made in a single write 
> command. We can limit it with 2 bytes to save a little bit of space, if it's 
> necessary.
> I prefer the second option, but maybe it could be revised during the 
> implementation.
> Of course, method "insert" should be removed from bridge API. Tests have to 
> be updated. With the lack of RAFT group in storage tests, we can generate row 
> ids artificially, it's not a big deal.
> EDIT: second option makes it difficult to use row ids in action request 
> p

[jira] [Updated] (IGNITE-16665) [Native Persistence 3.0] Move the group ID to the configuration

2022-07-22 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16665:
---
Reviewer: Semyon Danilov

> [Native Persistence 3.0] Move the group ID to the configuration
> ---
>
> Key: IGNITE-16665
> URL: https://issues.apache.org/jira/browse/IGNITE-16665
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Problem:
> Currently, persistent storage engines such as *PageMemory* and *RocksDB* use 
> the 
> *org.apache.ignite.configuration.schemas.table.TableConfigurationSchema#name* 
> (as part or all) as the name for the directory in which the data will be 
> stored. This does not allow you to rename the table correctly, since the data 
> will have to be in the new directory so that it is not lost after restarting 
> the node, but it will be in the directory with the old name.
> Possible solution:
> Do not use the name of the table as the directory name, instead add the 
> *org.apache.ignite.configuration.schemas.table.TableConfigurationSchema#id* 
> which should never change, and must also be unique.
> Please see:
>  * *org.apache.ignite.internal.storage.StorageUtils#groupId*
>  * 
> *org.apache.ignite.internal.storage.pagememory.PersistentPageMemoryTableStorage#start*
>  * 
> *org.apache.ignite.internal.storage.rocksdb.RocksDbStorageEngine#createTable*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-16665) [Native Persistence 3.0] Move the group ID to the configuration

2022-07-22 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-16665:
--

Assignee: Ivan Bessonov

> [Native Persistence 3.0] Move the group ID to the configuration
> ---
>
> Key: IGNITE-16665
> URL: https://issues.apache.org/jira/browse/IGNITE-16665
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> Problem:
> Currently, persistent storage engines such as *PageMemory* and *RocksDB* use 
> the 
> *org.apache.ignite.configuration.schemas.table.TableConfigurationSchema#name* 
> (as part or all) as the name for the directory in which the data will be 
> stored. This does not allow you to rename the table correctly, since the data 
> will have to be in the new directory so that it is not lost after restarting 
> the node, but it will be in the directory with the old name.
> Possible solution:
> Do not use the name of the table as the directory name, instead add the 
> *org.apache.ignite.configuration.schemas.table.TableConfigurationSchema#id* 
> which should never change, and must also be unique.
> Please see:
>  * *org.apache.ignite.internal.storage.StorageUtils#groupId*
>  * 
> *org.apache.ignite.internal.storage.pagememory.PersistentPageMemoryTableStorage#start*
>  * 
> *org.apache.ignite.internal.storage.rocksdb.RocksDbStorageEngine#createTable*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17278) TableManager#directTableIds can't be implemented effectively

2022-07-21 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17278:
---
Reviewer: Kirill Tkalenko

> TableManager#directTableIds can't be implemented effectively
> 
>
> Key: IGNITE-17278
> URL: https://issues.apache.org/jira/browse/IGNITE-17278
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I propose adding a special method "internalIds" to direct proxy, so that 
> there won't be the case for reading all tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16907) Add ability to use Raft log as storage WAL within the scope of local recovery

2022-07-20 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16907:
---
Epic Link: IGNITE-16923

> Add ability to use Raft log as storage WAL within the scope of local recovery
> -
>
> Key: IGNITE-16907
> URL: https://issues.apache.org/jira/browse/IGNITE-16907
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> h4. Problem
> From the birds eye view raft-to-storage flow looks similar to
>  # 
> {code:java}
> RaftGroupService#run(writeCommand());{code}
>  # Inner raft replication logic, when replicated on majority adjust 
> raft.commitedIndex.
>  # Propagate command to RaftGroupListener (raft state machine).
> {code:java}
> RaftGroupListener#onWrite(closure(writeCommand()));{code}
>  # Within state machine insert data from writeCommand to underneath storage:  
> {code:java}
> var insertRes = storage.insert(cmd.getRow(), cmd.getTimestamp());{code}
>  # ack that data was applied successfully 
> {code:java}
> clo.result(insertRes);{code}
>  # move raft.appliedIndex to corresponding value, meaning that the data for 
> this index is applied to the state machine.
> The most interesting part, especially for given ticket, relates to step 4.
> In real world storage doesn't flush every mutator on disk, instead it buffers 
> some amount of such mutators and flush them all-together as a part of some 
> checkpointing process. Thus, if node fails before mutatorsBuffer.flush() it 
> might lost some data because raft will apply data starting from appliedIndex 
> + 1 on recovery.
> h4. Possible solutions:
> There are several possibilities to solve this issue:
>  # In-storage WAL. Bad solution, because there's already raft log that can be 
> used as a WAL. Such duplication is redundant.
>  # Local recovery starting from appliedIndex - mutatorsBuffer.size. Bad 
> solution. Won't work for not-idempotent operations. Exposes inner storage 
> details such as mutatorBuffer.size.
>  # proposedIndex propagation + checkpointIndex synchonization. Seems fine. 
> More details below:
>  * First off all, in order to coordinate raft replicator and storage, 
> proposedIndex should be propagated to raftGroupListener and storage.
>  * On every checkpoint, storage will persist corresponding proposed index as 
> checkpointIndex.
>  ** In case of storage inner checkpoints, storage won't notify raft 
> replicator about new checkpointIndex. This kind of notification is an 
> optimization that does not affect the correctness of the protocol.
>  ** In case of outer checkpoint intention, e.g. raft snapshotting for the 
> purposes of raft log truncation, corresponding checkpointIndex will be 
> propagated to raft replicator within a callback "onShapshotDone".
>  * During local recovery raft will apply raft log entries from the very 
> begging. If checkpointIndex occurred to be bigger than proposedIndex on an 
> another raft log entity it fails the proposed closure with 
> IndexMismatchException(checkpointIndex) that leads to proposedIndex shift and 
> optional async raft log truncation.
> Let's consider following example:
> ] checkpointBuffer = 3. [P] - perisisted entities, [!P] - not perisisted/in 
> memory one.
>  # raft.put(k1,v1)
>  ## -> raftlog[cmd(k1,v1, index:1)]
>  ## -> storage[(k1,v1), index:1]
>  ## -> appliedIndex:1
>  # raft.put(k2,v2)
>  ## -> raftlog[cmd(k1,v1, index:1), \\{*}cmd(k2,v2, index:2)\\{*}]
>  ## -> storage[(k1,v1), \\{*}(k2,v2)\\{*}, ** index:\\{*}2\\{*}]
>  ## -> appliedIndex:{*}2{*}
>  # raft.put(k3,v3)
>  ## -> raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  \\{*}cmd(k3,v3, 
> index:3)\\{*}]
>  ## -> storage[(k1,v1), (k2,v2), \\{*}(k3,v3)\\{*}, index:\\{*}3\\{*}]
>  ## -> appliedIndex:{*}3{*}
>  ## *inner storage checkpoint*
>  ### raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  cmd(k3,v3, index:3)]
>  ### storage[(k1,v1, proposedIndex:1), (k2,v2, proposedIndex:2), (k3,v3, 
> proposedIndex:3)]
>  ### {*}checkpointedData[(k1,v1),* *(k2,v2),* \\{*}(k3,v3), 
> checkpointIndex:3\\{*}{*}\\{*}{*}]{*}{*}{{*}}
>  # raft.put(k4,v4)
>  ## -> raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  cmd(k3,v3, 
> index:3), \\{*}cmd(k4,v4, index:4)\\{*}]
>  ## -> storage[(k1,v1), (k2,v2), (k3,v3), *(k4,v4)* index:\\{*}4\\{*}]
>  ## -> checkpointedData[(k1,v1), (k2,v2), (k3,v3), checkpointIndex:3]
>  ## -> appliedIndex:{*}4{*}
>  # Node failure
>  # Node restart
>  ## StorageRecovery: storage.apply(checkpointedData)
>  ## raft-to-storage data application starting from index: 1 // raft doesn't 
> know checkpointedIndex at this point.
>  ### -> storageResp

[jira] [Updated] (IGNITE-17393) Make JRaftServiceFactory properly configurable

2022-07-20 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17393:
---
Reviewer: Semyon Danilov

> Make JRaftServiceFactory properly configurable
> --
>
> Key: IGNITE-17393
> URL: https://issues.apache.org/jira/browse/IGNITE-17393
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, ported JRaft code is polluted with unnecessary changes.
> On top of that, the way that we configure raft service is far from optimal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17393) Make JRaftServiceFactory properly configurable

2022-07-19 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17393:
--

 Summary: Make JRaftServiceFactory properly configurable
 Key: IGNITE-17393
 URL: https://issues.apache.org/jira/browse/IGNITE-17393
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov


Currently, ported JRaft code is polluted with unnecessary changes.

On top of that, the way that we configure raft service is far from optimal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16907) Add ability to use Raft log as storage WAL within the scope of local recovery

2022-07-13 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16907:
---
Reviewer: Kirill Tkalenko

> Add ability to use Raft log as storage WAL within the scope of local recovery
> -
>
> Key: IGNITE-16907
> URL: https://issues.apache.org/jira/browse/IGNITE-16907
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h4. Problem
> From the birds eye view raft-to-storage flow looks similar to
>  # 
> {code:java}
> RaftGroupService#run(writeCommand());{code}
>  # Inner raft replication logic, when replicated on majority adjust 
> raft.commitedIndex.
>  # Propagate command to RaftGroupListener (raft state machine).
> {code:java}
> RaftGroupListener#onWrite(closure(writeCommand()));{code}
>  # Within state machine insert data from writeCommand to underneath storage:  
> {code:java}
> var insertRes = storage.insert(cmd.getRow(), cmd.getTimestamp());{code}
>  # ack that data was applied successfully 
> {code:java}
> clo.result(insertRes);{code}
>  # move raft.appliedIndex to corresponding value, meaning that the data for 
> this index is applied to the state machine.
> The most interesting part, especially for given ticket, relates to step 4.
> In real world storage doesn't flush every mutator on disk, instead it buffers 
> some amount of such mutators and flush them all-together as a part of some 
> checkpointing process. Thus, if node fails before mutatorsBuffer.flush() it 
> might lost some data because raft will apply data starting from appliedIndex 
> + 1 on recovery.
> h4. Possible solutions:
> There are several possibilities to solve this issue:
>  # In-storage WAL. Bad solution, because there's already raft log that can be 
> used as a WAL. Such duplication is redundant.
>  # Local recovery starting from appliedIndex - mutatorsBuffer.size. Bad 
> solution. Won't work for not-idempotent operations. Exposes inner storage 
> details such as mutatorBuffer.size.
>  # proposedIndex propagation + checkpointIndex synchonization. Seems fine. 
> More details below:
>  * First off all, in order to coordinate raft replicator and storage, 
> proposedIndex should be propagated to raftGroupListener and storage.
>  * On every checkpoint, storage will persist corresponding proposed index as 
> checkpointIndex.
>  ** In case of storage inner checkpoints, storage won't notify raft 
> replicator about new checkpointIndex. This kind of notification is an 
> optimization that does not affect the correctness of the protocol.
>  ** In case of outer checkpoint intention, e.g. raft snapshotting for the 
> purposes of raft log truncation, corresponding checkpointIndex will be 
> propagated to raft replicator within a callback "onShapshotDone".
>  * During local recovery raft will apply raft log entries from the very 
> begging. If checkpointIndex occurred to be bigger than proposedIndex on an 
> another raft log entity it fails the proposed closure with 
> IndexMismatchException(checkpointIndex) that leads to proposedIndex shift and 
> optional async raft log truncation.
> Let's consider following example:
> ] checkpointBuffer = 3. [P] - perisisted entities, [!P] - not perisisted/in 
> memory one.
>  # raft.put(k1,v1)
>  ## -> raftlog[cmd(k1,v1, index:1)]
>  ## -> storage[(k1,v1), index:1]
>  ## -> appliedIndex:1
>  # raft.put(k2,v2)
>  ## -> raftlog[cmd(k1,v1, index:1), \\{*}cmd(k2,v2, index:2)\\{*}]
>  ## -> storage[(k1,v1), \\{*}(k2,v2)\\{*}, ** index:\\{*}2\\{*}]
>  ## -> appliedIndex:{*}2{*}
>  # raft.put(k3,v3)
>  ## -> raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  \\{*}cmd(k3,v3, 
> index:3)\\{*}]
>  ## -> storage[(k1,v1), (k2,v2), \\{*}(k3,v3)\\{*}, index:\\{*}3\\{*}]
>  ## -> appliedIndex:{*}3{*}
>  ## *inner storage checkpoint*
>  ### raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  cmd(k3,v3, index:3)]
>  ### storage[(k1,v1, proposedIndex:1), (k2,v2, proposedIndex:2), (k3,v3, 
> proposedIndex:3)]
>  ### {*}checkpointedData[(k1,v1),* *(k2,v2),* \\{*}(k3,v3), 
> checkpointIndex:3\\{*}{*}\\{*}{*}]{*}{*}{{*}}
>  # raft.put(k4,v4)
>  ## -> raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  cmd(k3,v3, 
> index:3), \\{*}cmd(k4,v4, index:4)\\{*}]
>  ## -> storage[(k1,v1), (k2,v2), (k3,v3), *(k4,v4)* index:\\{*}4\\{*}]
>  ## -> checkpointedData[(k1,v1), (k2,v2), (k3,v3), checkpointIndex:3]
>  ## -> appliedIndex:{*}4{*}
>  # Node failure
>  # Node restart
>  ## StorageRecovery: storage.apply(checkpointedData)
>  ## raft-to-storage data application starting from index: 1 // raft doesn't 
> know checkpointedIndex at this point.
>  ### -> storageResponse::IndexMismatchException(3)
>

[jira] [Assigned] (IGNITE-16907) Add ability to use Raft log as storage WAL within the scope of local recovery

2022-07-11 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-16907:
--

Assignee: Ivan Bessonov

> Add ability to use Raft log as storage WAL within the scope of local recovery
> -
>
> Key: IGNITE-16907
> URL: https://issues.apache.org/jira/browse/IGNITE-16907
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> h4. Problem
> From the birds eye view raft-to-storage flow looks similar to
>  # 
> {code:java}
> RaftGroupService#run(writeCommand());{code}
>  # Inner raft replication logic, when replicated on majority adjust 
> raft.commitedIndex.
>  # Propagate command to RaftGroupListener (raft state machine).
> {code:java}
> RaftGroupListener#onWrite(closure(writeCommand()));{code}
>  # Within state machine insert data from writeCommand to underneath storage:  
> {code:java}
> var insertRes = storage.insert(cmd.getRow(), cmd.getTimestamp());{code}
>  # ack that data was applied successfully 
> {code:java}
> clo.result(insertRes);{code}
>  # move raft.appliedIndex to corresponding value, meaning that the data for 
> this index is applied to the state machine.
> The most interesting part, especially for given ticket, relates to step 4.
> In real world storage doesn't flush every mutator on disk, instead it buffers 
> some amount of such mutators and flush them all-together as a part of some 
> checkpointing process. Thus, if node fails before mutatorsBuffer.flush() it 
> might lost some data because raft will apply data starting from appliedIndex 
> + 1 on recovery.
> h4. Possible solutions:
> There are several possibilities to solve this issue:
>  # In-storage WAL. Bad solution, because there's already raft log that can be 
> used as a WAL. Such duplication is redundant.
>  # Local recovery starting from appliedIndex - mutatorsBuffer.size. Bad 
> solution. Won't work for not-idempotent operations. Exposes inner storage 
> details such as mutatorBuffer.size.
>  # proposedIndex propagation + checkpointIndex synchonization. Seems fine. 
> More details below:
>  * First off all, in order to coordinate raft replicator and storage, 
> proposedIndex should be propagated to raftGroupListener and storage.
>  * On every checkpoint, storage will persist corresponding proposed index as 
> checkpointIndex.
>  ** In case of storage inner checkpoints, storage won't notify raft 
> replicator about new checkpointIndex. This kind of notification is an 
> optimization that does not affect the correctness of the protocol.
>  ** In case of outer checkpoint intention, e.g. raft snapshotting for the 
> purposes of raft log truncation, corresponding checkpointIndex will be 
> propagated to raft replicator within a callback "onShapshotDone".
>  * During local recovery raft will apply raft log entries from the very 
> begging. If checkpointIndex occurred to be bigger than proposedIndex on an 
> another raft log entity it fails the proposed closure with 
> IndexMismatchException(checkpointIndex) that leads to proposedIndex shift and 
> optional async raft log truncation.
> Let's consider following example:
> ] checkpointBuffer = 3. [P] - perisisted entities, [!P] - not perisisted/in 
> memory one.
>  # raft.put(k1,v1)
>  ## -> raftlog[cmd(k1,v1, index:1)]
>  ## -> storage[(k1,v1), index:1]
>  ## -> appliedIndex:1
>  # raft.put(k2,v2)
>  ## -> raftlog[cmd(k1,v1, index:1), \\{*}cmd(k2,v2, index:2)\\{*}]
>  ## -> storage[(k1,v1), \\{*}(k2,v2)\\{*}, ** index:\\{*}2\\{*}]
>  ## -> appliedIndex:{*}2{*}
>  # raft.put(k3,v3)
>  ## -> raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  \\{*}cmd(k3,v3, 
> index:3)\\{*}]
>  ## -> storage[(k1,v1), (k2,v2), \\{*}(k3,v3)\\{*}, index:\\{*}3\\{*}]
>  ## -> appliedIndex:{*}3{*}
>  ## *inner storage checkpoint*
>  ### raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  cmd(k3,v3, index:3)]
>  ### storage[(k1,v1, proposedIndex:1), (k2,v2, proposedIndex:2), (k3,v3, 
> proposedIndex:3)]
>  ### {*}checkpointedData[(k1,v1),* *(k2,v2),* \\{*}(k3,v3), 
> checkpointIndex:3\\{*}{*}\\{*}{*}]{*}{*}{{*}}
>  # raft.put(k4,v4)
>  ## -> raftlog[cmd(k1,v1, index:1), cmd(k2,v2, index:2),  cmd(k3,v3, 
> index:3), \\{*}cmd(k4,v4, index:4)\\{*}]
>  ## -> storage[(k1,v1), (k2,v2), (k3,v3), *(k4,v4)* index:\\{*}4\\{*}]
>  ## -> checkpointedData[(k1,v1), (k2,v2), (k3,v3), checkpointIndex:3]
>  ## -> appliedIndex:{*}4{*}
>  # Node failure
>  # Node restart
>  ## StorageRecovery: storage.apply(checkpointedData)
>  ## raft-to-storage data application starting from index: 1 // raft doesn't 
> know checkpointedIndex at this point.
>  ### -> storageResponse::IndexMismatchException(3)
>    raft-to-storage data application starting from

[jira] [Created] (IGNITE-17341) Support RAFT configuration with HOCON

2022-07-08 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17341:
--

 Summary: Support RAFT configuration with HOCON
 Key: IGNITE-17341
 URL: https://issues.apache.org/jira/browse/IGNITE-17341
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Currently, the only way to change RAFT settings is to explicitly change them in 
code, which is not convenient for the actual usage of the product. Some options 
have to be available



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17340) Disable fsync in RAFT log by default

2022-07-08 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17340:
--

 Summary: Disable fsync in RAFT log by default
 Key: IGNITE-17340
 URL: https://issues.apache.org/jira/browse/IGNITE-17340
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Currently, {{org.apache.ignite.raft.jraft.option.RaftOptions#sync}} leads to 
long inserts into tables, Runner TC suite can be made significantly faster with 
disabling fsync by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17340) Disable fsync in RAFT log by default

2022-07-08 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17340:
--

Assignee: Ivan Bessonov

> Disable fsync in RAFT log by default
> 
>
> Key: IGNITE-17340
> URL: https://issues.apache.org/jira/browse/IGNITE-17340
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Currently, {{org.apache.ignite.raft.jraft.option.RaftOptions#sync}} leads to 
> long inserts into tables, Runner TC suite can be made significantly faster 
> with disabling fsync by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17339) Implement B+Tree based hash index storage

2022-07-08 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17339:
--

 Summary: Implement B+Tree based hash index storage
 Key: IGNITE-17339
 URL: https://issues.apache.org/jira/browse/IGNITE-17339
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Please refer to IGNITE-17320 and issues from the epic for the gist. It's 
basically the same thing, but with hash slapped inside of the tree pages and a 
simplified comparison algorithm.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17338) Implement RocksDB based hash index storage

2022-07-08 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17338:
--

 Summary: Implement RocksDB based hash index storage
 Key: IGNITE-17338
 URL: https://issues.apache.org/jira/browse/IGNITE-17338
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Please see IGNITE-17318 for partial description of what needs to be achieved.

I expect that hash index records will have the following structure:
{code:java}
[ indexId | partitionId | hash | tuple | rowId ] -> []{code}
Fixed-length prefix should cover indexId, partitionId and hash value.

Searching rows effectively becomes a scan, but this is fine.

Hashing must be performed internally, hash function already presents somewhere 
in the code.

Is far as I understand, PK is going to be implemented as a secondary hash index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17006) Add protection against arbitrary page memory links in LinkRowId

2022-07-07 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563653#comment-17563653
 ] 

Ivan Bessonov commented on IGNITE-17006:


Given that LinkRowId won't be a thing soon, I propose closing this issue . See 
IGNITE-17076

> Add protection against arbitrary page memory links in LinkRowId
> ---
>
> Key: IGNITE-17006
> URL: https://issues.apache.org/jira/browse/IGNITE-17006
> Project: Ignite
>  Issue Type: Improvement
>  Components: persistence
>Reporter: Roman Puchkovskiy
>Assignee: Roman Puchkovskiy
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> It's theoretically possible to pass an arbitrary page memory link (via 
> LinkRowId) which might cause troubles:
>  # If pageId exceeds page memory limit, the JVM might crash
>  # If the page with this pageId was never initialized, an attempt to read 
> will fail with an internal assertion (because lock state will be 0)
> A possibility for item ID to be invalid is already handled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17325) Consider in-place compare for BinaryTuple comparator

2022-07-07 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17325:
--

 Summary: Consider in-place compare for BinaryTuple comparator
 Key: IGNITE-17325
 URL: https://issues.apache.org/jira/browse/IGNITE-17325
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


We should be able to compare columns in IGNITE-17318 / IGNITE-17320 without 
deserializing them. This includes String, BigInteger, BigDecimal, BitMap and 
maybe other types that I forgot about



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-16665) [Native Persistence 3.0] Move the group ID to the configuration

2022-07-07 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563636#comment-17563636
 ] 

Ivan Bessonov commented on IGNITE-16665:


I'm sorry, but this description has nothing to do with groupId. Is this 
intentional?

> [Native Persistence 3.0] Move the group ID to the configuration
> ---
>
> Key: IGNITE-16665
> URL: https://issues.apache.org/jira/browse/IGNITE-16665
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> Problem:
> Currently, persistent storage engines such as *PageMemory* and *RocksDB* use 
> the 
> *org.apache.ignite.configuration.schemas.table.TableConfigurationSchema#name* 
> (as part or all) as the name for the directory in which the data will be 
> stored. This does not allow you to rename the table correctly, since the data 
> will have to be in the new directory so that it is not lost after restarting 
> the node, but it will be in the directory with the old name.
> Possible solution:
> Do not use the name of the table as the directory name, instead add the 
> *org.apache.ignite.configuration.schemas.table.TableConfigurationSchema#id* 
> which should never change, and must also be unique.
> Please see:
>  * *org.apache.ignite.internal.storage.StorageUtils#groupId*
>  * 
> *org.apache.ignite.internal.storage.pagememory.PersistentPageMemoryTableStorage#start*
>  * 
> *org.apache.ignite.internal.storage.rocksdb.RocksDbStorageEngine#createTable*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17320) Implement B+Tree based sorted index storage

2022-07-06 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17320:
--

 Summary: Implement B+Tree based sorted index storage
 Key: IGNITE-17320
 URL: https://issues.apache.org/jira/browse/IGNITE-17320
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Like in IGNITE-17318, binary tuple format is a main goal here, because that's 
what we're gonna pass into API.

Implementing a tree for indexes isn't hard by itself. I expect it to be stored 
in same partition files as raw storage, everything should be colocated in a 
single file.

What's complicated is the inlining of data into trees pages. I see it as the 
following:
 * If tuple has no offset table, we can always store the entire payload in the 
tree. This is the best case scenario, because the size is fixed and known a 
priori, we don't even need to store it before the payload.
 * If tuple has an offset table, the inline size will immediately get bigger. 
In my view, we will have to:
 ** store the size of inlined payload, that's 4 bytes
 ** store null table if it's there, that's a known amount of bytes
 ** store header and offset table:
 *** if there are non-fixed-length columns, then a single entry in offset table 
can be up to 4 bytes
 *** if there are only fixed-length columns, like ints, floats or even bitsets, 
the amount of bytes per single entry can be accurately estimated with the upper 
bound
 ** then store a good amount of the actual columns data. How much? I'd be 
generous, but then we would probably have too much space, so all of this is 
debatable:
 *** for columns with fixed size, allocate room for entire value
 *** for strings and numbers (is there something else?) we have to pre-allocate 
a reasonable amount of bytes. Like, 8, for example. I don't know, there are 
defaults somewhere in the code of Ignite 2.x, we can use them.

So, my point is, there's no new data format for inlined section of the tuple, 
we should reuse it and thus avoid many possible errors. And, if record fits 
into a tree page, there's no need in inserting it into a free list. Good!

And yes, of course, there has to be a room for row id in inlined section as 
well.

Last, meta tree is still a thing. But, we shouldn't identify indexes by their 
names, cause there's UUID id or even integer id (see IGNITE-17318).
h3. Other ideas

I don't like how durable background tasks work in Ignite 2.x, there are always 
some issues. I would prefer having a general-purposed "recycle bin" in 
partition and a background cleaner process that would clean it. Maybe this 
queue should contain other entities in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17318) Implement RocksDB based sorted index storage

2022-07-06 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17318:
---
Labels: ignite-3  (was: )

> Implement RocksDB based sorted index storage
> 
>
> Key: IGNITE-17318
> URL: https://issues.apache.org/jira/browse/IGNITE-17318
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Pretty straightforward. Complicated places are:
>  * Binary tuples comparator: should be as fast as possible. Some 
> optimizations might be moved to other issues.
>  * Thorough testing is required. We have both Java and native comparators 
> planned. They should behave identically. This means a specific way of writing 
> tests, to account for this in advance.
>  * Bounds checking on range scan:
> by default, comparator should include the lower bound and exclude the upper 
> bound. This is how prefix search works. This means that exclusion of the 
> lower bound (if needed) and inclusion of the upper bound (if needed) +must be 
> performed manually+ inside of the scan method.
> The question is, do we use separate column families for indexes? At one hand, 
> this increases the number of files and potentially even increases time of the 
> flush, but on the other, it looks easy (or is it).
> Currently, every index has only a UUID id. Just like for tables, we could 
> create an integer identifier, because why not. This way we could store all 
> indexes in a single column family without too much overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17318) Implement RocksDB based sorted index storage

2022-07-06 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17318:
--

 Summary: Implement RocksDB based sorted index storage
 Key: IGNITE-17318
 URL: https://issues.apache.org/jira/browse/IGNITE-17318
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Pretty straightforward. Complicated places are:
 * Binary tuples comparator: should be as fast as possible. Some optimizations 
might be moved to other issues.
 * Thorough testing is required. We have both Java and native comparators 
planned. They should behave identically. This means a specific way of writing 
tests, to account for this in advance.
 * Bounds checking on range scan:
by default, comparator should include the lower bound and exclude the upper 
bound. This is how prefix search works. This means that exclusion of the lower 
bound (if needed) and inclusion of the upper bound (if needed) +must be 
performed manually+ inside of the scan method.

The question is, do we use separate column families for indexes? At one hand, 
this increases the number of files and potentially even increases time of the 
flush, but on the other, it looks easy (or is it).

Currently, every index has only a UUID id. Just like for tables, we could 
create an integer identifier, because why not. This way we could store all 
indexes in a single column family without too much overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-07-06 Thread Ivan Bessonov (Jira)

[
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ivan Bessonov updated IGNITE-17081:
---
Description:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
prerequisites.

Please also familiarize yourself with
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding,
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has
its own WAL. So, for testing purposes, it would be enough to just store update
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda
breaks RocksDB recovery procedure, we need to take measures to avoid it.

The only feasible way to do so is to use DBOptions#setAtomicFlush in
conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save
all column families consistently, if you have batches that cover several CFs.
Basically, {{acquireConsistencyLock()}} would create a thread-local write
batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will
be affected by this change.

NOTE: I believe that scans with unapplied batches should be prohibited for now
(gladly, there's a WriteBatchInterface#count() to check). I don't see any
practical value and a proper way of implementing it, considering how spread-out
in time the scan process is.
h2. Callbacks and RAFT snapshots

Simply storing and reading update index is easy. Reading committed index is
more challenging, I propose caching it and update only from the closure, that
can also be used by RAFT to truncate the log.

For a closure, there are several things to account for during the
implementation:
* DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and
ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in
atomic flush mode. And, once you have your first "completed" event ,you have a
guarantee that *all* memtables are already persisted.
This allows easy tracking of RocksDB flushes, monitoring events alteration is
all that's needed.
* Unlike PDS implementation, here we will be writing updateIndex value into a
memtable every time. This makes it harder to find persistedIndex values for
partitions. Gladly, considering the events that we have, during the time
between first "completed" and the very next "begin", the state on disk is fully
consistent. And there's a way to read data from storage avoiding memtable
completely - ReadOptions#setReadTier(PERSISTED_TIER).

Summarizing everything from the above, we should implement following protocol:

{code:java}
During table start: read latest values of update indexes. Store them in an
in-memory structure.
Set "lastEventType = ON_FLUSH_COMPLETED;".

onFlushBegin:
if (lastEventType == ON_FLUSH_BEGIN)
return;

waitForLastAsyncUpdateIndexesRead();

lastEventType = ON_FLUSH_BEGIN;

onFlushCompleted:
if (lastEventType == ON_FLUSH_COMPLETED)
return;

asyncReadUpdateIndexesFromDisk();

lastEventType = ON_FLUSH_COMPLETED;{code}
Reading values from disk must be performed asynchronously to not stall flushing
process. We don't control locks that RocksDb holds while calling listener's
methods.

That asynchronous process would invoke closures that provide presisted
updateIndex values to other components.

NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();"
as late as possible just in case. But my implementation says calling it during
the first event. This is fine. I noticed that column families are flushed in
order of their internal ids. These ids correspond to a sequence number of CFs,
and the "default" CF is always created first. This is the exact CF that we use
to store meta. Maybe we're going to change this and create a separate meta CF.
Only then we could start optimizing this part, and only if we'll have an actual
proof that there's a stall in this exact place.
h3. Types of storages

RocksDB is used for:
* tables
* cluster management
* meta-storage

All these types should use the same recovery procedure, but code is located in
different places. I hope that it won't be a big problem and we can do
everything at once.

was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
prerequisites.

Please also familiarize yourself with
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding,
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has
its own WAL. So, for testing purposes, it would be enough to just store update
index in meta column family.

Immediately we have a write amplification issue, on top

[jira] [Created] (IGNITE-17310) Intergrate IndexStorage into a TableStorage API

2022-07-05 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17310:
--

 Summary: Intergrate IndexStorage into a TableStorage API
 Key: IGNITE-17310
 URL: https://issues.apache.org/jira/browse/IGNITE-17310
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


As an endpoint, we need an interface that represents a single index storage for 
a single partition. But, creating/destroying these storages is not as obvious 
from API standpoint.

When index is created, storages should be created for every existing partition. 
And when a partition is created, index storages should be created for it as 
well. This complicates things a little bit, but, generally speaking, something 
like this could be a solution:
 * CompletableFuture createIndex(indexCinfgiguration);
 * CompletableFuture dropIndex(indexId);
 * IndexMvStorage getIndexStorage(indexId, partitionId);

Build / rebuild API will be figured out later in another issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17308:
---
Description: 
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.
 * new methods, like {{update}} and {{remove}} should be added to API.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now. Porting current tests to new API is 
up to a developer.

h3. Other

I would say that effective InternalTuple comparison is out of scope. We could 
just adapt current test code somehow.

  was:
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now.


> Revisit SortedIndexMvStorage interface
> --
>
> Key: IGNITE-17308
> URL: https://issues.apache.org/jira/browse/IGNITE-17308
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
> contract is far from obvious and it's only used in tests as a part of 
> "reference implementation".
> Originally, it was implemented when the vision of MV store wasn't fully 
> solidified.
> h3. API changes
>  * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
> should be replaced with {{{}InternalTuple{}}}, with the requirement that 
> every internal tuple can be converted into a IEP-92 format.
>  * {{scan}} should not return rows, but only indexed rows and RowId 
> instances. Index scan should NOT by itself filter-out invalid rows, this will 
> be performed outside of scan.
>  * TxId / Timestamp parameters are no longer applicable, given that index 
> does not perform rows validation.
>  * Partition filter should be removed as well. To simplify things, every 
> partition will be indexed {+}independently{+}.
>  * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
> now. Former can be brought back in the future, while latter makes no sense 
> considering that indexes are not multiversioned.
>  * new methods, like {{update}} and {{remove}} should be added to API.
> h3. New API for removed functions
>  * There should be a new entity on top of partition and index store. It 
> updates indexes and filters scan queries. There's no point in fully designing 
> it right now, all we need is working tests for now. Porting current tests

[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17308:
---
Description: 
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now.

  was:
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.


> Revisit SortedIndexMvStorage interface
> --
>
> Key: IGNITE-17308
> URL: https://issues.apache.org/jira/browse/IGNITE-17308
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
> contract is far from obvious and it's only used in tests as a part of 
> "reference implementation".
> Originally, it was implemented when the vision of MV store wasn't fully 
> solidified.
> h3. API changes
>  * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
> should be replaced with {{{}InternalTuple{}}}, with the requirement that 
> every internal tuple can be converted into a IEP-92 format.
>  * {{scan}} should not return rows, but only indexed rows and RowId 
> instances. Index scan should NOT by itself filter-out invalid rows, this will 
> be performed outside of scan.
>  * TxId / Timestamp parameters are no longer applicable, given that index 
> does not perform rows validation.
>  * Partition filter should be removed as well. To simplify things, every 
> partition will be indexed {+}independently{+}.
>  * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
> now. Former can be brought back in the future, while latter makes no sense 
> considering that indexes are not multiversioned.
> h3. New API for removed functions
>  * There should be a new entity on top of partition and index store. It 
> updates indexes and filters scan queries. There's no point in fully designing 
> it right now, all we need is working tests for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17308:
--

 Summary: Revisit SortedIndexMvStorage interface
 Key: IGNITE-17308
 URL: https://issues.apache.org/jira/browse/IGNITE-17308
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16156) Byte ordered index keys.

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16156.

Resolution: Won't Fix

Other data format will be used

> Byte ordered index keys.
> 
>
> Key: IGNITE-16156
> URL: https://issues.apache.org/jira/browse/IGNITE-16156
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Reporter: Alexander Belyak
>Assignee: Alexander Belyak
>Priority: Major
>  Labels: ignite-3
>
> To improve speed of operations with indexes ignite can store keys in byte 
> ordered format so only natural byte[] comparator will be enough to scan it.
> Required features:
> 1) write any (almost) data types.
> Must to have: boolean, byte, short, int,long, float, double, bigint, 
> bigdecimal, String, Date, Time, DateTime.
> Like to have: byte[], bitset
> unlikely to have: timestamp with timezone
> 2) Support null values for any columns. Like to have: support 
> nullFirst/nullLast
> 3) write asc/desc ordered (in any combination for columns, for indexes like 
> "col1 asc, col2 desc, col3 asc").
> Non functional requirements: space used and speed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16105) Replace sorted index binary storage protocol

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16105.

Resolution: Won't Fix

IGNITE-17192 will be used instead

> Replace sorted index binary storage protocol
> 
>
> Key: IGNITE-16105
> URL: https://issues.apache.org/jira/browse/IGNITE-16105
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> Sorted Index Storage currently uses {{BinaryRow}} as way to convert column 
> values into byte arrays. This approach is not optimal for the following 
> reasons:
> # Data is stored in RocksDB and we can't use its native lexicographic 
> comparator, we rely on a custom Java-based comparator that needs to 
> de-serialize all columns in order to compare them. This is bad 
> performance-wise, because Java-based comparators are  slower and we need to 
> extract all column values;
> # Range scans can't use the prefix seek operation from RocksDB, because 
> {{BinaryRow}} seralization is not stable: serialized prefix of column values 
> will not be a prefix of the whole serialized row, because the format depends 
> on columns being serialized;
> # {{BinaryRow}} serialization is designed to store versioned row data and is 
> overall badly suited to the Sorted Index purposes, its API usage looks 
> awkward in this context.
> We need to find a new serialization protocol that will (ideally) satisfy the 
> following requirements:
> # It should be comparable lexicographically;
> # It should support null values;
> # It should support variable length columns (though this requirement can 
> probably be dropped);
> # It should support both ascending and descending order for individual 
> columns;
> # It should support all data types that {{BinaryRow}} uses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16079) Rename search and data keys for the Partition Storage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16079.

Resolution: Won't Fix

> Rename search and data keys for the Partition Storage
> -
>
> Key: IGNITE-16079
> URL: https://issues.apache.org/jira/browse/IGNITE-16079
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> There are currently the following classes in the {{PartitionStorage}} that 
> act as data and search keys: {{SearchRow}} and {{DataRow}}. This makes the 
> {{SortedIndexStorage}} interface hard to understand, because it stores 
> {{SearchRows}} as values. It is proposed to rename these classes:
>  {{SearchRow}} -> {{PartitionKey}}
>  {{DataRow}} -> {{PartitionData}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16059) Add options to the "range" method in SortedIndexStorage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16059.

Resolution: Won't Fix

> Add options to the "range" method in SortedIndexStorage
> ---
>
> Key: IGNITE-16059
> URL: https://issues.apache.org/jira/browse/IGNITE-16059
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> [IEP-74|https://cwiki.apache.org/confluence/display/IGNITE/IEP-74+Data+Storage]
>  declares the following API for the {{SortedIndexStorage#range}} method:
> {code:java}
> /** Exclude lower bound. */
> byte GREATER = 0;
>  
> /** Include lower bound. */
> byte GREATER_OR_EQUAL = 1;
>  
> /** Exclude upper bound. */
> byte LESS = 0;
>  
> /** Include upper bound. */
> byte LESS_OR_EQUAL = 1 << 1;
> /**
>  * Return rows between lower and upper bounds.
>  * Fill results rows by fields specified at the projection set.
>  *
>  * @param low Lower bound of the scan.
>  * @param up Lower bound of the scan.
>  * @param scanBoundMask Scan bound mask (specify how to work with rows 
> equals to the bounds: include or exclude).
>  * @param proj Set of the columns IDs to fill results rows.
>  */
> Cursor scan(Row low, Row up, byte scanBoundMask, BitSet proj);
> {code}
> The {{scanBoundMask}} flags are currently not implemented. This API should be 
> revised and implemented, if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1109 matches

Mail list logo