Ivan Bessonov created IGNITE-16933:
--------------------------------------

             Summary: PageMemory-based MV storage implementation
                 Key: IGNITE-16933
                 URL: https://issues.apache.org/jira/browse/IGNITE-16933
             Project: Ignite
          Issue Type: New Feature
            Reporter: Ivan Bessonov


Similar to IGNITE-16611, we need an MV-storage implementation for page memory 
storage engine. Currently, I expect only row storage implementation, without 
primary or secondary indexes.

Here I'm going to describe a data format. Each row is stored as a versioned 
chain. It will be represented by a number of data entries that will have 
references to each other.
{code:java}
[ Timestamp | NextLink | PayloadSize | Payload ]{code}
 * Timestamp is a 16 bytes value derived from 
{{org.apache.ignite.internal.tx.Timestamp}} instance.
 * NextLink is a link to the next element in the chain or a NULL_LINK (or any 
other convenient name). It's a long value in standard format for Page Memory 
links (itemId, flag, partitionId, pageIdx). Technically, partition id is not 
needed here, because it's always the same. Removing it could allow us to save 2 
bytes per chain element.
 * PayloadSize is a 4-bytes integer value that gives us the size of actual data 
in arbitrary format.
 * Payload - I expect it to be a serialized BinaryRow data. This is how it's 
implemented in RocksDB right now.

For uncommitted (pending) entries I propose using maximal possible timestamp - 
{{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note that 
we never store tx id in chain itself.

Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It 
should be used as a header size in corresponding FreeList.

There's a requirement to have an immutable RowId for every versioned chain. One 
could argue that we should just make chain head immutable, but it would result 
in lots of complications. It's better to have a separate structure with 
immutable link, that will point to an actual head of the versioned chain.
{code:java}
[ TransactionId | HeadLink | NextLink ]{code}
 * TransactionId is a UUID. Can only be applied to pending entries. For 
committed head I propose storing 16 zeroes.
 * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already 
mentioned, I'd prefer 6.
 * NextLink is a "NextLink" value from the head chain element. It's a cheap 
shortcut for read-only transactions, you can skip uncommitted entry without 
even trying to read it, if there's a non-null transaction id. Debatable, I 
know, but looks cheap enough.

In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 6 
= 28) bytes of data. There must e a separate FreeList for every partition even 
in In-Memory mode for reasons that I'll give later. "Header" size in that list 
must be equal to these 28 bytes. I wander how effective FreeList will be for 
this case, where every chunk has the same size. We'll see. Maybe we should 
adjust a number of buckets somehow.

Now, the fun part. There's no mention of B+Tree here. That's because we can 
probably just avoid it. If it existed, it would just point RowId to a described 
RowId structure in partition, but RowId is already a pointer itself. The only 
other problem that is usually solved by a tree-like structure is a full-scan of 
all rows in partition. This is useful when you need to rebuild indexes, for 
example.

We should keep in mind that there's no code yet for rebuilding indexes. On the 
other hand, there's a method for partition scan in the API. It could be used to 
implement a Primary Index imitation until we have a real implementation.

There's not FreeList full-scan currently in the code, it needs to be 
implemented. And, this particular full-scan is the reason why every partition 
should have its own list of row ids.

There's also a chance that introducing new flag for row ids might be 
convenient. I don't know yet, let's not do it for now.

Finally, we need an adequate protection from assertions if we, for some reason, 
have invalid row id. Things that can be checked be a normal code, not assertion:
 * data page type
 * number of items in the page



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to