On Fri, Jun 8, 2018 at 7:34 AM, Asim R P <aprav...@pivotal.io> wrote:

> Hi,
>
> In the pluggable storage unconference session in Ottawa,
> column oriented storage was a key point of discussion.  We would like
> to share an overview of Greenplum's column store.  We hope that this
> will aid a discussion to carve out a pluggable storage interface.
> This is just an overview, source code is where the truth is:
> https://github.com/greenplum-db/gpdb
>
> Greenplum introduced "appendonly" as a table type to support row as
> well as column orientation.  Design goal for appendonly tables is to
> reduce I/O cost so that workloads involving bulk I/O (OLAP) run
> faster.  As the name suggests, tuples stored in the underlying files
> are immutable (no in-place updates / deletes).  The files can only be
> extended with new data.  We will focus only on column oriented
> appendonly tables in what follows.
>

Thanks for sharing the details of how Greenplum has implemented
columnar storage and use case of it. This will helpful in creating
a better pluggable table access methods.


A create table SQL command to create column oriented table:
>
>     CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true,
> orientation=column);
>

The current pluggable table access method provides the following syntax
currently to create a table with specific access method [1].

CREATE TABLE ... [USING ACCESSMETHOD] ...

The access method must be created by the columnar storage extension.


> Storage format:
>
> Each column of a column oriented table gets a dedicated file in the
> data directory.  A file contains a series of variable length blocks
> ("varblocks").  Each concurrent transaction writing to an appendonly
> table is allocated a dedicated set of files, one per column.  At the
> most, 128 concurrent transactions are allowed to write to the same
> appendonly table, leading to at the most 128 groups of files (one per
> column).  Let's assume two concurrent transactions inserted into the
> "col_table" in the above example.  The table will have two groups of
> files, each group containing two files (one per column).  A group of
> files is called "segment".  Segment 1 for "col_table" contains files
> "<relfilenode>.1" (c1) and "<relfilenode>.129" (c2).  Segment 2
> contains files "<relfilenode>.2" (c1) and "<relfilenode>.130" (c2).
>
> Appendonly tuples do not contain MVCC information (xmin/xmax).  An
> auxiliary table, pg_aoseg.pg_aoseg_<oid>, is created for each
> appendonly table.  It keeps track of the length of each segment file
> for the appendonly table.  The aoseg table is a heap table, normal
> MVCC rules apply.  This setup isolates transactions concurrently
> reading or writing to an appendonly table from each other.
>
> Similar to heap TID, appendonly TID is a 6-byte struct
> (appendonlytid.h).  It comprises of segment number and row number.
> Appendonly TID is recorded in indexes without much distinction from
> heap TID.  Unlike heap TID, an appendonly TID is not sufficient to
> locate a row number within a column's file.  Another auxiliary table,
> block directory (pg_aoseg.pg_aoblckdir_<oid>), maps a row number to
> offsets in the segment.  The offset points to start of the varblock
> containing the row number.  Index lookups first obtain the appendonly
> TID from the index, then map the row number to its offset using block
> directory and then read the varblock from the file.  The aoblckdir
> table is heap, normal MVCC rules apply.
>
> UPDATE and DELETE:
>
> Updates and deletes are supported by maintaining a bitmap representing
> visibility of appendonly row numbers.  Note that visibility is
> attribute of a row.  E.g. if row number 10 is invisible in
> "col_table", the column values corresponding to row number 10 in both
> "<relfilenode>.1" as well as "<relfilenode>.129" files should be
> considered invisible.  The bitmap is maintained in another auxiliary
> table - pg_aoseg.aovisimap, which is also a normal heap table.
> Deleting a row results in no change to the appendonly files.  The bit
> representing the deleted row number is set in the aovisimap table.
> UPDATE is delete followed by insert.  No update chains are maintained.
>
> Catalog changes:
>
> * pg_appenonly table contains OIDs of auxiliary tables for each
> appendonly table.
> * For each appendonly table, the auxiliary tables created are: aoseg,
> aovisimap and block directory.  Similar to toast tables, these have
> their own entires in pg_class with their own relfilenodes.
> * New pg_class.relstorage values 'a' for row oriented AO and 'c' for
> column oriented.
>

The current pluggable table access method interface provides a way to
create additional tables in the pg_class of relkind as 'E' (External table)
that can be accessed only by the storage access methods and not by the user.

Fujitsu has also developed our own columnar storage [2] and currently it is
tightly coupled
with modified version of PostgreSQL and we also wants to port our version
to pluggable
table access method

In our columnar storage, we also need to create additional relations to
provide
the columnar storage support. To generalize it for every other storage
engine,
I thought of the above External table approach.


> Compression:
>
> Each column can be compressed.  Supported compression algorithms are
> zlib, zstd, RLE (run length encoding), delta compression on top of
> RLE.
>
> Buffer management:
>
> Backend private memory is used to buffer contents of appendonly files.
> Block size (a multiple of 8K) can be specified per column.  The amount
> of memory buffered per file is equal to the block size configured for
> that column.
>
> Executor touch points:
>
> Column store offers access methods such as
> aocs_beginscan(), aocs_getnext(), aocs_fetch(), aocs_rescan(),
> aocs_insert().  One may come across a bunch of conditionals sprinkled
> around executor code to invoke either heap, appendonly row or column
> store access methods.  A better abstraction layer would have made the
> code much cleaner but we don't have it today.
>
> Scan of column oriented tables is performed by reading the projected
> columns' files in a loop.  Note that the values stored in the files
> contain only data, no MVCC information is stored.  Scan node returns a
> tuple table slot, just like in case of heap tables.  The slot contains
> virtual tuples.  The values and nulls arrays contain the values read
> from respective files.  If needed during execution, the virtual tuples
> are converted to what we call "memtuples".  Memtuples are similar to
> minimal tuples but supposedly more optimized for accessing individual
> datums.
>
>
> Vacuum:
>
> Vacuum operates on one appendonly segment at a time.  The appendonly
> segment is locked such that concurrent insert/update/delete are
> blocked from operating on this segment.  The segment is scanned using
> aocs_getnext() and aovisimap is used to filter out invisible row
> numbers.  Only the visible row numbers are written to a new segment.
> After the scan is finished, the old segment is marked for deletion in
> pg_aocsseg_<oid> table.  Transactions reading the table do not read
> files belonging to segments that are marked for deletion.  After all
> segments have been scanned, all auxiliary tables (aocsseg, aovisimap,
> block directory and toast) are vacuumed.
>
> Problem worth highlighting:
>
> The design of aovisimap table poses hurdles for concurrent update and
> delete operations on the same appendonly table.  One aovismap entry
> covers multiple row numbers in the appendonly table.  If two
> concurrent transactions delete or update different row numbers such
> that they are covered by the same aovisimap tuple, one of them must
> abort.
>
> Not having chained updated versions of the same tuple. the
> aocs_udpate() interface differs from the heap counterpart, leading to
> anomalies such as
> https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/
> 73LOuQFkIps/1TNHW0lGAwAJ
>
> Unique indexes are not supported on column oriented tables.
> Identifying a concurrent transaction that might be inserting duplicate
> tuples becomes hard because (1) lack of MVCC information in tuples and
> (2) one block directory tuple covers multiple row numbers.
>

Thanks for sharing your design details, I will go through them clearly to
understand
with the current proposed pluggable table access method interfaces can
handle all
the other storage methods that are implemented outside PostgreSQL.


[1] -
https://www.postgresql.org/message-id/CAJrrPGfRGtRz_-Qyt_hK9KCQYdN%3DpbpjPooAHM9dZhJmqwN8fg%40mail.gmail.com
[2] -
https://www.postgresql.org/message-id/cajrrpgfac7wc9nk6ptty6yn-nn+hcy8xolah2doyhvg5d6h...@mail.gmail.com


Regards,
Haribabu Kommi
Fujitsu Australia

Reply via email to