Re: Column store in Greenplum

2018-06-07 Thread Haribabu Kommi
On Fri, Jun 8, 2018 at 7:34 AM, Asim R P  wrote:

> Hi,
>
> In the pluggable storage unconference session in Ottawa,
> column oriented storage was a key point of discussion.  We would like
> to share an overview of Greenplum's column store.  We hope that this
> will aid a discussion to carve out a pluggable storage interface.
> This is just an overview, source code is where the truth is:
> https://github.com/greenplum-db/gpdb
>
> Greenplum introduced "appendonly" as a table type to support row as
> well as column orientation.  Design goal for appendonly tables is to
> reduce I/O cost so that workloads involving bulk I/O (OLAP) run
> faster.  As the name suggests, tuples stored in the underlying files
> are immutable (no in-place updates / deletes).  The files can only be
> extended with new data.  We will focus only on column oriented
> appendonly tables in what follows.
>

Thanks for sharing the details of how Greenplum has implemented
columnar storage and use case of it. This will helpful in creating
a better pluggable table access methods.


A create table SQL command to create column oriented table:
>
> CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true,
> orientation=column);
>

The current pluggable table access method provides the following syntax
currently to create a table with specific access method [1].

CREATE TABLE ... [USING ACCESSMETHOD] ...

The access method must be created by the columnar storage extension.


> Storage format:
>
> Each column of a column oriented table gets a dedicated file in the
> data directory.  A file contains a series of variable length blocks
> ("varblocks").  Each concurrent transaction writing to an appendonly
> table is allocated a dedicated set of files, one per column.  At the
> most, 128 concurrent transactions are allowed to write to the same
> appendonly table, leading to at the most 128 groups of files (one per
> column).  Let's assume two concurrent transactions inserted into the
> "col_table" in the above example.  The table will have two groups of
> files, each group containing two files (one per column).  A group of
> files is called "segment".  Segment 1 for "col_table" contains files
> ".1" (c1) and ".129" (c2).  Segment 2
> contains files ".2" (c1) and ".130" (c2).
>
> Appendonly tuples do not contain MVCC information (xmin/xmax).  An
> auxiliary table, pg_aoseg.pg_aoseg_, is created for each
> appendonly table.  It keeps track of the length of each segment file
> for the appendonly table.  The aoseg table is a heap table, normal
> MVCC rules apply.  This setup isolates transactions concurrently
> reading or writing to an appendonly table from each other.
>
> Similar to heap TID, appendonly TID is a 6-byte struct
> (appendonlytid.h).  It comprises of segment number and row number.
> Appendonly TID is recorded in indexes without much distinction from
> heap TID.  Unlike heap TID, an appendonly TID is not sufficient to
> locate a row number within a column's file.  Another auxiliary table,
> block directory (pg_aoseg.pg_aoblckdir_), maps a row number to
> offsets in the segment.  The offset points to start of the varblock
> containing the row number.  Index lookups first obtain the appendonly
> TID from the index, then map the row number to its offset using block
> directory and then read the varblock from the file.  The aoblckdir
> table is heap, normal MVCC rules apply.
>
> UPDATE and DELETE:
>
> Updates and deletes are supported by maintaining a bitmap representing
> visibility of appendonly row numbers.  Note that visibility is
> attribute of a row.  E.g. if row number 10 is invisible in
> "col_table", the column values corresponding to row number 10 in both
> ".1" as well as ".129" files should be
> considered invisible.  The bitmap is maintained in another auxiliary
> table - pg_aoseg.aovisimap, which is also a normal heap table.
> Deleting a row results in no change to the appendonly files.  The bit
> representing the deleted row number is set in the aovisimap table.
> UPDATE is delete followed by insert.  No update chains are maintained.
>
> Catalog changes:
>
> * pg_appenonly table contains OIDs of auxiliary tables for each
> appendonly table.
> * For each appendonly table, the auxiliary tables created are: aoseg,
> aovisimap and block directory.  Similar to toast tables, these have
> their own entires in pg_class with their own relfilenodes.
> * New pg_class.relstorage values 'a' for row oriented AO and 'c' for
> column oriented.
>

The current pluggable table access method interface provides a way to
create additional tables in the pg_class of relkind as 'E' (External table)
that can be accessed only by the storage access methods and not by the user.

Fujitsu has also developed our own columnar storage [2] and currently it is
tightly coupled
with modified version of PostgreSQL and we also wants to port our version
to pluggable
table access method

In our columnar storage, we also need to create additional relations to

Column store in Greenplum

2018-06-07 Thread Asim R P
Hi,

In the pluggable storage unconference session in Ottawa,
column oriented storage was a key point of discussion.  We would like
to share an overview of Greenplum's column store.  We hope that this
will aid a discussion to carve out a pluggable storage interface.
This is just an overview, source code is where the truth is:
https://github.com/greenplum-db/gpdb

Greenplum introduced "appendonly" as a table type to support row as
well as column orientation.  Design goal for appendonly tables is to
reduce I/O cost so that workloads involving bulk I/O (OLAP) run
faster.  As the name suggests, tuples stored in the underlying files
are immutable (no in-place updates / deletes).  The files can only be
extended with new data.  We will focus only on column oriented
appendonly tables in what follows.

A create table SQL command to create column oriented table:

CREATE TABLE col_table (c1 int, c2 varchar) WITH (appendonly=true,
orientation=column);

Storage format:

Each column of a column oriented table gets a dedicated file in the
data directory.  A file contains a series of variable length blocks
("varblocks").  Each concurrent transaction writing to an appendonly
table is allocated a dedicated set of files, one per column.  At the
most, 128 concurrent transactions are allowed to write to the same
appendonly table, leading to at the most 128 groups of files (one per
column).  Let's assume two concurrent transactions inserted into the
"col_table" in the above example.  The table will have two groups of
files, each group containing two files (one per column).  A group of
files is called "segment".  Segment 1 for "col_table" contains files
".1" (c1) and ".129" (c2).  Segment 2
contains files ".2" (c1) and ".130" (c2).

Appendonly tuples do not contain MVCC information (xmin/xmax).  An
auxiliary table, pg_aoseg.pg_aoseg_, is created for each
appendonly table.  It keeps track of the length of each segment file
for the appendonly table.  The aoseg table is a heap table, normal
MVCC rules apply.  This setup isolates transactions concurrently
reading or writing to an appendonly table from each other.

Similar to heap TID, appendonly TID is a 6-byte struct
(appendonlytid.h).  It comprises of segment number and row number.
Appendonly TID is recorded in indexes without much distinction from
heap TID.  Unlike heap TID, an appendonly TID is not sufficient to
locate a row number within a column's file.  Another auxiliary table,
block directory (pg_aoseg.pg_aoblckdir_), maps a row number to
offsets in the segment.  The offset points to start of the varblock
containing the row number.  Index lookups first obtain the appendonly
TID from the index, then map the row number to its offset using block
directory and then read the varblock from the file.  The aoblckdir
table is heap, normal MVCC rules apply.

UPDATE and DELETE:

Updates and deletes are supported by maintaining a bitmap representing
visibility of appendonly row numbers.  Note that visibility is
attribute of a row.  E.g. if row number 10 is invisible in
"col_table", the column values corresponding to row number 10 in both
".1" as well as ".129" files should be
considered invisible.  The bitmap is maintained in another auxiliary
table - pg_aoseg.aovisimap, which is also a normal heap table.
Deleting a row results in no change to the appendonly files.  The bit
representing the deleted row number is set in the aovisimap table.
UPDATE is delete followed by insert.  No update chains are maintained.

Catalog changes:

* pg_appenonly table contains OIDs of auxiliary tables for each
appendonly table.
* For each appendonly table, the auxiliary tables created are: aoseg,
aovisimap and block directory.  Similar to toast tables, these have
their own entires in pg_class with their own relfilenodes.
* New pg_class.relstorage values 'a' for row oriented AO and 'c' for
column oriented.

Compression:

Each column can be compressed.  Supported compression algorithms are
zlib, zstd, RLE (run length encoding), delta compression on top of
RLE.

Buffer management:

Backend private memory is used to buffer contents of appendonly files.
Block size (a multiple of 8K) can be specified per column.  The amount
of memory buffered per file is equal to the block size configured for
that column.

Executor touch points:

Column store offers access methods such as
aocs_beginscan(), aocs_getnext(), aocs_fetch(), aocs_rescan(),
aocs_insert().  One may come across a bunch of conditionals sprinkled
around executor code to invoke either heap, appendonly row or column
store access methods.  A better abstraction layer would have made the
code much cleaner but we don't have it today.

Scan of column oriented tables is performed by reading the projected
columns' files in a loop.  Note that the values stored in the files
contain only data, no MVCC information is stored.  Scan node returns a
tuple table slot, just like in case of heap tables.  The slot contains
virtual tuples.  The values and nulls arrays