Re: Zedstore - compressed in-core columnar storage

2019-05-24 Thread Mark Kirkwood



On 23/05/19 12:07 PM, Ashwin Agrawal wrote:


We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.

Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).

Current State:

- A new type of item added to Zedstore "Array item", to boost
  compression and performance. Based on Konstantin's performance
  experiments [2] and inputs from Tomas Vodra [3], this is
  added. Array item holds multiple datums, with consecutive TIDs and
  the same visibility information. An array item saves space compared
  to multiple single items, by leaving out repetitive UNDO and TID
  fields. An array item cannot mix NULLs and non-NULLs. So, those
  experiments should result in improved performance now. Inserting
  data via COPY creates array items currently. Code for insert has not
  been modified from last time. Making singleton inserts or insert
  into select, performant is still on the todo list.

- Now we have a separate and dedicated meta-column btree alongside
  rest of the data column btrees. This special or first btree for
  meta-column is used to assign TIDs for tuples, track the UNDO
  location which provides visibility information. Also, this special
  btree, which always exists, helps to support zero-column tables
  (which can be a result of ADD COLUMN DROP COLUMN actions as
  well). Plus, having meta-data stored separately from data, helps to
  get better compression ratios. And also helps to further simplify
  the overall design/implementation as for deletes just need to edit
  the meta-column and avoid touching the actual data btrees. Index
  scans can just perform visibility checks based on this meta-column
  and fetch required datums only for visible tuples. For tuple locks
  also just need to access this meta-column only. Previously, every
  column btree used to carry the same undo pointer. Thus visibility
  check could be potentially performed, with the past layout, using
  any column. But considering overall simplification new layout
  provides it's fine to give up on that aspect. Having dedicated
  meta-column highly simplified handling for add columns with default
  and null values, as this column deterministically provides all the
  TIDs present in the table, which can't be said for any other data
  columns due to default or null values during add column.

- Free Page Map implemented. The Free Page Map keeps track of unused
  pages in the relation. The FPM is also a b-tree, indexed by physical
  block number. To be more compact, it stores "extents", i.e. block
  ranges, rather than just blocks, when possible. An interesting paper 
[4] on

  how modern filesystems manage space acted as a good source for ideas.

- Tuple locks implemented

- Serializable isolation handled

- With "default_table_access_method=zedstore"
  - 31 out of 194 failing regress tests
  - 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.

Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
  logs
- Performance profiling and optimizations for Insert, Selects, Index
  Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
  of its own version of UNDO

Open questions / discussion items:

- how best to get "column projection list" from planner? (currently,
  we walk plan and find the columns required for the query in
  the executor, refer GetNeededColumnsForNode())

- how to pass the "column projection list" to table AM? (as stated in
  initial email, currently we have modified table am API to pass the
  projection to AM)

- TID treated as (block, offset) in current indexing code

- Physical tlist optimization? (currently, we disabled it for
  zedstore)

Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.

References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2] 
https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3] 
https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development

4] 

Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Mark Kirkwood



On 11/04/19 4:01 PM, Mark Kirkwood wrote:

On 9/04/19 12:27 PM, Ashwin Agrawal wrote:


Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.




Very nice. I realize that it is very early days, but applying this 
patch I've managed to stumble over some compression bugs doing some 
COPY's:


benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR:  compression failed. what now?
CONTEXT:  COPY dim1, line 458

The log has:

2019-04-11 15:48:43.976 NZST [2006] ERROR:  XX000: compression failed. 
what now?

2019-04-11 15:48:43.976 NZST [2006] CONTEXT:  COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, 
zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT:  COPY dim1 FROM 
'/data0/dump/dim1.dat'

    USING DELIMITERS ',';

The dataset is generated from and old DW benchmark I wrote 
(https://sourceforge.net/projects/benchw/). The row concerned looks like:


457,457th interesting measure,1th measure 
type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure 
type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg



I'll see if changing to LZ4 makes any different.




The COPY works with LZ4 configured.





Re: Zedstore - compressed in-core columnar storage

2019-04-10 Thread Mark Kirkwood

On 9/04/19 12:27 PM, Ashwin Agrawal wrote:


Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.




Very nice. I realize that it is very early days, but applying this patch 
I've managed to stumble over some compression bugs doing some COPY's:


benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR:  compression failed. what now?
CONTEXT:  COPY dim1, line 458

The log has:

2019-04-11 15:48:43.976 NZST [2006] ERROR:  XX000: compression failed. 
what now?

2019-04-11 15:48:43.976 NZST [2006] CONTEXT:  COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, 
zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT:  COPY dim1 FROM 
'/data0/dump/dim1.dat'

    USING DELIMITERS ',';

The dataset is generated from and old DW benchmark I wrote 
(https://sourceforge.net/projects/benchw/). The row concerned looks like:


457,457th interesting measure,1th measure 
type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure 
type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg



I'll see if changing to LZ4 makes any different.

best wishes

Mark





Re: PostgreSQL pollutes the file system

2019-03-21 Thread Mark Kirkwood
On 22/03/19 3:05 PM, Tom Lane wrote:
> Michael Paquier  writes:
>> I would be curious to hear the reason why such tool names have been
>> chosen from the start.  The tools have been switched to C in 9e0ab71
>> from 2003, have been introduced by Peter Eisentraut as of 240e4c9 from
>> 1999, and I cannot spot the thread from the time where this was
>> discussed.
> createuser, at least, dates back to Berkeley days: my copy of the
> PG v4r2 tarball contains a "src/bin/createuser/createuser.sh" file
> dated 1994-03-19.  (The 1999 commit you mention just moved the
> functionality around; it was there before.)  So I imagine the answer
> is that nobody at the time thought of fitting these scripts into a
> larger ecosystem.


FWIW the whole set is there in version 6.4.2:

markir@vedavec:/download/postgres/src/postgresql-6.4.2/src/bin$ ls -l
total 72
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 cleardbdir
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 createdb
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 createuser
drwxr-sr-x 2 markir adm 4096 Dec 31  1998 CVS
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 destroydb
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 destroyuser
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 initdb
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 initlocation
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 ipcclean
-rw-r--r-- 1 markir adm  795 Dec 19  1998 Makefile
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pgaccess
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_dump
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_encoding
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_id
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_passwd
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pgtclsh
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_version
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 psql

--

Mark





Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Mark Kirkwood



On 26/02/19 5:41 PM, Stephen Frost wrote:

Greetings Mark,

* Mark Kirkwood (mark.kirkw...@catalyst.net.nz) wrote:

ISTM that the onus should be on the patch submitter to provide additions to
pg_basebackup that make it as painless as possible for those people *not*
using pgBackRest to continue making backups. Breaking this is just not
right. Submitting patches that mean that people *must* use pgBackRest is
also not right IMHO.

I'm sorry that there's some confusion here- to be clear, no one is
required to use pgBackRest.  pg_basebackup works quite well and wouldn't
be impacted by the changes proposed no this thread.  The arguments
against removing the exclusive backup feature don't have anything to do
with pg_basebackup.

Ah yes (checks pg_basbackup code), you are correct! Reading this thread 
I thought I saw a comment to the effect that pg_basebackup was being 
broken, hence the less than impressed post.


Your relentless bashing of people doing their own backups and heavy 
marketing of pgBackRest - unfortunately - made it easy for me to believe 
that this was a possibility that you might see as ok. So - apologies for 
the misunderstanding, however less marketing of your own product would 
avoid me jumping to the wrong conclusion.


regards

Mark




Re: Code of Conduct plan

2018-09-15 Thread Mark Kirkwood




On 15/09/18 08:17, Tom Lane wrote:

Yeah, this.  The PG community is mostly nice people, AFAICT.  I'll be
astonished (and worried) if the CoC committee finds much to do.  We're
implementing this mostly to make newcomers to the project feel that
it's a safe space.


Agreed. However I think the all-of-life clause gives an open door to 
potential less than well intentioned new members joining up to extend a 
SJW agenda. So in fact the unintended consequence of this may be a 
*less* safe place for some existing members - unless all of their social 
media utterances are agreeable to the angry militant left.



It's also worth reminding people that this is v1.0 of the CoC document.
We plan to revisit it in a year or so, and thereafter as needed, to
improve anything that's causing problems or not working well.


+1, At least this means we can address the above if it emerges as a problem

regards
Mark


regards, tom lane






Re: Is a modern build system acceptable for older platforms

2018-04-27 Thread Mark Kirkwood

On 27/04/18 19:10, Yuriy Zhuravlev wrote:



1. You can remove tools/msvc folder because all your build rules will 
be universal. (cmake build now have much fewer lines of code)
2. You can forget about terminal in Windows (for windows guys it's 
important)
3. You can normally check environment on Windows, right now we have 
hardcoded headers and many options. Configure process will be same on 
all platforms.
4. You can generate not only GNU Make or MSVC project, you also can 
make Xcode projects, Ninja or NMake for build under MSVC Make. For 
Windows, you also can easily change MSVC to Clang it's not hardcoded 
at all.
5. With CMake you have an easy way to build extra modules (plugins), I 
have already working prototype for windows PGXS.  A plugin should just 
include .cmake file generated with Postgres build.
Example: 
https://github.com/stalkerg/postgres_cmake/blob/cmake/contrib/adminpack/CMakeLists.txt 
If PGXS is True it's mean we build module outside postgres.


But in my opinion, you should just try CMake to figure out all benefits.




I note that Mysql (yeah I know, we don't love 'em greatly, but their 
product niche is similar to ours) and Ceph (ok it is a distributed 
storage system but still a highly popular open src product) have 
switched to using cmake (relatively) recently. Both these projects were 
using autoconf etc related builds previously and seem to be doing just 
fine with cmake.


regards
Mark



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Mark Kirkwood

On 19/04/18 00:45, Craig Ringer wrote:



I guarantee you that when you create a 100GB EBS volume on AWS EC2,
you don't get 100GB of storage preallocated. AWS are probably pretty
good about not running out of backing store, though.




Some db folks (used to anyway) advise dd'ing to your freshly attached 
devices on AWS (for performance mainly IIRC), but that would help 
prevent some failure scenarios for any thin provisioned storage (but 
probably really annoy the admins' thereof).


regards
Mark



Re: zheap: a new storage format for PostgreSQL

2018-03-02 Thread Mark Kirkwood



On 03/03/18 05:03, Robert Haas wrote:

On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov
 wrote:

I would propose "zero-bloat heap" disambiguation of zheap.  Seems like fair
enough explanation for me without need to rename :)

It will be possible to bloat a zheap table in certain usage patterns.
For example, if you bulk-load the table with a ton of data, commit the
transaction, delete every other row, and then never insert any more
rows ever again, the table is bloated: it's twice as large as it
really needs to be, and we have no provision for shrinking it.  In
general, I think it's very hard to keep bulk deletes from leaving
bloat in the table, and to the extent that it *is* possible, we're not
doing it.  One could imagine, for example, an index-organized table
that automatically combines adjacent pages when they're empty enough,
and that also relocates data to physically lower-numbered pages
whenever possible.  Such a storage engine might automatically shrink
the on-disk footprint after a large delete, but we have no plans to go
in that direction.

Rather, our assumption is that the bloat most people care about comes
from updates.  By performing updates in-place as often as possible, we
hope to avoid bloating both the heap (because we're not adding new row
versions to it which then have to be removed) and the indexes (because
if we don't add new row versions at some other TID, then we don't need
to add index pointers to that new TID either, or remove the old index
pointers to the old TID).  Without delete-marking, we can basically
optimize the case that is currently handled via HOT updates: no
indexed columns have changed.  However, the in-place update has a
major advantage that it still works even when the page is completely
full, provided that the row does not expand.  As Amit's results show,
that can hugely reduce bloat and increase performance in the face of
long-running concurrent transactions.  With delete-marking, we can
also optimize the case where indexed columns have been changed.  We
don't know exactly how well this will work yet because the code isn't
written and therefore can't be benchmarked, but am hopeful that that
in-place updates will be a big win here too.

So, I would not describe a zheap table as zero-bloat, but it should
involve a lot less bloat than our standard heap.



For folk doing ETL type data warehousing this should be great, as the 
typical workload tends to be like: COPY (or similar) from foreign data 
source, then do several sets of UPDATES to fix/check/scrub the 
data...which tends to result in huge bloat with the current heap design 
(despite telling people 'you can do it another way to' to avoid bloat - 
I guess it seems to be more intuitive to just to do it as described).


regards
Mark




Re: zheap: a new storage format for PostgreSQL

2018-03-01 Thread Mark Kirkwood

On 02/03/18 16:53, Alvaro Herrera wrote:


I think it was impolite to post this on the very same day the commitfest
started.  We have enough patches as it is ...



To be fair - he did say things like "wanting feedback..." and "shows an 
example of using pluggable storage.." and for PG 12. If he held onto the 
patches and waited - he'd get criticism of the form "you should have 
given a heads up earlier...".


This is earlier :-)

Best wishes

Mark

P.s: awesome work.



Re: Changing default value of wal_sync_method to open_datasync on Linux

2018-02-19 Thread Mark Kirkwood

On 20/02/18 13:27, Tsunakawa, Takayuki wrote:


Hello,

I propose changing the default value of wal_sync_method from fdatasync to 
open_datasync on Linux.  The patch is attached.  I'm feeling this may be 
controversial, so I'd like to hear your opinions.

The reason for change is better performance.  Robert Haas said open_datasync 
was much faster than fdatasync with NVRAM in this thread:

https://www.postgresql.org/message-id/flat/c20d38e97bcb33dad59e...@lab.ntt.co.jp#c20d38e97bcb33dad59e...@lab.ntt.co.jp

pg_test_fsync shows higher figures for open_datasync:

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
 open_datasync 50829.597 ops/sec  20 usecs/op
 fdatasync 42094.381 ops/sec  24 usecs/op
 fsync  42209.972 ops/sec  
24 usecs/op
 fsync_writethroughn/a
 open_sync 48669.605 ops/sec  21 usecs/op
--


[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
 open_datasync 34648.778 ops/sec  29 usecs/op
 fdatasync 31570.947 ops/sec  32 usecs/op
 fsync 27783.283 ops/sec  36 usecs/op
 fsync_writethrough  n/a
 open_sync 35238.866 ops/sec  28 usecs/op
--


pgbench only shows marginally better results, although the difference is within 
an error range.  The following is the tps of the default read/write workload of 
pgbench.  I ran the test with all the tables and indexes preloaded with 
pg_prewarm (except pgbench_history), and the checkpoint not happening.  I ran a 
write workload before running the benchmark so that no new WAL file would be 
created during the benchmark run.

[SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered]
--
1  2  3avg
fdatasync  17610  17164  16678  17150
open_datasync  17847  17457  17958  17754 (+3%)

[HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback]
(the figures seem oddly high, though; this may be due to some VM configuration)
--
   1 2 3   avg
fdatasync  4911  5225  5198  5111
open_datasync  4996  5284  5317  5199 (+1%)


As the removed comment describes, when wal_sync_method is open_datasync (or 
open_sync), open() fails with errno=EINVAL if the ext4 volume is mounted with 
data=journal.  That's because open() specifies O_DIRECT in that case.  I don't 
think that's a problem in practice, because data=journal will not be used for 
performance, and wal_level needs to be changed from its default replica to 
minimal and max_wal_senders must be set to 0 for O_DIRECT to be used.




I think the use of 'nobarrier' is probably disabling most/all reliable  
writing to the devices. What do the numbers look like if use remove this  
option?


regards

Mark



Re: Partition pruning for Star Schema

2017-12-08 Thread Mark Kirkwood

On 04/12/17 17:20, Mark Kirkwood wrote:


On 04/12/17 16:08, Ashutosh Bapat wrote:


On Sun, Dec 3, 2017 at 5:56 AM, legrand legrand
<legrand_legr...@hotmail.com> wrote:

Hello,

I have a typical star schema, having dimension tables "product", 
"calendar"

and "country" and a fact table "sales".
This fact table is partitionned by time (range by month) and country 
(list).


Will query like:

select product.name, calendar.month, sum(sales.net_price)
from sales
  inner join product on (product.id = sales.cust_id)
  inner join country on (country.id = sales.country_id)
  inner join calendar on (calendar.id = sales.calendar_id)
where
  country.name = 'HERE'
  and calendar.year = '2017'
group by product.name,calendar.month

be able to identify needed partitions ?


AFAIU partition pruning, it works only with the partition key columns.
So, if country.name and calendar.year are the partition keys partition
pruning would identify the needed partitions from those tables. But
planner doesn't know that calendar.year is somehow related to
calendar.id and then transfer that knowledge so that partitions of
sales can be identified.



If you can get your code to perform a star transformation on this type 
of query, then you might see some partition pruning.




Actually it won't - sorry. To get that to work, you would need to 
evaluate the additional subqueries to produce fixed values! The patch 
for 'runtime partition pruning' might be what you want tho.


Cheers

Mark



Re: Partition pruning for Star Schema

2017-12-03 Thread Mark Kirkwood

On 04/12/17 16:08, Ashutosh Bapat wrote:


On Sun, Dec 3, 2017 at 5:56 AM, legrand legrand
 wrote:

Hello,

I have a typical star schema, having dimension tables "product", "calendar"
and "country" and a fact table "sales".
This fact table is partitionned by time (range by month) and country (list).

Will query like:

select product.name, calendar.month, sum(sales.net_price)
from sales
  inner join product on (product.id = sales.cust_id)
  inner join country on (country.id = sales.country_id)
  inner join calendar on (calendar.id = sales.calendar_id)
where
  country.name = 'HERE'
  and calendar.year = '2017'
group by product.name,calendar.month

be able to identify needed partitions ?


AFAIU partition pruning, it works only with the partition key columns.
So, if country.name and calendar.year are the partition keys partition
pruning would identify the needed partitions from those tables. But
planner doesn't know that calendar.year is somehow related to
calendar.id and then transfer that knowledge so that partitions of
sales can be identified.



If you can get your code to perform a star transformation on this type 
of query, then you might see some partition pruning.


Cheers

Mark