Re: Zedstore - compressed in-core columnar storage

2019-11-04 Thread Taylor Vesely
> When a zedstore table is queried using *invalid* ctid, the server
> crashes due to assertion failure. See below,
>
> postgres=# select * from t1 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> I believe above should have either returned 0 rows or failed with some
> user friendly error.

We pushed a fix for this today. It now returns zero rows, like the
equivalent query with heap. Thanks for reporting!


Re: Zedstore - compressed in-core columnar storage

2019-10-30 Thread Taylor Vesely
   120.670.00965.33  0.00 0.00 0.00   0.00
  0.00   16.510.00   1.99 8.00 0.00   8.21  99.07

Devicer/s w/s rkB/s wkB/s   rrqm/s   wrqm/s  %rrqm
 %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb121.000.00968.00  0.00 0.00 0.00   0.00
  0.00   16.760.00   2.02 8.00 0.00   8.19  99.07
***/ZEDSTORE***

The total query time:

***HEAP***
 Execution Time: 758807.571 ms
***/HEAP***

***ZEDSTORE***
 Execution Time: 2111576.259 ms
***/ZEDSTORE***

Every attribute in zedstore is stored in a btree with the TID as a
key. Unlike heap, the TID is a logical address, and not a physical
one. The pages of one attribute are interspersed with the pages of all
other attributes. When you do a sequential scan on zedstore the pages
are, therefore, not stored in sequential order, so the access pattern
can resemble random I/O.

On our system, query time for zedstore was around 3x slower than heap
for this query. If your storage does not handle semi-random reads very
well, then zedstore can be very slow. This setup was a worst case
scenario because random read was 150x slower than with sequential
read. On hardware with better random I/O zedstore would really shine.

On a side note, a second run of this query with zedstore was finished
in around 57 seconds, because the ~4GB of column data was already in
the relcache. The data size is smaller because we only store the
relevant columns in memory, also the datums are compressed and
encoded. Conversely, subsequently running the same query with heap
still takes around 750 seconds because our system cannot store 119GB
of relation data in the relcache/system caches.

Our main takeaway with this is that anything we can do to group
together data that is accessed together can help zedstore to have
larger, more frequent sequential reads.


On Mon, Oct 28, 2019 at 3:22 PM Taylor Vesely  wrote:

> > When a zedstore table is queried using *invalid* ctid, the server
> > crashes due to assertion failure. See below,
> >
> > postgres=# select * from t2 where ctid = '(0, 0)';
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> > The connection to the server was lost. Attempting reset: Failed.
>
> Thank you for pointing that out! I will look into fixing that some
> time this week. If we run without assertions the query still fails
> with this error because zedstoream_tuple_tid_valid incorrectly reports
> the TID as valid:
>
> ERROR:  arrived at incorrect block 2 while descending zedstore btree
>
> > I believe above should have either returned 1 rows or failed with some
> > user friendly error.
>
> Agreed. I think it should match the behavior of heap as closely as
> possible.
>


tpcds.02.explain_verbose.zedstore.out
Description: Binary data


tpcds.02.explain_verbose.heap.out
Description: Binary data


Re: Zedstore - compressed in-core columnar storage

2019-10-28 Thread Taylor Vesely
> When a zedstore table is queried using *invalid* ctid, the server
> crashes due to assertion failure. See below,
>
> postgres=# select * from t2 where ctid = '(0, 0)';
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Thank you for pointing that out! I will look into fixing that some
time this week. If we run without assertions the query still fails
with this error because zedstoream_tuple_tid_valid incorrectly reports
the TID as valid:

ERROR:  arrived at incorrect block 2 while descending zedstore btree

> I believe above should have either returned 1 rows or failed with some
> user friendly error.

Agreed. I think it should match the behavior of heap as closely as
possible.


Re: Zedstore - compressed in-core columnar storage

2019-09-19 Thread Taylor Vesely
> When doing update operation, for each tuple being modified,
> *tuplebuffers_insert()* says that there is no entry for the relation
> being modified in the hash table although it was already added when
> the first tuple in the table was updated. Why is it so?

Currently, when doing an update, it will actually flush the tuple
buffers every time we update a tuple. As a result, we only ever spool
up one tuple at a time. This is a good place to put in an optimization
like was implemented for insert, but I haven't gotten around to
looking into that yet.

The memory leak is actually happening because it isn't freeing the
attbuffers after flushing. Alexandra Wang and I have a working
branch[1] where we tried to plug the leak by freeing the attbuffers,
but it has exposed an issue with triggers that I need to understand
before I push the fix into the main zedstore branch.

I don't like our solution of freeing the buffers either, because they
could easily be reused. I'm going to take a stab at making that better
before merging in the fix.

[1] https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues


Re: Memory-Bounded Hash Aggregation

2019-08-28 Thread Taylor Vesely
I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]. I've
attached the rebased patch.

I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.

Some questions:

Right now the patch always initializes 32 spill partitions. Have you given
any thought into how to intelligently pick an optimal number of
partitions yet?

> That can be done as an add-on to approach #1 by evicting the entire
> Hash table (writing out the partial states), then resetting the memory
> Context.

By add-on approach, do you mean to say that you have something in mind
to combine the two strategies? Or do you mean that it could be implemented
as a separate strategy?

> I think it's clear there's no perfect eviction strategy - for every
> algorithm we came up with we can construct a data set on which it
> performs terribly (I'm sure we could do that for the approach used by
> Greenplum, for example).
>
> So I think it makes sense to do what Jeff proposed, and then maybe try
> improving that in the future with a switch to different eviction
> strategy based on some heuristics.

I agree. It definitely feels like both spilling strategies have their
own use case.

That said, I think it's worth mentioning that with parallel aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.

[1]
https://www.postgresql.org/message-id/flat/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel%40j-davis.com


v1-0001-Rebased-memory-bounded-hash-aggregation.patch
Description: Binary data