Re: Zedstore - compressed in-core columnar storage
> When a zedstore table is queried using *invalid* ctid, the server > crashes due to assertion failure. See below, > > postgres=# select * from t1 where ctid = '(0, 0)'; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > I believe above should have either returned 0 rows or failed with some > user friendly error. We pushed a fix for this today. It now returns zero rows, like the equivalent query with heap. Thanks for reporting!
Re: Zedstore - compressed in-core columnar storage
120.670.00965.33 0.00 0.00 0.00 0.00 0.00 16.510.00 1.99 8.00 0.00 8.21 99.07 Devicer/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb121.000.00968.00 0.00 0.00 0.00 0.00 0.00 16.760.00 2.02 8.00 0.00 8.19 99.07 ***/ZEDSTORE*** The total query time: ***HEAP*** Execution Time: 758807.571 ms ***/HEAP*** ***ZEDSTORE*** Execution Time: 2111576.259 ms ***/ZEDSTORE*** Every attribute in zedstore is stored in a btree with the TID as a key. Unlike heap, the TID is a logical address, and not a physical one. The pages of one attribute are interspersed with the pages of all other attributes. When you do a sequential scan on zedstore the pages are, therefore, not stored in sequential order, so the access pattern can resemble random I/O. On our system, query time for zedstore was around 3x slower than heap for this query. If your storage does not handle semi-random reads very well, then zedstore can be very slow. This setup was a worst case scenario because random read was 150x slower than with sequential read. On hardware with better random I/O zedstore would really shine. On a side note, a second run of this query with zedstore was finished in around 57 seconds, because the ~4GB of column data was already in the relcache. The data size is smaller because we only store the relevant columns in memory, also the datums are compressed and encoded. Conversely, subsequently running the same query with heap still takes around 750 seconds because our system cannot store 119GB of relation data in the relcache/system caches. Our main takeaway with this is that anything we can do to group together data that is accessed together can help zedstore to have larger, more frequent sequential reads. On Mon, Oct 28, 2019 at 3:22 PM Taylor Vesely wrote: > > When a zedstore table is queried using *invalid* ctid, the server > > crashes due to assertion failure. See below, > > > > postgres=# select * from t2 where ctid = '(0, 0)'; > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: Failed. > > Thank you for pointing that out! I will look into fixing that some > time this week. If we run without assertions the query still fails > with this error because zedstoream_tuple_tid_valid incorrectly reports > the TID as valid: > > ERROR: arrived at incorrect block 2 while descending zedstore btree > > > I believe above should have either returned 1 rows or failed with some > > user friendly error. > > Agreed. I think it should match the behavior of heap as closely as > possible. > tpcds.02.explain_verbose.zedstore.out Description: Binary data tpcds.02.explain_verbose.heap.out Description: Binary data
Re: Zedstore - compressed in-core columnar storage
> When a zedstore table is queried using *invalid* ctid, the server > crashes due to assertion failure. See below, > > postgres=# select * from t2 where ctid = '(0, 0)'; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. Thank you for pointing that out! I will look into fixing that some time this week. If we run without assertions the query still fails with this error because zedstoream_tuple_tid_valid incorrectly reports the TID as valid: ERROR: arrived at incorrect block 2 while descending zedstore btree > I believe above should have either returned 1 rows or failed with some > user friendly error. Agreed. I think it should match the behavior of heap as closely as possible.
Re: Zedstore - compressed in-core columnar storage
> When doing update operation, for each tuple being modified, > *tuplebuffers_insert()* says that there is no entry for the relation > being modified in the hash table although it was already added when > the first tuple in the table was updated. Why is it so? Currently, when doing an update, it will actually flush the tuple buffers every time we update a tuple. As a result, we only ever spool up one tuple at a time. This is a good place to put in an optimization like was implemented for insert, but I haven't gotten around to looking into that yet. The memory leak is actually happening because it isn't freeing the attbuffers after flushing. Alexandra Wang and I have a working branch[1] where we tried to plug the leak by freeing the attbuffers, but it has exposed an issue with triggers that I need to understand before I push the fix into the main zedstore branch. I don't like our solution of freeing the buffers either, because they could easily be reused. I'm going to take a stab at making that better before merging in the fix. [1] https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues
Re: Memory-Bounded Hash Aggregation
I started to review this patch yesterday with Melanie Plageman, so we rebased this patch over the current master. The main conflicts were due to a simplehash patch that has been committed separately[1]. I've attached the rebased patch. I was playing with the code, and if one of the table's most common values isn't placed into the initial hash table it spills a whole lot of tuples to disk that might have been avoided if we had some way to 'seed' the hash table with MCVs from the statistics. Seems to me that you would need some way of dealing with values that are in the MCV list, but ultimately don't show up in the scan. I imagine that this kind of optimization would most useful for aggregates on a full table scan. Some questions: Right now the patch always initializes 32 spill partitions. Have you given any thought into how to intelligently pick an optimal number of partitions yet? > That can be done as an add-on to approach #1 by evicting the entire > Hash table (writing out the partial states), then resetting the memory > Context. By add-on approach, do you mean to say that you have something in mind to combine the two strategies? Or do you mean that it could be implemented as a separate strategy? > I think it's clear there's no perfect eviction strategy - for every > algorithm we came up with we can construct a data set on which it > performs terribly (I'm sure we could do that for the approach used by > Greenplum, for example). > > So I think it makes sense to do what Jeff proposed, and then maybe try > improving that in the future with a switch to different eviction > strategy based on some heuristics. I agree. It definitely feels like both spilling strategies have their own use case. That said, I think it's worth mentioning that with parallel aggregates it might actually be more useful to spill the trans values instead, and have them combined in a Gather or Finalize stage. [1] https://www.postgresql.org/message-id/flat/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel%40j-davis.com v1-0001-Rebased-memory-bounded-hash-aggregation.patch Description: Binary data