Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
On Sat, Oct 1, 2016 at 2:44 AM, Andres Freundwrote: > Hi, > > On 2016-07-26 17:43:33 -0700, Andres Freund wrote: > > In the attached patch I've attached simplehash.h, which can be > > customized by a bunch of macros, before being inlined. There's also a > > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via > > execGrouping.c. > > Attached is a significantly improved version of this series. The main > changes are: > > - The hash table now uses robin-hood style hash collision handling. See the > commit message of 0002 (or simplehash.h) for an introduction. That > significantly reduces the impact of "clustering" due to linear > addressing. > - Significant comment and correctness fixes, both in simplehash.h > - itself, and 0003/0004. > - a lot of other random performance improvements for the hashing code. > > > Unfortunately I'm running out battery right now, so I don't want to > re-run the benchmarks posted upthread (loading takes a while). But the > last time I ran them all the results after the patches were better than > before. > > > This patch series currently consists out of four patches: > - 0001 - add unlikely/likely() macros. The use here is not entirely > mandatory, but they do provide a quite consistent improvement. There > are other threads where they proved to be beneficial, so I see little > reason not to introduce them. > - 0002 - the customizable hashtable itself. It now contains documentation. > - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix > for the issue mentioned in [1], to improve peformance in heavily lossy > scenarios (otherwise we could regress in some extreme cases) > - 0004 - convert execGrouping.c, e.g. used by hash aggregates > > > While not quite perfect yet, I do think this is at a state where input > is needed. I don't want to go a lot deeper into this rabbit hole, > before we have some agreement that this is a sensible course of action. > > > I think there's several possible additional users of simplehash.h, > e.g. catcache.c - which has an open coded, not particularly fast hash > table and frequently shows up in profiles - but I think the above two > conversions are plenty to start with. > > > Comments? > > > Greetings, > > Andres Freund > > [1] http://archives.postgresql.org/message-id/20160923205843.zcs > 533sqdzlh4cpo%40alap3.anarazel.de > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > This is a great addition. A couple of comments. * 80% occupancy is a bit conservative for RH hashing, it works well up to 90% if you use the early stops due to distance. So that TODO is worth pursuing. * An optimized memory layout for RH hashmap is along the lines of HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays specially well with large occupancies (~90%). Due to RH the probings on the H array are very short and within a single cache line. Even with a 31bit hash it's reallyyy rare to have to probe more than 1 entry in the KV array. Essentially guaranteeing that the 99% percentile of lookups will only hit a couple of cache lines (if you ignore crossing cache lines and other details).
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Sep 14, 2016 5:18 PM, "Robert Haas"wrote: > > On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee > wrote: > > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page > > bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there > > are 6 or more dead tuples per page, bitmap will outperform the current > > representation, assuming max allocation for bitmap. If we can use additional > > estimates to restrict the size to somewhat more conservative value and then > > keep overflow area, then probably the break-even happens even earlier than > > that. I hope this gives us a good starting point, but let me know if you > > think it's still a wrong approach to pursue. > > Well, it's certainly a bigger change. I think the big concern is that > the amount of memory now becomes fixed based on the table size. So > one problem is that you have to figure out what you're going to do if > the bitmap doesn't fit in maintenance_work_mem. A related problem is > that it might fit but use more memory than before, which could cause > problems for some people. Now on the other hand it could also use > less memory for some people, and that would be good. > > I am kind of doubtful about this whole line of investigation because > we're basically trying pretty hard to fix something that I'm not sure > is broken.I do agree that, all other things being equal, the TID > lookups will probably be faster with a bitmap than with a binary > search, but maybe not if the table is large and the number of dead > TIDs is small, because cache efficiency is pretty important. But even > if it's always faster, does TID lookup speed even really matter to > overall VACUUM performance? Claudio's early results suggest that it > might, but maybe that's just a question of some optimization that > hasn't been done yet. > > I'm fairly sure that our number one priority should be to minimize the > number of cases where we need to do multiple scans of the indexes to > stay within maintenance_work_mem. If we're satisfied we've met that > goal, then within that we should try to make VACUUM as fast as > possible with as little memory usage as possible. I'm not 100% sure I > know how to get there, or how much work it's worth expending. In > theory we could even start with the list of TIDs and switch to the > bitmap if the TID list becomes larger than the bitmap would have been, > but I don't know if it's worth the effort. > > /me thinks a bit. > > Actually, I think that probably *is* worthwhile, specifically because > it might let us avoid multiple index scans in cases where we currently > require them. Right now, our default maintenance_work_mem value is > 64MB, which is enough to hold a little over ten million tuples. It's > also large enough to hold a bitmap for a 14GB table. So right now if > you deleted, say, 100 tuples per page you would end up with an index > vacuum cycles for every ~100,000 pages = 800MB, whereas switching to > the bitmap representation for such cases would require only one index > vacuum cycle for every 14GB, more than an order of magnitude > improvement! > > On the other hand, if we switch to the bitmap as the ONLY possible > representation, we will lose badly when there are scattered updates - > e.g. 1 deleted tuple every 10 pages. So it seems like we probably > want to have both options. One tricky part is figuring out how we > switch between them when memory gets tight; we have to avoid bursting > above our memory limit while making the switch. And even if our > memory limit is very high, we want to avoid using memory gratuitously; > I think we should try to grow memory usage incrementally with either > representation. > > For instance, one idea to grow memory usage incrementally would be to > store dead tuple information separately for each 1GB segment of the > relation. So we have an array of dead-tuple-representation objects, > one for every 1GB of the relation. If there are no dead tuples in a > given 1GB segment, then this pointer can just be NULL. Otherwise, it > can point to either the bitmap representation (which will take ~4.5MB) > or it can point to an array of TIDs (which will take 6 bytes/TID). > That could handle an awfully wide variety of usage patterns > efficiently; it's basically never worse than what we're doing today, > and when the dead tuple density is high for any portion of the > relation it's a lot better. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers I'd say it's an idea worth pursuing. It's the base idea behind roaring bitmaps, arguably the best overall compressed bitmap implementation.
Re: [HACKERS] Status of 64 bit atomics
On May 27, 2016 5:01 PM, "John Gorman"wrote: > > Hi All > > Someone recently told me that the postgresql atomics library was incomplete > for 64 bit operations such as pg_atomic_fetch_add_u64() and should not be used. > > Can someone definitively confirm whether it is okay to rely on the 64 bit atomics > or whether it is better to protect 64 bit operations with a spinlock? > > Thanks! > John Golang has asm implementations for these even on 32bit platforms (see https://github.com/golang/go/tree/master/src/sync/atomic). Couldn't we borrow them? Or even better, fall back to spin lock on these, but transparently.
Re: [HACKERS] 9.5 release notes
On Sat, Aug 8, 2015 at 11:04 PM, Bruce Momjian br...@momjian.us wrote: On Sun, Aug 9, 2015 at 01:24:33AM +1200, David Rowley wrote: On 7 August 2015 at 14:24, Bruce Momjian br...@momjian.us wrote: On Tue, Jun 30, 2015 at 09:00:44PM +0200, Andres Freund wrote: * 2014-12-08 [519b075] Simon ..: Use GetSystemTimeAsFileTime directly in win32 2014-12-08 [8001fe6] Simon ..: Windows: use GetSystemTimePreciseAsFileTime if .. Timer resolution isn't a unimportant thing for people using explain? This all seemed very internals-only, e.g.: On most Windows systems this change will actually have no significant effect on timestamp resolution as the system timer tick is typically between 1ms and 15ms depending on what timer resolution currently running applications have requested. You can check this with clockres.exe from sysinternals. Despite the platform limiation this change still permits capture of finer timestamps where the system is capable of producing them and it gets rid of an unnecessary syscall. Was I wrong? This does have a user visible change. Timestamps are now likely to have 6 digits after the decimal point, if they're on a version of windows which supports GetSystemTimePreciseAsFileTime(); Master: postgres=# select now(); now --- 2015-08-09 01:14:01.959645+12 (1 row) 9.4.4 postgres=# select now(); now 2015-08-09 01:15:09.783+12 (1 row) Yes, this was already in the release notes: Allow higher-precision timestamp resolution on systemitem class=osnameWindows 8/ or systemitem class=osnameWindows Server 2012/ and later Windows systems (Craig Ringer) I am not sure why people were saying it was missing. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Are we landing pg_tgrm 1.2 in pg 9.5? If yes (we should), up to an order of magnitude improvements is a worthy inclusion in the release notes. -- Arthur Silva
Re: [HACKERS] 64-bit XIDs again
On Thu, Jul 30, 2015 at 5:31 PM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: On 31/07/15 02:24, Heikki Linnakangas wrote: On 07/30/2015 04:26 PM, Alexander Korotkov wrote: Also, I think it's possible to migrate to 64-bit XIDs without breaking pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples would be created with 64-bit XIDs. We can use free bits in t_infomask2 to distinguish old and new formats. I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large. There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. - Heikki I think having a special case to save 32 bits per tuple would cause unnecessary complications, and the savings are minimal compared to the size of current modern storage devices and the typical memory used in serious database servers. I think it is too much pain for very little gain, especially when looking into the future growth in storage capacity andbandwidth. The early mainframes used a base displacement technique to keep the size of addresses down in instructions: 16 bit addresses, comprising 4 bits for a base register and 12 bits for the displacement (hence the use of 4KB pages sizes now!). Necessary at the time when mainframes were often less than 128 KB! Now it would ludicrous to do that for modern servers! Cheers, Gavin (Who is ancient enough, to have programmed such MainFrames!) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers In the other hand PG tuple overhead is already the largest among the alternatives. Even if storage keeps getting faster and cheaper stuff you can't ignore the overhead of adding yet another 8bytes to each tuple.
Re: [HACKERS] GSoC 2015: SP-GIST for geometrical objects
On Mar 27, 2015 11:08 AM, Dima Ivanovskiy dima...@mail.ru wrote: Hello, I am Dmitrii, student of Moscow Institute of Physics and Technology Abstract: I chose project Indexing prolonged geometrical objects (i.e. boxes, circles, polygons, not points) with SP-GiST by mapping to 4d-space. According to the presentation https://www.pgcon.org/2011/schedule/attachments/197_pgcon-2011.pdf SP-GIST 3 times faster than GiST in some cases. But GIST supports geometrical data types: box, circle, polygon with operators:| | @ @ @ | | ~ ~= Popular spatial extension PostGIS doesn't include SP-GIST, but has a lot of geometrical features. Project details: After meeting with Alexander Korotkov, I wrote some plan. Using of K-D-tree and Quadtree in building index for geometrical data types can increase speed of search in some cases. The main idea is representing 2-D geometrical objects in their bounding box. Set of 2-D boxes is 4-D space. New _ops will work with points from 4-D space, for example kd_box_ops, quad_circle_ops and will support all geometrical operators. After conversion object to their bounding box algo has set of tuples (x1, y1, x2, y2). Our goal is separate this space the most equally. If we talk about K-D-tree, on first step K-D-tree algorithm will split space in 2 parts by the first coordinate, in next step by the second coordinate etc., after 4-th coordinate we repeat this procedure. At the end we have index at geometrical objects and use traversal tree for every search operator. Postgresql has already has realization ideas of MBR in gist/gistproc.c. So I will transfer this realization to other type of tree. Of cource, I assume that SP-GIST can be not the best decision of this problem. So after testing this clear methods, I will try to find more effective way. Maybe with using combination of different spatial tree structures. Project Schedule: until May 25 Read documentation and source code, clarify details of implementation. 1st month Implement new '_ops' with all geometrical operators for box, circle, polygon 2nd month Research new methods for increase speed of geometrical query 3rd month Final refactoring, testing and submitting a patch. Links: http://www.sai.msu.su/~megera/postgres/talks/gist_tutorial.html - about GIST https://toster.ru/q/27135#answer_110197 - people need SP-GIST for cubes http://www.slideshare.net/profyclub_ru/o-lt - presentation about indexes http://pgconf.ru/static/presentations/2015/korotkov_spatial.pdf - working with geo objects Nice proposal. Dynamic Kdtrees can perform badly as the splitting median can get way off as updates are coming. What are your thoughts about that? Also what's up with the 4d space? I don't quite get it. These types are 2 or 3 dimensions.
Re: [HACKERS] GSoC 2015: SP-GIST for geometrical objects
On Mar 27, 2015 6:41 PM, Dima Ivanovskiy dima...@mail.ru wrote: On Mar 27, 2015 11:08 AM, Dima Ivanovskiy dima...@mail.ru wrote: Hello, I am Dmitrii, student of Moscow Institute of Physics and Technology Abstract: I chose project Indexing prolonged geometrical objects (i.e. boxes, circles, polygons, not points) with SP-GiST by mapping to 4d-space. According to the presentation https://www.pgcon.org/2011/schedule/attachments/197_pgcon-2011.pdf SP-GIST 3 times faster than GiST in some cases. But GIST supports geometrical data types: box, circle, polygon with operators:| | @ @ @ | | ~ ~= Popular spatial extension PostGIS doesn't include SP-GIST, but has a lot of geometrical features. Project details: After meeting with Alexander Korotkov, I wrote some plan. Using of K-D-tree and Quadtree in building index for geometrical data types can increase speed of search in some cases. The main idea is representing 2-D geometrical objects in their bounding box. Set of 2-D boxes is 4-D space. New _ops will work with points from 4-D space, for example kd_box_ops, quad_circle_ops and will support all geometrical operators. After conversion object to their bounding box algo has set of tuples (x1, y1, x2, y2). Our goal is separate this space the most equally. If we talk about K-D-tree, on first step K-D-tree algorithm will split space in 2 parts by the first coordinate, in next step by the second coordinate etc., after 4-th coordinate we repeat this procedure. At the end we have index at geometrical objects and use traversal tree for every search operator. Postgresql has already has realization ideas of MBR in gist/gistproc.c. So I will transfer this realization to other type of tree. Of cource, I assume that SP-GIST can be not the best decision of this problem. So after testing this clear methods, I will try to find more effective way. Maybe with using combination of different spatial tree structures. Project Schedule: until May 25 Read documentation and source code, clarify details of implementation. 1st month Implement new '_ops' with all geometrical operators for box, circle, polygon 2nd month Research new methods for increase speed of geometrical query 3rd month Final refactoring, testing and submitting a patch. Links: http://www.sai.msu.su/~megera/postgres/talks/gist_tutorial.html - about GIST https://toster.ru/q/27135#answer_110197 - people need SP-GIST for cubes http://www.slideshare.net/profyclub_ru/o-lt - presentation about indexes http://pgconf.ru/static/presentations/2015/korotkov_spatial.pdf - working with geo objects Nice proposal. Dynamic Kdtrees can perform badly as the splitting median can get way off as updates are coming. What are your thoughts about that? Also what's up with the 4d space? I don't quite get it. These types are 2 or 3 dimensions. I read spgist README one more time. I didn't find the mechanism for maintaining good balance after updates. I think we can use Bkd-Tree, https://www.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf. But It can be not the best solving. I include Research time in 2nd month of timeline. About 4d space. All these types are 2 dimensional. Just as in R-tree object is approximated by MBR. MBR for 2d-objects can be mapped to 4d-point. More general, nd-object MBR can be mapped into 2nd-point. The reason I said that is because you pointed performance as one motivation factor (and the lack of balancing properties can degrade performance really fast on larger indexes). The Bkd variant seems interesting but I don't think spgist provides enough abstraction to implement it. A bounding box can still be inserted/queried with a 2d kdtree so I don't know why you call it 4d. I assume it's a matter of naming. Overall the proposal seems totally doable, although I have doubts about its usefulness. But I'm no one... You should have some feedback from the mentors soon.
Re: [HACKERS] pg_rewind in contrib
On Mar 26, 2015 4:20 AM, Vladimir Borodin r...@simply.name wrote: 26 марта 2015 г., в 7:32, Michael Paquier michael.paqu...@gmail.com написал(а): On Thu, Mar 26, 2015 at 12:23 PM, Venkata Balaji N nag1...@gmail.com wrote: Test 1 : [...] If the master is crashed or killed abruptly, it may not be possible to do a rewind. Is my understanding correct ? Yep. This is mentioned in the documentation: http://www.postgresql.org/docs/devel/static/app-pgrewind.html The target server must shut down cleanly before running pg_rewind. You can start old master, wait for crash recovery to complete, stop it cleanly and then use pg_rewind. It works. Shouldn't we have a flag so it does that automatically if necessary? Test 2 : - On a successfully running streaming replication with one master and one slave, i did a clean shutdown of master - promoted slave - performed some operations (data changes) on newly promoted slave and did a clean shutdown - Executed pg_rewind on the old master to sync with the latest changes on new master. I got the below message The servers diverged at WAL position 0/A298 on timeline 1. No rewind required. I am not getting this too. In this case the master WAL visibly did not diverge from the slave WAL line. A rewind is done if the master touches new relation pages after the standby has been promoted, and before the master is shutdown. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Да пребудет с вами сила... https://simply.name/ru
[HACKERS] Paper from IBM: Memory-Efficient Hash Joins
I've come across this paper today, I thought I'd share it here too. http://www.vldb.org/pvldb/vol8/p353-barber.pdf -- Arthur Silva
Re: [HACKERS] logical column ordering
On Fri, Feb 27, 2015 at 4:44 PM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: On 27.2.2015 20:34, Alvaro Herrera wrote: Tomas Vondra wrote: I think we could calls to the randomization functions into some of the regression tests (say 'create_tables.sql'), but that makes regression tests ... well, random, and I'm not convinced that's a good thing. Also, this makes regression tests harder to think, because SELECT * does different things depending on the attlognum order. No, that approach doesn't seem very useful. Rather, randomize the columns in the CREATE TABLE statement, and then fix up the attlognums so that the SELECT * expansion is the same as it would be with the not-randomized CREATE TABLE. Yes, that's a possible approach too - possibly a better one for regression tests as it fixes the 'SELECT *' but it effectively uses fixed 'attlognum' and 'attnum' values (it's difficult to randomize those, as they may be referenced in other catalogs). -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Sorry to intrude, I've been following this post and I was wondering if it would allow (in the currently planed form or in the future) a wider set of non-rewriting DDLs to Postgres. For example, drop a column without rewriting the table.
Re: [HACKERS] logical column ordering
On Feb 27, 2015 5:02 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Arthur Silva wrote: Sorry to intrude, I've been following this post and I was wondering if it would allow (in the currently planed form or in the future) a wider set of non-rewriting DDLs to Postgres. For example, drop a column without rewriting the table. Perhaps. But dropping a column already does not rewrite the table, only marks the column as dropped in system catalogs, so do you have a better example. One obvious example is that you have CREATE TABLE t ( t1 int, t3 int ); and later want to add t2 in the middle, the only way currently is to drop the table and start again (re-creating all dependant views, FKs, etc). With the patch you will be able to add the column at the right place. If no default value is supplied for the new column, no table rewrite is necessary at all. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services Cool! I didn't know I could drop stuff without rewriting. Ya, that's another example, people do these from GUI tools. That's a nice side effect. Cool (again)!
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 11:19 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote: I don't think the password storing best practices apply to db connection authentication. Why not? -- Peter Geoghegan I assume if the hacker can intercept the server unencrypted traffic and/or has access to its hard-drive the database is compromised anyway. I maybe missing something though.
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 11:25 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 5:22 PM, Arthur Silva arthur...@gmail.com wrote: I assume if the hacker can intercept the server unencrypted traffic and/or has access to its hard-drive the database is compromised anyway. That sounds like an argument against hashing the passwords in general. -- Peter Geoghegan Indeed. In a perfect world SCRAM would be the my choice. FWIW Mongodb 3.0 also uses SCRAM as the preferred method for password based authentication.
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 10:32 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote: Although the patch was described as relatively easy to write, it never went anywhere, because it *replaced* MD5 authentication with bcrypt, which would be a big problem for existing clients. It seems clear that we should add something new and not immediately kill off what we've already got, so that people can transition smoothly. An idea I just had today is to keep using basically the same system that we are currently using for MD5, but with a stronger hash algorithm, like SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and SHA-512). Those are slower, but my guess is that even SHA-512 is not enough slower for anybody to care very much, and if they do, well that's another reason to make use of the new stuff optional. I believe that a big advantage of bcrypt for authentication is the relatively high memory requirements. This frustrates GPU based attacks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers There's also scrypt, which can be tuned for both memory and compute requirements. I don't think the password storing best practices apply to db connection authentication. So SHA256 (or any other non terribly broken hash) is probably fine for Pg.
Re: [HACKERS] VODKA?
On Jan 6, 2015 7:14 PM, Josh Berkus j...@agliodbs.com wrote: Oleg, Teodor: I take it VODKA is sliding to version 9.6? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers This is kinda off, but I was wondering if anyone ever considered running a crowd-funding campaign for this sort of feature (aka potentially very popular).
Re: [HACKERS] Commitfest problems
On Fri, Dec 19, 2014 at 6:28 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 19/12/14 20:48, Andres Freund wrote: On 2014-12-18 10:02:25 -0800, Joshua D. Drake wrote: I think a lot of hackers forget exactly how tender their egos are. Now I say this knowing that a lot of them will go, Oh give me a break but as someone who employs hackers, deals with open source AND normal people :P every single day, I can tell you without a single inch of sarcasm that petting egos is one of the ways you get things done in the open source (and really any male dominated) community. To me that's a bit over the top stereotyping. +1 Having been mentioned one or two times myself - it was an unexpected wow - cool rather than a grumpy/fragile I must be noticed thing. I think some folk have forgotten the underlying principle of the open source community - it is about freely giving - time or code etc. The there must be something in it for me me me meme is - well - the *other* world view. However, doing crappy work and let's not be shy about it, there is NOTHING fun about reviewing someone else's code needs to have incentive. FWIW, I don't agree with this at all. Reviewing code can be quite interesting - with the one constraint that the problem the patch solves needs to be somewhat interesting. The latter is what I think gets many of the more experienced reviewers - lots of the patches just solve stuff they don't care about. Yeah, and also it helps if the patch addresses an area that you at least know *something* about - otherwise it is really hard to review in any useful way. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers I'm trying to follow this thread as much as I could but I could've missed a part of it. Merit/credits aside what would really help Postgres right now is a more streamlined/modern (the only words I could come up with) development process. Using the mailing list for everything is alright, it works. But there is lot of tools that could be used along it: gerrit/gitlab/github/bitbucket/jira and other tools that do: pull requests, code review and bugs (or any combination of them). That'd reduce friction for new contributors and further development. These tools are being used everywhere and I find hard to believe that PG can't benefit from any of them. -- Arthur Silva
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Wed, Dec 10, 2014 at 12:10 PM, Rahila Syed rahilasye...@gmail.com wrote: What I would suggest is instrument the backend with getrusage() at startup and shutdown and have it print the difference in user time and system time. Then, run tests for a fixed number of transactions and see how the total CPU usage for the run differs. Folllowing are the numbers obtained on tests with absolute CPU usage, fixed number of transactions and longer duration with latest fpw compression patch pgbench command : pgbench -r -t 25 -M prepared To ensure that data is not highly compressible, empty filler columns were altered using alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text checkpoint_segments = 1024 checkpoint_timeout = 5min fsync = on The tests ran for around 30 mins.Manual checkpoint was run before each test. Compression WAL generated%compressionLatency-avg CPU usage (seconds) TPS Latency stddev on 1531.4 MB ~35 % 7.351 ms user diff: 562.67s system diff: 41.40s 135.96 13.759 ms off 2373.1 MB 6.781 ms user diff: 354.20s system diff: 39.67s147.40 14.152 ms The compression obtained is quite high close to 35 %. CPU usage at user level when compression is on is quite noticeably high as compared to that when compression is off. But gain in terms of reduction of WAL is also high. Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Thank you, Rahila Syed On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed rahilasyed...@gmail.com wrote: If that's really true, we could consider having no configuration any time, and just compressing always. But I'm skeptical that it's actually true. I was referring to this for CPU utilization: http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com http:// The above tests were performed on machine with configuration as follows Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm I think that measurement methodology is not very good for assessing the CPU overhead, because you are only measuring the percentage CPU utilization, not the absolute amount of CPU utilization. It's not clear whether the duration of the tests was the same for all the configurations you tried - in which case the number of transactions might have been different - or whether the number of operations was exactly the same - in which case the runtime might have been different. Either way, it could obscure an actual difference in absolute CPU usage per transaction. It's unlikely that both the runtime and the number of transactions were identical for all of your tests, because that would imply that the patch makes no difference to performance; if that were true, you wouldn't have bothered writing it What I would suggest is instrument the backend with getrusage() at startup and shutdown and have it print the difference in user time and system time. Then, run tests for a fixed number of transactions and see how the total CPU usage for the run differs. Last cycle, Amit Kapila did a bunch of work trying to compress the WAL footprint for updates, and we found that compression was pretty darn expensive there in terms of CPU time. So I am suspicious of the finding that it is free here. It's not impossible that there's some effect which causes us to recoup more CPU time than we spend compressing in this case that did not apply in that case, but the projects are awfully similar, so I tend to doubt it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company This can be improved in the future by using other algorithms.
Re: [HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)
On Wed, Dec 10, 2014 at 1:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/28/2014 04:12 PM, Alexander Korotkov wrote: 3. A binary heap would be a better data structure to buffer the rechecked values. A Red-Black tree allows random insertions and deletions, but in this case you need to insert arbitrary values but only remove the minimum item. That's exactly what a binary heap excels at. We have a nice binary heap implementation in the backend that you can use, see src/backend/lib/binaryheap.c. Hmm. For me binary heap would be a better data structure for KNN-GiST at all :-) I decided to give this a shot, replacing the red-black tree in GiST with the binary heap we have in lib/binaryheap.c. It made the GiST code somewhat simpler, as the binaryheap interface is simpler than the red-black tree one. Unfortunately, performance was somewhat worse. That was quite surprising, as insertions and deletions are both O(log N) in both data structures, but the red-black tree implementation is more complicated. I implemented another data structure called a Pairing Heap. It's also a fairly simple data structure, but insertions are O(1) instead of O(log N). It also performs fairly well in practice. With that, I got a small but measurable improvement. To test, I created a table like this: create table gisttest (id integer, p point); insert into gisttest select id, point(random(), random()) from generate_series(1, 100) id; create index i_gisttest on gisttest using gist (p); And I ran this query with pgbench: select id from gisttest order by p - '(0,0)' limit 1000; With unpatched master, I got about 650 TPS, and with the patch 720 TPS. That's a nice little improvement, but perhaps more importantly, the pairing heap implementation consumes less memory. To measure that, I put a MemoryContextStats(so-queueCtx) call into gistendscan. With the above query, but without the limit clause, on master I got: GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998 chunks); 21296 used And with the patch: GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks); 21072 used That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice to reduce that memory consumption, as there is no hard upper bound on how much might be needed. If the GiST tree is really disorganized for some reason, a query might need a lot more. So all in all, I quite like this patch, even though it doesn't do anything too phenomenal. It adds a some code, in the form of the new pairing heap implementation, but it makes the GiST code a little bit simpler. And it gives a small performance gain, and reduces memory usage a bit. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers It may be better to replace the lib/binaryheap altogether as it offers comparable/better performance.
[HACKERS] Faster relevance ranking didn't make it into PostgreSQL 9.4
This is slight off topic but please bear with me. I came across this post: http://pauleveritt.wordpress.com/2014/10/29/faster-relevance-ranking-didnt-make-it-into-postgresql-9-4/ I was curious about it so I checked several commit fest pages and searched the mailing lists but I wasn't able to locate the relevant discussion. Can someone point me to to it? -- Arthur Silva
Re: [HACKERS] [WIP Patch] Using 128-bit integers for sum, avg and statistics aggregates
On Sat, Oct 25, 2014 at 12:38 PM, Andreas Karlsson andr...@proxel.se wrote: Hi, There was recently talk about if we should start using 128-bit integers (where available) to speed up the aggregate functions over integers which uses numeric for their internal state. So I hacked together a patch for this to see what the performance gain would be. Previous thread: http://www.postgresql.org/message-id/20141017182500. gf2...@alap3.anarazel.de What the patch does is switching from using numerics in the aggregate state to int128 and then convert the type from the 128-bit integer in the final function. The functions where we can make use of int128 states are: - sum(int8) - avg(int8) - var_*(int2) - var_*(int4) - stdev_*(int2) - stdev_*(int4) The initial benchmark results look very promising. When summing 10 million int8 I get a speedup of ~2.5x and similarly for var_samp() on 10 million int4 I see a speed up of ~3.7x. To me this indicates that it is worth the extra code. What do you say? Is this worth implementing? The current patch still requires work. I have not written the detection of int128 support yet, and the patch needs code cleanup (for example: I used an int16_ prefix on the added functions, suggestions for better names are welcome). I also need to decide on what estimate to use for the size of that state. The patch should work and pass make check on platforms where __int128_t is supported. The simple benchmarks: CREATE TABLE test_int8 AS SELECT x::int8 FROM generate_series(1, 1000) x; Before: # SELECT sum(x) FROM test_int8; sum 500500 (1 row) Time: 2521.217 ms After: # SELECT sum(x) FROM test_int8; sum 500500 (1 row) Time: 1022.811 ms CREATE TABLE test_int4 AS SELECT x::int4 FROM generate_series(1, 1000) x; Before: # SELECT var_samp(x) FROM test_int4; var_samp 83416.6667 (1 row) Time: 3808.546 ms After: # SELECT var_samp(x) FROM test_int4; var_samp 83416.6667 (1 row) Time: 1033.243 ms Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers These are some nice improvements. As far as I'm aware int128 types are supported on every major compiler when compiling for 64bit platforms. Right?
Re: [HACKERS] Fixed xloginsert_locks for 9.4
On Fri, Oct 3, 2014 at 1:40 PM, Bruce Momjian br...@momjian.us wrote: On Fri, Oct 3, 2014 at 04:11:30PM +0200, Andres Freund wrote: On 2014-10-03 10:07:39 -0400, Gregory Smith wrote: On 10/3/14, 8:26 AM, Andres Freund wrote: #define NUM_XLOGINSERT_LOCKS 1 tps = 52.711939 (including connections establishing) #define NUM_XLOGINSERT_LOCKS 8 tps = 286.496054 (including connections establishing) #define NUM_XLOGINSERT_LOCKS 16 tps = 346.113313 (including connections establishing) #define NUM_XLOGINSERT_LOCKS 24 tps = 363.242111 (including connections establishing) Just to clarify: that 10% number I threw out was meant as a rough estimate for a system with the default configuration, which is all that I tested. It seemed like people would likely need to tune all the usual things like checkpoint_segments, shared_buffers, etc. as well before seeing much better. You did all that, and sure enough the gain went up; thanks for confirming my guess. I still don't think that means this needs a GUC for 9.4. Look at that jump from 1 to 8. The low-hanging fruit here hasn't just been knocked off. It's been blended into a frozen drink, poured into a glass, and had a little paper umbrella put on top. I think that's enough for 9.4. But, yes, let's see if we can add delivery to the side of the pool in the next version too. So 25% performance on a relatively small machine improvements aren't worth a GUC? That are likely to be larger on a bigger machine? I utterly fail to see why that's a service to our users. Well, I think the issue is that having a GUC that can't reasonably be tuned by 95% of our users is nearly useless. Few users are going to run benchmarks to see what the optimal value is. I remember Informix had a setting that had no description except try different values to see if it helps performance --- we don't want to do that. What if we emit a server message if the setting is too low? That's how we handle checkpoint_segments. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Not all GUC need to be straight forward to tune. If the gains are worthy I don't see any reason not to have it.
Re: [HACKERS] Fixed xloginsert_locks for 9.4
On Fri, Oct 3, 2014 at 3:10 PM, Bruce Momjian br...@momjian.us wrote: On Fri, Oct 3, 2014 at 02:07:45PM -0400, Bruce Momjian wrote: On Fri, Oct 3, 2014 at 03:00:56PM -0300, Arthur Silva wrote: I remember Informix had a setting that had no description except try different values to see if it helps performance --- we don't want to do that. What if we emit a server message if the setting is too low? That's how we handle checkpoint_segments. Not all GUC need to be straight forward to tune. If the gains are worthy I don't see any reason not to have it. Every GUC add complexity to the system because people have to understand it to know if they should tune it. No GUC is zero-cost. Please see my blog post about the cost of adding GUCs: http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009 -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + That's true Bruce (nice post, it was a good reading). But how can we ignore 25%+ improvements (from 8 to 24)? At very least we should delivery some pretty good defaults.
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Sep 16, 2014 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 16, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote: Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. I already told you that I did, and that it was the only and only app I had written for JSONB. Ah, ok, I thought yours was a test case. Did you check how it performed on the two patches at all? My tests with 185 keys didn't show any difference, including for a last key case. No, I didn't test it. But I think Heikki's test results pretty much tell us everything there is to see here. This isn't really that complicated; I've read a few papers on index compression over the years and they seem to often use techniques that have the same general flavor as what Heikki did here, adding complexity in the data format to gain other advantages. So I don't think we should be put off. I second this reasoning. Even if I ran a couple of very realistic test cases that support all-lengths I do fell that the Hybrid aproach would be better as it covers all bases. To put things in perspective Tom's latest patch isn't much simpler either. Since it would still be a breaking change we should consider changing the layout to key-key-key-value-value-value as it seems to pay off. Basically, I think that if we make a decision to use Tom's patch rather than Heikki's patch, we're deciding that the initial decision, by the folks who wrote the original jsonb code, to make array access less than O(n) was misguided. While that could be true, I'd prefer to bet that those folks knew what they were doing. The only way reason we're even considering changing it is that the array of lengths doesn't compress well, and we've got an approach that fixes that problem while preserving the advantages of fast lookup. We should have a darn fine reason to say no to that approach, and it didn't benefit my particular use case is not it. In practice, I'm not very surprised that the impact doesn't seem too bad when you're running SQL queries from the client. There's so much other overhead, for de-TOASTing and client communication and even just planner and executor costs, that this gets lost in the noise. But think about a PL/pgsql procedure, say, where somebody might loop over all of the elements in array. If those operations go from O(1) to O(n), then the loop goes from O(n) to O(n^2). I will bet you a beverage of your choice that somebody will find that behavior within a year of release and be dismayed by it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I couldn't get my hands on the twitter data but I'm generating my own. The json template is http://paste2.org/wJ1dfcjw and data was generated with http://www.json-generator.com/. It has 35 top level keys, just in case someone is wondering. I generated 1 random objects and I'm inserting them repeatedly until I got 320k rows. Test query: SELECT data-'name', data-'email' FROM t_json Test storage: EXTERNAL Test jsonb lengths quartiles: {1278,1587,1731,1871,2231} Tom's lengths+cache aware: 455ms HEAD: 440ms This is a realistic-ish workload in my opinion and Tom's patch performs within 4% of HEAD. Due to the overall lenghts I couldn't really test compressibility so I re-ran the test. This time I inserted an array of 2 objects in each row, as in: [obj, obj]; The objects where taken in sequence from the 1 pool so contents match in both tests. Test query: SELECT data # '{0, name}', data # '{0, email}', data # '{1, name}', data # '{1, email}' FROM t_json Test storage: EXTENDED HEAD: 17mb table + 878mb toast HEAD size quartiles: {2015,2500,2591,2711,3483} HEAD query runtime: 15s Tom's: 220mb table + 580mb toast Tom's size quartiles: {1665,1984,2061,2142.25,2384} Tom's query runtime: 13s This is an intriguing edge case that Tom's patch actually outperform the base implementation for 3~4kb jsons.
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Em 14/09/2014 12:21, Andres Freund and...@2ndquadrant.com escreveu: On 2014-09-13 20:27:51 -0500, k...@rice.edu wrote: What we are looking for here is uniqueness thus better error detection. Not avalanche effect, nor cryptographically secure, nor bit distribution. As far as I'm aware CRC32C is unbeaten collision wise and time proven. I couldn't find tests with xxhash and crc32 on the same hardware so I spent some time putting together a benchmark (see attachment, to run it just start run.sh) I included a crc32 implementation using ssr4.2 instructions (which works on pretty much any Intel processor built after 2008 and AMD built after 2012), a portable Slice-By-8 software implementation and xxhash since it's the fastest software 32bit hash I know of. Here're the results running the test program on my i5-4200M crc sb8: 90444623 elapsed: 0.513688s speed: 1.485220 GB/s crc hw: 90444623 elapsed: 0.048327s speed: 15.786877 GB/s xxhash: 7f4a8d5 elapsed: 0.182100s speed: 4.189663 GB/s The hardware version is insanely and works on the majority of Postgres setups and the fallback software implementations is 2.8x slower than the fastest 32bit hash around. Hopefully it'll be useful in the discussion. Note that all these numbers aren't fully relevant to the use case here. For the WAL - which is what we're talking about and the only place where CRC32 is used with high throughput - the individual parts of a record are pretty darn small on average. So performance of checksumming small amounts of data is more relevant. Mind, that's not likely to go for CRC32, especially not slice-by-8. The cache fooprint of the large tables is likely going to be noticeable in non micro benchmarks. Indeed, the small input sizes is something I was missing. Something more cache friendly would be better, it's just a matter of finding a better candidate. Although I find it highly unlikely that the 4kb extra table of sb8 brings its performance down to sb4 level, even considering the small inputs and cache misses. For what's worth mysql, cassandra, kafka, ext4, xfx all use crc32c checksums in their WAL/Journals. Also, while I understand that CRC has a very venerable history and is well studied for transmission type errors, I have been unable to find any research on its applicability to validating file/block writes to a disk drive. Which incidentally doesn't really match what the CRC is used for here. It's used for individual WAL records. Usually these are pretty small, far smaller than disk/postgres' blocks on average. There's a couple scenarios where they can get large, true, but most of them are small. The primary reason they're important is to correctly detect the end of the WAL. To ensure we're interpreting half written records, or records from before the WAL file was overwritten. While it is to quote you unbeaten collision wise, xxhash, both the 32-bit and 64-bit version are its equal. Aha? You take that from the smhasher results? Since there seems to be a lack of research on disk based error detection versus CRC polynomials, it seems likely that any of the proposed hash functions are on an equal footing in this regard. As Andres commented up-thread, xxhash comes along for free with lz4. This is pure handwaving. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Sat, Sep 13, 2014 at 10:27 PM, k...@rice.edu k...@rice.edu wrote: On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote: On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-09-13 08:52:33 +0300, Ants Aasma wrote: On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com wrote: That's not entirely true. CRC-32C beats pretty much everything with the same length quality-wise and has both hardware implementations and highly optimized software versions. For better or for worse CRC is biased by detecting all single bit errors, the detection capability of larger errors is slightly diminished. The quality of the other algorithms I mentioned is also very good, while producing uniformly varying output. There's also much more literature about the various CRCs in comparison to some of these hash allgorithms. Indeed. CRCs have well-understood properties for error detection. Have any of these new algorithms been analyzed even a hundredth as thoroughly? No. I'm unimpressed by evidence-free claims that something else is also very good. Now, CRCs are designed for detecting the sorts of short burst errors that are (or were, back in the day) common on phone lines. You could certainly make an argument that that's not the type of threat we face for PG data. However, I've not seen anyone actually make such an argument, let alone demonstrate that some other algorithm would be better. To start with, you'd need to explain precisely what other error pattern is more important to defend against, and why. regards, tom lane Mysql went this way as well, changing the CRC polynomial in 5.6. What we are looking for here is uniqueness thus better error detection. Not avalanche effect, nor cryptographically secure, nor bit distribution. As far as I'm aware CRC32C is unbeaten collision wise and time proven. I couldn't find tests with xxhash and crc32 on the same hardware so I spent some time putting together a benchmark (see attachment, to run it just start run.sh) I included a crc32 implementation using ssr4.2 instructions (which works on pretty much any Intel processor built after 2008 and AMD built after 2012), a portable Slice-By-8 software implementation and xxhash since it's the fastest software 32bit hash I know of. Here're the results running the test program on my i5-4200M crc sb8: 90444623 elapsed: 0.513688s speed: 1.485220 GB/s crc hw: 90444623 elapsed: 0.048327s speed: 15.786877 GB/s xxhash: 7f4a8d5 elapsed: 0.182100s speed: 4.189663 GB/s The hardware version is insanely and works on the majority of Postgres setups and the fallback software implementations is 2.8x slower than the fastest 32bit hash around. Hopefully it'll be useful in the discussion. Thank you for running this sample benchmark. It definitely shows that the hardware version of the CRC is very fast, unfortunately it is really only available on x64 Intel/AMD processors which leaves all the rest lacking. For current 64-bit hardware, it might be instructive to also try using the XXH64 version and just take one half of the hash. It should come in at around 8.5 GB/s, or very nearly the speed of the hardware accelerated CRC. Also, while I understand that CRC has a very venerable history and is well studied for transmission type errors, I have been unable to find any research on its applicability to validating file/block writes to a disk drive. While it is to quote you unbeaten collision wise, xxhash, both the 32-bit and 64-bit version are its equal. Since there seems to be a lack of research on disk based error detection versus CRC polynomials, it seems likely that any of the proposed hash functions are on an equal footing in this regard. As Andres commented up-thread, xxhash comes along for free with lz4. Regards, Ken For the sake of completeness the results for xxhash64 in my machine xxhash64 speed: 7.365398 GB/s Which is indeed very fast.
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
That's not entirely true. CRC-32C beats pretty much everything with the same length quality-wise and has both hardware implementations and highly optimized software versions. Em 12/09/2014 17:18, Ants Aasma a...@cybertec.at escreveu: On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I don't mean that we should abandon this patch - compression makes the WAL smaller which has all kinds of other benefits, even if it makes the raw TPS throughput of the system worse. But I'm just saying that these TPS comparisons should be taken with a grain of salt. We probably should consider switching to a faster CRC algorithm again, regardless of what we do with compression. CRC is a pretty awfully slow algorithm for checksums. We should consider switching it out for something more modern. CityHash, MurmurHash3 and xxhash look like pretty good candidates, being around an order of magnitude faster than CRC. I'm hoping to investigate substituting the WAL checksum algorithm 9.5. Given the room for improvement in this area I think it would make sense to just short-circuit the CRC calculations for testing this patch to see if the performance improvement is due to less data being checksummed. Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
Em 12/09/2014 17:23, Andres Freund and...@2ndquadrant.com escreveu: On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote: On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote: At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote: We probably should consider switching to a faster CRC algorithm again, regardless of what we do with compression. As it happens, I'm already working on resurrecting a patch that Andres posted in 2010 to switch to zlib's faster CRC implementation. As it happens, I also wrote an implementation of Slice-by-4 the other day :-). Haven't gotten around to post it, but here it is. What algorithm does zlib use for CRC calculation? Also slice-by-4, with a manually unrolled loop doing 32bytes at once, using individual slice-by-4's. IIRC I tried and removing that slowed things down overall. What it also did was move crc to a function. I'm not sure why I did it that way, but it really might be beneficial - if you look at profiles today there's sometimes icache/decoding stalls... Hm. Let me look: http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de Ick, there's quite some debugging leftovers ;) I think it might be a good idea to also switch the polynom at the same time. I really really think we should, when the hardware supports, use the polynom that's available in SSE4.2. It has similar properties, can implemented in software just the same... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers This Google library is worth a look https://code.google.com/p/crcutil/ as it has some extremely optimized versions.
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
I agree that there's no reason to fix an algorithm to it, unless maybe it's pglz. There's some initial talk about implementing pluggable compression algorithms for TOAST and I guess the same must be taken into consideration for the WAL. -- Arthur Silva On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed rahilasyed...@gmail.com wrote: I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Average % of CPU usage at user level for each of the compression algorithm are as follows. CompressionMultipleSingle Off81.133881.1267 LZ4 81.099881.1695 Snappy:80.9741 80.9703 Pglz :81.235381.2753 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png The numbers show CPU utilization of Snappy is the least. The CPU utilization in increasing order is pglz No compression LZ4 Snappy The variance of average CPU utilization numbers is very low. However , snappy seems to be best when it comes to lesser utilization of CPU. As per the measurement results posted till date LZ4 outperforms snappy and pglz in terms of compression ratio and performance. However , CPU utilization numbers show snappy utilizes least amount of CPU . Difference is not much though. As there has been no consensus yet about which compression algorithm to adopt, is it better to make this decision independent of the FPW compression patch as suggested earlier in this thread?. FPW compression can be done using built in compression pglz as it shows considerable performance over uncompressed WAL and good compression ratio Also, the patch to compress multiple blocks at once gives better compression as compared to single block. ISTM that performance overhead introduced by multiple blocks compression is slightly higher than single block compression which can be tested again after modifying the patch to use pglz . Hence, this patch can be built using multiple blocks compression. Thoughts? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Wed, Sep 10, 2014 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote: I'm continuously studying Postgres codebase. Hopefully I'll be able to make some contributions in the future. For now I'm intrigued about the extensive use of memory alignment. I'm sure there's some legacy and some architecture that requires it reasoning behind it. That aside, since it wastes space (a lot of space in some cases) there must be a tipping point somewhere. I'm sure one can prove aligned access is faster in a micro-benchmark but I'm not sure it's the case in a DBMS like postgres, specially in the page/rows area. Just for the sake of comparison Mysql COMPACT storage (default and recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed 4-byte alignment. Not sure about Oracle and others. Is it worth the extra space in newer architectures (specially Intel)? Do you guys think this is something worth looking at? Yes. At least in my opinion, though, it's not a good project for a beginner. If you get your changes to take effect, you'll find that a lot of things will break in places that are not easy to find or fix. You're getting into really low-level areas of the system that get touched infrequently and require a lot of expertise in how things work today to adjust. I thought all memory alignment was (or at least the bulk of it) handled using some codebase wide macros/settings, otherwise how could different parts of the code inter-op? Poking this area might suffice for some initial testing to check if it's worth any more attention. Unaligned memory access received a lot attention in Intel post-Nehalen era. So it may very well pay off on Intel servers. You might find this blog post and it's comments/external-links interesting http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ I'm a newbie in the codebase, so please let me know if I'm saying anything non-sense. The idea I've had before is to try to reduce the widest alignment we ever require from 8 bytes to 4 bytes. That is, look for types with typalign = 'd', and rewrite them to have typalign = 'i' by having them use two 4-byte loads to load an eight-byte value. In practice, I think this would probably save a high percentage of what can be saved, because 8-byte alignment implies a maximum of 7 bytes of wasted space, while 4-byte alignment implies a maximum of 3 bytes of wasted space. And it would probably be pretty cheap, too, because any type with less than 8 byte alignment wouldn't be affected at all, and even those types that were affected would only be slightly slowed down by doing two loads instead of one. In contrast, getting rid of alignment requirements completely would save a little more space, but probably at the cost of a lot more slowdown: any type with alignment requirements would have to fetch the value byte-by-byte instead of pulling the whole thing out at once. Does byte-by-byte access stand true nowadays? I though modern processors would fetch memory at very least in word sized chunks, so 4/8 bytes then merge-slice. But there are a couple of obvious problems with this idea, too, such as: 1. It's really complicated and a ton of work. 2. It would break pg_upgrade pretty darn badly unless we employed some even-more-complex strategy to mitigate that. 3. The savings might not be enough to justify the effort. Very true. It might be interesting for someone to develop a tool measuring the number of bytes of alignment padding we lose per tuple or per page and gather some statistics on it on various databases. That would give us some sense as to the possible savings. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Memory Alignment in Postgres
On Thu, Sep 11, 2014 at 11:27 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 10:32:24 -0300, Arthur Silva wrote: Unaligned memory access received a lot attention in Intel post-Nehalen era. So it may very well pay off on Intel servers. You might find this blog post and it's comments/external-links interesting http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ FWIW, the reported results of imo pretty meaningless for postgres. It's sequential access over larger amount of memory. I.e. a perfectly prefetchable workload where it doesn't matter if superflous cachelines are fetched because they're going to be needed next round anyway. In many production workloads one of the most busy accesses to individual datums is the binary search on individual pages during index lookups. That's pretty much exactly the contrary to the above. Not saying that it's not going to be a benefit in many scenarios, but it's far from being as simple as saying that unaligned accesses on their own aren't penalized anymore. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services I modified the test code to use a completely random scan pattern to test something that completely trashes the cache. Not realistic but still confirms the hypothesis that the overhead is minimal on modern Intel. -- test results compiling for 32bit -- processing word of size 2 offset = 0 average time for offset 0 is 422.7 offset = 1 average time for offset 1 is 422.85 processing word of size 4 offset = 0 average time for offset 0 is 436.6 offset = 1 average time for offset 1 is 451 offset = 2 average time for offset 2 is 444.3 offset = 3 average time for offset 3 is 441.9 processing word of size 8 offset = 0 average time for offset 0 is 630.15 offset = 1 average time for offset 1 is 653 offset = 2 average time for offset 2 is 655.5 offset = 3 average time for offset 3 is 660.85 offset = 4 average time for offset 4 is 650.1 offset = 5 average time for offset 5 is 656.9 offset = 6 average time for offset 6 is 656.6 offset = 7 average time for offset 7 is 656.9 -- test results compiling for 64bit -- processing word of size 2 offset = 0 average time for offset 0 is 402.55 offset = 1 average time for offset 1 is 406.9 processing word of size 4 offset = 0 average time for offset 0 is 424.05 offset = 1 average time for offset 1 is 436.55 offset = 2 average time for offset 2 is 435.1 offset = 3 average time for offset 3 is 435.3 processing word of size 8 offset = 0 average time for offset 0 is 444.9 offset = 1 average time for offset 1 is 470.25 offset = 2 average time for offset 2 is 468.95 offset = 3 average time for offset 3 is 476.75 offset = 4 average time for offset 4 is 474.9 offset = 5 average time for offset 5 is 468.25 offset = 6 average time for offset 6 is 469.8 offset = 7 average time for offset 7 is 469.1 // g++ -O2 -o test test.cpp ./test #include sys/stat.h #include sys/time.h #include sys/types.h #include iostream #include cassert #include vector #include inttypes.h using namespace std; class WallClockTimer { public: struct timeval t1, t2; WallClockTimer() : t1(), t2() { gettimeofday(t1, 0); t2 = t1; } void reset() { gettimeofday(t1, 0); t2 = t1; } int elapsed() { return (t2.tv_sec * 1000 + t2.tv_usec / 1000) - (t1.tv_sec * 1000 + t1.tv_usec / 1000); } int split() { gettimeofday(t2, 0); return elapsed(); } }; // xor shift uint32_t xor128(void) { static uint32_t x = 123456789; static uint32_t y = 362436069; static uint32_t z = 521288629; static uint32_t w = 88675123; uint32_t t; t = x ^ (x 11); x = y; y = z; z = w; return w = w ^ (w 19) ^ (t ^ (t 8)); } template class T void runtest() { size_t N = 10 * 1000 * 1000 ; int repeat = 20; WallClockTimer timer; const bool paranoid = false; cout processing word of size sizeof(T)endl; for(unsigned int offset = 0; offsetsizeof(T); ++offset) { vectorT bigarray(N+2); coutoffset = offsetendl; T * const begin = reinterpret_castT * (reinterpret_castuintptr_t(bigarray[0]) + offset); assert(offset + reinterpret_castuintptr_t(bigarray[0]) == reinterpret_castuintptr_t(begin) ); T * const end = begin + N; if(paranoid) assert(reinterpret_castuintptr_t(end)reinterpret_castuintptr_t(bigarray.back())); int sumt = 0; //cout ignore this: ; for(int k = 0 ; k repeat; ++k) { timer.reset(); for(size_t i = 0; i N; ++i) { int ri = xor128() % N; begin[ri] = static_castT( i ); } volatile T val = 1
Re: [HACKERS] Memory Alignment in Postgres
On Thu, Sep 11, 2014 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: Be advised of the difficulties you are going to face here. Assuming for a second there is no reason not to go unaligned on Intel and there are material benefits to justify the effort, that doesn't necessarily hold for other platforms like arm/power. Note that on many (most?) non-Intel architectures, unaligned access is simply not an option. The chips themselves will throw SIGBUS or equivalent if you try it. Some kernels provide signal handlers that emulate the unaligned access in software rather than killing the process; but the performance consequences of hitting such traps more than very occasionally would be catastrophic. Even on Intel, I'd wonder what unaligned accesses do to atomicity guarantees and suchlike. This is not a big deal for row data storage, but we'd have to be careful about it if we were to back off alignment requirements for in-memory data structures such as latches and buffer headers. Another fun thing you'd need to deal with is ensuring that the C structs we overlay onto catalog data rows still match up with the data layout rules. On the whole, I'm pretty darn skeptical that such an effort would repay itself. There are lots of more promising things to hack on. regards, tom lane Indeed I don't know any other architectures that this would be at an option. So if this ever moves forward it must be turned on at compile time for x86-64 only. I wonder how the Mysql handle their rows even on those architectures as their storage format is completely packed. If we just reduced the alignment requirements when laying out columns in the rows and indexes by reducing/removing padding -- typalign, it'd be enough gain in my (humble) opinion. If you think alignment is not an issue you can see saving everywhere, which is kinda insane... I'm unsure how this equates in patch complexity, but judging by the reactions so far I'm assuming a lot.
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 11, 2014 at 10:01 PM, Josh Berkus j...@agliodbs.com wrote: So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. TABLE SIZES --- HEAD ?column? | pg_size_pretty -+ json text format| 393 MB jsonb: compressed | 1147 MB jsonb: uncompressed | 1221 MB PATCHED ?column? | pg_size_pretty -+ json text format| 394 MB jsonb: compressed | 525 MB jsonb: uncompressed | 1200 MB EXTRACTION TIMES HEAD Q1 (search via GIN index followed by extracting 100,000 values from rows): jsonb compressed: 4000 jsonb uncompressed: 3250 Q2 (seq scan and extract 200,000 values from rows): json: 11700 jsonb compressed: 3150 jsonb uncompressed: 2700 PATCHED Q1: jsonb compressed: 6750 jsonb uncompressed: 3350 Q2: json: 11796 jsonb compressed: 4700 jsonb uncompressed: 2650 -- Conclusion: with Tom's patch, compressed JSONB is 55% smaller when compressed (EXTENDED). Extraction times are 50% to 70% slower, but this appears to be almost entirely due to decompression overhead. When not compressing (EXTERNAL), extraction times for patch versions are statistically the same as HEAD, and file sizes are similar to HEAD. USER REACTION - I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres users how they would feel about a compression vs. extraction time tradeoff. The audience was evenly split. However, with the current patch, the user can choose. Users who know enough for performance tuning can set JSONB columns to EXTERNAL, and the the same performance as the unpatched version. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers The compression ratio difference is exaggerated in this case but it does support that Tom's patch alleviates the extraction penalty. In my testings with the github archive data the savings - performance-penalty was fine, but I'm not confident in those results since there were only 8 top level keys. For comparison, some twitter api objects[1] have 30+ top level keys. If I have time in the next couple of days I'll conduct some testings with the public twitter fire-hose data. [1] https://dev.twitter.com/rest/reference/get/statuses/home_timeline
[HACKERS] Memory Alignment in Postgres
I'm continuously studying Postgres codebase. Hopefully I'll be able to make some contributions in the future. For now I'm intrigued about the extensive use of memory alignment. I'm sure there's some legacy and some architecture that requires it reasoning behind it. That aside, since it wastes space (a lot of space in some cases) there must be a tipping point somewhere. I'm sure one can prove aligned access is faster in a micro-benchmark but I'm not sure it's the case in a DBMS like postgres, specially in the page/rows area. Just for the sake of comparison Mysql COMPACT storage (default and recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed 4-byte alignment. Not sure about Oracle and others. Is it worth the extra space in newer architectures (specially Intel)? Do you guys think this is something worth looking at? I'm trying to messing with the *ALIGN macros but so far I wasn't able to get any conclusive results. My guess is that I'm missing something in the code or pg_bench doesn't stress the difference enough. -- Arthur Silva
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Sep 2, 2014 at 9:11 AM, Rahila Syed rahilasye...@gmail.com wrote: Hello, It'd be interesting to check avg cpu usage as well I have collected average CPU utilization numbers by collecting sar output at interval of 10 seconds for following benchmark: Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Benchmark: Scale : 16 Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime 550,250,250,200,200 Warmup time : 1 sec Measurement time : 900 sec Number of tx types : 5 Number of agents : 16 Connection pool size : 16 Statement cache size : 40 Auto commit : false Checkpoint segments:1024 Checkpoint timeout:5 mins Average % of CPU utilization at user level for multiple blocks compression: Compression Off = 3.34133 Snappy = 3.41044 LZ4 = 3.59556 Pglz = 3.66422 The numbers show the average CPU utilization is in the following order pglz LZ4 Snappy No compression Attached is the graph which gives plot of % CPU utilization versus time elapsed for each of the compression algorithms. Also, the overall CPU utilization during tests is very low i.e below 10% . CPU remained idle for large(~90) percentage of time. I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Thank you, On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva arthur...@gmail.com wrote: Em 26/08/2014 09:16, Fujii Masao masao.fu...@gmail.com escreveu: On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Thank you for comments. Could you tell me where the patch for single block in one run is? Please find attached patch for single block compression in one run. Thanks! I ran the benchmark using pgbench and compared the results. I'd like to share the results. [RESULT] Amount of WAL generated during the benchmark. Unit is MB. MultipleSingle off202.0201.5 on6051.06053.0 pglz3543.03567.0 lz43344.03485.0 snappy3354.03449.5 Latency average during the benchmark. Unit is ms. MultipleSingle off19.119.0 on55.357.3 pglz45.045.9 lz444.244.7 snappy43.443.3 These results show that FPW compression is really helpful for decreasing the WAL volume and improving the performance. The compression ratio by lz4 or snappy is better than that by pglz. But it's difficult to conclude which lz4 or snappy is best, according to these results. ISTM that compression-of-multiple-pages-at-a-time approach can compress WAL more than compression-of-single-... does. [HOW TO BENCHMARK] Create pgbench database with scall factor 1000. Change the data type of the column filler on each pgbench table from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's gen_random_uuid() in order to avoid empty column, e.g., alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text After creating the test database, run the pgbench as follows. The number of transactions executed during benchmark is almost same between each benchmark because -R option is used. pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared checkpoint_timeout is 5min, so it's expected that checkpoint was executed at least two times during the benchmark. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers It'd be interesting to check avg cpu usage as well. Is there any reason to default to LZ4-HC? Shouldn't we try the default as well? LZ4-default is known for its near realtime speeds in exchange for a few % of compression, which sounds optimal for this use case. Also, we might want to compile these libraries with -O3 instead of the default -O2. They're finely tuned to work with all possible compiler optimizations w/ hints and other tricks, this is specially true for LZ4, not sure for snappy. In my virtual machine LZ4 w/ -O3 compression runs at twice the speed (950MB/s) of -O2 (450MB/s) @ (61.79%), LZ4-HC seems unaffected though (58MB/s) @ (60.27%). Yes, that's right, almost 1GB/s! And the compression ratio is only 1,5% short compared to LZ4-HC.
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Em 26/08/2014 09:16, Fujii Masao masao.fu...@gmail.com escreveu: On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Thank you for comments. Could you tell me where the patch for single block in one run is? Please find attached patch for single block compression in one run. Thanks! I ran the benchmark using pgbench and compared the results. I'd like to share the results. [RESULT] Amount of WAL generated during the benchmark. Unit is MB. MultipleSingle off202.0201.5 on6051.06053.0 pglz3543.03567.0 lz43344.03485.0 snappy3354.03449.5 Latency average during the benchmark. Unit is ms. MultipleSingle off19.119.0 on55.357.3 pglz45.045.9 lz444.244.7 snappy43.443.3 These results show that FPW compression is really helpful for decreasing the WAL volume and improving the performance. The compression ratio by lz4 or snappy is better than that by pglz. But it's difficult to conclude which lz4 or snappy is best, according to these results. ISTM that compression-of-multiple-pages-at-a-time approach can compress WAL more than compression-of-single-... does. [HOW TO BENCHMARK] Create pgbench database with scall factor 1000. Change the data type of the column filler on each pgbench table from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's gen_random_uuid() in order to avoid empty column, e.g., alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text After creating the test database, run the pgbench as follows. The number of transactions executed during benchmark is almost same between each benchmark because -R option is used. pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared checkpoint_timeout is 5min, so it's expected that checkpoint was executed at least two times during the benchmark. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers It'd be interesting to check avg cpu usage as well.
Re: [HACKERS] jsonb format is pessimal for toast compression
On Wed, Aug 27, 2014 at 1:09 AM, Arthur Silva arthur...@gmail.com wrote: It won't be faster by any means, but it should definitely be incorporated if any format changes are made (like Tom already suggested). I think it's important we gather at least 2 more things before making any calls: * Josh tests w/ cache aware patch, which should confirm cache aware is indeed prefered * Tests with toast hacked to use lz4 instead, which might ease any decisions -- Arthur Silva On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com wrote: The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced. As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation. I'm not surprised that it hasn't beaten HEAD. I haven't studied the problem in detail, but I don't think that the cache awareness of the new revision is necessarily a distinct advantage. -- Peter Geoghegan I'm attaching a quick-n-dirty patch that uses lz4 compression instead of pglz in case someone wants to experiment with it. Seems to work in my test env, I'll make more tests when I get home. PS: gotta love gmail fixed defaults of top-posting... lz4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Tom, here's the results with github data (8 top level keys) only. Here's a sample object https://gist.github.com/igrigorik/2017462 All-Lenghts + Cache-Aware EXTERNAL Query 1: 516ms Query 2: 350ms The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced. As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation. compare to (copied from my previous email) HEAD (aka, all offsets) EXTERNAL Test query 1 runtime: 505ms Test query 2 runtime: 350ms All Lengths (Tom Lane patch) EXTERNAL Test query 1 runtime: 525ms Test query 2 runtime: 355ms -- Arthur Silva On Tue, Aug 26, 2014 at 7:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. ... and while I was typing that sentence, lightning struck. The existing arrangement of object subfields with keys and values interleaved is just plain dumb. We should rearrange that as all the keys in order, then all the values in the same order. Then the keys are naturally adjacent in memory and object-key searches become much more cache-friendly: you probably touch most of the key portion of the object, but none of the values portion, until you know exactly what part of the latter to fetch. This approach might complicate the lookup logic marginally but I bet not very much; and it will be a huge help if we ever want to do smart access to EXTERNAL (non-compressed) JSON values. I will go prototype that just to see how much code rearrangement is required. This looks pretty good from a coding point of view. I have not had time yet to see if it affects the speed of the benchmark cases we've been trying. I suspect that it won't make much difference in them. I think if we do decide to make an on-disk format change, we should seriously consider including this change. The same concept could be applied to offset-based storage of course, although I rather doubt that we'd make that combination of choices since it would be giving up on-disk compatibility for benefits that are mostly in the future. Attached are two patches: one is a delta against the last jsonb-lengths patch I posted, and the other is a merged patch showing the total change from HEAD, for ease of application. regards, tom lane
Re: [HACKERS] jsonb format is pessimal for toast compression
It won't be faster by any means, but it should definitely be incorporated if any format changes are made (like Tom already suggested). I think it's important we gather at least 2 more things before making any calls: * Josh tests w/ cache aware patch, which should confirm cache aware is indeed prefered * Tests with toast hacked to use lz4 instead, which might ease any decisions -- Arthur Silva On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com wrote: The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced. As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation. I'm not surprised that it hasn't beaten HEAD. I haven't studied the problem in detail, but I don't think that the cache awareness of the new revision is necessarily a distinct advantage. -- Peter Geoghegan
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Aug 21, 2014 at 6:20 PM, Josh Berkus j...@agliodbs.com wrote: On 08/20/2014 03:42 PM, Arthur Silva wrote: What data are you using right now Josh? The same data as upthread. Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload? I'm concerned that my workload is unusual and don't want us to make this decision based entirely on it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com Here's my test results so far with the github archive data. It's important to keep in mind that the PushEvent event objects that I use in the queries only contains a small number of keys (8 to be precise), so these tests don't really stress the changed code. Anyway, in this dataset (with the small objects) using the all-lengths patch provide small compression savings but the overhead is minimal. Test data: 610MB of Json -- 341969 items Index size (jsonb_ops): 331MB Test query 1: SELECT data-'url', data-'actor' FROM t_json WHERE data @ '{type: PushEvent}' Test query 1 items: 169732 Test query 2: SELECT data FROM t_json WHERE data @ '{type: PushEvent}' Test query 2 items: HEAD (aka, all offsets) EXTENDED Size: 374MB Toast Size: 145MB Test query 1 runtime: 680ms Test query 2 runtime: 405ms HEAD (aka, all offsets) EXTERNAL Size: 366MB Toast Size: 333MB Test query 1 runtime: 505ms Test query 2 runtime: 350ms All Lengths (Tom Lane patch) EXTENDED Size: 379MB Toast Size: 108MB Test query 1 runtime: 720ms Test query 2 runtime: 420ms All Lengths (Tom Lane patch) EXTERNAL Size: 366MB Toast Size: 333MB Test query 1 runtime: 525ms Test query 2 runtime: 355ms -- Arthur Silva
Re: [HACKERS] jsonb format is pessimal for toast compression
What data are you using right now Josh? There's the github archive http://www.githubarchive.org/ Here's some sample data https://gist.github.com/igrigorik/2017462 -- Arthur Silva On Wed, Aug 20, 2014 at 6:09 PM, Josh Berkus j...@agliodbs.com wrote: On 08/20/2014 08:29 AM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: On 08/15/2014 04:19 PM, Tom Lane wrote: Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. ... So, that extraction test is about 1% *slower* than the basic Tom Lane lengths-only patch, and still 80% slower than original JSONB. And it's the same size as the lengths-only version. Since it's looking like this might be the direction we want to go, I took the time to flesh out my proof-of-concept patch. The attached version takes care of cosmetic issues (like fixing the comments), and includes code to avoid O(N^2) penalties in findJsonbValueFromContainer and JsonbIteratorNext. I'm not sure whether those changes will help noticeably on Josh's test case; for me, they seemed worth making, but they do not bring the code back to full speed parity with the all-offsets version. But as we've been discussing, it seems likely that those costs would be swamped by compression and I/O considerations in most scenarios with large documents; and of course for small documents it hardly matters. Table sizes and extraction times are unchanged from the prior patch based on my workload. We should be comparing all-lengths vs length-and-offset maybe using another workload as well ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: [HACKERS] [PATCH] Incremental backup: add backup profile to base backup
On Mon, Aug 18, 2014 at 10:05 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/18/2014 08:05 AM, Alvaro Herrera wrote: Marco Nenciarini wrote: To calculate the md5 checksum I've used the md5 code present in pgcrypto contrib as the code in src/include/libpq/md5.h is not suitable for large files. Since a core feature cannot depend on a piece of contrib, I've moved the files contrib/pgcrypto/md5.c contrib/pgcrypto/md5.h to src/backend/utils/hash/md5.c src/include/utils/md5.h changing the pgcrypto extension to use them. We already have the FNV checksum implementation in the backend -- can't we use that one for this and avoid messing with MD5? (I don't think we're looking for a cryptographic hash here. Am I wrong?) Hmm. Any user that can update a table can craft such an update that its checksum matches an older backup. That may seem like an onerous task; to correctly calculate the checksum of a file in a previous, you need to know the LSNs and the exact data, including deleted data, on every block in the table, and then construct a suitable INSERT or UPDATE that modifies the table such that you get a collision. But for some tables it could be trivial; you might know that a table was bulk-loaded with a particular LSN and there are no dead tuples. Or you can simply create your own table and insert exactly the data you want. Messing with your own table might seem harmless, but it'll e.g. let you construct a case where an index points to a tuple that doesn't exist anymore, or there's a row that doesn't pass a CHECK-constraint that was added later. Even if there's no direct security issue with that, you don't want that kind of uncertainty from a backup solution. But more to the point, I thought the consensus was to use the highest LSN of all the blocks in the file, no? That's essentially free to calculate (if you have to read all the data anyway), and isn't vulnerable to collisions. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers We also have both crc32 and crc64 implementations in pg_crc. If the goal is just verifying file integrity (we can't really protect against intentional modification) crc sounds more appropriate to me.
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 15, 2014 at 8:19 PM, Tom Lane t...@sss.pgh.pa.us wrote: Arthur Silva arthur...@gmail.com writes: We should add some sort of versionning to the jsonb format. This can be explored in the future in many ways. If we end up making an incompatible change to the jsonb format, I would support taking the opportunity to stick a version ID in there. But I don't want to force a dump/reload cycle *only* to do that. As for the current problem, we should explore the directory at the end option. It should improve compression and keep good access performance. Meh. Pushing the directory to the end is just a band-aid, and since it would still force a dump/reload, it's not a very enticing band-aid. The only thing it'd really fix is the first_success_by issue, which we could fix *without* a dump/reload by using different compression parameters for jsonb. Moving the directory to the end, by itself, does nothing to fix the problem that the directory contents aren't compressible --- and we now have pretty clear evidence that that is a significant issue. (See for instance Josh's results that increasing first_success_by did very little for the size of his dataset.) I think the realistic alternatives at this point are either to switch to all-lengths as in my test patch, or to use the hybrid approach of Heikki's test patch. IMO the major attraction of Heikki's patch is that it'd be upward compatible with existing beta installations, ie no initdb required (but thus, no opportunity to squeeze in a version identifier either). It's not showing up terribly well in the performance tests I've been doing --- it's about halfway between HEAD and my patch on that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my patch, there are things that could be done to micro-optimize it by touching a bit more code. I did some quick stats comparing compressed sizes for the delicio.us data, printing quartiles as per Josh's lead: all-lengths {440,569,609,655,1257} Heikki's patch {456,582,624,671,1274} HEAD{493,636,684,744,1485} (As before, this is pg_column_size of the jsonb within a table whose rows are wide enough to force tuptoaster.c to try to compress the jsonb; otherwise many of these values wouldn't get compressed.) These documents don't have enough keys to trigger the first_success_by issue, so that HEAD doesn't look too awful, but still there's about an 11% gain from switching from offsets to lengths. Heikki's method captures much of that but not all. Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. In case anyone else wants to do measurements on some more data sets, attached is a copy of Heikki's patch updated to apply against git tip. regards, tom lane I agree that versioning might sound silly at this point, but lets keep it in mind. Row level compression is very slow itself, so it sounds odd to me paying 25% performance penalty everywhere for the sake of having better compression ratio in the dictionary area. Consider, for example, an optimization that stuffs integers (up to 28 bits) inside the JEntry itself. That alone would save 8 bytes for each integer.
Re: [HACKERS] jsonb format is pessimal for toast compression
I'm still getting up to speed on postgres development but I'd like to leave an opinion. We should add some sort of versionning to the jsonb format. This can be explored in the future in many ways. As for the current problem, we should explore the directory at the end option. It should improve compression and keep good access performance. A 4 byte header is sufficient to store the directory offset and some versionning bits. Em 15/08/2014 17:39, Tom Lane t...@sss.pgh.pa.us escreveu: Josh Berkus j...@agliodbs.com writes: On 08/14/2014 07:24 PM, Tom Lane wrote: We can certainly reduce that. The question was whether it would be worth the effort to try. At this point, with three different test data sets having shown clear space savings, I think it is worth the effort. I'll poke into it tomorrow or over the weekend, unless somebody beats me to it. Note that I specifically created that data set to be a worst case: many top-level keys, no nesting, and small values. However, I don't think it's an unrealistic worst case. Interestingly, even on the unpatched, 1GB table case, the *index* on the JSONB is only 60MB. Which shows just how terrific the improvement in GIN index size/performance is. I've been poking at this, and I think the main explanation for your result is that with more JSONB documents being subject to compression, we're spending more time in pglz_decompress. There's no free lunch in that department: if you want compressed storage it's gonna cost ya to decompress. The only way I can get decompression and TOAST access to not dominate the profile on cases of this size is to ALTER COLUMN SET STORAGE PLAIN. However, when I do that, I do see my test patch running about 25% slower overall than HEAD on an explain analyze select jfield - 'key' from table type of query with 200-key documents with narrow fields (see attached perl script that generates the test data). It seems difficult to improve much on that for this test case. I put some logic into findJsonbValueFromContainer to calculate the offset sums just once not once per binary-search iteration, but that only improved matters 5% at best. I still think it'd be worth modifying the JsonbIterator code to avoid repetitive offset calculations, but that's not too relevant to this test case. Having said all that, I think this test is something of a contrived worst case. More realistic cases are likely to have many fewer keys (so that speed of the binary search loop is less of an issue) or else to have total document sizes large enough that inline PLAIN storage isn't an option, meaning that detoast+decompression costs will dominate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers