Re: [HACKERS] WAL Re-Writes
On 27/01/2016 13:30, Amit Kapila wrote: Thoughts? Are the decreases observed with SSD as well as spinning rust? I might imagine that decreasing the wear would be advantageous, especially if the performance decrease is less with low read latency. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Securing "make check" (CVE-2014-0067)
On 02/03/2014 15:30, Magnus Hagander wrote: Terminal Services have definitely become more common over time, but with faster and cheaper virtualization, a lot of people have switched to that instead, which would remove the problem of course. I wonder how common it actually is, though, to *build postgres* on a terminal services machine with other users on it... Well, the banks I've contracted at recently are all rather keen on virtual desktops for developers, and some of those are terminal services. We're a headache, and packaging up all the things we need is a pain, so there is some mileage in buying grunty servers and doing specific installs that are then shared, rather than making an MSI generally available. Also I have experience of being given accounts for jenkins etc that are essentially terminal services logins, and having these things unable to maintain a software stack can effectively disqualify tech we would otherwise use. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] swapcache-style cache?
Has anyone considered managing a system like the DragonFLY swapcache for a DBMS like PostgreSQL? ie where the admin can assign drives with good random read behaviour (but perhaps also-ran random write) such as SSDs to provide a cache for blocks that were dirtied, with async write that hopefully writes them out before they are forcibly discarded. And where a cache fail (whether by timeout, hard fail, or CRC fail) just means having to go back to the real transactional storage. I'd been thinking that swapcache would help where the working set won't fit in RAM, also L2ARC on Solaris - but it seems to me that there is no reason not to allow the DBMS to manage the set-aside area itself where it is given either access to the raw device or to a pre-sized file on the device it can map in segments. While L2ARC is obviously very heavyweight and entwined in ZFS, Dragonfly's swapcache seems to me remarkably elegant and, it would seem, very effective. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Sort
> Have you considered GPU-based sorting? I know there's been discussion in the past. If you use OpenCL, then you can use a CPU driver if there is no GPU, and that can allow you to leverage all the CPU cores without having to do the multi-thread stuff in the backend. While the compilation of a specific kernel can be quite expensive, it also has the effect of a JIT compiler in terms of system independence. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] C++ compiler
On 25/06/2013 05:16, Tom Lane wrote: It might be time to reconsider whether we should move the baseline portability requirement up to C99. My understanding was that you picked up a lot of users when the Win32 port became useful. While you can build with msys, I would think that leaving Microsoft's tooling behind would be a mistake, and as far as I am aware they have said that they are supporting C++11 but not bothering with C99. I'm really not in favor of moving to C++ though, as the portability-vs-usefulness tradeoffs seem pretty unattractive there. As a long-time C++ programmer I don't see what the problem would be beyond (some) existing contributors being wary of the unknown. Its not as if any platform developed enough to be a sane db server has not got a decent C++ compiler or two. Portability is only really a problem with a subset of new C++11 features.
Re: [HACKERS] plpython implementation
On 01/07/2013 02:43, Claudio Freire wrote: In essence, you'd have to use another implementation. CPython guys have left it very clear they don't intend to "fix" that, as they don't consider it a bug. It's just how it is. Given how useful it is to have a scripting language that can be used outside of the database as well as inside it, would it be reasonable to consider 'promoting' pllua? My understanding is that it (lua) is much cleaner under the hood (than CPython). Although I do recognise that Python as a whole has always had more traction. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 14/07/2013 20:13, Greg Smith wrote: The most efficient way to write things out is to delay those writes as long as possible. That doesn't smell right to me. It might be that delaying allows more combining and allows the kernel to see more at once and optimise it, but I think the counter-argument is that it is an efficiency loss to have either CPU or disk idle waiting on the other. It cannot make sense from a throughput point of view to have disks doing nothing and then become overloaded so they are a bottleneck (primarily seeking) and the CPU does nothing. Now I have NOT measured behaviour but I'd observe that we see disks that can stream 100MB/s but do only 5% of that if they are doing random IO. Some random seeks during sync can't be helped, but if they are done when we aren't waiting for sync completion then they are in effect free. The flip side is that we can't really know whether they will get merged with adjacent writes later so its hard to schedule them early. But we can observe that if we have a bunch of writes to adjacent data then a seek to do the write is effectively amortised across them. So it occurs to me that perhaps we can watch for patterns where we have groups of adjacent writes that might stream, and when they form we might schedule them to be pushed out early (if not immediately), ideally out as far as the drive (but not flushed from its cache) and without forcing all other data to be flushed too. And perhaps we should always look to be getting drives dedicated to dbms to do something, even if it turns out to have been redundant in the end. That's not necessarily easy on Linux without using a direct unbuffered IO but to me that is Linux' problem. For a start its not the only target system, and having feedback 'we need' from db and mail system groups to the NT kernels devs hasn't hurt, and it never hurt Solaris to hear what Oracle and Sybase devs felt they needed either. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
On 03/10/2014 05:53, Kouhei Kaigai wrote: > Yep, that's my pain. Even though usual query does not take many buffers > pinned, > my use case needs to fetch megabytes scale data at once because of performance > reason; page-by-page synchronous scan makes GPU being idle. Doesn't your GPU have an async queue and exec mechanism? Then you could do an asyn DMA to the GPU with an event, use that event in he GPU to start the kernel and in the DB to release the pin? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json api WIP patch
The processing functions have been extended to provide populate_record() and populate_recordset() functions.The latter in particular could be useful in decomposing a piece of json representing an array of flat objects (a fairly common pattern) into a set of Postgres records in a single pass. So this would allow an 'insert into ... select ... from (...)'? I had been wondering how to do such an insertion efficiently in the context of SPI, but it seems that there is no SPI_copy equiv that would allow a query parse and plan to be avoided. Is this mechanism likely to be as fast as we can get at the moment in contexts where copy is not feasible? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json api WIP patch
I had been wondering how to do such an insertion efficiently in the context of SPI, but it seems that there is no SPI_copy equiv that would allow a query parse and plan to be avoided. Your query above would need to be planned too, although the plan will be trivial. Ah yes, I meant that I had not found a way to avoid it (for multi-row inserts etc) from a stored proc context where I have SPI functions available. You should not try to use it as a general bulk load facility. And it will not be as fast as COPY for several reasons, including that the Json parsing routines are necessarily much heavier than the COPY parse routines, which have in any case been optimized over quite a long period. Also, a single json datum is limited to no more than 1Gb. If you have such a datum, parsing it involves having it in memory and then taking a copy (I wonder if we could avoid that step - will take a look). Then each object is decomposed into a hash table of key value pairs, which it then used to construct the record datum. Each field name in the result record is used to look up the value in the hash table - this happens once in the case of populate_record() and once per object in the array in the case of populate_recordset(). In the latter case the resulting records are put into a tuplestore structure (which spills to disk if necessary) which is then returned to the caller when all the objects in the js on array are processed. COPY doesn't have these sorts of issues. It knows without having to look things up where each datum is in each record, and it stashes the result straight into the target table. It can read and insert huge numbers of rows without significant memory implications. Yes - but I don't think I can use COPY from a stored proc context can I? If I could use binary COPY from a stored proc that has received a binary param and unpacked to the data, it would be handy. If SPI provided a way to perform a copy to a temp table and then some callback on an iterator that yields rows to it, that would do the trick I guess. Perhaps if you give us a higher level view of what you're trying to achieve we can help you better. I had been trying to identify a way to work with record sets where the records might be used for insert, or for updates or deletion statements, preferably without forming a large custom SQL statement that must then be parsed and planned (and which would be a PITA if I wanted to use the SQL-C preprocessor or some language bindings that like to prepare a statement and execute with params). The data I work with has a master-detail structure and insertion performance matters, so I'm trying to limit manipulations to one statement per table per logical operation even where there are multiple detail rows. Sometimes the network latency can be a pain too and that also suggests an RPC with unpack and insert locally. Cheers James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json api WIP patch
You can use COPY from a stored procedure, but only to and from files. I think that's in the chocolate fireguard realm though as far as efficiency for this sort of scenario goes, even if its handled by retaining an mmap'd file as workspace. If SPI provided a way to perform a copy to a temp table and then some callback on an iterator that yields rows to it, that would do the trick I guess. SPI is useful, but it's certainly possible to avoid its use. After all, that what almost the whole backend does, including the COPY code. Of course, it's a lot harder to write that way, which is part of why SPI exists. Efficiency has its price. So it is possible to use a lower level interface from a C stored proc? SPI is the (only) documented direct function extension API isn't it? Is the issue with using the JSON data-to-record set that the parsing can be costly? Perhaps it can be achieved with B64 of compressed protobuf, or such. I don't mind if it seems a bit messy - the code can be generated from the table easily enough, especially if I can use C++. I guess an allocator that uses SPI_palloc would solve issues with memory management on error? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange Windows problem, lock_timeout test request
So, while no native 64-bit compilers are available for free as part of Visual Studio Express 2012 (11), it doesn't matter much. The issue is more that it's very much Microsoft's whim whether they release compilers at all and if so, which ones, when and how. I think I have a pretty vanilla Visual Studio Express 2012 for Desktop and: C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64>.\cl Microsoft (R) C/C++ Optimizing Compiler Version 17.00.51106.1 for x64 Copyright (C) Microsoft Corporation. All rights reserved. usage: cl [ option... ] filename... [ /link linkoption... ] C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64> Am I misunderstanding the discussion here? Isn't that the 64-bit tool suite? Does anyone care if the compiler is a 64 bit binary itself? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visual Studio 2012 RC
Anyway, this is getting way off track. The point is that the MS SDKs and compilers are a bit of a mess and that MinGW support is useful because we can't rely on them continuing to offer free SDKs and compilers in future. Well, more compilers are always useful, but complaining that Microsoft might withdraw their working compilers smacks of 'what if?' paranoia. What if mingw support for Win64 was (sometimes/often/always/still) a bit rubbish? Oh wait ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visual Studio 2012 RC
On the contrary, only a few months ago there was a far from groundless fear that Microsoft would do just that. Following considerable outcry they changed their mind. But this is definitely not just paranoia. As for w64 support, the mingw-64 project exists more or less explicitly to produce 64 bit compilers, including those hosted on mingw/msys. Huh. The only reason we have to use mingw64 or one of the assorted personal builds is because 'mingw support' doesn't deliver on its own, and last I looked there was a confusing variety of personal builds with various strengths and weaknesses. I managed to make some progress but we seem to be a ways off having a reference download (and ideally one with clang too I guess). I'd very much like there to be a good reference implementation, but the whole mingw/mingw64 thing is indicative of some problems, and reminds me of egcs. You have references to back up your statements, and demonstrate that it wasn't primarily FUD? FWIW I think the higher entry prices of pay-for VStudio almost guarantees continued availability of a free compiler, though it might end up slightly crippled, but I'm not a product planner for MS any more than you are. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Have we out-grown Flex?
I haven't tried quex, but I have tried lemon (which can be broken out of SQLite) and re2c and ragel. I like ragel and lemon, but the combination supports a push-parser style from memory, and many tools are inconvenient unless you are prepared to suck in a whole message before parsing, or let the parser drive a pull loop, or use a coroutine structure. Could go all trendy and use a PEG tool like, er,, peg (http://piumarta.com/software/peg/). (I haven't tried them tho') James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Have we out-grown Flex?
Doesn't that imply that a plan cache might be worthwhile? But no matter: didn't the OP really have issue with packaging and Windows support - and there are a lot of Windows users, and in general there are many Windows devs: making it easier for them to contribute has to be good doesn't it? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Have we out-grown Flex?
I believe there are tools that are significantly faster than flex. I believe re2c generates code that is faster. But the key thing is to test, probably, or perhaps ask around. I'm out of touch, but from memory flex wasn't the be-all and end-all. Lemon is definitely easy to maintain/port and the result is pretty nice, too (I know Bison/Yacc wasn't the focus here). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs in our Windows socket code
That is, if you request FD_WRITE events for a pre-existing socket with WSAEventSelect, you will not get one until the outbound network buffer has been filled and then has partially emptied. (This is incredibly broken, but Microsoft evidently has no intention of fixing it.) I think you should distinguish between 'broken' and 'not what I am used to'. Microsoft have always had an edge triggered delivery here, rather than the level triggered delivery on BSD. But WinSock is not BSD sockets. If you want to remember what the state is, associate a flag with the socket. Microsoft are very good at retaining compatibility, 'fixing it' in your terms would be crass breakage. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fake async rep target
How easy would it be to implement a fake async rep target? Perhaps even as something that a server could allow a connection to request? (ie a suitably permissioned connection could convert itself to receive n async replication stream, rather than being statically configured?) I know that it sounds a bit bonkers, but a while back I worked on a system where we configured a rep target (using OpenServer) we could observe changes to tables and enqueue secondary processing. Rather painful in that case because of the way that repserver is configured, and I'm not sure it was worth the pain when configuring test and dev environments. However, in principle, it seems that this is quite an elegant standing for a whole raft of trigger functions - and probably a lot cheaper to execute. The key, I think, is to be able to allow dynamic attachment of such a 'change feed' by an account that has god-like read access. Is the existing async rep code amenable to this sort of abuse? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fake async rep target
Well, I was assuming that there was some intelligence in the receiver that could effectively parse this for the application; are you suggesting that is effectively binary deltas to apply to raw pages? Certainly, Sybase rep server works by creating function calls or SQL updates (depending on how you set it all up), and a fairly simple parser can be used to process the received stream. I was hoping that the receiver code could be used to at least identify which tuples are affected and their before/after primary keys. Ideally the result would be a packet by transaction, listing the impacted tables and for each table a list of affected primary keys and whether they were inserted, deleted or updated. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fake async rep target
Well, I was assuming that there was some intelligence in the receiver that could effectively parse this for the application; are you suggesting that is effectively binary deltas to apply to raw pages? Certainly, Sybase rep server works by creating function calls or SQL updates (depending on how you set it all up), and a fairly simple parser can be used to process the received stream. I was hoping that the receiver code could be used to at least identify which tuples are affected and their before/after primary keys. Ideally the result would be a packet by transaction, listing the impacted tables and for each table a list of affected primary keys and whether they were inserted, deleted or updated. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 05/01/2014 16:50, Robert Haas wrote: But on Windows, segments are*automatically* destroyed*by the operating system* when the last process unmaps them, so it's not quite so clear to me how we can allow it there. The main shared memory segment is no problem because the postmaster always has it mapped, even if no one else does, but that doesn't help for dynamic shared memory segments. Surely you just need to DuplicateHandle into the parent process? If you want to (tidily) dispose of it at some time, then you'll need to tell the postmaster that you have done so and what the handle is in its process, but if you just want it to stick around, then you can just pass it up.
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 05/01/2014 18:02, Robert Haas wrote: On Sun, Jan 5, 2014 at 12:34 PM, james wrote: >On 05/01/2014 16:50, Robert Haas wrote: > > But on Windows, segments are*automatically* >destroyed*by the operating system* when the last process unmaps them, >so it's not quite so clear to me how we can allow it there. The main >shared memory segment is no problem because the postmaster always has >it mapped, even if no one else does, but that doesn't help for dynamic >shared memory segments. > >Surely you just need to DuplicateHandle into the parent process? If you >want to (tidily) dispose of it at some time, then you'll need to tell the >postmaster that you have done so and what the handle is in its process, >but if you just want it to stick around, then you can just pass it up. Uh, I don't know, maybe? Does the postmaster have to do something to receive the duplicated handle In principle, no, so long as the child has a handle to the parent process that has the appropriate permissions. Given that these processes have a parent/child relationship that shouldn't be too hard to arrange. , or can the child just throw it over the wall to the parent and let it rot until the postmaster finally exits? Yes. Though it might be a good idea to record the handle somewhere (perhaps in a table) so that any potential issues from an insane system spamming the postmaster with handles are apparent. I'm intrigued - how are the handles shared between children that are peers in the current scheme? Some handle transfer must already be in place. Could you share the handles to an immortal worker if you want to reduce any potential impact on the postmaster? The latter would be nicer for our purposes, perhaps, as running more code from within the postmaster is risky for us. If a regular backend process dies, the postmaster will restart everything and the database will come back on line, but if the postmaster itself dies, we're hard down. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 06/01/2014 03:14, Robert Haas wrote: That's up to the application. After calling dsm_create(), you call dsm_segment_handle() to get the 32-bit integer handle for that segment. Then you have to get that to the other process(es) somehow. If you're trying to share a handle with a background worker, you can stuff it in bgw_main_arg. Otherwise, you'll probably need to store it in the main shared memory segment, or a file, or whatever. Well, that works for sysv shm, sure. But I was interested (possibly from Konstantin) how the handle transfer takes place at the moment, particularly if it is possible to create additional segments dynamically. I haven't looked at the extension at all. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 06/01/2014 04:20, Amit Kapila wrote: Duplicate handle should work, but we need to communicate the handle to other process using IPC. Only if the other process needs to use it. The IPC is not to transfer the handle to the other process, just to tell it which slot in its handle table contains the handle. If you just want to ensure that its use-count never goes to zero, the receiver does not need to know what the handle is. However ... The point remains that you need to duplicate it into every process that might want to use it subsequently, so it makes sense to DuplicateHandle into the parent, and then to advertise that handle value publicly so that other child processes can DuplicateHandle it back into their own process. The handle value can change so you also need to refer to the handle in the parent and map it in each child to the local equivalent. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] One process per session lack of sharing
On 15/07/2016 09:28, Craig Ringer wrote: I don't think anyone's considering moving from multi-processing to multi-threading in PostgreSQL. I really, really like the protection that the shared-nothing-by-default process model gives us, among other things. As I understand it, the main issue is that it is hard to integrate extensions that use heavyweight runtimes and are focussed on isolation within a virtual machine. Its not just Perhaps it would be possible for the postmaster (or a delegate process) to host such a runtime, and find a way for a user process that wants to use such a runtime to communicate with it, whether by copying function parameters over RPC or by sharing some of its address space explicitly to the runtime to operate on directly. Such a host delegate process could be explicitly built with multithread support and not 'infect' the rest of the code with its requirements. Using granular RPC is nice for isolation but I am concerned that the latencies might be high. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] todo - I want the elog() thingy
> > * Allow elog() to return error codes, module name, file name, line > > number, not just messages [elog] I bags this one. A nice relatively easy place for me to start hacken' the Postges. Which source tree do I diff and patch against? Er, I have no idea how to use these diff and patch things but I know that a manual exists. How do I get the CVS source tree? Surely I don't have to download the whole thing every day? I only have 1KB/sec of connectivity and it's extremely expensive ($300/month). Can I just download the files for elog() and do it that way, and I'll write some driver function to unit test it, and send the patch when I'm done to the patches list. Any developers got some tips for me? --- James - Original Message - From: "Oleg Bartunov" <[EMAIL PROTECTED]> To: "Bruce Momjian" <[EMAIL PROTECTED]> Cc: "PostgreSQL-development" <[EMAIL PROTECTED]> Sent: Sunday, May 13, 2001 9:35 PM Subject: Re: [HACKERS] 7.2 items > I'd like to have partial sorting implemented in 7.2. > While it's rather narrow optimization for case ORDER BY ... LIMIT ... > it has big (in my opinion) impact to Web application. > We get up to 6x performance improvement in our experiments with our very > crude patch for 7.1. The idea is very simple - stop sorting when we get > requested rows. Unfortunately, our knowledge of internals is poor and > we need some help. > > Regards, > Oleg > > On Thu, 10 May 2001, Bruce Momjian wrote: > > > Here is a small list of big TODO items. I was wondering which ones > > people were thinking about for 7.2? > > > > -- - > > > > * Add replication of distributed databases [replication] > > o automatic fallover > > o load balancing > > o master/slave replication > > o multi-master replication > > o partition data across servers > > o sample implementation in contrib/rserv > > o queries across databases or servers (two-phase commit) > > * Point-in-time data recovery using backup and write-ahead log > > * Allow row re-use without vacuum (Vadim) > > * Add the concept of dataspaces/tablespaces [tablespaces] > > * Allow better control over user privileges [privileges] > > * Allow elog() to return error codes, module name, file name, line > > number, not just messages [elog] > > * Allow international error message support and add error codes [elog] > > * Make binary/file in/out interface for TOAST columns > > * Large object interface improvements > > * Allow inherited tables to inherit index, UNIQUE constraint, and primary key > > [inheritance] > > * Add ALTER TABLE DROP COLUMN feature [drop] > > * Add ALTER TABLE ... DROP CONSTRAINT > > * Automatically drop constraints/functions when object is dropped > > > > > > Regards, > Oleg > _ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > > > ---(end of broadcast)--- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to [EMAIL PROTECTED] so that your > message can get through to the mailing list cleanly > > ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Parallel Aggregate
Hi, I've done some testing with one of my data sets in an 8VPU virtual environment and this is looking really, really good. My test query is: SELECT pageview, sum(pageview_count) FROM fact_agg_2015_12 GROUP BY date_trunc('DAY'::text, pageview); The query returns 15 rows. The fact_agg table is 5398MB and holds around 25 million records. Explain with a max_parallel_degree of 8 tells me that the query will only use 6 background workers. I have no indexes on the table currently. Finalize HashAggregate (cost=810142.42..810882.62 rows=59216 width=16) Group Key: (date_trunc('DAY'::text, pageview)) -> Gather (cost=765878.46..808069.86 rows=414512 width=16) Number of Workers: 6 -> Partial HashAggregate (cost=764878.46..765618.66 rows=59216 width=16) Group Key: date_trunc('DAY'::text, pageview) -> Parallel Seq Scan on fact_agg_2015_12 (cost=0.00..743769.76 rows=4221741 width=12) I am getting the following timings (everything was cached before I started tested). I didn't average the runtime, but I ran each one three times and took the middle value. *max_parallel_degree runtime* 0 11693.537 ms 1 6387.937 ms 2 4328.629 ms 3 3292.376 ms 4 2743.148 ms 5 2278.449 ms 6 2000.599 ms I'm pretty happy! Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 8:44 AM, David Rowley wrote: > On 12 March 2016 at 16:31, David Rowley > wrote: > > I've attached an updated patch which is based on commit 7087166, > > things are really changing fast in the grouping path area at the > > moment, but hopefully the dust is starting to settle now. > > The attached patch fixes a harmless compiler warning about a possible > uninitialised variable. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Hi, Happy to test, really looking forward to seeing this stuff in core. The explain analyze is below: Finalize HashAggregate (cost=810142.42..810882.62 rows=59216 width=16) (actual time=2282.092..2282.202 rows=15 loops=1) Group Key: (date_trunc('DAY'::text, pageview_start_tstamp)) -> Gather (cost=765878.46..808069.86 rows=414512 width=16) (actual time=2281.749..2282.060 rows=105 loops=1) Number of Workers: 6 -> Partial HashAggregate (cost=764878.46..765618.66 rows=59216 width=16) (actual time=2276.879..2277.030 rows=15 loops=7) Group Key: date_trunc('DAY'::text, pageview_start_tstamp) -> Parallel Seq Scan on celebrus_fact_agg_1_p2015_12 (cost=0.00..743769.76 rows=4221741 width=12) (actual time=0.066..1631 .650 rows=3618887 loops=7) One question - how is the upper limit of workers chosen? James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 12:30 PM, David Rowley wrote: > On 14 March 2016 at 14:16, James Sewell wrote: > >> I've done some testing with one of my data sets in an 8VPU virtual >> environment and this is looking really, really good. >> >> My test query is: >> >> SELECT pageview, sum(pageview_count) >> FROM fact_agg_2015_12 >> GROUP BY date_trunc('DAY'::text, pageview); >> >> The query returns 15 rows. The fact_agg table is 5398MB and holds around >> 25 million records. >> >> Explain with a max_parallel_degree of 8 tells me that the query will >> only use 6 background workers. I have no indexes on the table currently. >> >> Finalize HashAggregate (cost=810142.42..810882.62 rows=59216 width=16) >>Group Key: (date_trunc('DAY'::text, pageview)) >>-> Gather (cost=765878.46..808069.86 rows=414512 width=16) >> Number of Workers: 6 >> -> Partial HashAggregate (cost=764878.46..765618.66 rows=59216 >> width=16) >>Group Key: date_trunc('DAY'::text, pageview) >>-> Parallel Seq Scan on fact_agg_2015_12 >> (cost=0.00..743769.76 rows=4221741 width=12) >> > > Great! Thanks for testing this. > > If you run EXPLAIN ANALYZE on this with the 6 workers, does the actual > number of Gather rows come out at 105? I'd just like to get an idea of my > cost estimate for the Gather are going to be accurate for real world data > sets. > > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Cool, I've been testing how this works with partitioning (which seems to be strange, but I'll post separately about that) and something odd seems to be going on now with the parallel triggering: postgres=# create table a as select * from base_p2015_11; SELECT 2000 postgres=# select * from a limit 1; ts | count | a | b | c | d | e +---+-+--+--+--+--- 2015-11-26 21:10:04.856828 | 860 | 946 | 1032 | 1118 | 1204 | (1 row) postgres-# \d a Table "datamart_owner.a" Column |Type | Modifiers +-+--- ts | timestamp without time zone | count | integer | a | integer | b | integer | c | integer | d | integer | e | integer | postgres=# select pg_size_pretty(pg_relation_size('a')); pg_size_pretty 1149 MB postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); QUERY PLAN -- Finalize GroupAggregate (cost=218242.96..218254.46 rows=200 width=16) Group Key: (date_trunc('DAY'::text, ts)) -> Sort (cost=218242.96..218245.96 rows=1200 width=16) Sort Key: (date_trunc('DAY'::text, ts)) -> Gather (cost=218059.08..218181.58 rows=1200 width=16) Number of Workers: 5 -> Partial HashAggregate (cost=217059.08..217061.58 rows=200 width=16) Group Key: date_trunc('DAY'::text, ts) -> Parallel Seq Scan on a (cost=0.00..197059.06 rows=405 width=12) (9 rows) postgres=# analyze a; postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); QUERY PLAN -- GroupAggregate (cost=3164211.55..3564212.03 rows=2024 width=16) Group Key: (date_trunc('DAY'::text, ts)) -> Sort (cost=3164211.55..3214211.61 rows=2024 width=12) Sort Key: (date_trunc('DAY'::text, ts)) -> Seq Scan on a (cost=0.00..397059.30 rows=2024 width=12) (5 rows) Unsure what's happening here. James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 1:31 PM, David Rowley wrote: > On 14 March 2016 at 14:52, James Sewell wrote: > > One question - how is the upper limit of workers chosen? > > See create_parallel_paths() in allpaths.c. Basically the bigger the > relation (in pages) the more workers will be allocated, up until > max_parallel_degree. > > There is also a comment in that function which states: > /* > * Limit the degree of parallelism logarithmically based on the size of the > * relation. This probably needs to be a good deal more sophisticated, but > we > * need something here for now. > */ > > So this will likely see some revision at some point, after 9.6. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Hi again, I've been playing around with inheritance combined with this patch. Currently it looks like you are taking max(parallel_degree) from all the child tables and using that for the number of workers. For large machines it makes much more sense to use sum(parallel_degree) - but I've just seen this comment in the code: /* * Decide what parallel degree to request for this append path. For * now, we just use the maximum parallel degree of any member. It * might be useful to use a higher number if the Append node were * smart enough to spread out the workers, but it currently isn't. */ Does this mean that even though we are aggregating in parallel, we are only operating on one child table at a time currently? Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Mon, Mar 14, 2016 at 2:39 PM, James Sewell wrote: > Cool, > > I've been testing how this works with partitioning (which seems to be > strange, but I'll post separately about that) and something odd seems to be > going on now with the parallel triggering: > > postgres=# create table a as select * from base_p2015_11; > SELECT 2000 > > postgres=# select * from a limit 1; > ts | count | a | b | c | d | e > +---+-+--+--+--+--- > 2015-11-26 21:10:04.856828 | 860 | 946 | 1032 | 1118 | 1204 | > (1 row) > > postgres-# \d a > Table "datamart_owner.a" > Column |Type | Modifiers > +-+--- > ts | timestamp without time zone | > count | integer | > a | integer | > b | integer | > c | integer | > d | integer | > e | integer | > > postgres=# select pg_size_pretty(pg_relation_size('a')); > pg_size_pretty > > 1149 MB > > postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); > QUERY PLAN > > -- > Finalize GroupAggregate (cost=218242.96..218254.46 rows=200 width=16) >Group Key: (date_trunc('DAY'::text, ts)) >-> Sort (cost=218242.96..218245.96 rows=1200 width=16) > Sort Key: (date_trunc('DAY'::text, ts)) > -> Gather (cost=218059.08..218181.58 rows=1200 width=16) >Number of Workers: 5 >-> Partial HashAggregate (cost=217059.08..217061.58 > rows=200 width=16) > Group Key: date_trunc('DAY'::text, ts) > -> Parallel Seq Scan on a (cost=0.00..197059.06 > rows=405 width=12) > (9 rows) > > postgres=# analyze a; > > postgres=# explain select sum(count) from a group by date_trunc('DAY',ts); > QUERY PLAN > -- > GroupAggregate (cost=3164211.55..3564212.03 rows=2024 width=16) >Group Key: (date_trunc('DAY'::text, ts)) >-> Sort (cost=3164211.55..3214211.61 rows=2024 width=12) > Sort Key: (date_trunc('DAY'::text, ts)) > -> Seq Scan on a (cost=0.00..397059.30 rows=2024 width=12) > (5 rows) > > Unsure what's happening here. > > > > James Sewell, > PostgreSQL Team Lead / Solutions Architect > __ > > > Level 2, 50 Queen St, Melbourne VIC 3000 > > *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 > > > On Mon, Mar 14, 2016 at 1:31 PM, David Rowley < > david.row...@2ndquadrant.com> wrote: > >> On 14 March 2016 at 14:52, James Sewell >> wrote: >> > One question - how is the upper limit of workers chosen? >> >> See create_parallel_paths() in allpaths.c. Basically the bigger the >> relation (in pages) the more workers will be allocated, up until >> max_parallel_degree. >> >> There is also a comment in that function which states: >> /* >> * Limit the degree of parallelism logarithmically based on the size of the >> * relation. This probably needs to be a good deal more sophisticated, >> but we >> * need something here for now. >> */ >> >> So this will likely see some revision at some point, after 9.6. >> >> -- >> D
Re: [HACKERS] Parallel Aggregate
On Mon, Mar 14, 2016 at 3:05 PM, David Rowley wrote: > > Things to try: > 1. alter table a add column ts_date date; update a set ts_date = > date_trunc('DAY',ts); vacuum full analyze ts; > 2. or, create index on a (date_trunc('DAY',ts)); analyze a; > 3. or for testing, set the work_mem higher. > > Ah, that makes sense. Tried with a BTREE index, and it works as perfectly but the index is 428MB - which is a bit rough. Removed that and put on a BRIN index, same result for 48kB - perfect! Thanks for the help, James -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
On Tuesday, 15 March 2016, Robert Haas wrote: > > > Does the cost of the aggregate function come into this calculation at > > all? In PostGIS land, much smaller numbers of rows can generate loads > > that would be effective to parallelize (worker time much >> than > > startup cost). > > Unfortunately, no - only the table size. This is a problem, and needs > to be fixed. However, it's probably not going to get fixed for 9.6. > :-( > Any chance of getting a GUC (say min_parallel_degree) added to allow setting the initial value of parallel_degree, then changing the small relation check to also pass if parallel_degree > 1? That way you could set min_parallel_degree on a query by query basis if you are running aggregates which you know will take a lot of CPU. I suppose it wouldn't make much sense at all to set globally though, so it could just confuse matters. Cheers, -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
On Tue, Mar 15, 2016 at 9:32 AM, Robert Haas wrote: > > I kind of doubt this would work well, but somebody could write a patch > for it and try it out. OK I'll give this a go today and report back. Would the eventual plan be to use pg_proc.procost for the functions from each aggregate concerned? If so I might have a peek at that too, although I imagine I won't get far. Cheers, -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
Thanks David, Eventually it would be great to take into account the cost of the function doing the agg (pg_proc.procost, which is a multiple of CPU units). This would allow people to mark specific aggregations as needing more CPU power, therefore needing more workers per page (or should it be tuple in this case?). In the meantime some way to manually influence this would be good. I just did some testing (on an 8VCPU machine) with a 139MB table, which gets 3 workers currently. For a count(*) I get a time of 131.754 ms. If I increase this to 8 workers I get around 86.193 ms. Obviously this doesn't mean much as YMMV - but it does show that the ability to manually adjust the scaling would be great, especially in data warehouse or reporting environments. I did want to test with some really slow aggs, but even when I take out the small table test in create_parallel_paths I can't seem to get a parallel plan for a tiny table. Any idea on why this would be David? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Tue, Mar 15, 2016 at 12:25 PM, David Rowley wrote: > Over in [1] James mentioned about wanting more to be able to have more > influence over the partial path's parallel_degree decision. At risk > of a discussion on that hijacking the parallel aggregate thread, I > thought I'd start this for anyone who would want to discuss making > changes to that. > > I've attached a simple C program which shows the parallel_degree which > will be chosen at the moment. For now it's based on the size of the > base relation. Perhaps that will need to be rethought later, perhaps > based on costs. But I just don't think it's something for 9.6. > > Here's the output of the C program. > > For 1 pages there will be 1 workers (rel size 0 MB, 0 GB) > For 3001 pages there will be 2 workers (rel size 23 MB, 0 GB) > For 9001 pages there will be 3 workers (rel size 70 MB, 0 GB) > For 27001 pages there will be 4 workers (rel size 210 MB, 0 GB) > For 81001 pages there will be 5 workers (rel size 632 MB, 0 GB) > For 243001 pages there will be 6 workers (rel size 1898 MB, 1 GB) > For 729001 pages there will be 7 workers (rel size 5695 MB, 5 GB) > For 2187001 pages there will be 8 workers (rel size 17085 MB, 16 GB) > For 6561001 pages there will be 9 workers (rel size 51257 MB, 50 GB) > For 19683001 pages there will be 10 workers (rel size 153773 MB, 150 GB) > For 59049001 pages there will be 11 workers (rel size 461320 MB, 450 GB) > For 177147001 pages there will be 12 workers (rel size 1383960 MB, 1351 GB) > For 531441001 pages there will be 13 workers (rel size 4151882 MB, 4054 GB) > For 1594323001 pages there will be 14 workers (rel size 12455648 MB, 12163 > GB) > > [1] > http://www.postgresql.org/message-id/CANkGpBtUvzpdvF2=_iq64ujmvrpycs6d4i9-wepbusq1sq+...@mail.gmail.com > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
On Wed, Mar 16, 2016 at 11:26 AM, Julien Rouhaud wrote: > > I'm not too familiar with parallel planning, but I tried to implement > both in attached patch. I didn't put much effort into the > parallel_threshold GUC documentation, because I didn't really see a good > way to explain it. I'd e happy to improve it if needed. Also, to make > this parameter easier to tune for users, perhaps we could divide the > default value by 3 and use it as is in the first iteration in > create_parallel_path() ? > > Also, global max_parallel_degree still needs to be at least 1 for the > per table value to be considered. > > All applies and works from my end. Is the max_parallel_degree per table of much use here? It allows the max number of workers per table to be set - but it's still bound by the same formula (now from the GUC). So in reality it's only really useful for limiting the number of workers, not raising it. Would there be a common use case for limiting parallelism on a subset of tables in a database you've explicitly set to have a higher amount of parallel operations via the GUC? I struggle to think of one? I think in practicality the reverse would be more common, you'd want to set certain tables to a starting point of a certain number of workers (and ramp up to more if the formula allowed it). You could set this to 0 for never use parallel agg on this table. Another option is to allow access to the the threshold multiplier (currently hard coded to 3) per table - but this might become pretty hard to explain succinctly in the documentation. Cheers, James -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
Hey, I think are definitely use cases for using parallel agg on a small table when the time for each agg operation is very high. PostGIS can be used to create many examples of low row count and table size but high CPU operations. This does bring up an interesting point I don't quite understand though. If I run parallel agg on a table with 4 rows with 2 workers will it run on two workers (2 rows each) or will the first one grab all 4 rows? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Mar 17, 2016 at 10:30 PM, Julien Rouhaud wrote: > On 17/03/2016 12:21, David Rowley wrote: > > On 18 March 2016 at 00:13, Julien Rouhaud > wrote: > >> With the current threshold, you need a table bigger than 8 MB to be able > >> to force parallel workers. I'm not sure there'll be benefits for > >> multiple workers on a table smaller than 8 MB, since setting up all the > >> parallel stuff takes time. > > > > It would be really nice if it were possible to drop the setting really > > low, so that combined with a low parallel_setup_cost we could enable > > parallel query on small tables in the regression test suite. > > > > > > Indeed. That could also be a use case for moving parallel_threshold to a > GUC, but not sure what'd be best. > > -- > Julien Rouhaud > http://dalibo.com - http://dalibo.org > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Hi again, This is probably me missing something, but is there a reason parallel aggregate doesn't seem to ever create append nodes containing Index scans? SET random_page_cost TO 0.2; SET max_parallel_degree TO 8; postgres=# explain SELECT sum(count_i) FROM base GROUP BY view_time_day; QUERY PLAN - Finalize GroupAggregate (cost=310596.32..310598.03 rows=31 width=16) Group Key: view_time_day -> Sort (cost=310596.32..310596.79 rows=186 width=16) Sort Key: view_time_day -> Gather (cost=310589.00..310589.31 rows=186 width=16) Number of Workers: 5 -> Partial HashAggregate (cost=310589.00..310589.31 rows=31 width=16) Group Key: view_time_day -> Parallel Seq Scan on base (cost=0.00..280589.00 rows=600 width=12) SET max_parallel_degree TO 0; postgres=# explain SELECT sum(count_i) FROM base GROUP BY view_time_day; QUERY PLAN --- GroupAggregate (cost=0.56..600085.92 rows=31 width=16) Group Key: view_time_day -> Index Only Scan using base_view_time_day_count_i_idx on base (cost=0.56..450085.61 rows=3000 width=12) (3 rows) Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Mar 17, 2016 at 8:08 AM, David Rowley wrote: > On 17 March 2016 at 01:29, Robert Haas wrote: > > On Wed, Mar 16, 2016 at 8:19 AM, Amit Kapila > wrote: > >> Isn't it better to call it as Parallel Aggregate instead of Partial > >> Aggregate. Initialy, we have kept Partial for seqscan, but later on we > >> changed to Parallel Seq Scan, so I am not able to think why it is > better to > >> call Partial incase of Aggregates. > > > > I think partial is the right terminology. Unlike a parallel > > sequential scan, a partial aggregate isn't parallel-aware and could be > > used in contexts having nothing to do with parallelism. It's just > > that it outputs transition values instead of a finalized value. > > +1 the reason the partial aggregate patches have been kept separate > from the parallel aggregate patches is that partial aggregate will > serve for many other purposes. Parallel Aggregate is just one of many > possible use cases for this, so it makes little sense to give it a > name according to a single use case. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
On Thu, Mar 17, 2016 at 5:05 AM, Julien Rouhaud wrote: > > attached v3 drops the GUC part. > This looks good good. I do think that some threshold control would be good in the long term - but you are right Robert it just feels strange. Maybe once the final formula is implemented in 9.7+ and this gets some real world use cases it can be revisited? One thing I really, really like about the way the new patch works is that you can set parallel_degree on an inheritance parent, then that will set the minimum workers for all associated children (when accessing from the parent). Currently this patch will not fire on small tables even when parallel_degree is set, can we fix this by adding a check for ref->parallel_degree to the table size condition? Cheers, James -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Choosing parallel_degree
OK cool, thanks. Can we remove the minimum size limit when the per table degree setting is applied? This would help for tables with 2 - 1000 pages combined with a high CPU cost aggregate. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Sun, Mar 20, 2016 at 11:23 PM, David Rowley wrote: > On 18 March 2016 at 10:13, James Sewell wrote: > > This does bring up an interesting point I don't quite understand though. > If I run parallel agg on a table with 4 rows with 2 workers will it run on > two workers (2 rows each) or will the first one grab all 4 rows? > It works on a per page basis, workers just each grab the next page to > be scanned from a page counter that sits in shared memory, the worker > just increments the page number, releases the lock on the counter and > scans that page. > > See heap_parallelscan_nextpage() > > So the answer to your question is probably no. At least not unless the > the page only contained 2 rows. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Parallel Aggregate
Good news! On Tuesday, 22 March 2016, David Rowley wrote: > On 22 March 2016 at 02:35, Robert Haas > wrote: > > I have committed this after changing some of the comments. > > > > There might still be bugs ... but I don't see them. And the speedups > > look very impressive. > > > > Really nice work, David. > > Thanks for that, and thank you for taking the time to carefully review > it and commit it. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] Random note of encouragement
Hey All, I've been doing some (futile) work trying to speed up aggregates with a group by in PostgreSQL 9.5. I installed PostgreSQL 9.6 on the same machine to see if I could get anything running in parallel when using partitioning - which didn't work. But - I did find this: With the following setup: CREATE TABLE base( view_time TIMESTAMP WITHOUT time ZONE, view_time_day TIMESTAMP WITHOUT time ZONE, count_n numeric); INSERT INTO base SELECT view_time, date_trunc('day', view_time), COUNT::numeric, FROM (SELECT TIMESTAMP '2015-12-01' + random() * interval '30 days' AS view_time, trunc(random() * 99 + 1) AS COUNT FROM generate_series(1,3000)) a; analyze base; Now when I run the following SQL (multiple times to allow for getting everything into shared buffers, which is 4GB on my machine): select sum(count_n) from base group by view_time_day; I get the following results: PSQL 9.5 - *~21 seconds* PSQL 9.6 devel - *~8.5 seconds* I think that's pretty good! I know this is a devel release, things may change, blah blah. But still, something has changed for the better here! I get the same plan on both nodes: HashAggregate (cost=670590.56..670590.95 rows=31 width=13) Group Key: view_time_day -> Seq Scan on base (cost=0.00..520590.04 rows=3104 width=13) Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Random note of encouragement
I've actually just tested this on 9.3 - and I get roughly the same as 9.6devel. Now going back to make sure my 9.5 environment is sane. Hopefully this isn't me jumping the gun. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Feb 25, 2016 at 10:50 AM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Thu, Feb 25, 2016 at 12:26 PM, Bruce Momjian wrote: > > On Thu, Feb 25, 2016 at 10:06:34AM +1100, James Sewell wrote: > >> Now when I run the following SQL (multiple times to allow for getting > >> everything into shared buffers, which is 4GB on my machine): > >> > >> > >> select sum(count_n) from base group by view_time_day; > >> > >> > >> I get the following results: > >> > >> > >> PSQL 9.5 - ~21 seconds > >> PSQL 9.6 devel - ~8.5 seconds > >> > >> > >> I think that's pretty good! > >> > >> I know this is a devel release, things may change, blah blah. But still, > >> something has changed for the better here! > > > > Wow, that is cool. Can anyone suggest which commit improved this? > > Since it sums numerics, maybe integer transition functions from commit > 959277a4f579da5243968c750069570a58e92b38 helped? > > -- > Thomas Munro > http://www.enterprisedb.com > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Random note of encouragement
Argh seems like a false alarm for now. I installed 9.5 from RPM source (the other was one I had installed previously) and the performance matched 9.6 Sorry about that, I must have *something* screwed up on the other one. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] Replication vs. float timestamps is a disaster
>>>>> "TL" == Tom Lane writes: TL> The question to be asked is whether there is still anybody out there TL> using float timestamps. Gentoo's ebuild includes: $(use_enable !pg_legacytimestamp integer-datetimes) \ meaning that by default --enable-integer-datetimes is passed to configure, but if the pg_legacytimestamp use flag is set, then --disable-integer-datetimes is passed instead. They document it as: Use double precision floating-point numbers instead of 64-bit integers for timestamp storage. Ie, w/o any kind of deprecation notice. I don't know how many (how few?) add pg_legacytimestamp to USE when merging postgresql. But it is still available as of 9.6 and also with their live build of git://git.postgresql.org/git/postgresql.git. -JimC -- James Cloos OpenPGP: 0x997A9F17ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?
>>>>> "ST" == Shaun Thomas writes: ST> That said, the documentation here says FLOAT4 is an alias for REAL, ST> so it's somewhat nonintuitive for FLOAT4 to be so much slower than ST> FLOAT8, which is an alias for DOUBLE PRECISION. There are some versions of glibc where doing certain math on double is faster than doing it on float, depending on how things are compiled. Maybe this is one of them? -JimC -- James Cloos OpenPGP: 0x997A9F17ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 19:48 -0500, Trond Myklebust wrote: > On Jan 13, 2014, at 19:03, Hannu Krosing wrote: > > > On 01/13/2014 09:53 PM, Trond Myklebust wrote: > >> On Jan 13, 2014, at 15:40, Andres Freund wrote: > >> > >>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote: > >>>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner > >>>> wrote: > >>>>> I notice, Josh, that you didn't mention the problems many people > >>>>> have run into with Transparent Huge Page defrag and with NUMA > >>>>> access. > >>>> Amen to that. Actually, I think NUMA can be (mostly?) fixed by > >>>> setting zone_reclaim_mode; is there some other problem besides that? > >>> I think that fixes some of the worst instances, but I've seen machines > >>> spending horrible amounts of CPU (& BUS) time in page reclaim > >>> nonetheless. If I analyzed it correctly it's in RAM << working set > >>> workloads where RAM is pretty large and most of it is used as page > >>> cache. The kernel ends up spending a huge percentage of time finding and > >>> potentially defragmenting pages when looking for victim buffers. > >>> > >>>> On a related note, there's also the problem of double-buffering. When > >>>> we read a page into shared_buffers, we leave a copy behind in the OS > >>>> buffers, and similarly on write-out. It's very unclear what to do > >>>> about this, since the kernel and PostgreSQL don't have intimate > >>>> knowledge of what each other are doing, but it would be nice to solve > >>>> somehow. > >>> I've wondered before if there wouldn't be a chance for postgres to say > >>> "my dear OS, that the file range 0-8192 of file x contains y, no need to > >>> reread" and do that when we evict a page from s_b but I never dared to > >>> actually propose that to kernel people... > >> O_DIRECT was specifically designed to solve the problem of double > >> buffering > >> between applications and the kernel. Why are you not able to use that in > >> these situations? > > What is asked is the opposite of O_DIRECT - the write from a buffer inside > > postgresql to linux *buffercache* and telling linux that it is the same > > as what > > is currently on disk, so don't bother to write it back ever. > > I don’t understand. Are we talking about mmap()ed files here? Why > would the kernel be trying to write back pages that aren’t dirty? No ... if I have it right, it's pretty awful: they want to do a read of a file into a user provided buffer, thus obtaining a page cache entry and a copy in their userspace buffer, then insert the page of the user buffer back into the page cache as the page cache page ... that's right, isn't it postgress people? Effectively you end up with buffered read/write that's also mapped into the page cache. It's a pretty awful way to hack around mmap. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote: > On 01/14/2014 09:39 AM, Claudio Freire wrote: > > On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing > > wrote: > >> Again, as said above the linux file system is doing fine. What we > >> want is a few ways to interact with it to let it do even better when > >> working with postgresql by telling it some stuff it otherwise would > >> have to second guess and by sometimes giving it back some cache > >> pages which were copied away for potential modifying but ended > >> up clean in the end. > > You don't need new interfaces. Only a slight modification of what > > fadvise DONTNEED does. > > > > This insistence in injecting pages from postgres to kernel is just a > > bad idea. > Do you think it would be possible to map copy-on-write pages > from linux cache to postgresql cache ? > > this would be a step in direction of solving the double-ram-usage > of pages which have not been read from syscache to postgresql > cache without sacrificing linux read-ahead (which I assume does > not happen when reads bypass system cache). The current mechanism for coherency between a userspace cache and the in-kernel page cache is mmap ... that's the only way you get the same page in both currently. glibc used to have an implementation of read/write in terms of mmap, so it should be possible to insert it into your current implementation without a major rewrite. The problem I think this brings you is uncontrolled writeback: you don't want dirty pages to go to disk until you issue a write() I think we could fix this with another madvise(): something like MADV_WILLUPDATE telling the page cache we expect to alter the pages again, so don't be aggressive about cleaning them. Plus all the other issues with mmap() ... but if you can detail those, we might be able to fix them. > and we can write back the copy at the point when it is safe (from > postgresql perspective) to let the system write them back ? Using MADV_WILLUPDATE, possibly ... you're still not going to have absolute control. The kernel will write back the pages if the dirty limits are exceeded, for instance, but we could tune it to be useful. > Do you think it is possible to make it work with good performance > for a few million 8kb pages ? > > > At the very least, it still needs postgres to know too much > > of the filesystem (block layout) to properly work. Ie: pg must be > > required to put entire filesystem-level blocks into the page cache, > > since that's how the page cache works. > I was more thinking of an simple write() interface with extra > flags/sysctls to tell kernel that "we already have this on disk" > > At the very worst, it may > > introduce serious security and reliability implications, when > > applications can destroy the consistency of the page cache (even if > > full access rights are checked, there's still the possibility this > > inconsistency might be exploitable). > If you allow write() which just writes clean pages, I can not see > where the extra security concerns are beyond what normal > write can do. The problem is we can't give you absolute control of when pages are written back because that interface can be used to DoS the system: once we get too many dirty uncleanable pages, we'll thrash looking for memory and the system will livelock. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote: > On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley > wrote: > > No, I'm sorry, that's never going to be possible. No user space > > application has all the facts. If we give you an interface to force > > unconditional holding of dirty pages in core you'll livelock the system > > eventually because you made a wrong decision to hold too many dirty > > pages. I don't understand why this has to be absolute: if you advise > > us to hold the pages dirty and we do up until it becomes a choice to > > hold on to the pages or to thrash the system into a livelock, why would > > you ever choose the latter? And if, as I'm assuming, you never would, > > why don't you want the kernel to make that choice for you? > > If you don't understand how write-ahead logging works, this > conversation is going nowhere. Suffice it to say that the word > "ahead" is not optional. No, I do ... you mean the order of write out, if we have to do it, is important. In the rest of the kernel, we do this with barriers which causes ordered grouping of I/O chunks. If we could force a similar ordering in the writeout code, is that enough? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote: > On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas wrote: > > > > In terms of avoiding double-buffering, here's my thought after reading > > what's been written so far. Suppose we read a page into our buffer > > pool. Until the page is clean, it would be ideal for the mapping to > > be shared between the buffer cache and our pool, sort of like > > copy-on-write. That way, if we decide to evict the page, it will > > still be in the OS cache if we end up needing it again (remember, the > > OS cache is typically much larger than our buffer pool). But if the > > page is dirtied, then instead of copying it, just have the buffer pool > > forget about it, because at that point we know we're going to write > > the page back out anyway before evicting it. > > > > This would be pretty similar to copy-on-write, except without the > > copying. It would just be forget-from-the-buffer-pool-on-write. > > > But... either copy-on-write or forget-on-write needs a page fault, and > thus a page mapping. > > Is a page fault more expensive than copying 8k? > > (I really don't know). A page fault can be expensive, yes ... but perhaps you don't need one. What you want is a range of memory that's read from a file but treated as anonymous for writeout (i.e. written to swap if we need to reclaim it). Then at some time later, you want to designate it as written back to the file instead so you control the writeout order. I'm not sure we can do this: the separation between file backed and anonymous pages is pretty deeply ingrained into the OS, but if it were possible, is that what you want? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote: > James Bottomley writes: > > The current mechanism for coherency between a userspace cache and the > > in-kernel page cache is mmap ... that's the only way you get the same > > page in both currently. > > Right. > > > glibc used to have an implementation of read/write in terms of mmap, so > > it should be possible to insert it into your current implementation > > without a major rewrite. The problem I think this brings you is > > uncontrolled writeback: you don't want dirty pages to go to disk until > > you issue a write() > > Exactly. > > > I think we could fix this with another madvise(): > > something like MADV_WILLUPDATE telling the page cache we expect to alter > > the pages again, so don't be aggressive about cleaning them. > > "Don't be aggressive" isn't good enough. The prohibition on early write > has to be absolute, because writing a dirty page before we've done > whatever else we need to do results in a corrupt database. It has to > be treated like a write barrier. > > > The problem is we can't give you absolute control of when pages are > > written back because that interface can be used to DoS the system: once > > we get too many dirty uncleanable pages, we'll thrash looking for memory > > and the system will livelock. > > Understood, but that makes this direction a dead end. We can't use > it if the kernel might decide to write anyway. No, I'm sorry, that's never going to be possible. No user space application has all the facts. If we give you an interface to force unconditional holding of dirty pages in core you'll livelock the system eventually because you made a wrong decision to hold too many dirty pages. I don't understand why this has to be absolute: if you advise us to hold the pages dirty and we do up until it becomes a choice to hold on to the pages or to thrash the system into a livelock, why would you ever choose the latter? And if, as I'm assuming, you never would, why don't you want the kernel to make that choice for you? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 12:39 -0500, Robert Haas wrote: > On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley > wrote: > > On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote: > >> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas wrote: > >> > In terms of avoiding double-buffering, here's my thought after reading > >> > what's been written so far. Suppose we read a page into our buffer > >> > pool. Until the page is clean, it would be ideal for the mapping to > >> > be shared between the buffer cache and our pool, sort of like > >> > copy-on-write. That way, if we decide to evict the page, it will > >> > still be in the OS cache if we end up needing it again (remember, the > >> > OS cache is typically much larger than our buffer pool). But if the > >> > page is dirtied, then instead of copying it, just have the buffer pool > >> > forget about it, because at that point we know we're going to write > >> > the page back out anyway before evicting it. > >> > > >> > This would be pretty similar to copy-on-write, except without the > >> > copying. It would just be forget-from-the-buffer-pool-on-write. > >> > >> But... either copy-on-write or forget-on-write needs a page fault, and > >> thus a page mapping. > >> > >> Is a page fault more expensive than copying 8k? > >> > >> (I really don't know). > > > > A page fault can be expensive, yes ... but perhaps you don't need one. > > > > What you want is a range of memory that's read from a file but treated > > as anonymous for writeout (i.e. written to swap if we need to reclaim > > it). Then at some time later, you want to designate it as written back > > to the file instead so you control the writeout order. I'm not sure we > > can do this: the separation between file backed and anonymous pages is > > pretty deeply ingrained into the OS, but if it were possible, is that > > what you want? > > Doesn't sound exactly like what I had in mind. What I was suggesting > is an analogue of read() that, if it reads full pages of data to a > page-aligned address, shares the data with the buffer cache until it's > first written instead of actually copying the data. The only way to make this happen is mmap the file to the buffer and use MADV_WILLNEED. > The pages are > write-protected so that an attempt to write the address range causes a > page fault. In response to such a fault, the pages become anonymous > memory and the buffer cache no longer holds a reference to the page. OK, so here I thought of another madvise() call to switch the region to anonymous memory. A page fault works too, of course, it's just that one per page in the mapping will be expensive. Do you care about handling aliases ... what happens if someone else reads from the file, or will that never occur? The reason for asking is that it's much easier if someone else mmapping the file gets your anonymous memory than we create an alias in the page cache. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:09 -0500, Robert Haas wrote: > On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley > wrote: > >> Doesn't sound exactly like what I had in mind. What I was suggesting > >> is an analogue of read() that, if it reads full pages of data to a > >> page-aligned address, shares the data with the buffer cache until it's > >> first written instead of actually copying the data. > > > > The only way to make this happen is mmap the file to the buffer and use > > MADV_WILLNEED. > > > >> The pages are > >> write-protected so that an attempt to write the address range causes a > >> page fault. In response to such a fault, the pages become anonymous > >> memory and the buffer cache no longer holds a reference to the page. > > > > OK, so here I thought of another madvise() call to switch the region to > > anonymous memory. A page fault works too, of course, it's just that one > > per page in the mapping will be expensive. > > I don't think either of these ideas works for us. We start by > creating a chunk of shared memory that all processes (we do not use > threads) will have mapped at a common address, and we read() and > write() into that chunk. Yes, that's what I was thinking: it's a cache. About how many files comprise this cache? Are you thinking it's too difficult for every process to map the files? > > Do you care about handling aliases ... what happens if someone else > > reads from the file, or will that never occur? The reason for asking is > > that it's much easier if someone else mmapping the file gets your > > anonymous memory than we create an alias in the page cache. > > All reads and writes go through the buffer pool stored in shared > memory, but any of the processes that have that shared memory region > mapped could be responsible for any individual I/O request. That seems to be possible with the abstraction. The initial mapping gets the file backed pages: you can do madvise to read them (using readahead), flush them (using wontneed) and flip them to anonymous (using something TBD). Since it's a shared mapping API based on the file, any of the mapping processes can do any operation. Future mappers of the file get the mix of real and anon memory, so it's truly shared. Given that you want to use this as a shared cache, it seems that the API to flip back from anon to file mapped is wontneed. That would also trigger writeback of any dirty pages in the previously anon region ... which you could force with msync. As far as I can see, this is identical to read/write on a shared region with the exception that you don't need to copy in and out of the page cache. >From our point of view, the implementation is nice because the pages effectively never leave the page cache. We just use an extra per page flag (which I'll get shot for suggesting) to alter the writeout path (which is where the complexity which may kill the implementation is). James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PostgreSQL Failback without rebuild
Hello All, I have been reading through some of the recent discussions about failback when in a streaming replication setup. I define failback as: 1. Node A is master, Node B is slave 2. Node A crashes || Node A is stopped || nothing happens 3. Promote Node B to Master 4. Attach Node A as slave My understanding is currently to achieve step three you need to take a base backup of Node B and deploy it to Node A before starting streaming replication (or use rsync etc...). This is very undesirable for many users, especially if they have a very large database. >From the discussions I can see that the problem is to do with Node A writing changes to disk that Node B are not streamed before Node A crashes. Has there been any consensus on this issue? Are there any solutions which might make it into 9.4 or 9.5? I've seen some proposals and a tool (pg_rewind), but all seem to have draw backs. I've been looking mainly at these threads: http://www.postgresql.org/message-id/CAF8Q-Gy7xa60HwXc0MKajjkWFEbFDWTG=ggyu1kmt+s2xcq...@mail.gmail.com http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com http://www.postgresql.org/message-id/519df910.4020...@vmware.com Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] PostgreSQL Failback without rebuild
I've just noticed that on PostgreSQL 9.3 I can do the following with a master node A and a slave node B (as long as I have set recovery_target_timeline = 'latest'): 1. Stop Node A 2. Promote Node B 3. Attach Node A as slave This is sufficient for my needs (I know it doesn't cover a crash), can anyone see any potential problems with this approach? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Wed, Feb 5, 2014 at 6:03 PM, Michael Paquier wrote: > On Wed, Feb 5, 2014 at 3:14 PM, Amit Kapila > wrote: > > On Wed, Feb 5, 2014 at 10:30 AM, James Sewell > > >> I've seen some proposals and a tool (pg_rewind), but all seem to have > draw > >> backs. > > > > As far as I remember, one of the main drawbacks for pg_rewind was > related to > > hint bits which can be avoided by wal_log_hints. pg_rewind is not part of > > core > > PostgreSQL code, however if you wish, you can try that tool to see if > can it > > solve your purpose. > For 9.3, pg_rewind is only safe with page checksums enabled. For 9.4, > yes wal_log_hints or checksums is mandatory. The code contains as well > some safety checks as well to ensure that a node not using those > parameters cannot be rewinded. > Regards, > -- > Michael > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] PostgreSQL Failback without rebuild
Node A could get ahead even if it has been shut down cleanly BEFORE the promotion? I'd always assumed if I shut down the master the slave would be at the same point after shutdown - is this incorrect? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Fri, Feb 7, 2014 at 4:58 PM, Michael Paquier wrote: > > > > On Fri, Feb 7, 2014 at 1:57 PM, James Sewell wrote: > >> I've just noticed that on PostgreSQL 9.3 I can do the following with a >> master node A and a slave node B (as long as I have set >> recovery_target_timeline = 'latest'): >> >>1. Stop Node A >>2. Promote Node B >>3. Attach Node A as slave >> >> This is sufficient for my needs (I know it doesn't cover a crash), can >> anyone see any potential problems with this approach? >> > Yes, node A could get ahead of the point where WAL forked when promoting > B. In this case you cannot reconnect A to B, and need to actually recreate > a node from a fresh base backup, or rewind it. pg_rewind targets the > latter, postgres core is able to to the former, and depending on things > like your environment and/or the size of your server, you might prefer one > or the other. > Regards, > -- > Michael > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] pg_dump and schema names
I was under the impression that every command that references a relation makes use of the search path, regardless of what it is *doing*. Maybe this is different in older versions though? I actually ran into this recently and had to remove all the xx. schema components using vi before I could run the dump to move to a different schema. Just to be sure: postgres=# create schema test; CREATE SCHEMA postgres=# set search_path to test; SET postgres=# create table test1(id serial); NOTICE: CREATE TABLE will create implicit sequence "test1_id_seq" for serial column "test1.id" CREATE TABLE postgres=# alter table test1 owner to postgres; ALTER TABLE Cheers, James Sewell Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Fri, Aug 9, 2013 at 2:04 PM, Bruce Momjian wrote: > pg_dump goes to great lengths not to hard-code the schema name into > commands like CREATE TABLE, instead setting the search_path before > creating the table; these commands: > > CREATE SCHEMA xx; > CREATE TABLE xx.test(x int); > > generates this output: > > SET search_path = xx, pg_catalog; > CREATE TABLE test ( > x integer > ); > > If you dump a schema and want to reload it into another schema, you > should only need to update that one search_path line. However, later in > the dump file, we hardcode the schema name for setting the object owner: > > ALTER TABLE xx.test OWNER TO postgres; > > Could we use search_path here to avoid the schema designation? I am not > sure this possible because while CREATE defaults to the first existing > schema in the search_path, I am not sure ALTER has the same behavior > because you are not _creating_ something with ALTER. > > -- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. <>
[HACKERS] PSQL return coder
Hello, I am using PSQL to run SQL from a file with the -f flag as follows: BEGIN SQL SQL ... END This gives me rollback on error and a nicer output than -1. This works fine. My question is in a rollback scenario is it possible to get PSQL to return a non 0 exit status? Cheers,a James -- James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 * **W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] PSQL return coder
I was avoiding ON_ERROR_STOP because I was using ON_ERROR_ROLLBACK, but have just realised that if I encase my SQL in a transaction then rollback will still happen. Perfect! James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 * **W* www.lisasoft.com *F *(+61) 3 8370 8099 On Fri, Oct 11, 2013 at 12:25 AM, Merlin Moncure wrote: > On Thu, Oct 10, 2013 at 1:52 AM, Tom Lane wrote: > > James Sewell writes: > >> My question is in a rollback scenario is it possible to get PSQL to > return > >> a non 0 exit status? > > > > Maybe you could use -c instead of -f? > > > > $ psql -c 'select 1; select 1/0' regression > > ERROR: division by zero > > $ echo $? > > 1 > > > > You won't need explicit BEGIN/END because this is already a single > > transaction. > > According to the man page, > "EXIT STATUS >psql returns 0 to the shell if it finished normally, 1 if a fatal > error >of its own (out of memory, file not found) occurs, 2 if the > connection >to the server went bad and the session was not interactive, and 3 > if an >error occurred in a script and the variable ON_ERROR_STOP was set." > > So for a longer script ON_ERROR_STOP might be the ticket (which is > usually a good idea anyways). > > merlin > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey All, I had missed these emails, sorry. The search+bind mode issue is one of documentation location, I have fixed it by moving the section to the applied to both list. As the patch is to do with post-auth response this is correct. As far as the issue when something other than 0 or 1 is set I am happy throw an error (although this doesn't seem to be how option such as LDAPTLS work: 1 if 1 else 0). I assume I would use the ereport() function to do this (using the second example from this page http://www.postgresql.org/docs/9.2/static/error-message-reporting.html)? Cheers, James James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 * **W* www.lisasoft.com *F *(+61) 3 8370 8099 On Thu, Sep 19, 2013 at 12:56 AM, Peter Eisentraut wrote: > On 7/8/13 9:33 PM, James Sewell wrote: > > New patch attached. I've moved from using a boolean to an enum trivalue. > > When ldapreferrals is set to something other than "0" or "1" exactly, it > just ignores the option. That's not good, I think. It should probably > be an error. > > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] Streaming Replication clusters and load balancing
Hello all, I have recently been working with PostgreSQL and HAProxy to provide seamless load balancing to a group of database servers. This on it's own isn't a hard thing: I have an implementation finished and am now thinking about the best way to bring it to a production ready state which could be used by others, and used in load-balancers other than HAProxy with minimal config changes. My initial requirements were: Given a group of PostgreSQL servers check each x seconds and: - Allow read/write access only to the master server (via IPA / portA) - Disallow access if there are multiple master servers - Allow read access to all servers (via IPB / portB) as long as the following holds: - They are attached to the current master server via streaming replication (or they are the current master server) - They can currently contact the master server (safest option, disallow all access when master-less) - They are in the same timeline as the master server (do I need this check?) - The master server reports that they have less than x bytes lag HAProxy can talk to PostgreSQL for a health check via TCP or PSQL (connection check only). Neither of these allow the logic above - therefore this logic has to be hosted outside of HAProxy. This might change in the future if HAProxy gets the ability to send SQL statements (like an F5 can). Today the best way to provide this information to HAProxy (and many other load balancers, application frameworks, proxies, monitoring solutions) is via HTTP, where HTTP 200 is pass the check and HTTP 503 is fail the check (and don't load balance to this node). In my case I have a script which accepts HTTP requests to /read to check if this node is available for read only and /write which checks if this node is available for read/writes. The options as I see them are: - Implement a script / small app which connects to PostgreSQL and executes these checks - Implemented and proven today at many independent sites - Should it run on HAProxy server or PSQL server? - Integrated HTTP server or x.inetd script? - Platform independence? - What if it dies? - Implement a custom PostgreSQL BGworker which provides this information over HTTP - No outside of PostgreSQL config needed - No reliance on another daemon / interface being up - libmicrohttpd or similar should help with platform independence - Security / acceptance by community? - Only newer versions of PostgreSQL - Spend the time working on getting SQL checks into HAProxy - What about other platforms which only support HTTP? I think all of the options would benefit from a PSQL extension which does the following: - Encapsulates the check logic (easier to upgrade, manipulate) - Stores historic check data for a number of hours / days / months - Stores defaults (override via HTTP could be possible for things like lag) Does anyone else have any thoughts on this topic? Eventually many cool features could flow out of this kind of work: - Integration with High Availability products - I have this working with EnterpriseDB EFM now. - Locate the current master using the HA product - more than one master doesn't cause loss of service as long as HA state is sane - Locate all clustered standby servers using the HA product - if a standby is removed from the HA cluster, it is removed from load balancing - if a standby is not part of the cluster, it is removed from load balancing (even if it is part of streaming replication) - HTTP replication status requests which facilitate dynamically managing HAProxy (or other) PostgreSQL server pools - Add a node to streaming replication, it automatically shows up in the pool and starts being checked to see if it can service reads - Great for cloud scale out - Allocation of additional load balancer groups based on some criteria (?), for example - read/write (as above) - read only (as above) - data warehouse (reporting reads only) - DR (replica with no reads or writes - until it becomes a master) Keen to hear comments. Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey All, This patch request grew from this post (of mine) to pgsql-general: http://www.postgresql.org/message-id/cabuevezouae-g1_oejagujjmem675dnystwybp4d_wz6om+...@mail.gmail.com The patch adds another available LDAP option (ldapnochaseref) for search+bind mode in the pg_hba.conf fil. If set to 1 (0 is default) then it performs a ldap_set_option which disables chasing of any LDAP references which are returned as part of the search LDIF. I can think of two use cases for this: 1. (the case which spawned my email) A valid search is performed, but for some reason a "ref:" with a non responsive LDAP server is returned as well, which causes the authentication to time out (could be intermittent if DNS round robin or similar is used and some of the LDAP servers are not functioning / a packet dropping firewall is in the way). 2. (a case I found when testing with AD) A valid search is performed and 6 "ref:" entries are returned, which all must be chased before authentication can succeed. Setting ldapnochaseref speeds up authentication with no negative cost (assuming you understand your LDAP schema). I think it's work noting that this setting seems to be the default for ldapsearch on Linux these days. Hopefully I found all the documentation that I was meant to update, let me know if not though. Cheers, James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. <> pgsql_ldapnochaseref_v1.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey Peter, You are correct, it is the same as the referrals option in pam_ldap. It's also the -C (sometimes -R - it seems ldapsearch options are pretty non-standard) option in ldapsearch. As far as I'm aware you can't pass this in an LDAP URL, primarily because this never gets sent to the LDAP server. The server always returns an LDIF with inline references, this just determines if you chase them client side or just list them as is. I could be missing something here, but using: ldapreferrals={0|1} Would require a three state type, as we need a way of not interfering with the library defaults? To 'enable' the new behavior here using a boolean you would need to set ldapreferrals=false - which with the normal way of dealing with config booleans would alter the default behavior if the option was not specified. How do you feel about: ldapdisablereferrals=(0|1) Cheers, James Sewell James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Tue, Jul 2, 2013 at 10:46 PM, Peter Eisentraut wrote: > On 7/2/13 12:20 AM, James Sewell wrote: > > Hey All, > > > > This patch request grew from this post (of mine) to pgsql-general: > > > > > http://www.postgresql.org/message-id/cabuevezouae-g1_oejagujjmem675dnystwybp4d_wz6om+...@mail.gmail.com > > > > The patch adds another available LDAP option (ldapnochaseref) for > > search+bind mode in the pg_hba.conf fil. If set to 1 (0 is default) then > > it performs a ldap_set_option which disables chasing of any LDAP > > references which are returned as part of the search LDIF. > > This appears to be the same as the "referrals" option in pam_ldap > (http://linux.die.net/man/5/pam_ldap). So it seems legitimate. > > For consistency, I would name the option ldapreferrals={0|1}. I prefer > avoiding double negatives. > > Do you know of a standard way to represent this option in an LDAP URL, > perhaps as an extension? > > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. <>
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Heya, I see what you are saying, the problem as I see it is that the action we are taking here is "disable chasing ldap referrals". If the name is ldapreferrals and we use a boolean then setting it to 1 reads in a counter intuitive manner: "set ldapreferals=true to disable chasing LDAP referrals." Perhaps you are fine with this though if it's documented? It does work in the inverse way to pam_ldap, where setting to true enables referral chasing. pam_ldap works like so: not set : library default set to 0 : disable referral chasing set to 1 : enable referral chasing The other option would be to have the default value (of the parameter) be true and set the boolean to false to disable it. I can't find any other examples of this though - I assume having a one off like this in the code is a bad thing also? I'm happy to let you guys decide. Cheers, James James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Wed, Jul 3, 2013 at 6:12 PM, Magnus Hagander wrote: > > On Wed, Jul 3, 2013 at 3:04 AM, James Sewell wrote: > >> Hey Peter, >> >> You are correct, it is the same as the referrals option in pam_ldap. >> It's also the -C (sometimes -R - it seems ldapsearch options are pretty >> non-standard) option in ldapsearch. >> >> As far as I'm aware you can't pass this in an LDAP URL, primarily because >> this never gets sent to the LDAP server. The server always returns an LDIF >> with inline references, this just determines if you chase them client side >> or just list them as is. >> >> I could be missing something here, but using: >> >> ldapreferrals={0|1} >> >> Would require a three state type, as we need a way of not interfering >> with the library defaults? To 'enable' the new behavior here using a >> boolean you would need to set ldapreferrals=false - which with the normal >> way of dealing with config booleans would alter the default behavior if the >> option was not specified. >> >> How do you feel about: >> >> ldapdisablereferrals=(0|1) >> >> > I agree with Peter that the negative thing is bad. l don't see the > problem, really. If you don't specify it, you rely on library defaults. If > you do specify it, we lock it to that setting. I don't see the need to > specifically have a setting to rely on library defaults - just remove it > from the line and you get that. > > -- > Magnus Hagander > Me: http://www.hagander.net/ > Work: http://www.redpill-linpro.com/ > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. <>
Re: [HACKERS] [PATCH] Add an ldapoption to disable chasing LDAP referrals
Hey, New patch attached. I've moved from using a boolean to an enum trivalue. Let me know what you think. Cheers, James James Sewell PostgreSQL Team Lead / Solutions Architect _ [image: http://www.lisasoft.com/sites/lisasoft/files/u1/2013hieghtslogan_0.png] Level 2, 50 Queen St, Melbourne, VIC, 3000 P: 03 8370 8000 F: 03 8370 8099 W: www.lisasoft.com On Thu, Jul 4, 2013 at 8:23 PM, Magnus Hagander wrote: > > On Thu, Jul 4, 2013 at 2:30 AM, James Sewell wrote: > >> Heya, >> >> I see what you are saying, the problem as I see it is that the action we >> are taking here is "disable chasing ldap referrals". If the name is >> ldapreferrals and we use a boolean then setting it to 1 reads in a counter >> intuitive manner: >> > > That assumes that the default in the ldap library is always going to be to > chase them. Does the standard somehow mandate that it should be? > > > "set ldapreferals=true to disable chasing LDAP referrals." >> > > You'd obviously reverse the meaning as well. "set ldapreferals=false to > disable chasing LDAP referrals." > > > Perhaps you are fine with this though if it's documented? It does work in >> the inverse way to pam_ldap, where setting to true enables referral >> chasing. pam_ldap works like so: >> >> not set : library default >> set to 0 : disable referral chasing >> set to 1 : enable referral chasing >> >> > That is exactly what I'm suggesting it should do, and I'm pretty sure > that's what Peter suggested as well. > > > > -- > Magnus Hagander > Me: http://www.hagander.net/ > Work: http://www.redpill-linpro.com/ > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence. <> pgsql_ldapnochaseref_v1.2.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rounding to even for numeric data type
>>>>> "MP" == Michael Paquier writes: MP> So, attached is a patch that does 1) and 2) to make clear to the MP> user how numeric and double precision behave regarding rounding. MP> I am adding it to CF 2015-06 to keep track of it... Given that the examples show -2.5 rounds to -3, the IEEE term is roundTiesToAway, and the typical conversational english is round ties away from zero. RoundUp means mean towards +Infinity. 754 specifies that for decimal, either roundTiesToEven or roundTiesToAway are acceptable defaults, and which of the two applies is language dependent. Does ANSI SQL say anything about how numeric should round? In general, for decimals (or anything other than binary), there are twelve possible roundings: ToEven ToOdd AwayFromZero ToZero Up Down TiesToEven TiesToOdd TiesAwayFromZero TiesToZero TiesUp TiesDown (Up is the same as ceil(3), Down as floor(3).) -JimC -- James Cloos OpenPGP: 0x997A9F17ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ILIKE vs indices
While tuning an application, I found the posts from 2003 recomending the use of LOWER() and LIKE in place of ILIKE to take advantage of indices. For this app, given the limitations of the upper-layer protocol it must support, that change replaced about 30 minutes of repeated seq scans with about 1 minute of repeated index scans! On a query-set often repeated several times per day. (Probably more times per day now.) Is there any contraindication to recasting: foo ILIKE 'bar' into: LOWER(foo) LIKE LOWER('bar') and documenting that an index has to be on LOWER(column) to benefit ILIKE? Perhaps the parser could read the former as the latter? -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ILIKE vs indices
>>>>> "TL" == Tom Lane writes: JC>> Is there any contraindication to recasting: JC>> foo ILIKE 'bar' JC>> into: JC>> LOWER(foo) LIKE LOWER('bar') TL> In some locales those are not equivalent, I believe, or at least TL> shouldn't be. (What the current code actually does is a separate TL> question.) I see. After determining indexing based on th existance of an initial fixed string, exluding anything matching isalpha(), it uses tolower(3) and friends to do the actual match. So my proposal wouldn't change what matches, but might make fixing any bugs in what *should* match more difficult? TL> In any case it's not obvious why LOWER rather than UPPER. I suggested lower() because that matched all of the suggestions I found. And as it turns out matches the current behaviour, too. The footnote about adding explicit mention to the docs was expressly because it is not otherwise obvious whether indices should use lower() or upper(). I'll ask on one of the unicode lists whether there are any locales where a case-insensitive match should be different than a case-preserving match of tolower() vs tolower(). -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
XLog vs SSD [Was: Re: [HACKERS] random write in xlog?]
>>>>> "JJ" == Jeff Janes writes: JJ> Anyway, the writes are logically sequentially, but not physically. JJ> If I remember correctly, it always writes out full blocks, even if JJ> the last part of the block has not yet been filled with new data. JJ> When the remainder gets filled, it then writes out the full block JJ> again, both the already written and the new part. What does that mean for use of a flash SSD for the xlog dir? Does the block writing mesh up well with the usage pattern a flash SSD needs to maximize lifespan? I'd love a dram ssd for pg_xlog and the journals for the other filesystems, but they cost too much. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: XLog vs SSD [Was: Re: [HACKERS] random write in xlog?]
>>>>> "JJ" == Jeff Janes writes: JJ> So PG always writing 8K at a time is unlikely to make a difference JJ> than if it wrote a smaller amount. Ah. Somehow I was thinking of the xlog files' 16M filesize rather than the internal 8k block size If it is only writing 8k blocks then there is probably little hope of making efficient use of flash. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Floating point error
>>>>> "HL" == Heikki Linnakangas writes: HL> It would be nice to have a base-2 text format to represent floats. HL> It wouldn't be as human-friendly as base-10, but it could be used HL> when you don't want to lose precision. pg_dump in particular. hexidecimal notation for floats exists. The printf format flag is %a for miniscule and %A for majuscule. The result of 1./3. is 0xa.aabp-5. This site has some info and a conversion demo: http://gregstoll.dyndns.org/~gregstoll/floattohex/ -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Testing 9.2 in ~production environment
I'm giving 9.2-beta2 a test simulating a production workflow. Everything looks OK except the speed. Most (all?) queries take about five to six times as long as they do with 9.1. The configurations are essentially the same, the query plans are the same. A (hot) example, pulled semi-randomly from a run, with the names mangled to protect the innocent: = 9.1 = Nested Loop (cost=0.00..26.92 rows=1 width=28) (actual time=0.114..0.514 rows=19 loops=1) -> Index Scan using ms_pkey on ms msg (cost=0.00..26.03 rows=1 width=20) (actual time=0.026..0.207 rows=19 loops=1) Index Cond: ((ms_id >= 407) AND (ms_id <= 435) AND (mb_id = 50222)) Filter: (status = ANY ('{0,1,2}'::integer[])) -> Index Scan using ph_pkey on ph pm (cost=0.00..0.87 rows=1 width=16) (actual time=0.010..0.010 rows=1 loops=19) Index Cond: (id = msg.ph_id) Total runtime: 0.605 ms = 9.2 = Nested Loop (cost=0.00..30.12 rows=1 width=28) (actual time=0.439..2.540 rows=19 loops=1) -> Index Scan using ms_pkey on ms msg (cost=0.00..29.18 rows=1 width=20) (actual time=0.155..1.157 rows=19 loops=1) Index Cond: ((ms_id >= 407) AND (ms_id <= 435) AND (mb_id = 50222)) Filter: (status = ANY ('{0,1,2}'::integer[])) -> Index Scan using ph_pkey on ph pm (cost=0.00..0.93 rows=1 width=16) (actual time=0.053..0.054 rows=1 loops=19) Index Cond: (id = msg.ph_id) Total runtime: 2.752 ms All of the tables and indices for the run in question fit into ram. The effective cache, work mem, costs, etc were optimized in 9.0, and kept for 9.1 and the beta. That the plans are the same suggests that isn't the problem, yes? I think I recall mention from a previous beta (but goog isn't helping me confirm) that there is some extra debugging or such enabled in the betas. If so, and if turning that off would provide a better comparison, where in the src should I look? -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
>>>>> "AF" == Andres Freund writes: AF> Is it possible that you compiled with assertions enabled? That would roughly AF> fit that magnitude. SHOW debug_assertions; Should show you whether it was AF> enabled. Thanks, but SHOW debug_assertions reports off. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
>>>>> "PE" == Peter Eisentraut writes: PE> That depends on how you built it. Just being a beta by itself doesn't PE> turn on any extra debugging. OK. So either I misremembered or it was something no longer done. PE> That depends on how you built it. Its a Gentoo box; both were build from their ebuilds, with the same gcc, flags, etc. PE> Compare the output of pg_config --configure from both installations. The only differences are 9.1 vs 9.2 in the paths. Thanks, -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
>>>>> "JB" == Josh Berkus writes: JB> Can you check the collations of the two databases? I'm wondering if 9.1 JB> is in "C" collation and 9.2 is something else. Thanks! pg_dump -C tells me these two differences: -SET client_encoding = 'SQL_ASCII'; +SET client_encoding = 'UTF8'; -CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LC_COLLATE = 'C' LC_CTYPE = 'C'; +CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'C' LC_CTYPE = 'en_US.UTF-8'; for every db in the clusters. I presume that lc_ctype is the significant difference? LC_CTYPE *is* specified as 'C' in the dump from which I created the 9.2 cluster, so it must have been overridden by pg_restore. I see that my dist's /etc rc script now sets LC_CTYPE. Would that explain why lc_ctype changed between the two clusters? Is there any way to alter a db's lc_ctype w/o dumping and restoring? I want to preserve some of the changes made since I copied the 9.1 cluster. Alter database reports that lc_ctype cannot be changed. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
Updating pg_database to set datctype='C' did solve the speed issues with the two largs dbs. Presumably, since LC_CTYPE=en_US.UTF-8 was in the env when I ran pg_restore, it overrode the ctype setting in the dump files. Some of the slow selects do use ilike; even w/ datctype='C' the indices are skipped for at least this query: # explain analyze SELECT mb_id FROM mb WHERE name ILIKE 'foo@bar' AND ownerid=7; QUERY PLAN - Seq Scan on mb (cost=0.00..570.96 rows=3 width=4) (actual time=9.443..25.039 rows=1 loops=1) Filter: ((name ~~* 'foo@bar'::text) AND (ownerid = 7)) Rows Removed by Filter: 34827 Total runtime: 25.071 ms (4 rows) The mb table has several indices, including separate ones on name and ownerid. (not my design, btw. And I really do need to re-write the middleware) Whether it is strcoll(3) (even though LC_COLLATE is explicitly C) or LIKE, it does make a significant difference for those two apps. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
As a followup, I find that I can avoid the seq scan by adding an index to that table as: create index mb_name_own_idx on mb ( lower(name), ownerid ); and changing the query from using the idiom: WHERE name ILIKE 'foo@bar' AND ownerid=7; to using: WHERE lower(name) = lower('foo@bar') AND ownerid=7; which saves 20+ ms on each of the 30+ k such selects in a full run. I haven't tested how fast it would be with that change and a utf8 ctype. Because of how the middleware achives its portability between pg, my et al, changing it to use lower and = will require significant surgery. Is there any way to specify the index such that the ILIKE query will use said index? -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Postgres vs. intel ccNUMA on Linux
Hackers, Any tips / conventional wisdom regarding running postgres on large- ish memory ccNUMA intel machines, such as a 32G dual-quad-core, showing two NUMA nodes of 16G each? I expect each postgres backend's non-shared memory usage to remain nice and reasonably sized, hopefully staying within the confines of its processor's local memory region, but how will accesses to shared memory and / or buffer cache play out? Do people tune their backends via 'numactl' ? Furthermore, if one had more than one database being served by the machine, would it be advisable to do this via multiple clusters instead of a single cluster, tweaking the processor affinity of each postmaster accordingly, trying to ensure each cluster's shared memory segments and buffer cache pools remain local for the resulting backends? Thanks! James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .gitignore files, take two
I'm reading this a bit late, but... We (Xorg) found that ignoring: *~ *.bak *.patch in addition to the files generated by building is very helpful. We do use git tag and git describe in the make dist process, as well as git log >ChangeLog. That may be relevant; avoiding git describe's dirty designation is important when using it in that fashion. But it helps to be able to run git add --all cleanly. I understand that other git users have similar experience. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Floating-point timestamps versus Range Types
>>>>> "JD" == Jeff Davis writes: JD> 2. Fix the input/output functions in a special mode for dump/reload, JD>to make them true inverses. That can be done by supporting the %A printf(3)/scanf(3) format. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Floating-point timestamps versus Range Types
>>>>> "JD" == Jeff Davis writes: JD> 2. Fix the input/output functions in a special mode for dump/reload, JD> to make them true inverses. JC> That can be done by supporting the %A printf(3)/scanf(3) format. JD> I don't happen to see a %A format in the man page, but I doubt the JD> output would look like a timestamp. Modern printf(1) also supports it, so an easy example: :; printf '%A\n' 3.1415926535897932384626433832795029 0XC.90FDAA22168C235P-2 %a is the same, but with miniscule letters. It is the hex format of the float types, and defaults to exactly enough precision. The length modifier L makes %a expect a long double. JD> And if it doesn't look like a timestamp, it violates the spirit of a JD> logical dump of the data. Point taken. Had I read the whole thread before replying I would have been reminded that the float timestamps were archaic; that avoids any need of %A for timestamps. That said, the possiblity of hex i/o format for the float datatypes would be welcome. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Floating-point timestamps versus Range Types
>>>>> "TL" == Tom Lane writes: JC> That said, the possiblity of hex i/o format for the float datatypes JC> would be welcome. TL> It's unportable, for two different reasons: TL> 2. The printf specifiers you want us to rely on are not standard. They are in C99. TL> 1. pg_dump output would become platform-specific. This is highly TL> undesirable. It is true that pg would have to test for them in configure and supply alternative code wherever libc fails to support them. I can readily accept that there are many more pressing needs. But would such a patch for master be rejected? -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSL: better default ciphersuite
>>>>> "MK" == Marko Kreen writes: >>>>> "PE" == Peter Eisentraut writes: MK>> Well, we should - the DEFAULT is clearly a client-side default MK>> for compatibility only. No server should ever run with it. PE> Any other opinions on this out there? For reference, see: https://wiki.mozilla.org/Security/Server_Side_TLS for the currently suggested suite for TLS servers. That is: ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256: ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384: DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM: ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA: ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384: ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256: DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256: DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256: AES256-GCM-SHA384:ECDHE-RSA-RC4-SHA:ECDHE-ECDSA-RC4-SHA: AES128:AES256:RC4-SHA:HIGH: !aNULL:!eNULL:!EXPORT:!DES:!3DES:!MD5:!PSK The page explains why. But for pgsql, I'd leave off the !PSK; pre-shared keys may prove useful for some. And RC4, perhaps, also should be !ed. And if anyone wants Kerberos tls-authentication, one could add KRB5-DES-CBC3-SHA, but that is ssl3-only. Once salsa20-poly1305 lands in openssl, that should be added to the start of the list. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote: > On 1/13/14, 2:27 PM, Claudio Freire wrote: > > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby wrote: > >> On 1/13/14, 2:19 PM, Claudio Freire wrote: > >>> > >>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas > >>> wrote: > >>>> > >>>> On a related note, there's also the problem of double-buffering. When > >>>> we read a page into shared_buffers, we leave a copy behind in the OS > >>>> buffers, and similarly on write-out. It's very unclear what to do > >>>> about this, since the kernel and PostgreSQL don't have intimate > >>>> knowledge of what each other are doing, but it would be nice to solve > >>>> somehow. > >>> > >>> > >>> > >>> There you have a much harder algorithmic problem. > >>> > >>> You can basically control duplication with fadvise and WONTNEED. The > >>> problem here is not the kernel and whether or not it allows postgres > >>> to be smart about it. The problem is... what kind of smarts > >>> (algorithm) to use. > >> > >> > >> Isn't this a fairly simple matter of when we read a page into shared > >> buffers > >> tell the kernel do forget that page? And a corollary to that for when we > >> dump a page out of shared_buffers (here kernel, please put this back into > >> your cache). > > > > > > That's my point. In terms of kernel-postgres interaction, it's fairly > > simple. > > > > What's not so simple, is figuring out what policy to use. Remember, > > you cannot tell the kernel to put some page in its page cache without > > reading it or writing it. So, once you make the kernel forget a page, > > evicting it from shared buffers becomes quite expensive. > > Well, if we were to collaborate with the kernel community on this then > presumably we can do better than that for eviction... even to the > extent of "here's some data from this range in this file. It's (clean| > dirty). Put it in your cache. Just trust me on this." This should be the madvise() interface (with MADV_WILLNEED and MADV_DONTNEED) is there something in that interface that is insufficient? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote: > On 2014-01-13 12:34:35 -0800, James Bottomley wrote: > > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote: > > > Well, if we were to collaborate with the kernel community on this then > > > presumably we can do better than that for eviction... even to the > > > extent of "here's some data from this range in this file. It's (clean| > > > dirty). Put it in your cache. Just trust me on this." > > > > This should be the madvise() interface (with MADV_WILLNEED and > > MADV_DONTNEED) is there something in that interface that is > > insufficient? > > For one, postgres doesn't use mmap for files (and can't without major > new interfaces). I understand, that's why you get double buffering: because we can't replace a page in the range you give us on read/write. However, you don't have to switch entirely to mmap: you can use mmap/madvise exclusively for cache control and still use read/write (and still pay the double buffer penalty, of course). It's only read/write with directio that would cause problems here (unless you're planning to switch to DIO?). > Frequently mmap()/madvise()/munmap()ing 8kb chunks has > horrible consequences for performance/scalability - very quickly you > contend on locks in the kernel. Is this because of problems in the mmap_sem? > Also, that will mark that page dirty, which isn't what we want in this > case. You mean madvise (page_addr)? It shouldn't ... the state of the dirty bit should only be updated by actual writes. Which MADV_ primitive is causing the dirty marking, because we might be able to fix it (unless there's some weird corner case I don't know about). > One major usecase is transplanting a page comming from postgres' > buffers into the kernel's buffercache because the latter has a much > better chance of properly allocating system resources across independent > applications running. If you want to share pages between the application and the page cache, the only known interface is mmap ... perhaps we can discuss how better to improve mmap for you? We also do have a way of transplanting pages: it's called splice. How do the semantics of splice differ from what you need? > Oh, and the kernel's page-cache management while far from perfect, > actually scales much better than postgres'. Well, then, it sounds like the best way forward would be to get postgress to use the kernel page cache more efficiently. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 21:29 +, Greg Stark wrote: > On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund wrote: > > For one, postgres doesn't use mmap for files (and can't without major > > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has > > horrible consequences for performance/scalability - very quickly you > > contend on locks in the kernel. > > > I may as well dump this in this thread. We've discussed this in person > a few times, including at least once with Ted T'so when he visited > Dublin last year. > > The fundamental conflict is that the kernel understands better the > hardware and other software using the same resources, Postgres > understands better its own access patterns. We need to either add > interfaces so Postgres can teach the kernel what it needs about its > access patterns or add interfaces so Postgres can find out what it > needs to know about the hardware context. > > The more ambitious and interesting direction is to let Postgres tell > the kernel what it needs to know to manage everything. To do that we > would need the ability to control when pages are flushed out. This is > absolutely necessary to maintain consistency. Postgres would need to > be able to mark pages as unflushable until some point in time in the > future when the journal is flushed. We discussed various ways that > interface could work but it would be tricky to keep it low enough > overhead to be workable. So in this case, the question would be what additional information do we need to exchange that's not covered by the existing interfaces. Between madvise and splice, we seem to have most of what you want; what's missing? > The less exciting, more conservative option would be to add kernel > interfaces to teach Postgres about things like raid geometries. Then > Postgres could use directio and decide to do prefetching based on the > raid geometry, how much available i/o bandwidth and iops is available, > etc. > > Reimplementing i/o schedulers and all the rest of the work that the > kernel provides inside Postgres just seems like something outside our > competency and that none of us is really excited about doing. This would also be a well trodden path ... I believe that some large database company introduced Direct IO for roughly this purpose. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Multiple synchronous_standby_names rules
Hello, When working with a production (P) and a DR (D) environment it is often a requirement to be able to protect against data loss when promoting within a site, and also when losing A and promoting a master at D. The current synchronous_standby_names do not allow this. In a simple example we could take the following nodes: P1 (current master), P2, P3 D1, D2, D3 Where P1 is replicating to (P2, P3, D1, D2, D3). The closest synchronous_standby_names setting you could get to my use case would be: 1 (D1, D2, D3) This would allow the loss of either site without losing data - however it would not allow promotion within site P from P1 -> (P2 | P3) without the potential for data loss. What is needed to support this is the ability to configure Px with something like: 1 (P1, P2, P3), 1 (D1, D2, D3) Would there be any appetite for this - or would it be seen as over complication of the current rules? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009 *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F * (+61) 2 8099 9099 <(+61)%202%208099%209000> -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] ADD FOREIGN KEY locking
Hello all, When I add a FK with a statement like this: ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); I see a lock on table b: select locktype,mode,granted from pg_locks, pg_stat_activity where relation::regclass::text = 'b' AND pg_locks.pid = pg_stat_activity.pid; locktype | relation mode | AccessShareLock granted | t query | SOME LONG RUNNING QUERY WHICH SELECTS FROM b locktype | relation mode | AccessExclusiveLock granted | f query | ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); This means that my add key won't complete until my long running query does. That seems a bit odd to me? In this database there are lots of datawarehouse type queries running, which makes it a bit hard for me to schedule this operation. Is this just a manifestation of adding the key being in an ALTER TABLE, which always needs an AccessExclusiveLock? Or am I missing some edge case when this lock would be required in this circumstance? No real urgency on this question, I just found it a bit strange and thought someone might be able to shed some light. James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: [HACKERS] ADD FOREIGN KEY locking
Oh, I've just noticed something in the Commit fest post - Reducing lock strength of trigger and foreign key DDL Perhaps I just need to be more patient. Cheers, James Sewell, Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 On Wed, Feb 18, 2015 at 10:57 AM, James Sewell wrote: > Hello all, > > When I add a FK with a statement like this: > > ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); > > I see a lock on table b: > > select locktype,mode,granted from pg_locks, pg_stat_activity where > relation::regclass::text = 'b' AND pg_locks.pid = pg_stat_activity.pid; > > locktype | relation > mode | AccessShareLock > granted | t > query | SOME LONG RUNNING QUERY WHICH SELECTS FROM b > > locktype | relation > mode | AccessExclusiveLock > granted | f > query | ALTER TABLE a ADD FOREIGN KEY (id) REFERENCES b(id); > > > This means that my add key won't complete until my long running query > does. That seems a bit odd to me? In this database there are lots of > datawarehouse type queries running, which makes it a bit hard for me to > schedule this operation. > > Is this just a manifestation of adding the key being in an ALTER TABLE, > which always needs an AccessExclusiveLock? Or am I missing some edge case > when this lock would be required in this circumstance? > > No real urgency on this question, I just found it a bit strange and > thought someone might be able to shed some light. > > James Sewell, > Solutions Architect > __ > > > Level 2, 50 Queen St, Melbourne VIC 3000 > > *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 > > -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] Composite index and min()
Hello, I have the following table: \d a Table "phxconfig.a" Column | Type | Modifiers ---+-+--- phx_run_id| integer | cell_id | integer | Indexes: "a_phx_run_id_cell_id_idx" btree (phx_run_id, cell_id) When I use a min() query I get the following plans: test=# explain select min(phx_run_id) from a; QUERY PLAN --- Result (cost=0.22..0.23 rows=1 width=0) InitPlan 1 (returns $0) -> Limit (cost=0.14..0.22 rows=1 width=4) -> Index Only Scan using a_phx_run_id_cell_id_idx on a (cost=0.14..7.89 rows=100 width=4) Index Cond: (phx_run_id IS NOT NULL) test=# explain select min(cell_id) from a; QUERY PLAN - Aggregate (cost=2.25..2.26 rows=1 width=4) -> Seq Scan on a (cost=0.00..2.00 rows=100 width=4) Can anyone comment on why this happens? The index kicks in when I do an explicit cell_id comparison. These are large tables, and they are in a partition layout so it really hurts when I do the min call on the parent table. Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
[HACKERS] ALTER TABLE ... DISABLE TRIGGER vs. AccessExclusiveLock
Hackers, Experience and a read through backend/commands/tablecmds.c's AlterTable() indicate that ALTER TABLE ... DISABLE TRIGGER obtains an exclusive lock on the table (as does any ALTER TABLE). Blocking other readers from a table when we've, within the body of a transaction performing a bulk update operation where we don't want / need triggers to fire, seems at first glance to be over-kill. I can see how AlterTable()'s complex logic is made less complex through 'get and keep a big lock', since most of its operational modes really do need exclusive access, but is it strictly required for ... DISABLE / REENABLE TRIGGER? Could, say, RowExclusiveLock hypothetically provide adequate protection, allowing concurrent reads, but blocking out any other writers (for ENABLE / DISABLE TRIGGER) -- such as if driven through a new statement other than ALTER TABLE -- such as "DISABLE TRIGGER foo ON tbar" ? Thanks! James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Postgres / plpgsql equivalent to python's getattr() ?
Hackers, Python's getattr() allows for dynamic lookup of attributes on an object, as in: inst = MyClass(x=12, y=24) v = getattr(inst, 'x') assert v == 12 Oftentimes in writing data validating trigger functions, it'd be real handy to be able to do a similar thing in plpgsql against column values in a row or record type, such as making use of a trigger argument for hint as what column to consider in this table's case. Oh, to be able to do something like (toy example known to be equivalent to a check): CREATE OR REPLACE FUNCTION must_be_positive() RETURNS TRIGGER AS $$ begin if getattr(NEW, TG_ARGV[0]) <= 0 then raise exception(TG_ARGV[0] || ' must be positive'); end if; -- after trigger return null; end; $$ LANGUAGE PLPGSQL; A function which takes a row + a text column name, and / or a peer function taking row + index within row would really open up plpgsql's expressivity in cases where you're writing mainly SQL stuff, not really wanting to go over to plpythonu or whatnot (whose description of rows are as dicts). Is there something in the internals which inherently prevent this? Or am I fool and it already exists? Not having to defer to EXECUTE would be attractive. James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres / plpgsql equivalent to python's getattr() ?
On Aug 4, 2011, at 4:55 AM, Florian Pflug wrote: @OP: Here's my implementation of the feature you desire as a set of C-language functions: https://github.com/fgp/pg_record_inspect. Other people did code up similar things in the past, but I currently cannot find any links to their work. But it little bit digging in the mailing list archives should turn them up. Many thanks, Florian, we'll be checking that out. James James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Connection and transaction triggers
I've been looking at the new trigger features planned for Firebird 2.1 and I'm a bit jealous. I'm interested in 'stateful' proc packages that can benefit from accumulating data from updates during a transaction and then performing an action at the end - perhaps doing a bulk insert in a pre-commit trigger, or communicating with an external system in a post-commit trigger, and using a begin tran or rollback trigger to tidy up the accumulator. Would this be a difficult thing to add? I quite like their temporary table treatment too. James ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Another idea for index-only scans
Decibel! wrote: The advantage to Bruce's idea is that it sounds pretty simple to implement. While it wouldn't be of use for many general cases, it *would* be useful for read-only tables, ie: old partitions. Wouldn't the mostcommon case by foreign key checks against tables that essentially map application enums to display strings? This is a rather common scenario. It would be nice if such tables (which are typically small) could be retained in each backend process with a simple check that the cached data is still valid. James ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] INSERT/SELECT and excessive foreign key checks
Andrew Dunstan wrote: Could we achieve the same thing in a more general way by having a per-FK tiny (say 10?) LRU cache of values checked. Then it wouldn't only be restricted to constant expressions. Of course, then the trigger would need to keep state, so it might well be too complex (e.g. what if there are are concurrent inserts?) I was wondering whether one could try to identify what might be termed 'enum tables' that exist to provide lookups. There are perhaps three main types of table that is the target of a foreign key lookup: 1) tables that map to program language enumerations: typically small (less than a hundred rows) and changing very infrequently. 2) tables that hold quasi-static reference data where rows are 'never' deleted (the may be amended, perhaps to identify that they are logically inactivated, but still needed for reference lookup from existing rows elsewhere) - typically customer definitions, product definitions, site definitions and that sort of thing that is often regarded as 'static data' by a user application session but which may change. 3) master records in master/detail relationships such as order/orderline. If you can have mechanisms that reflect the likelihood of an update and optimise accordingly, then hopefully performance in real-world applications can be improved. In the case of 1) for example, we might reasonably have a single logical read/write lock that controls access to ALL such tables in a schema, and a single 'update generation count'. The lock would effectively provide repeatable read stability across all of the tables (a multi-table table lock) while in place, and the generation count (which can be a tran id) idicates to caching processes when the cache is stale. This means that unlike normal MVCC the readers will block a writer, but in this case we expect the write to happen only during application release. In the case of 2), we can't use the cross-table lock, and the tables may be large, so the suggested LRU cache per table (with a table-level read/write lock again) may be most effective, but we may elect to regard a read lock as allowing any operation that doesn't invalidate the primary key.. And in the case of 3) we don't do anything special at all. I certainly think that anything which can materially reduce lookups in case 1) and hopefully 2) will encourage good database design and declarative referential integrity - in some clases of high performance application the cost is too high to be done inline with an update, which is a shame. James ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: real procedures again (8.4)
Josh Berkus wrote: Not only would they be generally useful for SP programming, but multisets would eliminate one of the big hurdles in re-writing T-SQL stored procedures in PG, and thus make it easier to port from SQL Server. You don't hear a lot of demand for multisets on the mailing lists because we're not getting those SQL Server / Sybase crossovers now. Its true that multiple result sets are a big deal with T-SQL programming: but I think you'll also need to provide a way for the locking model to behave in a similar way and also very importantly to be able to emulate the after-statement triggers view of new and old images. James ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Win32 shared memory speed
Magnus Hagander wrote: IIRC, there hasn't been any direct benchmark for it (though I've wanted to do that but had no time), but it's been the olnly real explanation put forward for the behaviour we've seen. And it does make sense given the thread-centric view of the windows mm. /Magnus How is it supposed to be slow, once its mapped into your process? There's no OS interaction at all then. If you are suggesting that the inter-process synch objects are slow, then that may be so: just use interlocked increment and a spin lock in place of a mutex and use an associated event to wake up if necessary. You dont have to use a named kernel mutex, though it may be handy while setting up the shared memory. If you are repeatedly changing the mappings - well, that may be something that needs optimisation. James ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] .NET or Mono functions in PG
Magnus Hagander wrote: I did look at this at some earlier point as well. One big problem at that time was that once you embedded mono, yuo had all sorts of threads running in your backend ;-) Is that necessarily a problem? You have to compile with a thread-capable libc and take some care that the heap implementation is well tuned, but there's no reason why the mono housekeeping threads should cause you any problem is there? It should be much like the embedded Java. Another way to do it is "the PL/J" way (I think). Which is starting up a separate process with the VM in it and then do RPC of some kind to it. Which has more overhead per call, but lower per backend etc. And a lot less "dangerous". Given that one would want to embed to have very low latency both on trigger invocation and on calls back into the data engine, I don't really see the point personally. I'm not sure how important it is to make the embeded interface look like a standard interface (in that way that the embedded CLR in MSSQL does - mostly) or whether it can be a thin wrapper over the C API. I think there's good mileage in starting with the thin wrapper, then at least some common business logic code can be used. James ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] .NET or Mono functions in PG
Tom Lane wrote: It is also pretty well established that if pltcl or plperl cause the backend to become multithreaded, things break horribly. I strongly Isn't that something that needs to be fixed? Its one thing not to allow for multiple threads to be in your code, but not having a threaded library linked in is going to become more and more of a problem. James ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly