Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote: Hi, This thread seemed to trail off without a resolution. Was anything done? Not that I can tell. I was the original poster of this thread. We've worked around the issue by placing a CHECKPOINT command at the end of the migration script. For us it's not a performance issue, more a correctness one, tables were empty when they shouldn't have been. I'm hoping a fix will appear in the 9.5 release, since we're intending to release with that version. A forced checkpoint every now and them probably won't be a serious problem though. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 12:53:56PM -0400, Tom Lane wrote: Fujii Masao masao.fu...@gmail.com writes: Okay, so probably we need to change WAL replay of TRUNCATE so that the index file is truncated to one containing only meta page instead of empty one. That is, the WAL replay of TRUNCATE would need to call index_build() after smgrtruncate() maybe. That seems completely unworkable. For one thing, index_build would expect to be able to do catalog lookups, but we can't assume that the catalogs are in a good state yet. I think the responsibility has to be on the WAL-writing end to emit WAL instructions that lead to a correct on-disk state. Putting complex behavior into the reading side is fundamentally misguided. Am I missing something. ISTM that if the truncate record was simply not logged at all everything would work fine. The whole point is that the table was created in this transaction and so if it exists the table on disk must be the correct representation. The broken index is just one symptom. The heap also shouldn't be truncated at all. If you insert a row before commit then after replay the tuple should be there still. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 07:21:21PM +0200, Andres Freund wrote: On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote: Am I missing something. ISTM that if the truncate record was simply not logged at all everything would work fine. The whole point is that the table was created in this transaction and so if it exists the table on disk must be the correct representation. That'd not work either. Consider: BEGIN; CREATE TABLE ... INSERT; TRUNCATE; INSERT; COMMIT; If you replay that without a truncation wal record the second INSERT will try to add stuff to already occupied space. And they can have different lengths and stuff, so you cannot just ignore that fact. I was about to disagree with you by suggesting that if the table was created in this transaction then WAL logging is skipped. But testing shows that inserts are indeed logged, as you point out. With inserts the WAL records look as follows (relfilenodes changed): martijn@martijn-jessie:~/git/ctm/docker$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /tmp/pgtest/postgres/pg_xlog/ 00010001 |grep -wE '16386|16384|16390' rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A79C8, prev 0/016A79A0, bkp: , desc: file create: base/12139/16384 rmgr: Sequencelen (rec/tot):158/ 190, tx:683, lsn: 0/016B4258, prev 0/016B2508, bkp: , desc: log: rel 1663/12139/16384 rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016B4318, prev 0/016B4258, bkp: , desc: file create: base/12139/16386 rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016B9468, prev 0/016B9418, bkp: , desc: file create: base/12139/16390 rmgr: Sequencelen (rec/tot):158/ 190, tx:683, lsn: 0/016BC938, prev 0/016BC880, bkp: , desc: log: rel 1663/12139/16384 rmgr: Sequencelen (rec/tot):158/ 190, tx:683, lsn: 0/016BCAF0, prev 0/016BCAA0, bkp: , desc: log: rel 1663/12139/16384 rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 0/016BCBB0, prev 0/016BCAF0, bkp: , desc: insert(init): rel 1663/12139/16386; tid 0/1 rmgr: Btree len (rec/tot): 20/52, tx:683, lsn: 0/016BCBF8, prev 0/016BCBB0, bkp: , desc: newroot: rel 1663/12139/16390; root 1 lev 0 rmgr: Btree len (rec/tot): 34/66, tx:683, lsn: 0/016BCC30, prev 0/016BCBF8, bkp: , desc: insert: rel 1663/12139/16390; tid 1/1 rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016BCC78, prev 0/016BCC30, bkp: , desc: file truncate: base/12139/16386 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016BCCA8, prev 0/016BCC78, bkp: , desc: file truncate: base/12139/16390 to 0 blocks rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 0/016BCCD8, prev 0/016BCCA8, bkp: , desc: insert(init): rel 1663/12139/16386; tid 0/1 rmgr: Btree len (rec/tot): 20/52, tx:683, lsn: 0/016BCD20, prev 0/016BCCD8, bkp: , desc: newroot: rel 1663/12139/16390; root 1 lev 0 rmgr: Btree len (rec/tot): 34/66, tx:683, lsn: 0/016BCD58, prev 0/016BCD20, bkp: , desc: insert: rel 1663/12139/16390; tid 1/1 relname | relfilenode -+- test| 16386 test_id_seq | 16384 test_pkey | 16390 (3 rows) And amazingly, the database cluster successfuly recovers and there's no error now. So the problem is *only* because there is no data in the table at commit time. Which indicates that it's the 'newroot record that saves the day normally. And it's apparently generated by the first insert. Agreed. I think the problem is something else though. Namely that we reuse the relfilenode for heap_truncate_one_rel(). That's just entirely broken afaics. We need to allocate a new relfilenode and write stuff into that. Then we can forgo WAL logging the truncation record. Would that properly initialise the index though? Anyway, this is way outside my expertise, so I'll bow out now. Let me know if I can be of more assistance. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote: Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. I could reproduce the problem in the master branch by doing the following steps. Thank you, I wasn't sure if you could kill the server fast enough without containers, but it looks like immediate mode is enough. 1. start the PostgreSQL server with wal_level = minimal 2. execute the following SQL statements begin; create table test(id serial primary key); truncate table test; commit; 3. shutdown the server with immediate mode 4. restart the server (crash recovery occurs) 5. execute the following SQL statement select * from test; The optimization of TRUNCATE opereation that we can use when CREATE TABLE and TRUNCATE are executed in the same transaction block seems to cause the problem. In this case, only index file truncation is logged, and index creation in btbuild() is not logged because wal_level is minimal. Then at the subsequent crash recovery, index file is truncated to 0 byte... Very simple fix is to log an index creation in that case, but not sure if that's ok to do.. Looks plausible to me. For reference I attach a small tarball for reproduction with docker. 1. Unpack tarball into empty dir (it has three small files) 2. docker build -t test . 3. docker run -v /tmp/pgtest:/data test 4. docker run -v /tmp/pgtest:/data test Data dir is in /tmp/pgtest Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer postgresql-test.tgz Description: GNU Unix tar archive signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote: Hi, On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote: === Start with an empty database My guess is you have wal_level = minimal? Default config, was just initdb'd. So yes, the default wal_level = minimal. === Note the index file is 8KB. === At this point nuke the database server (in this case it was simply === destroying the container it was running in. How did you continue from there? The container has persistent storage? Or are you repapplying the WAL to somewhere else? The container has persistant storage on the host. What I think is actually unusual is that the script that started postgres was missing an 'exec so postgres never gets the signal to shutdown. martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393' rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 to 0 blocks pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740 Note that the truncate will lead to a new, different, relfilenode. Really? Comparing the relfilenodes gives the same values before and after the truncate. ctmp=# select * from test; ERROR: could not read block 0 in file base/16385/16393: read only 0 of 8192 bytes Hm. I can't reproduce this. Can you include a bit more details about how to reproduce? Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
[HACKERS] WAL logging problem in 9.4.3?
this helps. -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] upper planner path-ification
On Thu, May 14, 2015 at 12:19:44PM -0400, Robert Haas wrote: Well, I'm just shooting from the hip here, but it seems to me that the basic pipeline as it exists today is Join - Aggregate - SetOp - Limit - LockRows. I don't think Limit or LockRows can be moved any earlier. SetOps have a lot in common with Aggregates, though, and might be able to commute. For instance, if you had an EXCEPT that was going to knock out most of the rows and also a DISTINCT, you might want to postpone the DISTINCT until after you do the EXCEPT, or you might just want to do both at once: Also thinking a little from the side: an SQL query is a expression of some tree in relational algebra, and a Path is a representation of a way to acheive some sub-part of it. The planner attempts to try find alternative ways of generating path by reordering joins but AIUI doesn't do much about aggregations. What is essentially being discussed here is also allowing commuting aggregations and joins. DISTINCT and DISTINCT ON are just special kinds of aggregations so don't need to be considered especially. ISTM you should be able to for each aggregation note which joins it commutes with and which it doesn't, perhaps even with a simple bitmap. The backbone of the plan is the order of the aggregations which generally won't commute at all, and the joins which can float around as long as the dependancies (stuff that won't commute) are satisfied. And both set operations and aggregates can sometimes commute with joins, which seems like the stickiest wicket, because there can't be more than a couple of levels of grouping or aggregation in the same subquery (GROUP BY, DISTINCT, UNION?) but there can be lots of joins, and if a particular aggregate can be implemented in lots of places, things start to get complicated. I think this is basically the same idea. I'm not sure aggregates and set operations can commute in general, unless you could somehow (conceptually) describe them as a join/antijoin. UNION might be special here. Like, if you've got a SELECT DISTINCT ON (a.x) ... FROM ...lots...-type query, I think you can pretty much slap the DISTINCT on there at any point in the join nest, probably ideally at the point where the number of rows is the lowest it's going to get, or maybe when the sortorder is convenient. For a GROUP BY query with ungrouped dependent columns, you can do the aggregation before or after those joins, but you must do it after the joins that are needed to provide all the values needed for the aggregated columns. If you model this with RelOptInfos, you're basically going to need a RelOptInfo to say i include these relids and these aggregation or setop operations. So in the worst case each agg/setop doubles the number of RelOptInfos, although probably in reality it doesn't because you won't have infinite flexibility to reorder things. I think it's more like the number of possibilities doubles for each join that could commute with the aggregate. But as you say the number of actual possibilities doesn't actually grow that fast. I think it may be better to have each path track the relids and aggregates it covers, but then you need to have an efficient way to work out which rels/aggregates can be considered for each path. Either sounds it sounds like quite a planner overhaul. Hope this helps, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 12:37:44AM -0400, Robert Haas wrote: I think such a solution will be good for the cases when many evictions needs to be performed to satisfy the workload, OTOH when there are not too many evictions that needs to be done, in such a case some of the buffers that are accessed much more will have equal probability to get evicted as compare to buffers which are less accessed. Possibly, but I think it's even worse under the current algorithm. Under this proposal, if we go for a long time without any buffer evictions, every buffer usage's count will top out at 2 more than wherever it was after the last clock sweep. In the worst case, every buffer (or most of them) could end up with the same usage count. But under the status quo, they'll all go to 5, which is an even bigger loss of information, and which will make the first eviction much more expensive than if they are all pegged at 2 or 3. I've been following this thread from the side with interest and got twigged by the point about loss of information. If you'd like better information about relative ages, you can acheive this by raising the cap on the usage count and dividing (or right-shifting) each sweep. This would allow you to remember much more about about the relative worth of often used pages. With a cap of 32 you'd have the same effect as now where after 5 sweeps the buffer is evicted. Mathematically the count would converge to the number of times the block is used per sweep. If you wanted to be really clever, you could at the beginning of each sweep take an estimate of the number of buffers used since the last sweep (from the stats collector perhaps) and use that to drive your divisor, so if you have a lots of allocations you become more aggressive about reducing the counts. Or if the load is light fall back to just subtracting one. Then you don't need a cap at all. (Apologies if this has been suggested before, Google didn't find anything for me). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory
On Thu, Feb 12, 2015 at 08:52:56AM -0500, Robert Haas wrote: BTW, I'm not all that thrilled with the deserialized object terminology. I found myself repeatedly tripping up on which form was serialized and which de-. If anyone's got a better naming idea I'm willing to adopt it. My first thought is that we should form some kind of TOAST-like backronym, like Serialization Avoidance Loading and Access Device (SALAD) or Break-up, Read, Edit, Assemble, and Deposit (BREAD). I don't think there is anything per se wrong with the terms serialization and deserialization; indeed, I used the same ones in the parallel-mode stuff. But they are fairly general terms, so it might be nice to have something more specific that applies just to this particular usage. The words that sprung to mind for me were: packed/unpacked. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] longjmp clobber warnings are utterly broken in modern gcc
On Sun, Jan 25, 2015 at 07:11:12PM -0500, Tom Lane wrote: Martijn van Oosterhout klep...@svana.org writes: On Sun, Jan 25, 2015 at 02:02:47PM -0500, Tom Lane wrote: It's a bit of a long shot, but perhaps if you put something like: asm volatile(:::memory) at the beginning of the catch-block it might convince the compiler to forget any assumptions about what is in the local variables... Meh. Even if that worked for gcc (which as you say is uncertain), it would help not at all for other compilers. The POSIX requirements for portable code are clear: we need a volatile marker on affected variables. Never mind, it doesn't work. It's not that GCC doesn't know setjmp() is special, it does (the returns_twice attribute). So GCC does the above effectivly itself. The problem is that local variables may be stored in memory over calls in the PG_TRY() block, volatile is a sledgehammer way of preventing that. The problem is, GCC doesn't know anything about what the return value of setjmp() means which means that it can never produce any sensible warnings in this area. If you want the compiler to catch this, I don't see any way without requiring the code to indicate specifically which local variables it intends to use, or not using the locals at all by using a seperate cleanup function (as discussed elsewhere in this thread). With information about the locals you might be able to conjure some GCC macros to set things up to complain if you use anything else. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] longjmp clobber warnings are utterly broken in modern gcc
On Sun, Jan 25, 2015 at 02:02:47PM -0500, Tom Lane wrote: and compared the assembly language generated with and without adding volatile to Tmpfd's declaration. Without volatile (ie, in the code as shipped), gcc optimizes away the assignment Tmpfd = -1 within PG_TRY, and it also optimizes away the if-test in PG_CATCH, apparently believing that control cannot transfer from inside the PG_TRY to the PG_CATCH. This is utterly wrong of course. The issue is masked because we don't bother to test for a failure return from the second close() call, but it's not hard to think of similar coding patterns where this type of mistaken optimization would be disastrous. (Even here, the bogus close call could cause a problem if we'd happened to open another file during the last part of the PG_TRY stanza.) snip This is scary as hell. I intend to go around and manually audit every single PG_TRY in the current source code, but that is obviously not a long-term solution. Anybody have an idea about how we might get trustworthy mechanical detection of this type of situation? It's a bit of a long shot, but perhaps if you put something like: asm volatile(:::memory) at the beginning of the catch-block it might convince the compiler to forget any assumptions about what is in the local variables... Hope this helps, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Thu, Jan 22, 2015 at 03:50:03PM -0600, Merlin Moncure wrote: Quick update: not done yet, but I'm making consistent progress, with several false starts. (for example, I had a .conf problem with the new dynamic shared memory setting and git merrily bisected down to the introduction of the feature.). I have to triple check everything :(. The problem is generally reproducible but I get false negatives that throws off the bisection. I estimate that early next week I'll have it narrowed down significantly if not to the exact offending revision. I've never used it but the BBChop project claims to be able to bisect even in the case of intermittent failure. https://github.com/Ealdwulf/bbchop/ It claims to be slow, but I don't think that's the limiting factor here... Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Fri, Dec 19, 2014 at 05:32:43PM -0800, Peter Geoghegan wrote: Most people would list the columns, but if there is a really bizarre constraint, with non-default opclasses, or an exclusion constraint, it's probably been given a name that you could use. What I find curious about the opclass thing is: when do you ever have an opclass that has a different idea of equality than the default opclass for the type? In other words, when is B-Tree strategy number 3 not actually '=' in practice, for *any* B-Tree opclass? Certainly, it doesn't appear to be the case that it isn't so with any shipped opclasses - the shipped non-default B-Tree opclasses only serve to provide alternative notions of sort order, and never equals. Well, in theory you could build a case insensetive index on a text column. You could argue that the column should have been defined as citext in the first place, but it might not for various reasons. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] add ssl_protocols configuration option
On Wed, Oct 22, 2014 at 09:36:59PM +0200, Dag-Erling Smørgrav wrote: Martijn van Oosterhout klep...@svana.org writes: Dag-Erling Smørgrav d...@des.no writes: If I understand correctly, imaps has been shown to be vulnerable as well, so I wouldn't be so sure. Reference? Sorry, no reference. I was told that Thunderbird was vulnerable to POODLE when talking imaps. Ugh, found it. It does the same connection fallback stuff as firefox. https://securityblog.redhat.com/2014/10/20/can-ssl-3-0-be-fixed-an-analysis-of-the-poodle-attack/ Since you can already specify the cipher list, couldn't you just add -SSLv3 to the cipher list and be done? I didn't want to change the existing behavior; all I wanted was to give users a way to do so if they wish. I think we should just disable SSL3.0 altogether. The only way this could cause problems is if people are using PostgreSQL with an OpenSSL library from last century. As for client libraries, even Windows XP supports TLS1.0. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] add ssl_protocols configuration option
On Wed, Oct 22, 2014 at 03:14:26PM +0200, Dag-Erling Smørgrav wrote: In a case like POODLE we probably wouldn't have done it anyway, as it doesn't affect our connections (we're not http) If I understand correctly, imaps has been shown to be vulnerable as well, so I wouldn't be so sure. Reference? It's an SSL3 specific attack, so most servers are not vulnerable because TLS will negotiate to the highest supported protocol. So unless one of the client/server doesn't support TLS1.0 there's no issue. The only reason http is vulnerable is because browsers do protocol downgrading, something strictly forbidden by the spec. Additionally, the man-in-the-middle must be able to control the padding in the startup packet, which just isn't possible (no scripting language in the client). Since you can already specify the cipher list, couldn't you just add -SSLv3 to the cipher list and be done? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Collations and Replication; Next Steps
On Thu, Sep 18, 2014 at 01:35:10PM +0900, Tatsuo Ishii wrote: In my understanding PostgreSQL's manual MUST include the ICU license term (this is not a problem). What I am not so sure is, any software uses PostgreSQL also MUST include the ICU license or not. If yes, I think this is surely a problem. Only if we're thinking of distributing it. If the user gets ICU from their distribution then there is no need to list the licence (just like we don't need to mention the licence of glibc). We only need link against it, not distribute it. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Anonymous code block with parameters
On Wed, Sep 17, 2014 at 10:17:22PM +0200, Pavel Stehule wrote: Because you still have to do SELECT pg_temp.my_temp_function(blah); to execute it. this problem should be solvable. I can to use a temporary tables without using pg_temp schema. Umm, IIRC it used to work that way but was changed to work like this. IIRC the reason was that anyone can create functions in the temp tablespace and thus hijack other functions that more priviledged functions might call. Or something like that. I think it was even a CVE. Have a nice dat, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Collations and Replication; Next Steps
On Wed, Sep 17, 2014 at 03:57:38PM +0100, Greg Stark wrote: Then there's the concern that ICU is a *huge* dependency. ICU is itself larger than the entire Postgres install. It's a big burden on users to have to install and configure a second collation library in addition to the system library and a complete non-starter for embedded systems or low-memory systems. $ apt-cache show libicu52|grep Installed-Size Installed-Size: 27283 That's 27MB or less than 2 WAL files. Or about 4 times the template database, or which 3 are created during install and the first user database will be a fourth. The installed size of Postgres is 11MB not including any of the libraries it already depends on. Yes, it's not a small library but lets not get carried away. And if it's optional then low memory systems can configure it out. As for configuration, ICU doesn't require configuration just like glibc doesn't require configuration. Mvg, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Collations and Replication; Next Steps
On Tue, Sep 16, 2014 at 02:57:00PM -0700, Peter Geoghegan wrote: On Tue, Sep 16, 2014 at 2:07 PM, Peter Eisentraut pete...@gmx.net wrote: Clearly, this is worth documenting, but I don't think we can completely prevent the problem. There has been talk of a built-in index integrity checking tool. That would be quite useful. We could at least use the GNU facility for versioning collations where available, LC_IDENTIFICATION [1]. By not versioning collations, we are going against the express advice of the Unicode consortium (they also advise to do a strcmp() tie-breaker, something that I think we independently discovered in 2005, because of a bug report - this is what I like to call the Hungarian issue. They know what our constraints are.). I recognize it's a tricky problem, because of our historic dependence on OS collations, but I think we should definitely do something. That said, I'm not volunteering for the task, because I don't have time. While I'm not sure of what the long term solution should be, it *is not* okay that we don't version collations. I think that even the best possible B-Tree check tool is a not a solution. Personally I think we should just support ICU as an option. FreeBSD has been maintaining an out of tree patch for 10 years now so we know it works. The FreeBSD patch is not optimal though, these days ICU supports UTF-8 directly so many of the push-ups FreeBSD does are no longer necessary. It is often faster than glibc and the key sizes for strxfrm are more compact [1] which is relevent for the recent optimisation patch. Lets solve this problem for once and for all. [1] http://site.icu-project.org/charts/collation-icu4c48-glibc -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Collations and Replication; Next Steps
On Wed, Sep 17, 2014 at 01:07:56PM +, Matthew Kelly wrote: I'm with Martjin here, lets go ICU, if only because it moves sorting to a user level library, instead of a system level. Martjin do you have a link to the out of tree patch? If not I'll find it. I'd like to apply it to a branch and start playing with it. http://people.freebsd.org/~girgen/postgresql-icu/README.html http://people.freebsd.org/~girgen/postgresql-icu/ Note I said optional. It is a large library for sure, but for some installations I think the benefits are sufficient. Mvg, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Aussie timezone database changes incoming
On Thu, Sep 11, 2014 at 03:42:14PM +1000, Andrew McNamara wrote: Anyone from down under care to remark about the actual usage of old and new abbreviations? About bloody time! AEST/AEDT/etc are the official abbreviations and are commonly used. They have been increasingly used over the last 20 years or so, and the EST/EDT stuff on the Olsen tz database has been a source of annoyance for a very long time, eg: http://thread.gmane.org/gmane.comp.time.tz/2262 Quite likely this change will break stuff, but my feeling is more people will be cheering than screaming. Indeed, this has been a pain in the ass for a long long time. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] On partitioning
On Tue, Sep 02, 2014 at 09:44:17AM -0400, Bruce Momjian wrote: On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote: There is one situation where you need to be more flexible, and that is if you ever want to support online repartitioning. To do that you have to distinguish between I want to insert tuple X, which partition should it go into and I want to know which partitions I need to look for partition_key=Y. I am unclear why having information per-partition rather than on the parent table helps with online reparitioning. An example: We have three partitions, one for X0 (A), one for 0=X5 (B) and one for X=5 (C). These are in three different tables. Now we give the command to merge the last two partitions BC. You now have the choice to lock the table while you move all the tuples from C to B. Or you can make some adjustments such that new tuples that would have gone to C now go to B. And if there is a query for X=10 that you look in *both* B C. Then the existing tuples can be moved from C to B at any time without blocking any other operations. Is this clearer? If you up front decide that which partition to query will be determined by a function that can only return one table, then the above becomes impossible. Robert's idea of using normal table inheritance means we can access/move the data independently of the partitioning system. My guess is that we will need to do repartitioning with some tool, rather than as part of normal database operation. Doing it as some tool seems like a hack to me. And since the idea was (I thought) that partitions would not be directly accessable from SQL, it has to be in the database itself. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] On partitioning
On Fri, Aug 29, 2014 at 12:35:50PM -0400, Tom Lane wrote: Each partition is assigned an Expression that receives a tuple and returns boolean. This expression returns true if a given tuple belongs into it, false otherwise. -1, in fact minus a lot. One of the core problems of the current approach is that the system, particularly the planner, hasn't got a lot of insight into exactly what the partitioning scheme is in a partitioned table built on inheritance. If you allow the partitioning rule to be a black box then that doesn't get any better. I want to see a design wherein the system understands *exactly* what the partitioning behavior is. I'd start with supporting range-based partitioning explicitly, and maybe we could add other behaviors such as hashing later. In particular, there should never be any question at all that there is exactly one partition that a given row belongs to, not more, not less. You can't achieve that with a set of independent filter expressions; a meta-rule that says exactly one of them should return true is an untrustworthy band-aid. (This does not preclude us from mapping the tuple through the partitioning rule and finding that the corresponding partition doesn't currently exist. I think we could view the partitioning rule as a function from tuples to partition numbers, and then we look in pg_class to see if such a partition exists.) There is one situation where you need to be more flexible, and that is if you ever want to support online repartitioning. To do that you have to distinguish between I want to insert tuple X, which partition should it go into and I want to know which partitions I need to look for partition_key=Y. For the latter you really have need an expression per partition, or something equivalent. If performance is an issue I suppose you could live with having an old and an new partition scheme, so you couldn't have two live repartitionings happening simultaneously. Now, if you want to close the door on online repartitioning forever then that fine. But being in the position of having to say yes our partitioning scheme sucks, but we would have to take the database down for a week to fix it is no fun. Unless logical replication provides a way out maybe?? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] SQL MERGE is quite distinct from UPSERT
On Sat, Jul 19, 2014 at 09:55:19PM -0700, Peter Geoghegan wrote: At a high level SQL MERGE is quite distinct from UPSERT, in that it is a utility command that performs inserts, updates and deletes while avoiding race conditions (e.g. unique constraint violations) on a more or less best effort basis. MERGE is conceptually messy. In contrast UPSERT is actually atomic, and having its behavior be relatively easy to reason about ought to be the top priority. There is a *really* big demand for UPSERT from users, not MERGE, although MERGE is certainly useful too. FWIW, I agree. MERGE is hard enough as it is, but trying to guarentee some kind of atomicity makes it nigh on impossible. Indeed, after reading what you wrote I think it may well be impossible to make it atomic *and* make it perform in the general case. So, +1 UPSERT. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] better atomics - v0.5
On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote: As per my understanding of the general theory around barriers, read and write are defined to avoid reordering due to compiler and full memory barriers are defined to avoid reordering due to processors. There are some processors that support instructions for memory barriers with acquire, release and fence semantics. Not exactly, both compilers and processors can rearrange loads and stores. Because the architecture most developers work on (x86) has such a strong memory model (it's goes to lot of effort to hide reordering) people are under the impression that it's only the compiler you need to worry about, but that's not true for any other architechture. Memory barriers/fences are on the whole discouraged if you can use acquire/release semantics because the latter are faster and much easier to optimise for. I strongly recommend the Atomic Weapons talk on the C11 memory model to help understand how they work. As a bonus it includes correct implementations for the major architectures. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Can simplify 'limit 1' with slow function?
Fascinating. On Fri, Jul 04, 2014 at 10:47:06AM +0800, gotoschool6g wrote: slow query(8531 ms): SELECT ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 40.12211338311868)')) FROM road order by id LIMIT 1; explain output: Limit (cost=4653.48..4653.48 rows=1 width=3612) - Sort (cost=4653.48..4683.06 rows=11832 width=3612) Sort Key: id - Seq Scan on road (cost=0.00..4594.32 rows=11832 width=3612) fast query(16ms): select ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 40.12211338311868)')) from (SELECT shape FROM road order by id LIMIT 1) a explain output: Subquery Scan on a (cost=1695.48..1695.74 rows=1 width=3608) - Limit (cost=1695.48..1695.48 rows=1 width=3612) - Sort (cost=1695.48..1725.06 rows=11832 width=3612) Sort Key: road.id - Seq Scan on road (cost=0.00..1636.32 rows=11832 width=3612) So Postgres knows perfectly well that it's expensive, it just doesn't appear to understand it has the option of moving the calculation above the limit. In this case though, it seems an index on road(id) would make it instant in any case. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Can simplify 'limit 1' with slow function?
On Tue, Jul 01, 2014 at 02:36:55PM -0500, Merlin Moncure wrote: On Tue, Jul 1, 2014 at 2:16 PM, Martijn van Oosterhout klep...@svana.org wrote: On Sun, Jun 29, 2014 at 10:05:50PM +0800, gotoschool6g wrote: The simplified scene: select slowfunction(s) from a order by b limit 1; is slow than select slowfunction(s) from (select s from a order by b limit 1) as z; if there are many records in table 'a'. The real scene. Function ST_Distance_Sphere is slow, the query: SELECT ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from road order by c limit 1; is slow than: select ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from (SELECT s from road order by c limit 1) as a; There are about 7000 records in 'road'. I think to help here I think we need the EXPLAIN ANALYSE output for both queries. Well, I think the problem is a well understood one: there is no guarantee that functions-in-select-list are called exactly once per output row. This is documented -- for example see here: http://www.postgresql.org/docs/9.1/static/explicit-locking.html#ADVISORY-LOCKS. In short, if you want very precise control of function evaluation use a subquery, or, if you're really paranoid, a CTE. I'm probably dense, but I'm not sure I understand. Or it is that the slowfunction() is called prior to the sort? That seems insane. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Can simplify 'limit 1' with slow function?
On Wed, Jul 02, 2014 at 04:17:13PM -0400, Tom Lane wrote: David G Johnston david.g.johns...@gmail.com writes: Martijn van Oosterhout wrote I'm probably dense, but I'm not sure I understand. Or it is that the slowfunction() is called prior to the sort? That seems insane. The basic reality is that limit applies to the final set of rows that could be output. It's not so much the limit as that the sort has to happen before the limit, and yes, evaluation of the targetlist happens before the sort. I guess I assumed the column c was indexable, and it that case I beleive the slowfunction() would indeed only be called once. This is fundamental to the SQL conceptual model; remember that SQL92 had SELECT slowfunction(), ... ORDER BY 1, which certainly requires the function to be evaluated before the sort happens. And there's nothing in the conceptual model suggesting that different targetlist entries should be evaluated at different times, so just ordering by something other than the slowfunction() entry doesn't get you out of that. I'm not sure how much of this there is chapter and verse for in the SQL standard, but ISTM the stage sequencing we lay out in our SELECT reference page is pretty much forced by the standard. In the conceptual model the limit must happen after the select. But as an optimisation moving the evaluation above the limit node (when possible) should always be a win. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Can simplify 'limit 1' with slow function?
On Sun, Jun 29, 2014 at 10:05:50PM +0800, gotoschool6g wrote: The simplified scene: select slowfunction(s) from a order by b limit 1; is slow than select slowfunction(s) from (select s from a order by b limit 1) as z; if there are many records in table 'a'. The real scene. Function ST_Distance_Sphere is slow, the query: SELECT ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from road order by c limit 1; is slow than: select ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from (SELECT s from road order by c limit 1) as a; There are about 7000 records in 'road'. I think to help here I think we need the EXPLAIN ANALYSE output for both queries. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Escaping from blocked send() reprised.
On Tue, Jul 01, 2014 at 12:26:43PM +0900, Kyotaro HORIGUCHI wrote: 1. I think it's the case that there are platforms around where a signal won't cause send() to return EINTR and I'd be entirely unsurprised if SSL_write() doesn't necessarily return EINTR in that case. I'm not sure what, if anything, we can do about that. man 2 send on FreeBSD has not description about EINTR.. And even on linux, send won't return EINTR for most cases, at least I haven't seen that. So send()=-1,EINTR seems to me as only an equivalent of send() = 0. I have no idea about what the implementer thought the difference is. Whether send() returns EINTR or not depends on whether the signal has been marked restartable or not. This is configurable per signal, see sigaction(). If the signal is marked to restart, the kernel returns ERESTARTHAND (IIRC) and the libc will redo the call internally. Default BSD does not return EINTR normally, but supports sigaction(). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Minmax indexes
I'm sorry if I missed something, but ISTM this is beginning to look a lot like GiST. This was pointed out by Robert Haas last year. On Wed, Jun 18, 2014 at 12:09:42PM -0300, Claudio Freire wrote: So, you have: An aggregate to generate a compressed set from several values Which GiST does by calling 'compress' on each value, and the 'unions' the results together. A function which adds a new value to the compressed set and returns the new compressed set Again, 'compress' + 'union' A function which tests if a value is in a compressed set Which GiST does using 'compress' +'consistant' A function which tests if a compressed set overlaps another compressed set of equal type Which GiST calls 'consistant' So I'm wondering why you can't just reuse the btree_gist functions we already have in contrib. It seems to me that these MinMax indexes are in fact a variation on GiST that indexes the pages of a table based upon the 'union' of all the elements in a page. By reusing the GiST operator class you get support for many datatypes for free. If you can define different compressed sets, you can use this to generate both min/max indexes as well as bloom filter indexes. Whether we'd want to have both is perhaps questionable, but having the ability to is probably desirable. You could implement bloom filter in GiST too. It's been discussed before but I can't find any implementation. Probably because the filter needs to be parameterised and if you store the bloom filter for each element it gets expensive very quickly. However, hooked into a minmax structure which only indexes whole pages it could be quite efficient. One problem with such a generalized implementation would be, that I'm not sure in-place modification of the compressed set on-disk can be assumed to be safe on all cases. Surely, for strictly-enlarging sets it would, but while min/max and bloom filters both fit the bill, it's not clear that one can assume this for all structures. I think GiST has already solved this problem. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] UPDATE SET (a,b,c) = (SELECT ...) versus rules
On Sat, Jun 14, 2014 at 03:35:33PM -0400, Tom Lane wrote: The best that I think is reasonable to do in such cases is to pull out a separate copy of the sub-select for each actual NEW reference in a rule query. So the example above would give rise to an expanded rule query along the lines of INSERT INTO foolog VALUES ( (SELECT x as a, y as b, ...).a, (SELECT x as a, y as b, ...).b, ... ); Would it not be possible to use WITH here, like: WITH bar AS ( ... subselect ... ) INSERT INTO foolog VALUES (bar.a, bar.b, ...) Or am I missing something? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Allowing NOT IN to use ANTI joins
On Mon, Jun 09, 2014 at 12:36:30AM +1200, David Rowley wrote: Currently pull_up_sublinks_qual_recurse only changes the plan for NOT EXISTS queries and leaves NOT IN alone. The reason for this is because the values returned by a subquery in the IN clause could have NULLs. Awesome. I've had a brief look at the patch and other than a line of extraneous whitespace it looks sane. Since it is only testing on NOT IN queries I don't think there are any issues with it slowing down simple queries. I also note you can't prove id+1 not null. At first I thought you might be able to prove this not null if the operator/function was strict, but then I realised that strict only means null if input is null not output is only null if inputs are null. Pity. Nice work. -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Supporting Windows SChannel as OpenSSL replacement
On Mon, Jun 09, 2014 at 03:35:23PM +0200, Magnus Hagander wrote: On Mon, Jun 9, 2014 at 3:19 PM, Andreas Karlsson andr...@proxel.se wrote: On 06/09/2014 01:45 PM, Heikki Linnakangas wrote: There was a patch set for this from Martijn van Oosterhout which was quite complete. http://www.postgresql.org/message-id/20060504134807.gk4...@svana.org Wow, blast from the past. A lot has, unfortunately, changed since 2006. It might be a good startingpoint. But also actively starting from the point of let's try to support multiple libraries rather than let's try to support gnutls is probably also important. The patch did provide an API. The idea was that there were a number of functions which would need to be defined to support an SSL library. Each library would then have a wrapper which wrapped the library and based on the results of configure it compiled the right file into the backend. These functions were: extern void pgtls_initialize(void); extern void pgtls_destroy(void); extern int pgtls_open_server(Port *); extern void pgtls_close(Port *); extern ssize_t pgtls_read(Port *port, void *ptr, size_t len); extern ssize_t pgtls_write(Port *port, void *ptr, size_t len); Which should be easy enough to support for any library. These days you'd need to add support for verifying certificates, but I don't think that that would be difficult (unless the actual certificate formats are different). No switching after compile time, that would just lead to useless overhead. At some point we should design a new API, so that we can deprecate the old one. Even if we don't hve the code ready, we need to get rid of PQgetssl(), and replace it with something else. I'm thinking probably a functoin that returns both a void pointer and an enum that tells you which library is actually in use. And a boolean just saying ssl on/off, because that's what a lot of clients are interested in and they don't care aobut more than that. Obviously, we also have to do something about PQinitOpenSSL(). Yeah, I think this was one of the more controversial parts. Support in the backend was primarily moving code around and renaming functions, fairly straightforward. Even error handling was not so hard (I found the gnutls handling of errors much easier than openssl). One tricky part is that programs like to use libpq for the authentication, and then they hijack the connection using PGgetssl(). The way I dealt with this is defining a new state passthrough where the caller would get a few function pointers to read/write/check the connection. Then the callers would not need to know what library libpq was compiled with. And libpq would know the connection was hijacked and refuse to act anymore. I don't think everyone was pleased with this, but no real alternative was presented (other than requiring people hijacking the connection to do the hard work). For information about which library was in use there was PQgettlsinfo() which returned a PGresult with information about the library and connection. I beleive since then new functions have been added to libpq to retrive info about certificates, so that might need a rethink also. Basically, I think these last two points are the hard parts to get agreement (assuming there's agreement to do anything at all about the problem) and without nailing those down first whoever picks this up will be in for a lot of work. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Supporting Windows SChannel as OpenSSL replacement
On Mon, Jun 09, 2014 at 11:39:17PM +0900, MauMau wrote: From: Heikki Linnakangas hlinnakan...@vmware.com Thoughts? While we're at it, we'll probably want to refactor things so that it's easy to support other SSL implementations too, like gnutls. That may be good because it provides users with choices. But I wonder if it is worth the complexity and maintainability of PostgreSQL code. The complexity is very low. SSL is a standard protocol and so all libraries offer the same functionality. Were not really doing anything complex. * Are SChannel and other libraries more secure than OpenSSL? IIRC, recently I read in the news that GnuTLS had a vulnerability. OpenSSL is probably the most widely used library, and many people are getting more interested in its quality. I expect the quality will improve thanks to the help from The Linux foundation and other organizations/researchers. Does that matter? What's wrong with letting people choose. OpenVPN these days supports multiple SSL libraries, because PolarSSL (for example) has been vetted for a higher security level than OpenSSL. * Do other libraries get support from commercial vendor product support? For example, Safenet Inc., the famous HSM (hardware security module) vendor, supports OpenSSL to access the private key stored in its HSM product. Intel offered AES-NI implementation code to OpenSSL community. I guess OpenSSL will continue to be the most functional and obtain the widest adoption and support. And the crappiest license. I think it's silly for PostgreSQL dictate what SSL library users must use, when there are so many possibilities. We also support libedit for, in my opinion, worse reasons. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
On Mon, Jun 02, 2014 at 01:29:25PM -0400, Robert Haas wrote: I agree, but I think it's important to note that Alex's complaint is not unique - the way things work now is a real source of frustration for users. In a previous job, I wrote a schema-upgrade script that dropped all of the views in reverse creation order, applied the schema updates, and then recreated all the views. This worked, but it was a lot of hassle that I would have preferred to avoid, and in a higher-volume application, simultaneously grabbing exclusive locks on a large number of critical views would have been a non-starter. In the job before that, I did the same thing manually, which was no fun at all. This was actually what posted me to write one of my first patches, committed by Bruce as ff1ea2173a92dea975d399a4ca25723f83762e55. Would it be sufficient to automatically pass the type change through only if nothing in the view actually references it in a function, operator, group by, order by, etc? That is, it only appears in the SELECT list unadorned? Or is that too limiting? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Re: [COMMITTERS] pgsql: Fix bogus %name-prefix option syntax in all our Bison files.
On Thu, May 29, 2014 at 08:33:05AM -0400, Peter Eisentraut wrote: On 5/28/14, 7:02 PM, Andres Freund wrote: That's a good idea. What i've been thinking about is to add -Wno-deprecated to the bison rule in the interim. Maybe after a configure test for the option. All deprecation warnings so far seem to be pretty unhelpful. Here is a patch. Does this need a comment indicating why it's needed and when it can be removed? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] UUIDs in core WAS: 9.4 Proposal: Initdb creates a single table
On Fri, Apr 25, 2014 at 04:18:18PM +0100, Greg Stark wrote: Which isn't to say they're a bad idea but like everything else in engineering there are tradeoffs and no such thing as a free lunch. You can avoid depleting the entropy pool by including data you expect to be unique as a kind of fake entropy -- which quickly gets you back to looking for things like MAC address to avoid duplicates across systems. ISTM you could use the database identifier we already have to at least produce UUIDs which are unique amongst PostgreSQL instances. That might be something worth aiming for? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] PQputCopyData dont signal error
On Tue, Apr 01, 2014 at 01:53:13PM -0400, Robert Haas wrote: One of the things you mentioned is I often find it necessary to refer to existing examples of code when trying to figure out how to do things correctly. I couldn't agree more. Haven't seen one yet, but found plenty of discussion that tap danced around one or more of the components of the copy, put, end paradigm. Maybe I should have just asked for a sample code snippet but didn't after a day or so of frustration and trying to piece together other people's incomplete samples. FWIW, I've generally found that the best examples are what's in the core distribution. I'd go and look at a tool like psql or pg_restore and find the code that handles this, and then copy it and cut it down to what you need. To move the conversation along: https://github.com/postgres/postgres/blob/master/src/bin/psql/copy.c#L664 Seems possibly even more robust than most people will code, but it's had a lot of real world testing. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] [RFC] What should we do for reliable WAL archiving?
On Sat, Mar 22, 2014 at 06:22:37AM +0900, MauMau wrote: From: Jeff Janes jeff.ja...@gmail.com Do people really just copy the files from one directory of local storage to another directory of local storage? I don't see the point of that. It makes sense to archive WAL to a directory of local storage for media recovery. Here, the local storage is a different disk drive which is directly attached to the database server or directly connected through SAN. I'm one of those peope. They are archived into a local directory in preparation for an rsync over ssh. The recommendation is to refuse to overwrite an existing file of the same name, and exit with failure. Which essentially brings archiving to a halt, because it keeps trying but it will keep failing. If we make a custom version, one thing it should do is determine if the existing archived file is just a truncated version of the attempting-to-be archived file, and if so overwrite it. Because if the first archival command fails with a network glitch, it can leave behind a partial file. What I'm trying to address is just an alternative to cp/copy which fsyncs a file. It just overwrites an existing file. I ran into a related problem with cp, where halfway the copy the disk was full and I was left with half a WAL file. This caused the rsync to copy only half a file and the replication broke. This is clearly a recoverable situation, but it didn't recover in this case. Yes, you're right, the failed archive attempt leaves behind a partial file which causes subsequent attempts to fail, if you follow the PG manual. That's another undesirable point in the current doc. To overcome this, someone on this ML recommended me to do cp %p /archive/dir/%f.tmp mv /archive/dir/%f.tmp /archive/dir/%f. Does this solve your problem? This would probably have handled it, but I find it odd that there's program to handle restoring of archives properly, but on the archiving side you have to cobble together your own shell scripts which fail in various corner cases. I'd love a program that just Did The Right Thing. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Planner hints in Postgresql
On Mon, Mar 17, 2014 at 01:20:47PM -0500, Merlin Moncure wrote: A query plan is a complicated thing that is the result of detail analysis of the data. I bet there are less than 100 users on the planet with the architectural knowledge of the planner to submit a 'plan'. What users do have is knowledge of the data that the database can't effectively gather for some reason. Looking at my query above, what it would need (assuming the planner could not be made to look through length()) would be something like: SELECT * FROM foo WHERE length(bar) = 1000 WITH SELECTIVITY 0.999 AND length(bar) = 2 WITH SELECTIVITY 0.999; A small issue with selectivity is that the selectivity is probably not what the users are expecting anyway, since many will related to conditional selectivities. PostgreSQL is pretty good at single column statistics, it just sometimes screws up on cross-column correlations. This ties in with alerting about a bad plan: if the EXPLAIN output could list for each condition what the actual selectivity was it might give user a way of understanding the problem. So the example given might lead to output like: clause selectivity estimated length(bar)20.50 0.50 length(bar)1000 | length(bar)2 0.50 0.25 The execution engine can only output conditional selectivities because of the order of execution. But this would at least give users a handle on the problem. Note that a first cut of the problem might simply be something like likely()/unlikely() as in gcc. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] HBA files w/include support?
On Fri, Feb 14, 2014 at 11:10:48AM -0500, Tom Lane wrote: The argument about wanting to assemble a pg_hba file from separately managed configuration pieces seems to have some merit, but the weak spot there is how do you define the search order? Or are you planning to just cross your fingers and hope it doesn't matter too much? Well, in my case since the only auth method used is md5 the order really doesn't matter. Besides the point that each combination of dbname and username only appears once. But for a general use feature I can imagine it would be a concern. This is not an important feature for me though: the config file is generated by puppet with a bunch of loops and an include directory would not really reduce the amount of work. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Making strxfrm() blobs in indexes work
On Sun, Feb 02, 2014 at 05:09:06PM -0800, Peter Geoghegan wrote: However, it also occurs to me that strxfrm() blobs have another useful property: We (as, say, the author of an equality operator on text, an operator intended for a btree operator class) *can* trust a strcmp()'s result on blobs, provided the result isn't 0/equal, *even if the blobs are truncated*. So maybe a better scheme, and certainly a simpler one would be to have a pseudo-attribute in inner pages only with, say, the first 8 bytes of a strxfrm() blob formed from the logically-leading text attribute of the same indexTuple. Because we're only doing this on inner pages, there is a very good chance that that will be good enough most of the time. This also allows us to reduce bloat very effectively. (A bit late to the party). This idea has come up before and the most annoying thing is that braindead strxfrm api. Namely, to strxfrm a large strings you need to strxfrm it completely even if you only want the first 8 bytes. That said, since btree index tuples are limited to 3k anyway, the overhead probably isn't that bad. I think it would make a noticable difference if it can be made to work. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] RFC: Async query processing
On Fri, Jan 03, 2014 at 04:46:23PM +0100, Florian Weimer wrote: On 01/03/2014 04:20 PM, Tom Lane wrote: I think Florian has a good point there, and the reason is this: what you are talking about will be of exactly zero use to applications that want to see the results of one query before launching the next. Which eliminates a whole lot of apps. I suspect that almost the *only* common use case in which a stream of queries can be launched without feedback is going to be bulk data loading. It's not clear at all that pipelining the PQexec code path is the way to better performance for that --- why not use COPY, instead? The data I encounter has to be distributed across multiple tables. Switching between the COPY TO commands would again need client-side buffering and heuristics for sizing these buffers. Lengths of runs vary a lot in my case. Why switch between COPY commands, why could you not do it in one? For example: COPY table1(col1, col2, ...), table2(col1, col2, ...) FROM STDIN WITH (tableoids); tableoid1tabcol1tabcol2... tableoid2tab... ... \. There's no especially good reason why a COPY can only write to one table. In this way you provide, per row, which table this row should go to. There's always the issue of generated primary keys. So you could, if you wanted to, do: COPY table1(col1, col2, ...) RETURNING pkey1, table2(col1, col2, ...) FROM STDIN WITH (tableoids); tableoid1tabcol1tabcol2... tableoid2tab\K tableoidtab... ... \. So, like we have \N for NULL, a \K oid which be the value of the RETURNING column for the table with that oid. This may be way outfield, but we have a streaming interface, there's no reason why we can't extend it. There's also the idea of WITH x AS (COPY ... FROM STDIN) which you could do with a similar multiplexing of rows in one stream. This would be extremely powerful together with MERGE. Have a nice way, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] stuck spinlock
On Thu, Dec 26, 2013 at 03:18:23PM -0800, Robert Haas wrote: On Thu, Dec 26, 2013 at 11:54 AM, Peter Eisentraut pete...@gmx.net wrote: On 12/12/13, 8:45 PM, Tom Lane wrote: Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that most systems dump core files with process IDs embedded in the names. Which systems are those? MacOS X dumps core files into /cores/core.$PID, and at least some Linux systems seem to dump them into ./core.$PID On Linux it's configurable and at least on Ubuntu you get this: $ cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c But yes, it can be configured to icnclude the PID in the filename. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] trailing comment ghost-timing
On Tue, Dec 24, 2013 at 03:40:58AM +0100, Andreas Karlsson wrote: On 12/24/2013 03:17 AM, David Johnston wrote: It is not sent to the server as a trailing comment. The following file is sent to the server like this. File: /**/; /**/ Commands: PQexec(..., /**/;); PQexec(..., /**/); If this has to be fixed it should be in the client. I think people would complain if we broke the API by starting to throw an exception on PQexec with a string containing no actual query. (Untested). Isn't this just a case of psql not printing out a timing if the server responds with PGRES_EMPTY_QUERY? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] PoC: Partial sort
On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote: Hi! Next revision. It expected to do better work with optimizer. It introduces presorted_keys argument of cost_sort function which represent number of keys already sorted in Path. Then this function uses estimate_num_groups to estimate number of groups with different values of presorted keys and assumes that dataset is uniformly divided by groups. get_cheapest_fractional_path_for_pathkeys tries to select the path matching most part of path keys. You can see it's working pretty good on single table queries. Nice work! The plans look good and the calculated costs seem sane also. I suppose the problem with the joins is generating the pathkeys? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] PoC: Partial sort
On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote: Is that actually all that beneficial when sorting with a bog standard qsort() since that doesn't generally benefit from data being pre-sorted? I think we might need to switch to a different algorithm to really benefit from mostly pre-sorted input. In this patch I don't do full sort of dataset. For instance, index returns data ordered by first column and we need to order them also by second column. Then this node sorts groups (assumed to be small) where values of the first column are same by value of second column. And with limit clause only required number of such groups will be processed. But, I don't think we should expect pre-sorted values of second column inside a group. Nice. I imagine this would be mostly beneficial for fast-start plans, since you no longer need to sort the whole table prior to returning the first tuple. Reduced memory usage might be a factor, especially for large sorts where you otherwise might need to spool to disk. You can now use an index on (a) to improve sorting for (a,b). Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl log l), useful for large n. Minor comments: I find cmpTuple a bad name. That's what it's doing but perhaps cmpSkipColumns would be clearer. I think it's worthwhile adding a seperate path for the skipCols = 0 case, to avoid extra copies. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] patch: make_timestamp function
On Thu, Dec 12, 2013 at 08:50:26PM -0200, Fabrízio de Royes Mello wrote: On Thu, Dec 12, 2013 at 3:11 PM, Pavel Stehule pavel.steh...@gmail.comwrote: Hello this patch try to complete a set of functions make_date and make_timestamp. Could we have the 'make_timestamptz' function too? Wouldn't this just be: SELECT make_timestamp(...) at time zone 'foo'; (assuming make_timestamp actually returns a timestamp and not a timestamptz). or do you mean something else? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] ANALYZE sampling is too good
On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote: Surely we want to sample a 'constant fraction' (obviously, in practice you have to sample an integral number of rows in a page!) of rows per page? The simplest way, as Tom suggests, is to use all the rows in a page. However, if you wanted the same number of rows from a greater number of pages, you could (for example) select a quarter of the rows from each page. In which case, when this is a fractional number: take the integral number of rows, plus on extra row with a probability equal to the fraction (here 0.25). In this discussion we've mostly used block = 1 postgresql block of 8k. But when reading from a disk once you've read one block you can basically read the following ones practically for free. So I wonder if you could make your sampling read always 16 consecutive blocks, but then use 25-50% of the tuples. That way you get many more tuples for the same amount of disk I/O seeks.. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Platform-dependent(?) failure in timeout handling
On Tue, Nov 26, 2013 at 06:50:28PM -0500, Tom Lane wrote: I believe the reason for this is the mechanism that I speculated about in that previous thread. The platform is blocking SIGALRM while it executes handle_sig_alarm(), and that calls LockTimeoutHandler() which does kill(MyProcPid, SIGINT), and that SIGINT is being delivered immediately (or at least before we can get out of handle_sig_alarm). So now the platform blocks SIGINT, too, and calls StatementCancelHandler(), which proceeds to longjmp out of the whole signal-handling call stack. So the signal unblocking that would have happened after the handlers returned doesn't happen. In simpler cases we don't see an issue because the longjmp returns to the setsigjmp(foo,1) call in postgres.c, which will result in restoring the signal mask that was active at that stack level, so we're all good. However, PG_TRY() uses setsigjmp(foo,0), which means that no signal mask restoration happens if we catch the longjmp and don't ever re-throw it. Which is exactly what happens in plpgsql because of the EXCEPTION clause in the above example. I don't know how many platforms block signals during handlers in this way, but I'm seeing it on Linux (RHEL6.5 to be exact) and we know that at least OpenBSD acts likewise, so that's a pretty darn large chunk of the world. Isn't this why sigsetjmp/siglongjmp where invented? Is there a situation where you don't want the signal mask restored? BTW, seems on BSD systems sigsetjmp == setjmp: http://www.gnu.org/software/libc/manual/html_node/Non_002dLocal-Exits-and-Signals.html Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Wildcard usage enhancements in .pgpass
On Sat, Nov 16, 2013 at 09:26:33PM +0100, Alexey Klyukin wrote: Hi, Attached is the patch that improves usage of '*' wildcard in .pgpass, particularly in the host part. The use case is below. Looks interesting, though I wonder if you could use fnmatch(3) here. Or woud that match more than you expect? For example, it would allow 'foo*bar' to match 'foo.bar' which your code doesn't. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote: I have been thinking about this for years and I think the key idea for this is, implementing universal encoding. The universal encoding should have following characteristics to implement N2 encoding in a database. 1) no loss of round trip encoding conversion 2) no mapping table is necessary to convert from/to existing encodings Once we implement the universal encoding, other problem such as pg_database with multiple encoding problem can be solved easily. Isn't this essentially what the MULE internal encoding is? Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. This sounds like a terrible idea. In the future people are only going to want more advanced text functions, regular expressions, indexing and making encodings that don't exist anywhere else seems like a way to make a lot of work for little benefit. A better idea seems to me is to (if postgres is configured properly) embed the non-round-trippable characters in the custom character part of the unicode character set. In other words, adjust the mappings tables on demand and voila. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] logical column order and physical column order
On Sun, Nov 03, 2013 at 09:40:18PM +1300, Gavin Flower wrote: I think the system should PHYSICALLY store the columns in the most space efficient order, and have a facility for mapping to from the LOGICAL order - so that users application developers only have worry about the logical order. Even system programers would normally not have to be concerned with the physical order. I am a little surprised that this is not already done, to be honest. This has been discussed before, extensively. I beleive there have even been some patches. Apart from the space savings it also allow postgres to support column reordering of tables. The main objection IIRC is that you now have a logical order and a physical order and there would be an endless stream of bugs caused by code confusing the two. I don't really buy this: you can make the two identifiers non-overlapping so you can always tell which kind you have and some properly places checks will catch obvious problems. Logical order is only used in a handful of places anyway. You could even make them two seperate datatypes so the compiler will complain if you screw up. Dig through the archives for the full story. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Tue, Oct 22, 2013 at 11:16:19AM -0700, Jeff Janes wrote: On Mon, Oct 21, 2013 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hm. It's been a long time since college statistics, but doesn't the entire concept of standard deviation depend on the assumption that the underlying distribution is more-or-less normal (Gaussian)? It is easy to misinterpret the standard deviation if the distribution is not gaussian, but that is also true of the average. The standard deviation (or the variance) is commonly used with non-gaussian distributions, either because it is the most efficient estimator for those particular distributions, or just because it is so commonly available. Well, the standard deviation is the square root of the variance, which is the second moment of the distribution. The first moment being the mean. No matter what distribution it is, these are useful numbers. If I had to guess a distribution for query runtimes I'd go for Poisson, which would mean you'd expect the mean to equal the variance. Don't have enough experience with such measurements to say whether that is reasonable. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Fri, Sep 20, 2013 at 08:58:53AM +0900, Tatsuo Ishii wrote: For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. As far as I can tell the whole reason for introducing NCHAR is to support SHIFT-JIS, there hasn't been call for any other encodings, that I can remember anyway. So rather than this whole NCHAR thing, why not just add a type sjistext, and a few type casts and call it a day... Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] record identical operator - Review
On Thu, Sep 19, 2013 at 06:31:38PM -0400, Steve Singer wrote: I think there is agreement that better (as in more obscure) operators than === and !== need to be picked and we need to find a place in the user-docs to warn users of the behaviour of this operators. Hannu has proposed *==* binary equal, surely very equal by any other definition as wall !==? maybe not equal -- binary inequal, may still be equal by other comparison methods It's a pity operators must be non-alpha and can't be named. Something like: SELECT foo OPERATOR(byte_equivalent) bar; is simultaneously obscure, yet clear. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])
On Fri, Aug 23, 2013 at 06:41:04PM +0530, Amit Kapila wrote: Not with this proposal... If it's fixed then it makes no sense to make it look like it can be modified. This proposal is to have a special include which user can only use for enable/disable which means he can remove symbol '#' or add '#'. We cannot stop user from changing file name or add some different location, but that can lead to problems. We can have a note on top of this include indicating it is only for enable/disable. Note, my whole purpose for suggesting something like: include_auto_conf_filepostgresql.auto.conf is because I want the file location to be configurable. If I put in my configuration: include_auto_conf_file/etc/postgresql/9.4/postgresql.auto.conf it better put the options there. And if I comment the line out ALTER SYSTEM should stop working. If I put it at the beginning of the config then any other option in the file will override it (which we can detect and warn about). If I put it at the end of the file, ALTER SYSTEM overrides any statically configured options. And I can imagine hosting providers putting the configuration for memory usage, shared library directories and other such options after, and options like cpu_cost, enable_merge_join, etc before. That way they have fine grained control over which options the user can set and which not. I think if we add more meaning to it, like allow user to change it, handling and defining of that will be bit complex. Letting the user configure the location seems like common curtesy. Note this line isn't in itself a GUC, so you can't configure it via ALTER SYSTEM. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])
On Tue, Aug 20, 2013 at 11:23:06AM -0400, Stephen Frost wrote: What I was proposing upthread is that enable_alter_system=off/on would be present in postgresql.conf, and there is no include line for auto.conf. I really think that's a terrible approach, to be honest. I want to see an 'include' line in postgresql.conf for auto.conf, so the hapless sysadmin who is trying to figure out what the crazy DBA did has some clue what to look for. enable_alter_system doesn't tell him diddly about an 'auto.conf' file which is included in the system config. ISTM you want some kind of hybrid setting like: #include_system auto.conf which simultaneously does three things: 1. Sets the enable_alter_system flag 2. Indicates the file to use 3. Indicates the priority of the setting re other settings. Comment it out, ALTER SYSTEM stop working. Put it back and it's immediately clear what it means. And the user can control where the settings go. Syntax is a bit fugly though. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] improve Chinese locale performance
On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote: I pretty much lost interest in ICU upon reading that they use UTF-16 as their internal format. http://userguide.icu-project.org/strings#TOC-Strings-in-ICU The UTF-8 support has been steadily improving: For example, icu::Collator::compareUTF8() compares two UTF-8 strings incrementally, without converting all of the two strings to UTF-16 if there is an early base letter difference. http://userguide.icu-project.org/strings/utf-8 For all other encodings you should be able to use an iterator. As to performance I have no idea. The main issue with strxfrm() is its lame API. If it supported returning prefixes you'd be set, but as it is you need 10MB of memory just to transform a 10MB string, even if only the first few characers would be enough to sort... Mvg, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] pg_memory_barrier() doesn't compile, let alone work, for me
On Sun, Jul 14, 2013 at 09:26:38PM -0400, Robert Haas wrote: I'm pretty sure we've got latent memory-ordering risks in our existing code which we just haven't detected and fixed yet. Consider, for example, this exciting code from GetNewTransactionId: myproc-subxids.xids[nxids] = xid; mypgxact-nxids = nxids + 1; I don't believe that's technically safe even on an architecture like x86, because the compiler could decide to reorder those assignments. Of course there is probably no reason to do so, and even if it does you'd have to get really unlucky to see a user-visible failure, and if you did you'd probably misguess the cause. You're probably right. Note that it's not even just the compiler that might reorder them, the CPU/cache subsystem/memory bus all play their part in memory reordering. x86 is pretty forgiving, which is why it works. I found this to be a really good explanation of all the things that can go wrong with memory ordering. It also explains why, in the long run, memory barriers are not optimal. http://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/ That talk discusses how the hardware world is converging on SC [1] as the memory model to use. And C11/C++11 atomics will implement this for the programmer. With these you can actually make guarentees. For example, by marking mypgxact-nxids as an atomic type the compiler will emit all the necessary markings to let the CPU know what you want, so everything works the way you expect it to. Even on arcane architechtures. No explicit barriers needed. Unfortunatly, it won't help on compilers that don't support it. [1] http://en.wikipedia.org/wiki/Sequential_consistency There are places where you put code in and verify it does what you want. With this one you can put test programs in and it can tell you all possibly results due to memory reordering. http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/help.html Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Proposal - Support for National Characters functionality
On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote: Does support for alternative multi-byte encodings have something to do with the Han unification controversy? I don't know terribly much about this, so apologies if that's just wrong. There's a famous problem regarding conversion between Unicode and other encodings, such as Shift Jis. There are lots of discussion on this. Here is the one from Microsoft: http://support.microsoft.com/kb/170559/EN-US Apart from Shift-JIS not being a well defined (it's more a family of encodings) it has the unusual feature of providing multiple ways to encode the same character. This is not even a Han unification issue, they have largely been addressed. For example, the square-root symbol exists twice (0x8795 and 0x81E3) and many other mathmatical symbols also. Here's the code page which you can browse online: http://msdn.microsoft.com/en-us/goglobal/cc305152 Which means to be round-trippable Unicode would have to double those characters, but this would make it hard/impossible to round-trip with any other character set that had those characters. No easy solution here. Something that has been done before [1] is to map the doubles to the custom area of the unicode space (0xe000-0x). It gives you round-trip support at the expense of having to handle those characters yourself. But since postgres doesn't do anything meaningful with unicode characters this might be acceptable. [1] Python does a similar trick to handle filenames coming from disk in an unknown encoding: http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements
On Sun, Jul 14, 2013 at 03:36:09AM -0300, Fabrízio de Royes Mello wrote: Next, changes in src/backend, starting with parser changes: the patch adds IF_P NOT EXISTS variants for various productions. For example: snip I think opt_if_not_exists should be used for the others as well. I could not use the opt_if_not_exists because bison emits an error: /usr/bin/bison -d -o gram.c gram.y gram.y: conflicts: 10 shift/reduce gram.y: expected 0 shift/reduce conflicts make[3]: *** [gram.c] Error 1 I really don't know how to solve this problem. I'm just do ajustments like that: This probably isn't solvable, which is why the coding is double in many existing places. The issue is that by using opt_if_not_exists you make that bison has to decide much earlier which rule it is parsing. Bison only has one token lookahead and if that's not enough you get errors. BTW, bison dumps a large file describing all its states that you should be able to work out from that where the exact problem lies. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] How to implement Gin method?
On Mon, Jul 08, 2013 at 03:21:09PM +0900, kenji uno wrote: Hi. Ok, ok thanks. My problem is to shorten time of searching full text stored in text field. Ok, your explanation of your problem really helps, thanks. However the following query is very slow! 9,400ms. It uses Seq Scan lol. SELECT * FROM xxx WHERE TRUE AND (ddstrike(title,'ABC') OR (filter(fts1body) @ filter('AAA') AND ddstrike(fts1body,'AAA'))) Well, in this case it still needs to scan the whole table to search the title obviously. Apply filter to title column too. The best query result costs 3,700ms. 18 hits. It surely uses expected query plan: two Bitmap index scan - Bitmap Or - Bitmap Heap Scan. SELECT * FROM xxx WHERE TRUE AND (filter(title) @ filter('ABC') OR filter(fts1body) @ filter('ABC')) AND (ddstrike(title,'ABC') OR ddstrike(fts1body,'ABC')) It would be useful to see the explain analyze of this query. Note that looking up 11,000 entries in an index could very take as long as sequentially scanning the whole table. However, I may think good idea which uses inverted index. I think your above idea is a good one, but you need to work out why your above implementation didn't work out and why you think implementing it directly will be better. So I want to know... - the actual work of extractQuery and consistant. - the detail interface of extractValue/extractQuery/consistant. It may help understanding. I looked at contrib/_int.sql of PG8.2.22 Whoa, very old version, please look at something newer. For example the RECHECK flag below is no longer used. There are definitions of int[] GIN support. --- CREATE OPERATOR CLASS gin__int_ops FOR TYPE _int4 USING gin AS OPERATOR3 , OPERATOR6 = (anyarray, anyarray) RECHECK, OPERATOR7 @, OPERATOR8 @ RECHECK, OPERATOR13 @, OPERATOR14 ~ RECHECK, OPERATOR20 @@ (_int4, query_int), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 ginarrayextract (anyarray, internal), FUNCTION3 ginint4_queryextract (internal, internal, int2), FUNCTION4 ginint4_consistent (internal, int2, internal), STORAGE int4; --- Both ginint4_queryextract and ginint4_consistent assume that query argument is a PGARRAY (ArrayType *). Where is it decided? Is it array of STORAGE type? Remember the above uses operators which are what is indexed. The left hand side is the array. The right hand side is whatever is defined. intarray defines the operator (int[], int[]) hence the query argument is int[] in that case. Apparently intarray accepts many kinds of queries, it is the operators that define what actually happens. Both extractQuery(ginint4_queryextract) and extractValue(ginarrayextract) seem to return similar value type. They return Datum array of int4. Is it array of STORAGE type? From my reading of http://www.postgresql.org/docs/9.2/static/gin-extensibility.html, yes they must return an array of the STORAGE type. The last paragraph on that page says: The actual data types of the various Datum values mentioned above vary depending on the operator class. The item values passed to extractValue are always of the operator class's input type, and all key values must be of the class's STORAGE type. The type of the query argument passed to extractQuery and consistent is whatever is specified as the right-hand input type of the class member operator identified by the strategy number. This need not be the same as the item type, so long as key values of the correct type can be extracted from it. I want to understand the overview of GIN extension. Please let us know what the documentation is missing so it can be improved. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] How to implement Gin method?
On Sun, Jul 07, 2013 at 10:00:16AM +0900, Kenji uno wrote: Hi. I want to try GIN and know programming information of GIN technology. Please teach me about these functions extractValue, extractQuery and consistent. I have posted question at stack overflow. http://stackoverflow.com/questions/17489703/postgresql-how-to-implement-gin The documentation refers to the authors pages: http://www.sai.msu.su/~megera/wiki/Gin Did they help at all? Also, GIN cannot be just applied to anything. It works to be able to index certain types of which are difficult any other way, like full-text search. If you give some idea of what you'd like to index then we can give an idea of what the functions should do. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] plpython implementation
On Sun, Jun 30, 2013 at 01:49:53PM +0200, Szymon Guz wrote: I'm reading through plperl and plpython implementations and I don't understand the way they work. Comments for plperl say that there are two interpreters (trusted and untrusted) for each user session, and they are stored in a hash. The point is that python has no version for untrusted users, since it's been accepted that there's no way to build a python sandbox for untrusted code. There was actually a small competition to make one but it failed, since then they don't bother. Perl does provide a sandbox, hence you can have two interpreters in a single backend. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] plpython implementation
On Sun, Jun 30, 2013 at 02:18:07PM +0200, Szymon Guz wrote: python does not any any sort of reliable sandbox, so there is no plpython, only plpythonu - hence only one interpreter per backend is needed. Is there any track of the discussion that there is no way to make the sandbox? I managed to create some kind of sandbox, a simple modification which totally disables importing modules, so I'm just wondering why it cannot be done. http://wiki.python.org/moin/SandboxedPython This is the thread I was thinking of: http://mail.python.org/pipermail/python-dev/2009-February/086401.html If you read through it I think you will understand the difficulties. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Hard to Use WAS: Hard limit on WAL space
On Sat, Jun 15, 2013 at 12:43:10PM +0800, Craig Ringer wrote: Bloat -- Table bloat. Table bloat has been a major issue with PostgreSQL users/admins for years. Anyone care to explain to me in a simple paragraph how to find out if you have table or index bloat issues in your database and what to do about it? (Maybe we need pg_catalog.pg_index_bloat and pg_catalog.pg_table_bloat views including FILLFACTOR correction?) I think I'll draft up a patch to add exactly that. Nice list btw. I monitor this by using the excellent check_progres nagios plugin, which has stuff to check for things like this. Which makes me think that it might be possible to add some other checks like this, in for example pg_ctl. A big fat warning 'your data may be eaten' might get noticed at startup. (A minor annoyance is that in recent version of PostgreSQL you have to give check_postgres admin rights, otherwise it can't warn you about idle in transaction problems.) Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Batch API for After Triggers
On Sun, Jun 09, 2013 at 10:15:09AM +0100, Simon Riggs wrote: As I mentioned in my post, I did consider that and then chose not to do that. However, having a final func is a major modification in the way that we specify trigger functions. We'd also need to cope with recursive trigger execution, which would mean the final func would get called potentially many times, so there's no way of knowing if the final func is actually the last call needed. That sounded complex and confusing to me. The proposed API allows you to do exactly that anyway, more easily, by just waiting until tg_event_num == tg_tot_num_events. Can you signal partial completion? For example, if a trigger know that blocks of 10,000 are optimal and it sees tg_tot_num_events == 1,000,000 that it could do work every 10,000 entries, as in when: (tg_event_num % 1) == 0 || tg_event_num == tg_tot_num_events Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On Mon, Jun 03, 2013 at 11:27:57AM +0300, Ants Aasma wrote: I can't rule that out. Personally, I've always attributed it to the fact that it's (a) long and (b) I/O-intensive. But it's not impossible there could also be bugs lurking. It could be related to the OS. I have no evidence for or against, but it's possible that OS write-out routines defeat the careful cost based throttling that PostgreSQL does by periodically dumping a large portion of dirty pages into the write queue at once. That does nasty things to query latencies as evidenced by the work on checkpoint spreading. In other contexts I've run into issues relating to large continuous writes stalling. The issue is basically that the Linux kernel allows (by default) writes to pile up until they reach 5% of physical memory before deciding that the sucker who wrote the last block becomes responsible for writing the whole lot out. At full speed of course. Depending on the amount of memory and the I/O speed of your disks this could take a while, and interfere with other processes. This leads to extremely bursty I/O behaviour. The solution, as usual, is to make it more aggressive, so the kernel background writer triggers at 1% memory. I'm not saying that's the problem here, but it is an example of a situation where the write queue can become very large very quickly. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Combo xids
On Sat, Jun 01, 2013 at 09:22:05AM +0100, Simon Riggs wrote: When would this make sense? Frequently. Most of the time a tuple needs only one xid set. In most cases, we set xmin and xmax a long time apart. Very few cases end with both of them set inside the *same* xmin horizon. In a heavy transactional enviroment, the horizon moves forwards quickly, on the order of a few seconds. Very few rows get inserted and then updated/deleted that quickly. With long reporting queries, data tends to be updated less, so again the rows aren't touched within the same horizon. As a result, we hardly ever need both xmin and xmax at the same time - when we need to set xmax, xmin is already committed/cleaned. Is this really true? Consider a long running query A and a tuple created by B after A. If another transaction comes to update B you can't throw away the xmin because you need it to prove that A can't see the tuple. Or is the idea to create multixacts for each combination of xmin/xmax encountered? And the assumption is that there aren't that many? That could be measured. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
On Sat, Jun 01, 2013 at 03:11:50PM +0430, Soroosh Sardari wrote: Dear Hackers I've created a new DB, and a bunch of files created in base/12054, 12054 is oid of the new DB. I want to find what table stored in each file. BTW, I read this http://www.postgresql.org/docs/9.2/interactive/storage-file-layout.html I have 156 files with numerical names, vm and fsm file are ignored. 107 files are same as pg_class.reltoastrelid,so I have 49 files that I do not know what tables stored in them. Any idea to find ? From that page: Each table and index is stored in a separate file. For ordinary relations, these files are named after the table or index's filenode number, which can be found in pg_class.relfilenode. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]
On Sat, Jun 01, 2013 at 03:27:40PM +0430, Soroosh Sardari wrote: Yes, I have some files which is not in pg_class.relfilenode of any table or index. I want to know which table or index stored in such files. That shouldn't happen. Are you sure you're looking in the right database? Kan you list the filenames? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] about index inheritance
On Wed, May 08, 2013 at 10:19:08AM +0200, Vincenzo Melandri wrote: On Tue, May 7, 2013 at 11:55 PM, Robert Haas robertmh...@gmail.com wrote: This is a really hard problem. If you pick this as your first project hacking on PostgreSQL, you will almost certainly fail. Thank you very much, i guessed that already -.- Still, I needed that at my office for a long time, struggled with it many times and had to come out with some exotic solutions... Now I have spare time between projects, so I can work on it full-time. At least it's worth a try, isn't it? Well, you can work on it but I think it will be less programming and more coming up with a feasable solution. Anyway, I'm working to better understand the problem, trying to identify at least the main involved points. At the moment I'm figuring out how the inherit mechanism works for relations (in tablecmds.c).. Then I'll figure out about how indexes work.. While there are probably old threads in the archives, I find the easiest way to look at the problem is in the locking. In particular, I think if you can get unique indexes to work then the rest will follow. Consider the case of an inheritence hierarchy and you want a unique index on a column. Since you want to be able to create and drop children easily, each childs need to have an index just for them. But if you insert a row into one child you need to, somehow, prevent other people also inserting the same value in a different child. Efficiently and deadlock-free. This is hard, though we're up for crazy, out-of-the-box ideas. Note, there is one very special case, namely: - The children are used for partitioning. - The unique index you want is on the partition key. Since each value can only possibly appear in one table your locking problems vanish. The question is: how often does this happen? Hope this helps, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Substituting Checksum Algorithm (was: Enabling Checksums)
On Tue, Apr 30, 2013 at 01:05:30PM -0400, Greg Smith wrote: I re-ran the benchmark that's had me most worried against the committed code and things look good so far. I've been keeping quiet because my tests recently have all agreed with what Ants already described. This is more a confirmation summary than new data. I came across this today: Data Integrity Extensions, basically a standard for have an application calculate a checksum of a block and submitting it together with the block so that the disk can verify that the block it is writing matches what the application sent. It appears SCSI has standardised on a CRC-16 checksum with polynomial 0x18bb7 . http://www.t10.org/ftp/t10/document.03/03-290r0.pdf https://oss.oracle.com/~mkp/docs/dix-draft.pdf Not directly relavent to PostgreSQL now, but possibly in the future. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote: Fixing this on the receive side alone seems quite messy and fragile. So instead, I think we should let the master send a shutdown message after it has sent everything it wants to send, and wait for the client to acknowledge it before shutting down the socket. If the client fails to respond, we could log a fat WARNING. ISTM the master should half close the socket, using shutdown(). That way the client receives an EOF and can still then send its reply to the master. Then when the master receives that it can close() completely. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Interesting post-mortem on a near disaster with git
On Sun, Mar 24, 2013 at 11:52:17AM -0400, Tom Lane wrote: Over the weekend, KDE came within a gnat's eyelash of losing *all* their authoritative git repos, despite having seemingly-extensive redundancy. Read about it here: http://jefferai.org/2013/03/24/too-perfect-a-mirror/ We should think about protecting our own repo a bit better, especially after the recent unpleasantness with a bogus forced update. The idea of having clones that are deliberately a day or two behind seems attractive ... I think the lesson here is that a mirror is not a backup. RAID, ZFS, and version control are all not backups. Taking a tarball of the entire repository and storing it on a different machine would solve just about any problem you can think of in this area. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Considering Gerrit for CFs
On Wed, Feb 06, 2013 at 10:17:09PM +0100, Magnus Hagander wrote: I just took a quick look at their system, and when they start talking about requirements in the 100's of Gb of RAM, 24 core machines and SSD, I get scared :) But that's to scale it - doesn't mention when you need to do anything like that. I'm assuming we'd be tiny. FWIW, what we have now could easily run on a box with 128Mb RAM... I'm right now setting up a gerrit instance for another project (not as large as Postgres) on a VM with 1GB of RAM and it works alright. You'll want to run a Postgres database next to it for storing the actual reviews and such. It's a nice tool and integrates reasonably well with other thing. We have a bot connected to it which sends messages on IRC in response to various state changes. I've never seen an instance respond to email though. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Enabling Checksums
On Tue, Dec 18, 2012 at 04:06:02AM -0500, Greg Smith wrote: On 12/18/12 3:17 AM, Simon Riggs wrote: Clearly part of the response could involve pg_dump on the damaged structure, at some point. This is the main thing I wanted to try out more, once I have a decent corruption generation tool. If you've corrupted a single record but can still pg_dump the remainder, that seems the best we can do to help people recover from that. Providing some documentation on how to figure out what rows are in that block, presumably by using the contrib inspection tools, would be helpful too. FWIW, Postgres is pretty resiliant against corruption. I've maintained a postgres db on a server with bad memory (don't ask) and since most scrambling was in text strings you just got funny output sometimes. The most common failure was a memory allocation failure as postgres tried to copy a datum whose length field was correupted. If things went really wonky you could identify the bad tuples by hand and then delete them by ctid. Regular reindexing helped too. All I'm saying is that a mode where you log a warning but proceed anyway is useful. It won't pin down the exact error, but it will tell you where to look and help find the non-obvious corruption (so you can possibly fix it by hand). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Memory leaks in record_out and record_send
On Tue, Nov 13, 2012 at 05:50:08PM -0500, Stephen Frost wrote: * Robert Haas (robertmh...@gmail.com) wrote: Yeah. The thing that concerns me is that I think we have a pretty decent number of memory contexts where the expected number of allocations is very small ... and we have the context *just in case* we do more than that in certain instances. I've seen profiles where the setup/teardown costs of memory contexts are significant ... which doesn't mean that those examples would perform better with fewer memory contexts, but it's enough to make me pause for thought. So, for my 2c, I'm on the other side of this, personally. We have memory contexts for more-or-less exactly this issue. It's one of the great things about PG- it's resiliant and very unlikely to have large or bad memory leaks in general, much of which can, imv, be attributed to our use of memory contexts. If the problem is that we create memory context overhead which is not necessary in many cases, perhaps we can reduce the overhead somehow. IIRC we have a seperate function for resetting a context and freeing it entirely. If there was a quick test we could do such that resetting a context did nothing unless at least (say) 16k had been allocated, that might reduce the cost for many very small allocations. Ofcourse, unless someone comes up with a way to measure the cost this is all handwaving, but it might a nice project for someone interested in learning to hack postgres. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Successor of MD5 authentication, let's use SCRAM
On Sun, Oct 21, 2012 at 09:55:50AM +0200, Magnus Hagander wrote: I don't see a problem at all with providing the snakeoil cert. In fact, it's quite useful. I see a problem with enabling it by default. Because it makes people think they are more secure than they are. So, what you're suggesting is that any use of ssl to a remote machine without the sslrootcert option should generate a warning. Something along the lines of remote server not verified? For completeness it should also show this for any non-SSL connection. libpq should export a serververified flag which would be false always unless the connection is SSL and the CA is verified . In a browser, they will get a big fat warning every time, so they will know it. There is no such warning in psql. Actually, maybe we should *add* such a warning. We could do it in psql. We can't do it in libpq for everyone, but we can do it in our own tools... Particularly since we do print the SSL information already - we could just add a warning: cert not verified or something like that to the same piece of information. It bugs me every time you have to jump through hoops and get red warnings for an unknown CA, whereas no encryption whatsoever is treated as fine while being actually even worse. Transport encryption is a *good thing*, we should be encouraging it wherever possible. If it wern't for the performance issues I'd suggest defaulting to SSL everywhere transparently with ephemeral certs. It would protect against any number of passive attacks. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Bugs in planner's equivalence-class processing
On Tue, Oct 16, 2012 at 11:56:52AM -0400, Tom Lane wrote: Is anybody concerned about the compatibility implications of fixing this bug in the back branches? I'm worried about people complaining that we broke their application in a minor release. Maybe they were depending on incorrect behavior, but they might complain anyway. On the other hand, the fact that this hasn't been reported from the field in nine years suggests that not many people write queries like this. Nice detective work. I'd personally say that it should be fixed. I personally haven't written these kinds of queries so I'm not affected, but I don't like the idea of known bugs being unfixed. It's a pity we can't have a system that can somehow independantly checks the results of the planner Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Detecting libpq connections improperly shared via fork()
On Thu, Oct 04, 2012 at 12:14:02AM +0200, Andres Freund wrote: On Thursday, October 04, 2012 12:08:18 AM Daniel Farina wrote: It would be fantastic for libpq to somehow monitor use of a connection from multiple PIDs that share a parent and deliver an error indicating what is wrong. Unfortunately detecting that would require either a file or some kind of shared memory map, AFAIK, and I don't know how keen anyone is on accepting that patch. So, may I ask: how keen is anyone on accepting such a patch, and under what conditions of mechanism? Hm. An easier version of this could just be storing the pid of the process that did the PQconnectdb* in the PGconn struct. You can then check that PGconn-pid == getpid() at relatively few places and error out on a mismatch. That should be doable with only minor overhead. On system's that support it, pthread_atfork() might help. Though being like a signal handler you don't have access to the PGconn structure, so you can't flag anything easily. Maybe just to cache getpid()? In any case, adding a check to PQexec and friends would probably be sufficient, since that's what every program is going to start with. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] do we EXEC_BACKEND on Mac OS X?
On Wed, Oct 03, 2012 at 01:57:47PM -0400, Bruce Momjian wrote: On Wed, Oct 3, 2012 at 01:53:28PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: Yes, but those framework libraries are typically supposed to prevent such problems from being seen by applications calling them. How exactly would a library prevent such problems? In particular, let's see a proposal for how libpq might make it look like a fork was transparent for an open connection. I guess that is impossible. Well, not quite impossible. A simple solution would be to use pthread_atfork() to register a handler that puts the socket into an invalid state in either the parent or the child. http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html Ofcourse, the backward compatabilty issues prevent us doing that, but it's *possible*. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Hash id in pg_stat_statements
On Tue, Oct 02, 2012 at 12:58:15PM -0400, Stephen Frost wrote: I simply do not understand objections to the proposal. Have I missed something? It was my impression that the concern is the stability of the hash value and ensuring that tools which operate on it don't mistakenly lump two different queries into one because they had the same hash value (caused by a change in our hashing algorithm or input into it over time, eg a point release). I was hoping to address that to allow this proposal to move forward.. That makes no sense though. The moment you talk about hash you consider the possibility of lumping together things that aren't the same. Any tools using these hashes must have realised this. Fortunatly, the statistics are better than the birthday paradox. The chances that the two most important queries in your system end up having the same hash is miniscule. Like mentioned elsewhere, a system with more than 10,000 different queries sounds rare to me (once you exclude query parameters ofcourse). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Oid registry
On Tue, Sep 25, 2012 at 09:18:30AM -0400, Andrew Dunstan wrote: I'm not at all familiar with record_to_json or the json datatype, but wouldn't it be appropriate to create a cast from hstore to json to handle that case? No, the difficulty (or at least the first difficulty) is in having the code recognize that it has an hstore at all. The code picks apart the record field by field at run time and takes various actions depending on the field's type. For any type it doesn't recognize it just outputs the value as a string, and so that's what it does with hstore. Mostly this is the right thing but in the hstore case it's rather sad. Is there a particular reason to special case hstore though? Wouldn't it be sufficient to simply look for a cast from the given type oid to json and use that if present. If you don't like cast, allow other extensions to define a function to_json_hook(yourtype) - json which is used. Hardcoding IDs in extensions just doesn't seem right somehow... (Hmm, I see someone else in the thread pointed this out too). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Rules and WITH and LATERAL
On Sun, Aug 19, 2012 at 12:06:30PM -0400, Tom Lane wrote: While thinking about this I wondered whether it might be possible to clean up the implementation of rules, and perhaps also get rid of some of their semantic issues, by making the rule rewriter rely on WITH and/or LATERAL, neither of which we had back in the dark ages when the current rules implementation was built. In particular, WITH might offer a fix for the multiple-evaluation gotchas that people so often trip over. For instance, perhaps an UPDATE with rules could be rewritten into something like Making the rule system use WITH always seemed like a good idea to me. ISTM though that it would tax the optimiser, as it would need to become much more clever at pushing conditions down. For example, on 9.1 at least you still get this: $ explain with x as (select * from pg_class) select * from x where relname = 'test'; QUERY PLAN - CTE Scan on x (cost=14.15..23.49 rows=2 width=189) Filter: (relname = 'test'::name) CTE x - Seq Scan on pg_class (cost=0.00..14.15 rows=415 width=194) (4 rows) whereas without the with you get an index scan. So in its current form you can't use WITH to simplify the implementation of views because performence would suck. OTOH, the intelligence in the current rule system may be a good guide to optimise WITH statements. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] spinlock-pthread_mutex : real world results
On Mon, Aug 06, 2012 at 08:54:11AM -0400, Robert Haas wrote: 2. Should we be modifying our spinlock implementation on Linux to use futexes rather than pulling pthreads into the mix? Anyone have data on the first point, or opinions on the second one? I'm not sure whether pthreads is such a thick layer. Or are you referring to the fact that you don't want to link against the library at all? If we've found a situation where our locks work better than the ones in pthreads than either (a) we're doing something wrong or (b) the pthreads implementation could do with improvement. In either case it might be worth some investigation. If we can improve the standard pthreads implementation everybody wins. BTW, I read that some *BSDs have futex implementations (to emulate linux), it might be an idea to see where they're going. e.g. http://osdir.com/ml/os.dragonfly-bsd.kernel/2003-10/msg00232.html Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Resetting libpq connections after an app error
On Sat, Jul 21, 2012 at 01:08:58AM +0100, Daniele Varrazzo wrote: Hello, apologize for bumping the question to -hackers but I got no answer from -general. If there is a better ML to post it let me know. In a libpq application, if there is an application error between PQsendQuery and PQgetResult, is there a way to revert a connection back to an usable state (i.e. from transaction status ACTIVE to IDLE) without using the network in a blocking way? We are currently doing while (NULL != (res = PQgetResult(conn-pgconn))) { PQclear(res); } but this is blocking, and if the error had been caused by the network down, we'll just get stuck in a poll() waiting for a timeout. There is PQreset(), which also exists in a non-blocking variant. Hope this helps, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] huge tlb support
On Mon, Jul 09, 2012 at 02:11:00AM -0400, Tom Lane wrote: y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: Also, I was under the impression that recent Linux kernels use hugepages automatically if they can, so I wonder exactly what Andres was testing on ... if you mean the trasparent hugepage feature, iirc it doesn't affect MAP_SHARED mappings like this. Oh! That would explain some things. It seems like a pretty nasty restriction though ... do you know why they did that? It doesn't say explicitly in the documentation (found at http://lwn.net/Articles/423592/ aka transhuge.txt) but reading between the lines I'm guessing it's due to the fact that huge pages must be aligned to 2 or 4MB and when dealing with a shared mapping you probably need to require it to be aligned is all address spaces. However, it seems it should work for SysV shared memory, see: http://lwn.net/Articles/375096/ . The same page suggests shared mappings should work fine. However, this page refers to the non-transparent feature. If you think about it, it must work since huge pages are inherited through fork(). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Bug tracker tool we need
On Sat, Jul 07, 2012 at 11:36:41AM +0200, Magnus Hagander wrote: I've never thought of it in terms of friction, but I think that term makes a lot of sense. And it sums up pretty much one of the things that I find the most annoying with the commitfest app - in the end it boils down to the same thing. I find it annoying that whenever someone posts a new review or new comments, one has to *also* go into the CF app and add them there. Which leads to a lot of friction, which in turn leads to people not actually putting their comments in there, which decreases the value of the tool. Don't get me wrong - the cf app is a huge step up from no app at all. But it's just not done yet. Well, if that's the issue then there are well known solutions to that. Give each commitfest entry a tag, say, CF#8475 and add a bot to the mail server that examines each email for this tag and if found adds it to the CF app. This could then easily track commit messages, emails on mailing list and even bug reports. (For example, someone replys to a bug report with See CF#8484 and then a reference to the bug thread gets pushed into the CF app.) It's also a searchable identifier, which is also useful for google. We do this at my work, it's a very low barrier method of linking different systems. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] experimental: replace s_lock spinlock code with pthread_mutex on linux
On Wed, Jun 27, 2012 at 12:58:47AM +0200, Nils Goroll wrote: So it looks like using pthread_mutexes could at least be an option on Linux. Using futexes directly could be even cheaper. Note that below this you only have the futex(2) system call. Futexes require all counter manipulation to happen in userspace, just like now, so all the per architecture stuff remains. On Linux pthread mutexes are really just a thin wrapper on top of this. The futex(2) system call merely provides an interface for handling the blocking and waking of other processes and releasing locks on process exit (so everything can still work after a kill -9). So it's more a replacement for the SysV semaphores than anything else. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] why roll-your-own s_lock? / improving scalability
On Tue, Jun 26, 2012 at 01:46:06PM -0500, Merlin Moncure wrote: Well, that would introduce a backend dependency on pthreads, which is unpleasant. Also you'd need to feature test via _POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between processes (and configure your mutexes as such when you do). There are probably other reasons why this can't be done, but I personally don' t klnow of any. And then you have fabulous things like: https://git.reviewboard.kde.org/r/102145/ (OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support it.) Seems not very well tested in any case. It might be worthwhile testing futexes on Linux though, they are specifically supported on any kind of shared memory (shm/mmap/fork/etc) and quite well tested. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Pg default's verbosity?
On Mon, Jun 18, 2012 at 09:30:14PM -0400, Robert Haas wrote: There might be something to the idea of demoting a few of the things we've traditionally had as NOTICEs, though. IME, the following two messages account for a huge percentage of the chatter: NOTICE: CREATE TABLE will create implicit sequence foo_a_seq for serial column foo.a NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index foo_pkey for table foo +1 Absolutely. And if you also suppress the output of the setval's produced by pg_dump that would make a successful restore of a dump produce barely any output at all with -q. That would make errors significantly more visible. Not sure how to go about that though. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] libpq compression
On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote: The fly in the ointment with any of these ideas is that the configure list is not a list of exact cipher names, as per Magnus' comment that the current default includes tests like !aNULL. I am not sure that we know how to evaluate such conditions if we are applying an after-the-fact check on the selected cipher. Does OpenSSL expose any API for evaluating whether a selected cipher meets such a test? I'm not sure whether there's an API for it, but you can certainly check manually with openssl ciphers -v, for example: $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP' NULL-SHASSLv3 Kx=RSA Au=RSA Enc=None Mac=SHA1 NULL-MD5SSLv3 Kx=RSA Au=RSA Enc=None Mac=MD5 ...etc... So unless the openssl includes the code twice there must be a way to extract the list from the library. Have a nice ay, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Last gasp
On Wed, Apr 11, 2012 at 12:00:39PM -0300, Alvaro Herrera wrote: remote in their main PG tree, and so changesets could be pulled into the same clone and cherry-picked into the master branch. If you're talking about a way of using git to support reviewing, the Gerrit tool has an interesting workflow. Essentially anything you want reviewed you push to a fake tag refs/for/master which always creates a new branch. As such you have a repository which contains every patch ever submitted, but it simultaneously tracks the parents so you know which version of the tree a patch was against. In the case of Postgres each entry in the CF app would have its own tag (say refs/cf/234) which would create a new patch for that entry. In the end accepted patches are cherry-picked onto the real tree. But because all patches are now in the same place you can build tooling around it easier, like testing: does this patch cherry-pick cleanly or is there a conflict. No merge commits, just using git purely as patch storage. (Note to make this work it has a git server emulation which may or may not be easy to do, but it's just a thought about workflow.) Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Re: Cross-backend signals and administration (Was: Re: pg_terminate_backend for same-role)
On Tue, Mar 27, 2012 at 03:17:36AM +0100, Greg Stark wrote: I may be forgetting something obvious here but is there even a function to send an interrupt signal? That would trigger the same behaviour that a user hitting C-c would trigger which would only be handled at the next CHECK_FOR_INTERRUPTS which seems like it would be non-controversial and iirc we don't currently have a function to do this for other connections the user may have if he doesn't have access to the original terminal and doesn't have raw shell access to run arbitrary commands. Sure, C-c just sends a SIGINT. But IIRC the problem wasn't so much cancelling running queries, SIGINT appears to work fine there, it's that a connection blocked on waiting for client input doesn't wake up when sent a signal. To fix that you'd need to change the signal mode and teach the various input layers (like SSL) to handle the EINTR case. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] sortsupport for text
On Mon, Mar 19, 2012 at 12:19:53PM -0400, Robert Haas wrote: On Sat, Mar 17, 2012 at 6:58 PM, Greg Stark st...@mit.edu wrote: On Fri, Mar 2, 2012 at 8:45 PM, Robert Haas robertmh...@gmail.com wrote: 12789 28.2686 libc-2.13.so strcoll_l 6802 15.0350 postgres text_cmp I'm still curious how it would compare to call strxfrm and sort the resulting binary blobs. I don't think the sortsupport stuff actually makes this any easier though. Since using it requires storing the binary blob somewhere I think the support would have to be baked into tuplesort (or hacked into the sortkey as an expr that was evaluated earlier somehow). Well, the real problem here is that the strxfrm'd representations aren't just bigger - they are huge. On MacBook Pro, if the input representation is n characters, the strxfrm'd representation is 9x+3 characters. Ouch. I was holding out hope that you could get a meaningful improvement if we could use the first X bytes of the strxfrm output so you only need to do a strcoll on strings that actually nearly match. But with an information density of 9 bytes for one 1 character it doesn't seem worthwhile. That and this gem in the strxfrm manpage: RETURN VALUE The strxfrm() function returns the number of bytes required to store the transformed string in dest excluding the terminating '\0' character. If the value returned is n or more, the contents of dest are indeterminate. Which means that you have to take the entire transformed string, you can't just ask for the first bit. I think that kind of leaves the whole idea dead in the water. Just for interest I looked at the ICU API for this and they have the same restriction. There is another function which you can use to return partial sort keys (ucol_nextSortKeyPart) but it produces uncompressed sortkeys, which it seems is what Mac OSX is doing, which seems useless for our purposes. Either this is a hard problem or we're nowhere near a target use case. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
On Sat, Mar 10, 2012 at 11:38:51AM -0500, Tom Lane wrote: Shigeru Hanada shigeru.han...@gmail.com writes: Thanks for the review. Agreed to write own depraser for pgsql_fdw which handles nodes which can be pushed down. Every SQL-based FDW which constructs SQL statement for each local query would need such module inside. Yeah. That's kind of annoying, and the first thing you think of is that we ought to find a way to share that code somehow. But I think it's folly to try to design a shared implementation until we have some concrete implementations to compare. An Oracle FDW, for instance, would need to emit SQL code with many differences in detail from pgsql_fdw. It's not clear to me whether a shared implementation is even practical, but for sure I don't want to try to build it before we've done some prototype single-purpose implementations. FWIW, this sounds like the compiler mechanism in SQLalchemy for turning SQL node trees into strings. The basic idea is you define functions for converting nodes to strings. Stuff like And and Or works for every database, but then dialects can override different things. So you have for Postgres: Node(foo) = $1, but to other databases perhaps :field1. But most of the other code can be shared.. http://docs.sqlalchemy.org/en/latest/core/compiler.html In my experience it works well for generating custom constructs. They have compilers for 11 different database engines, so it seems flexible enough. Mind you, they also handle DDL mapping (where most of the variation is) and datatype translations, which seems a lot further than we need here. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Our regex vs. POSIX on longest match
On Mon, Mar 05, 2012 at 11:28:24AM -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: I think the right way to imagine this is as though the regular expression were being matched to the source text in left-to-right fashion. No, it isn't. You are headed down the garden path that leads to a Perl-style definition-by-implementation, and in particular you are going to end up with an implementation that fails to satisfy the POSIX standard. POSIX requires an *overall longest* match (at least for cases where all quantifiers are greedy), and that sometimes means that the quantifiers can't be processed strictly left-to-right greedy. An example of this is On the otherhand, I think requiring an overall longest match makes your implementation non-polynomial complexity. The simplest example I can think of is the knapsack problem, where given weights x_n and a total W, can be converted to a regex problem as matching a string with W a's against the regex: a{x_1}?a{x_2}?a{x_3}? etc... Yes, Perl (and others) don't guarentee an overall longest match. I think they want you to consider regular expressions as a specialised parsing language where you can configure a state machine to process your strings. Not ideal, but predicatable. The question is, what are users expecting of the PostgreSQL regex implementation? Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature