from:"Martijn van Oosterhout"

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-21 Thread Martijn van Oosterhout

On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote:
 Hi,
 
 This thread seemed to trail off without a resolution.  Was anything done?

Not that I can tell. I was the original poster of this thread. We've
worked around the issue by placing a CHECKPOINT command at the end of
the migration script.  For us it's not a performance issue, more a
correctness one, tables were empty when they shouldn't have been.

I'm hoping a fix will appear in the 9.5 release, since we're intending
to release with that version.  A forced checkpoint every now and them
probably won't be a serious problem though.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Martijn van Oosterhout

On Fri, Jul 03, 2015 at 12:53:56PM -0400, Tom Lane wrote:
 Fujii Masao masao.fu...@gmail.com writes:
  Okay, so probably we need to change WAL replay of TRUNCATE so that
  the index file is truncated to one containing only meta page instead of
  empty one. That is, the WAL replay of TRUNCATE would need to call
  index_build() after smgrtruncate() maybe.
 
 That seems completely unworkable.  For one thing, index_build would expect
 to be able to do catalog lookups, but we can't assume that the catalogs
 are in a good state yet.
 
 I think the responsibility has to be on the WAL-writing end to emit WAL
 instructions that lead to a correct on-disk state.  Putting complex
 behavior into the reading side is fundamentally misguided.

Am I missing something. ISTM that if the truncate record was simply not
logged at all everything would work fine. The whole point is that the
table was created in this transaction and so if it exists the table on
disk must be the correct representation.

The broken index is just one symptom. The heap also shouldn't be
truncated at all. If you insert a row before commit then after replay
the tuple should be there still.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Martijn van Oosterhout

On Fri, Jul 03, 2015 at 07:21:21PM +0200, Andres Freund wrote:
 On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
  Am I missing something. ISTM that if the truncate record was simply not
  logged at all everything would work fine. The whole point is that the
  table was created in this transaction and so if it exists the table on
  disk must be the correct representation.
 
 That'd not work either. Consider:
 
 BEGIN;
 CREATE TABLE ...
 INSERT;
 TRUNCATE;
 INSERT;
 COMMIT;
 
 If you replay that without a truncation wal record the second INSERT
 will try to add stuff to already occupied space. And they can have
 different lengths and stuff, so you cannot just ignore that fact.

I was about to disagree with you by suggesting that if the table was
created in this transaction then WAL logging is skipped. But testing
shows that inserts are indeed logged, as you point out.

With inserts the WAL records look as follows (relfilenodes changed):

martijn@martijn-jessie:~/git/ctm/docker$ sudo 
/usr/lib/postgresql/9.4/bin/pg_xlogdump -p /tmp/pgtest/postgres/pg_xlog/ 
00010001 |grep -wE '16386|16384|16390'
rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
0/016A79C8, prev 0/016A79A0, bkp: , desc: file create: base/12139/16384
rmgr: Sequencelen (rec/tot):158/   190, tx:683, lsn: 
0/016B4258, prev 0/016B2508, bkp: , desc: log: rel 1663/12139/16384
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016B4318, prev 0/016B4258, bkp: , desc: file create: base/12139/16386
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016B9468, prev 0/016B9418, bkp: , desc: file create: base/12139/16390
rmgr: Sequencelen (rec/tot):158/   190, tx:683, lsn: 
0/016BC938, prev 0/016BC880, bkp: , desc: log: rel 1663/12139/16384
rmgr: Sequencelen (rec/tot):158/   190, tx:683, lsn: 
0/016BCAF0, prev 0/016BCAA0, bkp: , desc: log: rel 1663/12139/16384
rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 
0/016BCBB0, prev 0/016BCAF0, bkp: , desc: insert(init): rel 
1663/12139/16386; tid 0/1
rmgr: Btree   len (rec/tot): 20/52, tx:683, lsn: 
0/016BCBF8, prev 0/016BCBB0, bkp: , desc: newroot: rel 1663/12139/16390; 
root 1 lev 0
rmgr: Btree   len (rec/tot): 34/66, tx:683, lsn: 
0/016BCC30, prev 0/016BCBF8, bkp: , desc: insert: rel 1663/12139/16390; tid 
1/1
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016BCC78, prev 0/016BCC30, bkp: , desc: file truncate: base/12139/16386 
to 0 blocks
rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 
0/016BCCA8, prev 0/016BCC78, bkp: , desc: file truncate: base/12139/16390 
to 0 blocks
rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 
0/016BCCD8, prev 0/016BCCA8, bkp: , desc: insert(init): rel 
1663/12139/16386; tid 0/1
rmgr: Btree   len (rec/tot): 20/52, tx:683, lsn: 
0/016BCD20, prev 0/016BCCD8, bkp: , desc: newroot: rel 1663/12139/16390; 
root 1 lev 0
rmgr: Btree   len (rec/tot): 34/66, tx:683, lsn: 
0/016BCD58, prev 0/016BCD20, bkp: , desc: insert: rel 1663/12139/16390; tid 
1/1
 
   relname   | relfilenode 
-+-
 test|   16386
 test_id_seq |   16384
 test_pkey   |   16390
(3 rows)

And amazingly, the database cluster successfuly recovers and there's no
error now.  So the problem is *only* because there is no data in the
table at commit time.  Which indicates that it's the 'newroot record
that saves the day normally.  And it's apparently generated by the
first insert.

 Agreed. I think the problem is something else though. Namely that we
 reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
 broken afaics. We need to allocate a new relfilenode and write stuff
 into that. Then we can forgo WAL logging the truncation record.

Would that properly initialise the index though?

Anyway, this is way outside my expertise, so I'll bow out now. Let me
know if I can be of more assistance.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-03 Thread Martijn van Oosterhout

On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
  Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
  can probably construct a Dockerfile that reproduces it pretty reliably.
 
 I could reproduce the problem in the master branch by doing
 the following steps.

Thank you, I wasn't sure if you could kill the server fast enough
without containers, but it looks like immediate mode is enough.

 1. start the PostgreSQL server with wal_level = minimal
 2. execute the following SQL statements
  begin;
  create table test(id serial primary key);
  truncate table test;
  commit;
 3. shutdown the server with immediate mode
 4. restart the server (crash recovery occurs)
 5. execute the following SQL statement
 select * from test;
 
 The optimization of TRUNCATE opereation that we can use when
 CREATE TABLE and TRUNCATE are executed in the same transaction block
 seems to cause the problem. In this case, only index file truncation is
 logged, and index creation in btbuild() is not logged because wal_level
 is minimal. Then at the subsequent crash recovery, index file is truncated
 to 0 byte... Very simple fix is to log an index creation in that case,
 but not sure if that's ok to do..

Looks plausible to me.

For reference I attach a small tarball for reproduction with docker.

1. Unpack tarball into empty dir (it has three small files)
2. docker build -t test .
3. docker run -v /tmp/pgtest:/data test
4. docker run -v /tmp/pgtest:/data test

Data dir is in /tmp/pgtest

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


postgresql-test.tgz
Description: GNU Unix tar archive


signature.asc
Description: Digital signature

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Martijn van Oosterhout

On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
 Hi,
 
 On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
  === Start with an empty database
 
 My guess is you have wal_level = minimal?

Default config, was just initdb'd. So yes, the default wal_level =
minimal.

  === Note the index file is 8KB.
  === At this point nuke the database server (in this case it was simply 
  === destroying the container it was running in.
 
 How did you continue from there? The container has persistent storage?
 Or are you repapplying the WAL to somewhere else?

The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec so postgres never gets the signal to shutdown.

  martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p 
  /data/postgres/pg_xlog/ 00010001 |grep -wE 
  '16389|16387|16393'
  rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
  0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; 
  tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; 
  oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; 
  shutdown
  rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
  0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: 
  base/16385/16389 to 0 blocks
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: 
  base/16385/16393 to 0 blocks
  pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero 
  length at 0/16BE740
 
 Note that the truncate will lead to a new, different, relfilenode.

Really? Comparing the relfilenodes gives the same values before and
after the truncate.
 
  ctmp=# select * from test;
  ERROR:  could not read block 0 in file base/16385/16393: read only 0 of 
  8192 bytes
 
 Hm. I can't reproduce this. Can you include a bit more details about how
 to reproduce?

Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

[HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Martijn van Oosterhout

 this helps.
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] upper planner path-ification

2015-05-14 Thread Martijn van Oosterhout

On Thu, May 14, 2015 at 12:19:44PM -0400, Robert Haas wrote:
 Well, I'm just shooting from the hip here, but it seems to me that the
 basic pipeline as it exists today is Join - Aggregate - SetOp -
 Limit - LockRows. I don't think Limit or LockRows can be moved any
 earlier.  SetOps have a lot in common with Aggregates, though, and
 might be able to commute.  For instance, if you had an EXCEPT that was
 going to knock out most of the rows and also a DISTINCT, you might
 want to postpone the DISTINCT until after you do the EXCEPT, or you
 might just want to do both at once:

Also thinking a little from the side: an SQL query is a expression of
some tree in relational algebra, and a Path is a representation of a
way to acheive some sub-part of it.  The planner attempts to try find
alternative ways of generating path by reordering joins but AIUI
doesn't do much about aggregations.

What is essentially being discussed here is also allowing commuting
aggregations and joins.  DISTINCT and DISTINCT ON are just special
kinds of aggregations so don't need to be considered especially.

ISTM you should be able to for each aggregation note which joins it
commutes with and which it doesn't, perhaps even with a simple bitmap. 
The backbone of the plan is the order of the aggregations which
generally won't commute at all, and the joins which can float around as
long as the dependancies (stuff that won't commute) are satisfied.

 And both set operations and aggregates can sometimes commute with
 joins, which seems like the stickiest wicket, because there can't be
 more than a couple of levels of grouping or aggregation in the same
 subquery (GROUP BY, DISTINCT, UNION?) but there can be lots of joins,
 and if a particular aggregate can be implemented in lots of places,
 things start to get complicated.

I think this is basically the same idea. I'm not sure aggregates and
set operations can commute in general, unless you could somehow
(conceptually) describe them as a join/antijoin.  UNION might be
special here.

 Like, if you've got a SELECT DISTINCT ON (a.x) ... FROM
 ...lots...-type query, I think you can pretty much slap the DISTINCT
 on there at any point in the join nest, probably ideally at the point
 where the number of rows is the lowest it's going to get, or maybe
 when the sortorder is convenient.  For a GROUP BY query with ungrouped
 dependent columns, you can do the aggregation before or after those
 joins, but you must do it after the joins that are needed to provide
 all the values needed for the aggregated columns.  If you model this
 with RelOptInfos, you're basically going to need a RelOptInfo to say
 i include these relids and these aggregation or setop operations.
 So in the worst case each agg/setop doubles the number of RelOptInfos,
 although probably in reality it doesn't because you won't have
 infinite flexibility to reorder things.

I think it's more like the number of possibilities doubles for each
join that could commute with the aggregate.  But as you say the number
of actual possibilities doesn't actually grow that fast.  I think it
may be better to have each path track the relids and aggregates it
covers, but then you need to have an efficient way to work out which
rels/aggregates can be considered for each path.  Either sounds it
sounds like quite a planner overhaul.

Hope this helps,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-15 Thread Martijn van Oosterhout

On Wed, Apr 15, 2015 at 12:37:44AM -0400, Robert Haas wrote:
  I think such a solution will be good for the cases when many evictions
  needs to be performed to satisfy the workload,  OTOH when there are
  not too many evictions that needs to be done, in such a case some of
  the buffers that are accessed much more will have equal probability to
  get evicted as compare to buffers which are less accessed.
 
 Possibly, but I think it's even worse under the current algorithm.
 Under this proposal, if we go for a long time without any buffer
 evictions, every buffer usage's count will top out at 2 more than
 wherever it was after the last clock sweep.   In the worst case, every
 buffer (or most of them) could end up with the same usage count.  But
 under the status quo, they'll all go to 5, which is an even bigger
 loss of information, and which will make the first eviction much more
 expensive than if they are all pegged at 2 or 3.

I've been following this thread from the side with interest and got
twigged by the point about loss of information.  If you'd like better
information about relative ages, you can acheive this by raising the
cap on the usage count and dividing (or right-shifting) each sweep.

This would allow you to remember much more about about the relative
worth of often used pages.  With a cap of 32 you'd have the same effect
as now where after 5 sweeps the buffer is evicted.  Mathematically the
count would converge to the number of times the block is used per
sweep.

If you wanted to be really clever, you could at the beginning of each
sweep take an estimate of the number of buffers used since the last
sweep (from the stats collector perhaps) and use that to drive your
divisor, so if you have a lots of allocations you become more
aggressive about reducing the counts.  Or if the load is light fall
back to just subtracting one.  Then you don't need a cap at all.

(Apologies if this has been suggested before, Google didn't find
anything for me).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory

2015-02-13 Thread Martijn van Oosterhout

On Thu, Feb 12, 2015 at 08:52:56AM -0500, Robert Haas wrote:
  BTW, I'm not all that thrilled with the deserialized object terminology.
  I found myself repeatedly tripping up on which form was serialized and
  which de-.  If anyone's got a better naming idea I'm willing to adopt it.
 
 My first thought is that we should form some kind of TOAST-like
 backronym, like Serialization Avoidance Loading and Access Device
 (SALAD) or Break-up, Read, Edit, Assemble, and Deposit (BREAD).  I
 don't think there is anything per se wrong with the terms
 serialization and deserialization; indeed, I used the same ones in the
 parallel-mode stuff.  But they are fairly general terms, so it might
 be nice to have something more specific that applies just to this
 particular usage.

The words that sprung to mind for me were: packed/unpacked.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] longjmp clobber warnings are utterly broken in modern gcc

2015-02-01 Thread Martijn van Oosterhout

On Sun, Jan 25, 2015 at 07:11:12PM -0500, Tom Lane wrote:
 Martijn van Oosterhout klep...@svana.org writes:
  On Sun, Jan 25, 2015 at 02:02:47PM -0500, Tom Lane wrote:
  It's a bit of a long shot, but perhaps if you put something like:
 
  asm volatile(:::memory)
 
  at the beginning of the catch-block it might convince the compiler to
  forget any assumptions about what is in the local variables...
 
 Meh.  Even if that worked for gcc (which as you say is uncertain),
 it would help not at all for other compilers.  The POSIX requirements
 for portable code are clear: we need a volatile marker on affected
 variables.

Never mind, it doesn't work. It's not that GCC doesn't know setjmp() is
special, it does (the returns_twice attribute).  So GCC does the above
effectivly itself.  The problem is that local variables may be stored
in memory over calls in the PG_TRY() block, volatile is a sledgehammer
way of preventing that.

The problem is, GCC doesn't know anything about what the return value
of setjmp() means which means that it can never produce any sensible
warnings in this area.

If you want the compiler to catch this, I don't see any way without
requiring the code to indicate specifically which local variables it
intends to use, or not using the locals at all by using a seperate
cleanup function (as discussed elsewhere in this thread).  With
information about the locals you might be able to conjure some GCC
macros to set things up to complain if you use anything else.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] longjmp clobber warnings are utterly broken in modern gcc

2015-01-25 Thread Martijn van Oosterhout

On Sun, Jan 25, 2015 at 02:02:47PM -0500, Tom Lane wrote:
 and compared the assembly language generated with and without adding
 volatile to Tmpfd's declaration.  Without volatile (ie, in the
 code as shipped), gcc optimizes away the assignment Tmpfd = -1
 within PG_TRY, and it also optimizes away the if-test in PG_CATCH,
 apparently believing that control cannot transfer from inside the
 PG_TRY to the PG_CATCH.  This is utterly wrong of course.  The issue
 is masked because we don't bother to test for a failure return from the
 second close() call, but it's not hard to think of similar coding
 patterns where this type of mistaken optimization would be disastrous.
 (Even here, the bogus close call could cause a problem if we'd happened
 to open another file during the last part of the PG_TRY stanza.)

snip

 This is scary as hell.  I intend to go around and manually audit
 every single PG_TRY in the current source code, but that is obviously
 not a long-term solution.  Anybody have an idea about how we might
 get trustworthy mechanical detection of this type of situation?

It's a bit of a long shot, but perhaps if you put something like:

asm volatile(:::memory)

at the beginning of the catch-block it might convince the compiler to
forget any assumptions about what is in the local variables...

Hope this helps,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-24 Thread Martijn van Oosterhout

On Thu, Jan 22, 2015 at 03:50:03PM -0600, Merlin Moncure wrote:
 Quick update:  not done yet, but I'm making consistent progress, with
 several false starts.  (for example, I had a .conf problem with the
 new dynamic shared memory setting and git merrily bisected down to the
 introduction of the feature.).
 I have to triple check everything :(. The problem is generally
 reproducible but I get false negatives that throws off the bisection.
 I estimate that early next week I'll have it narrowed down
 significantly if not to the exact offending revision.

I've never used it but the BBChop project claims to be able to bisect
even in the case of intermittent failure.

https://github.com/Ealdwulf/bbchop/

It claims to be slow, but I don't think that's the limiting factor here...

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-20 Thread Martijn van Oosterhout

On Fri, Dec 19, 2014 at 05:32:43PM -0800, Peter Geoghegan wrote:
  Most people would list the columns, but if there is a really bizarre
  constraint, with non-default opclasses, or an exclusion constraint, it's
  probably been given a name that you could use.
 
 What I find curious about the opclass thing is: when do you ever have
 an opclass that has a different idea of equality than the default
 opclass for the type? In other words, when is B-Tree strategy number 3
 not actually '=' in practice, for *any* B-Tree opclass? Certainly, it
 doesn't appear to be the case that it isn't so with any shipped
 opclasses - the shipped non-default B-Tree opclasses only serve to
 provide alternative notions of sort order, and never equals.

Well, in theory you could build a case insensetive index on a text
column. You could argue that the column should have been defined as
citext in the first place, but it might not for various reasons.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] [PATCH] add ssl_protocols configuration option

2014-10-23 Thread Martijn van Oosterhout

On Wed, Oct 22, 2014 at 09:36:59PM +0200, Dag-Erling Smørgrav wrote:
 Martijn van Oosterhout klep...@svana.org writes:
  Dag-Erling Smørgrav d...@des.no writes:
   If I understand correctly, imaps has been shown to be vulnerable as
   well, so I wouldn't be so sure.
  Reference?
 
 Sorry, no reference.  I was told that Thunderbird was vulnerable to
 POODLE when talking imaps.

Ugh, found it. It does the same connection fallback stuff as firefox.

https://securityblog.redhat.com/2014/10/20/can-ssl-3-0-be-fixed-an-analysis-of-the-poodle-attack/

  Since you can already specify the cipher list, couldn't you just add
  -SSLv3 to the cipher list and be done?
 
 I didn't want to change the existing behavior; all I wanted was to give
 users a way to do so if they wish.

I think we should just disable SSL3.0 altogether. The only way this
could cause problems is if people are using PostgreSQL with an OpenSSL
library from last century.  As for client libraries, even Windows XP
supports TLS1.0.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] [PATCH] add ssl_protocols configuration option

2014-10-22 Thread Martijn van Oosterhout

On Wed, Oct 22, 2014 at 03:14:26PM +0200, Dag-Erling Smørgrav wrote:
  In a case like POODLE we probably wouldn't have done it anyway, as it
  doesn't affect our connections (we're not http)
 
 If I understand correctly, imaps has been shown to be vulnerable as
 well, so I wouldn't be so sure.

Reference? It's an SSL3 specific attack, so most servers are not
vulnerable because TLS will negotiate to the highest supported
protocol.  So unless one of the client/server doesn't support TLS1.0
there's no issue.  The only reason http is vulnerable is because
browsers do protocol downgrading, something strictly forbidden by the
spec.

Additionally, the man-in-the-middle must be able to control the padding
in the startup packet, which just isn't possible (no scripting language
in the client).

Since you can already specify the cipher list, couldn't you just add
-SSLv3 to the cipher list and be done?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Collations and Replication; Next Steps

2014-09-18 Thread Martijn van Oosterhout

On Thu, Sep 18, 2014 at 01:35:10PM +0900, Tatsuo Ishii wrote:
 In my understanding PostgreSQL's manual MUST include the ICU license
 term (this is not a problem).  What I am not so sure is, any software
 uses PostgreSQL also MUST include the ICU license or not. If yes, I
 think this is surely a problem.

Only if we're thinking of distributing it. If the user gets ICU from
their distribution then there is no need to list the licence (just like
we don't need to mention the licence of glibc).  We only need link
against it, not distribute it.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Anonymous code block with parameters

2014-09-18 Thread Martijn van Oosterhout

On Wed, Sep 17, 2014 at 10:17:22PM +0200, Pavel Stehule wrote:
  Because you still have to do
 
  SELECT pg_temp.my_temp_function(blah);
 
  to execute it.
 
 
 this problem should be solvable. I can to use a temporary tables without
 using pg_temp schema.

Umm, IIRC it used to work that way but was changed to work like this.
IIRC the reason was that anyone can create functions in the temp
tablespace and thus hijack other functions that more priviledged
functions might call.

Or something like that. I think it was even a CVE.

Have a nice dat,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Collations and Replication; Next Steps

2014-09-18 Thread Martijn van Oosterhout

On Wed, Sep 17, 2014 at 03:57:38PM +0100, Greg Stark wrote:
 Then there's the concern that ICU is a *huge* dependency. ICU is
 itself larger than the entire Postgres install. It's a big burden on
 users to have to install and configure a second collation library in
 addition to the system library and a complete non-starter for embedded
 systems or low-memory systems.

$ apt-cache show libicu52|grep Installed-Size
Installed-Size: 27283

That's 27MB or less than 2 WAL files. Or about 4 times the template
database, or which 3 are created during install and the first user
database will be a fourth.

The installed size of Postgres is 11MB not including any of the
libraries it already depends on.

Yes, it's not a small library but lets not get carried away.

And if it's optional then low memory systems can configure it out.

As for configuration, ICU doesn't require configuration just like glibc
doesn't require configuration.

Mvg,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Collations and Replication; Next Steps

2014-09-17 Thread Martijn van Oosterhout

On Tue, Sep 16, 2014 at 02:57:00PM -0700, Peter Geoghegan wrote:
 On Tue, Sep 16, 2014 at 2:07 PM, Peter Eisentraut pete...@gmx.net wrote:
  Clearly, this is worth documenting, but I don't think we can completely
  prevent the problem.  There has been talk of a built-in index integrity
  checking tool.  That would be quite useful.
 
 We could at least use the GNU facility for versioning collations where
 available, LC_IDENTIFICATION [1]. By not versioning collations, we are
 going against the express advice of the Unicode consortium (they also
 advise to do a strcmp() tie-breaker, something that I think we
 independently discovered in 2005, because of a bug report - this is
 what I like to call the Hungarian issue. They know what our
 constraints are.). I recognize it's a tricky problem, because of our
 historic dependence on OS collations, but I think we should definitely
 do something. That said, I'm not volunteering for the task, because I
 don't have time. While I'm not sure of what the long term solution
 should be, it *is not* okay that we don't version collations. I think
 that even the best possible B-Tree check tool is a not a solution.

Personally I think we should just support ICU as an option. FreeBSD has
been maintaining an out of tree patch for 10 years now so we know it
works.

The FreeBSD patch is not optimal though, these days ICU supports UTF-8
directly so many of the push-ups FreeBSD does are no longer necessary.
It is often faster than glibc and the key sizes for strxfrm are more
compact [1] which is relevent for the recent optimisation patch.

Lets solve this problem for once and for all.

[1] http://site.icu-project.org/charts/collation-icu4c48-glibc

-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Collations and Replication; Next Steps

2014-09-17 Thread Martijn van Oosterhout

On Wed, Sep 17, 2014 at 01:07:56PM +, Matthew Kelly wrote:
 I'm with Martjin here, lets go ICU, if only because it moves sorting
 to a user level library, instead of a system level.  Martjin do you
 have a link to the out of tree patch?  If not I'll find it.  I'd like
 to apply it to a branch and start playing with it.

http://people.freebsd.org/~girgen/postgresql-icu/README.html
http://people.freebsd.org/~girgen/postgresql-icu/

Note I said optional. It is a large library for sure, but for some
installations I think the benefits are sufficient.

Mvg,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Aussie timezone database changes incoming

2014-09-14 Thread Martijn van Oosterhout

On Thu, Sep 11, 2014 at 03:42:14PM +1000, Andrew McNamara wrote:
 Anyone from down under care to remark about the actual usage of old
 and new abbreviations?

About bloody time!

 
 AEST/AEDT/etc are the official abbreviations and are commonly used.
 They have been increasingly used over the last 20 years or so, and the
 EST/EDT stuff on the Olsen tz database has been a source of annoyance
 for a very long time, eg:
 
 http://thread.gmane.org/gmane.comp.time.tz/2262
 
 Quite likely this change will break stuff, but my feeling is more people
 will be cheering than screaming.

Indeed, this has been a pain in the ass for a long long time.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] On partitioning

2014-09-02 Thread Martijn van Oosterhout

On Tue, Sep 02, 2014 at 09:44:17AM -0400, Bruce Momjian wrote:
 On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote:
  There is one situation where you need to be more flexible, and that is
  if you ever want to support online repartitioning. To do that you have
  to distinguish between I want to insert tuple X, which partition
  should it go into and I want to know which partitions I need to look
  for partition_key=Y.
 
 I am unclear why having information per-partition rather than on the
 parent table helps with online reparitioning.

An example:

We have three partitions, one for X0 (A), one for 0=X5 (B) and one
for X=5 (C).  These are in three different tables.

Now we give the command to merge the last two partitions BC. You now
have the choice to lock the table while you move all the tuples from C
to B.

Or you can make some adjustments such that new tuples that would have gone
to C now go to B. And if there is a query for X=10 that you look in
*both* B  C. Then the existing tuples can be moved from C to B at any
time without blocking any other operations.

Is this clearer? If you up front decide that which partition to query
will be determined by a function that can only return one table, then
the above becomes impossible.

 Robert's idea of using normal table inheritance means we can access/move
 the data independently of the partitioning system.  My guess is that we
 will need to do repartitioning with some tool, rather than as part of
 normal database operation.

Doing it as some tool seems like a hack to me. And since the idea was (I
thought) that partitions would not be directly accessable from SQL, it
has to be in the database itself.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] On partitioning

2014-08-31 Thread Martijn van Oosterhout

On Fri, Aug 29, 2014 at 12:35:50PM -0400, Tom Lane wrote:
  Each partition is assigned an Expression that receives a tuple and
  returns boolean.  This expression returns true if a given tuple belongs
  into it, false otherwise.
 
 -1, in fact minus a lot.  One of the core problems of the current approach
 is that the system, particularly the planner, hasn't got a lot of insight
 into exactly what the partitioning scheme is in a partitioned table built
 on inheritance.  If you allow the partitioning rule to be a black box then
 that doesn't get any better.  I want to see a design wherein the system
 understands *exactly* what the partitioning behavior is.  I'd start with
 supporting range-based partitioning explicitly, and maybe we could add
 other behaviors such as hashing later.
 
 In particular, there should never be any question at all that there is
 exactly one partition that a given row belongs to, not more, not less.
 You can't achieve that with a set of independent filter expressions;
 a meta-rule that says exactly one of them should return true is an
 untrustworthy band-aid.
 
 (This does not preclude us from mapping the tuple through the partitioning
 rule and finding that the corresponding partition doesn't currently exist.
 I think we could view the partitioning rule as a function from tuples to
 partition numbers, and then we look in pg_class to see if such a partition
 exists.)

There is one situation where you need to be more flexible, and that is
if you ever want to support online repartitioning. To do that you have
to distinguish between I want to insert tuple X, which partition
should it go into and I want to know which partitions I need to look
for partition_key=Y.

For the latter you really have need an expression per partition, or
something equivalent.  If performance is an issue I suppose you could
live with having an old and an new partition scheme, so you
couldn't have two live repartitionings happening simultaneously.

Now, if you want to close the door on online repartitioning forever
then that fine. But being in the position of having to say yes our
partitioning scheme sucks, but we would have to take the database down
for a week to fix it is no fun.

Unless logical replication provides a way out maybe??

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] SQL MERGE is quite distinct from UPSERT

2014-07-20 Thread Martijn van Oosterhout

On Sat, Jul 19, 2014 at 09:55:19PM -0700, Peter Geoghegan wrote:
 At a high level SQL MERGE is quite distinct from UPSERT, in that it is
 a utility command that performs inserts, updates and deletes while
 avoiding race conditions (e.g. unique constraint violations) on a more
 or less best effort basis. MERGE is conceptually messy. In contrast
 UPSERT is actually atomic, and having its behavior be relatively easy
 to reason about ought to be the top priority. There is a *really* big
 demand for UPSERT from users, not MERGE, although MERGE is certainly
 useful too.

FWIW, I agree. MERGE is hard enough as it is, but trying to guarentee
some kind of atomicity makes it nigh on impossible.  Indeed, after
reading what you wrote I think it may well be impossible to make it
atomic *and* make it perform in the general case.

So, +1 UPSERT.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] better atomics - v0.5

2014-07-11 Thread Martijn van Oosterhout

On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:
 As per my understanding of the general theory around barriers,
 read and write are defined to avoid reordering due to compiler and
 full memory barriers are defined to avoid reordering due to processors.
 There are some processors that support instructions for memory
 barriers with acquire, release and fence semantics.

Not exactly, both compilers and processors can rearrange loads and
stores.  Because the architecture most developers work on (x86) has
such a strong memory model (it's goes to lot of effort to hide
reordering) people are under the impression that it's only the compiler
you need to worry about, but that's not true for any other
architechture.

Memory barriers/fences are on the whole discouraged if you can use
acquire/release semantics because the latter are faster and much easier
to optimise for.

I strongly recommend the Atomic Weapons talk on the C11 memory model
to help understand how they work. As a bonus it includes correct
implementations for the major architectures.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-04 Thread Martijn van Oosterhout

Fascinating.

On Fri, Jul 04, 2014 at 10:47:06AM +0800, gotoschool6g wrote:
 slow query(8531 ms):
 SELECT ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
 40.12211338311868)')) FROM road order by id LIMIT 1;
 
 explain output:
 Limit  (cost=4653.48..4653.48 rows=1 width=3612)
   -  Sort  (cost=4653.48..4683.06 rows=11832 width=3612)
 Sort Key: id
 -  Seq Scan on road  (cost=0.00..4594.32 rows=11832 width=3612)
 
 fast query(16ms):
 select ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
 40.12211338311868)')) from (SELECT shape FROM road order by id  LIMIT 1) a
 
 explain output:
 Subquery Scan on a  (cost=1695.48..1695.74 rows=1 width=3608)
   -  Limit  (cost=1695.48..1695.48 rows=1 width=3612)
 -  Sort  (cost=1695.48..1725.06 rows=11832 width=3612)
   Sort Key: road.id
   -  Seq Scan on road  (cost=0.00..1636.32 rows=11832 
 width=3612)

So Postgres knows perfectly well that it's expensive, it just doesn't
appear to understand it has the option of moving the calculation above
the limit.

In this case though, it seems an index on road(id) would make it
instant in any case.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-02 Thread Martijn van Oosterhout

On Tue, Jul 01, 2014 at 02:36:55PM -0500, Merlin Moncure wrote:
 On Tue, Jul 1, 2014 at 2:16 PM, Martijn van Oosterhout
 klep...@svana.org wrote:
  On Sun, Jun 29, 2014 at 10:05:50PM +0800, gotoschool6g wrote:
  The simplified scene:
  select slowfunction(s) from a order by b limit 1;
  is slow than
  select slowfunction(s) from (select s from a order by b limit 1) as z;
  if there are many records in table 'a'.
 
 
  The real scene. Function  ST_Distance_Sphere is slow, the query:
  SELECT ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from road 
  order by c limit 1;
  is slow than:
  select ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from (SELECT s 
  from road order by c limit 1) as a;
  There are about 7000 records in 'road'.
 
  I think to help here I think we need the EXPLAIN ANALYSE output for
  both queries.
 
 Well, I think the problem is a well understood one: there is no
 guarantee that functions-in-select-list are called exactly once per
 output row.  This is documented -- for example see here:
 http://www.postgresql.org/docs/9.1/static/explicit-locking.html#ADVISORY-LOCKS.
 In short, if you want very precise control of function evaluation use
 a subquery, or, if you're really paranoid, a CTE.

I'm probably dense, but I'm not sure I understand. Or it is that the
slowfunction() is called prior to the sort? That seems insane.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-02 Thread Martijn van Oosterhout

On Wed, Jul 02, 2014 at 04:17:13PM -0400, Tom Lane wrote:
 David G Johnston david.g.johns...@gmail.com writes:
  Martijn van Oosterhout wrote
  I'm probably dense, but I'm not sure I understand. Or it is that the
  slowfunction() is called prior to the sort? That seems insane.
 
  The basic reality is that limit applies to the final set of rows that could
  be output.
 
 It's not so much the limit as that the sort has to happen before the
 limit, and yes, evaluation of the targetlist happens before the sort.

I guess I assumed the column c was indexable, and it that case I
beleive the slowfunction() would indeed only be called once.

 This is fundamental to the SQL conceptual model; remember that SQL92 had
 SELECT slowfunction(), ... ORDER BY 1, which certainly requires the
 function to be evaluated before the sort happens.  And there's nothing in
 the conceptual model suggesting that different targetlist entries should
 be evaluated at different times, so just ordering by something other than
 the slowfunction() entry doesn't get you out of that.
 
 I'm not sure how much of this there is chapter and verse for in the
 SQL standard, but ISTM the stage sequencing we lay out in our SELECT
 reference page is pretty much forced by the standard.

In the conceptual model the limit must happen after the select. But as
an optimisation moving the evaluation above the limit node (when
possible) should always be a win.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-01 Thread Martijn van Oosterhout

On Sun, Jun 29, 2014 at 10:05:50PM +0800, gotoschool6g wrote:
 The simplified scene: 
 select slowfunction(s) from a order by b limit 1;
 is slow than
 select slowfunction(s) from (select s from a order by b limit 1) as z;
 if there are many records in table 'a'.
 
 
 The real scene. Function  ST_Distance_Sphere is slow, the query:
 SELECT ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from road order 
 by c limit 1;
 is slow than:
 select ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from (SELECT s 
 from road order by c limit 1) as a;
 There are about 7000 records in 'road'.  

I think to help here I think we need the EXPLAIN ANALYSE output for
both queries.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Escaping from blocked send() reprised.

2014-07-01 Thread Martijn van Oosterhout

On Tue, Jul 01, 2014 at 12:26:43PM +0900, Kyotaro HORIGUCHI wrote:
  1. I think it's the case that there are platforms around where a
  signal won't cause send() to return EINTR and I'd be entirely
  unsurprised if SSL_write() doesn't necessarily return EINTR in that
  case.  I'm not sure what, if anything, we can do about that.
 
 man 2 send on FreeBSD has not description about EINTR.. And even
 on linux, send won't return EINTR for most cases, at least I
 haven't seen that. So send()=-1,EINTR seems to me as only an
 equivalent of send() = 0. I have no idea about what the
 implementer thought the difference is.

Whether send() returns EINTR or not depends on whether the signal has
been marked restartable or not. This is configurable per signal, see
sigaction(). If the signal is marked to restart, the kernel returns
ERESTARTHAND (IIRC) and the libc will redo the call internally.

Default BSD does not return EINTR normally, but supports sigaction().

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Minmax indexes

2014-06-21 Thread Martijn van Oosterhout

I'm sorry if I missed something, but ISTM this is beginning to look a
lot like GiST. This was pointed out by Robert Haas last year.

On Wed, Jun 18, 2014 at 12:09:42PM -0300, Claudio Freire wrote:
 So, you have:
 
 An aggregate to generate a compressed set from several values

Which GiST does by calling 'compress' on each value, and the 'unions' the
results together.

 A function which adds a new value to the compressed set and returns
 the new compressed set

Again, 'compress' + 'union'

 A function which tests if a value is in a compressed set

Which GiST does using 'compress' +'consistant'

 A function which tests if a compressed set overlaps another
 compressed set of equal type

Which GiST calls 'consistant'

So I'm wondering why you can't just reuse the btree_gist functions we
already have in contrib.  It seems to me that these MinMax indexes are
in fact a variation on GiST that indexes the pages of a table based
upon the 'union' of all the elements in a page.  By reusing the GiST
operator class you get support for many datatypes for free.

 If you can define different compressed sets, you can use this to
 generate both min/max indexes as well as bloom filter indexes. Whether
 we'd want to have both is perhaps questionable, but having the ability
 to is probably desirable.

You could implement bloom filter in GiST too. It's been discussed
before but I can't find any implementation. Probably because the filter
needs to be parameterised and if you store the bloom filter for each
element it gets expensive very quickly. However, hooked into a minmax
structure which only indexes whole pages it could be quite efficient.

 One problem with such a generalized implementation would be, that I'm
 not sure in-place modification of the compressed set on-disk can be
 assumed to be safe on all cases. Surely, for strictly-enlarging sets
 it would, but while min/max and bloom filters both fit the bill, it's
 not clear that one can assume this for all structures.

I think GiST has already solved this problem.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] UPDATE SET (a,b,c) = (SELECT ...) versus rules

2014-06-17 Thread Martijn van Oosterhout

On Sat, Jun 14, 2014 at 03:35:33PM -0400, Tom Lane wrote:
 The best that I think is reasonable to do in such cases is to pull out
 a separate copy of the sub-select for each actual NEW reference in a
 rule query.  So the example above would give rise to an expanded
 rule query along the lines of
 
   INSERT INTO foolog VALUES ( (SELECT x as a, y as b, ...).a,
   (SELECT x as a, y as b, ...).b,
   ... );

Would it not be possible to use WITH here, like:

WITH bar AS ( ... subselect ... )
INSERT INTO foolog VALUES (bar.a, bar.b, ...)

Or am I missing something?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Allowing NOT IN to use ANTI joins

2014-06-09 Thread Martijn van Oosterhout

On Mon, Jun 09, 2014 at 12:36:30AM +1200, David Rowley wrote:
 Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
 EXISTS queries and leaves NOT IN alone. The reason for this is because the
 values returned by a subquery in the IN clause could have NULLs.

Awesome. I've had a brief look at the patch and other than a line of
extraneous whitespace it looks sane.

Since it is only testing on NOT IN queries I don't think there are any
issues with it slowing down simple queries.

I also note you can't prove id+1 not null. At first I thought you
might be able to prove this not null if the operator/function was
strict, but then I realised that strict only means null if input is
null not output is only null if inputs are null. Pity.

Nice work.
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Supporting Windows SChannel as OpenSSL replacement

2014-06-09 Thread Martijn van Oosterhout

On Mon, Jun 09, 2014 at 03:35:23PM +0200, Magnus Hagander wrote:
 On Mon, Jun 9, 2014 at 3:19 PM, Andreas Karlsson andr...@proxel.se wrote:
 
  On 06/09/2014 01:45 PM, Heikki Linnakangas wrote:
  There was a patch set for this from Martijn van Oosterhout which was quite
  complete.
 
  http://www.postgresql.org/message-id/20060504134807.gk4...@svana.org

Wow, blast from the past.

 A lot has, unfortunately, changed since 2006. It might be a good
 startingpoint. But also actively starting from the point of let's try to
 support multiple libraries rather than let's try to support gnutls is
 probably also important.

The patch did provide an API. The idea was that there were a number of
functions which would need to be defined to support an SSL library.
Each library would then have a wrapper which wrapped the library and
based on the results of configure it compiled the right file into the
backend.

These functions were:

extern void pgtls_initialize(void);
extern void pgtls_destroy(void);
extern int pgtls_open_server(Port *);
extern void pgtls_close(Port *);
extern ssize_t pgtls_read(Port *port, void *ptr, size_t len);
extern ssize_t pgtls_write(Port *port, void *ptr, size_t len);

Which should be easy enough to support for any library. These days
you'd need to add support for verifying certificates, but I don't think
that that would be difficult (unless the actual certificate formats are
different).

No switching after compile time, that would just lead to useless
overhead.

 At some point we should design a new API, so that we can deprecate the old
 one. Even if we don't hve the code ready, we need to get rid of PQgetssl(),
 and replace it with something else. I'm thinking probably a functoin that
 returns both a void pointer and an enum that tells you which library is
 actually in use. And a boolean just saying ssl on/off, because that's
 what a lot of clients are interested in and they don't care aobut more than
 that.
 
 Obviously, we also have to do something about PQinitOpenSSL().

Yeah, I think this was one of the more controversial parts. Support in
the backend was primarily moving code around and renaming functions,
fairly straightforward.  Even error handling was not so hard (I found
the gnutls handling of errors much easier than openssl).

One tricky part is that programs like to use libpq for the
authentication, and then they hijack the connection using PGgetssl(). 
The way I dealt with this is defining a new state passthrough where
the caller would get a few function pointers to read/write/check the
connection.  Then the callers would not need to know what library libpq
was compiled with.  And libpq would know the connection was hijacked
and refuse to act anymore.  I don't think everyone was pleased with
this, but no real alternative was presented (other than requiring
people hijacking the connection to do the hard work).

For information about which library was in use there was PQgettlsinfo()
which returned a PGresult with information about the library and
connection.  I beleive since then new functions have been added to
libpq to retrive info about certificates, so that might need a rethink
also.

Basically, I think these last two points are the hard parts to get
agreement (assuming there's agreement to do anything at all about the
problem) and without nailing those down first whoever picks this up
will be in for a lot of work.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Supporting Windows SChannel as OpenSSL replacement

2014-06-09 Thread Martijn van Oosterhout

On Mon, Jun 09, 2014 at 11:39:17PM +0900, MauMau wrote:
 From: Heikki Linnakangas hlinnakan...@vmware.com
 Thoughts? While we're at it, we'll probably want to refactor
 things so that it's easy to support other SSL implementations too,
 like gnutls.
 
 That may be good because it provides users with choices.  But I
 wonder if it is worth the complexity and maintainability of
 PostgreSQL code.

The complexity is very low. SSL is a standard protocol and so all
libraries offer the same functionality. Were not really doing anything
complex.

 * Are SChannel and other libraries more secure than OpenSSL?  IIRC,
 recently I read in the news that GnuTLS had a vulnerability.
 OpenSSL is probably the most widely used library, and many people
 are getting more interested in its quality.  I expect the quality
 will improve thanks to the help from The Linux foundation and other
 organizations/researchers.

Does that matter? What's wrong with letting people choose. OpenVPN
these days supports multiple SSL libraries, because PolarSSL (for
example) has been vetted for a higher security level than OpenSSL.

 * Do other libraries get support from commercial vendor product
 support? For example, Safenet Inc., the famous HSM (hardware
 security module) vendor, supports OpenSSL to access the private key
 stored in its HSM product.  Intel offered AES-NI implementation code
 to OpenSSL community.  I guess OpenSSL will continue to be the most
 functional and obtain the widest adoption and support.

And the crappiest license. I think it's silly for PostgreSQL dictate
what SSL library users must use, when there are so many possibilities. 
We also support libedit for, in my opinion, worse reasons.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-06-03 Thread Martijn van Oosterhout

On Mon, Jun 02, 2014 at 01:29:25PM -0400, Robert Haas wrote:
 I agree, but I think it's important to note that Alex's complaint is
 not unique - the way things work now is a real source of frustration
 for users.  In a previous job, I wrote a schema-upgrade script that
 dropped all of the views in reverse creation order, applied the schema
 updates, and then recreated all the views. This worked, but it was a
 lot of hassle that I would have preferred to avoid, and in a
 higher-volume application, simultaneously grabbing exclusive locks on
 a large number of critical views would have been a non-starter.  In
 the job before that, I did the same thing manually, which was no fun
 at all.  This was actually what posted me to write one of my first
 patches, committed by Bruce as
 ff1ea2173a92dea975d399a4ca25723f83762e55.

Would it be sufficient to automatically pass the type change through
only if nothing in the view actually references it in a function,
operator, group by, order by, etc?  That is, it only appears in the
SELECT list unadorned?  Or is that too limiting?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Re: [COMMITTERS] pgsql: Fix bogus %name-prefix option syntax in all our Bison files.

2014-05-29 Thread Martijn van Oosterhout

On Thu, May 29, 2014 at 08:33:05AM -0400, Peter Eisentraut wrote:
 On 5/28/14, 7:02 PM, Andres Freund wrote:
  That's a good idea. What i've been thinking about is to add
  -Wno-deprecated to the bison rule in the interim. Maybe after a
  configure test for the option. All deprecation warnings so far seem to
  be pretty unhelpful.
 
 Here is a patch.
 

Does this need a comment indicating why it's needed and when it can be
removed?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] UUIDs in core WAS: 9.4 Proposal: Initdb creates a single table

2014-04-26 Thread Martijn van Oosterhout

On Fri, Apr 25, 2014 at 04:18:18PM +0100, Greg Stark wrote:
 Which isn't to say they're a bad idea but like everything else in
 engineering there are tradeoffs and no such thing as a free lunch.
 You can avoid depleting the entropy pool by including data you expect
 to be unique as a kind of fake entropy -- which quickly gets you back
 to looking for things like MAC address to avoid duplicates across
 systems.

ISTM you could use the database identifier we already have to at least
produce UUIDs which are unique amongst PostgreSQL instances. That
might be something worth aiming for?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] PQputCopyData dont signal error

2014-04-07 Thread Martijn van Oosterhout

On Tue, Apr 01, 2014 at 01:53:13PM -0400, Robert Haas wrote:
  One of the things you mentioned is I often find it necessary to refer to
  existing examples of code when trying to figure out how to do things
  correctly.  I couldn't agree more.  Haven't seen one yet, but found plenty
  of discussion that tap danced around one or more of the components of the
  copy, put, end paradigm.  Maybe I should have just asked for a sample code
  snippet but didn't after a day or so of frustration and trying to piece
  together other people's incomplete samples.
 
 FWIW, I've generally found that the best examples are what's in the
 core distribution.  I'd go and look at a tool like psql or pg_restore
 and find the code that handles this, and then copy it and cut it down
 to what you need.

To move the conversation along:

https://github.com/postgres/postgres/blob/master/src/bin/psql/copy.c#L664

Seems possibly even more robust than most people will code, but it's
had a lot of real world testing.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] [RFC] What should we do for reliable WAL archiving?

2014-03-22 Thread Martijn van Oosterhout

On Sat, Mar 22, 2014 at 06:22:37AM +0900, MauMau wrote:
 From: Jeff Janes jeff.ja...@gmail.com
 Do people really just copy the files from one directory of local
 storage to
 another directory of local storage?  I don't see the point of that.
 
 It makes sense to archive WAL to a directory of local storage for
 media recovery.  Here, the local storage is a different disk drive
 which is directly attached to the database server or directly
 connected through SAN.

I'm one of those peope. They are archived into a local directory in
preparation for an rsync over ssh.

 The recommendation is to refuse to overwrite an existing file of the same
 name, and exit with failure.  Which essentially brings archiving
 to a halt,
 because it keeps trying but it will keep failing.  If we make a custom
 version, one thing it should do is determine if the existing archived file
 is just a truncated version of the attempting-to-be archived file, and if
 so overwrite it.  Because if the first archival command fails with a
 network glitch, it can leave behind a partial file.
 
 What I'm trying to address is just an alternative to cp/copy which
 fsyncs a file.  It just overwrites an existing file.

I ran into a related problem with cp, where halfway the copy the disk
was full and I was left with half a WAL file. This caused the rsync to
copy only half a file and the replication broke. This is clearly a
recoverable situation, but it didn't recover in this case.

 Yes, you're right, the failed archive attempt leaves behind a
 partial file which causes subsequent attempts to fail, if you follow
 the PG manual. That's another undesirable point in the current doc.
 To overcome this, someone on this ML recommended me to do cp %p
 /archive/dir/%f.tmp  mv /archive/dir/%f.tmp /archive/dir/%f.
 Does this solve your problem?

This would probably have handled it, but I find it odd that there's
program to handle restoring of archives properly, but on the archiving
side you have to cobble together your own shell scripts which fail in
various corner cases.

I'd love a program that just Did The Right Thing.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Planner hints in Postgresql

2014-03-17 Thread Martijn van Oosterhout

On Mon, Mar 17, 2014 at 01:20:47PM -0500, Merlin Moncure wrote:
 A query plan is a complicated thing that is the result of detail
 analysis of the data.  I bet there are less than 100 users on the
 planet with the architectural knowledge of the planner to submit a
 'plan'.  What users do have is knowledge of the data that the database
 can't effectively gather for some reason.  Looking at my query above,
 what it would need (assuming the planner could not be made to look
 through length()) would be something like:
 
 SELECT * FROM foo WHERE
   length(bar) = 1000 WITH SELECTIVITY 0.999
   AND length(bar) = 2 WITH SELECTIVITY 0.999;

A small issue with selectivity is that the selectivity is probably not
what the users are expecting anyway, since many will related to
conditional selectivities.  PostgreSQL is pretty good at single column
statistics, it just sometimes screws up on cross-column correlations. 
This ties in with alerting about a bad plan: if the EXPLAIN output
could list for each condition what the actual selectivity was it might
give user a way of understanding the problem.
   
So the example given might lead to output like:
   
clause   selectivity  estimated
length(bar)20.50 0.50
length(bar)1000 | length(bar)2 0.50 0.25
   
The execution engine can only output conditional selectivities because 
of the order of execution. But this would at least give users a handle 
on the problem.

Note that a first cut of the problem might simply be something like
likely()/unlikely() as in gcc.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] HBA files w/include support?

2014-02-15 Thread Martijn van Oosterhout

On Fri, Feb 14, 2014 at 11:10:48AM -0500, Tom Lane wrote:
 The argument about wanting to assemble a pg_hba file from separately
 managed configuration pieces seems to have some merit, but the weak
 spot there is how do you define the search order?  Or are you planning
 to just cross your fingers and hope it doesn't matter too much?

Well, in my case since the only auth method used is md5 the order
really doesn't matter.  Besides the point that each combination of
dbname and username only appears once.

But for a general use feature I can imagine it would be a concern.

This is not an important feature for me though: the config file is
generated by puppet with a bunch of loops and an include directory
would not really reduce the amount of work.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Making strxfrm() blobs in indexes work

2014-02-12 Thread Martijn van Oosterhout

On Sun, Feb 02, 2014 at 05:09:06PM -0800, Peter Geoghegan wrote:
 However, it also occurs to me that strxfrm() blobs have another useful
 property: We (as, say, the author of an equality operator on text, an
 operator intended for a btree operator class) *can* trust a strcmp()'s
 result on blobs, provided the result isn't 0/equal, *even if the blobs
 are truncated*. So maybe a better scheme, and certainly a simpler one
 would be to have a pseudo-attribute in inner pages only with, say, the
 first 8 bytes of a strxfrm() blob formed from the logically-leading
 text attribute of the same indexTuple. Because we're only doing this
 on inner pages, there is a very good chance that that will be good
 enough most of the time. This also allows us to reduce bloat very
 effectively.

(A bit late to the party). This idea has come up before and the most
annoying thing is that braindead strxfrm api.  Namely, to strxfrm a
large strings you need to strxfrm it completely even if you only want
the first 8 bytes.

That said, since btree index tuples are limited to 3k anyway, the
overhead probably isn't that bad.

I think it would make a noticable difference if it can be made to work.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] RFC: Async query processing

2014-01-04 Thread Martijn van Oosterhout

On Fri, Jan 03, 2014 at 04:46:23PM +0100, Florian Weimer wrote:
 On 01/03/2014 04:20 PM, Tom Lane wrote:
 
 I think Florian has a good point there, and the reason is this: what
 you are talking about will be of exactly zero use to applications that
 want to see the results of one query before launching the next.  Which
 eliminates a whole lot of apps.  I suspect that almost the *only*
 common use case in which a stream of queries can be launched without
 feedback is going to be bulk data loading.  It's not clear at all
 that pipelining the PQexec code path is the way to better performance
 for that --- why not use COPY, instead?
 
 The data I encounter has to be distributed across multiple tables.
 Switching between the COPY TO commands would again need client-side
 buffering and heuristics for sizing these buffers.  Lengths of runs
 vary a lot in my case.

Why switch between COPY commands, why could you not do it in one? For
example:

COPY table1(col1, col2, ...),
 table2(col1, col2, ...)
FROM STDIN WITH (tableoids);
tableoid1tabcol1tabcol2...
tableoid2tab...
...
\.

There's no especially good reason why a COPY can only write to one
table. In this way you provide, per row, which table this row should go
to.

There's always the issue of generated primary keys. So you could, if
you wanted to, do:

COPY table1(col1, col2, ...) RETURNING pkey1,
 table2(col1, col2, ...)
FROM STDIN WITH (tableoids);
tableoid1tabcol1tabcol2...
tableoid2tab\K tableoidtab...
...
\.

So, like we have \N for NULL, a \K oid which be the value of the
RETURNING column for the table with that oid.

This may be way outfield, but we have a streaming interface, there's no
reason why we can't extend it.

There's also the idea of WITH x AS (COPY ... FROM STDIN) which you
could do with a similar multiplexing of rows in one stream. This would
be extremely powerful together with MERGE.

Have a nice way,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] stuck spinlock

2013-12-26 Thread Martijn van Oosterhout

On Thu, Dec 26, 2013 at 03:18:23PM -0800, Robert Haas wrote:
 On Thu, Dec 26, 2013 at 11:54 AM, Peter Eisentraut pete...@gmx.net wrote:
  On 12/12/13, 8:45 PM, Tom Lane wrote:
  Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
  most systems dump core files with process IDs embedded in the names.
 
  Which systems are those?
 
 MacOS X dumps core files into /cores/core.$PID, and at least some
 Linux systems seem to dump them into ./core.$PID

On Linux it's configurable and at least on Ubuntu you get this:

$ cat /proc/sys/kernel/core_pattern 
|/usr/share/apport/apport %p %s %c

But yes, it can be configured to icnclude the PID in the filename.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] trailing comment ghost-timing

2013-12-23 Thread Martijn van Oosterhout

On Tue, Dec 24, 2013 at 03:40:58AM +0100, Andreas Karlsson wrote:
 On 12/24/2013 03:17 AM, David Johnston wrote:
 It is not sent to the server as a trailing comment. The following
 file is sent to the server like this.
 
 File:
 /**/;
 /**/
 
 Commands:
 PQexec(..., /**/;);
 PQexec(..., /**/);
 
 If this has to be fixed it should be in the client. I think people
 would complain if we broke the API by starting to throw an exception
 on PQexec with a string containing no actual query.

(Untested). Isn't this just a case of psql not printing out a timing if
the server responds with PGRES_EMPTY_QUERY?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] PoC: Partial sort

2013-12-22 Thread Martijn van Oosterhout

On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote:
 Hi!
 
 Next revision. It expected to do better work with optimizer. It introduces
 presorted_keys argument of cost_sort function which represent number of
 keys already sorted in Path. Then this function uses estimate_num_groups to
 estimate number of groups with different values of presorted keys and
 assumes that dataset is uniformly divided by
 groups. get_cheapest_fractional_path_for_pathkeys tries to select the path
 matching most part of path keys.
 You can see it's working pretty good on single table queries.

Nice work! The plans look good and the calculated costs seem sane also.

I suppose the problem with the joins is generating the pathkeys?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] PoC: Partial sort

2013-12-14 Thread Martijn van Oosterhout

On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote:
  Is that actually all that beneficial when sorting with a bog standard
  qsort() since that doesn't generally benefit from data being pre-sorted?
  I think we might need to switch to a different algorithm to really
  benefit from mostly pre-sorted input.
 
 
 In this patch I don't do full sort of dataset. For instance, index returns
 data ordered by first column and we need to order them also by second
 column. Then this node sorts groups (assumed to be small) where values of
 the first column are same by value of second column. And with limit clause
 only required number of such groups will be processed. But, I don't think
 we should expect pre-sorted values of second column inside a group.

Nice. I imagine this would be mostly beneficial for fast-start plans,
since you no longer need to sort the whole table prior to returning the
first tuple.

Reduced memory usage might be a factor, especially for large sorts
where you otherwise might need to spool to disk.

You can now use an index on (a) to improve sorting for (a,b).

Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl
log l), useful for large n.

Minor comments:

I find cmpTuple a bad name. That's what it's doing but perhaps
cmpSkipColumns would be clearer.

I think it's worthwhile adding a seperate path for the skipCols = 0
case, to avoid extra copies.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] patch: make_timestamp function

2013-12-13 Thread Martijn van Oosterhout

On Thu, Dec 12, 2013 at 08:50:26PM -0200, Fabrízio de Royes Mello wrote:
 On Thu, Dec 12, 2013 at 3:11 PM, Pavel Stehule pavel.steh...@gmail.comwrote:
 
  Hello
 
  this patch try to complete a set of functions make_date and make_timestamp.
 
 
 Could we have the 'make_timestamptz' function too?

Wouldn't this just be:

SELECT make_timestamp(...) at time zone 'foo';

(assuming make_timestamp actually returns a timestamp and not a
timestamptz).

or do you mean something else?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] ANALYZE sampling is too good

2013-12-11 Thread Martijn van Oosterhout

On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote:
 Surely we want to sample a 'constant fraction' (obviously, in
 practice you have to sample an integral number of rows in a page!)
 of rows per page? The simplest way, as Tom suggests, is to use all
 the rows in a page.
 
 However, if you wanted the same number of rows from a greater number
 of pages, you could (for example) select a quarter of the rows from
 each page.  In which case, when this is a fractional number: take
 the integral number of rows, plus on extra row with a probability
 equal to the fraction (here 0.25).

In this discussion we've mostly used block = 1 postgresql block of 8k. 
But when reading from a disk once you've read one block you can
basically read the following ones practically for free.

So I wonder if you could make your sampling read always 16 consecutive
blocks, but then use 25-50% of the tuples.  That way you get many more
tuples for the same amount of disk I/O seeks..

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

2013-11-27 Thread Martijn van Oosterhout

On Tue, Nov 26, 2013 at 06:50:28PM -0500, Tom Lane wrote:
 I believe the reason for this is the mechanism that I speculated about in
 that previous thread.  The platform is blocking SIGALRM while it executes
 handle_sig_alarm(), and that calls LockTimeoutHandler() which does
 kill(MyProcPid, SIGINT), and that SIGINT is being delivered immediately
 (or at least before we can get out of handle_sig_alarm).  So now the
 platform blocks SIGINT, too, and calls StatementCancelHandler(), which
 proceeds to longjmp out of the whole signal-handling call stack.  So
 the signal unblocking that would have happened after the handlers
 returned doesn't happen.  In simpler cases we don't see an issue because
 the longjmp returns to the setsigjmp(foo,1) call in postgres.c, which
 will result in restoring the signal mask that was active at that stack
 level, so we're all good.  However, PG_TRY() uses setsigjmp(foo,0),
 which means that no signal mask restoration happens if we catch the
 longjmp and don't ever re-throw it.  Which is exactly what happens in
 plpgsql because of the EXCEPTION clause in the above example.
 
 I don't know how many platforms block signals during handlers in this way,
 but I'm seeing it on Linux (RHEL6.5 to be exact) and we know that at least
 OpenBSD acts likewise, so that's a pretty darn large chunk of the world.

Isn't this why sigsetjmp/siglongjmp where invented? Is there a
situation where you don't want the signal mask restored?

BTW, seems on BSD systems sigsetjmp == setjmp:

http://www.gnu.org/software/libc/manual/html_node/Non_002dLocal-Exits-and-Signals.html

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Wildcard usage enhancements in .pgpass

2013-11-17 Thread Martijn van Oosterhout

On Sat, Nov 16, 2013 at 09:26:33PM +0100, Alexey Klyukin wrote:
 Hi,
 
 Attached is the patch that improves usage of '*' wildcard in .pgpass,
 particularly in the host part. The use case is below.

Looks interesting, though I wonder if you could use fnmatch(3) here. Or
woud that match more than you expect?  For example, it would allow
'foo*bar' to match 'foo.bar' which your code doesn't.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-11-13 Thread Martijn van Oosterhout

On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote:
 I have been thinking about this for years and I think the key idea for
 this is, implementing universal encoding. The universal encoding
 should have following characteristics to implement N2 encoding in a
 database.
 
 1) no loss of round trip encoding conversion
 
 2) no mapping table is necessary to convert from/to existing encodings
 
 Once we implement the universal encoding, other problem such as
 pg_database with multiple encoding problem can be solved easily.

Isn't this essentially what the MULE internal encoding is?

 Currently there's no such an universal encoding in the universe, I
 think the only way is, inventing it by ourselves.

This sounds like a terrible idea. In the future people are only going
to want more advanced text functions, regular expressions, indexing and
making encodings that don't exist anywhere else seems like a way to
make a lot of work for little benefit.

A better idea seems to me is to (if postgres is configured properly)
embed the non-round-trippable characters in the custom character part
of the unicode character set. In other words, adjust the mappings
tables on demand and voila.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] logical column order and physical column order

2013-11-03 Thread Martijn van Oosterhout

On Sun, Nov 03, 2013 at 09:40:18PM +1300, Gavin Flower wrote:
 I think the system should PHYSICALLY store the columns in the most
 space efficient order, and have a facility for mapping to  from the
 LOGICAL order - so that users  application developers only have
 worry about the logical order.  Even system programers would
 normally not have to be concerned with the physical order.  I am a
 little surprised that this is not already done, to be honest.

This has been discussed before, extensively. I beleive there have even
been some patches. Apart from the space savings it also allow postgres
to support column reordering of tables.

The main objection IIRC is that you now have a logical order and a
physical order and there would be an endless stream of bugs caused by
code confusing the two.

I don't really buy this: you can make the two identifiers
non-overlapping so you can always tell which kind you have and some
properly places checks will catch obvious problems.  Logical order is
only used in a handful of places anyway.  You could even make them two
seperate datatypes so the compiler will complain if you screw up.

Dig through the archives for the full story.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-23 Thread Martijn van Oosterhout

On Tue, Oct 22, 2013 at 11:16:19AM -0700, Jeff Janes wrote:
 On Mon, Oct 21, 2013 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Hm.  It's been a long time since college statistics, but doesn't the
  entire concept of standard deviation depend on the assumption that the
  underlying distribution is more-or-less normal (Gaussian)?
 
 It is easy to misinterpret the standard deviation if the distribution is
 not gaussian, but that is also true of the average.  The standard deviation
 (or the variance) is commonly used with non-gaussian distributions, either
 because it is the most efficient estimator for those particular
 distributions, or just because it is so commonly available.

Well, the standard deviation is the square root of the variance, which
is the second moment of the distribution. The first moment being the
mean. No matter what distribution it is, these are useful numbers.

If I had to guess a distribution for query runtimes I'd go for Poisson,
which would mean you'd expect the mean to equal the variance. Don't
have enough experience with such measurements to say whether that is
reasonable.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

2013-09-20 Thread Martijn van Oosterhout

On Fri, Sep 20, 2013 at 08:58:53AM +0900, Tatsuo Ishii wrote:
 For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is
 UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
 SHIFT-JIS and database encoding is UTF-8 because there is a conversion
 between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
 SHIFT-JIS and database encoding is ISO-8859-1 because there's no
 conversion between them.

As far as I can tell the whole reason for introducing NCHAR is to
support SHIFT-JIS, there hasn't been call for any other encodings, that
I can remember anyway.

So rather than this whole NCHAR thing, why not just add a type
sjistext, and a few type casts and call it a day...

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] record identical operator - Review

2013-09-20 Thread Martijn van Oosterhout

On Thu, Sep 19, 2013 at 06:31:38PM -0400, Steve Singer wrote:
 I think there is agreement that better (as in more obscure)
 operators than === and !== need to be picked  and we need to find a
 place in the user-docs to warn users of the behaviour of this
 operators.   Hannu has proposed
 
 *==*   binary equal, surely very equal by any other definition as wall
 !==?  maybe not equal -- binary inequal, may still be equal by
 other comparison methods

It's a pity operators must be non-alpha and can't be named. Something
like:

SELECT foo OPERATOR(byte_equivalent) bar;

is simultaneously obscure, yet clear.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-08-24 Thread Martijn van Oosterhout

On Fri, Aug 23, 2013 at 06:41:04PM +0530, Amit Kapila wrote:
  Not with this proposal...  If it's fixed then it makes no sense to make it
  look like it can be modified.
 
This proposal is to have a special include which user can only use
 for enable/disable
which means he can remove symbol '#' or add '#'.
We cannot stop user from changing file name or add some different
 location, but that can lead to problems.
We can have a note on top of this include indicating it is only for
 enable/disable.

Note, my whole purpose for suggesting something like:

include_auto_conf_filepostgresql.auto.conf

is because I want the file location to be configurable. If I put in my
configuration:

include_auto_conf_file/etc/postgresql/9.4/postgresql.auto.conf

it better put the options there. And if I comment the line out ALTER
SYSTEM should stop working.  If I put it at the beginning of the config
then any other option in the file will override it (which we can detect
and warn about).  If I put it at the end of the file, ALTER SYSTEM
overrides any statically configured options.

And I can imagine hosting providers putting the configuration for
memory usage, shared library directories and other such options after,
and options like cpu_cost, enable_merge_join, etc before.  That way
they have fine grained control over which options the user can set and
which not.

I think if we add more meaning to it, like allow user to change it,
 handling and defining of that will be bit complex.

Letting the user configure the location seems like common curtesy. Note
this line isn't in itself a GUC, so you can't configure it via ALTER
SYSTEM.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-08-20 Thread Martijn van Oosterhout

On Tue, Aug 20, 2013 at 11:23:06AM -0400, Stephen Frost wrote:
  What I was proposing upthread is that enable_alter_system=off/on would
  be present in postgresql.conf, and there is no include line for
  auto.conf.  
 
 I really think that's a terrible approach, to be honest.  I want to see
 an 'include' line in postgresql.conf for auto.conf, so the hapless
 sysadmin who is trying to figure out what the crazy DBA did has some
 clue what to look for.  enable_alter_system doesn't tell him diddly
 about an 'auto.conf' file which is included in the system config.

ISTM you want some kind of hybrid setting like:

#include_system auto.conf

which simultaneously does three things:

1. Sets the enable_alter_system flag
2. Indicates the file to use
3. Indicates the priority of the setting re other settings.

Comment it out, ALTER SYSTEM stop working. Put it back and it's
immediately clear what it means. And the user can control where the
settings go.

Syntax is a bit fugly though.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] improve Chinese locale performance

2013-07-28 Thread Martijn van Oosterhout

On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:
 I pretty much lost interest in ICU upon reading that they use UTF-16
 as their internal format.
 
 http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

The UTF-8 support has been steadily improving:

  For example, icu::Collator::compareUTF8() compares two UTF-8 strings
  incrementally, without converting all of the two strings to UTF-16 if
  there is an early base letter difference.

http://userguide.icu-project.org/strings/utf-8

For all other encodings you should be able to use an iterator. As to
performance I have no idea.

The main issue with strxfrm() is its lame API. If it supported
returning prefixes you'd be set, but as it is you need 10MB of memory
just to transform a 10MB string, even if only the first few characers
would be enough to sort...

Mvg,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] pg_memory_barrier() doesn't compile, let alone work, for me

2013-07-16 Thread Martijn van Oosterhout

On Sun, Jul 14, 2013 at 09:26:38PM -0400, Robert Haas wrote:
 I'm pretty sure we've got latent memory-ordering risks in our existing
 code which we just haven't detected and fixed yet.  Consider, for
 example, this exciting code from GetNewTransactionId:
 
 myproc-subxids.xids[nxids] = xid;
 mypgxact-nxids = nxids + 1;
 
 I don't believe that's technically safe even on an architecture like
 x86, because the compiler could decide to reorder those assignments.
 Of course there is probably no reason to do so, and even if it does
 you'd have to get really unlucky to see a user-visible failure, and if
 you did you'd probably misguess the cause.

You're probably right. Note that it's not even just the compiler that
might reorder them, the CPU/cache subsystem/memory bus all play their
part in memory reordering.  x86 is pretty forgiving, which is why it
works.

I found this to be a really good explanation of all the things that can
go wrong with memory ordering.  It also explains why, in the long run,
memory barriers are not optimal.

http://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/

That talk discusses how the hardware world is converging on SC [1] as
the memory model to use.  And C11/C++11 atomics will implement this for
the programmer.  With these you can actually make guarentees.  For
example, by marking mypgxact-nxids as an atomic type the compiler will
emit all the necessary markings to let the CPU know what you want, so
everything works the way you expect it to.  Even on arcane
architechtures.  No explicit barriers needed.

Unfortunatly, it won't help on compilers that don't support it.

[1] http://en.wikipedia.org/wiki/Sequential_consistency

There are places where you put code in and verify it does what you
want.  With this one you can put test programs in and it can tell you
all possibly results due to memory reordering.

http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/help.html

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Proposal - Support for National Characters functionality

2013-07-16 Thread Martijn van Oosterhout

On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote:
  Does support for alternative multi-byte encodings have something to do
  with the Han unification controversy? I don't know terribly much about
  this, so apologies if that's just wrong.
 
 There's a famous problem regarding conversion between Unicode and other
 encodings, such as Shift Jis.
 
 There are lots of discussion on this. Here is the one from Microsoft:
 
 http://support.microsoft.com/kb/170559/EN-US

Apart from Shift-JIS not being a well defined (it's more a family of
encodings) it has the unusual feature of providing multiple ways to
encode the same character.  This is not even a Han unification issue,
they have largely been addressed.  For example, the square-root symbol
exists twice (0x8795 and 0x81E3) and many other mathmatical symbols
also.

Here's the code page which you can browse online:

http://msdn.microsoft.com/en-us/goglobal/cc305152

Which means to be round-trippable Unicode would have to double those
characters, but this would make it hard/impossible to round-trip with
any other character set that had those characters.  No easy solution
here.

Something that has been done before [1] is to map the doubles to the
custom area of the unicode space (0xe000-0x).  It gives you
round-trip support at the expense of having to handle those characters
yourself.  But since postgres doesn't do anything meaningful with
unicode characters this might be acceptable.

[1] Python does a similar trick to handle filenames coming from disk in
an unknown encoding:
http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements

2013-07-14 Thread Martijn van Oosterhout

On Sun, Jul 14, 2013 at 03:36:09AM -0300, Fabrízio de Royes Mello wrote:
  Next, changes in src/backend, starting with parser changes: the patch
  adds IF_P NOT EXISTS variants for various productions. For example:

snip

  I think opt_if_not_exists should be used for the others as well.
 
 
 I could not use the opt_if_not_exists because bison emits an error:
 
 /usr/bin/bison -d -o gram.c gram.y
 gram.y: conflicts: 10 shift/reduce
 gram.y: expected 0 shift/reduce conflicts
 make[3]: *** [gram.c] Error 1
 
 I really don't know how to solve this problem. I'm just do ajustments like
 that:

This probably isn't solvable, which is why the coding is double in many
existing places. The issue is that by using opt_if_not_exists you make
that bison has to decide much earlier which rule it is parsing. Bison
only has one token lookahead and if that's not enough you get errors.

BTW, bison dumps a large file describing all its states that you should
be able to work out from that where the exact problem lies.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] How to implement Gin method?

2013-07-08 Thread Martijn van Oosterhout

On Mon, Jul 08, 2013 at 03:21:09PM +0900, kenji uno wrote:
 Hi.
 
 Ok, ok thanks.
 
 My problem is to shorten time of searching full text stored in text field.

Ok, your explanation of your problem really helps, thanks.

 However the following query is very slow! 9,400ms. It uses Seq Scan lol.
 
  SELECT * FROM xxx 
  WHERE TRUE
  AND (ddstrike(title,'ABC') OR (filter(fts1body) @ filter('AAA') AND 
 ddstrike(fts1body,'AAA')))

Well, in this case it still needs to scan the whole table to search the
title obviously.

 Apply filter to title column too.
 
 The best query result costs 3,700ms. 18 hits. It surely uses expected query 
 plan: two Bitmap index scan - Bitmap Or - Bitmap Heap Scan.
 
  SELECT * FROM xxx 
  WHERE TRUE
  AND (filter(title) @ filter('ABC') OR filter(fts1body) @ filter('ABC')) 
  AND (ddstrike(title,'ABC') OR ddstrike(fts1body,'ABC'))

It would be useful to see the explain analyze of this query. Note that
looking up 11,000 entries in an index could very take as long as
sequentially scanning the whole table.

 However, I may think good idea which uses inverted index.

I think your above idea is a good one, but you need to work out why
your above implementation didn't work out and why you think
implementing it directly will be better.
 
 So I want to know...
 - the actual work of extractQuery and consistant.
 - the detail interface of extractValue/extractQuery/consistant. It may help 
 understanding.
 
 I looked at contrib/_int.sql of PG8.2.22

Whoa, very old version, please look at something newer. For example the
RECHECK flag below is no longer used.

 There are definitions of int[] GIN support.
 ---
 CREATE OPERATOR CLASS gin__int_ops
 FOR TYPE _int4 USING gin
 AS
   OPERATOR3   ,
   OPERATOR6   = (anyarray, anyarray)  RECHECK,
   OPERATOR7   @,
   OPERATOR8   @  RECHECK,
   OPERATOR13  @,
   OPERATOR14  ~   RECHECK,
   OPERATOR20  @@ (_int4, query_int),
   FUNCTION1   btint4cmp (int4, int4),
   FUNCTION2   ginarrayextract (anyarray, internal),
   FUNCTION3   ginint4_queryextract (internal, internal, int2),
   FUNCTION4   ginint4_consistent (internal, int2, internal),
   STORAGE int4;
 ---
 Both ginint4_queryextract and ginint4_consistent assume that query argument 
 is a PGARRAY (ArrayType *). Where is it decided? Is it array of STORAGE type?

Remember the above uses operators which are what is indexed. The left
hand side is the array.  The right hand side is whatever is defined. 
intarray defines the operator (int[], int[]) hence the query
argument is int[] in that case. Apparently intarray accepts many kinds
of queries, it is the operators that define what actually happens.

 Both extractQuery(ginint4_queryextract) and extractValue(ginarrayextract) 
 seem to return similar value type. They return Datum array of int4. Is it 
 array of STORAGE type?

From my reading of
http://www.postgresql.org/docs/9.2/static/gin-extensibility.html, yes
they must return an array of the STORAGE type. The last paragraph on
that page says:

The actual data types of the various Datum values mentioned above
vary depending on the operator class.  The item values passed to
extractValue are always of the operator class's input type, and all
key values must be of the class's STORAGE type.  The type of the
query argument passed to extractQuery and consistent is whatever is
specified as the right-hand input type of the class member operator
identified by the strategy number.  This need not be the same as
the item type, so long as key values of the correct type can be
extracted from it.

 I want to understand the overview of GIN extension.

Please let us know what the documentation is missing so it can be
improved.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] How to implement Gin method?

2013-07-07 Thread Martijn van Oosterhout

On Sun, Jul 07, 2013 at 10:00:16AM +0900, Kenji uno wrote:
 Hi.
 
 I want to try GIN and know programming information of GIN technology.
 
 Please teach me about these functions extractValue, extractQuery and 
 consistent.
 
 I have posted question at stack overflow.
 
 http://stackoverflow.com/questions/17489703/postgresql-how-to-implement-gin

The documentation refers to the authors pages:
http://www.sai.msu.su/~megera/wiki/Gin

Did they help at all?

Also, GIN cannot be just applied to anything. It works to be able to
index certain types of which are difficult any other way, like
full-text search.  If you give some idea of what you'd like to index
then we can give an idea of what the functions should do.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] plpython implementation

2013-06-30 Thread Martijn van Oosterhout

On Sun, Jun 30, 2013 at 01:49:53PM +0200, Szymon Guz wrote:
 I'm reading through plperl and plpython implementations and I don't
 understand the way they work.
 
 Comments for plperl say that there are two interpreters (trusted and
 untrusted) for each user session, and they are stored in a hash.

The point is that python has no version for untrusted users, since it's
been accepted that there's no way to build a python sandbox for
untrusted code. There was actually a small competition to make one but
it failed, since then they don't bother.

Perl does provide a sandbox, hence you can have two interpreters in a
single backend.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] plpython implementation

2013-06-30 Thread Martijn van Oosterhout

On Sun, Jun 30, 2013 at 02:18:07PM +0200, Szymon Guz wrote:
  python does not any any sort of reliable sandbox, so there is no plpython,
  only plpythonu - hence only one interpreter per backend is needed.
 
 Is there any track of the discussion that there is no way to make the
 sandbox? I managed to create some kind of sandbox, a simple modification
 which totally disables importing modules, so I'm just wondering why it
 cannot be done.

http://wiki.python.org/moin/SandboxedPython

This is the thread I was thinking of:
http://mail.python.org/pipermail/python-dev/2009-February/086401.html

If you read through it I think you will understand the difficulties.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Hard to Use WAS: Hard limit on WAL space

2013-06-15 Thread Martijn van Oosterhout

On Sat, Jun 15, 2013 at 12:43:10PM +0800, Craig Ringer wrote:
 Bloat
 --
 
 Table bloat. Table bloat has been a major issue with PostgreSQL
 users/admins for years. Anyone care to explain to me in a simple
 paragraph how to find out if you have table or index bloat issues in
 your database and what to do about it? (Maybe we need
 pg_catalog.pg_index_bloat and pg_catalog.pg_table_bloat views
 including FILLFACTOR correction?)
 
 I think I'll draft up a patch to add exactly that.

Nice list btw. I monitor this by using the excellent check_progres
nagios plugin, which has stuff to check for things like this.

Which makes me think that it might be possible to add some other checks
like this, in for example pg_ctl.  A big fat warning 'your data may be
eaten' might get noticed at startup.

(A minor annoyance is that in recent version of PostgreSQL you have to
give check_postgres admin rights, otherwise it can't warn you about
idle in transaction problems.)

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Batch API for After Triggers

2013-06-09 Thread Martijn van Oosterhout

On Sun, Jun 09, 2013 at 10:15:09AM +0100, Simon Riggs wrote:
 As I mentioned in my post, I did consider that and then chose not to
 do that. However, having a final func is a major modification in the
 way that we specify trigger functions. We'd also need to cope with
 recursive trigger execution, which would mean the final func would get
 called potentially many times, so there's no way of knowing if the
 final func is actually the last call needed. That sounded complex and
 confusing to me.
 
 The proposed API allows you to do exactly that anyway, more easily, by
 just waiting until tg_event_num == tg_tot_num_events.

Can you signal partial completion? For example, if a trigger know that
blocks of 10,000 are optimal and it sees tg_tot_num_events == 1,000,000
that it could do work every 10,000 entries, as in when:

(tg_event_num % 1) == 0 || tg_event_num == tg_tot_num_events

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-06-03 Thread Martijn van Oosterhout

On Mon, Jun 03, 2013 at 11:27:57AM +0300, Ants Aasma wrote:
  I can't rule that out.  Personally, I've always attributed it to the
  fact that it's (a) long and (b) I/O-intensive.  But it's not
  impossible there could also be bugs lurking.
 
 It could be related to the OS. I have no evidence for or against, but
 it's possible that OS write-out routines defeat the careful cost based
 throttling that PostgreSQL does by periodically dumping a large
 portion of dirty pages into the write queue at once. That does nasty
 things to query latencies as evidenced by the work on checkpoint
 spreading.

In other contexts I've run into issues relating to large continuous
writes stalling.  The issue is basically that the Linux kernel allows
(by default) writes to pile up until they reach 5% of physical memory
before deciding that the sucker who wrote the last block becomes
responsible for writing the whole lot out.  At full speed of course. 
Depending on the amount of memory and the I/O speed of your disks this
could take a while, and interfere with other processes.

This leads to extremely bursty I/O behaviour.

The solution, as usual, is to make it more aggressive, so the
kernel background writer triggers at 1% memory.

I'm not saying that's the problem here, but it is an example of a
situation where the write queue can become very large very quickly.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Combo xids

2013-06-01 Thread Martijn van Oosterhout

On Sat, Jun 01, 2013 at 09:22:05AM +0100, Simon Riggs wrote:
 When would this make sense?
 Frequently. Most of the time a tuple needs only one xid set. In most
 cases, we set xmin and xmax a long time apart. Very few cases end with
 both of them set inside the *same* xmin horizon. In a heavy
 transactional enviroment, the horizon moves forwards quickly, on the
 order of a few seconds. Very few rows get inserted and then
 updated/deleted that quickly. With long reporting queries, data tends
 to be updated less, so again the rows aren't touched within the same
 horizon. As a result, we hardly ever need both xmin and xmax at the
 same time - when we need to set xmax, xmin is already
 committed/cleaned.

Is this really true? Consider a long running query A and a tuple
created by B after A. If another transaction comes to update B you
can't throw away the xmin because you need it to prove that A can't see
the tuple.

Or is the idea to create multixacts for each combination of xmin/xmax
encountered? And the assumption is that there aren't that many? That
could be measured.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]

2013-06-01 Thread Martijn van Oosterhout

On Sat, Jun 01, 2013 at 03:11:50PM +0430, Soroosh Sardari wrote:
 Dear Hackers
 
 I've created a new DB, and  a bunch of files created in base/12054, 12054
 is oid of the new DB.
 I want to find what table stored in each file.
 BTW, I read this
 http://www.postgresql.org/docs/9.2/interactive/storage-file-layout.html
 I have 156 files with numerical names, vm and fsm file are ignored.
 107 files are same as pg_class.reltoastrelid,so I have 49 files that I do
 not know what tables stored in them.
 Any idea to find ?

From that page:

Each table and index is stored in a separate file. For ordinary
relations, these files are named after the table or index's filenode
number, which can be found in pg_class.relfilenode. 

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Which table stored in which file in PGDATA/base/[db-oid]

2013-06-01 Thread Martijn van Oosterhout

On Sat, Jun 01, 2013 at 03:27:40PM +0430, Soroosh Sardari wrote:
 Yes, I have some files which is not in pg_class.relfilenode of any table or
 index.
 I want to know which table or index stored in such files.

That shouldn't happen. Are you sure you're looking in the right
database? Kan you list the filenames?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] about index inheritance

2013-05-08 Thread Martijn van Oosterhout

On Wed, May 08, 2013 at 10:19:08AM +0200, Vincenzo Melandri wrote:
 On Tue, May 7, 2013 at 11:55 PM, Robert Haas robertmh...@gmail.com wrote:
  This is a really hard problem.  If you pick this as your first project
  hacking on PostgreSQL, you will almost certainly fail.
 
 Thank you very much, i guessed that already -.-
 Still, I needed that at my office for a long time, struggled with it many
 times and had to come out with some exotic solutions...
 Now I have spare time between projects, so I can work on it full-time. At
 least it's worth a try, isn't it?

Well, you can work on it but I think it will be less programming and
more coming up with a feasable solution.

 Anyway, I'm working to better understand the problem, trying to identify at
 least the main involved points.
 At the moment I'm figuring out how the inherit mechanism works for
 relations (in tablecmds.c).. Then I'll figure out about how indexes work..

While there are probably old threads in the archives, I find the
easiest way to look at the problem is in the locking.  In particular, I
think if you can get unique indexes to work then the rest will follow.

Consider the case of an inheritence hierarchy and you want a unique
index on a column.  Since you want to be able to create and drop
children easily, each childs need to have an index just for them.  But
if you insert a row into one child you need to, somehow, prevent other
people also inserting the same value in a different child.  Efficiently
and deadlock-free.  This is hard, though we're up for crazy,
out-of-the-box ideas.

Note, there is one very special case, namely:

- The children are used for partitioning.

- The unique index you want is on the partition key.

Since each value can only possibly appear in one table your locking
problems vanish. The question is: how often does this happen?

Hope this helps,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Substituting Checksum Algorithm (was: Enabling Checksums)

2013-04-30 Thread Martijn van Oosterhout

On Tue, Apr 30, 2013 at 01:05:30PM -0400, Greg Smith wrote:
 I re-ran the benchmark that's had me most worried against the
 committed code and things look good so far.  I've been keeping quiet
 because my tests recently have all agreed with what Ants already
 described.  This is more a confirmation summary than new data.

I came across this today: Data Integrity Extensions, basically a
standard for have an application calculate a checksum of a block and
submitting it together with the block so that the disk can verify that
the block it is writing matches what the application sent.

It appears SCSI has standardised on a CRC-16 checksum with polynomial
0x18bb7 .

http://www.t10.org/ftp/t10/document.03/03-290r0.pdf
https://oss.oracle.com/~mkp/docs/dix-draft.pdf

Not directly relavent to PostgreSQL now, but possibly in the future.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-19 Thread Martijn van Oosterhout

On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote:
 Fixing this on the receive side alone seems quite messy and fragile.
 So instead, I think we should let the master send a shutdown message
 after it has sent everything it wants to send, and wait for the client
 to acknowledge it before shutting down the socket.
 
 If the client fails to respond, we could log a fat WARNING.

ISTM the master should half close the socket, using shutdown(). That
way the client receives an EOF and can still then send its reply to the
master.  Then when the master receives that it can close() completely.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Interesting post-mortem on a near disaster with git

2013-03-24 Thread Martijn van Oosterhout

On Sun, Mar 24, 2013 at 11:52:17AM -0400, Tom Lane wrote:
 Over the weekend, KDE came within a gnat's eyelash of losing *all*
 their authoritative git repos, despite having seemingly-extensive
 redundancy.  Read about it here:
 http://jefferai.org/2013/03/24/too-perfect-a-mirror/
 
 We should think about protecting our own repo a bit better, especially
 after the recent unpleasantness with a bogus forced update.  The idea
 of having clones that are deliberately a day or two behind seems
 attractive ...

I think the lesson here is that a mirror is not a backup. RAID, ZFS,
and version control are all not backups.

Taking a tarball of the entire repository and storing it on a different
machine would solve just about any problem you can think of in this
area.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Considering Gerrit for CFs

2013-02-06 Thread Martijn van Oosterhout

On Wed, Feb 06, 2013 at 10:17:09PM +0100, Magnus Hagander wrote:
 I just took a quick look at their system, and when they start talking
 about requirements in the 100's of Gb of RAM, 24 core machines and
 SSD, I get scared :) But that's to scale it - doesn't mention when
 you need to do anything like that. I'm assuming we'd be tiny.
 
 FWIW, what we have now could easily run on a box with 128Mb RAM...

I'm right now setting up a gerrit instance for another project (not as
large as Postgres) on a VM with 1GB of RAM and it works alright. You'll
want to run a Postgres database next to it for storing the actual
reviews and such.

It's a nice tool and integrates reasonably well with other thing. We
have a bot connected to it which sends messages on IRC in response to
various state changes. I've never seen an instance respond to email
though.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Enabling Checksums

2012-12-20 Thread Martijn van Oosterhout

On Tue, Dec 18, 2012 at 04:06:02AM -0500, Greg Smith wrote:
 On 12/18/12 3:17 AM, Simon Riggs wrote:
 Clearly part of the response could involve pg_dump on the damaged
 structure, at some point.
 
 This is the main thing I wanted to try out more, once I have a
 decent corruption generation tool.  If you've corrupted a single
 record but can still pg_dump the remainder, that seems the best we
 can do to help people recover from that.  Providing some
 documentation on how to figure out what rows are in that block,
 presumably by using the contrib inspection tools, would be helpful
 too.

FWIW, Postgres is pretty resiliant against corruption. I've maintained
a postgres db on a server with bad memory (don't ask) and since most
scrambling was in text strings you just got funny output sometimes. The
most common failure was a memory allocation failure as postgres tried
to copy a datum whose length field was correupted.

If things went really wonky you could identify the bad tuples by hand
and then delete them by ctid. Regular reindexing helped too.

All I'm saying is that a mode where you log a warning but proceed
anyway is useful.  It won't pin down the exact error, but it will tell
you where to look and help find the non-obvious corruption (so you can
possibly fix it by hand).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Memory leaks in record_out and record_send

2012-11-13 Thread Martijn van Oosterhout

On Tue, Nov 13, 2012 at 05:50:08PM -0500, Stephen Frost wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
  Yeah.  The thing that concerns me is that I think we have a pretty
  decent number of memory contexts where the expected number of
  allocations is very small ... and we have the context *just in case*
  we do more than that in certain instances.  I've seen profiles where
  the setup/teardown costs of memory contexts are significant ... which
  doesn't mean that those examples would perform better with fewer
  memory contexts, but it's enough to make me pause for thought.
 
 So, for my 2c, I'm on the other side of this, personally.  We have
 memory contexts for more-or-less exactly this issue.  It's one of the
 great things about PG- it's resiliant and very unlikely to have large or
 bad memory leaks in general, much of which can, imv, be attributed to
 our use of memory contexts.

If the problem is that we create memory context overhead which is not
necessary in many cases, perhaps we can reduce the overhead somehow. 
IIRC we have a seperate function for resetting a context and freeing it
entirely.  If there was a quick test we could do such that resetting a
context did nothing unless at least (say) 16k had been allocated, that
might reduce the cost for many very small allocations.

Ofcourse, unless someone comes up with a way to measure the cost this
is all handwaving, but it might a nice project for someone interested
in learning to hack postgres.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Successor of MD5 authentication, let's use SCRAM

2012-10-21 Thread Martijn van Oosterhout

On Sun, Oct 21, 2012 at 09:55:50AM +0200, Magnus Hagander wrote:
 I don't see a problem at all with providing the snakeoil cert. In
 fact, it's quite useful.
 
 I see a problem with enabling it by default. Because it makes people
 think they are more secure than they are.

So, what you're suggesting is that any use of ssl to a remote machine
without the sslrootcert option should generate a warning.  Something
along the lines of remote server not verified?  For completeness it
should also show this for any non-SSL connection.

libpq should export a serververified flag which would be false always
unless the connection is SSL and the CA is verified .

 In a browser, they will get a big fat warning every time, so they will
 know it. There is no such warning in psql. Actually, maybe we should
 *add* such a warning. We could do it in psql. We can't do it in libpq
 for everyone, but we can do it in our own tools... Particularly since
 we do print the SSL information already - we could just add a
 warning: cert not verified or something like that to the same piece
 of information.

It bugs me every time you have to jump through hoops and get red
warnings for an unknown CA, whereas no encryption whatsoever is treated
as fine while being actually even worse.

Transport encryption is a *good thing*, we should be encouraging it
wherever possible. If it wern't for the performance issues I'd suggest
defaulting to SSL everywhere transparently with ephemeral certs. It
would protect against any number of passive attacks.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Bugs in planner's equivalence-class processing

2012-10-17 Thread Martijn van Oosterhout

On Tue, Oct 16, 2012 at 11:56:52AM -0400, Tom Lane wrote:
 Is anybody concerned about the compatibility implications of fixing this
 bug in the back branches?  I'm worried about people complaining that we
 broke their application in a minor release.  Maybe they were depending
 on incorrect behavior, but they might complain anyway.  On the other
 hand, the fact that this hasn't been reported from the field in nine
 years suggests that not many people write queries like this.

Nice detective work. I'd personally say that it should be fixed. I
personally haven't written these kinds of queries so I'm not affected,
but I don't like the idea of known bugs being unfixed.

It's a pity we can't have a system that can somehow independantly
checks the results of the planner

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Detecting libpq connections improperly shared via fork()

2012-10-09 Thread Martijn van Oosterhout

On Thu, Oct 04, 2012 at 12:14:02AM +0200, Andres Freund wrote:
 On Thursday, October 04, 2012 12:08:18 AM Daniel Farina wrote:
  It would be fantastic for libpq to somehow monitor use of a connection
  from multiple PIDs that share a parent and deliver an error indicating
  what is wrong.  Unfortunately detecting that would require either a
  file or some kind of shared memory map, AFAIK, and I don't know how
  keen anyone is on accepting that patch.  So, may I ask: how keen is
  anyone on accepting such a patch, and under what conditions of
  mechanism?
 Hm. An easier version of this could just be storing the pid of the process 
 that did the PQconnectdb* in the PGconn struct. You can then check that 
 PGconn-pid == getpid() at relatively few places and error out on a mismatch. 
 That should be doable with only minor overhead.

On system's that support it, pthread_atfork() might help. Though being
like a signal handler you don't have access to the PGconn structure, so
you can't flag anything easily. Maybe just to cache getpid()?

In any case, adding a check to PQexec and friends would probably be
sufficient, since that's what every program is going to start with.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] do we EXEC_BACKEND on Mac OS X?

2012-10-03 Thread Martijn van Oosterhout

On Wed, Oct 03, 2012 at 01:57:47PM -0400, Bruce Momjian wrote:
 On Wed, Oct  3, 2012 at 01:53:28PM -0400, Tom Lane wrote:
  Bruce Momjian br...@momjian.us writes:
   Yes, but those framework libraries are typically supposed to prevent
   such problems from being seen by applications calling them.
  
  How exactly would a library prevent such problems?  In particular,
  let's see a proposal for how libpq might make it look like a fork
  was transparent for an open connection.
 
 I guess that is impossible.

Well, not quite impossible. A simple solution would be to use
pthread_atfork() to register a handler that puts the socket into an
invalid state in either the parent or the child.

http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html

Ofcourse, the backward compatabilty issues prevent us doing that, but
it's *possible*.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Martijn van Oosterhout

On Tue, Oct 02, 2012 at 12:58:15PM -0400, Stephen Frost wrote:
  I simply do not understand objections to the proposal. Have I missed 
  something?
 
 It was my impression that the concern is the stability of the hash value
 and ensuring that tools which operate on it don't mistakenly lump two
 different queries into one because they had the same hash value (caused
 by a change in our hashing algorithm or input into it over time, eg a
 point release).  I was hoping to address that to allow this proposal to
 move forward..

That makes no sense though. The moment you talk about hash you
consider the possibility of lumping together things that aren't the
same.  Any tools using these hashes must have realised this.

Fortunatly, the statistics are better than the birthday paradox. The
chances that the two most important queries in your system end up
having the same hash is miniscule.

Like mentioned elsewhere, a system with more than 10,000 different
queries sounds rare to me (once you exclude query parameters ofcourse).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Oid registry

2012-09-25 Thread Martijn van Oosterhout

On Tue, Sep 25, 2012 at 09:18:30AM -0400, Andrew Dunstan wrote:
 I'm not at all familiar with record_to_json or the json datatype,
 but wouldn't it be appropriate to create a cast from hstore to
 json to handle that case?
 
 No, the difficulty (or at least the first difficulty) is in having
 the code recognize that it has an hstore at all. The code picks
 apart the record field by field at run time and takes various
 actions depending on the field's type. For any type it doesn't
 recognize it just outputs the value as a string, and so that's what
 it does with hstore. Mostly this is the right thing but in the
 hstore case it's rather sad.

Is there a particular reason to special case hstore though? Wouldn't it
be sufficient to simply look for a cast from the given type oid to json
and use that if present.

If you don't like cast, allow other extensions to define a function
to_json_hook(yourtype) - json which is used.

Hardcoding IDs in extensions just doesn't seem right somehow...

(Hmm, I see someone else in the thread pointed this out too).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Rules and WITH and LATERAL

2012-08-20 Thread Martijn van Oosterhout

On Sun, Aug 19, 2012 at 12:06:30PM -0400, Tom Lane wrote:
 While thinking about this I wondered whether it might be possible to
 clean up the implementation of rules, and perhaps also get rid of some
 of their semantic issues, by making the rule rewriter rely on WITH
 and/or LATERAL, neither of which we had back in the dark ages when the
 current rules implementation was built.  In particular, WITH might offer
 a fix for the multiple-evaluation gotchas that people so often trip
 over.  For instance, perhaps an UPDATE with rules could be rewritten
 into something like

Making the rule system use WITH always seemed like a good idea to me.
ISTM though that it would tax the optimiser, as it would need to become
much more clever at pushing conditions down. For example, on 9.1 at
least you still get this:

$ explain with x as (select * from pg_class) select * from x where relname = 
'test';
 QUERY PLAN  
-
 CTE Scan on x  (cost=14.15..23.49 rows=2 width=189)
   Filter: (relname = 'test'::name)
   CTE x
 -  Seq Scan on pg_class  (cost=0.00..14.15 rows=415 width=194)
(4 rows)

whereas without the with you get an index scan.

So in its current form you can't use WITH to simplify the
implementation of views because performence would suck.  OTOH, the
intelligence in the current rule system may be a good guide to optimise
WITH statements.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] spinlock-pthread_mutex : real world results

2012-08-06 Thread Martijn van Oosterhout

On Mon, Aug 06, 2012 at 08:54:11AM -0400, Robert Haas wrote:
 2. Should we be modifying our spinlock implementation on Linux to use
 futexes rather than pulling pthreads into the mix?
 
 Anyone have data on the first point, or opinions on the second one?

I'm not sure whether pthreads is such a thick layer. Or are you
referring to the fact that you don't want to link against the library
at all?

If we've found a situation where our locks work better than the ones in
pthreads than either (a) we're doing something wrong or (b) the
pthreads implementation could do with improvement.

In either case it might be worth some investigation. If we can improve
the standard pthreads implementation everybody wins.

BTW, I read that some *BSDs have futex implementations (to emulate
linux), it might be an idea to see where they're going.

e.g. http://osdir.com/ml/os.dragonfly-bsd.kernel/2003-10/msg00232.html

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Resetting libpq connections after an app error

2012-07-21 Thread Martijn van Oosterhout

On Sat, Jul 21, 2012 at 01:08:58AM +0100, Daniele Varrazzo wrote:
 Hello,
 
 apologize for bumping the question to -hackers but I got no answer
 from -general. If there is a better ML to post it let me know.
 
 In a libpq application, if there is an application error between
 PQsendQuery and PQgetResult, is there a way to revert a connection
 back to an usable state (i.e. from transaction status ACTIVE to IDLE)
 without using the network in a blocking way? We are currently doing
 
 while (NULL != (res = PQgetResult(conn-pgconn))) {
 PQclear(res);
 }
 
 but this is blocking, and if the error had been caused by the network
 down, we'll just get stuck in a poll() waiting for a timeout.

There is PQreset(), which also exists in a non-blocking variant.

Hope this helps,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] huge tlb support

2012-07-09 Thread Martijn van Oosterhout

On Mon, Jul 09, 2012 at 02:11:00AM -0400, Tom Lane wrote:
 y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes:
  Also, I was under the impression that recent Linux kernels use hugepages
  automatically if they can, so I wonder exactly what Andres was testing
  on ...
 
  if you mean the trasparent hugepage feature, iirc it doesn't affect
  MAP_SHARED mappings like this.
 
 Oh!  That would explain some things.  It seems like a pretty nasty
 restriction though ... do you know why they did that?

It doesn't say explicitly in the documentation (found at
http://lwn.net/Articles/423592/ aka transhuge.txt) but reading between
the lines I'm guessing it's due to the fact that huge pages must be
aligned to 2 or 4MB and when dealing with a shared mapping you probably
need to require it to be aligned is all address spaces.

However, it seems it should work for SysV shared memory, see:
http://lwn.net/Articles/375096/ .  The same page suggests shared
mappings should work fine.  However, this page refers to the
non-transparent feature.

If you think about it, it must work since huge pages are inherited
through fork().

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Bug tracker tool we need

2012-07-07 Thread Martijn van Oosterhout

On Sat, Jul 07, 2012 at 11:36:41AM +0200, Magnus Hagander wrote:
 I've never thought of it in terms of friction, but I think that term
 makes a lot of sense. And it sums up pretty much one of the things
 that I find the most annoying with the commitfest app - in the end it
 boils down to the same thing. I find it annoying that whenever someone
 posts a new review or new comments, one has to *also* go into the CF
 app and add them there. Which leads to a lot of friction, which in
 turn leads to people not actually putting their comments in there,
 which decreases the value of the tool.
 
 Don't get me wrong - the cf app is a huge step up from no app at all.
 But it's just not done yet.

Well, if that's the issue then there are well known solutions to that. 
Give each commitfest entry a tag, say, CF#8475 and add a bot to the
mail server that examines each email for this tag and if found adds it
to the CF app.

This could then easily track commit messages, emails on mailing list
and even bug reports.  (For example, someone replys to a bug report
with See CF#8484 and then a reference to the bug thread gets pushed
into the CF app.)

It's also a searchable identifier, which is also useful for google.

We do this at my work, it's a very low barrier method of linking
different systems.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-06-27 Thread Martijn van Oosterhout

On Wed, Jun 27, 2012 at 12:58:47AM +0200, Nils Goroll wrote:
 So it looks like using pthread_mutexes could at least be an option on Linux.
 
 Using futexes directly could be even cheaper.

Note that below this you only have the futex(2) system call. Futexes
require all counter manipulation to happen in userspace, just like now,
so all the per architecture stuff remains.  On Linux pthread mutexes
are really just a thin wrapper on top of this.

The futex(2) system call merely provides an interface for handling the
blocking and waking of other processes and releasing locks on process
exit (so everything can still work after a kill -9).

So it's more a replacement for the SysV semaphores than anything else.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Martijn van Oosterhout

On Tue, Jun 26, 2012 at 01:46:06PM -0500, Merlin Moncure wrote:
 Well, that would introduce a backend dependency on pthreads, which is
 unpleasant.  Also you'd need to feature test via
 _POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between
 processes (and configure your mutexes as such when you do).  There are
 probably other reasons why this can't be done, but I personally don' t
 klnow of any.

And then you have fabulous things like:

https://git.reviewboard.kde.org/r/102145/
(OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support
it.)

Seems not very well tested in any case.

It might be worthwhile testing futexes on Linux though, they are
specifically supported on any kind of shared memory (shm/mmap/fork/etc)
and quite well tested.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Pg default's verbosity?

2012-06-19 Thread Martijn van Oosterhout

On Mon, Jun 18, 2012 at 09:30:14PM -0400, Robert Haas wrote:
 There might be something to the idea of demoting a few of the things
 we've traditionally had as NOTICEs, though.  IME, the following two
 messages account for a huge percentage of the chatter:
 
 NOTICE:  CREATE TABLE will create implicit sequence foo_a_seq for
 serial column foo.a
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
 foo_pkey for table foo

+1

Absolutely. And if you also suppress the output of the setval's
produced by pg_dump that would make a successful restore of a dump
produce barely any output at all with -q.  That would make errors
significantly more visible.

Not sure how to go about that though.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] libpq compression

2012-06-18 Thread Martijn van Oosterhout

On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
 The fly in the ointment with any of these ideas is that the configure
 list is not a list of exact cipher names, as per Magnus' comment that
 the current default includes tests like !aNULL.  I am not sure that
 we know how to evaluate such conditions if we are applying an
 after-the-fact check on the selected cipher.  Does OpenSSL expose any
 API for evaluating whether a selected cipher meets such a test?

I'm not sure whether there's an API for it, but you can certainly check
manually with openssl ciphers -v, for example:

$ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
NULL-SHASSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=SHA1
NULL-MD5SSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=MD5

...etc...

So unless the openssl includes the code twice there must be a way to
extract the list from the library.

Have a nice ay,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Last gasp

2012-04-12 Thread Martijn van Oosterhout

On Wed, Apr 11, 2012 at 12:00:39PM -0300, Alvaro Herrera wrote:
 remote in their main PG tree, and so changesets could be pulled into the
 same clone and cherry-picked into the master branch.

If you're talking about a way of using git to support reviewing, the
Gerrit tool has an interesting workflow.  Essentially anything you want
reviewed you push to a fake tag refs/for/master which always creates a
new branch.  As such you have a repository which contains every patch
ever submitted, but it simultaneously tracks the parents so you know
which version of the tree a patch was against.

In the case of Postgres each entry in the CF app would have its own tag
(say refs/cf/234) which would create a new patch for that entry.  In
the end accepted patches are cherry-picked onto the real tree.  But
because all patches are now in the same place you can build tooling
around it easier, like testing: does this patch cherry-pick cleanly or
is there a conflict.

No merge commits, just using git purely as patch storage.

(Note to make this work it has a git server emulation which may or may
not be easy to do, but it's just a thought about workflow.)

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Re: Cross-backend signals and administration (Was: Re: pg_terminate_backend for same-role)

2012-03-27 Thread Martijn van Oosterhout

On Tue, Mar 27, 2012 at 03:17:36AM +0100, Greg Stark wrote:
 I may be forgetting something obvious here but is there even a
 function to send an interrupt signal? That would trigger the same
 behaviour that a user hitting C-c would trigger which would only be
 handled at the next CHECK_FOR_INTERRUPTS which seems like it would be
 non-controversial and iirc we don't currently have a function to do
 this for other connections the user may have if he doesn't have access
 to the original terminal and doesn't have raw shell access to run
 arbitrary commands.

Sure, C-c just sends a SIGINT. But IIRC the problem wasn't so much
cancelling running queries, SIGINT appears to work fine there, it's
that a connection blocked on waiting for client input doesn't wake up
when sent a signal. To fix that you'd need to change the signal mode
and teach the various input layers (like SSL) to handle the EINTR case.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] sortsupport for text

2012-03-19 Thread Martijn van Oosterhout

On Mon, Mar 19, 2012 at 12:19:53PM -0400, Robert Haas wrote:
 On Sat, Mar 17, 2012 at 6:58 PM, Greg Stark st...@mit.edu wrote:
  On Fri, Mar 2, 2012 at 8:45 PM, Robert Haas robertmh...@gmail.com wrote:
  12789    28.2686  libc-2.13.so             strcoll_l
  6802     15.0350  postgres                 text_cmp
 
  I'm still curious how it would compare to call strxfrm and sort the
  resulting binary blobs. I don't think the sortsupport stuff actually
  makes this any easier though. Since using it requires storing the
  binary blob somewhere I think the support would have to be baked into
  tuplesort (or hacked into the sortkey as an expr that was evaluated
  earlier somehow).
 
 Well, the real problem here is that the strxfrm'd representations
 aren't just bigger - they are huge.  On MacBook Pro, if the input
 representation is n characters, the strxfrm'd representation is 9x+3
 characters.  

Ouch. I was holding out hope that you could get a meaningful
improvement if we could use the first X bytes of the strxfrm output so
you only need to do a strcoll on strings that actually nearly match.
But with an information density of 9 bytes for one 1 character it
doesn't seem worthwhile.

That and this gem in the strxfrm manpage:

RETURN VALUE
   The  strxfrm()  function returns the number of bytes required to
   store the transformed string in dest excluding the terminating
   '\0' character.  If the value returned is n or more, the
   contents of dest are indeterminate.

Which means that you have to take the entire transformed string, you
can't just ask for the first bit. I think that kind of leaves the whole
idea dead in the water.

Just for interest I looked at the ICU API for this and they have the
same restriction.  There is another function which you can use to
return partial sort keys (ucol_nextSortKeyPart) but it produces
uncompressed sortkeys, which it seems is what Mac OSX is doing, which
seems useless for our purposes.  Either this is a hard problem or we're
nowhere near a target use case.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-03-10 Thread Martijn van Oosterhout

On Sat, Mar 10, 2012 at 11:38:51AM -0500, Tom Lane wrote:
 Shigeru Hanada shigeru.han...@gmail.com writes:
  Thanks for the review.  Agreed to write own depraser for pgsql_fdw
  which handles nodes which can be pushed down.  Every SQL-based FDW
  which constructs SQL statement for each local query would need such
  module inside.
 
 Yeah.  That's kind of annoying, and the first thing you think of is that
 we ought to find a way to share that code somehow.  But I think it's
 folly to try to design a shared implementation until we have some
 concrete implementations to compare.  An Oracle FDW, for instance, would
 need to emit SQL code with many differences in detail from pgsql_fdw.
 It's not clear to me whether a shared implementation is even practical,
 but for sure I don't want to try to build it before we've done some
 prototype single-purpose implementations.

FWIW, this sounds like the compiler mechanism in SQLalchemy for
turning SQL node trees into strings. The basic idea is you define
functions for converting nodes to strings. Stuff like And and Or
works for every database, but then dialects can override different
things.

So you have for Postgres: Node(foo) = $1, but to other databases
perhaps :field1. But most of the other code can be shared..

http://docs.sqlalchemy.org/en/latest/core/compiler.html

In my experience it works well for generating custom constructs. They
have compilers for 11 different database engines, so it seems flexible
enough.  Mind you, they also handle DDL mapping (where most of the
variation is) and datatype translations, which seems a lot further than
we need here.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Our regex vs. POSIX on longest match

2012-03-05 Thread Martijn van Oosterhout

On Mon, Mar 05, 2012 at 11:28:24AM -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  I think the right way to imagine this is as though the regular
  expression were being matched to the source text in left-to-right
  fashion.
 
 No, it isn't.  You are headed down the garden path that leads to a
 Perl-style definition-by-implementation, and in particular you are going
 to end up with an implementation that fails to satisfy the POSIX
 standard.  POSIX requires an *overall longest* match (at least for cases
 where all quantifiers are greedy), and that sometimes means that the
 quantifiers can't be processed strictly left-to-right greedy.  An
 example of this is 

On the otherhand, I think requiring an overall longest match makes
your implementation non-polynomial complexity. The simplest example I
can think of is the knapsack problem, where given weights x_n and a
total W, can be converted to a regex problem as matching a string with
W a's against the regex:

a{x_1}?a{x_2}?a{x_3}? etc...

Yes, Perl (and others) don't guarentee an overall longest match. I
think they want you to consider regular expressions as a specialised
parsing language where you can configure a state machine to process
your strings. Not ideal, but predicatable.

The question is, what are users expecting of the PostgreSQL regex
implementation?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1563 matches

Mail list logo