from:"Peter J. Holzer"

Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

2017-11-12 Thread Peter J. Holzer

On 2017-11-12 13:26:58 +0100, Christoph Berg wrote:
> Re: To Adam Brusselback 2017-11-11 
> <2017205316.u56lkmkakdmcx...@msg.df7cb.de>
> > I'm investigating if it's a good idea to tell systemd to ignore the
> > exit code of pg_ctl(cluster).
> 
> Telling systemd to ignore ExecStart errors seems to be the correct
> solution. The service will still be active, with the startup error
> being shown:

Wouldn't it be better to remove the timeout? If some other service
depends on PostgreSQL it probably shouldn't be startet until PostgreSQL
is really up and services which don't need PostgreSQL (e.g. SSH or X11
login or a web- or mail server) shouldn't depend on it.

One of the purported advantages of systemd over SystemV init is that it
starts up services in parallel, so a service which takes a long (or
infinite) time to start doesn't block other services.

    hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: PGP signature

Re: [GENERAL] Client Authentication methods

2017-11-10 Thread Peter J. Holzer

On 2017-11-10 08:25:24 -0500, chiru r wrote:
> I am trying to understand the Authentication method in pg_hba.conf file
> (password & md5) in PostgreSQL database server. 
> 
> I am assuming that the user provides the  connection string host/usser/
> password,then client will go and contact the DB server pg_hba.conf file in
> memory without carrying password over the network initially, and then it
> confirms the authentication method from pg_hba.conf ,then it decides weather 
> it
> send clear text or md5 encrypted password from client to Server to make a
> session?

I'm not sure what "it" refers to in this sentence. If "it" refers to the
client (as grammatically it should) then the answer is no. The client
doesn't have access to the pg_hba.conf file.

The client connects to the server, sending the username and database
name, but not (yet) the password. Then the server checks the pg_hba.conf
file to determine which authentication method to use. The server then
sends an authentication request to the client, to which the client sends
a response (including, or based on, the password).

> Is my assumption is correct ? or What exactly it make the difference for 
> client
> if i use md5/password  in pg_hba.conf file in DB server?.

See
https://www.postgresql.org/docs/10/static/auth-methods.html#AUTH-PASSWORD

With method password, passwords are sent in plain text. With md5, an md5
hash of the password, the username, and a nonce is sent instead.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: PGP signature

Re: [GENERAL] EXPLAIN command just hangs...

2017-11-03 Thread Peter J. Holzer

On 2017-11-02 20:51:23 +, Rhhh Lin wrote:
[...]
> where timestamp BETWEEN 150667656 AND 150875022 
[...]
> *Also, as a sidenote - can someone please expand on why one (I was not 
> involved
> in the creation of this DB/schema definition) would choose to have the
> definition of the timestamp column as a bigint in this case?  

The numbers look like Java timestamps (Milliseconds since the epoch). So
probably the programs accessing the database are written in Java and the
programmer decided that it's simpler to do all timestamp computations in
Java than in SQL. Or maybe the programmer just felt more comfortable
with Java-like timestamps than with calendar dates. (I have done the
same (with Unix timestamps, i.e. seconds since the epoch). Although in
the cases where I've done it I had the additional excuse that the
database didn't support timestamps with timezones, which isn't the case
for PostgreSQL.)

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: PGP signature

[GENERAL] Query plan for Merge Semi Join

2017-11-01 Thread Peter J. Holzer

  ║
║ Execution time: 3275.341 ms   

 ║
╚╝

That is almost certainly not ideal, but this is not my question.

My question is what does that merge semi join actually do?

In general a merge join needs two inputs sorted by the merge key. It
walks both in parallel and joins matching lines. Correct?

The first input is the index scan. 

The second is the output of the materialize. Since we need only the
column arbeitsvolumen this would be something like
select arbeitsvolumen from facttable_kon_eh where thema='E'
order by arbeitsvolumen;

So far so good. But there are a few things I don't understand:

Where does Rows Removed by Filter: 3874190 come from? The number doesn't
match any count I can come up with: It is a bit larger than the total number
of rows where term is not null but smaller than the total number of
rows where the filter doesn't match. And it is much larger than the
number of rows I would expect if the merge stopped once there could not
be a possible match any more. And does it really check the filter
condition even for rows that don't satisfy the merge condition? Of
course it makes sense from a modularization point of view, but that's a
lot of random accesses, most of which are unneccessary.

The materialize returns 184791 rows. This one I understand: There are 6
non-null distinct values of arbeitsvolumen in facttable_kon_eh, and each
appears 36958 times. 36958 * 5 + 1 = 184791. So it stops once it reaches
the largest value. Although now I'm wondering how it knows that this is
the largest value without scanning to the end).

    hp

- -
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: PGP signature

Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Peter J. Holzer

On 2017-10-13 12:49:21 -0300, Seamus Abshere wrote:
> In the spreadsheet world, there is this concept of "shared strings," a
> simple way of compressing spreadsheets when the data is duplicated in
> many cells.
> 
> In my database, I have a table with >200 million rows and >300 columns
> (all the households in the United States). For clarity of development
> and debugging, I have not made any effort to normalize its contents, so
> millions of rows have, for example, "SINGLE FAMILY RESIDENCE /
> TOWNHOUSE" (yes, that whole string!) instead of some code representing
> it.
> 
> Theoretically / blue sky, could there be a table or column type that
> transparently handles "shared strings" like this, reducing size on disk
> at the cost of lookup overhead for all queries?

Theoretically it's certainly possible and I think some column-oriented
databases store data that way.

> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> only for large objects?)

Yes, but if you want to autmatically delete entries which are no longer
needed you need to keep track of that. So either a reference count or an
index lookup on the parent table. This is starting to look a lot like a
foreign key - just hidden from the user. Performance would probably be
similar, too.

We have done something similar (although for different reasons). We
ended up doing the "join" in the application. For most purposes we don't
need the descriptive strings and when we need them we can do a
dictionary lookup just before sending them to the client (of course the
dictionary has to be read from the database, too, but it doesn't change
that often, so it can be cached). And from a software maintainability
POV I think a dictionary lookup in Perl is a lot nicer than 50 joins
(or 300 in your case).

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: PGP signature

Re: [GENERAL] pglogical bidirectional replication of sequences

2017-09-01 Thread Peter J. Holzer

On 2017-09-01 10:29:51 +0200, Peter J. Holzer wrote:
> pglogical supports replication of sequences, and although the way it
> does this suggests that it can't really work in both directions
> (actually I'm sceptical that it works reliably in one direction), of
> course I had to try it.
> 
> So I created a sequence on both nodes and called
> select pglogical.replication_set_add_sequence('default', 'test_sequence');
> on both nodes.
> 
> The result was ... interesting.
> 
> First I got the same sequence (1, 2, 3, 4, 5) on both nodes.
> 
> After a few seconds the replication kicked in, and then I got the same
> value (1005) on both nodes most of the time, with a few variants (2005,
> 3005) thrown in.
> 
> In a word, the sequence was completely unusable.

[...some failed attempts to recover...]

> So, is there a way to recover from this situation without drastic
> measures like nuking the whole database.

To answer my own question:

delete from pglogical.queue where message_type='S';
on both nodes seems to have the desired effect.
A vacuum full pglogical.queue afterwards is a good idea to get the
bloated table back to a reasonable size.

hp



-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: Digital signature

Re: [GENERAL] pglogical bidirectional replication of sequences

2017-09-01 Thread Peter J. Holzer

On 2017-09-01 09:57:52 -0600, Rob Sargent wrote:
> On 09/01/2017 02:29 AM, Peter J. Holzer wrote:
> >TLDR: Don't.
> >
> >I'm currently conducting tests which should eventually lead to a 2 node
> >cluster with working bidirectional logical replication.
> >
> >(Postgresql 9.6.4-1.pgdg90+1, pglogical 2.0.1-1.jessie+1 on Debian 9
> >(Stretch))
> >
> >pglogical supports replication of sequences, and although the way it
> >does this suggests that it can't really work in both directions
> >(actually I'm sceptical that it works reliably in one direction), of
> >course I had to try it.
> >
[and it blew up]

> I trust you mean don't use sequences

I trust you don't mean what I understood ;-).

Seriously:

Sequences in general are fine and very useful. I think they should be
used where appropriate.

Sequences and logical replication don't mix well. That still doesn't
mean that you can't use sequences, you just have to be careful how you
use them. 

Since replicating sequence state doesn't really work, I think it is best
to use independent sequences on each node and just configure them in a
way that they can not produce the same values. A naive approach would be
to use MINVALUE/MAXVALUE/START WITH to ensure non-overlapping ranges. A
somewhat more elegant approach is to increment by $n$ (the number of
nodes in the cluster) and use different start values (I got that idea
from
http://thedumbtechguy.blogspot.co.at/2017/04/demystifying-pglogical-tutorial.html).

There are other ways to get unique ids: A uuid should work pretty well
in most cases, and in some even a random 64 bit int might be enough.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: Digital signature

[GENERAL] pglogical bidirectional replication of sequences

2017-09-01 Thread Peter J. Holzer

TLDR: Don't.

I'm currently conducting tests which should eventually lead to a 2 node
cluster with working bidirectional logical replication.

(Postgresql 9.6.4-1.pgdg90+1, pglogical 2.0.1-1.jessie+1 on Debian 9
(Stretch))

pglogical supports replication of sequences, and although the way it
does this suggests that it can't really work in both directions
(actually I'm sceptical that it works reliably in one direction), of
course I had to try it.

So I created a sequence on both nodes and called
select pglogical.replication_set_add_sequence('default', 'test_sequence');
on both nodes.

The result was ... interesting.

First I got the same sequence (1, 2, 3, 4, 5) on both nodes.

After a few seconds the replication kicked in, and then I got the same
value (1005) on both nodes most of the time, with a few variants (2005,
3005) thrown in.

In a word, the sequence was completely unusable.

Experiment completed, so I removed the sequence from the replication
set:

select pglogical.replication_set_remove_sequence('default', 'test_sequence');
on both nodes.

But the behaviour of the sequence doesn't change. It still returns 1005
most of the time, and sometimes 2005 or 3005. This is true even after
restarting both nodes. 

Plus, I can't drop the sequence any more (as the user who created the
sequence):

wds=> drop sequence public.test_sequence ;
ERROR:  permission denied for schema pglogical

So, clearly, pglogical is still managing that sequence.

If I drop the sequence as postgres and then recreate it, it works
normally for some time (also the sequence on the other node now works
normally), but after some time, the replication kicks in again and the
sequence is stuck again at 1005. 

So, is there a way to recover from this situation without drastic
measures like nuking the whole database.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: Digital signature

Re: [GENERAL] Porting libpq to QNX 4.25

2017-08-25 Thread Peter J. Holzer

On 2017-08-22 12:57:15 -0300, marcelo wrote:
> We'll replace those QNX machines with WIndows XP ones

The future is already here — it's just not very evenly distributed.

SCNR,
hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: Digital signature

Re: [GENERAL] Count column with name 'count' returns multiple rows. Why?

2017-08-19 Thread Peter J. Holzer

On 2017-08-18 15:57:39 -0500, Justin Pryzby wrote:
> On Fri, Aug 18, 2017 at 10:47:37PM +0200, Peter J. Holzer wrote:
> > On 2017-08-18 06:37:15 -0500, Justin Pryzby wrote:
> > > On Fri, Aug 18, 2017 at 01:01:45PM +0200, Rob Audenaerde wrote:
> > > > Can anyone please explain this behaviour?
> > > 
> > > https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SQL-EXPRESSIONS-FUNCTION-CALLS
> > > https://www.postgresql.org/docs/9.6/static/rowtypes.html#ROWTYPES-USAGE
> > 
> > Maybe I overlooked it, but I don't see anything in those pages which
> > explains why «count» is parsed as a column name in the first example and
> > as a function name in the second.
> > 
> > Nor do I see what «count(base.*)» is supposed to mean. It seems to be
> > completely equivalent to just writing «count», but the part in
> > parentheses is not ignored: It has to be either the table name or the
> > table name followed by «.*». Everything else I tried either led to a
> > syntax error or to «count» being recognized as a function. So apparently
> > columnname open-parenthesis tablename closed-parenthesis is a specific
> > syntactic construct, but I can't find it documented anywhere.
> 
> | Another special syntactical behavior associated with composite values is 
> that
> |we can use functional notation for extracting a field of a composite value. 
> The
> |simple way to explain this is that the notations field(table) and table.field
> |are interchangeable. For example, these queries are equivalent:

Thanks. I see it now.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: Digital signature

Re: [GENERAL] Count column with name 'count' returns multiple rows. Why?

2017-08-18 Thread Peter J. Holzer

On 2017-08-18 06:37:15 -0500, Justin Pryzby wrote:
> On Fri, Aug 18, 2017 at 01:01:45PM +0200, Rob Audenaerde wrote:
> > I don't understand why this query:
> > 
> >select count(base.*) from mytable base;
> > 
> > does return multiple rows.
> > 
> >select count(1) from mytable base;
> > 
> > returns the proper count.
> > 
> > There is a column with the name 'count'.
> > 
> > Can anyone please explain this behaviour?
> 
> https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SQL-EXPRESSIONS-FUNCTION-CALLS
> https://www.postgresql.org/docs/9.6/static/rowtypes.html#ROWTYPES-USAGE

Maybe I overlooked it, but I don't see anything in those pages which
explains why «count» is parsed as a column name in the first example and
as a function name in the second.

Nor do I see what «count(base.*)» is supposed to mean. It seems to be
completely equivalent to just writing «count», but the part in
parentheses is not ignored: It has to be either the table name or the
table name followed by «.*». Everything else I tried either led to a
syntax error or to «count» being recognized as a function. So apparently
columnname open-parenthesis tablename closed-parenthesis is a specific
syntactic construct, but I can't find it documented anywhere.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: Digital signature

Re: [GENERAL] Dealing with ordered hierarchies

2017-07-31 Thread Peter J. Holzer

On 2017-07-25 01:15:56 +1200, Tim Uckun wrote:
> I don't like the approach with a large increment. It would mean complicated
> logic to see if you filled the gap and then update all the other peers if you
> did. It sounds like the re-order is going to be expensive no matter what. My
> primary concern are race conditions though. What if two or more users are
> trying to update the hierarchy either by inserts or updates? I can definitely
> see a situation where we have issues transactions trip over each other.

You could add a unique index over (parent, sequence_number). That way
two transactions won't be able to add a node with the same sequence
number under the same parent. You will have to handle duplicate key
errors, though.

    hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: Digital signature

Re: [GENERAL] plpgsql function with offset - Postgres 9.1

2017-06-15 Thread Peter J. Holzer

On 2017-06-16 10:19:45 +1200, Patrick B wrote:
> 2017-05-29 19:27 GMT+12:00 Albe Laurenz <laurenz.a...@wien.gv.at>:
> Patrick B wrote:
> > I am running a background task on my DB, which will copy data from 
> tableA
> to tableB. For
> > that, I'm writing a PL/PGSQL function which basically needs to do the
> following:
> >
> >
> > 1.    Select the data from tableA
> > 2.    The limit will be put when calling the function
> > 3.    insert the selected data on Step 1 onto new table
[...]
> >               FOR row IN EXECUTE '
> >                           SELECT
> >                                   id,
> >                                   path,
> >                                   name,
> >                                   name_last,
> >                                   created_at
> >                           FROM
> >                                   tablea
> >                           WHERE
> >                                   ready = true
> >                           ORDER BY 1 LIMIT ' || rows || ' OFFSET ' ||
> rows || ''
> 
> '... LIMIT ' || p_limit || ' OFFSET ' || p_offset
> 
> >               LOOP
> 
> num_rows := num_rows + 1;
>
> >               INSERT INTO tableB (id,path,name,name_last,created_at)
> >               VALUES (row.id,row.path,row.name,row.
> name_last,row.created_at);
> >
> >               END LOOP;
[...]
> 
> There are two problems with this approach:
> 
> 1. It will do the wrong thing if rows are added or deleted in "tablea"
> while
>    you process it.
> 
> 
> 
> There will be actually records being inserted in tablea while processing the
> migration Any ideas here?

Is id monotonically increasing? You might be able to use that, as Albe
suggests:

> The solution is to avoid OFFSET and to use "keyset pagination":
> http://use-the-index-luke.com/no-offset

But it works only if rows cannot become ready after their id range has
already been processed. Otherwise you will miss them.

> I can add another column in tablea, like example: row_migrated boolean --> if
> that helps

Yes that's probably the best way. Instead of using an additional column
you could also make ready tristate: New -> ready_for_migration -> migrated.

> 2. Queries with hight OFFSET values have bad performance.
> 
> 
> No problem. The plan is to perform 2k rows at once, which is not much.

Are rows deleted from tablea after they are migrated? Otherwise you will
have a problem:

select ... limit 2000 offset 1234000

will have to retrieve 1236000 rows and then discard 1234000 of them.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: Digital signature

Re: [GENERAL] storing large files in database - performance

2017-05-16 Thread Peter J. Holzer

On 2017-05-16 12:25:03 +, Eric Hill wrote:
> I searched and found a few discussions of storing large files in the database
> in the archives, but none that specifically address performance and how large
> of files can realistically be stored in the database.
> 
>  
> 
> I have a node.js application using PostgreSQL to store uploaded files.  The
> column in which I am storing the file contents is of type “bytea” with
> “Storage” type set to “EXTENDED”.

I have mentioned this little experiment before, but I finally put the
results on my web site: https://www.hjp.at/databases/blob-bench/

(Please note that so far I have run this only on one system.
Generalizing to other systems might be premature).


> Storing a 12.5 MB file is taking 10 seconds, and storing a 25MB file
> is taking 37 seconds.  Two notable things about those numbers:  It
> seems like a long time, and the time seems to grow exponentially with
> file size rather than linearly.
> 
>  
> 
> Do these numbers surprise you?

Yes. on my system, storing a 25 MB bytea value takes well under 1 second.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


signature.asc
Description: Digital signature

Re: [GENERAL] Caching and Blobs in PG? Was: Can PG replace redis, amqp, s3 in the future?

2017-05-05 Thread Peter J. Holzer

On 2017-05-05 11:46:55 -0700, John R Pierce wrote:
> On 5/5/2017 11:28 AM, Peter J. Holzer wrote:
> 
> On 2017-05-04 23:08:25 +0200, Sven R. Kunze wrote:
> 
> On 03.05.2017 12:57, Thomas Güttler wrote:
> 
> Am 02.05.2017 um 05:43 schrieb Jeff Janes:
> 
> No.  You can certainly use PostgreSQL to store blobs.  But 
> then, you
> need to store the PostgreSQL data **someplace**.
> If you don't store it in S3, you have to store it somewhere 
> else.
> 
> I don't understand what you mean here. AFAIK storing blobs in PG 
> is not
> recommended since it is not very efficient.
> 
> Seems like several people here disagree with this conventional wisdom.
> 
> I think it depends very much on what level of "efficiency" you need. On
> my home server (i5 processor, 32GB RAM, Samsung 850 SSD - not a piece of
> junk, but not super powerful either) I can retrieve a small blob from a
> 100GB table in about 0.1 ms, and for large blobs the speed approaches
> 200MB/s. For just about everything I'd do on that server (or even at
> work) this is easily fast enough.
> 
> 
> S3 is often used for terabyte to petabyte file collections.   I would not want
> to burden my relational database with this.

I repeat the the first sentence I wrote: "I think it depends very much
on what level of 'efficiency' you need." Just because some people need
to store petabytes of blob data doesn't mean everybody does. If you need
to store petabytes of blobs, PostgreSQL may not be the right tool. But
it may be the right tool if you just need to store a few thousand PDFs.
To tell people to never store blobs in PostgreSQL because PostgreSQL is
"not efficient" is just bullshit. There are many factors which determine
how you should store your data, and "efficiency" (however that is
defined, if it's defined at all and not just used as a buzzword) is only
one of them - and rarely, in my experience, the most important one.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] Caching and Blobs in PG? Was: Can PG replace redis, amqp, s3 in the future?

2017-05-05 Thread Peter J. Holzer

On 2017-05-04 23:08:25 +0200, Sven R. Kunze wrote:
> On 03.05.2017 12:57, Thomas Güttler wrote:
> >Am 02.05.2017 um 05:43 schrieb Jeff Janes:
> >>No.  You can certainly use PostgreSQL to store blobs.  But then, you
> >>need to store the PostgreSQL data **someplace**.
> >>If you don't store it in S3, you have to store it somewhere else.
> >
> >I don't understand what you mean here. AFAIK storing blobs in PG is not
> >recommended since it is not very efficient.
> 
> Seems like several people here disagree with this conventional wisdom.

I think it depends very much on what level of "efficiency" you need. On
my home server (i5 processor, 32GB RAM, Samsung 850 SSD - not a piece of
junk, but not super powerful either) I can retrieve a small blob from a
100GB table in about 0.1 ms, and for large blobs the speed approaches
200MB/s. For just about everything I'd do on that server (or even at
work) this is easily fast enough.

Sure, just telling the kernel "send data from file descriptor A (which
happens to be a file) to file descriptor B (a socket)" is a lot more
efficient than copying data from disk into a postgresql process, then
from that process to an application server, from that to the webserver
and that finally sends it to the socket. But if that just lets my server
be 99.9% idle instead of 99.0% idle, I haven't gained much. Similarly,
if my server spends 90% of it's resources doing other stuff, I won't
gain much by optimizing this (I should better optimize that other stuff
it's spending so much time on).

I am in this regard a firm believer in not optimizing prematurely. Do
whatever makes sense from an application point of view. If the blobs are
logically part of some other data (e.g. PDFs in a literature database),
store them together (either all of them in PostgreSQL, or all in some
NoSQL database, or maybe on stone tablets, if that makes sense for some
reason). Only if you have good reason[1] to believe that physically
separating data which logically belongs together will resolve a
bottleneck, then by all means separate them.

hp

[1] "I read somewhere on the internet" is usually not a good reason.

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] Handling psql lost connections

2017-03-30 Thread Peter J. Holzer

On 2017-03-29 08:49:57 -0700, Steve Crawford wrote:
> When firewalls/VPNs stand between my psql client and a remote PostgreSQL 
> server
> the connection will on occasion time out and drop. This results in the
> following scenario:
> 
> -Leave for lunch mid project - leave psql open.
> 
> -Return from lunch, complete and submit large query.
> 
> -Notice query is taking too long. cancel it.
> 
> -Cancel doesn't return - realize that connection has dropped.
> 
> -Kill psql - history is not written out. Start query from scratch.
> 
> Is there:
[...]
> Yes, I know I and my coworkers could spend brain cycles trying to unerringly
> remember to close and restart connections, write all queries in an external
> editor and then submit them, etc. but I'm looking for more user friendly
> options.

One workaround could be to login to the server, start a screen session
and psql in the screen session. Then if your network connection drops
you can simply login again and resume the screen session. Of course this
only works if you have a shell login on the server which may not be the
case.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

Re: [GENERAL] inevitability of to_date() when converting representations which don't represent whole timestamps

2017-03-30 Thread Peter J. Holzer

I don't understand what you mean by "inevitability" in the subject.

On 2017-03-29 21:19:56 -0400, Shaun Cutts wrote:
> When being asked to convert a day of the week, the to_date() function
> returns the same day ('0001-01-01 BC’) no matter which day is
> converted:
> 
> # select to_date(‘Monday’, ‘Day’)
> '0001-01-01 BC’
> 
> # select to_date(‘Tuesday’, ‘Day’)
> '0001-01-01 BC’
> 
> However, if it were to return a date that was that day of the week, it
> could be inverted:
> 
> # select extract(dow from '0001-01-01 BC'::date); — this date should be the 
> result of to_date(‘Sunday’, ‘Day’)
> 6
> 
> # select extract(dow from '0001-01-02 BC'::date); — this date should be the 
> result of to_date(‘Monday’, ‘Day’)
> 0
> 
> ….
> 
> David tells this is not a bug, but it still seems like a reasonable
> requirement on to_date() to me. Is there some reason why this isn’t
> possible?

The documentation warns that to_date “interpret input liberally, with
minimal error checking” and while it “produce[s] valid output, the
conversion can yield unexpected results.”

I would agree that producing the same date for every day of the week
crosses the line between “unexpected (but valid) result” and “bug”.

On the other hand I have no idea what the result of to_date(‘Monday’,
‘Day’) should be. “Any date which is a Monday” seems too vague. “The
nearest Monday”, “the previous Monday”, “the next Monday” might be
useful in practice, but whichever of them you pick, you've picked the
wrong one with a probability of 2/3. “The first monday in the year -1 of
the proleptic Gregorian calendar” would be consistent with how
to_timestamp('12:34:56', 'HH24:MI:SS') works, but apart from that and
being invertible it seems to be a quite useless choice.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] Postgres Permissions Article

2017-03-30 Thread Peter J. Holzer

On 2017-03-29 08:05:23 -0700, Paul Jungwirth wrote:
> On 03/29/2017 06:36 AM, Tom Lane wrote:
> >Karsten Hilbert <karsten.hilb...@gmx.net> writes:
> >>Being able to create foreign keys may allow to indirectly
> >>discover whether certain values exists in a table which I
> >>don't otherwise have access to (by means of failure or
> >>success to create a judiciously crafted FK).
> >
> >Aside from that, an FK can easily be used to cause effective
> >denial-of-service, for example preventing rows from being deleted
> >within a table, or adding enormous overhead to such a deletion.
> 
> Thank you both for taking a look! I agree those are both worthwhile
> concerns. It still seems a little strange it is not just part of the CREATE
> permission (for example). I understand why not everyone can create a foreign
> key, I just have trouble imagining a use case where it is helpful to
> separate it from other DDL commands.

A foreign key affects not only the table on which it is defined but also
the table it references. 

If Alice creates a table “master” and Bob creates a table “detail”
referencing “master”, Bob can prevent Alice from deleting entries from
her own table. So Alice must be able to decide whom she allows to
reference her tables.

I don't see how how this could be part of the create privilege - I
certainly want different roles to be able to create their own tables (or
views, or whatever) without being able to DOS each other (accidentally
or intentionally).

(Also I don't understand why you wrote “You need the permission on both
tables”: Only the owner of a table can add constraints to it - this
privilege cannot be granted to other roles at all. So to create a
foreign key constraint you need to be the owner of the referencing table
and have the references privilege on the referenced table. It's not
symmetrical.)

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] Autoanalyze oddity

2017-03-23 Thread Peter J. Holzer

On 2017-03-05 12:01:07 +0100, Peter J. Holzer wrote:
[...]
> At the current rate of inserts, this threshold will be reached on
> March 24nd. I'll check whether the table is analyzed then.

It was (a little earlier than expected because pg_class.reltuples didn't
increase in the meantime).

hp


-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

Re: [GENERAL] Autoanalyze oddity

2017-03-05 Thread Peter J. Holzer

On 2017-03-05 08:39:05 -0800, Adrian Klaver wrote:
> On 03/05/2017 03:01 AM, Peter J. Holzer wrote:
> >So it is likely that something happened on that day (disk full?) which
> >wiped out the contents of pg_stat_user_tables.
> 
> Are there any logs from that time, either Postgres or system?
> 
> I would think a full disk would have been noticed at the time so alternate
> theories:
> 
> https://www.postgresql.org/docs/9.5/static/monitoring-stats.html
> 
> "...  When the server shuts down cleanly, a permanent copy of the statistics
> data is stored in the pg_stat subdirectory, so that statistics can be
> retained across server restarts. When recovery is performed at server start
> (e.g. after immediate shutdown, server crash, and point-in-time recovery),
> all statistics counters are reset.
> ..."

Oh, of course. That was the day we found out the hard way that the
bypass for the UPS didn't work. I knew that date looked familiar, but
somehow couldn't place it. Mystery solved, thanks!

hp


-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

Re: [GENERAL] Autoanalyze oddity

2017-03-05 Thread Peter J. Holzer

On 2017-03-03 06:39:35 -0800, Adrian Klaver wrote:
> On 03/03/2017 12:33 AM, Peter J. Holzer wrote:
> >This is with PostgreSQL 9.5.6 on Debian Linux.
> >
> >I noticed that according to pg_stat_user_tables autoanalyze has never
> >run on a lot of tables. Here is one example:
> >
> >wdsah=> select * from pg_stat_user_tables where schemaname='public' and 
> >relname='facttable_wds_indexstats';
> >─[ RECORD 1 ]───┬─
[...]
> >n_tup_ins   │ 47128
[...]
> >n_live_tup  │ 47128
> >n_dead_tup  │ 0
> >n_mod_since_analyze │ 47128
> >last_vacuum │ (∅)
> >last_autovacuum │ (∅)
> >last_analyze│ (∅)
> >last_autoanalyze│ (∅)
> >vacuum_count│ 0
> >autovacuum_count│ 0
> >analyze_count   │ 0
> >autoanalyze_count   │ 0
> >
> >wdsah=> select count(*) from facttable_wds_indexstats;
> > count
> >
> > 857992
> >(1 row)
> >
> >So, n_live_tup is way off, and n_tup_ins and n_mod_since_analyze also
> >seem to be wrong. Looks like this hasn't been updated in a year or so.
> >But track_counts is on:
> >
> >wdsah=> show track_counts;
> > track_counts
> >──
> > on
> >(1 row)
> 
> What are your settings for autovacuum?:
> 
> https://www.postgresql.org/docs/9.5/static/runtime-config-autovacuum.html

All the values in the autovacuum section of postgresql.conf are
commented out, so they should be the default values:

Just to be sure here's the output of show for each of the parameters:

wdsah=> show autovacuum;  on
wdsah=> show log_autovacuum_min_duration; -1
wdsah=> show autovacuum_max_workers;  3
wdsah=> show autovacuum_naptime;  1min
wdsah=> show autovacuum_vacuum_threshold; 50
wdsah=> show autovacuum_analyze_threshold;50
wdsah=> show autovacuum_vacuum_scale_factor;  0.2
wdsah=> show autovacuum_analyze_scale_factor; 0.1
wdsah=> show autovacuum_freeze_max_age;   2
wdsah=> show autovacuum_multixact_freeze_max_age; 4
wdsah=> show autovacuum_vacuum_cost_delay;20ms
wdsah=> show autovacuum_vacuum_cost_limit;-1


> Have the storage parameters for the table been altered?:
> 
> https://www.postgresql.org/docs/9.5/static/sql-createtable.html#SQL-CREATETABLE-STORAGE-PARAMETERS

No.

> >And even if it wasn't, shouldn't the autovacuum daemon notice that
> >n_mod_since_analyze is greater than n_live_tup *
> >autovacuum_analyze_scale_factor and run an autoanalyze?
> 
> That value is added to autovacuum_analyze_threshold:
> 
> autovacuum_analyze_scale_factor (floating point)
> 
> Specifies a fraction of the table size to add to
> autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE.
> The default is 0.1 (10% of table size). This parameter can only be set in
> the postgresql.conf file or on the server command line; but the setting can
> be overridden for individual tables by changing table storage parameters.

True. But 50 is negligible compared to 47128*0.1. So that shouldn't make
much of a difference.

But now that I look closer, I notice that the number in n_tup_ins for
that table is exactly the number of records inserted since
2017-02-08T13:00 and there were no records inserted between 09:00 and
13:00 on that day.

So it is likely that something happened on that day (disk full?) which
wiped out the contents of pg_stat_user_tables.

Looking into the source code, I find that 
reltuples = classForm->reltuples;
Am I correct to assume that this is pg_class.reltuples? That would
explain why analyze hasn't run yet: This is 862378, which is exactly
correct. 862378 * 0.1 + 50 is 86287.8, which is larger than
pg_stat_user_tables.n_mod_since_analyze. At the current rate of inserts,
this threshold will be reached on March 24nd. I'll check whether the
table is analyzed then.

hp


-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

[GENERAL] Autoanalyze oddity

2017-03-03 Thread Peter J. Holzer

This is with PostgreSQL 9.5.6 on Debian Linux.

I noticed that according to pg_stat_user_tables autoanalyze has never
run on a lot of tables. Here is one example:

wdsah=> select * from pg_stat_user_tables where schemaname='public' and 
relname='facttable_wds_indexstats';
─[ RECORD 1 ]───┬─
relid   │ 112723
schemaname  │ public
relname │ facttable_wds_indexstats
seq_scan│ 569
seq_tup_read│ 474779212
idx_scan│ 59184
idx_tup_fetch   │ 59184
n_tup_ins   │ 47128
n_tup_upd   │ 0
n_tup_del   │ 0
n_tup_hot_upd   │ 0
n_live_tup  │ 47128
n_dead_tup  │ 0
n_mod_since_analyze │ 47128
last_vacuum │ (∅)
last_autovacuum │ (∅)
last_analyze│ (∅)
last_autoanalyze│ (∅)
vacuum_count│ 0
autovacuum_count│ 0
analyze_count   │ 0
autoanalyze_count   │ 0

wdsah=> select count(*) from facttable_wds_indexstats;
 count  

 857992
(1 row)

So, n_live_tup is way off, and n_tup_ins and n_mod_since_analyze also
seem to be wrong. Looks like this hasn't been updated in a year or so.
But track_counts is on:

wdsah=> show track_counts;
 track_counts 
──
 on
(1 row)

And even if it wasn't, shouldn't the autovacuum daemon notice that
n_mod_since_analyze is greater than n_live_tup *
autovacuum_analyze_scale_factor and run an autoanalyze?

But the really weird thing is that pg_stats seems to be reasonably
current: I see entries in most_common_vals which were only inserted in
January. Is it possible that autoanalyze runs without updating
pg_stat_user_tables?

hp


-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

Re: [GENERAL] Loose indexscan and partial indexes

2017-02-10 Thread Peter J. Holzer

On 2017-02-10 14:24:36 +0100, Thomas Kellerer wrote:
> Peter J. Holzer schrieb am 10.02.2017 um 14:02:
> > So it's doing a sequential scan on the initial select in the recursive
> > CTE, but using the index on the subsequent selects.
> > 
> > But why? If it uses the index on
> > SELECT MIN(periodizitaet) FROM facttable_imf_ifs WHERE periodizitaet > 
> > 'x'
> > shouldn't it be able to use the same index on 
> > SELECT MIN(periodizitaet) FROM facttable_imf_ifs
> 
> What is the definition of the index facttable_imf_ifs_periodizitaet_idx?

The solution to the puzzle was just 2 paragraphs further down. 

Looks like I have to practice this arc of suspense thing ;-)

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

[GENERAL] Loose indexscan and partial indexes

2017-02-10 Thread Peter J. Holzer

.51 rows=1 width=2) (actual 
time=0.039..0.040 rows=1 loops=1)
 ->  Index Only Scan using 
facttable_imf_ifs_periodizitaet_idx on facttable_imf_ifs  
(cost=0.44..1516760.47 rows=21080284 width=2) (actual time=0.038..0.038 rows=1 
loops=1)
   Index Cond: ((periodizitaet IS NOT NULL) AND 
(periodizitaet IS NOT NULL))
   Heap Fetches: 1
   ->  WorkTable Scan on t t_1  (cost=0.00..6.19 rows=10 width=32) 
(actual time=0.161..0.162 rows=1 loops=4)
 Filter: (periodizitaet IS NOT NULL)
 Rows Removed by Filter: 0
 SubPlan 3
   ->  Result  (cost=0.59..0.60 rows=1 width=0) (actual 
time=0.212..0.212 rows=1 loops=3)
 InitPlan 2 (returns $3)
   ->  Limit  (cost=0.44..0.59 rows=1 width=2) (actual 
time=0.211..0.211 rows=1 loops=3)
 ->  Index Only Scan using 
facttable_imf_ifs_periodizitaet_idx on facttable_imf_ifs facttable_imf_ifs_1  
(cost=0.44..1061729.65 rows=7026761 width=2) (actual time=0.208..0.208 rows=1 
loops=3)
   Index Cond: ((periodizitaet IS NOT NULL) 
AND (periodizitaet > t_1.periodizitaet))
   Heap Fetches: 2
 Planning time: 8.883 ms
 Execution time: 0.801 ms
(23 rows)

800 times faster :-).

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

Re: [GENERAL] Are new connection/security features in order, given connection pooling?

2017-01-13 Thread Peter J. Holzer

On 2017-01-11 00:49:19 -0800, Guyren Howe wrote:
> I’m not following. What I would like is just a lightweight way to switch the
> connections to use a different role, or some moral equivalent, that would
> prevent an SQL injection from wrecking havoc. I’m not proposing anything that
> will change anything else about how the application is using the database.
> 
> SET ROLE doesn’t work, because an SQL injection can just SET ROLE back to the
> privileged user.

But then you are no worse off than with the commonly used scheme of
executing all queries as the same (necessarily "privileged") user.

In both cases the attacker can execute queries as a privileged user IF
he succeeds at sql injections.

But as others have already noted this is relatively easy to prevent.
Just preparing all queries is sufficient, even if you don't actually
parametrize them. Perl DBI does this, so this dangerous-looking line of
code (assume that the string wasn't hardcoded but the result of an SQL
injection):

$r = $dbh->selectall_arrayref("select * from twoqueries; insert into 
twoqueries(t) values('b')");

will fail with

DBD::Pg::db selectall_arrayref failed: ERROR:  cannot insert
multiple commands into a prepared statement at ./twoqueries line 21.

So I think just using 

set local role 

at the beginning of each transaction should work well with session
pooling. It doesn't protect you against sql injections, but you won't
have to reinvent the authorization system.

> I would like a regime where there is no permanent privileged relationship
> between the client application and the database; a user would need to supply
> validating information that we can use to switch a connection to something 
> with
> minimal privileges for just that role, for the duration of one session or
> transaction.

I haven't read the blog post referenced in this thread yet, so maybe
this is covered there, but I think "supplying validating information"
would be the hard part. In general you wouldn't want a web-frontend to
cache plain-text passwords to resubmit them for each transaction, but to
use something more ethereal, like session cookies or kerberos tickets.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] COPY: row is too big

2017-01-04 Thread Peter J. Holzer

On 2017-01-04 06:53:31 -0800, Adrian Klaver wrote:
> On 01/04/2017 05:00 AM, vod vos wrote:
> >Now I am confused about I can create 1100 columns in a table in
> >postgresql, but I can't copy 1100 values into the table. And I really
> 
> As pointed out previously:
> 
> https://www.postgresql.org/about/
> Maximum Columns per Table 250 - 1600 depending on column types
> 
> That being dependent on both the number of columns and the actual data in
> the columns.

I think this is confusingly phrased. In my mind "column type" is static
- the type is the same, independent of the values which are stored. So
  "250 - 1600 depending on column types" implies to me that there is
some type A of which I can have only 250 columns and another type B of
which I can have 1600 columns. But it doesn't imply to me that the
number of columns depends on the values which ar put into those columns.

May I suggest the these improvements?

In https://www.postgresql.org/about/:
Instead of
| 250 - 1600 depending on column types
write
| 250 - 1600 depending on column types and data

In https://www.postgresql.org/docs/9.6/static/ddl-basics.html:
Replace the sentence:
| Depending on the column types, it is between 250 and 1600.
with:
| For all columns in a row, some information (either the data itself or
| a pointer to the data) must be stored in a single block (8 kB).
| Because for some types this data is itself of variable length, the
| maximum number of columns depends not only on the types of the columns
| but also on the data (e.g., a NULL uses less space than a non-NULL
| value). Therefore there is no simple way to compute the maximum number
| of columns, and it is possible to declare a table with more columns
| than can be filled. Keeping all this in mind, the limit is between 250
| and 1600.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] explain analyze showed improved results without changes, why?

2016-12-25 Thread Peter J. Holzer

On 2016-12-23 10:35:26 -0600, Chris Richards wrote:
> Howdy. I was noticing a significant problem with a query on one of my tables. 
> I
> tried recreating the problem and wasn't able to do so on a different install,
> and so a few days later I ran the same query on the problem table. Lo' and
> behold, there wasn't a problem anymore. I'm at a loss to why.
[...]
>     "blocks_off_sz_idx" btree (off, sz)
> 
> mdb=> explain analyze SELECT * FROM blocks
>  WHERE cloudidx=98038 AND off+sz >= 0 AND state='seeded'
>  ORDER BY off LIMIT 1 FOR UPDATE;
>                                                                       QUERY
> PLAN
> ---
>  Limit  (cost=0.43..587.83 rows=1 width=100) (actual time=4814.579..4814.579 
> rows=1 loops=1)
>    ->  LockRows  (cost=0.43..1358633.99 rows=2313 width=100) (actual time= 
> 4814.577..4814.577 rows=1 loops=1)
>          ->  Index Scan using blocks_off_sz_idx on blocks  (cost= 
> 0.43..1358610.86 rows=2313 width=100) (actual time=4813.498..4814.384 rows=2 
> loops=1)
>                Filter: ((cloudidx = 98038) AND (state = 
> 'seeded'::block_state) AND ((off + sz) >= 0))
>                Rows Removed by Filter: 6935023
>  Total runtime: 4814.619 ms
> (6 rows)

This scans the table in ascending (off, sz) order until it finds one row
matching the filter. Apparently at the time of the query there were
6935023 rows in the table before the matching row.

[...]
> And here's the second. Notice that even though there are more rows, it was 
> much
> faster and the "rows removed by filter" were significantly reduced by several
> orders of magnitude.
> 
> 
> mdb=> explain analyze SELECT * FROM blocks
>  WHERE cloudidx=98038 AND off+sz >= 0 AND state='seeded'
>  ORDER BY off LIMIT 1 FOR UPDATE;
>                                                                    QUERY PLAN
> -
>  Limit  (cost=0.43..584.57 rows=1 width=100) (actual time=0.071..0.071 rows=1 
> loops=1)
>    ->  LockRows  (cost=0.43..1390825.21 rows=2381 width=100) (actual 
> time=0.070..0.070 rows=1 loops=1)
>          ->  Index Scan using blocks_off_sz_idx on blocks  
> (cost=0.43..1390801.40 rows=2381 width=100) (actual time=0.055..0.055 rows=1 
> loops=1)
>                Filter: ((cloudidx = 98038) AND (state = 
> 'seeded'::block_state) AND ((off + sz) >= 0))
>                Rows Removed by Filter: 26
>  Total runtime: 0.114 ms
> (6 rows)

The plan here is exactly the same, but only 26 rows are discarded. My
guess is that between those two queries a row was inserted with a really
low (off, sz) value which matches the query. So now the query can return
after checking only a handful of rows.

LIMIT, EXISTS, etc. are awful when you want predictable performance. You
may be lucky and the rows you are looking for are just at the start or
you may be unlucky and you have to scan through the whole table to find
them. The optimizer (usually) doesn't have enough information and
assumes they are spread randomly through the table.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html


signature.asc
Description: Digital signature

Re: [GENERAL] Is is safe to use SPI in multiple threads?

2016-12-23 Thread Peter J. Holzer

On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote:
> I'm new to PG and want to implement my domain-specific system based on PG. I
> wish to arrange my data as several tables in database and translate my DSL 
> into
> SQL statements for query. Since one DSL statement may be mapped to several SQL
> statements, it's better to push the DSL server as close to the PG server as
> possible. I found PG's backgroud worker meet my needs. I can setup a 
> background
> worker bounded to PG server and listen to a port for network requests. 
> 
> But I encounter a problem that the Server Programing Interfaces are not THREAD
> SAFE. There are some global variables defined like: SPI_processed,
> SPI_tuptable, etc. This limit to my DSL server to work in single thread mode
> which is quite inefficient.

I had a similar requirement. I solved it by moving the application logic
out of the stored procedures. All the stored procedure does is an RPC
call (I use ØMQ for that) to a server process and send the result back
to the client. The server process converts the request into multiple SQL
queries which can be processed in parallel.

The downside is of course that the communication overhead is much
higher (A minimum of 4 network messages per request). That's not a
problem in my case, but you mileage may vary.

The advantages in my opinion are:

* A standalone server process is easier to test and debug than a bunch
  of stored procedures.
* I can easily scale out if necessary: Currently my database and server
  process run on the same machine, but I could distribute them over
  several machines with (almost) no change in logic.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] Index size

2016-12-10 Thread Peter J. Holzer

On 2016-12-09 21:45:35 -0500, Melvin Davidson wrote:
> On Fri, Dec 9, 2016 at 6:40 PM, Samuel Williams 
> <space.ship.travel...@gmail.com>
> wrote:
> >I also read that when you change a column which is not index, all the
> >indexes for that row need to be updated anyway. Is that correct?
> 
> That is not correct. Indexes are changed under the following conditions:
> A. An insert is done to the table which involves an index.
> B. A delete is done to the table which involves an index.
> C. An update is done that involves columns included in an index.
> D. An index is REINDEXed
> 
> Indexes point to the tid of the row for which the column(s) in the index
> are involved. So if columns updated are not involved in the index,
> there is no need to change the index.

I don't think this is generally correct. The TID is a (block,item)
tuple. It the updated version of the row doesn't fit into the same block
it has to be stored in a different block, so the TID will change (AIUI
there is a bit of trickery to avoid changing the TID if the new version
is stored in the same block). This means that all the index entries for
this row (not just for the changed field) will have to be updated. You
can set fillfactor to a smaller value to make this less likely.

hp

-- 
   _  | Peter J. Holzer| A coding theorist is someone who doesn't
|_|_) || think Alice is crazy.
| |   | h...@hjp.at | -- John Gordon
__/   | http://www.hjp.at/ |http://downlode.org/Etext/alicebob.html

signature.asc
Description: Digital signature

Re: [GENERAL] Ascii Elephant for text based protocols

2016-05-15 Thread Peter J. Holzer

On 2016-05-15 14:02:56 +0200, Charles Clavadetscher wrote:
> ++
> |     __  ___|
> |  /)/  \/   \   |
> | ( / ___\)  |
> |  \(/ o)  ( o)   )  |
> |   \_  (_  )   \ )  /   |
> | \  /\_/\)_/|
> |  \/  //|  |\\  |
> |  v |  | v  |
> |\__/|
> ||
> |  PostgreSQL 1996-2016  |
> |  20 Years of success   |
> +----+

Nice.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

Re: [GENERAL] Using both ident and password in pg_hba.conf

2016-05-14 Thread Peter J. Holzer

On 2016-05-09 16:18:39 -0400, D'Arcy J.M. Cain wrote:
> On Mon, 9 May 2016 13:02:53 -0700
> Adrian Klaver <adrian.kla...@aklaver.com> wrote:
> > So define PHP runs as 'nobody'?
> 
> Because of the way PHP and Apache works PHP script have to run as the
> Apache user which, in my case anyway, is "nobody" so every PHP script
> runs as nobody.

This is not really true. You can use FastCGI to run PHP for each site as
a different user. For Apache there is also an MPM
(http://mpm-itk.sesse.net/) which lets you run apache processes (and
therefore also any embedded mod_php) under different uids. So while
running everything as nobody is the default, it is possible to use
different users, and I would strongly recommend doing this if you have
multiple customers.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] Allow disabling folding of unquoted identifiers to lowercase

2016-05-08 Thread Peter J. Holzer

On 2016-04-29 19:21:30 +0200, Evgeny Morozov wrote:
> It would be great if Postgres had a server setting that allowed the automatic
> folding of identifiers to lowercase to be disabled, so that camel case
> identifiers could be used without having to quote every single identifier, 
> i.e.
> 
> SELECT MyColumn FROM MyTable ORDER BY MyColumn
> 
> instead of
> 
> SELECT "MyColumn" FROM "MyTable" ORDER BY "MyColumn"
[...]
> My company is looking into doing this. Currently our table and column names
> exactly match our class and property names, which are in camel case. MSSQL
> supports this just fine. To move to Postgres we would have to either quote
> *everything* or translate names back-and-forth between code and database. Both
> options are OK for auto-generated SQL, but we also have many users writing
> ad-hoc SQL queries. Having to quote everything would have those users 
> screaming
> to switch back to MSSQL very quickly! That leaves us with the mapping 
> approach,
> which is doable, but also a constant "mental speedbump" at best.

What exactly is the problem you are trying to solve? 

If you and your users are consistent about never using quotes, your
users can write:

SELECT MyColumn FROM MyTable ORDER BY MyColumn;

It will select mycolumn from mytable, but that doesn't matter, since you
created the table with

CREATE MyTable (MyColumn varchar);

so you really have a table mytable with a column mycolumn, not a table
MyTable with a column MyColumn.

There are three potential problems I can see:

1) Users might be confused that PgAdmin (or whatever tool they use to
   inspect the database) displays all the names in lowercase, and they
   might find a name like sometablewithareallylongname less readable
   than SomeTableWithAReallyLongName.

2) Since case doesn't matter, they might be inconsistent: One programmer
   might write MyTable, another mytable, the third MYTABLE, ...

3) You might want to use a tool to automatically generate SQL queries,
   but that tool quotes identifiers.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] Distributed Table Partitioning

2016-03-13 Thread Peter J. Holzer

On 2016-03-12 21:19:11 -0500, Melvin Davidson wrote:
> - Original Message -
> From: "Leonardo M. Ramé" <l.r...@griensu.com>
> To: "PostgreSql-general" <pgsql-general@postgresql.org>
> Sent: Saturday, 12 March, 2016 8:25:01 PM
> Subject: [GENERAL] Distributed Table Partitioning
> 
> I have this problem: a Master table containing records with a timestamp
> column registering creation date-time, and one Detail table containing
> info related to the Master table.
> 
> As time went by, those tables grew enormously, and I can't afford
> expanding my SSD VPS. So I'm thinking about storing only NEW data into
> it, and move OLD data to a cheaper SATA VPS.
[...]
> Why don't you just make use of tablespaces and partition the child tablespaces
> so that the newer parttion is on the SSD and the older one is on SATA?

Since he mentioned virtual private servers (VPS) the reason might be
that his hoster offers VPS with SSDs (of various sizes) and VPS with
rotating hard disks (of various sizes), but not VPS with both. So he
can't rent a VPS with a relatively small SSD and a larger hard disk. 

That might be a reason to look for an alternate hoster, but if he's
otherwise happy, switching to an unknown provider might be considered
too large a risk.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

Re: [GENERAL] Unable to match same value in field.

2016-03-12 Thread Peter J. Holzer

On 2016-03-10 11:09:00 +0200, Condor wrote:
> I using postgresql 9.5.1 and I have problem to match value in one field.
> Both tables are text:
[...]
> =# select imsi from list_cards_tbl where imsi = '28411123315';
>  imsi
> --
> (0 rows)
> 
> No value, lets change to LIKE
> 
> =# select imsi, md5(imsi), bit_length(imsi) from list_cards_tbl where imsi
> like '28411123315%';
>   imsi   |   md5| bit_length
> -+--+
>  28411123315 | b438e984c97483bb942eaaed5c0147f3 |120
> (1 row)

That looks familiar. I think I've seen something similar recently. That
was on 9.5beta1 (I haven't gotten around to upgrade to 9.5.1 yet).

> =# reindex table list_cards_tbl;
> REINDEX
[...]
> Still cant find value.

Dropping and recreating the index helped in my case. Still, I find it
worrying if a value which obviously is in the table can't be found using
the index.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

Re: [GENERAL] index problems (again)

2016-03-12 Thread Peter J. Holzer

On 2016-03-12 21:00:04 +, Geoff Winkless wrote:
> On 12 March 2016 at 18:43, Peter J. Holzer <hjp-pg...@hjp.at> wrote:
> > The question is what can be done to improve the situation.
> >
> > Tom thinks that correlation statistics would help. That seems plausible
> > to me.
[...]
> > You claim that no statistics are needed.
> 
> Well that's a bit confrontational.

Sorry. Didn't want to sound confrontational. I was just repeating points
made by Tom and you previously in this thread to establish a baseline.

> > That may or may not be true: You haven't proposed an alternate method
> > yet.
> 
> You could make an assumption that perfect distribution isn't true:
> that actually the distribution is within a certain _deviation_ of that
> perfect distribution. It wouldn't have to have been very much to make
> the index-only scan win here and would still keep the planner from
> choosing less optimal queries most of the time (and where it did end
> up making the "wrong" choice it's not going to be far off anyway).
> 
> But I'm making assumptions here, I'm aware of that. Chances are that
> actually most people's data _does_ fit into this perfect distribution
> set. Is there any research that shows that real-world data usually
> does?

I don't think most people's data is perfectly distributed. But as you
say most data is probably within some deviation of being perfectly
distributed and as long as that deviation isn't too big it doesn't
matter.

But there are certainly some common examples of highly correlated
columns. Having a serial id and a date as in your case is probably quite
common. Another example might be a surrogate primary key which is
computed from some other fields (e.g. a timeseries code starting with a
country code, or a social security number starting with the birth date,
...). That's probably not that uncommon either. 

So, I agree with you. This is a problem and it should be fixed. I'm just
sceptical that it can be done with a simple cost adjustment.

> As Jeff points out I'd have a much larger win in this instance by
> someone spending the time implementing skip index scans rather than
> messing with the planner :)

Yeah. I think I have some code which could benefit from this, too. I'll
have to try that trick from the wiki.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] index problems (again)

2016-03-12 Thread Peter J. Holzer

On 2016-03-08 10:16:57 +, Geoff Winkless wrote:
> On 7 March 2016 at 20:40, Peter J. Holzer <hjp-pg...@hjp.at> wrote:
> > As Tom wrote, the estimate of having to read only about 140 rows is only
> > valid if sc_id and sc_date are uncorrelated. In reality your query has
> > to read a lot more than 140 rows, so it is much slower.
> 
> But as I've said previously, even if I do select from scdate values
> that I know to be in the first 1% of the data (supposedly the perfect
> condition) the scan method is insignificantly quicker than the index
> (scdate,scid) method.

Actually the planner expects find a match within the first 0.0035 %, so
to find out how fast that would be you would have to use a value from
that range.

> Even with the absolute perfect storm (loading in the entire index for
> the full range) it's still not too bad (1.3 seconds or so).
> 
> The point is that to assume, knowing nothing about the data, that the
> data is in an even distribution is only a valid strategy if the worst
> case (when that assumption turns out to be wildly incorrect) is not
> catastrophic. That's not the case here.

True. The fundamental problem here is that the planner doesn't have any
notion of a worst case. It only knows "cost", and that is a single
number for each operation. For many operations, both the best case and
the worst case are unusable as cost - the first would almost always
underestimate the time and choose a plan which is far from optimal and
the second would almost always overestimate it and reject an optimal
plan. The art of programming a planner (which I've dabbled with in a
previous (not postgresql-related) project but certainly can't claim any
expertise in) lies in choosing a cost function which is quite close most
of the time and catastrophically wrong only very rarely. It is clear
that PostgreSQL hasn't succeed in the latter category: Correlated
columns do occur and the current cost function, which assumes that all
columns are uncorrelated can catastrophically underestimate the cost in
this case. 

The question is what can be done to improve the situation.

Tom thinks that correlation statistics would help. That seems plausible
to me.

You claim that no statistics are needed.

That may or may not be true: You haven't proposed an alternate method
yet.

I feel fairly certain that using the worst case (the cost for scanning
the whole table) would be just as bad in and would cause inferior plans
to be used in many instances.

Maybe computing the cost as weighted average of the best, average and
worst case (e.g. cost = cost_best*0.05 + cost_avg*0.90 + cost_worst*0.05)
would penalize methods with a large spread between best and worst case
enough - but that still leaves the problem of determining the weights
and determining what the "average" is. So it's the same black magic as
now, just the little more complicated (on the plus side, this would
probably be a relatively simple patch).

If we assume that we could revamp the planner completely, other
possibilities come to mind:

For example, since I think that the core problem is having a single
number for the cost, the planner could instead compute a distribution
(in the most simple case just best and worst case, but ideally many
values). Then the planner could say something like: I have two plans A
nd B and A is at most 20 % faster in almost all cases. But in the worst
case, A is 1000 times slower. Being 20 % faster most of the time is nice
but doesn't outweigh the risk of being 1000 times slower sometimes, so
I'll use B anyway. 

Another possibility I've been considering for some time is feeding back
the real execution times into the planner, but that sounds like a major
research project. (Actually I think Oracle does something like this
since version 12)

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] index problems (again)

2016-03-07 Thread Peter J. Holzer

On 2016-03-07 16:37:37 +, Geoff Winkless wrote:
> On 7 March 2016 at 16:02, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > In English, what that plan is trying to do is scan the index
> > in sc_id order until it hits a row with scdate in the target range.
> > The first such row, by definition, has the correct min(sc_id) value.
> > The problem is that we're guessing at how soon we'll hit such a row.
> > If the columns are independent, then the planner can guess based on how
> > many rows in the whole table have scdate in the target range, and it
> > will probably be about right.  But that estimate can fall down very
> > badly if sc_id and scdate increase together, because then the target
> > rows aren't randomly distributed in the index sequence but could all be
> > all the way at the far end of the index.
> 
> I'm sorry, I'm obviously not being clear. I already accepted this
> argument when Victor gave it, although I believe that in part it falls
> down because sc_id is also (potentially) randomly distributed so it's
> not like you're doing a sequential table scan (it might work better on
> a clustered table, but we don't have those :) )
> 
> So you still have an extra layer of indirection into a large table
> with lots of random accesses.
> 
> > If we had cross-column correlation stats we could detect this pitfall,
> > but without that it's hard to do.
> 
> But as far as I can see, apart from the absolute extremes, the
> index-only scan is _always_ going to be quicker than the index+table
> scan.

We are talking about an "absolute extreme" here. You have about 420 date
values and you are looking for 3 of them. Assuming for the moment that
your distribution is uniform, that's 140th of the whole table.

So if PostgreSQL were using the (sc_date,sc_id) index, it would have so
scan 4E6/140 = 29000 index entries, extract the id value and get the
minumum of those 29000 values.

OTOH, if it uses the sc_id index, it only expects to have to scan 140
entries until it finds a matching entry. And then it is finished.

So it's 140 index entries plus row accesses against 29000 index entries.
To choose the second plan, the planner would have to estimate that
reading a random row is more than 200 times slower than reading an index
entry, which apparently it doesn't.

As Tom wrote, the estimate of having to read only about 140 rows is only
valid if sc_id and sc_date are uncorrelated. In reality your query has
to read a lot more than 140 rows, so it is much slower.

> I don't believe you need any further statistics than what is currently
> available to be able to make that judgement, and that's why I believe
> it's suboptimal.

We all know it is suboptimal, but unfortunately, without additional
statistics I don't think there is a better way. The other way around -
assuming that the columns are correlated in the worst possible way -
would remove viable plans in many cases. 

This is, I think one of the places where hints are a good idea. The
programmer sometimes knows more about the characteristics of the data
than the planner can possibly know and it is a pity that there is no way
for the programmer to pass that knowledge to the planner. (And yes, I
know that quite often the programmer is wrong - but I do believe in
giving people enough rope to hang themselves with)

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] BRIN Usage

2016-02-21 Thread Peter J. Holzer

On 2016-02-18 13:37:37 -0500, Tom Smith wrote:
> it is for reducing index size as the table become huge. 
> sorry for confusion, by timestamp, I meant a time series number, not the sql
> timestamp type.
> I need the unique on the column to ensure no duplicate,   but the btree index
> is getting
> huge so BRIN seems to solve problem but can not ensure unique

If it is getting huge, then this is because there are a large number of
timestamps. If you want an index to ensure uniqueness, it will have to
store every value. I don't think there's a way around that. 

With a BRIN index, you would only get a list of page ranges which could
possibly contain the new value. All these pages would then have to be
scanned sequentially to make sure it isn't already there. That could be
implemented, but it would make inserts very slow - I don't think you
would want that on a huge table even if postgres implemented it.

hp

signature.asc
Description: Digital signature

Re: [GENERAL] strange sql behavior

2016-02-06 Thread Peter J. Holzer

On 2016-02-01 12:35:35 -0600, Yu Nie wrote:
> Recently I am working with a large amount of taxis GIS data and had 
> encountered
> some weird performance issues.  I am hoping someone in this community can help
> me figure it out.
> 
> The taxi data were loaded in 5 minute block into a table.  I have two separate
> such tables, one stores a month of data with about 700 million rows, another
> stores about 10 days of data with about 300 million rows.  The two tables have
> the exactly same schema and indexes. There are two indexes: one on taxiid
> (text), and the other on the time stamp (date time).  In order to process the
> data, I need to get all points for a single taxis; to do that, I use something
> like:
>  select * from table1 where taxiid = 'SZB00S41' order by time;
> What puzzled me greatly is that this query runs consistently much faster for
> the large table than for the small table, which seems to contradict with
> intuition.
[...]

> Results for the small table: it took 141 seconds to finish.  The planning time
> is 85256.31
> 
> "Sort  (cost=85201.05..85256.31 rows=22101 width=55) (actual time=
> 141419.499..141420.025 rows=20288 loops=1)"
> "  Sort Key: "time""
> "  Sort Method: quicksort  Memory: 3622kB"
> "  Buffers: shared hit=92 read=19816"
> "  ->  Bitmap Heap Scan on data2013_01w  (cost=515.86..83606.27 rows=22101
> width=55) (actual time=50.762..141374.777 rows=20288 loops=1)"
> "    Recheck Cond: ((taxiid)::text = 'SZB00S41'::text)"
> "    Heap Blocks: exact=19826"
> "    Buffers: shared hit=92 read=19816"
  ^^
[...]
> 
> Results for the large table: it took 5 seconds to finish.  The planning time 
> is
> 252077.10
> "Sort  (cost=251913.32..252077.10 rows=65512 width=55) (actual time=
> 5038.571..5039.765 rows=44204 loops=1)"
> "  Sort Key: "time""
> "  Sort Method: quicksort  Memory: 7753kB"
> "  Buffers: shared hit=2 read=7543"
> "  ->  Bitmap Heap Scan on data2011_01  (cost=1520.29..246672.53 rows=65512
> width=55) (actual time=36.935..5017.463 rows=44204 loops=1)"
> "    Recheck Cond: ((taxiid)::text = 'SZB00S41'::text)"
> "    Heap Blocks: exact=7372"
> "    Buffers: shared hit=2 read=7543"
 ^
[]

The obvious difference is that the query for the smaller table needs to
read about 2.5 times as many blocks (for 1/3 of the records) from the
disk. This suggests that the data for a single taxi is more localized in
the larger table. In addition, the average time per block on the smaller
table is about 7 ms, which is a typical random seek time for a disk. So
the blocks are probably randomly scattered through the table. For the
larger table, the average time is well below 1 ms, so there are probably
many consecutive blocks to read.

There are 2880 5 minute intervals in 10 days. You have about 22k records
per taxi, so there are about 7.6 records for each taxi per interval.
This is very close to the number of records per block in your second
query (65512/7372 = 8.9). I suspect that the records in your larger
table are sorted by taxiid within each interval. 

You can almost certainly get a similar speedup by sorting each 5 minute
interval by taxi id before appending it to the table.

If querying by taxiid is typical and your table is static, you should
consider clustering the table by taxiid. If your table is updated every
5 minutes, you could partition it by day and cluster each partition as
soon as it is not written any more.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-04 Thread Peter J. Holzer

On 2015-12-03 10:02:18 -0500, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pg...@hjp.at> writes:
> > Can those signals be safely ignored? Just blocking them (so that they
> > are delivered after the UDF finishes) might be safer. But even that may
> > be a problem: If the UDF then executes some SQL, could that rely on
> > signals being delivered? I have no idea. 
> 
> The minute you start fooling with a backend's signal behavior, we're
> going to politely refuse to support whatever breakage you run into.

As I understood Jim he was talking about possible changes to postgresql
to shield UDFs from those signals, not something the author of a UDF
should do.

> We aren't sending those signals just for amusement's sake.

Right. That's why I was sceptical whether those signals could be
ignored. I wouldn't have thought so, but Jim clearly knows a lot more
about the inner workings of postgresql than I do (which is easy - I know
almost nothing) and maybe he knows of a way (something like "we can
ignore signals while executing the UDF and just assume that we missed at
least one signal and call the magic synchronize state function
afterwards")

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-03 Thread Peter J. Holzer

On 2015-12-02 19:07:55 -0600, Jim Nasby wrote:
> On 12/2/15 9:26 AM, Peter J. Holzer wrote:
> >As explained in backend/utils/misc/timeout.c, the timers are never
> >cancelled: If a timeout is cancelled, postgres just sees that it has
> >nothing to do and resumes whatever it is doing.
> 
> Hrm, if those timers are really just for auth purposes then perhaps they
> should be cancelled. But aside from that, there's certainly other things
> that can signal a backend (including fairly normal things, like DDL).

Yep. I noticed that, too. In one of my test runs I got two signals
instead of the one I expected. Checking the logs I found that it seemed be
caused by another user dropping a table.

> Offhand I don't think functions run in a CRITICAL block (and I don't think
> it'd be a good idea for them to). So really, functions have to be handle
> being interrupted.

Right. I think that should be mentioned somewhere in the manual.
Something like this:

Note: PostgreSQL uses signals for various purposes. These signals
may be delivered while a user-defined function is executed.
Therefore user-defined functions must be able to handle being
interrupted, in particular they must expect system calls to fail
with errno=EINTR and handle that case appropriately.

I'm not sure wether that's an issue with all procedural languages. If it
is, it should probable go into "Chapter 39. Procedural Languages". If it
is specific to plperl(u), I would put it in "42.8.2. Limitations and
Missing Features".

> Yeah, it'd be nice to detect that this had happened. Or maybe it's worth it
> to ignore SIGALARM while a UDF is running.

Can those signals be safely ignored? Just blocking them (so that they
are delivered after the UDF finishes) might be safer. But even that may
be a problem: If the UDF then executes some SQL, could that rely on
signals being delivered? I have no idea. 

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-02 Thread Peter J. Holzer

On 2015-12-01 20:55:02 +0100, Peter J. Holzer wrote:
> On 2015-12-01 18:58:31 +0100, Peter J. Holzer wrote:
> > I suspect such an interaction because I cannot reproduce the problem
> > outside of a stored procedure. A standalone Perl script doing the same
> > requests doesn't get a timeout.
[...]
> The strace doesn't show a reason for the SIGALRM, though. No alarm(2) or
> setitimer(2) system call (I connected strace to a running postgres
> process just after I got the prompt from "psql" and before I typed
> "select * from mb_search('export');" (I used a different (but very
> similar) stored procedure for those tests because it is much easier to
> find a search which is slow enough to trigger a timeout at least
> sometimes than a data request (which normally finishes in
> milliseconds)).
> 
> So I guess my next task will be to find out where that SIGALRM comes
> from and/or whether I can just restart the zmq_msg_recv if it happens. 

Ok, I think I know where that SIGALRM comes from: It's the
AuthenticationTimeout. What I'm seeing in strace (if I attach it early
enough) is that during authentication the postgres worker process calls
setitimer with a 60 second timeout twice. This matches the comment in
backend/postmaster/postmaster.c:

 * Note: AuthenticationTimeout is applied here while waiting for the
 * startup packet, and then again in InitPostgres for the duration of 
any
 * authentication operations.  So a hostile client could tie up the
 * process for nearly twice AuthenticationTimeout before we kick him 
off.

As explained in backend/utils/misc/timeout.c, the timers are never
cancelled: If a timeout is cancelled, postgres just sees that it has
nothing to do and resumes whatever it is doing. 

This is also what I'm seeing: 60 seconds after start, the process
receives a SIGALRM. 

If the process is idle or in a "normal" SQL statement at the time, thats
not a problem. But if it is in one of my stored procedures which is
currently calling a ØMQ function which is waiting for some I/O
(zmq_msg_recv(), most likely), that gets interrupted and returns an
error which my code doesn't know how to handle (yet). So the error gets
back to the user. 

A strange interaction between postgres and ØMQ indeed. But now that I
know what's causing it I can handle that. Thanks for your patience.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-01 Thread Peter J. Holzer

On 2015-12-01 13:13:27 -0500, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pg...@hjp.at> writes:
> > Postgres worker processes are single-threaded, are they? Is there
> > something else which could interact badly with a moderately complex
> > multithreaded I/O library used from a stored procedure? 
> 
> Yes, lots.  If you cause additional threads to appear inside a backend
> process, things could break arbitrarily badly.  It's up to you to ensure
> that none of those extra threads ever escape to execute any non-Perl
> code.

Actually, non-�MQ code. Perl doesn't like to be unexpectedly
multithreaded either. Yes, those threads should only ever execute code
from the �MQ library. In fact they are automatically created and
destroyed by the library and there is no way to control them from Perl
code (there may be a way to do that from the C API, but I don't remember
seeing that in the manual).

> I suspect this could easily explain the problems you're seeing.

Quite.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

Re: [GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-01 Thread Peter J. Holzer

On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote:
> On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
> >A rather weird observation from the log files of our server (9.5 beta1):
> >
> >2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 
> >WARNING:  Use of uninitialized value $success in concatenation (.) or string 
> >at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
> >2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 
> >CONTEXT:  PL/Perl function "mb_timeseriesdata_zmq"
> >[lots of other stuff from different connections]
> >2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR: 
> > impossible result '' (payload=) at 
> >/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
> > WDS::Macrobond::Utils::decode_result("") called at line 30
> > main::__ANON__("gen_wqehur") called at -e line 0
> > eval {...} called at -e line 0
> >
> >Two messages from the same line of the same plperlu stored procedure, 68
> >seconds apart. So what is this line 36?
> >
> > confess "impossible result '$success' (payload=$payload)";
> >
> >What? The first message clearly comes from interpolating $success
> >(which is undef at that point) into the argument. The second from
> >confess itself. What could cause a plperlu procedure to freeze for 68
> >seconds between the call to confess and its output?
> >
> >Is it possible that only the writing of the log entry is delayed?
> >
> >Another weird thing: $success is undef because a ØMQ rpc call[1] timed
> 
> And the call is?

The sequence is:

my $req_sck = zmq_socket($context, ZMQ_REQ);
zmq_connect($req_sck, $url);
my $qry_msg = join(...);
zmq_send($req_sck, $qry_msg);
my $res_msg = zmq_msg_init();
my $rv = zmq_msg_recv($res_msg, $req_sck);
# check rv here.
my $data = zmq_msg_data($res_msg); # $data is "" here
my $result = WDS::Macrobond::Utils::decode_result($data); # the error messages 
are from this function

(Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but
I'm not using it). 

I omitted that because I don't think it's terribly relevant here.
Details of the usage of ØMQ are better discussed on the ØMQ mailing
list.

But there is something else which may be relevant: ØMQ uses threads
internally, and I don't actually know whether zmq_msg_recv returning
means that the read(2) call (or whatever) on the socket terminates.
It may actually continue in another thread. But I still don't see how
that could block the main thread (or wake it up again in a place which
has nothing to do with ØMQ (confess is a standard Perl function to print
a stack trace and die)). Or - just thinking aloud here - I fear I'm
abusing you guys as support teddy bears[1] - maybe it's the other way
round: confess dies, so maybe it frees some lock during cleanup which
allows the message which should have been sent by zmq_send to finally go
out on the wire. But that still doesn't explain the 68 seconds spent in
confess ...

Postgres worker processes are single-threaded, are they? Is there
something else which could interact badly with a moderately complex
multithreaded I/O library used from a stored procedure? 

I suspect such an interaction because I cannot reproduce the problem
outside of a stored procedure. A standalone Perl script doing the same
requests doesn't get a timeout.

I guess Alvaro is right: I should strace the postgres worker process
while it executes the stored procedure. The problem of course is that
it happens often enough be annoying, but rarely enough that it's not
easily reproducible.

> >out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
> >to have a default timeout of 60 seconds, and I don't set one). But at
> 
> Network timeout?

That was my first guess, but I don't see where it would come from. Or
why it only is there if I call the code from a stored procedure, not
from a standalone script.

> >09:24:45 (i.e. the time of the error message) the answer for that RPC
> >call arrived. So it kind of looks like confess waited for the message to
> >arrive (which makes no sense at all) or maybe that confess waited for
> >something which also blocked the sending of the request (because
> >according to the server logs, the RPC request only arrived there at
> >09:24:45 and was answered within 1 second), but that doesn't make any
> 
> So if the request timed out how did you get a reply, a second request?

Nope. I don't really "get" the reply. I just see in the logs of the
other server that it sent a reply at that time. 

The time line is like this

timepostgres processmb_dal process
T   zmq_send()
zmq_msg_recv()

Re: [GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-01 Thread Peter J. Holzer

On 2015-12-01 10:20:09 -0800, Adrian Klaver wrote:
> On 12/01/2015 09:58 AM, Peter J. Holzer wrote:
> >On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote:
> >>On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
> >>>A rather weird observation from the log files of our server (9.5 beta1):
> >>>
> >>>2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 
> >>>WARNING:  Use of uninitialized value $success in concatenation (.) or 
> >>>string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
> >>>2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 
> >>>CONTEXT:  PL/Perl function "mb_timeseriesdata_zmq"
> >>>[lots of other stuff from different connections]
> >>>2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 
> >>>ERROR:  impossible result '' (payload=) at 
> >>>/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
> >>> WDS::Macrobond::Utils::decode_result("") called at line 30
> >>> main::__ANON__("gen_wqehur") called at -e line 0
> >>> eval {...} called at -e line 0
> >>>
> >>>Two messages from the same line of the same plperlu stored procedure, 68
> >>>seconds apart. So what is this line 36?
> >>>
> >>> confess "impossible result '$success' (payload=$payload)";
> >>>
> >>>What? The first message clearly comes from interpolating $success
> >>>(which is undef at that point) into the argument. The second from
> >>>confess itself. What could cause a plperlu procedure to freeze for 68
> >>>seconds between the call to confess and its output?
> >>>
> >>>Is it possible that only the writing of the log entry is delayed?
> >>>
> >>>Another weird thing: $success is undef because a ØMQ rpc call[1] timed
> >>
> >>And the call is?
> >
> >The sequence is:
> >
> >my $req_sck = zmq_socket($context, ZMQ_REQ);
> >zmq_connect($req_sck, $url);
> >my $qry_msg = join(...);
> >zmq_send($req_sck, $qry_msg);
> >my $res_msg = zmq_msg_init();
> >my $rv = zmq_msg_recv($res_msg, $req_sck);
> ># check rv here.
> >my $data = zmq_msg_data($res_msg); # $data is "" here
> >my $result = WDS::Macrobond::Utils::decode_result($data); # the error 
> >messages are from this function
> >
> >(Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but
> >I'm not using it).
> >
> >I omitted that because I don't think it's terribly relevant here.
> >Details of the usage of ØMQ are better discussed on the ØMQ mailing
> >list.
> >
> >But there is something else which may be relevant: ØMQ uses threads
> >internally, and I don't actually know whether zmq_msg_recv returning
> 
> Except I see this here:
> 
> http://api.zeromq.org/4-0:zmq-socket
> 
> Thread safety
> 
> ØMQ sockets are not thread safe. Applications MUST NOT use a socket from
> multiple threads except after migrating a socket from one thread to another
> with a "full fence" memory barrier.

Well yes, but I don't use a ØMQ socket in multiple threads, It is
created in the stored procedure and destroyed at the end (just checked
the strace output: Yes it is. For a moment I wasn't sure whether lexical
variables in plperlu procedures go out of scope.). It's the ØMQ library
itself which creates extra threads (And it should terminate them
properly and afaics from strace it does).


[...]
> From here:
> 
> http://api.zeromq.org/4-0:zmq-connect
> 
> It seems something like(I am not a Perl programmer, so approach carefully):
> 
> my $rc = zmq_connect($req_sck, $url);
> 
> Then you will have an error code to examine.

Yes. I have been a bit sloppy with error checking. I check only the
return value of the zmq_msg_recv() call which returns the empty message.
It is possible that the problem actually occurs earlier and I should
check those calls as well. Mea culpa.

However, in at least one case the failed call was indeed zmq_msg_recv()
not one of the earlier ones (see my next mail).

> Have you looked at the Notes at the bottom of this page:
> 
> http://www.postgresql.org/docs/9.4/interactive/plperl-trusted.html

I have. I don't think that's a problem here: Debian perl is built with
with both multiplicity and ithreads, and I would assume that the .deb
packages from postgresql.org use the shared library provided by the
system.

But even if that wasn't the case it should not be a problem as only
plperlu stored procedures are called.

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

Re: [GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-01 Thread Peter J. Holzer

On 2015-12-01 18:58:31 +0100, Peter J. Holzer wrote:
> I suspect such an interaction because I cannot reproduce the problem
> outside of a stored procedure. A standalone Perl script doing the same
> requests doesn't get a timeout.
> 
> I guess Alvaro is right: I should strace the postgres worker process
> while it executes the stored procedure. The problem of course is that
> it happens often enough be annoying, but rarely enough that it's not
> easily reproducible.

I did manage to catch a timeout once with strace in the mean time,
although that one was much more straightforward and less mysterious than
the original case: postgres process sends message, about 10 seconds
later it receives a SIGALRM which interrupts an epoll, reply hasn't yet
arrived, error message to client and log file. No waits in functions
which shouldn't wait or messages which arrive much later than they were
(presumably) sent.

The strace doesn't show a reason for the SIGALRM, though. No alarm(2) or
setitimer(2) system call (I connected strace to a running postgres
process just after I got the prompt from "psql" and before I typed
"select * from mb_search('export');" (I used a different (but very
similar) stored procedure for those tests because it is much easier to
find a search which is slow enough to trigger a timeout at least
sometimes than a data request (which normally finishes in
milliseconds)).

So I guess my next task will be to find out where that SIGALRM comes
from and/or whether I can just restart the zmq_msg_recv if it happens. 

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

signature.asc
Description: Digital signature

[GENERAL] plperlu stored procedure seems to freeze for a minute

2015-12-01 Thread Peter J. Holzer

A rather weird observation from the log files of our server (9.5 beta1):

2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  
Use of uninitialized value $success in concatenation (.) or string at 
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  
PL/Perl function "mb_timeseriesdata_zmq" 
[lots of other stuff from different connections]
2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  
impossible result '' (payload=) at 
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. 
WDS::Macrobond::Utils::decode_result("") called at line 30
main::__ANON__("gen_wqehur") called at -e line 0
eval {...} called at -e line 0

Two messages from the same line of the same plperlu stored procedure, 68
seconds apart. So what is this line 36?

confess "impossible result '$success' (payload=$payload)";

What? The first message clearly comes from interpolating $success
(which is undef at that point) into the argument. The second from
confess itself. What could cause a plperlu procedure to freeze for 68
seconds between the call to confess and its output?

Is it possible that only the writing of the log entry is delayed?

Another weird thing: $success is undef because a ØMQ rpc call[1] timed
out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
to have a default timeout of 60 seconds, and I don't set one). But at
09:24:45 (i.e. the time of the error message) the answer for that RPC
call arrived. So it kind of looks like confess waited for the message to
arrive (which makes no sense at all) or maybe that confess waited for
something which also blocked the sending of the request (because
according to the server logs, the RPC request only arrived there at
09:24:45 and was answered within 1 second), but that doesn't make any
sense either. (Just noticed that 60 + 68 == 128, which is also a round
number).

hp


[1] ØMQ is an IPC framework: See http://zeromq.org/ We use it to make
RPC calls from stored procedures to a server process.


-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

[GENERAL] Upgrade from postgresql-9.5alpha1 to postgresql-9.5alpha2 (Debian packages)

2015-08-18 Thread Peter J. Holzer

It looks like the catalog version has changed between 9.5alpha1 and
9.5alpha2:

 FATAL:  database files are incompatible with server
 DETAIL:  The database cluster was initialized with CATALOG_VERSION_NO 
201506282,
  but the server was compiled with CATALOG_VERSION_NO 201507281.
 HINT:  It looks like you need to initdb.

Normally, one would use pg_upgradecluster to do the upgrade, but
pg_upgradecluster assumes that the old and new version are installed in
parallel. Likewise, the low-level tool pg_upgrade needs the old bindir,
if I read the man-page correctly, and of course, apt-get upgrade
overwrites that, since it's just two versions of the same package
(unlike a major upgrade which is a new package). 

So, what's the best way to do the upgrade?

* Copy the bindir before the upgrade (or restore from backup) to a safe
  place and do pg_upgrade?
* Initdb a new cluster and restore yesterdays backup? 
* Something else?

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

Re: [GENERAL] Upgrade from postgresql-9.5alpha1 to postgresql-9.5alpha2 (Debian packages)

2015-08-18 Thread Peter J. Holzer

On 2015-08-18 20:40:10 +0900, Masahiko Sawada wrote:
 On Tue, Aug 18, 2015 at 6:06 PM, Peter J. Holzer hjp-pg...@hjp.at wrote:
  It looks like the catalog version has changed between 9.5alpha1 and
  9.5alpha2:
[...]
  So, what's the best way to do the upgrade?
 
  * Copy the bindir before the upgrade (or restore from backup) to a safe
place and do pg_upgrade?
  * Initdb a new cluster and restore yesterdays backup?
  * Something else?
 
 
 I've not used pg_upgrade at such case, but If you have a enough time
 to do upgrading, I will suggest to take backup(pg_dumpall) from
 current cluster and then restore it to new cluster.

Of course you would have to make a backup before the upgrade to restore
it afterwards. I could of course have forcibly downgraded to alpha1
again and made a new backup, but since this is a test system I just
decided to drop and recreate the cluster and restore yesterday's backup.
(One of my colleagues won't be too pleased about that, I think)

 I think pg_upgrade is basically used at major version upgrading.

This was basically a major version upgrade, the problem was that it
wasn't reflected in the package/directory structure (normally the
postgres debian packages are quite fastidious in separating everything
so that you have both an old and a new installation in the places where
pg_upgradecluster expects them), and that I didn't expect it (the
possibility of catalog version changes from one alpha release to the
next was discussed before the release of alpha1, but I somehow
classified that as theoretically possible but not likely - my fault),
and finally that I don't really understand the finer points of
pg_upgrade (I managed to use it in a similar situation some time ago,
but I had to read the source code of pg_upgradecluster (and I think I
even single-stepped through it in the debugger) to figure out the
parameters and unfortunately I didn't take notes). 

No big harm done (alpha software on a test system - I expect things to
blow up once in a while), but maybe the person preparing the alpha
releases can figure out how to make the upgrade smoother. At least a
warning in the release announcement would be nice (wouldn't have helped
me as I have to admit that I read that only after I upgraded, but it
would help those who do things in the right order ;-) ).

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

Re: [GENERAL] Queries for unused/useless indexes

2015-05-25 Thread Peter J. Holzer

On 2015-05-22 09:41:57 -0400, Melvin Davidson wrote:
 I'd like to share those queries with the community, as I know there must be
 others out there with the same problem.
 
 /* useless_indexes.sql */
 SELECT
    idstat.schemaname AS schema,
    idstat.relname AS table_name,
    indexrelname AS index_name,
    idstat.idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(quote_ident(idstat.schemaname) || '.' 
 ||
 quote_ident(idstat.relname))) AS table_size,
    pg_size_pretty(pg_relation_size(quote_ident(idstat.schemaname) || '.' 
 ||
 quote_ident(indexrelname))) AS index_size,
    n_tup_upd + n_tup_ins + n_tup_del as num_writes,
    indexdef AS definition
 FROM pg_stat_user_indexes AS idstat
 JOIN pg_indexes ON indexrelname = indexname
 JOIN pg_stat_user_tables AS tabstat ON idstat.relname = tabstat.relname
 WHERE idstat.idx_scan  200
 AND indexdef !~* 'unique'
 ORDER BY idstat.schemaname,
  idstat.relname,
  indexrelname;

Thanks, that's useful.

However, it doesn't quite work if there are indexes with the same name
in different schemas. Better join on the schemaname, too:

FROM pg_stat_user_indexes AS idstat
JOIN pg_indexes AS idx ON indexrelname = indexname and idstat.schemaname = 
idx.schemaname
JOIN pg_stat_user_tables AS tabstat ON idstat.relname = tabstat.relname and 
idstat.schemaname = tabstat.schemaname

(for some reason that makes it a lot slower, though)

hp

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

Re: [GENERAL] Queries for unused/useless indexes

2015-05-25 Thread Peter J. Holzer

On 2015-05-25 12:25:01 -0400, Melvin Davidson wrote:
 I'm not sure why you are using pg_stat_user_indexes.

Because you did. I didn't change that.

 My original query below
 uses pg_stat_all_indexes and the schema names are joined and it does work.

I'm not sure what you mean by original, but this:

 SELECT n.nspname as schema,
    i.relname as table,
    i.indexrelname as index,
    i.idx_scan,
    i.idx_tup_read,
    i.idx_tup_fetch,
    pg_size_pretty(pg_relation_size(quote_ident(n.nspname) || '.' ||
 quote_ident(i.relname))) AS table_size,
    pg_size_pretty(pg_relation_size(quote_ident(n.nspname) || '.' ||
 quote_ident(i.indexrelname))) AS index_size,
    pg_get_indexdef(idx.indexrelid) as idx_definition
   FROM pg_stat_all_indexes i
   JOIN pg_class c ON (c.oid = i.relid)
   JOIN pg_namespace n ON (n.oid = c.relnamespace)
   JOIN pg_index idx ON (idx.indexrelid =  i.indexrelid )
  WHERE i.idx_scan  200
    AND NOT idx.indisprimary
    AND NOT idx.indisunique
  ORDER BY 1, 2, 3;

is not the query you posted in your original message. 

Here is what you posted:

 On Mon, May 25, 2015 at 10:41 AM, Peter J. Holzer hjp-pg...@hjp.at wrote:
 
 On 2015-05-22 09:41:57 -0400, Melvin Davidson wrote:
  I'd like to share those queries with the community, as I know there must
 be
  others out there with the same problem.
 
  /* useless_indexes.sql */
  SELECT
     idstat.schemaname AS schema,
     idstat.relname AS table_name,
     indexrelname AS index_name,
     idstat.idx_scan AS times_used,
     pg_size_pretty(pg_relation_size(quote_ident(idstat.schemaname) ||
 '.' ||
  quote_ident(idstat.relname))) AS table_size,
     pg_size_pretty(pg_relation_size(quote_ident(idstat.schemaname) ||
 '.' ||
  quote_ident(indexrelname))) AS index_size,
     n_tup_upd + n_tup_ins + n_tup_del as num_writes,
     indexdef AS definition
  FROM pg_stat_user_indexes AS idstat
  JOIN pg_indexes ON indexrelname = indexname
  JOIN pg_stat_user_tables AS tabstat ON idstat.relname = tabstat.relname
  WHERE idstat.idx_scan  200
  AND indexdef !~* 'unique'
  ORDER BY idstat.schemaname,
   idstat.relname,
   indexrelname;

-- 
   _  | Peter J. Holzer| I want to forget all about both belts and
|_|_) || suspenders; instead, I want to buy pants 
| |   | h...@hjp.at | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/


signature.asc
Description: Digital signature

52 matches

Mail list logo