Re: [PERFORM] postgresql tuning with perf

2017-10-23 Thread Tomas Vondra


On 10/23/2017 09:19 PM, Purav Chovatia wrote:
> Hello Experts,
> 
> We are trying to tune our postgresql DB using perf.

Can you share some of the perf reports, then?

> We are running a C program that connects to postgres DB and calls
> very simple StoredProcs, one each for SELECT, INSERT & UPDATE.
> 
> The SPs are very simple. 
> *SELECT_SP*:
> CREATE OR REPLACE PROCEDURE query_dept_new(p1 IN numeric, p2 OUT
> numeric,p3 OUT numeric,...,p205 OUT numeric) AS
> BEGIN
>     SELECT c2,c3,..,c205
>         INTO p2,p3,...,p205
>         FROM dept_new
>         WHERE c1 = p1;
> END;
> 
> *UPDATE_SP*:
> CREATE OR REPLACE PROCEDURE query_dept_update(p1 IN numeric, p2 IN
> numeric,,p205 IN numeric) AS
> BEGIN
>     update dept_new set  c2 = p2,c3 = p3,.,c205 = p205 
>         WHERE c1 = p1;
> commit;
> END;
> 
> *INSERT_SP*:
> CREATE OR REPLACE PROCEDURE query_dept_insert(p1 IN numeric, p2 IN
> numeric,.,p205 IN numeric) AS
> BEGIN
> insert into dept_new values(p1,p2,.,p205);
> commit;
> END;
> 
> As shown above, its all on a single table. Before every test, the table
> is truncated and loaded with 1m rows. WAL is on a separate disk.
> 

It'd be nice if you could share more details about the structure of the
table, hardware and observed metrics (throughput, ...). Otherwise we
can't try reproducing it, for example.

> Its about 3x slower as compared to Oracle and major events are WAL
> related. With fsync=off or sync_commit=off it gets 10% better but still
> far from Oracle. Vacuuming the table does not help. Checkpoint too is
> not an issue. 

So how do you know the major events are WAL related? Can you share how
you measure that and the measurements?

> 
> Since we dont see any other way to find out what is slowing it down, we
> gathered data using the perf tool. Can somebody pls help on how do we go
> about reading the perf report.

Well, that's hard to do when you haven't shared the report.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Row level security policy policy versus SQL constraints. Any performance difference?

2017-10-17 Thread Tomas Vondra
Hi,

On 10/17/2017 10:44 PM, Joe Carlson wrote:
> Hello.
> 
> I have not used row level security policies in the past but am
> considering using them for a project in which I would like to restrict
> the set returned in a query based on specific fields. This is more as a
> convenience issue (for me) rather than a security issue.
> 
> What I was wondering is what is the performance differences between a
> row level security implementation:
> 
> CREATE POLICY  ON  TO  USING
> (=ANY());
> 
> DROP POLICY 
> 
> and an implementation where I add on the constraints as part of each
> select statement:
> 
> SELECT  FROM  WHERE  AND
> =ANY()
> 
> In my (admittedly small) number of EXPLAINs I've looked at, it appears
> that the policy logic is added to the SELECT statement as a constraint.
> So I would not expect any fundamental performance difference in the 2
> different forms.
> 
> Is this true? Or is there some extra behind-the-scenes things to be
> aware of? Can there be excessive overhead from the CREATE/DROP POLICY
> statements?
> 

The main point of the RLS is enforcing an order in which the conditions
are evaluated. That is, the "security" quals (coming from RLS policies)
have to be evaluated first, before any quals that might leak information
about the values (imagine a custom PL/pgSQL function inserting the data
somewhere, or perhaps just printing debug messages).

(Many built-in operators are however exempt from that, as we consider
them leak-proof. This allows us to use non-RLS conditions for index
scans etc. which might be impossible otherwise)

Otherwise yes - it's pretty much the same as if you combine the
conditions using AND. It's "just" much more convenient approach.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] blocking index creation

2017-10-11 Thread Tomas Vondra


On 10/11/2017 04:11 PM, Neto pr wrote:
> 
> 2017-10-11 10:46 GMT-03:00 Laurenz Albe <laurenz.a...@cybertec.at
> <mailto:laurenz.a...@cybertec.at>>:
> 
> Neto pr wrote:
> > When creating index on table of approximately 10GB of data, the DBMS 
> hangs (I think),
> > because even after waiting 10 hours there was no return of the command.
> > It happened by creating Hash indexes and B + tree indexes.
> > However, for some columns, it was successfully (L_RETURNFLAG, 
> L_PARTKEY).
> 
> > If someone has a hint how to speed up index creation so that it 
> completes successfully.
> 
> Look if CREATE INDEX is running or waiting for a lock (check the
> "pg_locks" table, see if the backend consumes CPU time).
> 
> 
> In this moment now, there is an index being created in the Lineitem
> table (+ - 10 Gb), and apparently it is locked, since it started 7 hours
> ago.
> I've looked at the pg_locks table and look at the result, it's with
> "ShareLock" lock mode.
> Is this blocking correct? or should it be another type?
> 

Yes, CREATE INDEX acquire SHARE lock, see

   https://www.postgresql.org/docs/9.1/static/explicit-locking.html

> Before creating the index, should I set the type of transaction lock? What?

Eeee? Not sure I understand. The command acquires all necessary locks
automatically.

> ---
> SELECT
>   L.mode, c.relname, locktype,  l.GRANTED, l.transactionid,
> virtualtransaction
> FROM   pg_locks l, pg_class   c
> where  c.oid = l.relation
> 
> -- RESULT
> --
> AccessShareLock   pg_class_tblspc_relfilenode_index   relation
> TRUE
> (null)3/71
> AccessShareLock   pg_class_relname_nsp_index  relationTRUE
> (null)  3/71
> AccessShareLock   pg_class_oid_index  relationTRUE(null)  
> 3/71
> AccessShareLock   pg_classrelationTRUE(null)  3/71
> AccessShareLock   pg_locksrelationTRUE(null)  3/71
> ShareLock lineitemrelationTRUE(null)  21/3769
> 
>  

Well, we see something is holding a SHARE lock on the "lineitem" table,
but we don't really know what the session is doing.

There's a PID in the pg_locks table, you can use it to lookup the
session in pg_stat_activity which includes the query (and also "state"
column that will tell you if it's active or waiting for a lock.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] https://stackoverflow.com/questions/28844170/how-to-limit-the-memory-that-is-available-for-postgressql-server

2017-09-19 Thread Tomas Vondra


On 09/19/2017 02:49 AM, 園田祥平 wrote:
> Hi experts,
> 
> For an academic experiment I need to *restrict the total amount of
> memory that is available for a pgSQL server* to compute a given set of
> queries.
> 
> I know that I can do this through |postgressql.conf|file, where I can
> adjust some parameters related with Resource Management.
> 
> The problem is that: it's not clear for me--given the several parameters
> available on the config file--which is the parameter that I should change. 
> > When I first opened the config file I'm expecting someting like
> this: |max_server_memmory|. Instead I found a lot
> of: |shared_buffers|, |temp_buffers|, |work_mem|, and so on...
> 
> Given that, I've consulted pgSQL docs. on Resource Consumption
> <http://www.postgresql.org/docs/9.3/static/runtime-config-resource.html> and
> I come up with the |shared_buffers| as the best candidate for what I'm
> looking for: *the parameter that restricts the total amount of memory
> that a pgSQL server can use to perform its computation*. But I'm not
> completely sure about this. 
> 
> Can you guys give me some insight about which parameters should I adjust
> to restrict the pgSQL server's memory, please?
> 

The short answer is "You can't do that from within PostgreSQL alone."
You can define size of some memory buffers, but not some hard total
limit. One reason is that queries may use multiple work_mem buffers, we
don't know how much memory the other queries are consuming, etc. We also
don't have any control over page cache, for example.

If you really need to do that, you'll need to do that at the OS level,
e.g. by specifying "mem=X" kernel parameter, at the VM level, or
something like that.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query of partitioned object doesnt use index in qa

2017-09-15 Thread Tomas Vondra
On 09/16/2017 12:05 AM, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
>> I'm wondering if we should extend the new CREATE STATISTICS
>> framework to be able to do that without requiring an index.
> 
> I think that's already on the roadmap --- it's one of the reasons we 
> ended up with a SELECT-like syntax for CREATE STATISTICS. But it 
> didn't get done for v10.
> 

Right. It's one of the things I'd like to be working on after getting in
the more complex statistics types (MCV & histograms).

> If we do look at that as a substitute for "make an expression index
> just so you get some stats", it would be good to have a way to
> specify that you only want the standard ANALYZE stats on that value
> and not the extended ones.
> 

Not sure I understand what you mean by "extended" - the statistics we
collect for expression indexes, or the CREATE STATISTICS stuff? I assume
the former, because if you don't want the latter then just don't create
the statistics. Or am I missing something?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Execution plan analysis

2017-08-25 Thread Tomas Vondra
Hi,

So looking at the plans, essentially the only part that is different is
the scan node at the very bottom - in one case it's a sequential scan,
in the other case (the slow one) it's the bitmap index scan.

Essentially it's this:

->  Seq Scan on lineitem
(cost=0.00..2624738.17 ...)
(actual time=0.839..74391.087 ...)

vs. this:

->  Bitmap Heap Scan on lineitem
(cost=336295.10..1970056.39 ...)
(actual time=419620.817..509685.421 ...)
->  Bitmap Index Scan on idx_l_shipmodelineitem000
(cost=0.00..336227.49 ...)
(actual time=419437.172..419437.172 ...)

All the nodes are the same and perform about the same in both cases, so
you can ignore them. This difference it the the root cause you need to
investigate.

The question is why is the sequential scan so much faster than bitmap
index scan? Ideally, the bitmap heap scan should scan the index (in a
mostly sequential way), build a bitmap, and then read just the matching
part of the table (sequentially, by skipping some of the pages).

Now, there are a few reasons why this might not work that well.

Perhaps the table fits into RAM, but table + index does not. That would
make the sequential scan much faster than the index path. Not sure if
this is the case, as you haven't mentioned which TPC-H scale are you
testing, but you only have 4GB of RAM which if fairly low.

Another bit is prefetching - with sequential scans, the OS is able to
prefetch the next bit of data automatically (read-ahead). With bitmap
index scans that's not the case, producing a lot of individual
synchronous I/O requests. See if increasing effective_cache_size (from
default 1 to 16 or 32) helps.

Try generating the plans with EXPLAIN (ANALYZE, BUFFERS), that should
tell us more about how many blocks are found in shared buffers, etc.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query runs for more than 24 hours!

2017-08-22 Thread Tomas Vondra



On 08/22/2017 04:23 PM, Mariel Cherkassky wrote:
Hi, I have a query that I run in my postgresql 9.6 database and it runs 
for more than 24 hours and doesnt finish.


My select consist from few joins :



I'm sorry, but the query and plans are completely broken (wrapped in 
funny ways, missing important bits. ...) I don't know what client you 
use or how that happened, but I recommend attaching the information as 
text files instead of pasting it into the message directly.


Regarding the query analysis - we can't really help you much without 
seeing an explain analyze (that is, not just the plan and estimates, but 
actual performance and row counts). That usually identifies the query 
operations (scans, join, ...) causing issues.


Of course, if the query is already running for 24h and you don't know 
how much longer it will take to complete, running EXPLAIN ANALYZE on it 
is not very practical. The best thing you can do is break the query into 
smaller parts and debugging that - start with one table, and then add 
tables/conditions until the performance gets bad. Hopefully the explain 
analyze on that will complete in reasonable time.


Of course, you haven't told us anything about what's happening on the 
machine. It is reading a lot of data from the disks? Random or 
sequential? Is it writing a lot of data into temporary files? Is it 
consuming a lot of CPU? And so on.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Simple SQL too slow

2017-07-02 Thread Tomas Vondra



On 07/02/2017 03:26 AM, Daviramos Roussenq Fortunato wrote:

I am using pgAdmin for SQL test.

Linux:

EXPLAIN ANALYZE select * from
"Seq Scan on lancamentosteste  (cost=0.00..49289.74 rows=1883774 
width=92) (actual time=0.016..1194.453 rows=1883699 loops=1)"

"Total runtime: 2139.067 ms"

Windows:
"Seq Scan on lancamentosteste  (cost=0.00..49288.67 rows=1883967 
width=92) (actual time=0.036..745.409 rows=1883699 loops=1)"

"Total runtime: 797.159 ms"



I'm really, really confused. In the first message you claimed the 
queries take 7 and 3 minutes, yet here we see the queries taking just a 
few seconds.





I did some test reading the disk and monitored with iotop.

#hdparm -t /dev/sdc

/dev/sdc:
  Timing buffered disk reads: 730 MB in  3.01 seconds = 242.65 MB/sec

#hdparm -T /dev/sdc

/dev/sdc:
  Timing cached reads:   9392 MB in  2.00 seconds = 4706.06 MB/sec
#time sh -c "dd if=/dev/zero of=ddfile bs=8k count=25 && sync"; rm 
ddfile

25+0 registros de entrada
25+0 registros de saída
204800 bytes (2,0 GB) copiados, 5,84926 s, 350 MB/s

real0m9.488s
user0m0.068s
sys 0m5.488s


In the tests monitoring the disk by iotop, it kept constant the reading 
between 100MB/s to 350MB/s


By doing the same monitoring on iotop and running SELECT, the disk 
reading does not exceed 100kb/s, I have the impression that some 
configuration of LINUX or Postgres is limiting the use of the total 
capacity of DISCO.


Does anyone know if there is any setting for this?



There is no such setting. But it's possible that the network is very 
slow, so transferring the results from the server to the client takes 
very long. Or that formatting the results in the client takes a lot of 
time (I'm not sure why there'd be a difference between Windows and Linux 
though).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Simple SQL too slow

2017-07-01 Thread Tomas Vondra

Hello,

On 07/01/2017 10:39 PM, Daviramos Roussenq Fortunato wrote:
What tests could I do. Rigid Linux disk is much faster than Windows, I 
should get a much better perfomace on this Linux. What test battery do 
you recommend I do?




I'm not sure what you mean by "rigid disk" or "test battery", but I 
agree with Andreas that clearly there's something wrong at the system 
level. It's hard to guess what exactly, but sequential scan on 250MB 
table (computed the relpages values) should only take a few seconds on 
any decent hardware, and not 3 or 7 minutes.


The first thing I would do is running basic system-level tests, for 
example benchmarking storage using fio.


After that, you need to determine what is the bottleneck. Perhaps the 
resources are saturated by something else running on the system - other 
queries, maybe something else running next to PostgreSQL. Look at top 
and iotop while running the queries, and other system tools.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: FW: Re: [PERFORM] Query is running very slow......

2017-05-25 Thread Tomas Vondra



On 5/25/17 2:26 PM, Dinesh Chandra 12108 wrote:

Hi Tomas,

Please find the below input for slow query.

(a) something about the hardware it's running on
  RAM-->64 GB, CPU->40core

(b) amounts of data in the tables / databases
Database size   :32GB
-
Tables size
-
Workflow.project: 8194 byte
workflow.tool_performance   :175 MB
workflow.evidence_to_do :580 MB

(c) EXPLAIN or even better EXPLAIN ANALYZE of the query

"GroupAggregate  (cost=16583736169.63..18157894828.18 rows=5920110 width=69)"
"  ->  Sort  (cost=16583736169.63..16714893857.43 rows=52463075120 width=69)"
"Sort Key: tool_performance.project_id, project.project_name, 
tool_performance.step_id, (date_trunc('day'::text, tool_performance.insert_time)), 
tool_performance.user_id"
"->  Nested Loop  (cost=2.42..787115179.07 rows=52463075120 width=69)"
"  ->  Seq Scan on evidence_to_do  (cost=0.00..119443.95 rows=558296 
width=0)"
"Filter: (status_id = ANY 
('{15100,15150,15200,15300,15400,15500}'::bigint[]))"
"  ->  Materialize  (cost=2.42..49843.24 rows=93970 width=69)"
"->  Hash Join  (cost=2.42..49373.39 rows=93970 width=69)"
"  Hash Cond: (tool_performance.project_id = 
project.project_id)"
"  ->  Seq Scan on tool_performance  (cost=0.00..48078.88 
rows=93970 width=39)"
"Filter: ((insert_time > '2017-05-01 
00:00:00+05:30'::timestamp with time zone) AND (insert_time < '2017-05-02 
00:00:00+05:30'::timestamp with time zone))"
"  ->  Hash  (cost=1.63..1.63 rows=63 width=38)"
"->  Seq Scan on project  (cost=0.00..1.63 rows=63 
width=38)"



Are you sure this is the same query? The query you posted includes there 
two join conditions:


evidence_to_do.project_id = tool_performance.project_id
evidence_to_do.project_id = project.project_id

But the plan only seems to enforce the equality between 'project' and 
'tool_performance'. So when joining the evidence_to_do, it performs a 
cartesian product, producing ~52B rows (estimated). That can't be fast.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query is running very slow......

2017-05-24 Thread Tomas Vondra

Hi,

On 5/24/17 7:04 PM, Dinesh Chandra 12108 wrote:

Dear Expert,

While executing the blow query, its taking too long time to fetch output.

Could you please help to fine tune the same?



You'll have to provide far more details - the query alone is certainly 
not enough for anyone to guess why it's slow. Perhaps look at this:


https://wiki.postgresql.org/wiki/Slow_Query_Questions

In particular, you'll have to tell us

(a) something about the hardware it's running on

(b) amounts of data in the tables / databases

(c) EXPLAIN or even better EXPLAIN ANALYZE of the query

(d) configuration of the database (work_mem, shared_buffers etc.)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-28 Thread Tomas Vondra

On 04/28/2017 01:34 AM, Andres Freund wrote:

On 2017-04-28 01:29:14 +0200, Tomas Vondra wrote:

I can confirm this observation. I bought the Intel 750 NVMe SSD last year,
the device has 1GB DDR3 cache on it (power-loss protected), can do ~1GB/s of
sustained O_DIRECT sequential writes. But when running pgbench, I can't push
more than ~300MB/s of WAL to it, no matter what I do because of
WALWriteLock.


Hm, interesting.  Even if you up wal_buffers to 128MB, use
synchronous_commit = off, and play with wal_writer_delay/flush_after?



I think I've tried things like that, but let me do some proper testing. 
I'll report the numbers in a few days.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-27 Thread Tomas Vondra

On 04/27/2017 07:35 PM, Andres Freund wrote:

On 2017-04-27 10:29:48 -0700, Joshua D. Drake wrote:

On 04/27/2017 09:34 AM, Andres Freund wrote:

On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote:

On 04/27/2017 08:59 AM, Andres Freund wrote:




I would agree it isn't yet a widespread issue.


I'm not yet sure about that actually. I suspect a large
percentage of people with such workloads aren't lingering lots on
the lists.


That would probably be true. I was thinking of it more as the
"most new users are in the cloud" and the "cloud" is going to be
rare that a cloud user is going to be able to hit that level of
writes. (at least not without spending LOTS of money)


You can get pretty decent NVMe SSD drives on serveral cloud
providers these days, without immediately bancrupting you. Sure, it's
instance storage, but with a decent replication and archival setup,
that's not necessarily an issue.

It's not that hard to get to the point where postgres can't keep up
with storage, at least for some workloads.



I can confirm this observation. I bought the Intel 750 NVMe SSD last 
year, the device has 1GB DDR3 cache on it (power-loss protected), can do 
~1GB/s of sustained O_DIRECT sequential writes. But when running 
pgbench, I can't push more than ~300MB/s of WAL to it, no matter what I 
do because of WALWriteLock.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-27 Thread Tomas Vondra

On 04/27/2017 06:34 PM, Andres Freund wrote:

On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote:

On 04/27/2017 08:59 AM, Andres Freund wrote:



Ok, based on the, few, answers I've got so far, my experience is
indeed skewed. A number of the PG users I interacted with over
the last couple years had WAL write ranges somewhere in the range
of 500MB/s to 2.2GB/s (max I'veseen). At that point WAL insertion
became a major bottleneck, even if storage was more than fast
enough to keep up. To address these we'd need some changes, but
the feedback so far suggest that it's not yet a widespread
issue...


I would agree it isn't yet a widespread issue.


I'm not yet sure about that actually. I suspect a large percentage
of people with such workloads aren't lingering lots on the lists.



To a certain extent, this is a self-fulfilling prophecy. If you know 
you'll have such a busy system, you probably do some research and 
testing first, before choosing the database. If we don't perform well 
enough, you pick something else. Which removes the data point.


Obviously, there are systems that start small and get busier and busier 
over time. And those are the ones we see.


cheers

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-27 Thread Tomas Vondra

Hi,

On 04/25/2017 06:17 AM, Andres Freund wrote:

Hi,

I've lately seen more and more installations where the generation of
write-ahead-log (WAL) is one of the primary bottlenecks. I'm curious
whether that's primarily a "sampling error" of mine, or whether
that's indeed more common.



I see those cases too. To some degree it's a sampling bias. People 
generally don't call us to look at the 99% of systems that perform fine, 
so we tend to see the the 1% of systems under pressure.


That doesn't make that observation irrelevant, though. Those demanding 
systems are one of the things that pushes us forward.


>

The primary reason I'm curious is that I'm pondering a few potential
optimizations, and would like to have some guidance which are more
and which are less important.



I think any optimization you do will improve at least some of those busy 
systems.



Questions (answer as many you can comfortably answer):
- How many MB/s, segments/s do you see on busier servers?


That depends on the cause (see the next point).


- What generates the bulk of WAL on your servers (9.5+ can use
  pg_xlogdump --stats to compute that)?


a) systems doing large batches
   - bulk loads/updates/deletes, one or few sessions doing a lot
   - easily high hundreds of MB/s (on a separate device)

b) OLTP systems doing a lot of tiny/small transactions
   - many concurrent sessions
   - often end up much more limited by WAL, due to locking etc.
   - often the trouble is random updates all over the place, causing
 amplification due to FPI (PK on UUID is a great way to cause this
 unnecessarily even on apps with naturally tiny working set)


- Are you seeing WAL writes being a bottleneck?OA


On the write-intensive systems, yes. Often the CPUs are waiting for WAL 
I/O to complete during COMMIT.



- What kind of backup methods are you using and is the WAL volume a
  problem?


The large and busy systems can easily produce so much WAL, that the 
basebackup is not the largest part of the backup. That is somewhat 
solvable by using other means of obtaining the basebackup snapshot (e.g. 
by taking some sort of filesystem / SAN / ... snapshot). That reduces 
the amount of WAL needed to make the basebackup consistent, but it 
doesn't solve the WAL archiving issue.



- What kind of replication are you using and is the WAL volume a
  problem?


Generally streaming replication, and yes, the amount of WAL may be an 
issue, partially because the standby is a single-process thing. And as 
it has to process something generated by N sessions on the primary, that 
can't end well.


Interestingly enough, FPIs can actually make it way faster, because the 
standby does not need to read the data from disk during recovery.



- What are your settings for wal_compression, max_wal_size (9.5+) /
  checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers?


I'd say the usual practice is to tune for timed checkpoints, say 30+ 
minutes apart (or more). wal_compression is typically 'off' (i.e. the 
default value).



- Could you quickly describe your workload?


Pretty much a little bit of everything, depending on the customer.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Delete, foreign key, index usage

2017-04-25 Thread Tomas Vondra

On 04/25/2017 08:28 AM, Johann Spies wrote:

On 24 April 2017 at 15:17, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:

On 04/24/2017 08:48 AM, Johann Spies wrote:



Why would the planner prefer the use the gin index and not the btree
index in this case?



You'll need to show what queries are you running - that's a quite important
piece of information, and I don't see it anywhere in this thread. Seeing
explain plans would also be helpful.


It is a simple "delete from wos_2017_1.article;" which causes a domino
effect deletes due to foreign keys. In the case of one table with more
than 50 million records where the primary key was also the foreign
key, the process only started to use the index when we built a gin
index.  In the case of the "belongs_to" table (shown in my first
email) we first built a btree index on the foreign key - and it was
ignored.  Only after the gin index was created did it use the index.

Regards.
Johann


Wouldn't it be easier to simply show the queries (with the exact 
condition) and the associated explain plans? I understand you're doing 
your best to explain what's happening, but the explain plans contain a 
lot of information that you might have missed.


I suppose you actually did explain analyze to verify the query was not 
using the btree index and then started using the gin index. Or how did 
you verify that?


Also, which PostgreSQL version have you observed this on? I see you've 
mentioned 9.6 when talking about parallel scans, but I suppose the issue 
was originally observed on some older version.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Delete, foreign key, index usage

2017-04-24 Thread Tomas Vondra

On 04/24/2017 08:48 AM, Johann Spies wrote:


Why would the planner prefer the use the gin index and not the btree
index in this case?



You'll need to show what queries are you running - that's a quite 
important piece of information, and I don't see it anywhere in this 
thread. Seeing explain plans would also be helpful.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Tomas Vondra

On 02/17/2017 11:42 PM, David G. Johnston wrote:

On Fri, Feb 17, 2017 at 3:19 PM, Hustler DBA <hustler...@gmail.com
<mailto:hustler...@gmail.com>>wrote:


my_db=# create index tab_idx1 on tab(ID);

CREATE INDEX
my_db=# explain (analyze, buffers) select count(*) from tab where ID
= '01625cfa-2bf8-45cf' ;
  QUERY
PLAN

---
 Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual
time=0.048..0.048 rows=1 loops=1)
   Buffers: shared read=2
   ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29
rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
 Index Cond: (ID = '01625cfa-2bf8-45cf'::text)



   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
time=0.031..0.108 rows=5 loops=1)
 Filter: ((ID)::text =
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
 Rows Removed by Filter: 218
 Buffers: shared hit=12
 Planning time: 0.122 ms
 Execution time: 0.180 ms
(8 rows)


​IIRC the only reason the first query cares to use the index is because
it can perform an Index Only Scan and thus avoid touching the heap at
all.  If it cannot avoid touching the heap the planner is going to just
use a sequential scan to retrieve the records directly from the heap and
save the index lookup step.



I don't follow - the queries are exactly the same in both cases, except 
the parameter value. So both cases are eligible for index only scan.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Number of characters in column preventing index usage

2017-02-17 Thread Tomas Vondra

Hi,

On 02/17/2017 11:19 PM, Hustler DBA wrote:

I am seeing this strange behavior, I don't know if this is by design by
Postgres.

I have an index on a column which is defined as "character
varying(255)". When the value I am searching for is of a certain length,
the optimizer uses the index but when the value is long, the optimizer
doesn't use the index but does a seq scan on the table. Is this by
design? How can I make the optimizer use the index no matter what the
size/length of the value being searched for?



AFAIK there are no such checks, i.e. the optimizer does not consider the 
length of the value when deciding between scan types.




PostgreSQL version: 9.4



That's good to know, but we also need information about the table 
involved in your queries. I'd bet the table is tiny (it seems to be just 
12 pages, so ~100kB), making the indexes rather useless.



my_db=# explain (analyze, buffers) select count(*) from tab where ID =
'01625cfa-2bf8-45cf' ;
  QUERY PLAN

---
 Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual time=0.048..0.048
rows=1 loops=1)
   Buffers: shared read=2
   ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1
width=0) (actual time=0.043..0.043 rows=0 loops=1)
 Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
 Heap Fetches: 0
 Buffers: shared read=2
 Planning time: 0.250 ms
 Execution time: 0.096 ms
(8 rows)

my_db=# explain (analyze, buffers) select count(*) from tab where ID =
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' ;
QUERY PLAN

---
 Aggregate  (cost=14.80..14.81 rows=1 width=0) (actual time=0.115..0.115
rows=1 loops=1)
   Buffers: shared hit=12
   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
time=0.031..0.108 rows=5 loops=1)
 Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
 Rows Removed by Filter: 218
 Buffers: shared hit=12
 Planning time: 0.122 ms
 Execution time: 0.180 ms
(8 rows)


The only difference I see is that for the long value the planner expects 
5 rows, while for the short one it expects 1 row. That may seem a bit 
strange, but I'd bet it finds the short value in some statistic (MCV, 
histogram) ans so can provide very accurate estimate. While for the 
longer one, it ends up using some default (0.5% for equality IIRC) or 
value deduced from ndistinct. Or something like that.


The differences between the two plans are rather negligible, both in 
terms of costs (8.3 vs. 14.81) and runtime (0.1 vs 0.2 ms). The choice 
of a sequential scan seems perfectly reasonable for such tiny tables.


FWIW it's impossible to draw conclusions based on two EXPLAIN ANALYZE 
executions. The timing instrumentation from EXPLAIN ANALYZE may have 
significant impact impact (different for each plan!). You also need to 
testing with more values and longer runs, not just a single execution 
(there are caching effects etc.)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Querying with multicolumn index

2016-12-10 Thread Tomas Vondra

Hi,

On 12/10/2016 12:51 AM, Tom Lane wrote:

Eric Jiang <e...@doublemap.com> writes:

I have a query that I *think* should use a multicolumn index, but
sometimes isn't, resulting in slow queries.


I tried to duplicate this behavior, without success.  Are you running
with nondefault planner parameters?



My guess is this is a case of LIMIT the matching rows are uniformly 
distributed in the input data. The planner likely concludes that for a 
driver with a lot of data we'll find the first row using ix_updates_time 
very quickly, and that it will be cheaper than inspecting the larger 
multi-column index. But imagine a driver with a lots of data long time 
ago. That breaks the LIMIT fairly quickly.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Perf decreased although server is better

2016-11-02 Thread Tomas Vondra

On 11/02/2016 02:26 PM, Benjamin Toueg wrote:

Hi everyone,

I'm facing a peformance decrease after switching to a more performant
VPS :
http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better



Well, changing so many things at once (CPU, RAM, storage, Ubuntu 
version, probably kernel version, PostgreSQL version) is a bad idea, 
exactly because it makes investigating regressions more complicated.



My questions are:

 1. What benchmark should I perform before switching to a new server?


Three types of benchmarks, in this order:

1) system-level benchmarks to test various resources (fio to test disks, 
etc.)


2) general-purpose PostgreSQL benchmarks (regular pgbench, ...)

3) application-specific benchmarks, or at least pgbench with templates 
that match your workload somewhat better than the default one


Start with (1), compare results between machines, if it's OK start with 
(2) and so on.



 2. What's your rule of thumb regarding my specific issue? What should
be investigated first?



There's a bottleneck somewhere. You need to identify which resource is 
it and why, until then it's just wild guessing. Try measuring how long 
the requests take at different points - at the app server, at the 
database, etc. That will tell you whether it's a database issue, a 
network issue etc. If the queries take longer on the database, use 
something like perf to profile the system.


ISTM it's not a disk issue (at least the chart shows minimum usage). But 
you're doing ~400tps, returning ~5M rows per second.


Also, if it turns out to be a database issue, more info about config and 
data set would be useful.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning Checkpoints

2016-10-31 Thread Tomas Vondra

On 10/31/2016 08:19 PM, Andre Henry wrote:

My PG 9.4.5 server runs on Amazon RDS some times of the day we have a
lot of checkpoints really close (less than 1 minute apart, see logs
below) and we are trying to tune the DB to minimize the impact of the
checkpoint or reduce the number of checkpoints.

Server Stats

· Instance Type db.r3.4xl

• 16 vCPUs 122GB of RAM

• PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc
(GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit



Some PG Stats

• Shared Buffers = 31427608kB

• Checkpoint Segments = 64

• Checkpoint completion target = .9

• Rest of the configuration is below



Things we are doing

• We have a huge table where each row is over 1kB and its very
busy. We are splitting that into multiple tables especially the one json
field that making it large.



Questions

• Each checkpoint log writes out the following checkpoint
complete: wrote 166481 buffers (4.2%); 0 transaction log file(s) added,
0 removed, 64 recycled; write=32.441 s, sync=0.050 s, total=32.550 s;
sync files=274, longest=0.049 s, average=0.000 s



OK, each checkpoint has to write all dirty data from checkpoints. You 
have ~170k buffers worth of dirty data, i.e. ~1.3GB.



• What does buffers mean? How do I find out how much RAM that is
equivalent to?



Buffer holds 8kB of data, which is the "chunk" of data files.


• Based on my RDS stats I don't think IOPs will help, because I
don't see any flat lines on my write operations / second graph. Is this
a good assumption?



Not sure what you mean by this. Also, maybe you should talk to AWS if 
you're on RDS.



• What else can we tune to spread out checkpoints?



Based on the logs, your checkpoints are triggered by filling WAL. I see 
your checkpoints happen every 30 - 40 seconds, and you only have 64 
segments.


So to get checkpoints checkpoints triggered by timeout (which I assume 
is 5 minutes, because you have not mentioned checkpoint_timeout), you 
need to increase checkpoint_segments enough to hold 5 minutes worth of WAL.


That means 300/30 * 64, i.e. roughly 640 segments (it's likely an 
overestimate, due to full page writes, but well).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query slowdown after 9.0 -> 9.4 migration

2016-10-26 Thread Tomas Vondra

On 10/26/2016 03:48 PM, Filip Rembiałkowski wrote:

Hi.

Query run time degraded after migration from Pg 9.0 + postgis 1.5 to Pg
9.4 + postgis 2.2.

1 ms versus 7 ms.

Same query, same data, same schema, similar hardware. Data is small and
fits in cache.

EXPLAIN shows heap scan cost increase. What can be the reason for
40-fold increase in page scans needed to run Bitmap Heap Scan with
Filter and Recheck?



On 9.0 the the scan accesses only 8 buffers:

Buffers: shared hit=8

while on 9.4 it has to inspect 316 of them:

Buffers: shared hit=316

Perhaps the table is organized / sorted differently, or something like 
that. How did you do the upgrade?



ragards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL on ZFS: performance tuning

2016-09-27 Thread Tomas Vondra

On 09/27/2016 06:00 PM, Torsten Zuehlsdorff wrote:



On 29.07.2016 08:30, Tomas Vondra wrote:



On 07/29/2016 08:04 AM, trafdev wrote:

Hi.

I have an OLAP-oriented DB (light occasional bulk writes and heavy
aggregated selects over large periods of data) based on Postgres 9.5.3.

Server is a FreeBSD 10.3 with 64GB of RAM and 2x500GB SSD (root on ZFS,
mirror).

The largest table is 13GB (with a 4GB index on it), other tables are 4,
2 and less than 1GB.

After reading a lot of articles and "howto-s" I've collected following
set of tweaks and hints:


ZFS pools creation:
zfs create zroot/ara/sqldb
zfs create -o recordsize=8k -o primarycache=all zroot/ara/sqldb/pgsql


zfs get primarycache,recordsize,logbias,compression
zroot/ara/sqldb/pgsql
NAME   PROPERTY  VALUE SOURCE
zroot/ara/sqldb/pgsql  primarycache  all   local
zroot/ara/sqldb/pgsql  recordsize8Klocal
zroot/ara/sqldb/pgsql  logbias   latency   local
zroot/ara/sqldb/pgsql  compression   lz4   inherited from zroot

L2ARC is disabled
VDEV cache is disabled


pgsql -c "mkdir /ara/sqldb/pgsql/data_ix"
pgsql -c "initdb --locale=en_US.UTF-8 -E UTF-8 -D /ara/sqldb/pgsql/data"


/etc/sysctl.conf
vfs.zfs.metaslab.lba_weighting_enabled=0


postgresql.conf:
listen_addresses = '*'
max_connections = 100
shared_buffers = 16GB
effective_cache_size = 48GB


It may not be a problem for your workload, but this effective_cache_size
value is far too high.


May i asked why? ZFS in default caches your size of RAM minus 1 GB.
Getting the shared buffer from the 64 GB RAM i would asume 47 GB
would be a better value. But this would not be far too high. So
please can you explain this?


Because it's not a global value, but an estimate of how much RAM is 
available as a cache for a single query. So if you're running 10 queries 
at the same time, they'll have to share the memory.


It's a bit trickier as there's often a fair amount of cross-backend 
sharing (backends accessing the same data, so it's likely one backend 
loads data into cache, and then other backends access it too).


It also ignores that memory may get allocated for other reasons - some 
queries may allocate quite a bit of memory for sorts/aggregations, so 
not only is


   effective_cache_size = RAM - shared_buffers

excessive as it ignores the per-query nature, but also because it 
neglects these other allocations.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL on ZFS: performance tuning

2016-07-29 Thread Tomas Vondra
s:99.95%22.04b
Streams:10.14b
  +Resets:0.10%9.97m
  -Resets:99.90%10.13b
  Bogus:0


Notes\concerns:

- primarycache=metadata (recommended in most articles) produces a
significant performance degradation (in SELECT queries);


Those articles are wrong. PostgreSQL relies of filesystem cache, so it 
needs primarycache=all.




- from what I can see, Postgres uses memory too carefully. I would like
somehow to force it to keep accessed data in memory as long as possible.
Instead I often see that even frequently accessed data is pushed out of
memory cache for no apparent reasons.

>

This is probably a consequence of the primarycache misconfiguration.



Do I miss something important in my configs? Are there any double
writes\reads somewhere because of OS\ZFS\Postgres caches? How to avoid
them?

Please share your experience\tips. Thanks.




--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] using stale statistics instead of current ones because stats collector is not responding

2016-03-09 Thread Tomas Vondra
Hi,

On Tue, 2016-03-08 at 16:18 -0800, Tory M Blue wrote:
> No hits on the intratubes on this.
> 
> 
> Any idea ? We are doing some massive deletes so was curious as to what
> would cause this error. The DB is up, not overburdened, just some big
> deletes and slon replication processes.
>
> CentOS 6.6 
> Postgres 9.4.5
> 
> First time I've ever seen this alert/error just curious about it.
>
> 2016-03-08 16:17:29 PST11877 2016-03-08 16:17:29.850 PSTLOG:
> using stale statistics instead of current ones because stats collector
> is not responding

PostgreSQL tracks 'runtime statistics' (number of scans of a table,
tuples fetched from index/table etc.) in a file, maintained by a
separate process (collector). When a backed process requests some of the
stats (e.g. when a monitoring tool selects from pg_stat_all_tables) it
requests a recent snapshot of the file from the collector.

The log message you see means that the collector did not handle such
requests fast enough, and the backend decided to read an older snapshot
instead. So you may see some stale data in monitoring for example.

This may easily happen if the I/O system is overloaded, for example. The
simplest solution is to move the statistics file to RAM disk (e.g. tmpfs
mount on Linux) using stats_temp_directory in postgresql.conf.

The space neede depends on the number of objects (databases, tables,
indexes), and usually it's a megabyte in total or so.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hash join gets slower as work_mem increases?

2016-02-01 Thread Tomas Vondra

On 02/01/2016 10:38 AM, Albe Laurenz wrote:

Tomas Vondra wrote:

...

I didn't post the whole plan since it is awfully long, I'll include hyperlinks
for the whole plan.

work_mem = '100MB' (http://explain.depesz.com/s/7b6a):

->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual 
time=4296.986..106087.683 rows=187222 loops=1)
Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
Buffers: shared hit=1181177 dirtied=1, temp read=7232 written=7230
[...]
->  Hash  (cost=18044.92..18044.92 rows=4014 width=8) (actual 
time=4206.892..4206.892 rows=3096362 loops=1)
  Buckets: 1024  Batches: 2 (originally 1)  Memory Usage: 102401kB
  Buffers: shared hit=1134522 dirtied=1, temp written=5296

work_mem = '500MB' (http://explain.depesz.com/s/Cgkl):

->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual 
time=3802.849..245970.049 rows=187222 loops=1)
Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
Buffers: shared hit=1181175 dirtied=111
[...]
->  Hash  (cost=18044.92..18044.92 rows=4014 width=8) (actual 
time=3709.584..3709.584 rows=3096360 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 120952kB
  Buffers: shared hit=1134520 dirtied=111

Does that support your theory?

There is clearly an underestimate here, caused by correlated attributes, but
is that the cause for the bad performance with increased work_mem?


Yes, that's clearly the culprit here. In both cases we estimate here are 
only ~4000 tuples in the hash, and 9.3 sizes the hash table to have at 
most ~10 tuples per bucket (in a linked list).


However we actually get ~3M rows, so there will be ~3000 tuples per 
bucket, and that's extremely expensive to walk. The reason why 100MB is 
faster is that it's using 2 batches, thus making the lists "just" ~1500 
tuples long.


This is pretty much exactly the reason why I reworked hash joins in 9.5. 
I'd bet it's going to be ~20x faster on that version.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] checkpoints, proper config

2015-12-10 Thread Tomas Vondra



On 12/10/2015 11:45 PM, Alvaro Herrera wrote:

Tomas Vondra wrote:


Also, I don't think it makes much sense to set

(checkpoint_warning > checkpoint_timeout)

as it kinda defeats the whole purpose of the warning.


I agree, but actually, what is the sense of checkpoint_warning?  I think
it was useful back when we didn't have log_checkpoints, but now that we
have detailed checkpoint logging I think it's pretty much useless noise.



Not entirely. The WARNING only triggers when you get below the 30s (or 
whatever value is set in the config) and explicitly warns you about 
doing checkpoints too often. log_checkpoints=on logs all checkpoints and 
you have to do further analysis on the data (and it's just LOG).



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] checkpoints, proper config

2015-12-10 Thread Tomas Vondra

On 12/10/2015 06:20 PM, Joshua D. Drake wrote:

On 12/10/2015 01:12 AM, Tory M Blue wrote:


checkpoint_timeout = 5min

checkpoint_completion_target = 0.9



The above is your problem. Make checkpoint_timeout = 1h . Also,
considering turning synchronous_commit off.


I doubt that. The report mentioned that the checkpoints happen 0.027... 
minutes apart (assuming the minutes_between_checkpoints is computed in a 
sane way). That's way below 5 minutes, so the checkpoints have to be 
triggered by something else - probably by running out of segments, but 
we don't know the value before Tory increased it to 150.


Also, recommending synchronous_commit=off is a bit silly, because not 
only it introduces data loss issues, but it'll likely cause even more 
frequent checkpoints.


Tory, please enable logging of checkpoints (log_checkpoints=on). Also, I 
don't think it makes much sense to set


   (checkpoint_warning > checkpoint_timeout)

as it kinda defeats the whole purpose of the warning.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Server slowing down over time

2015-09-04 Thread Tomas Vondra

Hi,

On 08/27/2015 07:21 PM, Jean Cavallo wrote:

Hi,

I am currently working on a data migration for a client.
The general plan is :
   - Read data from a postgresql database
   - Convert them to the new application
   - Insert in another database (same postgresql instance).

The source database is rather big (~40GB, wo indexes), and the
conversion process takes some time. It is done by multiple workers
on a separate Linux environnement, piece by piece.

When we start the migration, at first it looks good.
Performances are good, and it ran smoothly. After a few hours,
we noticed that things started to slow down. Some queries seemed
to be stuck, so we waited for them to end, and restarted the server.

After that it went well for some time (~10 minutes), then it slowed
down again. We tried again (a few times), and the pattern repeats.


If you're moving a lot of data (especially if the destination database 
is empty), one possible problem is statistics. This generally is not a 
problem in regular operation, because the data growth is gradual and 
autovacuum analyzes the tables regularly, but in batch processes this is 
often a big issue.


The usual scenario is that there's an empty (or very small) table, where 
indexes are inefficient so PostgreSQL plans the queries with sequential 
scans. The table suddenly grows, which would make indexes efficient, but 
the planner has no idea about that until autovacuum kicks in. But before 
that happens, the batch process executes queries on that table.


Try adding ANALYZE after steps that add a lot of data.



My postgresql specific problem is that it looks like the server gets
stuck. CPU usage is <10%, RAM usage is under 50% max, there is no
noticeable disk usage. But, there are some (<10) active queries, some
of which may take several hours to complete. Those queries work
properly (i.e < 1min) right after the server restarts.


That's a bit strange. Essentially what you're saying is that the 
workload is neither CPU nor I/O bound. To make it CPU bound, at least 
one CPU would have to be 100% utilized, and with 4 CPUs that's 25%, but 
you're saying there's only 10% used. But you're saying I/O is not the 
bottleneck either.



So my question is : What could slow the queries from ~1min to 2hours
 which does not involve CPU, Memory, or disk usage, and which would
"reset" when restarting the server ?


A lot of things, unfortunately, and the fact that this is a migration 
moving data between two databases makes it even more complicated. The 
virtualization does not make it less complex either.


For example, are you sure it's not stuck on the other database? I assume 
you're running some long queries, so maybe it's stuck there and the 
destination database is just waiting for data? That's be consistent with 
the low CPU and I/O usage you observe.


Locking is another possibility, although it probably is not the only 
cause - it'd be utilizing at least one CPU otherwise.




For information, the number of processes does not seem to be the
problem, there are ~20 connections with max_connection set to 100.
We noticed at some point that the hard drive holding the target
database was heavily fragmented (100%...), but defrag did not
seem to change anything.


If it was a problem, you'd see high I/O usage. And that's not the case.



Also, the queries that appear to get stuck are "heavy" queries,
though after a fresh restart they execute in a reasonable time.


Does the plan change? If not, check waiting locks in pg_locks.



Finally, whatever causes the database to wait also causes the
Windows instance to slow down. But restarting Postgresql fixes
this as well.


That's a bit strange, I guess. If you're not observing light CPU and I/O 
usage, then the instance should not be slow, unless there's something 
else going on - possibly at the virtualization level (e.g. another busy 
instance on the same hardware, some sort of accounting that limits the 
resources after a time, etc.)



Configuration :

The Postgresql server runs on a Windows Virtual Machine under
VMWare. The VM has dedicated resources, and the only other
VM on the host is the applicative server (which runs idle while
waiting for the database). There is nothing else running on the
server except postgresql (well, there were other things, but we
stopped everything to no avail).

PostgreSQL 9.3.5, compiled by Visual C++ build 1600, 64-bit


You're 4 versions behind.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query > 1000× slowdown after adding datetime comparison

2015-08-31 Thread Tomas Vondra



On 08/31/2015 06:09 PM, twoflower wrote:

I have the following three tables:

...

This is the query plan:

<http://postgresql.nabble.com/file/n5864045/qp1.png>

If I remove the WHERE condition, it returns immediately.

Am I doing something obviously wrong?


Please share explain plans for both the slow and the fast query. That 
makes it easier to spot the difference, and possibly identify the cause.


Also, what PostgreSQL version is this, and what are "basic" config 
parameters (shared buffers, work mem)?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index creation running now for 14 hours

2015-08-26 Thread Tomas Vondra

Hi,

On 08/26/2015 11:53 PM, Tory M Blue wrote:



On Wed, Aug 26, 2015 at 2:45 PM, Qingqing Zhou
zhouqq.postg...@gmail.com mailto:zhouqq.postg...@gmail.com wrote:

On Wed, Aug 26, 2015 at 1:26 PM, Tory M Blue tmb...@gmail.com
mailto:tmb...@gmail.com wrote:
 
  Right now the 100% cpu process which is this index is only using
3.5GB
  and has been for the last 15 hours
 

If 100% cpu, you can do 'sudo perf top' to see what the CPU is busy
about.

Regards,
Qingqing



I appreciate the attempted help, but I know what it's doing, it's
creating indexes for the last 14+ hours.


Sure, but what exactly was it doing? 'perf top' might give us a hint 
which function is consuming most of the time, for example.


 I've killed it now,  as it was

about to run my machine out of disk space, stopped it at 97% full, could
not go any longer.


Which suggests it's using a lot of temp files.

Indexes are built by reading all the necessary data from the table (just 
the columns), sorted and then an index is built using the sorted data 
(because it can be done very efficiently - much faster than when simply 
inserting the tuples into the btree index).


The fact that you ran out of disk space probably means that you don't 
have enough space for the sort (it clearly does not fit into 
maintenance_work_mem), and there's no way around that - you need enough 
disk space.



I will now clean up the table a bit but will still have 500million rows
with 6 indexes on it. I will create the indexes after the data is laid
down vs during,  so it doesn't block my other table replications. I will
then fire off my index creations in parallel  for my other tables so I
can actually use the hardware the DB is sitting on.


That's a very bad idea, because each of the index builds will require 
disk space for the sort, and you're even more likely to run out of disk 
space.




But I guess the answer is, no real way to tell what the box is doing
when it's creating an index. Yes there was a lock, no I could not find a
way to see how it's progressing so there was no way for me to gauge when
it would be done.


Had it been waiting on a lock, it wouldn't consume 100% of CPU.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index creation running now for 14 hours

2015-08-26 Thread Tomas Vondra



On 08/26/2015 10:26 PM, Tory M Blue wrote:



the table is 90GB without indexes,  285GB with indexes and bloat,  The
row count is not actually completing.. 125Million rows over 13 months,
this table is probably close to 600million rows.


You don't need to do SELECT COUNT(*) if you only need an approximate 
number. You can look at pg_class.reltuples:


   SELECT reltuples FROM pg_class WHERE relname = 'impressions';

That should be a sufficiently accurate estimate.


The above is when it had finished copying the table and started on the
index..

Well as I said I'm running out of storage as the index is creating some
serious data on the filesystem, I'll have to kill it, try to massage the
data a bit and increase the maintenance_work mem to use some of my 256GB
of ram to try to get through this. Right now the 100% cpu process which
is this index is only using 3.5GB and has been for the last 15 hours


Please post details on the configuration (shared_buffer, work_mem, 
maintenance_work_mem and such).


BTW while the the CREATE INDEX is reporting 3.5GB, it most likely wrote 
a lot of data into on-disk chunks when sorting the data. So it's 
actually using the memory through page cache (i.e. don't increase 
maintenance_work_mem too much, you don't want to force the data to disk 
needlessly).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query not using GIN index

2015-08-21 Thread Tomas Vondra

Hi,

On 08/22/2015 03:55 AM, Guo, Yun wrote:

Hi,

We have a query on a column with GIN index, but query plan chooses not
using the index but do an seq scan whichi is must slower

CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING
gin (access_tokens);

explain analyze SELECT access_grants.* FROM access_grants  WHERE
(access_tokens @ ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
 QUERY PLAN
--
  Limit  (cost=0.00..7.46 rows=1 width=157) (actual
time=260.376..260.377 rows=1 loops=1)
-  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985
width=157) (actual time=260.373..260.373 rows=1 loops=1)
  Filter: (access_tokens @
'{124e5a1f9de325fc176a7c89152ac734}'::text[])
  Rows Removed by Filter: 796818
  Total runtime: 260.408 ms



I find it very likely that the explain output actually comes from a 
slightly different query, including a LIMIT 1 clause.


That might easily be the problem here, because the optimizer expects the 
3985 matches to be uniformly distributed in the table, so it thinks 
it'll scan just a tiny fraction of the table (1/3985) until the first 
match. But it's quite possible all at rows are end of the table, and the 
executor has to actually scan the whole table.


It's difficult to say without further details of the table and how the 
data are generated.



We tested on smaller table in development region and it chooses to use
the index there. However, in production size table it decides to ignore
the index for unknown reasons.


Please provide explain output from that table. It's difficult to say 
what's different without seeing the details.


Also please provide important details about the system (e.g. which 
PostgreSQL version, how much RAM, what work_mem/shared_buffers and such 
stuff).



Is the large number of tuples skewing the query planner’s decision
or the index itself is larger than the table therefor it would decide
to do table scan?


What large number of tuples? The indexes are supposed to be more 
efficient the larger the table is.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do work_mem and shared buffers have 1g or 2g limit on 64 bit linux?

2015-06-15 Thread Tomas Vondra



On 06/15/15 05:44, Kaijiang Chen wrote:

I've checked the source codes in postgresql 9.2.4. In function
static bool
grow_memtuples(Tuplesortstate *state)

the codes:
/*
 * On a 64-bit machine, allowedMem could be high enough to get us into
 * trouble with MaxAllocSize, too.
 */
if ((Size) (state-memtupsize * 2) = MaxAllocSize / sizeof(SortTuple))
return false;

Note that MaxAllocSize == 1GB - 1
that means, at least for sorting, it uses at most 1GB work_mem! And
setting larger work_mem has no use at all...


That's not true. This only limits the size of 'memtuples' array, which 
only stores pointer to the actual tuple, and some additional data. The 
tuple itself is not counted against MaxAllocSize directly. The SortTuple 
structure has ~24B which means you can track 33M tuples in that array, 
and the tuples may take a lot more space.


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query - lots of temporary files.

2015-06-10 Thread Tomas Vondra



On 06/10/15 15:42, Johann Spies wrote:

On 10 June 2015 at 15:02, Claudio Freire klaussfre...@gmail.com
mailto:klaussfre...@gmail.com wrote:


The joins are different on both versions, and the most likely culprit
is the join against D. It's probably wrong, and the first query is
building a cartesian product.

Without more information about the schema it's difficult to be sure
though.


Thanks for your  reply.  I will experiment futher with different joins.


I don't know what you mean by experimenting with joins - that should 
be determined by the schema.


The problematic piece of the explain plan is this:

 -  Merge Join  (cost=4384310.92..21202716.78 rows=6664163593
  width=390)
   Output: a.ut, c.gt, b.go, b.gn, d.au
   Merge Cond: ((c.ut)::text = (d.rart_id)::text)

That is, the planner expects ~6.7 billion rows, each ~390B wide. That's 
~2.5TB of data that needs to be stored to disk (so that the sort can 
process it).


The way the schema is designed might be one of the issues - ISTM the 
'ut' column is somehow universal, mixing values referencing different 
columns in multiple tables. Not only that's utterly misleading for the 
planner (and may easily cause issues with huge intermediate results), 
but it also makes formulating the queries very difficult. And of course, 
the casting between text and int is not very good either.


Fix the schema to follow relational best practices - separate the values 
into multiple columns, and most of this will go away.



regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to reduce writing on disk ? (90 gb on pgsql_tmp)

2015-06-04 Thread Tomas Vondra



On 06/04/15 02:58, Scott Marlowe wrote:

On Wed, Jun 3, 2015 at 6:53 PM, Scott Marlowe scott.marl...@gmail.com wrote:

On Wed, Jun 3, 2015 at 4:29 PM, Joshua D. Drake j...@commandprompt.com wrote:


On 06/03/2015 03:16 PM, Tomas Vondra wrote:


Cache is not free memory - it's there for a purpose and usually
plays a significant role in performance. Sure, it may be freed
and used for other purposes, but that has consequences - e.g.
it impacts performance of other queries etc. You generally
don't want to do that onproduction.



Exactly. If your cache is reduced your performance is reduced
because less things are in cache. It is not free memory. Also the
command free is not useful in this scenario. It is almost
alwaysbetter to use sar so you can see where the data points are
thatfree is using.


But if that WAS happening he wouldn't still HAVE 60G of cache!
That's my whole point. He's NOT running out of memory. He's not
even having to dump cache right now.


No one claimed he's running out of memory ...

What I claimed is that considering page cache equal to free memory is 
not really appropriate, because it is used for caching data, which plays 
a significant role.


Regarding the free output, we have no clue when the free command was 
executed. I might have been executed while the query was running, right 
after it failed or long after that. That has significant impact on 
interpretation of the output.


Also, we have no clue what happens on the machine, so it's possible 
there are other queries competing for the page cache, quickly filling 
reusing free memory (used for large query moments ago) for page cache.


And finally, we have no clue what plan the query is using, so we don't 
know how much memory it's using before it starts spilling to disk. For 
example it might easily be a single sort node, taking only 384MB (the 
work_mem) of RAM before it starts spilling to disk.




Further if he started using a few gig here for this one it wouldn't
have a big impact on cache (60G-1G etc) but might make it much
faster, as spilling to disk is a lot less intrusive when you've got a
bigger chunk of ram to work in. OTOH doing something like setting
work_mem to 60G would likely be fatal.


It'd certainly be fatal, because this query is spilling 95G to disk, 
and keeping that in memory would easily require 2-3x more space.




But he's not down to 3GB of memory by any kind of imagination. Any
working machine will slowly, certainly fill its caches since it's
not using the memory for anything else. That's normal. As long as
you're not blowing out the cache you're fine.


Once again, what about the 15GB shared buffers? Not that it'd change 
anything, really.



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to reduce writing on disk ? (90 gb on pgsql_tmp)

2015-06-03 Thread Tomas Vondra



On 06/03/15 17:09, Scott Marlowe wrote:

On Wed, Jun 3, 2015 at 8:56 AM, Tomas Vondra



I don't see why you think you have less than 3GB used. The output you posted
clearly shows there's only ~300MB memory free - there's 15GB shared buffers
and ~45GB of page cache (file system cache).


Because you subtract cached from used to see how much real spare
memory you have. The kernel will dump cached mem as needed to free up
space for memory usage. So 64141-61761=2380MB used.


Well, except that 15GB of that is shared_buffers, and I wouldn't call 
that 'free'. Also, I don't see page cache as entirely free - you 
probably want at least some caching at this level.


In any case, even if all 64GB were free, this would not be enough for 
the query that needs 95GB for temp files.



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to reduce writing on disk ? (90 gb on pgsql_tmp)

2015-06-03 Thread Tomas Vondra



On 06/03/15 23:18, Scott Marlowe wrote:

On Wed, Jun 3, 2015 at 1:24 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:


On 06/03/15 17:09, Scott Marlowe wrote:


On Wed, Jun 3, 2015 at 8:56 AM, Tomas Vondra


Well, except that 15GB of that is shared_buffers, and I wouldn't call that
'free'. Also, I don't see page cache as entirely free - you probably want at
least some caching at this level.

In any case, even if all 64GB were free, this would not be enough for the
query that needs 95GB for temp files.


You can argue all you want, but this machine has plenty of free memory
right now, and unless the OP goes crazy and cranks up work_mem to some
much higher level it'll stay that way, which is good. There's far far
more than 300MB free here. At the drop of a hat there can be ~60G
freed up as needed, either for shared_buffers or work_mem or other
things to happen. Cache doesn't count as used in terms of real
memory pressure. IE you're not gonna start getting swapping becase you
need more memory, it'll just come from the cache.


Please, could you explain how you free 60GB 'as need' when 15GB of that 
is actually used for shared buffers? Also, we don't know how much of 
that cache is 'dirty' which makes it more difficult to free.


What is more important, though, is the amount of memory. OP reported the 
query writes ~95GB of temp files (and dies because of full disk, so 
there may be more). The on-disk format is usually more compact than the 
in-memory representation - for example on-disk sort often needs 3x less 
space than in-memory qsort. So we can assume the query needs 95GB of 
data. Can you explain how that's going to fit into the 64GB RAM?



Cache is free memory. If you think of it any other way when you're
looking at memory usage and pressure on theings like swap you're
gonna make some bad decisions.


Cache is not free memory - it's there for a purpose and usually plays a 
significant role in performance. Sure, it may be freed and used for 
other purposes, but that has consequences - e.g. it impacts performance 
of other queries etc. You generally don't want to do that on production.



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to reduce writing on disk ? (90 gb on pgsql_tmp)

2015-06-03 Thread Tomas Vondra



On 06/04/15 01:54, Yves Dorfsman wrote:

On 2015-06-03 16:29, Joshua D. Drake wrote:


On 06/03/2015 03:16 PM, Tomas Vondra wrote:


What is more important, though, is the amount of memory. OP reported the
query writes ~95GB of temp files (and dies because of full disk, so
there may be more). The on-disk format is usually more compact than the
in-memory representation - for example on-disk sort often needs 3x less
space than in-memory qsort. So we can assume the query needs 95GB of
data. Can you explain how that's going to fit into the 64GB RAM?


Cache is free memory. If you think of it any other way when you're
looking at memory usage and pressure on theings like swap you're
gonna make some bad decisions.


Cache is not free memory - it's there for a purpose and usually plays a
significant role in performance. Sure, it may be freed and used for
other purposes, but that has consequences - e.g. it impacts performance
of other queries etc. You generally don't want to do that on production.


Exactly. If your cache is reduced your performance is reduced because less
things are in cache. It is not free memory. Also the command free is not
useful in this scenario. It is almost always better to use sar so you can see
where the data points are that free is using.



It's one thing to consciously keep free memory for the OS cache, but
you should not take the free column from the first line output of
the program free as meaning that's all there is left, or that you
need allthat memory.


No one suggested using the 'free' column this way, so I'm not sure what 
you're responding to?



You should look at used from the second line (-/+ buffers/cache).
That value is what the kernel and all the apps are using on your
machine. Add whatever you want to have for OS cache, and this is the
total amount ofmemory you want in your machine.


Except that the second line is not particularly helpful too, because it 
does not account for the shared buffers clearly, nor does it show what 
part of the page cache is dirty etc.



Note that for a machine that has run long enough, and done enough
I/O ops, free from the first line will always be close to 0,
because the OS tries to use as much memory as possible for caching,
do enough I/O and  you'll fill that up.


That's generally true, but the assumption is that on a 300GB database 
the page cache has a significant benefit for performance. What however 
makes this approach utterly futile is the fact that OP has only 64GB of 
RAM (and only ~45GB of that in page cache), and the query writes 95GB 
temp files on disk (and then fails). So even if you drop the whole page 
cache, the query will fail anyway.


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to reduce writing on disk ? (90 gb on pgsql_tmp)

2015-06-03 Thread Tomas Vondra



On 06/03/15 15:27, chiru r wrote:

Hi Benjamin,

It looks you are facing disk space issue for queries.
In order to avid the disk space issue you can do the following.
1) Increase the work_mem parameter session level before executing the
queries.
2) If you observe diskspace issue particular user queries,increase the
work_mem parameter user level.


The suggestion to increase work_mem is a bit naive, IMHO. The query is 
writing ~95GB to disk, it usually takes more space to keep the same data 
in memory. They only have 64GB of RAM ...


In the good case, it will crash just like now. In the worse case, the 
OOM killer will intervene, possibly crashing the whole database.




3) Check with developer to tune the query.


That's a better possibility. Sadly, we don't know what the query is 
doing, so we can't judge how much it can be optimized.


--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to reduce writing on disk ? (90 gb on pgsql_tmp)

2015-06-03 Thread Tomas Vondra


On 06/03/15 16:06, ben.play wrote:

The query is (unfortunately) generated by Doctrine 2 (Symfony 2).
We can’t change the query easily.


Well, then you'll probably have to buy more RAM, apparently.


This is my config :

max_connections = 80
shared_buffers = 15GB
work_mem = 384MB
maintenance_work_mem = 1GB
#temp_buffers = 8MB
#temp_file_limit = -1
effective_cache_size = 44GB


If I put a temp_file_limit …Are all my queries (who have to write on
disk) will crash ?

As you can see… I have 64 gb of Ram, but less than 3 Gb is used !

ben@bdd:/home/benjamin# free -m
  total   used   free sharedbuffers cached
Mem: 64456  64141315  15726 53  61761
-/+ buffers/cache:   2326  62130
Swap: 1021 63958


Thanks guys for your help :)


I don't see why you think you have less than 3GB used. The output you 
posted clearly shows there's only ~300MB memory free - there's 15GB 
shared buffers and ~45GB of page cache (file system cache).


But you still haven't shows us the query (the EXPLAIN ANALYZE of it), so 
we can't really give you advice.


--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Connection time when using SSL

2015-06-01 Thread Tomas Vondra

Hi

On 06/01/15 22:51, Marco Di Cesare wrote:

I am connecting to a Postgres instance using SSL and seeing fairly slow
connect times. I would expect there would be some overhead but it's more
than I anticipated. The connection is happening over a network. I am
using a wildcard SSL certificate on the server side only.
Using one of these JDBC SSL connect strings takes on average: 1060 ms to
connect to the database:
jdbc:postgresql://db01-dev.pointclickcare.com:5432/testdb?ssl=truesslmode=requiresslfactory=org.postgresql.ssl.jdbc4.LibPQFactory
- or -
jdbc:postgresql://db01-dev.pointclickcare.com:5432/testdb?ssl=truesslmode=requiresslfactory=org.postgresql.ssl.NonValidatingFactory
Using this JDBC non-SSL connect string takes on average: 190 ms to
connect to the database:
jdbc:postgresql://db01-dev.pointclickcare.com:5432/testdb
Does this sound like a reasonable overhead that SSL would add to the
connection time or does this seem high? (~870ms/~443% slower using SSL)


What is the network latency (ping) between the two hosts? SSL requires a 
handshake, exchanging a number messages between the two hosts, and if 
each roundtrip takes a significant amount of time ...


The 190ms seems quite high. On my rather slow workstation, a local 
connection without SSL takes ~30ms , with SSL ~70ms. So I wouldn't be 
surprised by ~100ms roundtrips in your case, and that is going to slow 
down the SSL handshake significantly.



There's very little you can do with the roundtrip time, usually, but you 
can keep the connections open in a pool. That'll amortize the costs.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Different plan for very similar queries

2015-05-31 Thread Tomas Vondra



On 05/31/15 18:22, Tom Lane wrote:

Tomas Vondra tomas.von...@2ndquadrant.com writes:

On 05/31/15 13:00, Peter J. Holzer wrote:

(There was no analyze on facttable_stat_fta4 (automatic or manual) on
facttable_stat_fta4 between those two tests, so the statistics on
facttable_stat_fta4 shouldn't have changed - only those for term.)



So maybe there was autoanalyze, because otherwise it really should be
the same in both plans ...


No, because that's the inside of a nestloop with significantly different
outer-side rowcount estimates.  The first case gets a benefit from the
expectation that it will be re-executed many times (see the impact of
loop_count on cost_index).


Meh, I got confused by the plan a bit - I thought there's a problem in 
the outer path (e.g. change of row count). But actually this is the path 
scanning the 'term' table, so the change is expected there.


The fact that the index scan cost 'suddenly' grows from 386k to 2M is 
confusing at first, but yeah - it's caused by the 'averaging' in 
cost_index() depending on loop_count.


But I think this does not really change the problem with eliminating 
inner paths solely on the basis of total cost - in fact it probably 
makes it slightly worse, because the cost also depends on estimates in 
the outer path (while the bitmapscan does not).



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Different plan for very similar queries

2015-05-31 Thread Tomas Vondra

On 05/31/15 13:00, Peter J. Holzer wrote:

[I've seen in -hackers that you already seem to have a fix]

On 2015-05-30 15:04:34 -0400, Tom Lane wrote:

Tomas Vondra tomas.von...@2ndquadrant.com writes:

Why exactly does the second query use a much slower plan I'm not sure. I
believe I've found an issue in planning semi joins (reported to
pgsql-hackers a few minutes ago), but may be wrong and the code is OK.


I think you are probably right that there's a bug there: the planner is
vastly overestimating the cost of the nestloop-with-inner-indexscan
plan.  However, the reason why the mergejoin plan gets chosen in some
cases seems to be that an additional estimation error is needed to make
that happen; otherwise the nestloop still comes out looking cheaper.
The undesirable case looks like:


  Merge Semi Join  (cost=316864.57..319975.79 rows=1 width=81) (actual 
time=7703.917..30948.271 rows=2 loops=1)
Merge Cond: ((t.term)::text = (f.berechnungsart)::text)
-  Index Scan using term_term_idx on term t  (cost=0.00..319880.73 
rows=636 width=81) (actual time=7703.809..7703.938 rows=3 loops=1)
  Filter: (((facttablename)::text = 'facttable_stat_fta4'::text) AND 
((columnname)::text = 'berechnungsart'::text))


Just noticed that this is a bit strange, too:

This scans the whole index term_term_idx and for every row found it
checks the table for the filter condition. So it has to read the whole
index and the whole table, right? But the planner estimates that it will
return only 636 rows (out of 6.1E6), so using
term_facttablename_columnname_idx to extract those 636 and then sorting
them should be quite a bit faster (even just a plain full table scan
and then sorting should be faster).


That seems a bit strange, yes. I don't see why a simple index scan (with 
Index Cond), expected to produce 636, should be more expensive than 
scanning the whole index (with a Filter). Even if there's an additional 
Sort node, sorting those 636 rows.


But I've been unable to reproduce that (both on 9.1 and HEAD) without 
significant 'SET enable_*' gymnastics, so I'm not sure why that happens. 
Don't you have some 'enable_sort=off' or something like that?


A test case with a data set would help a lot, in this case.



Another thought: For the merge semi join postgresql doesn't actually
have to scan the whole inner index. It can skip from the first 'm' entry
to the first 'n' entry reading only a few non-leaf blocks, skipping many
leaf blocks in the process. The times (7703.917..30948.271) indicate that
it doesn't actually do this, but maybe the planner assumes it does?


How would it know how far to skip? I mean, assume you're on the first 
'n' entry - how do you know where is the first 'm' entry?


If you only really need to check existence, a nested loop with an inner 
index scan is probably the right thing anyway, especially if the number 
of outer rows (and thus loops performed) is quite low. This is clearly 
demonstrated by the first plan in this thread:


  QUERY PLAN
- ...
 Nested Loop Semi Join  (cost=0.00..384860.48 rows=1 width=81 ...
   -  Index Scan using term_facttablename_columnname_idx on  ...
 Index Cond: (((facttablename)::text = 'facttable_sta ...
   -  Index Scan using facttable_stat_fta4_einheit_idx on fa ...
 Index Cond: ((einheit)::text = (t.term)::text)
 Total runtime: 0.173 ms
(6 rows)

This is probably the best plan you can get in cases like this ...



I also suspected that the culprit is the columnname column. That one has a 
very
skewed distribution:

wdsah= select columnname, count(*) from term group by columnname order by 
count(*);
  columnname  |  count
-+-
  warenstrom  |   3
  einheit |   3
  berechnungsart  |   3
  og  |  26
  berichtsregion  | 242
  partnerregion   | 246
  sitcr4  |4719
  kurzbezeichnung | 1221319
  macrobondtimeseries | 1221320
  | 3661206
(10 rows)

So random variation in the sample could throw off the estimated
frequencies of the the least frequent columnnames by quite a bit.

But given that both plans estimated the number of rows returned by the
outer index scan as 636, that was probably a red herring.

But there does seem to be a connection to this column: In one case
pg_stats contained n_distinct=7 and only the two most common values.
Then the plan looked like this:

wdsah= explain analyze select facttablename, columnname, term, concept_id, 
t.hidden, language, register
 from term t where facttablename='facttable_stat_fta4' and 
columnname='warenstrom' and exists (select 1 from facttable_stat_fta4 f where 
f.warenstrom=t.term );

  QUERY PLAN

Re: [PERFORM] Postmaster eating up all my cpu

2015-05-30 Thread Tomas Vondra

Hi,

On 05/28/15 13:25, birimblongas wrote:

Hi,

My app was working just fine. A month ago postmaster started to eat all my
cpu sometimes (not all day, but a lot of times and everyday) and then my app
gets really slow and sometimes don't even complete the requests.

What could it be?


A lot of things. The first step should be looking at pg_stat_activity, 
what is the process eating the CPU doing.


We also need much more information about your system - what PostgreSQL 
version are you using, what kind of OS, configuration etc.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres is using 100% CPU

2015-05-30 Thread Tomas Vondra

Hi,

On 05/30/15 15:46, Ashik S L wrote:

We are using postgres SQL version 8.4.17..


FYI 8.4 is already unsupported for ~1 year, so you should consider 
upgrading to a newer release. Also, the newest version in that branch is 
8.4.22, so with 8.4.17 you're missing ~1 year of patches.



Postgres DB szie is 900 MB and we are inserting 273 rows at once .and
each row is of 60 bytes.Every time we insert 16380 bytes of data.


So you insert 273 rows and it gets stuck for 40 minutes? That's really 
strange, and I find it rather unlikely even with a badly misconfigured 
instance. It should easily insert thousands of rows per second.


Can you elaborate more about the database structure, or at least the 
table(s) you're inserting into. Are there any foreign keys (in either 
direction), indexes or triggers?



I tried to make some config changes using above link. But I did not
see any improvement.
I made following changes in postgres.conf file:
shared_buffers = 512MB // It was 32MB
work_mem = 30MB
effective_cache_size = 512MB // I tried with 128MB 256MB also

Please let me know any config changes that I can try out.


I don't think this has anything to do with configuration. This seems 
like an issue at the application level, or maybe poorly designed schema.


You mentioned you have multiple machines, and only some of them are 
having this issue. What are the differences between the machines? Are 
all the machines using the same schema? I assume each has a different 
amount of data.


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Different plan for very similar queries

2015-05-29 Thread Tomas Vondra

Hi,

On 05/29/15 11:51, Peter J. Holzer wrote:

A couple of additional observations:

The total cost of both queries is quite similar, so random variations
might push into one direction or the other. Indeed, after dropping
and recreating indexes (I tried GIN indexes as suggested by Heikki on
[1]) and calling analyze after each change, I have now reached a
state where both queries use the fast plan.


I don't think bitmap indexes are particularly good match for this use 
case. The queries need to check an existence of a few records, and btree 
indexes are great for that - the first plan is very fast.


Why exactly does the second query use a much slower plan I'm not sure. I 
believe I've found an issue in planning semi joins (reported to 
pgsql-hackers a few minutes ago), but may be wrong and the code is OK.


Can you try forcing the same plan for the second query, using enable 
flags? E.g.


   SET enable_mergejoin = off;

will disable the merge join, and push the optimizer towards a different 
join type. You may have to disable a few more node types until you get 
the same plan as for the first query, i.e.


   nestloop semi join
 - index scan
 - index scan

See this for more info:

   http://www.postgresql.org/docs/9.1/static/runtime-config-query.html

Also, have you tuned the PostgreSQL configuration? How?

Can you provide the dataset? Not necessarily all the columns, it should 
be sufficient to provide the columns used in the join/where clauses:


term - facttablename, columnname, term
facttable_stat_fta4 - einheit, berechnungsart

That'd make reproducing the problem much easier.


In the first case the query planner seems to add the cost of the two
index scans to get the total cost, despite the fact that for a semi
join the second index scan can be aborted after the first hit (so
either the cost of the second scan should be a lot less than
384457.80 or it needs to be divided by a large factor for the semi
join).

In the second case the cost of the second index scan (2545748.85) is
either completely ignored or divided by a large factor: It doesn't
seem to contribute much to the total cost.


I believe this is a consequence of the semi join semantics, because the 
explain plan contains total costs and row counts, as if the whole 
relation was scanned (in this case all the 43M rows), but the optimizer 
only propagates fraction of the cost estimate (depending on how much of 
the relation it expects to scan). In this case it expects to scan a tiny 
part of the index scan, so the impact on the total cost is small.


A bit confusing, yeah.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres is using 100% CPU

2015-05-29 Thread Tomas Vondra



On 05/29/15 20:10, Ashik S L wrote:

Hi All,

I am using postgresDB on redhat machine which is having 4GB RAM
machine. As soon as it starts to Inserting rows into the postgres DB
it will reach 100%cpu. It will comedown to normal after 40 minutes. I
tried perform some tuning on the postgres DB, But result was same.I
am not postgres DB expert. Even we are not seeing in all machine.
Only few machines we are seeing this issue. Any help on this would
be appreciated.


Ashik, before pointing you to this list, I asked for some basic 
information that are needed when diagnosing issues like this - database 
size, postgres version etc. We can't really help you without this info, 
because right now we only know you're doing some inserts (while before 
you mentioned updates), and it's slow.


Also, can you please provide info about the configuration and what 
changes have you done when tuning it?


Have you seen this?

  https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] extract(year from date) doesn't use index but maybe could?

2015-04-19 Thread Tomas Vondra



On 04/19/15 22:10, Jon Dufresne wrote:

On Sun, Apr 19, 2015 at 10:42 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:


Or you might try creating an expression index ...

CREATE INDEX date_year_idx ON dates((extract(year from d)));



Certainly, but won't this add additional overhead in the form of two
indexes; one for the column and one for the expression?


It will, but it probably will be more efficient than poorly performing 
queries. Another option is to use the first type of queries with 
explicit date ranges, thus making it possible to use a single index.




My point is, why force the user to take these extra steps or add
overhead when the the two queries (or two indexes) are functionally
equivalent. Shouldn't this is an optimization handled by the
database so the user doesn't need to hand optimize these differences?


Theoretically yes.

But currently the extract function call is pretty much a black box for 
the planner, just like any other function - it has no idea what happens 
inside, what fields are extracted and so on. It certainly is unable to 
infer the date range as you propose.


It's possible that in the future someone will implement an optimization 
like this, but I'm not aware of anyone working on that and I wouldn't 
hold my breath.


Until then you either have to create an expression index, or use queries 
with explicit date ranges (without extract calls).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] extract(year from date) doesn't use index but maybe could?

2015-04-19 Thread Tomas Vondra



On 04/19/15 19:16, Jon Dufresne wrote:

Given the table:

CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)

With an *index* on field d. The following two queries are functionally
equivalent:

1. SELECT * FROM dates WHERE d = '1900-01-01'
2. SELECT * FROM dates WHERE EXTRACT(year from d) = 1900'

By functionally equivalent, they will return the same result set.

Query 2 does not use the index, adding a performance cost. It seems
there is an opportunity for optimization to handle these two queries
equivalently to take advantage of the index.


Or you might try creating an expression index ...

CREATE INDEX date_year_idx ON dates((extract(year from d)));


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows

2015-03-26 Thread Tomas Vondra
On 26.3.2015 08:48, Jeff Janes wrote:

 OK, this is starting to look like a long-standing bug to me.
 
 If it only sees 3 distinct values, and all three are present at least
 twice, it throws all of them into the MCV list. But if one of those 3
 were present just once, then it tests them to see if they qualify.
 The test for inclusion is that it has to be present more than once,
 and that it must be over-represented by 25%.
 
 Lets say it sampled 3 rows and found 29,900 of one value, 99 of
 another, and 1 of a third.
 
 But that turns into the second one needing to be present 12,500 times. 
 The average value is present 10,000 times (30,000 samples with 3
 distinct values) and 25 more than that is 12,500.  So it excluded.
 
 It seems to me that a more reasonable criteria is that it must be
 over-represented 25% compared to the average of all the remaining values
 not yet accepted into the MCV list.  I.e. all the greater ones should be
 subtracted out before computing the over-representation threshold.

That might work IMO, but maybe we should increase the coefficient a bit
(say, from 1.25 to 2), not to produce needlessly long MCV lists.


 It is also grossly inconsistent with the other behavior.  If they are
 29900; 98; 2 then all three go into the MCV.

Isn't the mincount still 12500? How could all three get into the MCV?


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows

2015-03-26 Thread Tomas Vondra
On 26.3.2015 17:35, Jeff Janes wrote:
 On Thu, Mar 26, 2015 at 5:44 AM, Tomas Vondra
 tomas.von...@2ndquadrant.com mailto:tomas.von...@2ndquadrant.com wrote:
 
 That might work IMO, but maybe we should increase the coefficient a
 bit (say, from 1.25 to 2), not to produce needlessly long MCV
 lists.
 
 That wouldn't work here, because at the point of decision the value 
 present 99 times contributes half the average, so the average is 50, 
 and of course it can't possibly be twice of that.

Oh, right. How could I miss that? ;-)

 I have a patch, but is there a way to determine how it affects a
 wide variety of situations? I guess run `make installcheck`, then
 analyze, then dump pg_stats, with the patch and without the patch,
 and then compare the dumpsj?

I doubt there's such way. I'd argue that if you can show this always
generates longer MCV lists, we can assume the stats are probably more
accurate, and thus the plans should be better.

Of course, there's always the possibility that the plan was good by
luck, and improving the estimates will result in a worse plan. But I
don't think we can really fix that :-(

 It is also grossly inconsistent with the other behavior. If they
 are 29900; 98; 2 then all three go into the MCV.
 
 Isn't the mincount still 12500? How could all three get into the
 MCV?
 
 If all observed values are observed at least twice, it takes a 
 different path through the code. It just keeps them all in the MCV 
 list. That is what is causing the instability for the OP. If the 3rd 
 most common is seen twice, then all three are kept. If it is seen 
 once, then only the most common is kept. See if statements at 2494 
 and 2585
 
 else if (toowide_cnt == 0  nmultiple == ndistinct)
 
 if (track_cnt == ndistinct 

Aha, I think I see it now. I've been concentrating on this code:

   avgcount = (double) samplerows / ndistinct;
   /* set minimum threshold count to store a value */
   mincount = avgcount * 1.25;
   if (mincount  2)
  mincount = 2;

but this is actually too late, because first we do this:

  else if (toowide_cnt == 0  nmultiple == ndistinct)
  {
 stats-stadistinct = ndistinct;
  }

and that only happens if each item is observed at least 2x in the sample
(and the actual Haas and Stokes estimator it not used).

And then we do this:

  if (track_cnt == ndistinct  toowide_cnt == 0 
  stats-stadistinct  0  track_cnt = num_mcv)
  {
num_mcv = track_cnt;
  }

so that we track everything.

If at least one value is seen only 1x, it works differently, and we use
the code with (1.25*avgcount) threshold.

I wonder where the 1.25x threshold comes from - whether it's something
we came up with, or if it comes from some paper. I guess the former.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Index only scan sometimes switches to sequential scan for small amount of rows

2015-03-25 Thread Tomas Vondra
On 25.3.2015 13:04, Feike Steenbergen wrote:
...
 When analyzing pg_stats we have sometimes have the following: (Note:
 'NOT_YET_PRINTED' has not been found during this analyze, these are
 real values)
 
  attname| status
  inherited  | f
  null_frac  | 0
  avg_width  | 4
  n_distinct | 3
  most_common_vals   | {PRINTED}
  most_common_freqs  | {0.996567}
  histogram_bounds   | {PREPARED,ERROR}
  correlation| 0.980644
 
 A question about this specific entry, which some of you may be able to
 shed some light on:
 
 most_common_vals contains only 1 entry, why is this? I would expect to
 see 3 entries, as it has n_distinct=3

To be included in the MCV list, the value has to actually appear in the
random sample at least twice, IIRC. If the values are very rare (e.g. if
you only have such 10 rows out of 3.5M), that may not happen.

You may try increasing the statistics target for this column, which
should make the sample larger and stats more detailed (max is 1,
which should use sample ~3M rows, i.e. almost the whole table).

 When looking at
 http://www.postgresql.org/docs/current/static/row-estimation-examples.html
 we can see that an estimate  5000 is what is to be expected for these
 statistics:
 
 # select ( (1 - 0.996567)/2 * 350 )::int;
  int4
 --
  6008
 (1 row)
 
 But why does it not record the frequency of 'PREPARED' and 'ERROR'
 in most_common_*?

Can you post results for this query?

SELECT stats, COUNT(*) FROM print_list group by 1

I'd like to know how frequent the other values are.

 
 Our current strategies in mitigating this problem is decreasing the 
 autovacuum_*_scale_factor for this specific table, therefore 
 triggering more analyses and vacuums.

I'm not sure this is a good solution. The problem is elsewhere, IMHO.

 This is helping somewhat, as if the problem occurs it often solved 
 automatically if autoanalyze analyzes this table, it is analyzed
 many times an hour currently.
 
 We can also increase the 'Stats target' for this table, which will
 cause the statistics to contain information about 'NOT_YET_PRINTED'
 more often, but even then, it may not find any of these records, as
 they sometimes do not exist.

This is a better solution, IMHO.

 
 Could you help us to find a strategy to troubleshoot this issue
 further?

You might also make the index scans cheaper, so that the switch to
sequential scan happens later (when more rows are estimated). Try to
decreasing random_page_cost from 4 (default) to 1.5 or something like that.

It may hurt other queries, though, depending on the dataset size etc.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-22 Thread Tomas Vondra
On 22.3.2015 22:50, Vivekanand Joshi wrote:
 Any documentation regarding how to configure postgresql.conf file as per
 individual user?

That can't be done in postgresql.conf, but by ALTER ROLE commands.



-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-18 Thread Tomas Vondra
Hi,

On 18.3.2015 18:31, Vivekanand Joshi wrote:
 So, here is the first taste of success and which gives me the
 confidence that if properly worked out with a good hardware and
 proper tuning, PostgreSQL could be a good replacement.
 
 Out of the 9 reports which needs to be migrated in PostgreSQL, 3 are
 now running.
 
 Report 4 was giving an issue and I will see it tomorrow.
 
 Just to inform you guys that, the thing that helped most is setting 
 enable_nestloops to false worked. Plans are now not miscalculated.

The estimates are still miscalculated, but you're forcing the database
not to use the nested loop. The problem is the nested loop may be
appropriate in some cases (maybe only in a few places of the plan) so
this is really corse-grained solution.

 But this is not a production-suitable setting. So what do you think
 how to get a work around this?

(a) Try to identify why the queries are poorly estimated, and rephrase
them somehow. This is the best solution, but takes time, expertise
and may not be feasible in some cases.

(b) Tweak the database structure, possibly introducing intermediate
tables, materialized views (or tables maintained by triggers - this
might work for the 'latest record' subquery), etc.

(c) Try to tweak the cost parameters, to make the nested loops more
expensive (and thus less likely to be selected), but in a more
gradual way than enable_nestloops=false.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hardware Configuration and other Stuff

2015-03-18 Thread Tomas Vondra
Hi,

On 18.3.2015 20:07, Vivekanand Joshi wrote:
 Hi Team,
 
 I don't know under which section does this question comes, so I am
 posting this question to both Admin and performance mailing list.
 Apologies in advance.

Let's keep this in pgsql-performance.

 
 Objective:
 
 We are planning to use PostgreSQL instead of Netezza for our data 
 warehouse as well as database solutions. Right now, we have all our 
 clients in one Netezza box. What we are thinking of migrating our 
 clients to dedicated PostgreSQL for each of them. We will start with
 one of the client. If it works successfully, we will be migrating all
 the clients one by one. The objective is to get a better performance
 than our existing solution. We are hopeful of that mainly because of
 two reasons. Firstly, we will have a dedicated server for each of the
 client with good hardware instead of having one server with all the
 clients on that. Secondly, we can spend on hardware much easily than
 spending on a proprietary appliance.


OK.

 I am hoping this community can help us to know that what would be the
 good infrastructure/hardware that can help us in achieving our goal.
 
 Here are few of the statistics which might act as a starting point.
 
 Availability: High (24*7).
 
 User Data : 700 GB which will increase to 1.5 TB in next 2-3 years.

How do you measure the amount of data? Is that the amount of data before
loading, size of the database, or what?

Also, is this a single client (thus placed on a single box), or multiple
clients?

 Number of User Databases : 2 (One is the main database, other is
 used only for working tables where tables gets deleted in every 48
 hours)

You mentioned 700GB of data - is that just the main database, or both
databases?

 
 Number of tables : 200 (in the main database), (2000-3000 in working 
 database)
 
 Size of top 5 biggest tables : 20-40 GB
 
 No of users concurrently accessing the system : 5-6 with write
 access. 10 with read access.
 
 No of User Queries running on the system in a day : ~80K
 
 Read-only Queries (Select): ~60K
 
 Write queries: ~20K
 
 Data Import Queries: ~1K
 
 Typical Business Day : 18-20 hours.

So is this a typical batch environment when you do the loads at night,
but no during loads? That might be possible with clients on dedicated
boxes and would allow various optimizations.

 
 I can pass on few complex queries to let you guys know what are we
 doing.
 
 Here are few questions:
  
 1.) I don't need a load balancing solution. It must be high availability
 server and I can work with asynchronous replication. The most important
 thing here would be recovery should be as fast as possible.
 
 What approach would you recommend?

Streaming replication. I would probably start with sync replication.

 2.) Recommendations on indexes, WAL, table spaces. I am not asking
 about on which key I need to make indexes, but an high level approach
 about how to keep them? This might come out as a weird question to
 many but please excuse me for being a novice.

Not sure what exactly are you looking for - there's a lot of things, and
many of them depend on what hardware you plan to use.

The simplest indexing strategy is to design them along with the schema,
and evaluate them on queries (collect slow queries - create suitable
indexes - repeat).

 
 3.) What would be the ideal hardware configuration for this
 requirement? I know there is not a one-stop answer for this, but
 let's take it is a starting point. We can come to a proper conclusion
 after a discussion.

This is very complex question, to be honest. I assume you're looking for
regular servers, in that case a good server with not that many CPUs
(say, 32 cores seems to be enough for your workload), plenty of RAM and
good disk system to handle the load would be a good start.

 What are the best on-line resources/books which can tell us about
 the hardware requirements?

I'd say these two books would be helpful:

(1)
https://www.packtpub.com/big-data-and-business-intelligence/postgresql-9-high-availability-cookbook

   - explains capacity planning etc.

(2)
https://www.packtpub.com/big-data-and-business-intelligence/postgresql-90-high-performance

   - a good book about PostgreSQL performance tuning

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra
On 17.3.2015 16:10, Vivekanand Joshi wrote:
 The confusion for me here is that :
 
 
 I am getting results from the view in around 3 seconds
 (S_V_D_CAMPAIGN_HIERARCHY) and 25 seconds (S_V_F_PROMOTION_HISTORY_EMAIL)
 
 But when I am using these two views in the query as the joining 
 tables, it doesn't give any result. As per my understanding, the 
 planner is making new plan and that is costly instead of using
 output from the view, which is actually understandable.

In general, yes. The problem is that the plan is constructed based on
the estimates, and those are very inaccurate in this case.

The planner may do various changes, but let's assume that does not
happen and the plans are executed and and the results are joined.

For example what might happen is this:

   for each row in 's_v_d_campaign_hierarchy' (1 row expected):
  execute s_v_f_promotion_history_email  join (11644 rows exp.)

But then it gets 45k rows from s_v_d_campaign_hierarchy, and ~400x more
rows from s_v_f_promotion_history_email (I'm neglecting the join
condition here, but that's not really significant). Kabm!

In reality, the plan is reorganized (e.g. different join order), but the
misestimates are still lurking there.

 Is there a way, we can do anything about it?

Rephrasing the query so that the planner can estimate it more accurately.



-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra
Hi,

On 17.3.2015 08:41, Vivekanand Joshi wrote:
 Hi Guys,
 
 Next level of query is following:
 
 If this works, I guess 90% of the problem will be solved.
 
 SELECT
 COUNT(DISTINCT TARGET_ID)
 FROM
 S_V_F_PROMOTION_HISTORY_EMAIL PH
 INNER JOIN S_V_D_CAMPAIGN_HIERARCHY CH
 ON PH.TOUCHPOINT_EXECUTION_ID =
 CH.TOUCHPOINT_EXECUTION_ID
 WHERE
 1=1
 AND SEND_DT = '2014-03-13'
 AND SEND_DT = '2015-03-14'
 
 
 In this query, I am joining two views which were made earlier with CTEs. I
 have replaced the CTE's with subqueries. The view were giving me output in
 around 5-10 minutes and now I am getting the same result in around 3-4
 seconds.
 
 But when I executed the query written above, I am again stuck. I am
 attaching the query plan as well the link.
 
 http://explain.depesz.com/s/REeu
 
 I can see most of the time is spending inside a nested loop and total
 costs comes out be cost=338203.81..338203.82.

Most of that cost comes from this:

Seq Scan on s_f_promotion_history base (cost=0.00..283,333.66 rows=1
width=32)
  Filter: ((send_dt = '2014-03-13 00:00:00'::timestamp without time
zone) AND (send_dt = '2015-03-14 00:00:00'::timestamp without time


That's a bit weird, I guess. If you analyze this part of the query
separately, i.e.

EXPLAIN ANALYZE SELECT * FROM s_f_promotion_history
 WHERE (send_dt = '2014-03-13 00:00:00')
   AND (send_dt = '2015-03-14 00:00:00')

what do you get?

I suspect it's used in EXISTS, i.e. something like this:

... WHERE EXISTS (SELECT * FROM s_f_promotion_history
   WHERE ... send_dt conditions ...
 AND touchpoint_execution_id =
 s_f_touchpoint_execution_status_history_1.touchpoint_execution_id)

and this is transformed into a nested loop join. If there's a
misestimate, this may be quite expensive - try to create index on

   s_f_promotion_history (touchpoint_execution_id, send_date)


regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra
On 17.3.2015 12:07, Vivekanand Joshi wrote:
 EXPLAIN ANALYZE didn't give result even after three hours.

In that case the only thing you can do is 'slice' the query into smaller
parts (representing subtrees of the plan), and analyze those first. Look
for misestimates (significant differences between estimated and actual
row counts, and very expensive parts).

We can't do that, because we don't have your data or queries, and
without the explain analyze it's difficult to give advices.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra
Just as I feared, the attached explain analyze results show significant
misestimates, like this for example:

Nested Loop  (cost=32782.19..34504.16 rows=1 width=16)
(actual time=337.484..884.438 rows=46454 loops=1)

Nested Loop  (cost=18484.94..20366.29 rows=1 width=776)
  (actual time=2445.487..3741.049 rows=45360 loops=1)

Hash Left Join  (cost=34679.90..37396.37 rows=11644 width=148)
  (actual time=609.472..9070.675 rows=4559289 loops=1)

There's plenty of nested loop joins - the optimizer believes there will
be only a few rows in the outer relation, but gets order of magnitude
more tuples. And nested loops are terrible in that case.

In case of the first view, it seems to be caused by this:

Merge Cond:
((s_f_touchpoint_execution_status_history.touchpoint_execution_id =
s_f_touchpoint_execution_status_history_1.touchpoint_ex
ecution_id) AND (s_f_touchpoint_execution_status_history.creation_dt =
(max(s_f_touchpoint_execution_status_history_1.creation_dt

especially the ':id = max(:id)' condition is probably giving the
optimizer a hard time. This is a conceptually difficult poblem (i.e.
fixing this at the optimizer level is unlikely to happen any time soon,
because it effectively means you have to predict the statistical
properties of the aggregation).

You may try increasing the statistical target, which makes the stats a
bit more detailed (the default on 9.4 is 100):

SET default_statistics_target = 1;
ANALYZE;

But I don't really believe this might really fix the problem.

But maybe it's possible to rewrite the query somehow?

Let's experiment a bit - remove the aggregation, i.e. join directly to
s_f_touchpoint_execution_status_history. It'll return wrong results, but
the estimates should be better, so let's see what happens.

You may also try disabling nested loops - the other join algorithms
usually perform better with large row counts.

SET enable_nestloop = false;

This is not a production-suitable solution, but for experimenting that's OK.

ISTM what the aggregation (or the whole mergejoin) does is selecting the
last s_f_touchpoint_execution_status_history record for each
touchpoint_execution_id.

There are better ways to determine that, IMHO. For example:

 (1) adding a 'is_last' flag to s_f_touchpoint_execution_status_history

 This however requires maintaining that flag somehow, but the join
 would not be needed at all.

 The last IDs might be maintained in a separate table - the join
 would be still necessary, but it might be less intrusive and
 cheper to maintain.

 (2) using window functions, e.g. like this:

 SELECT * FROM (
   SELECT *,
ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
   ORDER BY FROM max_creation_dt) AS rn
   FROM s_f_touchpoint_execution_status_history
 ) foo WHERE rn = 1

 But estimating this is also rather difficult ...

 (3) Using temporary table / MV - this really depends on your
 requirements, load schedule, how you run the queries etc. It would
 however fix the estimation errors (probably).

The 2nd view seems to suffer because of the same issue (underestimates
leading to choice of nested loops), but caused by something else:

-  Hash Join  (cost=1954.13..6249.67 rows=13 width=108)
 (actual time=31.777..210.346 rows=72670 loops=1)
  Hash Cond: ((tp_exec.touchpoint_id = tp.touchpoint_id)
  AND (wave_exec.wave_id = tp.wave_id))

Estimating cardinality of joins with multi-column conditions is
difficult, no idea how to fix that at the moment.












-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra
On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...
 
 
 From my experience rewriting something like the above using DISTINCT 
 ON is usually faster.

How do you get the last record (with respect to a timestamp column)
using a DISTINCT ON?


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Tomas Vondra
On 17.3.2015 16:24, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 15:43:
 On 17.3.2015 15:19, Thomas Kellerer wrote:
 Tomas Vondra schrieb am 17.03.2015 um 14:55:
  (2) using window functions, e.g. like this:

  SELECT * FROM (
SELECT *,
 ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
ORDER BY FROM max_creation_dt) AS rn
FROM s_f_touchpoint_execution_status_history
  ) foo WHERE rn = 1

  But estimating this is also rather difficult ...


 From my experience rewriting something like the above using DISTINCT 
 ON is usually faster.

 How do you get the last record (with respect to a timestamp column)
 using a DISTINCT ON?
 
 You need to use order by ... desc. See here: 
 http://sqlfiddle.com/#!15/d4846/2

Nice, thanks!

 
 Btw: your row_number() usage wouldn't return the latest row either. 
 It would return the oldest row.

Oh, right. I forgot the DESC in the window.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-16 Thread Tomas Vondra
On 16.3.2015 20:43, Jim Nasby wrote:
 On 3/13/15 7:12 PM, Tomas Vondra wrote:
 (4) I suspect many of the relations referenced in the views are not
  actually needed in the query, i.e. the join is performed but
  then it's just discarded because those columns are not used.
  Try to simplify the views as much has possible - remove all the
  tables that are not really necessary to run the query. If two
  queries need different tables, maybe defining two views is
  a better approach.
 
 A better alternative with multi-purpose views is to use an outer
 join instead of an inner join. With an outer join if you ultimately
 don't refer to any of the columns in a particular table Postgres will
 remove the table from the query completely.

Really? Because a quick test suggests otherwise:

db=# create table test_a (id int);
CREATE TABLE
db=# create table test_b (id int);
CREATE TABLE
db=# explain select test_a.* from test_a left join test_b using (id);
  QUERY PLAN
--
 Merge Left Join  (cost=359.57..860.00 rows=32512 width=4)
   Merge Cond: (test_a.id = test_b.id)
   -  Sort  (cost=179.78..186.16 rows=2550 width=4)
 Sort Key: test_a.id
 -  Seq Scan on test_a  (cost=0.00..35.50 rows=2550 width=4)
   -  Sort  (cost=179.78..186.16 rows=2550 width=4)
 Sort Key: test_b.id
 -  Seq Scan on test_b  (cost=0.00..35.50 rows=2550 width=4)
(8 rows)

Also, how would that work with duplicate rows in the referenced table?


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-16 Thread Tomas Vondra
On 16.3.2015 18:49, Marc Mamin wrote:
 
 Hi Team,

 This is the EXPLAIN ANALYZE for one of the view : S_V_D_CAMPAIGN_HIERARCHY:

FWIW, this is a somewhat more readable version of the plan:

http://explain.depesz.com/s/nbB

In the future, please do two things:

(1) Attach the plan as a text file, because the mail clients tend to
screw things up (wrapping long lines). Unless the plan is trivial,
of course - but pgsql-performance usually deals with complex stuff.

(2) Put the plan on explain.depesz.com helps too, because it's
considerably more readable (but always do 1, because resorces
placed somewhere else tends to disappear, and the posts then make
very little sense, which is bad when searching in the archives)

(3) Same for stuff pasted somewhere else - always attach it to the
message. For example I'd like to give you more accurate advice, but
I can't as http://pgsql.privatepaste.com/41207bea45 is unavailable.

 
 
   Rows Removed by Join Filter: 3577676116
 
   That's quite a lot.
   You're possibly missing a clause in a join, resulting in a cross join.
   It is also helpful to put your result here:
   http://explain.depesz.com/
   regards,

IMHO this is merely a consequence of using the CTE, which produces 52997
rows and is scanned 67508x as the inner relation of a nested loop. That
gives you 3577721476 tuples in total, and only 45360 are kept (hence
3577676116 are removed).

This is a prime example of why CTEs are not just aliases for subqueries,
but may actually cause serious trouble.

There are other issues (e.g. the row count estimate of the CTE is
seriously off, most likely because of the HashAggregate in the outer
branch), but that's a secondary issue IMHO.

Vivekanand, try this (in the order of intrusiveness):

(1) Get rid of the CTE, and just replace it with subselect in the FROM
part of the query, so instead of this:

WITH valid_executions AS (...)
SELECT ... FROM ... JOIN valid_executions ON (...)

you'll have something like this:

SELECT ... FROM ... JOIN (...) AS valid_executions ON (...)

This way the subselect will optimized properly.


(2) Replace the CTE with a materialized view, or a temporary table.
This has both advantages and disadvantages - the main advantage is
that you can create indexes, collect statistics. Disadvantage is
you have to refresh the MV, fill temporary table etc.

I expect (1) to improve the performance significantly, and (2) might
improve it even further by fixing the misestimates.


regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] MusicBrainz postgres performance issues

2015-03-15 Thread Tomas Vondra
On 15.3.2015 13:07, Robert Kaye wrote:

 If the load problem really is being caused by swapping when things
 really shouldn't be swapping, it could be a matter of adjusting your
 swappiness - what does cat /proc/sys/vm/swappiness show on your server?
 
 0 
 
 We adjusted that too, but no effect.
 
 (I’ve updated the blog post with these two comments)

IMHO setting swappiness to 0 is way too aggressive. Just set it to
something like 10 - 20, that works better in my experience.


 There are other linux memory management things that can cause
 postgres and the server running it to throw fits like THP and zone
 reclaim. I don't have enough info about your system to say they are
 the cause either, but check out the many postings here and other
 places on the detrimental effect that those settings *can* have.
 That would at least give you another angle to investigate.
 
 If there are specific things you’d like to know, I’ve be happy to be a
 human proxy. :)

I'd start with vm.* configuration, so the output from this:

# sysctl -a | grep '^vm.*'

and possibly /proc/meminfo. I'm especially interested in the overcommit
settings, because per the free output you provided there's ~16GB of free
RAM.

BTW what amounts of data are we talking about? How large is the database
and how large is the active set?


I also noticed you use kernel 3.2 - that's not the best kernel version
for PostgreSQL - see [1] or [2] for example.

[1]
https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4

[2]
http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html



-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] MusicBrainz postgres performance issues

2015-03-15 Thread Tomas Vondra
On 15.3.2015 23:47, Andres Freund wrote:
 On 2015-03-15 12:25:07 -0600, Scott Marlowe wrote:
 Here's the problem with a large shared_buffers on a machine that's
 getting pushed into swap. It starts to swap BUFFERs. Once buffers
 start getting swapped you're not just losing performance, that huge
 shared_buffers is now working against you because what you THINK are
 buffers in RAM to make things faster are in fact blocks on a hard
 drive being swapped in and out during reads. It's the exact opposite
 of fast. :)
 
 IMNSHO that's tackling things from the wrong end. If 12GB of shared 
 buffers drive your 48GB dedicated OLTP postgres server into swapping
 out actively used pages, the problem isn't the 12GB of shared
 buffers, but that you require so much memory for other things. That
 needs to be fixed.

I second this opinion.

As was already pointed out, the 500 connections is rather insane
(assuming the machine does not have hundreds of cores).

If there are memory pressure issues, it's likely because many queries
are performing memory-expensive operations at the same time (might even
be a bad estimate causing hashagg to use much more than work_mem).


 But! We haven't even established that swapping is an actual problem
 here. The ~2GB of swapped out memory could just as well be the java raid
 controller management monstrosity or something similar. Those pages
 won't ever be used and thus can better be used to buffer IO.
 
 You can check what's actually swapped out using:
 grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'
 
 For swapping to be actually harmful you need to have pages that are 
 regularly swapped in. vmstat will tell.

I've already asked for vmstat logs, so let's wait.

 In a concurrent OLTP workload (~450 established connections do
 suggest that) with a fair amount of data keeping the hot data set in 
 shared_buffers can significantly reduce problems. Constantly
 searching for victim buffers isn't a nice thing, and that will happen
 if your most frequently used data doesn't fit into s_b. On the other
 hand, if your data set is so large that even the hottest part doesn't
 fit into memory (perhaps because there's no hottest part as there's
 no locality at all), a smaller shared buffers can make things more
 efficient, because the search for replacement buffers is cheaper with
 a smaller shared buffers setting.

I've met many systems with max_connections values this high, and it was
mostly idle connections because of separate connection pools on each
application server. So mostly idle (90% of the time), but at peak time
all the application servers want to od stuff at the same time. And it
all goes KABOOOM! just like here.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] MusicBrainz postgres performance issues

2015-03-15 Thread Tomas Vondra
On 16.3.2015 00:55, mich...@sqlexec.com wrote:
 Why is 500 connections insane. We got 32 CPU with 96GB and 3000
 max connections, and we are doing fine, even when hitting our max
 concurrent connection peaks around 4500. At a previous site, we were
 using 2000 max connections on 24 CPU and 64GB RAM, with about 1500
 max concurrent connections. So I wouldn't be too hasty in saying more
 than 500 is asking for trouble. Just as long as you got your kernel
 resources set high enough to sustain it (SHMMAX, SHMALL, SEMMNI, and
 ulimits), and RAM for work_mem.

If all the connections are active at the same time (i.e. running
queries), they have to share the 32 cores somehow. Or I/O, if that's the
bottleneck.

In other words, you're not improving the throughput of the system,
you're merely increasing latencies. And it may easily happen that the
latency increase is not linear, but grows faster - because of locking,
context switches and other process-related management.

Imagine you have a query taking 1 second of CPU time. If you have 64
such queries running concurrently on 32 cores, each gets only 1/2 a CPU
and so takes =2 seconds. With 500 queries, it's =15 seconds per, etc.

If those queries are acquiring the same locks (e.g. updating the same
rows, or so), you can imagine what happens ...

Also, if part of the query required a certain amount of memory for part
of the plan, it now holds that memory for much longer too. That only
increases the change of OOM issues.

It may work fine when most of the connections are idle, but it makes
storms like this possible.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] MusicBrainz postgres performance issues

2015-03-15 Thread Tomas Vondra
On 15.3.2015 18:54, Ilya Kosmodemiansky wrote:
 On Sun, Mar 15, 2015 at 8:46 PM, Andres Freund and...@2ndquadrant.com wrote:
 That imo doesn't really have anything to do with it. The primary
 benefit of a BBU with writeback caching is accelerating
 (near-)synchronous writes. Like the WAL.
 
 My point was, that having no proper raid controller (today bbu
 surely needed for the controller to be a proper one) + heavy writes
 of any kind, it is absolutely impossible to live with large
 shared_buffers and without io problems.

That is not really true, IMHO.

The benefit of the write cache is that it can absorb certain amount of
writes, equal to the size of the cache (nowadays usually 512MB or 1GB),
without forcing them to disks.

It however still has to flush the dirty data to the drives later, but
that side usually has much lower throughput - e.g. while you can easily
write several GB/s to the controller, the drives usually handle only
~1MB/s of random writes each (I assume rotational drives here).

But if you do a lot of random writes (which is likely the case for
write-heavy databases), you'll fill the write cache pretty soon and will
be bounded by the drives anyway.

The controller really can't help with sequential writes much, because
the drives already handle that quite well. And SSDs are a completely
different story of course.

That does not mean the write cache is useless - it can absorb short
bursts of random writes, fix the write hole with RAID5, the controller
may compute the parity computation etc. Whenever someone asks me whether
they should buy a RAID controller with write cache for their database
server, my answer is absolutely yes in 95.23% cases ...

... but really it's not something that magically changes the limits for
write-heavy databases - the main limit are still the drives.

regards
Tomas

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-13 Thread Tomas Vondra
On 13.3.2015 21:46, Vivekanand Joshi wrote:
 Since I was doing it only for the testing purposes and on a
 development server which has only  8 GB of RAM, I used only 10m rows.
 But the original table has 1.5 billion rows. We will obviously be
 using a server with very high capacity, but I am not satisfied with
 the performance at all. This might be only a start, so I might get a
 better performance later.

OK, understood.

 Yes, the view is complex and almost is created by using 10 tables. Same
 goes with other views as well but this is what we are using in Netezza
 as well. And we are getting results of the full report in less than 5
 seconds. And add to that, this is only a very little part of the whole
 query used in a report.

Well, in the very first message you asked Is the query written
correctly as per the PostgreSQL? - how can we decide that when most of
the query is hidden in some unknown view?

 I will post the result of whole query with Explain analyze tomorrow.

Please also collect some information about the system using iostat,
vmstat and such, so that we know what is the bottleneck.

 We might even consider taking experts advice on how to tune queries
 and server, but if postgres is going to behave like this, I am not
 sure we would be able to continue with it.

That's probably a good idea.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-13 Thread Tomas Vondra
Hi,

On 13.3.2015 20:59, Vivekanand Joshi wrote:
 I am really worried about the performance of PostgreSQL as we have
 almost 1.5 billion records in promotion history table. Do you guys

In the previous message you claimed the post table has 10M rows ...

 really think PostgreSQL can handle this much load. We have fact
 tables which are more than 15 GB in size and we have to make joins
 with those tables in almost every query.

That depends on what performance you're looking for. You'll have to
provide considerably more information until we can help you. You might
want to check this:

   https://wiki.postgresql.org/wiki/Slow_Query_Questions

You have not provided the full query, just a query apparently
referencing views, so that the actual query is way more complicated.
Also, plain EXPLAIN is not really sufficient, we need EXPLAIN ANALYZE.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-13 Thread Tomas Vondra
On 14.3.2015 00:28, Vivekanand Joshi wrote:
 Hi Guys,
 
 So here is the full information attached as well as in the link
 provided below:
 
 http://pgsql.privatepaste.com/41207bea45
 
 I can provide new information as well.

Thanks.

We still don't have EXPLAIN ANALYZE - how long was the query running (I
assume it got killed at some point)? It's really difficult to give you
any advices because we don't know where the problem is.

If EXPLAIN ANALYZE really takes too long (say, it does not complete
after an hour / over night), you'll have to break the query into parts
and first tweak those independently.

For example in the first message you mentioned that select from the
S_V_D_CAMPAIGN_HIERARCHY view takes ~9 minutes, so start with that. Give
us EXPLAIN ANALYZE for that query.

Few more comments:

(1) You're using CTEs - be aware that CTEs are not just aliases, but
impact planning / optimization, and in some cases may prevent
proper optimization. Try replacing them with plain views.

(2) Varadharajan Mukundan already recommended you to create index on
s_f_promotion_history.send_dt. Have you tried that? You may also
try creating an index on all the columns needed by the query, so
that Index Only Scan is possible.

(3) There are probably additional indexes that might be useful here.
What I'd try is adding indexes on all columns that are either a
foreign key or used in a WHERE condition. This might be an
overkill in some cases, but let's see.

(4) I suspect many of the relations referenced in the views are not
actually needed in the query, i.e. the join is performed but
then it's just discarded because those columns are not used.
Try to simplify the views as much has possible - remove all the
tables that are not really necessary to run the query. If two
queries need different tables, maybe defining two views is
a better approach.

(5) The vmstat / iostat data are pretty useless - what you provided are
averages since the machine was started, but we need a few samples
collected when the query is running. I.e. start the query, and then
give us a few samples from these commands:

iostat -x -k 1
vmstat 1

 Would like to see if queries of these type can actually run in
 postgres server?

Why not? We're running DWH applications on tens/hundreds of GBs.

 If yes, what would be the minimum requirements for hardware? We would
 like to migrate our whole solution on PostgreSQL as we can spend on
 hardware as much as we can but working on a proprietary appliance is
 becoming very difficult for us.

That's difficult to say, because we really don't know where the problem
is and how much the queries can be optimized.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-11 Thread Tomas Vondra
On 11.3.2015 18:30, Jeff Janes wrote:
 On Sat, Mar 7, 2015 at 7:44 AM, Tomas Vondra
 tomas.von...@2ndquadrant.com mailto:tomas.von...@2ndquadrant.com wrote:
 
 On 7.3.2015 03:26, Jeff Janes wrote:
  On Fri, Mar 6, 2015 at 5:38 PM, Tom Lane t...@sss.pgh.pa.us 
 mailto:t...@sss.pgh.pa.us
  mailto:t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us wrote:
 
  But the actual query is using a seq scan, and so it would hint the
  table in efficient sequential order, rather than hinting the table
  haphazardly in index order like probing the endpoint does.
 
 I think this has nothing to do with the plan itself, but with the
 estimation in optimizer - that looks up the range from the index in some
 cases, and that may generate random I/O to the table.
 
 
 Right.  Tom was saying that the work needs to be done anyway, but it is
 just that some ways of doing the work are far more efficient than
 others.  It just happens that the executed plan in this case would do it
 more efficiently, (but in general you aren't going to get any less
 efficient than having the planner do it in index order).

Oh! Now I see what you meant. I parsed is as if you're suggesting that
the theory does not match the symptoms because the plan contains
sequential scan yet there's a lot of random I/O. But now I see that's
not what you claimed, so sorry for the noise.



-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-07 Thread Tomas Vondra
On 7.3.2015 03:26, Jeff Janes wrote:
 On Fri, Mar 6, 2015 at 5:38 PM, Tom Lane t...@sss.pgh.pa.us
 mailto:t...@sss.pgh.pa.us wrote:
 
 But the actual query is using a seq scan, and so it would hint the
 table in efficient sequential order, rather than hinting the table
 haphazardly in index order like probing the endpoint does.

I think this has nothing to do with the plan itself, but with the
estimation in optimizer - that looks up the range from the index in some
cases, and that may generate random I/O to the table.

 
 Also, it's less than clear why only this particular query is
 showing any stress. Dead rows should be a hazard for anything,
 especially if there are enough of them to require hours to re-hint.
 And why wouldn't autovacuum get to them first?
 
 
 Say the timing of this query is such that 10% of the parent turns
 over between invocations of this query, and that this 10% is all at
 the end of some index but random over the table heap. If autovac
 kicks in at 20% turn over, then half the time autovac does get to
 them first, and half the time it doesn't. It would be interesting to
 know if this query is bad every time it is planner, or just
 sometimes.

Yeah, this might be the reason. Another possibility is that this is part
of some large batch, and autovacuum simply did not have change to do the
work.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-06 Thread Tomas Vondra
On 6.3.2015 01:44, Tom Lane wrote:
 Tomas Vondra tomas.von...@2ndquadrant.com writes:
 On 5.3.2015 16:01, Gunnlaugur Thor Briem wrote:
 - postgres version is 9.1.13
 
 The only thing I can think of is some sort of memory exhaustion,
 resulting in swapping out large amounts of memory.
 
 I'm wondering about the issue addressed by commit fccebe421 (Use 
 SnapshotDirty rather than an active snapshot to probe index
 endpoints). Now, that was allegedly fixed in 9.1.13 ... but if the
 OP were confused and this server were running, say, 9.1.12, that
 could be a viable explanation. Another possibly viable explanation
 for seeing the issue in 9.1.13 would be if I fat-fingered the
 back-patch somehow :-(.

How would fccebe421 explain the large amount of random writes (~4MB/s
for more than an hour), reported in the initial post? And why would that
only affect the EXPLAIN and not the bare query?

I guess there might be two sessions, one keeping uncommitted changes
(thus invisible tuples), and the other one doing the explain. And the
actual query might be executed after the first session does a commit.

But the random writes don't really match in this scenario ...


regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT

2015-03-05 Thread Tomas Vondra
Hi,

On 5.3.2015 16:01, Gunnlaugur Thor Briem wrote:
 Hi, thanks for your follow-up questions.
 
 - postgres version is 9.1.13
 - the number of rows (in this latest instance) is 28,474,842
 - I've clustered and vacuum-full-ed and analyzed this table frequently,
 attempting to troubleshoot this. (Running vacuum full on the whole
 catalog seems a little excessive, and unlikely to help.)
 - no other processes are likely to be interfering; nothing other than
 PostgreSQL runs on this machine (except for normal OS processes and New
 Relic server monitoring service); concurrent activity in PostgreSQL is
 low-level and unrelated, and this effect is observed systematically
 whenever this kind of operation is performed on this table
 - no override for this table; the system default_statistics_target is
 100 (the default)
 - yes, an ANALYZE is performed on the temp table after the COPY and
 before the INSERT
 - no index on the temp table (but I'm scanning the whole thing anyway).
 There are indexes on f_foo as detailed in my original post, and I expect
 the PK to make the WHERE NOT EXISTS filtering efficient (as it filters
 on exactly all columns of the PK) ... but even if it didn't, I would
 expect that to only slow down the actual insert execution, not the EXPLAIN.

The only thing I can think of is some sort of memory exhaustion,
resulting in swapping out large amounts of memory. That'd explain the
I/O load. Can you run something like vmstat to see if this really is swap?

The fact that plain INSERT does not do that contradicts that, as it
should be able to plan either both queries (plain and EXPLAIN), or none
of them.

Can you prepare a self-contained test case? I.e. a script that
demonstrates the issue? I tried to reproduce the issue using the
information provided so far, but unsuccessfully :-(

Even if you could reproduce the problem on another machine (because of
keeping the data internal) on a server with debug symbols and see where
most of the time is spent (e.g. using 'perf top'), that'd be useful.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] working around JSONB's lack of stats?

2015-01-28 Thread Tomas Vondra
On 29.1.2015 00:03, Josh Berkus wrote:
 On 01/28/2015 11:48 AM, Tomas Vondra wrote:
 On 27.1.2015 08:06, Josh Berkus wrote:
 Folks,

 ...

 On a normal column, I'd raise n_distinct to reflect the higher
 selecivity of the search terms.  However, since @ uses contsel,
 n_distinct is ignored.  Anyone know a clever workaround I don't
 currently see?

 I don't see any reasonable workaround :-(

 ISTM we'll have to invent a way to collect useful stats about contents
 of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly
 relying on defaults that may be reasonable, but still misfire in many
 cases. Do we have any ideas of how that might work?

 We're already collecting stats about contents of arrays, and maybe we
 could do something similar for JSONB? The nested nature of JSON makes
 that rather incompatible with the flat MCV/histogram stats, though.
 
 Well, I was thinking about this.
 
 We already have most_common_elem (MCE) for arrays and tsearch. What
 if we put JSONB's most common top-level keys (or array elements,
 depending) in the MCE array? Then we could still apply a simple rule
 for any path criteria below the top-level keys, say assuming that any
 sub-key criteria would match 10% of the time. While it wouldn't be
 perfect, it would be better than what we have now.

So how would that work with your 'tags' example? ISTM most of your
documents have 'tags' as top-level key, so that would end up in the MCV
list. But there's no info about the elements of the 'tags' array (thus
the 10% default, which may work in this particular case, but it's hardly
a general solution and I doubt it's somehow superior to the defaults
we're using right now).

I think a 'proper' solution to JSONB stats needs to somehow reflect the
nested structure. What I was thinking about is tracking MCV for
complete paths, i.e. for a document:

  {
keyA : {
  keyB : x,
  keyC : z,
}
keyD : [1, 2, 3, 4]
  }

We'd extract three paths

   keyA.keyB
   keyA.keyC
   keyD

and aggregate that over all the documents to select the MCV paths.
And then, for each of those MCV paths track the most common values.

ISTM this would allow better estimations, but it has issues too:

Firstly, it does not match the MCV structure, because it requires
storing (a) MCV paths and (b) MCV values for those paths. Moreover, (b)
probably stores different data types (some values are strings, some
integers, etc.). Arrays might be handled just like regular arrays, i.e.
tracking stats of elements, but it's still mixed data types.

Secondly, I think it's based on the assumption of independence (i.e.
that the occurence of one path does not depend on occurence of a
different path in the same document). Same for values x paths. Which may
or may not be be true - it's essentially the same as assumption of
independence for predicates on multiple columns. While I do have ideas
on how to approach this in the multi-column case, handling this for
JSONB is going to be much more complex I think.

But the first question (what stats to collect and how to store them) is
the most important at this point, I guess.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] working around JSONB's lack of stats?

2015-01-28 Thread Tomas Vondra
On 27.1.2015 08:06, Josh Berkus wrote:
 Folks,
 
...

 On a normal column, I'd raise n_distinct to reflect the higher
 selecivity of the search terms.  However, since @ uses contsel,
 n_distinct is ignored.  Anyone know a clever workaround I don't
 currently see?

I don't see any reasonable workaround :-(

ISTM we'll have to invent a way to collect useful stats about contents
of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly
relying on defaults that may be reasonable, but still misfire in many
cases. Do we have any ideas of how that might work?

We're already collecting stats about contents of arrays, and maybe we
could do something similar for JSONB? The nested nature of JSON makes
that rather incompatible with the flat MCV/histogram stats, though.

regards
-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why is PostgreSQL not using my index?

2015-01-26 Thread Tomas Vondra
Hi,

On 26.1.2015 17:32, Christian Roche wrote:
 select *
 
 from mixpanel_events_201409 mp
 
   inner join mixpanel_event_list ev on ( ev.id = mp.event_id )
 
 where ev.id in (3, 4, 5, 6, 7, 8, 9, 10, 11, 373, 375, 376, 318);
 
  
 
 Hash Join  (cost=20.73..2892183.32 rows=487288 width=1000)
 
   Hash Cond: (mp.event_id = ev.id)
 
   -  Seq Scan on mixpanel_events_201409 mp  (cost=0.00..2809276.70
 rows=20803470 width=949)
 
   -  Hash  (cost=20.57..20.57 rows=13 width=51)
 
 -  Seq Scan on mixpanel_event_list ev  (cost=0.00..20.57
 rows=13 width=51)
 
   Filter: (id = ANY
 ('{3,4,5,6,7,8,9,10,11,373,375,376,318}'::integer[]))
 
  
 
  
 
 Both tables have been vacuum analyzed.

Can we get EXPLAIN ANALYZE please, and maybe some timings for the two
plans? Otherwise we have no clue how accurate those estimates really
are, making it difficult to judge the plan choice.

You might also use enable_hashjoin=off to force a different join
algorithm (it may not switch to nested loop immediately, so maybe try
the other enable_* options).

The estimated row counts are quite near each other (410k vs. 487k), but
the costs are not. I'm pretty sure that's because while the fist query
has WHERE condition directly on the event_id column, the second one
moves the condition to the 'list' table, forcing this particular plan.

But as the condition is on the join column, you may try moving it back:

  select *
  from mixpanel_events_201409 mp
inner join mixpanel_event_list ev on ( ev.id = mp.event_id )
  where mp.event_id in (3, 4, 5, 6, 7, 8, 9, 10, 11, 373, 375, 376, 318);

Of course, this only works on this particular column - it won't work for
other columns in the 'list' table.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query performance

2015-01-25 Thread Tomas Vondra
Hi,

On 25.1.2015 07:38, Joe Van Dyk wrote:
 
 Here's one that's not quite as well: http://explain.depesz.com/s/SgT

As Pavel already pointed out, the first problem is this part of the plan:

Seq Scan on events e2 (cost=0.00..120,179.60 rows=4,450,241 width=51)
(actual time=0.014..33,773.370 rows=4,450,865 loops=1)
Filter: (product_id  '81716'::citext)

Consuming ~33 seconds of the runtime. If you can make this faster
somehow (e.g. by getting rid of the citext cast), that'd be nice.

Another issue is that the hashjoin is batched:

   Buckets: 65536 Batches: 8 Memory Usage: 46085kB

The hash preparation takes ~40 seconds, so maybe try to give it a bit
more memory - I assume you have work_mem=64MB, so try doubling that
(ISTM 512MB should work with a single batch). Maybe this won't really
improve the performance, though. It still has to process ~4.5M rows.

Increasing the work mem could also result in switching to hash
aggregate, making the sort (~30 seconds) unnecessary.

Anyway, ISTM this works as expected, i.e.

(a) with rare product_id values the queries are fast
(b) with common product_id values the queries are slow

That's expected, because (b) needs to process much more data. I don't
think you can magically make it run as fast as (a). The best solution
might be to keep a pre-aggregated results - I don't think you really
need exact answers when recommending similar products.

I also wonder if you really need to join the tables? I mean, what if you
do something like this:

CREATE TABLE events_aggregated AS SELECT
   site_id,
   array_agg(product_id) AS product_ids,
   count(nullif(e2.type='viewed', false)) view_count,
   count(nullif(e2.type='purchased', false)) purchase_count
FROM events
GROUP BY 1;

and then using intarray with GIN indexes to query this table?
Something like this:

  CREATE products_agg_idx ON aggregated
   USING GIN (product_ids gin__int_ops);

  SELECT * FROM events_aggregated WHERE product_ids @ ARRAY['82503'];

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to tell ANALYZE to collect statistics from the whole table?

2015-01-24 Thread Tomas Vondra
Hi,

On 25.1.2015 00:33, AlexK987 wrote:
 The documentation states that The extent of analysis can be 
 controlled by adjusting the default_statistics_target configuration 
 variable. It looks like I can tell Postgres to create more 
 histograms with more bins, and more distinct values. This implicitly
 means that Postgres will use a larger random subset to calculate
 statistics.
 
 However, this is not what I want. My data may be quite skewed, and I 
 want full control over the size of the sample. I want to explicitly 
 tell Postgres to analyze the whole table. How can I accomplish that?

I don't think there's an official way to do that - at least I can't
think of one. The only thing you can do is increasing statistics target
(either globally by setting default_statistics_target, or per column
using ALTER TABLE ... SET STATISTICS).

As you noticed, this however controls two things - sample size and how
detailed the statistics (MCV list / histogram) will be. The statistics
target is used as upper bound for number of MCV items / histogram bins,
and the number of sampled rows is (300 * statistics_target). With
default_statistics_target = 1 (which si the max allowed value since
9.0), this produces very detailed stats and uses sample of ~3M rows.

It's a bit more complicated though, because there's an algorithm that
decides how many MCV items / histogram buckets to actually create, based
on the data. So you may not get more detailed stats, even when using
larger sample.

That being said, I really doubt increasing the statistics target above
1 (or even sampling the whole table) will help you in practice.
Might be worth showing an example of a bad estimate with your data, or
maybe a test case to play with.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to tell ANALYZE to collect statistics from the whole table?

2015-01-24 Thread Tomas Vondra
On 25.1.2015 02:04, AlexK987 wrote:
 Tomas,
 
 Thank you for a very useful reply. Right now I do not have a case of
 poor performance caused by strong data skew which is not properly
 reflected in statistics. I was being defensive, trying to prevent
 every possible thing that might go wrong.

OK. My recommendation is not to mess with default_statistics unless you
actually have to (e.g. increasing the value on all tables, withouth a
query where the current value causes trouble). It increases time to plan
the queries, collect statistics (ANALYZE / autovacuum) etc.

regards
-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Initial insert

2015-01-24 Thread Tomas Vondra
 it to 1 ms, I
don't see how this would do a difference in a batch-style job.

If you're doing many such queries (with different id_article values),
you may do something like this

select caracteris1_.id_article, sum(quantitest0_.quantite_valeur) as
col_0_0_ from dm5_quantitestock quantitest0_,
dm5_caracteristiquearticlestock caracteris1_ where
quantitest0_.id_caracteristiquearticlestock=caracteris1_.id
group by caracteris1_.id_article

and then query this (supposedly much smaller) table.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-11-23 Thread Tomas Vondra
On 21.11.2014 19:38, Jeff Janes wrote:

 When I run this patch on the regression database, I get a case where 
 the current method is exact but the adaptive one is off:

 WARNING:  ndistinct estimate current=676.00 adaptive=906.00
 
 select count(distinct stringu1) from onek;
 676
 
 It should be seeing every single row, so I don't know why the
 adaptive method is off. Seems like a bug.

Thanks for noticing this. I wouldn't call it a bug, but there's clearly
room for improvement.

The estimator, as described in the original paper, does not expect the
sampling to be done our way (using fixed number of rows) but assumes
to get a fixed percentage of rows. Thus it does not expect the number of
sampled rows to get so close (or equal) to the total number of rows.

I think the only way to fix this is by checking if samplerows is close
to totalrows, and use a straightforward estimate in that case (instead
of a more sophisticated one). Something along these lines:

if (samplerows = 0.95 * totalrows)
stats-stadistinct = (d + d/0.95) / 2;

which means if we sampled = 95% of the table, use the number of
observed distinct values directly.

I have modified the estimator to do the adaptive estimation, and then do
this correction too (and print the values). And with that in place I get
these results

  WARNING:  ndistinct estimate current=676.00 adaptive=996.00
  WARNING:  corrected ndistinct estimate current=676.00 adaptive=693.79

So it gets fairly close to the original estimate (and exact value).

In the end, this check should be performed before calling the adaptive
estimator at all (and not calling it in case we sampled most of the rows).

I also discovered an actual bug in the optimize_estimate() function,
using 'f_max' instead of the number of sampled rows.

Attached is a patch fixing the bug, and implementing the sample size check.

regards
Tomas
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 732ab22..3975fb6 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -110,6 +110,8 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static double optimize_estimate(int total_rows, int sample_rows, int ndistinct,
+int *f, int f_max);
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -2369,6 +2371,11 @@ compute_scalar_stats(VacAttrStatsP stats,
 		int			slot_idx = 0;
 		CompareScalarsContext cxt;
 
+		/* f values for the estimator - messy and we likely need much
+		 * less memory, but who cares */
+		int			f_max = 0; /* max number of duplicates */
+		int		   *f_count = (int*)palloc0(sizeof(int)*values_cnt);
+
 		/* Sort the collected values */
 		cxt.ssup = ssup;
 		cxt.tupnoLink = tupnoLink;
@@ -2410,6 +2417,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 ndistinct++;
 if (dups_cnt  1)
 {
+
 	nmultiple++;
 	if (track_cnt  num_mcv ||
 		dups_cnt  track[track_cnt - 1].count)
@@ -2435,6 +2443,12 @@ compute_scalar_stats(VacAttrStatsP stats,
 		track[j].first = i + 1 - dups_cnt;
 	}
 }
+
+/* increment the number of values with this number of
+ * repetitions and the largest number of repetitions */
+f_count[dups_cnt] += 1;
+f_max = (f_max  dups_cnt) ? dups_cnt : f_max;
+
 dups_cnt = 0;
 			}
 		}
@@ -2481,6 +2495,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 			double		numer,
 		denom,
 		stadistinct;
+			double		adaptdistinct;	/* adaptive estimate */
 
 			numer = (double) samplerows *(double) d;
 
@@ -2494,6 +2509,20 @@ compute_scalar_stats(VacAttrStatsP stats,
 			if (stadistinct  totalrows)
 stadistinct = totalrows;
 			stats-stadistinct = floor(stadistinct + 0.5);
+
+			/* compute the adaptive estimate */
+			adaptdistinct = optimize_estimate(totalrows, samplerows, d, f_count, f_max);
+
+			elog(WARNING, ndistinct estimate current=%.2f adaptive=%.2f, stadistinct, adaptdistinct);
+
+			/* if we've seen 'almost' all rows, use the estimate instead */
+			if (samplerows = 0.95 * totalrows)
+			{
+adaptdistinct = (d + d/0.95)/2;
+elog(WARNING, corrected ndistinct estimate current=%.2f adaptive=%.2f,
+	 stadistinct, adaptdistinct);
+			}
+
 		}
 
 		/*
@@ -2819,3 +2848,84 @@ compare_mcvs(const void *a, const void *b)
 
 	return da - db;
 }
+
+
+/*
+ * We need to minimize this equality (find m solving it)
+ *
+ * m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ * 0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
+ *
+ * Instead of looking for the exact solution to this equation (which
+ * might be fractional), we'll look for a natural number minimizing
+ * the absolute difference. Number of (distinct) elements is a natural
+ * number, and we don't mind if the 

Re: [PERFORM] Increased shared_buffer setting = lower hit ratio ?

2014-11-13 Thread Tomas Vondra
Hi,

On 14.11.2014 00:16, CS DBA wrote:
 This is on a CentOS 6.5 box running PostgreSQL 9.2
 
 
 On 11/13/14 4:09 PM, CS DBA wrote:
 All;

 We have a large db server with 128GB of ram running complex
 functions.

 with the server set to have the following we were seeing a
 somewhat low hit ratio and lots of temp buffers

 shared_buffers = 18GB
 work_mem = 75MB
 effective_cache_size = 105GB
 checkpoint_segments = 128


 when we increased the values to these not only did the hit ratio
 drop but query times are now longer as well:


 shared_buffers = 28GB
 work_mem = 150MB
 effective_cache_size = 105GB
 checkpoint_segments = 256

 This does not seem to make sense to me, anyone have any thoughts
 on why more memory resources would cause worse performance?

what exactly do you mean by hit ratio - is that the page cache hit ratio
(filesystem cache), or shared buffers hit ratio (measured e.g. using
pg_buffercache)?

Regarding the unexpected decrease of performance after increasing
shared_buffers - that's actually quite common behavior. First, the
management of shared buffers is not free, and the more pieces you need
to manage the more expensive it is. Also, by using larger shared buffers
you make that memory unusable for page cache etc. There are also other
negative consequences - double buffering, accumulating more changes for
a checkpoint etc.

The common wisdom (which some claim to be obsolete) is not to set shared
buffers over ~10GB of RAM. It's however very workload-dependent so your
mileage may vary.

To get some basic idea of the shared_buffers utilization, it's possible
to compute stats using pg_buffercache. Also pg_stat_bgwriter contains
useful data.

BTW, it's difficult to say why a query is slow - can you post explain
analyze of the query with both shared_buffers settings?

And just to check - what kind of hardware/kernel version is this? Do you
have numa / transparent huge pages or similar trouble-indicing issues?

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 9.3 performance issues, lots of bind and parse log entries

2014-11-05 Thread Tomas Vondra
On 5.11.2014 20:16, Tory M Blue wrote:
 
 Thanks Thomas, 
 
 
 On 4.11.2014 21:07, Tory M Blue wrote:
  Well after fighting this all day and dealing with a really sluggish db
  where even my slon processes were taking several seconds, I reduced my
  shared_buffers back to 2GB from 10GB and my work_mem from 7.5GB to
 2GB.
  i actually undid all my changes, including dropping my effective_cache
  back to 7GB  and restarted.
 
 Have you been using the same parameter values on 9.2, or have you bumped
 them up only on the new 9.3? I'm wondering whether 9.2 was performing
 better with the values?
 
 
 Things seem to have been running better on 9.2 at this point I'm using
 the same config file from 9.2 and I'm still experiencing slowness under
 heavier write access. And my disk subsystem has not changed. Hardware
 has not changed, heck i'm even running the old version of slony (have
 not upgraded it yet).

So with shared_buffers=10GB and work_mem=7.5GB you saw significant
slowdown both for read and write queries, and after reverting to lower
values the read queries are OK but writes still take much longer?

 But since the upgrade to 9.3 even calls to my sl_log tables which
 are tiny can take:
 
 2014-11-04 02:58:40 PST clsdb postgres 10.13.200.242(52022) 21642
 2014-11-04 02:58:40.515 PSTLOG:  duration: 1627.019 ms  statement: fetch
 500 from LOG;  (log had 145K items).
 
  I have 300 connections configured, we will use around 87 normally
  with some spikes, but I'm wondering if the 10GB shared memory caused
  me some grief, I don't believe it was the work_mem and don't believe
  it was the effective cache, but something caused my DB to run into
  issues with basic queries, same queries after restart are finishing
  in milliseconds instead of 2-3 seconds. No disk issues seen,.
 
 I assume only some of the connections will be active (running queries)
 at the same time. If you expect  32 active queries at the same time,
 you're only increasing latency.
 
 Based on your description I assume you're CPU bound (otherwise the
 machine would not get hotter, and planning is not about I/O).
 
 I'm not sure if this is a production machine or how much you can
 experiment with it, but it'd be helpful if you could provide some
 profiling information
 
$ iostat -x -k 1
$ vmstat 1
 
 and such data. A perf profile would be even better, but to get the most
 useful info it may be necessary to recompile the postgres with debug
 info and '-fno-omit-frame-pointer'. Then this should do the trick:
 
   perf record -a -g (for a few seconds, then Ctrl-C)
   perf report
 
 or just perf top to see what functions are at the top.
 
 
 This is a production server, but it was not really CPU bound with 9.2
 so something is odd and I'm starting to stress, because it is a
 production environment :)

Yeah, I was talking about the 9.3 - that's clearly CPU bound.

 Connections correct, I have less than 20 or so active requests at a
 time, but i would say active queries are in the handful. I was was not
 seeing IO, but was seeing load increase as queries started taking
 longer, but nothing in iostat or vmstat/free showed any contention. Heck
 even Top while showed some cores as busy, nothing was sitting at over
 60% utilized. And we are talking a load of 12-14 here on a 32 core
 system, when it's normally asleep!

Right. That's consistent with being CPU bound.

 This is my master slon insert server, so I can run commands, tweak 
 configs but any type of rebuild or restart of postgres is a
 scheduled affair.

OK, understood. That however mostly rules out recompiling with debug
info and frame pointers, as that might make it significantly slower.
That's not something you'd like to do on production.

 These work loads that seem to be creating the issues run between 
 midnight and now almost 6am, prior to 9.3 it was taking maybe 4
 hours, now it's taking 6. So tomorrow AM , I'll grab some stats when
 I see that it's struggling.
 
 But even now with almost no connections or really any major access
 i'm seeing the sl_log grab 500 rows take 1-3 seconds, which is just
 plain silly (but it's not a constant, so I may see 1 of these alerts
 every hour)

Is that plain SELECT * FROM sl_log or something more complex? When you
do explain analyze on the query, what you see?

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 9.3 performance issues, lots of bind and parse log entries

2014-11-04 Thread Tomas Vondra
Hi Tory,

On 4.11.2014 21:07, Tory M Blue wrote:
 Well after fighting this all day and dealing with a really sluggish db
 where even my slon processes were taking several seconds, I reduced my
 shared_buffers back to 2GB from 10GB and my work_mem from 7.5GB to 2GB.
 i actually undid all my changes, including dropping my effective_cache
 back to 7GB  and restarted.

Have you been using the same parameter values on 9.2, or have you bumped
them up only on the new 9.3? I'm wondering whether 9.2 was performing
better with the values?

 I have 300 connections configured, we will use around 87 normally
 with some spikes, but I'm wondering if the 10GB shared memory caused
 me some grief, I don't believe it was the work_mem and don't believe
 it was the effective cache, but something caused my DB to run into
 issues with basic queries, same queries after restart are finishing
 in milliseconds instead of 2-3 seconds. No disk issues seen,.

I assume only some of the connections will be active (running queries)
at the same time. If you expect  32 active queries at the same time,
you're only increasing latency.

Based on your description I assume you're CPU bound (otherwise the
machine would not get hotter, and planning is not about I/O).

I'm not sure if this is a production machine or how much you can
experiment with it, but it'd be helpful if you could provide some
profiling information

   $ iostat -x -k 1
   $ vmstat 1

and such data. A perf profile would be even better, but to get the most
useful info it may be necessary to recompile the postgres with debug
info and '-fno-omit-frame-pointer'. Then this should do the trick:

  perf record -a -g (for a few seconds, then Ctrl-C)
  perf report

or just perf top to see what functions are at the top.


 So if this is not a 9.3 issue, it's an issue with me upping my config
 params to a level I thought would give a nice bump..
 
 CentOS 6.x
 Postgres:  9.3.4
 256GB Mem
 32Core

What kernel version are you using? I assume 6.x means 6.5, or are you
using an older CentOS version?

Are you using transparent huge pages, NUMA or similar features?
Althought, that'd probably impact 9.2 too.

Also, what package is this? Is it coming from the CentOS repository,
yum.postgresql.org or some other repository?

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incredibly slow restore times after 9.09.2 upgrade

2014-10-30 Thread Tomas Vondra
On 29.10.2014 16:12, jmcdonagh wrote:
 Hi Tomas- thank you for your thoughtful response!
 
 
 Tomas Vondra wrote
 On 28.10.2014 21:55, jmcdonagh wrote:
 Hi, we have a nightly job that restores current production data to
 the development databases in a 'warm spare' database so that if the
 developers need fresh data, it's ready during the day. When we moved
 from 9.0 to 9.2 suddenly the restores began to take from a few hours
 to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
 volumes, warmed them up, threw IOPS at them, pretty much all the
 standard stuff to get more disk performance.

 So, if I understand it correctly, you've been restoring into 9.0, then
 you switched to 9.2 and it's much slower?
 
 Yes- but since the move was done utilizing snapshots so the move
 involves new volumes, but I have created new volumes since then to
 rule out a single bad volume.

My advice would be to do some basic low-level performance tests to rule
this out. Use dd or (better) fio to test basic I/O performance, it's
much easier to spot issues that way.

 Tomas Vondra wrote
 Is the 9.2 configured equally to 9.0? If you do something like this

   SELECT name, setting
 FROM pg_settings
WHERE source = 'configuration file';

 on both versions, what do you get?
 
 I no longer have the 9.0 box up but we do track configuration via
 puppet and git. The only configuration change made for 9.2 is:
 
 -#standard_conforming_strings = off
 +standard_conforming_strings = off

Compared to 9.0, I suppose? Anyway, post the non-default config values
at least for 9.2, please.

 Cause we have an old app that needs this setting on otherwise we'd
 spend a lot of time trying to fix it.

I doubt standard_conforming_strings has anything to do with the issues.

 Tomas Vondra wrote
 Here's the thing, the disk isn't saturated. The behavior I'm seeing 
 seems very odd to me; I'm seeing the source disk which holds the dump
 saturated by reads, which is great, but then I just see nothing being
 written to the postgres volume. Just nothing happening, then a
 small burst. There is no write queue backup on the destination disk
 either. if I look at pg_stat_activity I'll see something like:

 COPY salesforce_reconciliation (salesforce_id, email,
 advisor_salesforce_id, processed) FROM stdin

 and even for small tables, that seems to take a very long time even
 though the destination disk is almost at 0 utilization.

 So, where's the bottleneck? Clearly, there's one, so is it a CPU, a
 disk or something else? Or maybe network, because you're using EBS?

 What do you mean by 'utilization'? How do you measure that?
 
 The bottleneck is I/O somehow. I say somehow, because I see iowait 
 averaging about 50% between two CPUs, but there is just no writes to 
 the destination EBS volume really happening, just reads from the
 disk where the source dump is located, then bursts of writes to the
 destination volume every so often. It's kind of puzzling. This is
 happening on multiple database servers, in multiple availability
 zones. Driving me bonkers.
 
 What I mean by utilization is util% from iostat -m -x 1.

I find this rather contradictory. At one moment you say the disk isn't
saturated, the next moment you say you're I/O bound.

Also, iowait (as reported e.g. by 'top') is tricky to interpret
correctly, especially on multi-cpu systems (nice intro to the complexity
[1]). It's really difficult to interpret the 50% iowait without more
info about what's happening on the machine.

IMHO, the utilization (as reported by iotop) is much easier to
interpret, because it means '% of time the device was servicing
requests'. It has issues too, because 100% does not mean 'saturated'
(especially on RAID arrays that can service multiple requests in
parallel), but it's better than iowait.

If I had to guess based from your info, I'd bet you're CPU bound, so
there's very little idle time and about 50% of it is spent waiting for
I/O requests (hence the 50% iowait). But in total the amount of I/O is
very small, so %util is ~0.

Please, post a few lines of 'iostat -x -k 1' output. Samples from 'top'
and 'vmstat 1' would be handy too.

regards
Tomas

[1] http://veithen.blogspot.cz/2013/11/iowait-linux.html



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incredibly slow restore times after 9.09.2 upgrade

2014-10-28 Thread Tomas Vondra
On 28.10.2014 21:55, jmcdonagh wrote:
 Hi, we have a nightly job that restores current production data to
 the development databases in a 'warm spare' database so that if the
 developers need fresh data, it's ready during the day. When we moved
 from 9.0 to 9.2 suddenly the restores began to take from a few hours
 to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
 volumes, warmed them up, threw IOPS at them, pretty much all the
 standard stuff to get more disk performance.

So, if I understand it correctly, you've been restoring into 9.0, then
you switched to 9.2 and it's much slower?

Is the 9.2 configured equally to 9.0? If you do something like this

  SELECT name, setting
FROM pg_settings
   WHERE source = 'configuration file';

on both versions, what do you get?

 Here's the thing, the disk isn't saturated. The behavior I'm seeing 
 seems very odd to me; I'm seeing the source disk which holds the dump
 saturated by reads, which is great, but then I just see nothing being
 written to the postgres volume. Just nothing happening, then a
 small burst. There is no write queue backup on the destination disk
 either. if I look at pg_stat_activity I'll see something like:
 
 COPY salesforce_reconciliation (salesforce_id, email,
 advisor_salesforce_id, processed) FROM stdin
 
 and even for small tables, that seems to take a very long time even
 though the destination disk is almost at 0 utilization.

So, where's the bottleneck? Clearly, there's one, so is it a CPU, a disk
or something else? Or maybe network, because you're using EBS?

What do you mean by 'utilization'? How do you measure that?


 The dumps are created with pg_dump -Fc and restored with pg_restore
 -d db -j 2 -O -U postgres PostgreSQL-db.sql.

OK

 Is it possible that some default settings were changed from 9.0 to 
 9.2 that would cause this kind of behavior? I'm stumped here. Thanks
 in advance for any consideration here.

I doubt that. There probably were some changes (after all, we're talking
about 2 major versions), but we generally don't change it in a way
that'd hurt performance.

regards
Tomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Sanity checking big select performance

2014-10-28 Thread Tomas Vondra
On 28.10.2014 22:15, Jeff Chen wrote:
 Hi friends!
 
 I'd love to get a sanity check on whether a fat select query I'm doing
 makes sense given the infrastructure that we have.
 
 We have 3 big tables that we typically join together for certain
 queries: a ~40 million row photos table, a ~20 million row users table,
 and a ~50 million row photo_to_album table that maps photos to albums.

So how much data is it? Does it fit within RAM (after loading into DB,
with all the indexes)?

 We like to display real time analytics, which often results in a query like:
 
 select (random aggregations )
 from
 photo_to_album join photos on photos.id http://photos.id =
 photo_to_album.photo_id
 join users on users.id http://users.id = photos.user_id
 where
 photo_to_album.album_id = something
 and
 photos.created_at between some dates
 and other junk
 
 We have indexes on all of the joins, and the where clauses.

Can we get EXPLAIN (and ideally EXPLAIN ANALYZE) for such queries?

 One of these queries that should be targeting something like 300K
 photos takes 38 seconds to run (with an aggregate/nested loop taking 
 effectively all of that time), and then upon second execution with a 
 warm cache, 4 seconds.

Well, if you're hitting disk, it's going to be slow. As you observed,
after loading it into page cache, it's much faster.

 Also worryingly, it spikes read IOPS to almost 1500/sec during the time
 and write IOPS 200/sec.  When not running the query, steady level read
 iops basically nil, write hovers around 50-100.
 
 This also increases the queue depth from basically 0 up to 6.  Keeping
 the queue depth high seems to cause timeouts in other queries.  The CPU
 is barely if at all affected, hovering around 20%.  Memory also barely
 affected.

20% is ~2 CPU cores (as you have 8 of them).

 We have a RDS Postgres database, m3.2xlarge with 2000 Provisioned IOPS
 and 400GB storage.  This translates to 8 virtual CPUs, 30GiB memory, and
 all SSD drives.

AFAIK there are two PostgreSQL major versions supported on RDS - 9.1 and
9.3. Which one are you using?

Also, can you list values for some basic parameters (shared_buffers,
work_mem)? We don't know what are the default values on RDS, neither if
you somehow modified them.

 Several questions here:
 
 1) Is that level of IOPS normal?

Um, why wouldn't it be? Each IO request works with 16 KB (on EBS),
and you're reading/writing a certain amount of data.

 2) Is it bad that the level of iops can queue requests that screw up the
 whole database even if it's just select queries?  Especially when the
 CPU and Memory are still plentiful?

You're saturating a particular resource. If you hit I/O wall, you can't
use the CPU / memory. The fact that it slows down your queries is
somehow expected.

Is it bad? Well, if you need to minimize impact on other queries, then
probably yes.

 3) What is up with the huge difference between cold and warm cache?

I don't understand why you're surprised by this? The EBS performance on
m3.2xlarge (with EBS-Optimized networking, i.e. 1 Gbit dedicated to EBS)
you get up to ~120 MB/s, except that you set 2000 IOPS, which is ~32
MB/s. Memory is orders of magnitude faster, hence the difference.

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ERROR: out of memory | with 23GB cached 7GB reserved on 30GB machine

2014-10-21 Thread Tomas Vondra
Dne 22 Říjen 2014, 0:25, Montana Low napsal(a):
 I'm running postgres-9.3 on a 30GB ec2 xen instance w/ linux kernel
 3.16.3.
 I receive numerous Error: out of memory messages in the log, which are
 aborting client requests, even though there appears to be 23GB available
 in
 the OS cache.

 There is no swap on the box. Postgres is behind pgbouncer to protect from
 the 200 real clients, which limits connections to 32, although there are
 rarely more than 20 active connections, even though postgres
 max_connections is set very high for historic reasons. There is also a 4GB
 java process running on the box.




 relevant postgresql.conf:

 max_connections = 1000  # (change requires restart)
 shared_buffers = 7GB# min 128kB
 work_mem = 40MB # min 64kB
 maintenance_work_mem = 1GB  # min 1MB
 effective_cache_size = 20GB



 sysctl.conf:

 vm.swappiness = 0
 vm.overcommit_memory = 2

This means you have 'no overcommit', so the amount of memory is limited by
overcommit_ratio + swap. The default value for overcommit_ratio is 50%
RAM, and as you have no swap that effectively means only 50% of the RAM is
available to the system.

If you want to verify this, check /proc/meminfo - see the lines
CommitLimit (the current limit) and Commited_AS (committed address space).
Once the committed_as reaches the limit, it's game over.

There are different ways to fix this, or at least improve that:

(1) increasing the overcommit_ratio (clearly, 50% is way too low -
something 90% might be more appropriate on 30GB RAM without swap)

(2) adding swap (say a small ephemeral drive, with swappiness=10 or
something like that)

Tomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-18 Thread Tomas Vondra
On 17.10.2014 19:25, Greg Stark wrote:
 On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra t...@fuzzy.cz wrote:
 If you know the title of the article, it's usually available
 elsewhere on the web - either at the university site, or elsewhere.
 I found these two articles about block-based sampling:
 
 
 http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf

 
 There are a series of papers with Chaudhuri as lead author which I 
 agree sounds like what Josh is talking about. Note that he's
 Microsoft Research's database group lead and it would be a pretty
 safe bet anything published from there is going to be covered by
 patents from here till next Tuesday (and seventeen years beyond).

Hmmm. I have 0 experience with handling patents and related issues. Any
idea how to address that?

 I think this is all putting the cart before the horse however. If we 
 could fix our current sampling to use the data more efficiently that 
 would be a good start before we start trying to read even more data.
 
 We currently read just one row from each block on average instead of 
 using the whole block. That's what would be needed in the worst case 
 if the blocks were a very biased sample (which indeed they probably 
 are in most databases due to the way Postgres handles updates). But
 we could at least give users the option to use more than one row per 
 block when they know it's ok (such as data that hasn't been updated) 
 or detect when it's ok (such as by carefully thinking about how 
 Postgres's storage mechanism would bias the data).

I think this will be very tricky, and in fact it may make the estimates
much worse easily, because  all the algorithms assume random sampling.

For example the ndistinct estimator uses the low-frequency values (that
were observed only once or twice in the sample). By using multiple rows
from each block, you'll significantly influence this probability for
columns with values correlated to block (which is quite common.

Take for example fact tables in data warehouses - those are usually
denormalized, mostly append-only. Say each row has date_id which is a
sequential number of a day, with 0 sometime in the past. Daily
increments are usually stored on many consecutive blocks, so on each
block there's usually a single date_id value.

By sampling all rows on a block you gain exactly nothing, and in fact it
results in observing no low-frequency values, making the estimator
absolutely useless.

I can imagine fixing this (although I don't see how exactly), but the
thing is we need to fix *all* the estimators we have, not just
ndistinct. And that's going to be tough.

I don't think adding a knob to tune the number of tuples sampled per
block is a good approach. Either we can solve the issues I described
(and in that case it's unnecessary), or we can't solve them and it turns
into a massive foot gun.

 But I looked into this and ran into a problem. I think our algorithm 
 for calculating the most frequent values list is bunk. The more rows
 I picked from each block the more biased that list was towards
 values seen earlier in the table. What's worse, when that list was
 biased it threw off the histogram since we exclude the most frequent
 values from the histogram, so all estimates were thrown off.

I think the 'minimal' stats (when we have just '=' for the type) behaves
like this, but fixing it by switching to a two-pass approach should not
be that difficult (but would cost a few more CPU cycles).

Or do you suggest that even the scalar MCV algorithm behaves has this
bias issue? I doubt that, because the MCV works with an array sorted by
number of occurences, so the position within the table is irrelevant.

 If we could fix the most frequent values collection to not be biased 
 when it sees values in a clumpy way then I think we would be okay to 
 set the row sample size in Vitter's algorithm to a factor of N
 larger than the block sample size where N is somewhat smaller than
 the average number of rows per block. In fact even if we used all the
 rows in the block I think I've convinced myself that the results
 would be accurate in most circumstances.

I don't expect fixing the MCV to be overly difficult (although it will
need a few more CPU cycles).

But making it work with the block sampling will be much harder, because
of the bias. The phrase 'in most circumstances' doesn't sound really
convincing to me ...

 I think to calcualte the most frequent values more accurately it 
 would take a two pass approach. Scan some random sample of blocks 
 with a counting bloom filter then do a second pass (possibly for the 
 same sample?) keeping counts only for values that the counting bloom 
 filter said hashed to the most common hash values. That might not be 
 exactly the most common values but should be at least a 
 representative sample of the most common values.

I don't see why the counting bloom filter would be necessary, in a two
pass approach?

regards
Tomas


-- 
Sent via pgsql

Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-15 Thread Tomas Vondra
On 15.10.2014 19:20, Josh Berkus wrote:
 On 10/10/2014 04:16 AM, Greg Stark wrote:
 On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus j...@agliodbs.com wrote:
 Yes, it's only intractable if you're wedded to the idea of a tiny,
 fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
 can get a MUCH more accurate n_distinct estimate using multiple
 algorithms, of which HLL is one.  While n_distinct will still have some
 variance, it'll be over a much smaller range.

 I've gone looking for papers on this topic but from what I read this
 isn't so. To get any noticeable improvement you need to read 10-50% of
 the table and that's effectively the same as reading the entire table
 -- and it still had pretty poor results. All the research I could find
 went into how to analyze the whole table while using a reasonable
 amount of scratch space and how to do it incrementally.
 
 So, right now our estimation is off on large tables by -10X to
 -1X. First, the fact that it's *always* low is an indication
 we're using the wrong algorithm. Second, we can most certainly do
 better than a median of -1000X.

A few days ago I posted an experimental patch with the adaptive
estimator, described in [1]. Not perfect, but based on the testing I did
I believe it's a superior algorithm to the one we use now. Would be nice
to identify a few datasets where the current estimate is way off.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf


 One interesting set of algorithms is block-based sampling. That is,
 you read 5% of the physical table in random blocks, reading every row
 in the block. The block size is determined by your storage block
 size, so you're not actually reading any more physically than you are
 logically; it really is just 5% of the table, especially on SSD.
 
 Then you apply algorithms which first estimate the correlation of
 common values in the block (i.e. how likely is it that the table is
 completely sorted?), and then estimates of how many values there
 might be total based on the correlation estimate.

I think we might also use a different approach - instead of sampling the
data when ANALYZE kicks in, we might collect a requested sample of rows
on the fly. Say we want 1% sample - whenever you insert a new row, you
do [random()  0.01] and if it happens to be true you keep a copy of the
row aside. Then, when you need the sample, you simply read the sample
and you're done - no random access to the main table, no problems with
estimated being off due to block-level sampling, etc.

Not sure how to track deletions/updates, though. Maybe rebuilding the
sample if the number of deletions exceeds some threshold, but that
contradicts the whole idea a bit.

 I no longer have my ACM membership, so I can't link this, but 
 researchers were able to get +/- 3X accuracy for a TPCH workload 
 using this approach. A real database would be more variable, of 
 course, but even so we should be able to achieve +/- 50X, which
 would be an order of magnitude better than we're doing now.

If you know the title of the article, it's usually available elsewhere
on the web - either at the university site, or elsewhere. I found these
two articles about block-based sampling:


http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf

   https://www.stat.washington.edu/research/reports/1999/tr355.pdf

Maybe there are more, but most of the other links were about how Oracle
does this in 11g.

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-10 Thread Tomas Vondra
Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
 On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus j...@agliodbs.com wrote:
 Yes, it's only intractable if you're wedded to the idea of a tiny,
 fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
 can get a MUCH more accurate n_distinct estimate using multiple
 algorithms, of which HLL is one.  While n_distinct will still have some
 variance, it'll be over a much smaller range.

 I've gone looking for papers on this topic but from what I read this
 isn't so. To get any noticeable improvement you need to read 10-50% of
 the table and that's effectively the same as reading the entire table
 -- and it still had pretty poor results. All the research I could find
 went into how to analyze the whole table while using a reasonable
 amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

Would be helpful to get a collection of test cases that currently perform
poorly. I have collected a few from the archives, but if those who follow
this thread can provide additional test cases / point to a thread
describing related etc. that'd be great.

It certainly won't be perfect, but if it considerably improves the
estimates then I believe it's step forward. Ultimately, it's impossible to
improve the estimates without increasing the sample size.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

regards
Tomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-10 Thread Tomas Vondra

On 10.10.2014 16:21, Craig James wrote:
 On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
  I've gone looking for papers on this topic but from what I read this
  isn't so. To get any noticeable improvement you need to read 10-50% of
  the table and that's effectively the same as reading the entire table
  -- and it still had pretty poor results. All the research I could find
  went into how to analyze the whole table while using a reasonable
  amount of scratch space and how to do it incrementally.
 
 I think it's really difficult to discuss the estimation without some
 basic
 agreement on what are the goals. Naturally, we can't get a perfect
 estimator with small samples (especially when the sample size is
 fixed and
 not scaling with the table). But maybe we can improve the estimates
 without scanning most of the table?
 
 FWIW I've been playing with the adaptive estimator described in [1] and
 the results looks really interesting, IMHO. So far I was testing it on
 synthetic datasets outside the database, but I plan to use it instead of
 our estimator, and do some more tests.
 
 
 We've solved this problem using an external (non-Postgres) dynamically
 optimizing index. In addition to the early abort, we also require an
 efficient late start, the equivalent of offset 100 limit 10. It's a
 common problem for web sites that let users page through data with just
 a tiny amount of state information (a cookie).

Yeah, paging is a known example, both for the inefficiency once you get
to pages far away, and because of the planning challenges. I think there
are known solutions to this problem
(http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way),
although those are not applicable to all cases.

But I'm not sure how that's related to the ndistinct estimation problem,
discussed in this thread (or rather in this subthread)?

 Our index is for chemical structures. Chemicals are indexed on
 chemical fragments
 http://emolecules.com/info/molecular-informatics. A search 
 typically starts with 50-200 indexed columns (chemical fragments).
 The query is always flat, A and B and ... and Z. The indexed
 fragments are both correlated (the existence of one strongly raises
 the chances of another) and anti-correlated (certain combinations are
 very rare).

Maybe I don't understand the problem well enough, but isn't this a
perfect match for GIN indexes? I mean, you essentially need to do
queries like WHERE substance @@ ('A  B  !C') etc. Which is exactly
what GIN does, because it keeps pointers to tuples for each fragment.

 Static planners simply can't handle the early abort condition,
 even with good statistics. Many have pointed out that data are
 lumpy rather than well distributed. A more subtle problem is that
 you can have evenly distributed data, but badly distributed
 correlations. Agnes and Bob may be names that are distributed
 well in a real-estate database, but it might happen that all of the
 information about homes whose owners' names are Agnes and Bob
 occurs at the very end of all of your data because they just got
 married and bought a house.
 
 The end result is that even with perfect statistics on each column,
 you're still screwed. The combinatorial explosion of possible
 correlations between indexes is intractable.

Static planners clearly have limitations, but we don't have dynamic
planning in PostgreSQL, so we have to live with them. And if we could
improve the quality of estimates - lowering the probability of poorly
performing plans, it's probably good to do that.

It won't be perfect, but until we have dynamic planning it's better than
nothing.

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-10 Thread Tomas Vondra
On 10.10.2014 14:10, Tomas Vondra wrote:
 Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
 On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus j...@agliodbs.com wrote:
 Yes, it's only intractable if you're wedded to the idea of a tiny,
 fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
 can get a MUCH more accurate n_distinct estimate using multiple
 algorithms, of which HLL is one.  While n_distinct will still have some
 variance, it'll be over a much smaller range.

 I've gone looking for papers on this topic but from what I read this
 isn't so. To get any noticeable improvement you need to read 10-50% of
 the table and that's effectively the same as reading the entire table
 -- and it still had pretty poor results. All the research I could find
 went into how to analyze the whole table while using a reasonable
 amount of scratch space and how to do it incrementally.
 
 I think it's really difficult to discuss the estimation without some basic
 agreement on what are the goals. Naturally, we can't get a perfect
 estimator with small samples (especially when the sample size is fixed and
 not scaling with the table). But maybe we can improve the estimates
 without scanning most of the table?
 
 FWIW I've been playing with the adaptive estimator described in [1] and
 the results looks really interesting, IMHO. So far I was testing it on
 synthetic datasets outside the database, but I plan to use it instead of
 our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the
estimates for the  scalar case (i.e. data types that we can sort).
Implementing this for the minimal case is possible, but requires a bit
more work.

It only computes the estimate and prints a WARNING with both the current
and new estimate, but the old estimate is stored.

I also attach a few synthetic examples of synthetic datasets with
distributions stored in various ways, that I used for testing. In all
cases there's a single table with 10M rows and a single INT column.
There are three kinds of skew:

1) smooth skew

   - N distinct values (100, 10.000 and 100.000 values)
   - average moves to 0 as 'k' increases ('k' between 1 and 9)
   - smooth distribution of frequencies

   INSERT INTO test
SELECT pow(random(),k) * 1 FROM generate_series(1,1000);

2) step skew

   - a few very frequent values, many rare values
   - for example this generates 5 very frequent and ~10k rare values

   INSERT INTO test
SELECT (CASE WHEN (v  9) THEN MOD(v,5) ELSE v END)
  FROM (
 SELECT (random()*10)::int AS v
   FROM generate_series(1,1000)
  ) foo;


Results
===

I tested this with various statistics target settings (10, 100, 1000),
which translates to different sample sizes.

statistics target 100 (default, 30k rows, 0.3% sample)
==

a) smooth skew, 101 values, different skew ('k')

   k   currentadaptive
   -
   1   101102
   3   101102
   5   101102
   7   101102
   9   101102

b) smooth skew, 10.001 values, different skew ('k')

   k   currentadaptive
   -
   1   9986   10542
   3   8902   10883
   5   7579   10824
   7   6639   10188
   9   5947   10013

c) step skew (different numbers of values)

   values   currentadaptive
   --
   106  106107
   106  35 104
   1006 2591262
   100062823   11047


statistics target 10 (3k rows, 0.03% sample)


a) smooth skew, 101 values, different skew ('k')

   k   currentadaptive
   -
   1   101102
   3   101102
   5   101102
   7   101102
   9   101102

b) smooth skew, 10.001 values, different skew ('k')

   k   currentadaptive
   -
   1   9846   10014
   3   4399   7190
   5   2532   5477
   7   1938   4932
   9   1623   1623

c) step skew (different numbers of values)

   values   currentadaptive
   --
   106  100114
   106  5  5
   1006 37 532
   1000632320970

statistics target 1000 (300k rows, 3% sample)
=

   k   currentadaptive
   -
   1   101102
   3   101102
   5   101102
   7   101102
   9   101102

b) smooth skew, 10.001 values, different skew ('k')

   k   currentadaptive
   -
   1   10001  10002
   3   1  1
   5   9998   10011
   7   9973   10045
   9   9939   10114

c) step skew (different numbers of values)

   values   currentadaptive

Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-10 Thread Tomas Vondra
On 10.10.2014 19:59, Craig James wrote:
 On Fri, Oct 10, 2014 at 9:53 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 
 On 10.10.2014 16:21, Craig James wrote:
  Our index is for chemical structures. Chemicals are indexed on
  chemical fragments
  http://emolecules.com/info/molecular-informatics. A search
  typically starts with 50-200 indexed columns (chemical fragments).
  The query is always flat, A and B and ... and Z. The indexed
  fragments are both correlated (the existence of one strongly raises
  the chances of another) and anti-correlated (certain combinations are
  very rare).
 
 Maybe I don't understand the problem well enough, but isn't this a
 perfect match for GIN indexes? I mean, you essentially need to do
 queries like WHERE substance @@ ('A  B  !C') etc. Which is exactly
 what GIN does, because it keeps pointers to tuples for each fragment.
 
 
 On the day our web site opened we were using tsearch. Before the end of
 the day we realized it was a bad idea, for the very reasons discussed
 here. The early-abort/late-start problem (offset N limit M) could take
 minutes to return the requested page. With the external
 dynamically-optimized index, we can almost always get answers in less
 than a couple seconds, often in 0.1 seconds.

In the early days of tsearch, it did not support GIN indexes, and AFAIK
GiST are not nearly as fast for such queries. Also, the GIN fastscan
implemented by Alexander Korotkov in 9.4 makes a huge difference for
queries combining frequent and rare terms.

Maybe it'd be interesting to try this on 9.4. I'm not saying it will
make it faster than the optimized index, but it might be an interesting
comparison.

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-03 Thread Tomas Vondra
On 3.10.2014 21:58, Jeff Janes wrote:
 On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus j...@agliodbs.com
 mailto:j...@agliodbs.com wrote:
 
 Yes, it's only intractable if you're wedded to the idea of a tiny,
 fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
 can get a MUCH more accurate n_distinct estimate using multiple
 algorithms, of which HLL is one.  While n_distinct will still have some
 variance, it'll be over a much smaller range.
 
 
 In my hands, the problems with poor n_distinct were not due to the
 insufficient size of the sample, but the insufficient randomness of it. 
 Increasing  default_statistics_target did help but not because it
 increases the number of rows sampled, but rather because it increases
 the number of blocks sampled.  Once substantially all of the blocks are
 part of the block sampling, the bias is eliminated even though it was
 still sampling a small fraction of the rows (roughly one per block).

I don't think that's entirely accurate. According to [1], there's a
lower boundary on ratio error, depending on the number of sampled rows.

Say there's a table with 10M rows, we sample 30k rows (which is the
default). Then with probability 5% we'll get ratio error over 20. That
is, we may either estimate 5% or 200% of the actual ndistinct value.
Combined with our arbitrary 10% limit that we use to decide whether
ndistinct scales with the number of rows, this sometimes explodes.

By increasing the statistics target, you get much larger sample and thus
lower probability of such error. But nevertheless, it breaks from time
to time, and the fact that statistics target is static (and not scaling
with the size of the table to get appropriate sample size) is not really
helping IMHO. Static sample size may work for histograms, for ndistinct
not so much.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf


 So one idea would be go get rid of the 2-stage sampling algorithm 
 (sample blocks, sample rows from the chosen blocks) and just read
 the whole table and sample rows from it unbiased, at least under
 some conditions. Some low level benchmarking on my favorite server
 showed that reading 1% of a system's blocks (in block number order
 within each file) was no faster than reading all of them from an IO
 perspective. But that is a virtualized server that wasn't really
 speced out to be an IO intensive database server in the first place.
 It would be interesting to see what people get on real hardware that
 they actually designed for the task.

I think there was a discussion about the sampling on pgsql-hackers a
while ago ... and yes, here it is [2]. However it seems there was no
clear conclusion on how to change it at that time ...


[2]
http://www.postgresql.org/message-id/ca+tgmozaqygsual2v+yfvsx06dqdqh-pev0nobgpws-dnwa...@mail.gmail.com

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Yet another abort-early plan disaster on 9.3

2014-10-03 Thread Tomas Vondra
On 3.10.2014 02:54, Peter Geoghegan wrote:
 On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus j...@agliodbs.com wrote:
 Yes, it's only intractable if you're wedded to the idea of a tiny, 
 fixed-size sample. If we're allowed to sample, say, 1% of the
 table, we can get a MUCH more accurate n_distinct estimate using
 multiple algorithms, of which HLL is one. While n_distinct will
 still have some variance, it'll be over a much smaller range.
 
 I think that HyperLogLog, as a streaming algorithm, will always 
 require that the entire set be streamed. This doesn't need to be a
 big deal - in the case of my abbreviated key patch, it appears to 
 basically be free because of the fact that we were streaming 
 everything anyway. It's a very cool algorithm, with fixed overhead
 and constant memory usage. It makes very useful guarantees around 
 accuracy.

I think you're mixing two things here - estimating the number of
distinct values in a sample (which can be done very efficiently using
HLL) and estimating the number of distinct values in the whole table.
For that HLL is not usable, unless you process all the data.

Sadly HLL is rather incompatible with the usual estimators, because the
ones I'm aware of need to know the number of occurences for the distinct
values etc.

But couldn't we just piggyback this on autovacuum? One of the nice HLL
features is that it's additive - you can build partial counters for
ranges of blocks (say, a few MBs per range), and then merge them when
needed. By keeping the parts it's possible to rebuild it separately.

But maybe this idea is way too crazy ...

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-30 Thread Tomas Vondra
On 30 Červenec 2014, 3:44, Mark Kirkwood wrote:

 While these numbers look great in the middle range (12-96 clients), then
 benefit looks to be tailing off as client numbers increase. Also running
 with no stats (and hence no auto vacuum or analyze) is way too scary!

I assume you've disabled statistics collector, which has nothing to do
with vacuum or analyze.

There are two kinds of statistics in PostgreSQL - data distribution
statistics (which is collected by ANALYZE and stored in actual tables
within the database) and runtime statistics (which is collected by the
stats collector and stored in a file somewhere on the dist).

By disabling statistics collector you loose runtime counters - number of
sequential/index scans on a table, tuples read from a relation aetc. But
it does not influence VACUUM or planning at all.

Also, it's mostly async (send over UDP and you're done) and shouldn't make
much difference unless you have large number of objects. There are ways to
improve this (e.g. by placing the stat files into a tmpfs).

Tomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


  1   2   3   4   >