Re: [PERFORM] How clustering for scale out works in PostgreSQL

2013-09-02 Thread Craig Ringer
On 08/30/2013 01:48 AM, bsreejithin wrote:
 Ya..sure...Migration to 9.2 is one of the activities planned and in fact
 it's already on track.Thanks Thomas

You'll want to re-do your performance testing; a huge amount has changed
since 8.2.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How clustering for scale out works in PostgreSQL

2013-09-02 Thread Craig Ringer
bsreejithin wrote:
 
 I am *expecting 1000+ hits to my PostgreSQL DB* and I doubt my standalone DB
 will be able to handle it.
 
 So I want to *scale out by adding more servers to share the load*. For this,
 I want to do clustering.

  DB server was 4 Core 12GB RAM.

You're jumping way ahead here. You have a medium sized server that
should effortlessly handle most loads if its I/O subsystem is up to it.

It's a good idea to plan for what you'll do as load grows, but it's not
necessary to jump straight to engineering some web scale monstrosity
if you don't have to.

 The performance test that was conducted was for 1 Hour. 

 There are 6 transactions. 2 DB inserts and 4 SELECTs.
 Every 2 minutes there will be 4 SELECTs. And every 3 minutes there will be 2
 DB inserts.

It's not possible to give useful specific advice without knowing what
the selects and updates you're dealing with are. After all, a
single-tuple update of a non-indexed field with no trigger side-effects
will be way sub-millisecond. On the other hand, a big update over a
multi-table join that causes updates on several multi-column indexes /
GIN indexes / etc, a cascade update, etc, might take hours.

You need to work out what the actual load is. Determine whether you're
bottlenecked on disk reads, disk writes, disk flushes (fsync), CPU, etc.

Ask some basic tuning questions. Does your DB fit in RAM? Do at least
the major indexes and hot smaller tables fit in RAM? Is
effective_cache_size set to tell the query planner that.

Look at the query plans. Is there anything grossly unreasonable? Do you
need to adjust any tuning params (random_page_cost, etc)? Is
effective_cache_size set appropriately for the server?  Figure out
whether there are any indexes that're worth creating that won't make the
write load too much worse.

Find the point where throughput stops scaling up with load on the
server. Put a connection pooler in place and limit concurrent working
connections to PostgreSQL to about that level; overall performance will
be greatly improved by not trying to do too much all at once.

 I am *curious to know how clustering works in PostgreSQL.* (I don't want to
 know how to setup cluster - as of now. Just want to know how clustering
 works).

The clustering available in PostgreSQL is a variety of forms of
replication.

It is important to understand that attempting to scale out to
multi-server setups requires significant changes to many applications.
There is no transparent multi-master clustering for PostgreSQL.

If you're on a single server, you can rely on the strict rules
PostgreSQL follows for traffic isolation. It will ensure that two
updates can't conflict with row-level locking. In SERIALIZABLE isolation
it'll protect against a variety of concurrency problems.

Most of that goes away when you go multi-server. If you're using a
single master and multiple read-replicas you have to deal with lags,
where the replicas haven't yet seen / replayed transactions performed on
the master. So you might UPDATE a row in one transaction, only to find
that when you SELECT it the update isn't there ... then it suddenly
appears when you SELECT again. Additionally, long-running queries on the
read-only replicas can be aborted to allow the replica to continue
replaying changes from the master.

You can work around that one with synchronous replication, but then you
create another set of performance challenges on the master.

There are also a variety of logical / row-level replication options.
They have their own trade-offs in terms of impact on master performance,
transaction consistency, etc.

It only gets more fun when you want multiple masters, where you can
write to more than one server.  Don't go there unless you have to.

 When I look at some of the content available while googling, I am getting
 more and more confused, as I find that in most of the sites, clustering is
 used interchangeably with replication.

Well, a cluster of replicas is still a cluster.

If you mean transparent multi-master clustering, well that's another
thing entirely.

I strongly recommend you go back to basics. Evaluate the capacity of the
server you've got, update PostgreSQL, characterize the load, do some
basic tuning, benchmark based on a simulation of your load, get a
connection pooler in place, do some basic query pattern and plan
analysis, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Varchar vs foreign key vs enumerator - table and index size

2013-09-02 Thread Łukasz Walkowski

On 1 wrz 2013, at 03:31, Craig James cja...@emolecules.com wrote:

 If your applications are read-heavy and only have a small-ish amount of code 
 that inserts/updates the table, it may not be that much of a rewrite. You can 
 create a integer/varchar table of key/values, use its key to replace the 
 current varchar column, rename the original table, and create a view with the 
 original table's name.  Code that only reads the data won't know the 
 difference. And it's a portable solution.
 
 I did this and it worked out well. If the key/value pairs table is relatively 
 small, the planner does an excellent job of generating efficient queries 
 against the big table.
 
 Craig

Actually this (event) table is write heavy. But the concept is really cool and 
worth trying. Thanks.


Lukasz

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Varchar vs foreign key vs enumerator - table and index size

2013-09-02 Thread Łukasz Walkowski

On 1 wrz 2013, at 05:10, Vitalii Tymchyshyn tiv...@gmail.com wrote:
 
 
 Well, there are some more options:
 a) Store int keys and do mapping in the application (e.g. with java enums). 
 This can save you a join, that is especially useful if you are going to do 
 paged output with limit/offset scenario. Optimizer sometimes produce 
 suboptimal plans for join in offset/limit queries.
 b) Store small varchar values as keys (up to char type if you really want 
 to save space) and do user display mapping in application. It's different 
 from (a) since it's harder to mess with the mapping and values are still more 
 or less readable with simple select. But it can be less efficient than (a).
 c) Do mixed approach with mapping table, loaded on start into application 
 memory. This would be an optimization in case you get into optimizer troubles.
 
 Best regards, Vitalii Tymchyshyn

I'd like to leave database in readable form because before I add some new 
queries and rest endpoints to the application, I test them as ad-hoc queries 
using command line. So variant a) isn't good for me. Variant b) is worth trying 
and c) is easy to code, but I still prefer having all this data in database 
independent of application logic.

Thanks for suggestion,
Lukasz

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Varchar vs foreign key vs enumerator - table and index size

2013-09-02 Thread Andrew Dunstan


On 09/02/2013 05:53 AM, Łukasz Walkowski wrote:

On 1 wrz 2013, at 05:10, Vitalii Tymchyshyn tiv...@gmail.com wrote:


Well, there are some more options:
a) Store int keys and do mapping in the application (e.g. with java enums). 
This can save you a join, that is especially useful if you are going to do 
paged output with limit/offset scenario. Optimizer sometimes produce suboptimal 
plans for join in offset/limit queries.
b) Store small varchar values as keys (up to char type if you really want to 
save space) and do user display mapping in application. It's different from (a) since 
it's harder to mess with the mapping and values are still more or less readable with 
simple select. But it can be less efficient than (a).
c) Do mixed approach with mapping table, loaded on start into application 
memory. This would be an optimization in case you get into optimizer troubles.

Best regards, Vitalii Tymchyshyn

I'd like to leave database in readable form because before I add some new 
queries and rest endpoints to the application, I test them as ad-hoc queries 
using command line. So variant a) isn't good for me. Variant b) is worth trying 
and c) is easy to code, but I still prefer having all this data in database 
independent of application logic.




I think the possible use of Postgres enums has been too easily written 
off in this thread. Looking at the original problem description they 
look like quite a good fit, despite the OP's skepticism. What exactly is 
wanted that can't be done with database enums? You can add new values to 
the type very simply.  You can change the values of existing labels in 
the type slightly less simply, but still without any great difficulty. 
Things that are hard to do include removing labels in the set and 
changing the sort order, because those things would require processing 
tables where the type is used, unlike the simple things. But neither of 
these is required for typical use cases. For most uses of this kind they 
are very efficient both in storage and processing.


cheers

andrew


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] planner parameters

2013-09-02 Thread Torsten Förtsch
Hi,

depending on the OFFSET parameter I have seen at least 3 different query
plans.

SELECT * FROM
(
 SELECT * FROM transaction tt
 WHERE
 tt.account_id = '1376641'
 AND tt.transaction_time = '2013-02-03 05:37:24'
 AND tt.transaction_time  '2013-08-23 05:37:24'
 ORDER BY
 tt.transaction_time ASC,
 tt.id ASC
 LIMIT 1
 OFFSET 0
) t1
LEFT OUTER JOIN fmb t2
ON (t1.fmb_id = t2.id)
LEFT OUTER JOIN payment.payment t3
ON (t1.payment_id = t3.id);

The best of them is this:

 Nested Loop Left Join  (cost=1488.34..126055.47 rows=9985 width=1015)
(actual time=26.894..78.711 rows=1 loops=1)
   -  Nested Loop Left Join
   (cost=1487.91..86675.47 rows=9985 width=828)
   (actual time=26.892..72.170 rows=1 loops=1)
 -  Limit  (cost=1487.35..1911.50 rows=9985 width=597)
(actual time=26.873..33.735 rows=1 loops=1)
   -  Index Scan using xxx on transaction tt
   (cost=0.57..1911.50 rows=44985 width=597)
   (actual time=0.020..31.707 rows=45000 loops=1)
 Index Cond: ((account_id = 1376641::bigint) AND
  (transaction_time = '...') AND
  (transaction_time  '...'))
 -  Index Scan using pk_fmb on fmb t2
 (cost=0.56..8.47 rows=1 width=231)
 (actual time=0.003..0.003 rows=1 loops=1)
   Index Cond: (tt.fmb_id = id)
   -  Index Scan using pk_payment on payment t3
   (cost=0.43..3.93 rows=1 width=187)
   (actual time=0.000..0.000 rows=0 loops=1)
 Index Cond: (tt.payment_id = id)
 Total runtime: 79.219 ms

Another one is this:

 Hash Left Join  (cost=55139.59..140453.16 rows=9985 width=1015)
 (actual time=715.450..762.989 rows=1 loops=1)
   Hash Cond: (tt.payment_id = t3.id)
   -  Nested Loop Left Join
   (cost=1487.91..86675.47 rows=9985 width=828)
   (actual time=27.472..70.723 rows=1 loops=1)
 -  Limit  (cost=1487.35..1911.50 rows=9985 width=597)
(actual time=27.453..34.066 rows=1 loops=1)
   -  Index Scan using xxx on transaction tt
   (cost=0.57..1911.50 rows=44985 width=597)
   (actual time=0.076..32.050 rows=45000 loops=1)
 Index Cond: ((account_id = 1376641::bigint) AND
  (transaction_time = '...') AND
  (transaction_time  '...'))
 -  Index Scan using pk_fmb on fmb t2
 (cost=0.56..8.47 rows=1 width=231)
 (actual time=0.003..0.003 rows=1 loops=1)
   Index Cond: (tt.fmb_id = id)
   -  Hash  (cost=40316.30..40316.30 rows=1066830 width=187)
 (actual time=687.651..687.651 rows=1066830 loops=1)
 Buckets: 131072  Batches: 1  Memory Usage: 235206kB
 -  Seq Scan on payment t3
 (cost=0.00..40316.30 rows=1066830 width=187)
 (actual time=0.004..147.681 rows=1066830 loops=1)
 Total runtime: 781.584 ms


You see this 2nd plan takes 10 times longer.

Now, if I

  set enable_seqscan=off;

the planner generates the 1st plan also for this parameter set and it
executes in about the same time (~80 ms).

Then I created a new tablespace with very low cost settings:

  alter tablespace trick_indexes set
(seq_page_cost=0.0001, random_page_cost=0.0001);

and moved the pk_payment there. The tablespace is located on the same
disk. The only reason for it's existence are the differing cost parameters.

Now I could turn enable_seqscan back on and still got the better query plan.


Is there an other way to make the planner use generate the 1st plan?

Why does it generate the 2nd plan at all?

Does the planner take into account what is currently present in shared
memory? If so, it could know that the pk_payment index is probably in
RAM most of the time.


Thanks,
Torsten


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance