Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Niels Kristian Schjødt
Den 11/12/2012 kl. 00.58 skrev Jeff Janes :

> On Mon, Dec 10, 2012 at 2:51 PM, Niels Kristian Schjødt
>  wrote:
> 
>> synchronous_commit = off
>> 
>> The pg_xlog folder has been moved onto the SSD array (md3), and symlinked
>> back into the postgres dir.
> 
> With synchronous_commit = off, or with large transactions, there is
> probably no advantage to moving those to SSD.
> 
> 
>> 2)
>> When the database is loaded like this, I see a lot of queries talking up to
>> 1000 times as long, as they would when the database is not loaded so
>> heavily.
> 
> What kinds of queries are they?  single-row look-ups, full table scans, etc.
Well Mostly they are updates. Like the one shown in the previous question I 
referenced.
>> 
>>  Notes and thoughts
>> ##
>> 
>> As you can see, even though I have moved the pg_xlog folder to the SSD array
>> (md3) the by far largest amount of writes still goes to the regular HDD's
>> (md2), which puzzles me - what can that be?
> 
> Every row you insert or non-HOT update has to do maintenance on all
> indexes of that table.  If the rows are not inserted/updated in index
> order, this means you every row inserted/updated dirties a randomly
> scattered 8KB for each of the indexes.  If you have lots of indexes
> per table, that adds up fast.
> 
> The fact that there is much more writing than reading tells me that
> most of your indexes are in RAM.  The amount of index you are rapidly
> reading and dirtying is large enough to fit in RAM, but is not large
> enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb 
being in use on the server… Can I somehow utilize more of it?
> So you are redirtying the same blocks over and over, PG is
> desperately dumping them to the kernel (because shared_buffers it too
> small to hold them) and the kernel is desperately dumping them to
> disk, because vm.dirty_background_ratio is so low.  There is little
> opportunity for write-combining, because they don't sit in memory long
> enough to accumulate neighbors.
> 
> How big are your indexes?
This is a size list of all my indexes: 117 MB, 118 MB, 11 MB, 12 MB, 12 MB, 12 
MB, 12 MB, 140 MB, 15 MB, 15 MB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 
kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 MB, 16 MB, 176 kB, 176 kB, 17 MB, 18 
MB, 19 MB, 23 MB, 240 kB, 24 MB, 256 kB, 25 MB, 25 MB, 26 MB, 26 MB, 27 MB, 27 
MB, 27 MB, 27 MB, 280 MB, 2832 kB, 2840 kB, 288 kB, 28 MB, 28 MB, 28 MB, 28 MB, 
28 MB, 28 MB, 28 MB, 28 MB, 29 MB, 29 MB, 3152 kB, 3280 kB, 32 kB, 32 MB, 32 
MB, 3352 kB, 3456 kB, 34 MB, 36 MB, 3744 kB, 3776 kB, 37 MB, 37 MB, 3952 kB, 
400 kB, 408 kB, 40 kB, 40 kB, 40 kB, 416 kB, 416 kB, 42 MB, 432 kB, 4520 kB, 
4720 kB, 47 MB, 48 kB, 496 kB, 49 MB, 512 kB, 52 MB, 52 MB, 5304 kB, 5928 kB, 
6088 kB, 61 MB, 6224 kB, 62 MB, 6488 kB, 64 kB, 6512 kB, 71 MB, 72 kB, 72 kB, 
8192 bytes, 8400 kB, 88 MB, 95 MB, 98 MB
> You could really crank up shared_buffers or vm.dirty_background_ratio,
> but doing so might cause problems with checkpoints stalling and
> latency spikes.  That would probably not be a problem during the
> night, but could be during the day.
What do you have in mind here? Tweaking what parameters to what values?
> .
> Rather than moving maintenance to the day and hoping it doesn't
> interfere with normal operations, I'd focus on making night-time
> maintenance more efficient, for example by dropping indexes (either
> just at night, or if some indexes are not useful, just get rid of them
> altogether), or cranking up shared_buffers at night, or maybe
> partitioning or look into pg_bulkload.
> 
>> From stat 3) (the iostat) I notice that the SSD's doesn't seem to be
>> something near fully utilized - maybe something else than just pg_xlog could
>> be moved her?
> 
> I don't know how big each disk is, or how big your various categories
> of data are.  Could you move everything to SSD?  Could you move all
> your actively updated indexes there?
With table spaces you mean?
> Or, more fundamentally, it looks like you spent too much on CPUs (86%
> idle) and not nearly enough on disks.  Maybe you can fix that for less
> money than it will cost you in your optimization time to make the best
> of the disks you already have.
The SSD's I use a are 240Gb each which will grow too small within a few months 
- so - how does moving the whole data dir onto four of those in a RAID5 array 
sound?
> 
> Cheers,
> 
> Jeff



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Craig Ringer
On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>
> Maybe I should mention, that I never see more than max 5Gb out of my total 
> 32Gb being in use on the server… Can I somehow utilize more of it?
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
> The SSD's I use a are 240Gb each which will grow too small within a
> few months - so - how does moving the whole data dir onto four of
> those in a RAID5 array sound? 

Not RAID 5!

Use a RAID10 of four or six SSDs.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Niels Kristian Schjødt

Den 11/12/2012 kl. 14.29 skrev Craig Ringer :

> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>> 
>> Maybe I should mention, that I never see more than max 5Gb out of my total 
>> 32Gb being in use on the server… Can I somehow utilize more of it?
> For an update-mostly workload it probably won't do you tons of good so
> long as all your indexes fit in RAM. You're clearly severely
> bottlenecked on disk I/O not RAM.
>> The SSD's I use a are 240Gb each which will grow too small within a
>> few months - so - how does moving the whole data dir onto four of
>> those in a RAID5 array sound? 
> 
> Not RAID 5!
> 
> Use a RAID10 of four or six SSDs.
> 
> -- 
> Craig Ringer   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
> 
Hehe got it - did you have a look at the SSD's I am considering building it of? 
http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
 
Are they suitable do you think?



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt 
 wrote:

> 
> Den 11/12/2012 kl. 14.29 skrev Craig Ringer :
> 
>> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>>> 
>>> Maybe I should mention, that I never see more than max 5Gb out of my total 
>>> 32Gb being in use on the server… Can I somehow utilize more of it?
>> For an update-mostly workload it probably won't do you tons of good so
>> long as all your indexes fit in RAM. You're clearly severely
>> bottlenecked on disk I/O not RAM.
>>> The SSD's I use a are 240Gb each which will grow too small within a
>>> few months - so - how does moving the whole data dir onto four of
>>> those in a RAID5 array sound? 
>> 
>> Not RAID 5!
>> 
>> Use a RAID10 of four or six SSDs.
>> 
>> -- 
>> Craig Ringer   http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Training & Services
>> 
> Hehe got it - did you have a look at the SSD's I am considering building it 
> of? 
> http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
>  
> Are they suitable do you think?
> 

I am not Craig, but i use them in production in raid10 array now.

> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Jeff Janes
On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt
 wrote:
> Den 11/12/2012 kl. 00.58 skrev Jeff Janes :
>
>>
>> The fact that there is much more writing than reading tells me that
>> most of your indexes are in RAM.  The amount of index you are rapidly
>> reading and dirtying is large enough to fit in RAM, but is not large
>> enough to fit in shared_buffers + kernel's dirty-buffer comfort level.

> Maybe I should mention, that I never see more than max 5Gb out of my total 
> 32Gb being in use on the server… Can I somehow utilize more of it?

What tool do you use to determine that?  Is that on top of the 4GB
shared_buffers, are including it?

How big is your entire data set?  Maybe all your data fits in 5GB
(believable, as all your indexes listed below sum to < 2.5GB) so there
is no need to use more.

Or maybe you have hit an bug in the 3.2 kernel.  At least one of those
has been frequently discussed.


>> You could really crank up shared_buffers or vm.dirty_background_ratio,
>> but doing so might cause problems with checkpoints stalling and
>> latency spikes.  That would probably not be a problem during the
>> night, but could be during the day.

> What do you have in mind here? Tweaking what parameters to what values?

I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your
data) and see what happens.  And probably increase checkpoint_timeout
and checkpoint_segments about 3x each. Also, turn on log_checkpoints
so you can see what kinds of problem those changes may be causing
there (i.e. long sync times).  Preferably you do this on some kind of
pre-production or test server.

But if your database is growing so rapidly that it soon won't fit on
240GB, then cranking up shared_buffers won't do for long.  If you can
get your tables and all of their indexes clustered together, then you
can do the updates in an order that makes IO more efficient.  Maybe
partitioning would help.


>> I don't know how big each disk is, or how big your various categories
>> of data are.  Could you move everything to SSD?  Could you move all
>> your actively updated indexes there?

> With table spaces you mean?

Yes.  Or moving everything to SSD if it fits, then you don't have go
through and separate objects.

The UPDATE you posted in a previous thread looked like the table
blocks might also be getting dirtied in a fairly random order, which
means the table blocks are in the same condition as the index blocks
so maybe singling out the indexes isn't warranted.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Niels Kristian Schjødt
And what is your experience so far?
Den 11/12/2012 18.16 skrev "Evgeny Shishkin" :

>
> On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <
> nielskrist...@autouncle.com> wrote:
>
> >
> > Den 11/12/2012 kl. 14.29 skrev Craig Ringer :
> >
> >> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
> >>>
> >>> Maybe I should mention, that I never see more than max 5Gb out of my
> total 32Gb being in use on the server… Can I somehow utilize more of it?
> >> For an update-mostly workload it probably won't do you tons of good so
> >> long as all your indexes fit in RAM. You're clearly severely
> >> bottlenecked on disk I/O not RAM.
> >>> The SSD's I use a are 240Gb each which will grow too small within a
> >>> few months - so - how does moving the whole data dir onto four of
> >>> those in a RAID5 array sound?
> >>
> >> Not RAID 5!
> >>
> >> Use a RAID10 of four or six SSDs.
> >>
> >> --
> >> Craig Ringer   http://www.2ndQuadrant.com/
> >> PostgreSQL Development, 24x7 Support, Training & Services
> >>
> > Hehe got it - did you have a look at the SSD's I am considering building
> it of?
> http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> > Are they suitable do you think?
> >
>
> I am not Craig, but i use them in production in raid10 array now.
>
> >
> >
> > --
> > Sent via pgsql-performance mailing list (
> pgsql-performance@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance
>
>


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt 
 wrote:

> And what is your experience so far?
> 
Increased tps by a factor of 10, database no longer a limiting factor of 
application.
And it is cheaper than brand rotating drives.


> Den 11/12/2012 18.16 skrev "Evgeny Shishkin" :
> 
> On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt 
>  wrote:
> 
> >
> > Den 11/12/2012 kl. 14.29 skrev Craig Ringer :
> >
> >> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
> >>>
> >>> Maybe I should mention, that I never see more than max 5Gb out of my 
> >>> total 32Gb being in use on the server… Can I somehow utilize more of it?
> >> For an update-mostly workload it probably won't do you tons of good so
> >> long as all your indexes fit in RAM. You're clearly severely
> >> bottlenecked on disk I/O not RAM.
> >>> The SSD's I use a are 240Gb each which will grow too small within a
> >>> few months - so - how does moving the whole data dir onto four of
> >>> those in a RAID5 array sound?
> >>
> >> Not RAID 5!
> >>
> >> Use a RAID10 of four or six SSDs.
> >>
> >> --
> >> Craig Ringer   http://www.2ndQuadrant.com/
> >> PostgreSQL Development, 24x7 Support, Training & Services
> >>
> > Hehe got it - did you have a look at the SSD's I am considering building it 
> > of? 
> > http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
> > Are they suitable do you think?
> >
> 
> I am not Craig, but i use them in production in raid10 array now.
> 
> >
> >
> > --
> > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance
> 



[PERFORM] Occasional timeouts on TRUNCATE and simple INSERTs

2012-12-11 Thread Osborn, Jeff
Hello,

I am seeing some strange performance on a new pg9.1 instance.  We are seeing 
occasional statement timeouts on some TRUNCATEs and INSERTs.  In both cases, 
the statements are quite simple:
   - TRUNCATE schema.table;
  -  INSERT  INTO schema.table VALUES  ($1,2,$2,'');

Sometimes these will succeed.  Occasionally I see timeouts.  The 
statement_timeout is set to 60 seconds.  These tables are not particularly 
large; in the case of the insert, the table only has three rows.  

Our previous Postgresql 8.2 instance did not have this problem.  Any ideas 
about how to track down the issue?  

Thanks,

--Jeff O

smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Occasional timeouts on TRUNCATE and simple INSERTs

2012-12-11 Thread Sergey Konoplev
On Tue, Dec 11, 2012 at 1:19 PM, Osborn, Jeff  wrote:
> I am seeing some strange performance on a new pg9.1 instance.  We are seeing 
> occasional statement timeouts on some TRUNCATEs and INSERTs.  In both cases, 
> the statements are quite simple:
>- TRUNCATE schema.table;
>   -  INSERT  INTO schema.table VALUES  ($1,2,$2,'');
>
> Sometimes these will succeed.  Occasionally I see timeouts.  The 
> statement_timeout is set to 60 seconds.  These tables are not particularly 
> large; in the case of the insert, the table only has three rows.

A most common case is when backup (pg_dump*) is running TRUNCATE has
to wait for it because it acquires an access exclusive lock on a table
and all other queries including INSERT have to wait for the TRUNCATE.
Check the backup case first.

> Our previous Postgresql 8.2 instance did not have this problem.

This is strange for me.

-- 
Sergey Konoplev
Database and Software Architect
http://www.linkedin.com/in/grayhemp

Phones:
USA +1 415 867 9984
Russia, Moscow +7 901 903 0499
Russia, Krasnodar +7 988 888 1979

Skype: gray-hemp
Jabber: gray...@gmail.com


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional timeouts on TRUNCATE and simple INSERTs

2012-12-11 Thread Merlin Moncure
On Tue, Dec 11, 2012 at 3:38 PM, Sergey Konoplev  wrote:
> On Tue, Dec 11, 2012 at 1:19 PM, Osborn, Jeff  wrote:
>> I am seeing some strange performance on a new pg9.1 instance.  We are seeing 
>> occasional statement timeouts on some TRUNCATEs and INSERTs.  In both cases, 
>> the statements are quite simple:
>>- TRUNCATE schema.table;
>>   -  INSERT  INTO schema.table VALUES  ($1,2,$2,'');
>>
>> Sometimes these will succeed.  Occasionally I see timeouts.  The 
>> statement_timeout is set to 60 seconds.  These tables are not particularly 
>> large; in the case of the insert, the table only has three rows.
>
> A most common case is when backup (pg_dump*) is running TRUNCATE has
> to wait for it because it acquires an access exclusive lock on a table
> and all other queries including INSERT have to wait for the TRUNCATE.
> Check the backup case first.

Yeah: absolute first thing to check is if your statements are being
blocked -- you can get that via pg_stat_activity from another session.
 It's a completely different beast if that's the case.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional timeouts on TRUNCATE and simple INSERTs

2012-12-11 Thread Osborn, Jeff
Yeah I've been running a cron pulling relevant info from pg_stat_activity.  
Haven't seen anything yet.  Currently looking into the pg_dump situation.

--Jeff O
 
On Dec 11, 2012, at 5:16 PM, Merlin Moncure wrote:

> On Tue, Dec 11, 2012 at 3:38 PM, Sergey Konoplev  wrote:
> 
> Yeah: absolute first thing to check is if your statements are being
> blocked -- you can get that via pg_stat_activity from another session.
> It's a completely different beast if that's the case.
> 
> merlin



smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Niels Kristian Schjødt
Are you using a hardware based raid controller with them?
Den 11/12/2012 20.11 skrev "Evgeny Shishkin" :

>
> On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt <
> nielskrist...@autouncle.com> wrote:
>
> And what is your experience so far?
>
> Increased tps by a factor of 10, database no longer a limiting factor of
> application.
> And it is cheaper than brand rotating drives.
>
>
> Den 11/12/2012 18.16 skrev "Evgeny Shishkin" :
>
>>
>> On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <
>> nielskrist...@autouncle.com> wrote:
>>
>> >
>> > Den 11/12/2012 kl. 14.29 skrev Craig Ringer :
>> >
>> >> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>> >>>
>> >>> Maybe I should mention, that I never see more than max 5Gb out of my
>> total 32Gb being in use on the server… Can I somehow utilize more of it?
>> >> For an update-mostly workload it probably won't do you tons of good so
>> >> long as all your indexes fit in RAM. You're clearly severely
>> >> bottlenecked on disk I/O not RAM.
>> >>> The SSD's I use a are 240Gb each which will grow too small within a
>> >>> few months - so - how does moving the whole data dir onto four of
>> >>> those in a RAID5 array sound?
>> >>
>> >> Not RAID 5!
>> >>
>> >> Use a RAID10 of four or six SSDs.
>> >>
>> >> --
>> >> Craig Ringer   
>> >> http://www.2ndQuadrant.com/
>> >> PostgreSQL Development, 24x7 Support, Training & Services
>> >>
>> > Hehe got it - did you have a look at the SSD's I am considering
>> building it of?
>> http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
>> > Are they suitable do you think?
>> >
>>
>> I am not Craig, but i use them in production in raid10 array now.
>>
>> >
>> >
>> > --
>> > Sent via pgsql-performance mailing list (
>> pgsql-performance@postgresql.org)
>> > To make changes to your subscription:
>> > http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt 
 wrote:

> Are you using a hardware based raid controller with them?
> 
Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast 
fsync without it.
Also mdadm is a pain in the ass and is suitable only on amazon and other cloud 
shit.

> Den 11/12/2012 20.11 skrev "Evgeny Shishkin" :
> 
> On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt 
>  wrote:
> 
>> And what is your experience so far?
>> 
> Increased tps by a factor of 10, database no longer a limiting factor of 
> application.
> And it is cheaper than brand rotating drives.
> 
> 
>> Den 11/12/2012 18.16 skrev "Evgeny Shishkin" :
>> 
>> On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt 
>>  wrote:
>> 
>> >
>> > Den 11/12/2012 kl. 14.29 skrev Craig Ringer :
>> >
>> >> On 12/11/2012 06:04 PM, Niels Kristian Schjødt wrote:
>> >>>
>> >>> Maybe I should mention, that I never see more than max 5Gb out of my 
>> >>> total 32Gb being in use on the server… Can I somehow utilize more of it?
>> >> For an update-mostly workload it probably won't do you tons of good so
>> >> long as all your indexes fit in RAM. You're clearly severely
>> >> bottlenecked on disk I/O not RAM.
>> >>> The SSD's I use a are 240Gb each which will grow too small within a
>> >>> few months - so - how does moving the whole data dir onto four of
>> >>> those in a RAID5 array sound?
>> >>
>> >> Not RAID 5!
>> >>
>> >> Use a RAID10 of four or six SSDs.
>> >>
>> >> --
>> >> Craig Ringer   http://www.2ndQuadrant.com/
>> >> PostgreSQL Development, 24x7 Support, Training & Services
>> >>
>> > Hehe got it - did you have a look at the SSD's I am considering building 
>> > it of? 
>> > http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
>> > Are they suitable do you think?
>> >
>> 
>> I am not Craig, but i use them in production in raid10 array now.
>> 
>> >
>> >
>> > --
>> > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> > To make changes to your subscription:
>> > http://www.postgresql.org/mailpref/pgsql-performance
>> 
> 



Re: [PERFORM] Why is PostgreSQL 9.2 slower than 9.1 in my tests?

2012-12-11 Thread Jeff Janes
On Tue, Dec 11, 2012 at 2:50 AM, Patryk Sidzina
 wrote:

> The differences come up when you change the "INSERT" to "EXECUTE 'INSERT'" (
> and i checked this time on 3 machines, one of which was Windows):
>
> CREATE TEMP TABLE test_table_md_speed(id serial primary key, n integer);
>
> CREATE OR REPLACE FUNCTION test_db_speed(cnt integer)
>  RETURNS text
>  LANGUAGE plpgsql
> AS $function$
> DECLARE
> time_start timestamp;
> time_stop timestamp;
> time_total interval;
> BEGIN
> time_start := cast(timeofday() AS TIMESTAMP);
> FOR i IN 1..cnt LOOP
> EXECUTE 'INSERT INTO test_table_md_speed(n) VALUES (' || i
> || ')';
> END LOOP;
>
> time_stop := cast(timeofday() AS TIMESTAMP);
> time_total := time_stop-time_start;
>
> RETURN extract (milliseconds from time_total);
> END;
> $function$;
>
> SELECT test_db_speed(10);

The culprit is the commit below.  I don't know exactly why this slows
down your case.  A preliminary oprofile analysis suggests that it most
of the slowdown is that it calls AllocSetAlloc more often.  I suspect
that this slow-down will be considered acceptable trade-off for
getting good parameterized plans.


commit e6faf910d75027bdce7cd0f2033db4e912592bcc
Author: Tom Lane 
Date:   Fri Sep 16 00:42:53 2011 -0400

Redesign the plancache mechanism for more flexibility and efficiency.


Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why is PostgreSQL 9.2 slower than 9.1 in my tests?

2012-12-11 Thread Tom Lane
Jeff Janes  writes:
> On Tue, Dec 11, 2012 at 2:50 AM, Patryk Sidzina
>  wrote:
>> The differences come up when you change the "INSERT" to "EXECUTE 'INSERT'" (
>> and i checked this time on 3 machines, one of which was Windows):

>> FOR i IN 1..cnt LOOP
>> EXECUTE 'INSERT INTO test_table_md_speed(n) VALUES (' || i || ')';
>> END LOOP;

> The culprit is the commit below.  I don't know exactly why this slows
> down your case.  A preliminary oprofile analysis suggests that it most
> of the slowdown is that it calls AllocSetAlloc more often.  I suspect
> that this slow-down will be considered acceptable trade-off for
> getting good parameterized plans.

I'm having a hard time getting excited about optimizing the above case:
the user can do far more to make it fast than we can, simply by not
using EXECUTE, which is utterly unnecessary in this example.

Having said that, though, it's not real clear to me why the plancache
changes would have affected the speed of EXECUTE at all --- the whole
point of that command is we don't cache a plan for the query.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Craig Ringer
On 12/12/2012 06:44 AM, Evgeny Shishkin wrote:
>
> On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt
> mailto:nielskrist...@autouncle.com>> wrote:
>
>> Are you using a hardware based raid controller with them?
>>
> Yes, of course. Hardware raid with cache and bbu is a must. You can't
> get fast fsync without it.

Most SSDs should offer fairly fast fsync without a hardware RAID
controller, as they do write-back caching. The trick is to find ones
that do write-back caching safely, so you don't get severe data
corruption on power-loss.

A HW RAID controller is an absolute must for rotating magnetic media,
though.

> Also mdadm is a pain in the ass and is suitable only on amazon and
> other cloud shit.

I've personally been pretty happy with mdadm. I find the array
portability it offers very useful, so I don't need to buy a second RAID
controller just in case my main controller dies and I need a compatible
one to get the array running again. If you don't need a BBU for safe
write-back caching then mdadm has advantages over hardware RAID.

I'll certainly use mdadm over onboard fakeraid solutions or low-end
hardware RAID controllers. I suspect a mid- to high end HW RAID unit
will generally win.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 5:03 AM, Craig Ringer  wrote:

> On 12/12/2012 06:44 AM, Evgeny Shishkin wrote:
>> 
>> On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt 
>>  wrote:
>> 
>>> Are you using a hardware based raid controller with them?
>>> 
>> Yes, of course. Hardware raid with cache and bbu is a must. You can't get 
>> fast fsync without it.
> 
> Most SSDs should offer fairly fast fsync without a hardware RAID controller, 
> as they do write-back caching. The trick is to find ones that do write-back 
> caching safely, so you don't get severe data corruption on power-loss. 
> 

Actually most of low-end SSDs don't do write caching, they do not have enough 
ram for that. Sandforce for example.

> A HW RAID controller is an absolute must for rotating magnetic media, though.
> 
>> Also mdadm is a pain in the ass and is suitable only on amazon and other 
>> cloud shit.
> 
> I've personally been pretty happy with mdadm. I find the array portability it 
> offers very useful, so I don't need to buy a second RAID controller just in 
> case my main controller dies and I need a compatible one to get the array 
> running again. If you don't need a BBU for safe write-back caching then mdadm 
> has advantages over hardware RAID.
> 

If we are talking about dedicated machine for database with ssd drives, why 
would anybody don't by hardware raid for about 500-700$?  

> I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware 
> RAID controllers. I suspect a mid- to high end HW RAID unit will generally 
> win.
> 

> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Craig Ringer
On 12/12/2012 09:17 AM, Evgeny Shishkin wrote:
>
> Actually most of low-end SSDs don't do write caching, they do not have
> enough ram for that. Sandforce for example.
>
Or, worse, some of them do limited write caching but don't protect their
write cache from power loss. Instant data corruption!

I would be extremely reluctant to use low-end SSDs for a database server.

> If we are talking about dedicated machine for database with ssd
> drives, why would anybody don't by hardware raid for about 500-700$?
I'd want to consider whether the same money is better spent on faster,
higher quality SSDs with their own fast write caches.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 5:29 AM, Craig Ringer  wrote:

> On 12/12/2012 09:17 AM, Evgeny Shishkin wrote:
>> 
>> Actually most of low-end SSDs don't do write caching, they do not have 
>> enough ram for that. Sandforce for example.
>> 
> Or, worse, some of them do limited write caching but don't protect their 
> write cache from power loss. Instant data corruption!
> 
> I would be extremely reluctant to use low-end SSDs for a database server.
> 
>> If we are talking about dedicated machine for database with ssd drives, why 
>> would anybody don't by hardware raid for about 500-700$?
> I'd want to consider whether the same money is better spent on faster, higher 
> quality SSDs with their own fast write caches.
> 

High quality ssd costs 5-7$ per GB. Consumer grade ssd - 1$. Highend - 11$
New intel dc s3700 2-3$ per GB as far as i remember.

So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. 
Cost effective and fast, may be not very safe, but so far so good. All data 
protection measures from postgresql are on, of course.
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Rosser Schwarz
On Tue, Dec 11, 2012 at 5:17 PM, Evgeny Shishkin wrote:

> Actually most of low-end SSDs don't do write caching, they do not have
> enough ram for that.
>

AIUI, *all* SSDs do write-caching of a sort: writes are actually flushed to
the NAND media by erasing, and then overwriting the erased space, and
erasing is done in fixed-size blocks, usually much larger than a
filesystem's pages.  The drive's controller accumulates writes in an
on-board cache until it has an "erase block"'s worth of them, which are
then flushed.  From casual searching, a common erase block size is 256
kbytes, while filesystem-level pages are usually 4k.

Most low-end (and even many mid-range) SSDs, including Sandforce-based
drives, don't offer any form of protection (e.g., supercaps, as featured on
the Intel 320 and 710-series drives) for the data in that write cache,
however, which may be what you're thinking of.  I wouldn't let one of those
anywhere near one of my servers, unless it was a completely disposable,
load-balanced slave, and probably not even then.

rls

-- 
:wq


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Craig Ringer
On 12/12/2012 09:44 AM, Evgeny Shishkin wrote:
> So far, more than a year already, i bought consumer ssds with 300-400$
> hw raid. Cost effective and fast, may be not very safe, but so far so
> good. All data protection measures from postgresql are on, of course.

You're aware that many low end SSDs lie to the RAID controller about
having written data, right? Even if the RAID controller sends a flush
command, the SSD might cache the write in non-durable cache. If you're
using such SSDs and you lose power, data corruption is extremely likely,
because your SSDs are essentially ignoring fsync.

Your RAID controller's BBU won't save you, because once the disks tell
the RAID controller the data has hit durable storage, the RAID
controller feels free to flush it from its battery backed cache. If the
disks are lying...

The only solid way to find out if this is an issue with your SSDs is to
do plug-pull testing and find out.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 6:02 AM, Craig Ringer  wrote:

> On 12/12/2012 09:44 AM, Evgeny Shishkin wrote:
>> So far, more than a year already, i bought consumer ssds with 300-400$ hw 
>> raid. Cost effective and fast, may be not very safe, but so far so good. All 
>> data protection measures from postgresql are on, of course.
> 
> You're aware that many low end SSDs lie to the RAID controller about having 
> written data, right? Even if the RAID controller sends a flush command, the 
> SSD might cache the write in non-durable cache. If you're using such SSDs and 
> you lose power, data corruption is extremely likely, because your SSDs are 
> essentially ignoring fsync.
> 
> Your RAID controller's BBU won't save you, because once the disks tell the 
> RAID controller the data has hit durable storage, the RAID controller feels 
> free to flush it from its battery backed cache. If the disks are lying...
> 
> The only solid way to find out if this is an issue with your SSDs is to do 
> plug-pull testing and find out.
> 

Yes, i am aware of this issue. Never experienced this neither on intel 520, no 
ocz vertex 3.
Have you heard of them on this list?
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread David Boreham

On 12/11/2012 7:13 PM, Evgeny Shishkin wrote:
Yes, i am aware of this issue. Never experienced this neither on intel 
520, no ocz vertex 3.

Have you heard of them on this list?
People have done plug-pull tests and reported the results on the list 
(sometime in the past couple of years).


But you don't need to do the test to know these drives are not safe. 
They're unsafe by design.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 6:15 AM, David Boreham  wrote:

> On 12/11/2012 7:13 PM, Evgeny Shishkin wrote:
>> Yes, i am aware of this issue. Never experienced this neither on intel 520, 
>> no ocz vertex 3.
>> Have you heard of them on this list?
> People have done plug-pull tests and reported the results on the list 
> (sometime in the past couple of years).
> 
> But you don't need to do the test to know these drives are not safe. They're 
> unsafe by design.
> 

Oh, there is no 100% safe system. The only way to be sure is to read data back.
Everything about system design is tradeoff between cost and risks.

> 
> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread David Boreham

On 12/11/2012 7:20 PM, Evgeny Shishkin wrote:

Oh, there is no 100% safe system.
In this case we're discussing specifically "safety in the event of power 
loss shortly after the drive indicates to the controller that it has 
committed a write operation". Some drives do provide 100% safety against 
this event, and they don't cost much more than those that don't.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 6:26 AM, David Boreham  wrote:

> On 12/11/2012 7:20 PM, Evgeny Shishkin wrote:
>> Oh, there is no 100% safe system.
> In this case we're discussing specifically "safety in the event of power loss 
> shortly after the drive indicates to the controller that it has committed a 
> write operation". Some drives do provide 100% safety against this event, and 
> they don't cost much more than those that don't.

Which drives would you recommend? Besides intel 320 and 710.


> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread David Boreham

On 12/11/2012 7:38 PM, Evgeny Shishkin wrote:

Which drives would you recommend? Besides intel 320 and 710.
Those are the only drive types we have deployed in servers at present 
(almost all 710, but we have some 320 for less mission-critical 
machines). The new DC-S3700 Series looks nice too, but isn't yet in the 
sales channel :

http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Craig Ringer
On 12/12/2012 10:13 AM, Evgeny Shishkin wrote:
>
> Yes, i am aware of this issue. Never experienced this neither on intel
> 520, no ocz vertex 3.
>

I wouldn't trust either of those drives. The 520 doesn't have Intel's "
Enhanced Power Loss Data Protection"; it's going to lose its buffers if
it loses power. Similarly, the Vertex 3 doesn't have any kind of power
protection. See:

http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
http://ark.intel.com/products/family/56572/Intel-SSD-500-Family

http://www.ocztechnology.com/res/manuals/OCZ_SSD_Breakdown_Q2-11_1.pdf

The only way I'd use those for a production server was if I had
synchronous replication running to another machine with trustworthy,
durable storage - and if I didn't mind some downtime to restore the
corrupt DB from the replica after power loss.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 6:41 AM, David Boreham  wrote:

> On 12/11/2012 7:38 PM, Evgeny Shishkin wrote:
>> Which drives would you recommend? Besides intel 320 and 710.
> Those are the only drive types we have deployed in servers at present (almost 
> all 710, but we have some 320 for less mission-critical machines). The new 
> DC-S3700 Series looks nice too, but isn't yet in the sales channel :
> http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html

Yeah, s3700 looks promising, but sata interface is limiting factor for this 
drive.
I'm looking towards SMART ssd 
http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review

but i don't heard of it anywhere else.


> 
> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Mark Kirkwood

On 12/12/12 15:41, David Boreham wrote:

On 12/11/2012 7:38 PM, Evgeny Shishkin wrote:

Which drives would you recommend? Besides intel 320 and 710.
Those are the only drive types we have deployed in servers at present 
(almost all 710, but we have some 320 for less mission-critical 
machines). The new DC-S3700 Series looks nice too, but isn't yet in 
the sales channel :
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html 








+1

The s3700 is probably the one to get (when it is available). I'd opt for 
the 710 if you need something now. I'd avoid the 320 - we have 
encountered the firmware bug whereby you get an 8MB (yes 8MB) capacity 
after powerdown with a depressingly large number of them (they were 
updated to the latest firmware too).


Regards

Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread David Boreham

On 12/11/2012 7:49 PM, Evgeny Shishkin wrote:
Yeah, s3700 looks promising, but sata interface is limiting factor for 
this drive.
I'm looking towards SMART ssd 
http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review



What don't you like about SATA ?

I prefer to avoid SAS drives if possible due to the price premium for 
dubious benefits besides vague hand-waving "enterprise-ness" promises.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 7:05 AM, David Boreham  wrote:

> On 12/11/2012 7:49 PM, Evgeny Shishkin wrote:
>> Yeah, s3700 looks promising, but sata interface is limiting factor for this 
>> drive.
>> I'm looking towards SMART ssd 
>> http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review
>> 
> What don't you like about SATA ?
> 
> I prefer to avoid SAS drives if possible due to the price premium for dubious 
> benefits besides vague hand-waving "enterprise-ness" promises.
> 

Quoting 
http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review

Intel makes the case that the S3700 is the ideal drive for entry, mainstream 
and performance enterprise computing including HPC use cases. The claim is 
bold, largely because of the decision to go with a SATA interface, which has 
several limitations in the enterprise. The SATA interface tops out at a queue 
depth 32 (SAS scales as high as 256 in most cases) which means that when 
requests go above that level average and peak latency spike as we saw in all of 
our workloads.

Another huge advantage of SAS is the ability to offer dual-port modes for high 
availability scenarios, where there are two controllers interfacing with the 
same drive at the same time. In the event one goes offline, the connection with 
the SSD is not lost, as it would with a standard SATA interface without 
additional hardware. Some SAS drives also offer wide-port configurations used 
to increase total bandwidth above a single-link connection. While the Intel SSD 
DC S3700 against other SATA competitors is very fast, the story changes when 
you introduce the latest MLC and SLC-based SAS SSDs, which can cope better with 
increased thread and queue levels.

We picked the primary post-preconditioning sections of our benchmarks after 
each SSD had reached steady-state. For the purposes of this section, we added 
the Intel SSD DC S3700 onto the throughput charts of the newest SAS 
high-performance SSDs. There are also significant latency differences at higher 
queue depths that play a significant factor, but for the sake of easy 
comparison we stick with raw I/O speed across varying thread and queue counts.

In a 100% 4K random write or random read scenario, the Intel SSD DC 3700 
performs quite well up against the high-end SAS competition, with the second 
fastest 4K steady-state speed. When you switch focus to read throughput at a 
heavy 16T/16Q load it only offers 1/2 to 1/3 the performance of SSDs in this 
category.

http://www.storagereview.com/images/intel_ssd_dc_s3700_main_slc_4kwrite_throughput.png


> 
> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Do I have a hardware or a software problem?

2012-12-11 Thread David Boreham

On 12/11/2012 8:11 PM, Evgeny Shishkin wrote:


Quoting 
http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review

Heh. A fine example of the kind of hand-waving of which I spoke ;)

Higher performance is certainly a benefit, although at present we can't 
saturate even a single 710 series drive (the application, CPU, OS, etc 
is the bottleneck). Similarly while dual-porting certainly has its uses, 
it is not something I need.






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] hash join vs nested loop join

2012-12-11 Thread Huan Ruan
Hello All

While investigating switching to Postgres, we come across a query plan that
uses hash join and is a lot slower than a nested loop join.

I don't understand why the optimiser chooses the hash join in favor of the
nested loop. What can I do to get the optimiser to make a better decision
(nested loop in this case)? I have run analyze on both tables.

The query is,

/*
   smalltable has about 48,000 records.
   bigtable has about 168,000,000 records.
   invtranref is char(10) and is the primary key for both tables
*/
SELECT
  *
FROM IM_Match_Table smalltable
  inner join invtran bigtable on
bigtable.invtranref = smalltable.invtranref

The hash join plan is,

"Hash Join  (cost=1681.87..6414169.04 rows=48261 width=171)"
"  Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
"  Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
"  ->  Seq Scan on public.invtran bigtable  (cost=0.00..4730787.28
rows=168121728 width=108)"
"Output: bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
"  ->  Hash  (cost=1078.61..1078.61 rows=48261 width=63)"
"Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref"
"->  Seq Scan on public.im_match_table smalltable
(cost=0.00..1078.61 rows=48261 width=63)"
"  Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref"

The nested loop join plan is,

"Nested Loop  (cost=0.00..12888684.07 rows=48261 width=171)"
"  Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
"  ->  Seq Scan on public.im_match_table smalltable  (cost=0.00..1078.61
rows=48261 width=63)"
"Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref"
"  ->  Index Scan using pk_invtran on public.invtran bigtable
(cost=0.00..267.03 rows=1 width=108)"
"Output: bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
"Index Cond: (bigtable.invtranref = smalltable.invtranref)"

The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:

   - Centos, ext4
   - 24GB memory
   - 6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
   - raid 10 on 4 sata disks

Config changes are

   - shared_buffers = 6GB
   - effective_cache_size = 18GB
   - work_mem = 10MB
   - maintenance_work_mem = 3GB

Many Thanks
Huan


Re: [PERFORM] hash join vs nested loop join

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 8:25 AM, Huan Ruan  wrote:

> Hello All
> 
> While investigating switching to Postgres, we come across a query plan that 
> uses hash join and is a lot slower than a nested loop join.
> 
> I don't understand why the optimiser chooses the hash join in favor of the 
> nested loop. What can I do to get the optimiser to make a better decision 
> (nested loop in this case)? I have run analyze on both tables.
> 

Optimiser thinks that nested loop is more expensive, because of point PK 
lookups, which a random io.
Can you set random_page_cost to 2 or 3 and try again?


> The query is,
> /*
>smalltable has about 48,000 records.
>bigtable has about 168,000,000 records.
>invtranref is char(10) and is the primary key for both tables
> */
> SELECT
>   *
> FROM IM_Match_Table smalltable
>   inner join invtran bigtable on 
> bigtable.invtranref = smalltable.invtranref
> The hash join plan is,
> "Hash Join  (cost=1681.87..6414169.04 rows=48261 width=171)"
> "  Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, 
> smalltable.trandate, smalltable.invprodref, smalltable.invheadref, 
> bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, 
> bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, 
> bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, 
> bigtable.createmode, bigtable.transtat, bigtable.sysversion, 
> bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, 
> bigtable.transign"
> "  Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
> "  ->  Seq Scan on public.invtran bigtable  (cost=0.00..4730787.28 
> rows=168121728 width=108)"
> "Output: bigtable.itbatchref, bigtable.invtranref, 
> bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, 
> bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, 
> bigtable.tranamount, bigtable.createmode, bigtable.transtat, 
> bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, 
> bigtable.seqnum, bigtable.transign"
> "  ->  Hash  (cost=1078.61..1078.61 rows=48261 width=63)"
> "Output: smalltable.invtranref, smalltable.itbatchref, 
> smalltable.trantype, smalltable.trandate, smalltable.invprodref, 
> smalltable.invheadref"
> "->  Seq Scan on public.im_match_table smalltable  
> (cost=0.00..1078.61 rows=48261 width=63)"
> "  Output: smalltable.invtranref, smalltable.itbatchref, 
> smalltable.trantype, smalltable.trandate, smalltable.invprodref, 
> smalltable.invheadref"
> The nested loop join plan is,
> "Nested Loop  (cost=0.00..12888684.07 rows=48261 width=171)"
> "  Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, 
> smalltable.trandate, smalltable.invprodref, smalltable.invheadref, 
> bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, 
> bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, 
> bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, 
> bigtable.createmode, bigtable.transtat, bigtable.sysversion, 
> bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, 
> bigtable.transign"
> "  ->  Seq Scan on public.im_match_table smalltable  (cost=0.00..1078.61 
> rows=48261 width=63)"
> "Output: smalltable.invtranref, smalltable.itbatchref, 
> smalltable.trantype, smalltable.trandate, smalltable.invprodref, 
> smalltable.invheadref"
> "  ->  Index Scan using pk_invtran on public.invtran bigtable  
> (cost=0.00..267.03 rows=1 width=108)"
> "Output: bigtable.itbatchref, bigtable.invtranref, 
> bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, 
> bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, 
> bigtable.tranamount, bigtable.createmode, bigtable.transtat, 
> bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, 
> bigtable.seqnum, bigtable.transign"
> "Index Cond: (bigtable.invtranref = smalltable.invtranref)"
> The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by gcc 
> (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:
> Centos, ext4
> 24GB memory 
> 6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
> raid 10 on 4 sata disks
> Config changes are
> 
> shared_buffers = 6GB
> effective_cache_size = 18GB
> work_mem = 10MB
> maintenance_work_mem = 3GB
> Many Thanks
> Huan
> 
> 
> 
> 



Re: [PERFORM] hash join vs nested loop join

2012-12-11 Thread Evgeny Shishkin

On Dec 12, 2012, at 8:57 AM, Evgeny Shishkin  wrote:

> 
> On Dec 12, 2012, at 8:44 AM, Huan Ruan  wrote:
> 
>> 
>> On 12 December 2012 15:33, Evgeny Shishkin  wrote:
>> Optimiser thinks that nested loop is more expensive, because of point PK 
>> lookups, which a random io.
>> Can you set random_page_cost to 2 or 3 and try again?
>> 
>> Hi Evgeny
>> 
>> Thanks for the quick reply. Setting random_page_cost to 3 doesn't make a 
>> difference, but to 2 makes the optimiser to choose nested loop. However, 
>> with such a small penalty for random I/O, I'm worry about this setting will 
>> make other small queries incorrectly use index when it should be a 
>> sequential scan though. I understand random I/O is expensive, but in this 
>> case the optimiser already knows the big table is really big, should it 
>> consider a sequential scan will be slower than an index lookup? Scan 170 
>> million records vs index lookup of 50,000 records. Any thoughts?
>> 
> 
> Yes, this is the most common issue for me. 
> Usually you just have to find the right combination of random and seq scan 
> costs, shared_buffers and effective_cache_size.
> If some of the queries work well with another value of, say, 
> random_page_cost, then, since it is per session parameter, you can SET it in 
> your session before the query. But over time your table may change in size 
> and distribution and everything brakes. No speaking about general ugliness 
> from application standpoint.
> 
> May be somebody more experienced would help.
> 
> Also you can set different costs per tablespace.
> 
>> Thanks
>> Huan
> 

Added CC.