date:20110516

Re: [PERFORM] hash semi join caused by "IN (select ...)"

2011-05-16 Thread Dave Johansen

On Mon, May 16, 2011 at 3:30 PM, Clemens Eisserer wrote:

> Hi,
>
> I have a quite complex, performance sensitive query in a system with a
> few (7) joins:
> select  from t1 left join t2  WHERE id IN (select )
>
> For this query the planner evaluates the IN with a hash semi join last,
> and all the joining is done by hash joins for all rows contained in t1.
>
> However when I specify the ids manually (IN (1, 2, 3, 4, 5) the
> planner first does an index lookup on the primary key column id,
> and subsequently does nested loop joins using an index on t2 - which
> gives way better results.
>
> Is there any way to guide the planner to evaluate the IN condition
> first, instead of last?
> Why is the planner behaving this way? (postgresql 8.4.??)
>
> Thank you in advance, Clemens
>
>
> Query plan with IN(select):
>
> Sort  (cost=165.77..165.77 rows=2 width=16974) (actual
> time=13.459..13.460 rows=2 loops=1)
>   Sort Key: this_.id
>   Sort Method:  quicksort  Memory: 26kB
>   ->  Hash Semi Join  (cost=123.09..165.76 rows=2 width=16974)
> (actual time=12.741..13.432 rows=2 loops=1)
> Hash Cond: (this_.id = kladdenent0_.id)
> ->  Hash Left Join  (cost=119.17..160.90 rows=348
> width=16974) (actual time=8.765..13.104 rows=342 loops=1)
>   Hash Cond: (flugzeug2_.flugzeugtyp_id = flugzeugty3_.id)
>   ->  Hash Left Join  (cost=118.10..155.08 rows=348
> width=16454) (actual time=8.724..12.412 rows=342 loops=1)
> Hash Cond: (flugzeug2_.zaehlertyp_id = bmintype4_.id)
> ->  Hash Left Join  (cost=117.06..152.71 rows=348
> width=15934) (actual time=8.660..11.786 rows=342 loops=1)
>   Hash Cond: (this_.lehrerid = pilot5_.id)
>   ->  Hash Left Join  (cost=96.66..130.46
> rows=348 width=8912) (actual time=6.395..8.899 rows=342 loops=1)
> Hash Cond: (this_.nachid = flugplatz6_.id)
> ->  Hash Left Join
> (cost=93.89..122.90 rows=348 width=8370) (actual time=6.354..8.429
> rows=342 loops=1)
>   Hash Cond: (this_.flugzeugid =
> flugzeug2_.id)
>   ->  Hash Left Join
> (cost=23.17..47.04 rows=348 width=7681) (actual time=1.992..3.374
> rows=342 loops=1)
> Hash Cond: (this_.pilotid
> = pilot7_.id)
> ->  Hash Left Join
> (cost=2.78..22.04 rows=348 width=659) (actual time=0.044..0.548
> rows=342 loops=1)
>   Hash Cond:
> (this_.vonid = flugplatz8_.id)
>   ->  Seq Scan on
> startkladde this_  (cost=0.00..14.48 rows=348 width=117) (actual
> time=0.004..0.074 rows=342 loops=1)
>   ->  Hash
> (cost=1.79..1.79 rows=79 width=542) (actual time=0.032..0.032 rows=79
> loops=1)
> ->  Seq Scan
> on flugplatz flugplatz8_  (cost=0.00..1.79 rows=79 width=542) (actual
> time=0.003..0.010 rows=79 loops=1)
> ->  Hash
> (cost=15.73..15.73 rows=373 width=7022) (actual time=1.938..1.938
> rows=375 loops=1)
>   ->  Seq Scan on
> pilot pilot7_  (cost=0.00..15.73 rows=373 width=7022) (actual
> time=0.006..0.769 rows=375 loops=1)
>   ->  Hash  (cost=51.43..51.43
> rows=1543 width=689) (actual time=4.351..4.351 rows=1543 loops=1)
> ->  Seq Scan on flugzeug
> flugzeug2_  (cost=0.00..51.43 rows=1543 width=689) (actual
> time=0.006..1.615 rows=1543 loops=1)
> ->  Hash  (cost=1.79..1.79 rows=79
> width=542) (actual time=0.031..0.031 rows=79 loops=1)
>   ->  Seq Scan on flugplatz
> flugplatz6_  (cost=0.00..1.79 rows=79 width=542) (actual
> time=0.003..0.011 rows=79 loops=1)
>   ->  Hash  (cost=15.73..15.73 rows=373
> width=7022) (actual time=2.236..2.236 rows=375 loops=1)
> ->  Seq Scan on pilot pilot5_
> (cost=0.00..15.73 rows=373 width=7022) (actual time=0.005..0.781
> rows=375 loops=1)
> ->  Hash  (cost=1.02..1.02 rows=2 width=520)
> (actual time=0.005..0.005 rows=2 loops=1)
>   ->  Seq Scan on bmintype bmintype4_
> (cost=0.00..1.02 rows=2 width=520) (actual time=0.003..0.004 rows=2
> loops=1)
>   ->  Hash  (cost=1.03..1.03 rows=3 width=520) (actual
> time=0.004..0.004 rows=3 loops=1)
> ->  Seq Scan on flugzeugtype flugzeugty3_
> (cost=0.00..1.03 rows=3 width=520) (actual time=0.002..0.002 rows=3
> loops=1)
> ->  Hash  (cost=3.90..3.90 rows=2 width=4) (actual
> time=0.239..0.239 rows=2 loops=1)
>   ->  Limit  (cost=0.00..3.8

Re: [PERFORM] hash semi join caused by "IN (select ...)"

2011-05-16 Thread Tom Lane

Clemens Eisserer  writes:
> I have a quite complex, performance sensitive query in a system with a
> few (7) joins:
> select  from t1 left join t2  WHERE id IN (select )

Does it work as expected with one less join?  If so, try increasing
join_collapse_limit ...

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] hash semi join caused by "IN (select ...)"

2011-05-16 Thread Clemens Eisserer

Hi,

I have a quite complex, performance sensitive query in a system with a
few (7) joins:
select  from t1 left join t2  WHERE id IN (select )

For this query the planner evaluates the IN with a hash semi join last,
and all the joining is done by hash joins for all rows contained in t1.

However when I specify the ids manually (IN (1, 2, 3, 4, 5) the
planner first does an index lookup on the primary key column id,
and subsequently does nested loop joins using an index on t2 - which
gives way better results.

Is there any way to guide the planner to evaluate the IN condition
first, instead of last?
Why is the planner behaving this way? (postgresql 8.4.??)

Thank you in advance, Clemens


Query plan with IN(select):

Sort  (cost=165.77..165.77 rows=2 width=16974) (actual
time=13.459..13.460 rows=2 loops=1)
   Sort Key: this_.id
   Sort Method:  quicksort  Memory: 26kB
   ->  Hash Semi Join  (cost=123.09..165.76 rows=2 width=16974)
(actual time=12.741..13.432 rows=2 loops=1)
 Hash Cond: (this_.id = kladdenent0_.id)
 ->  Hash Left Join  (cost=119.17..160.90 rows=348
width=16974) (actual time=8.765..13.104 rows=342 loops=1)
   Hash Cond: (flugzeug2_.flugzeugtyp_id = flugzeugty3_.id)
   ->  Hash Left Join  (cost=118.10..155.08 rows=348
width=16454) (actual time=8.724..12.412 rows=342 loops=1)
 Hash Cond: (flugzeug2_.zaehlertyp_id = bmintype4_.id)
 ->  Hash Left Join  (cost=117.06..152.71 rows=348
width=15934) (actual time=8.660..11.786 rows=342 loops=1)
   Hash Cond: (this_.lehrerid = pilot5_.id)
   ->  Hash Left Join  (cost=96.66..130.46
rows=348 width=8912) (actual time=6.395..8.899 rows=342 loops=1)
 Hash Cond: (this_.nachid = flugplatz6_.id)
 ->  Hash Left Join
(cost=93.89..122.90 rows=348 width=8370) (actual time=6.354..8.429
rows=342 loops=1)
   Hash Cond: (this_.flugzeugid =
flugzeug2_.id)
   ->  Hash Left Join
(cost=23.17..47.04 rows=348 width=7681) (actual time=1.992..3.374
rows=342 loops=1)
 Hash Cond: (this_.pilotid
= pilot7_.id)
 ->  Hash Left Join
(cost=2.78..22.04 rows=348 width=659) (actual time=0.044..0.548
rows=342 loops=1)
   Hash Cond:
(this_.vonid = flugplatz8_.id)
   ->  Seq Scan on
startkladde this_  (cost=0.00..14.48 rows=348 width=117) (actual
time=0.004..0.074 rows=342 loops=1)
   ->  Hash
(cost=1.79..1.79 rows=79 width=542) (actual time=0.032..0.032 rows=79
loops=1)
 ->  Seq Scan
on flugplatz flugplatz8_  (cost=0.00..1.79 rows=79 width=542) (actual
time=0.003..0.010 rows=79 loops=1)
 ->  Hash
(cost=15.73..15.73 rows=373 width=7022) (actual time=1.938..1.938
rows=375 loops=1)
   ->  Seq Scan on
pilot pilot7_  (cost=0.00..15.73 rows=373 width=7022) (actual
time=0.006..0.769 rows=375 loops=1)
   ->  Hash  (cost=51.43..51.43
rows=1543 width=689) (actual time=4.351..4.351 rows=1543 loops=1)
 ->  Seq Scan on flugzeug
flugzeug2_  (cost=0.00..51.43 rows=1543 width=689) (actual
time=0.006..1.615 rows=1543 loops=1)
 ->  Hash  (cost=1.79..1.79 rows=79
width=542) (actual time=0.031..0.031 rows=79 loops=1)
   ->  Seq Scan on flugplatz
flugplatz6_  (cost=0.00..1.79 rows=79 width=542) (actual
time=0.003..0.011 rows=79 loops=1)
   ->  Hash  (cost=15.73..15.73 rows=373
width=7022) (actual time=2.236..2.236 rows=375 loops=1)
 ->  Seq Scan on pilot pilot5_
(cost=0.00..15.73 rows=373 width=7022) (actual time=0.005..0.781
rows=375 loops=1)
 ->  Hash  (cost=1.02..1.02 rows=2 width=520)
(actual time=0.005..0.005 rows=2 loops=1)
   ->  Seq Scan on bmintype bmintype4_
(cost=0.00..1.02 rows=2 width=520) (actual time=0.003..0.004 rows=2
loops=1)
   ->  Hash  (cost=1.03..1.03 rows=3 width=520) (actual
time=0.004..0.004 rows=3 loops=1)
 ->  Seq Scan on flugzeugtype flugzeugty3_
(cost=0.00..1.03 rows=3 width=520) (actual time=0.002..0.002 rows=3
loops=1)
 ->  Hash  (cost=3.90..3.90 rows=2 width=4) (actual
time=0.239..0.239 rows=2 loops=1)
   ->  Limit  (cost=0.00..3.88 rows=2 width=4) (actual
time=0.202..0.236 rows=2 loops=1)
 ->  Index Scan using startkladde_pkey on
startkladde kladdenent0_  (cost=0.00..56.24 rows=29 width=4) (actual
time=0.200..0.233 rows=2

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-16 Thread Tom Lane

Nathan Boley  writes:
>> The accesses to an index are far more likely to be clustered than the
>> accesses to the underlying table, because the index is organized in a
>> way that's application-meaningful and the table not so much.

> So, to clarify, are you saying that if query were actually requesting
> rows uniformly random, then there would be no reason to suspect that
> index accesses would have hotspots? It seems like the index structure
> ( ie, the top node in b-trees ) could also get in the way.

The upper nodes would tend to stay in cache, yes, but we already assume
that in the index access cost model, in a kind of indirect way: the
model only considers leaf-page accesses in the first place ;-)

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-16 Thread Nathan Boley

> The accesses to an index are far more likely to be clustered than the
> accesses to the underlying table, because the index is organized in a
> way that's application-meaningful and the table not so much.

So, to clarify, are you saying that if query were actually requesting
rows uniformly random, then there would be no reason to suspect that
index accesses would have hotspots? It seems like the index structure
( ie, the top node in b-trees ) could also get in the way.

Best,
Nathan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Using pgiosim realistically

2011-05-16 Thread Jeff



On May 16, 2011, at 1:06 PM, John Rouillard wrote:


that is a #define in pgiosim.c


So which is a better test, modifying the #define to allow specifying
200-300 1GB files, or using 64 files but increasing the size of my
files to 2-3GB for a total bytes in the file two or three times the
memory in my server (96GB)?



I tend to make 10G chunks with dd and run pgiosim over that.
dd if=/dev/zero of=bigfile bs=1M count=10240


the -w param to pgiosim has it rewrite blocks out as it runs. (it is
a percentage).


Yup, I was running with that and getting low enough numbers, that I
switched to pure read tests. It looks like I just need multiple
threads so I can have multiple reads/writes in flight at the same
time.



Yep - you need multiple threads to get max throughput of your io.

--
Jeff Trout 
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-16 Thread Tom Lane

Jeff Janes  writes:
> On Sun, May 15, 2011 at 9:49 PM, Jesper Krogh  wrote:
>> Ok, it may not work as well with index'es, since having 1% in cache may very
>> well mean that 90% of all requested blocks are there.. for tables in should
>> be more trivial.

> Why would the index have a meaningful hot-spot unless the underlying
> table had one as well?  (Of course the root block will be a hot-spot,
> but certainly not 90% of all requests)

The accesses to an index are far more likely to be clustered than the
accesses to the underlying table, because the index is organized in a
way that's application-meaningful and the table not so much.  Continuing
the earlier example of a timestamp column, accesses might preferentially
hit near the right end of the index while the underlying rows are all
over the table.

IOW, hot spots measured at the row level and hot spots measured at the
page level could very easily be different between table and index.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Using pgiosim realistically

2011-05-16 Thread John Rouillard

On Mon, May 16, 2011 at 12:23:13PM -0400, Jeff wrote:
> On May 16, 2011, at 9:17 AM, John Rouillard wrote:
> >However, in my case I have an 8 disk raid 10 with a read only load (in
> >this testing configuration). Shouldn't I expect more iops than a
> >single disk can provide? Maybe pgiosim is hitting some other boundary
> >than just i/o?
> >
> 
> given your command line you are only running a single thread - use
> the -t argument to add more threads and that'll increase
> concurrency.  a single process can only process so much at once and
> with multiple threads requesting different things the drive will
> actually be able to respond faster since it will have more work to
> do.
> I tend to test various levels - usually a single (-t 1 - the
> default) to get a base line, then -t (drives / 2), -t (#drives) up
> to probably 4x drives (you'll see iops level off).

Ok cool. I'll try that.
 
> >Also it turns out that pgiosim can only handle 64 files. I haven't
> >checked to see if this is a compile time changable item or not.
> 
> that is a #define in pgiosim.c

So which is a better test, modifying the #define to allow specifying
200-300 1GB files, or using 64 files but increasing the size of my
files to 2-3GB for a total bytes in the file two or three times the
memory in my server (96GB)?

> also, are you running the latest pgiosim from pgfoundry?

yup version 0.5 from the foundry.

> the -w param to pgiosim has it rewrite blocks out as it runs. (it is
> a percentage).

Yup, I was running with that and getting low enough numbers, that I
switched to pure read tests. It looks like I just need multiple
threads so I can have multiple reads/writes in flight at the same
time.

-- 
-- rouilj

John Rouillard   System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-16 Thread Jeff Janes

On Sun, May 15, 2011 at 9:49 PM, Jesper Krogh  wrote:
>
> Ok, it may not work as well with index'es, since having 1% in cache may very
> well mean that 90% of all requested blocks are there.. for tables in should
> be more trivial.

Why would the index have a meaningful hot-spot unless the underlying
table had one as well?  (Of course the root block will be a hot-spot,
but certainly not 90% of all requests)

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Using pgiosim realistically

2011-05-16 Thread Jeff



On May 16, 2011, at 9:17 AM, John Rouillard wrote:



I am seeing really poor (70) iops with pgiosim.  According to:
http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-8.html in the
database benchmark they are seeing ~170 iops on a single disk for
these drives. I would expect an 8 disk raid 10 should get better  
then

3x the single disk rate (assuming the data is randomly distributed).

Those drives are 7200 rpm drives which would give you a maximum write
rate of 120/sec at best with the cache disabled. I actually think  
your
70/sec is closer to reality and what you should anticipate in real  
use.
I do not see how they could make 170/sec. Did they strap a jet  
engine to

the drive. :)




also you are reading with a worst case scenario for the mechanical  
disk - randomly seeking around everywhere, which will lower  
performance drastically.



Hmm, I stated the disk cache was disabled. I should have said the disk
write cache, but it's possible the readhead cache is disabled as well
(not quite sure how to tell on the lsi cards). Also there isn't a lot
of detail in what the database test mix is and I haven't tried
researching the site to see if the spec the exact test. If it included
a lot of writes and they were being handled by a cache then that could
explain it.



you'll get some extra from the os readahead and the drive's potential  
own readahead.




However, in my case I have an 8 disk raid 10 with a read only load (in
this testing configuration). Shouldn't I expect more iops than a
single disk can provide? Maybe pgiosim is hitting some other boundary
than just i/o?



given your command line you are only running a single thread - use the  
-t argument to add more threads and that'll increase concurrency.  a  
single process can only process so much at once and with multiple  
threads requesting different things the drive will actually be able to  
respond faster since it will have more work to do.
I tend to test various levels - usually a single (-t 1 - the default)  
to get a base line, then -t (drives / 2), -t (#drives) up to probably  
4x drives (you'll see iops level off).



Also it turns out that pgiosim can only handle 64 files. I haven't
checked to see if this is a compile time changable item or not.



that is a #define in pgiosim.c

also, are you running the latest pgiosim from pgfoundry?

the -w param to pgiosim has it rewrite blocks out as it runs. (it is a  
percentage).


--
Jeff Trout 
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-16 Thread Tom Lane

Robert Haas  writes:
> On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh  wrote:
>> Ok, it may not work as well with index'es, since having 1% in cache may very
>> well mean that 90% of all requested blocks are there.. for tables in should
>> be more trivial.

> Tables can have hot spots, too.  Consider a table that holds calendar
> reservations.  Reservations can be inserted, updated, deleted.  But
> typically, the most recent data will be what is most actively
> modified, and the older data will be relatively more (though not
> completely) static, and less frequently accessed.  Such examples are
> common in many real-world applications.

Yes.  I'm not convinced that measuring the fraction of a table or index
that's in cache is really going to help us much.  Historical cache hit
rates might be useful, but only to the extent that the incoming query
has a similar access pattern to those in the (recent?) past.  It's not
an easy problem.

I almost wonder if we should not try to measure this at all, but instead
let the DBA set a per-table or per-index number to use, analogous to the
override we added recently for column n-distinct statistics ...

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-16 Thread Robert Haas

On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh  wrote:
>> To me it seems like a robust and fairly trivial way to to get better
>> numbers. The
>> fear is that the OS-cache is too much in flux to get any stable numbers
>> out
>> of it.
>
> Ok, it may not work as well with index'es, since having 1% in cache may very
> well mean that 90% of all requested blocks are there.. for tables in should
> be more trivial.

Tables can have hot spots, too.  Consider a table that holds calendar
reservations.  Reservations can be inserted, updated, deleted.  But
typically, the most recent data will be what is most actively
modified, and the older data will be relatively more (though not
completely) static, and less frequently accessed.  Such examples are
common in many real-world applications.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] [PERFORMANCE] expanding to SAN: which portion best to move

2011-05-16 Thread Robert Haas

On Mon, May 16, 2011 at 4:19 AM, Robert Klemme
 wrote:
>> - If the planner chooses a Bitmap Index Scan, it effectively scans the
>> index to figure out which table blocks to read, and then reads those
>> table blocks in block number order, so that the I/O is sequential,
>> with skips.
>
> Are these two separate phases (i.e. first scan index completely, then
> access table)?

Yes.

> Separating index and tables might not be a totally good idea
> generally.  Richard Foote has an excellent article about Oracle but I
> assume at least a few things do apply to PostgreSQL as well - it's at
> least worth as something to check PostgreSQL's access patterns
> against:
>
> http://richardfoote.wordpress.com/2008/04/16/separate-indexes-from-tables-some-thoughts-part-i-everything-in-its-right-place/
>
> I would probably rather try to separate data by the nature and
> frequency of accesses.  One reasonable separation would be to leave
> all frequently accessed tables *and* their indexes on local RAID and
> moving less frequently accessed data to the SAN.  This separation
> could be easily identified if you have separate tables for current and
> historic data.

Yeah, I think the idea of putting tables and indexes in separate
tablespaces is mostly to bring more I/O bandwidth to bear on the same
data.  But there are other reasonable things you might do also - e.g.
put the indexes on an SSD, and the tables on a spinning disk, figuring
that the SSD is less reliable but you can always rebuild the index if
you need to...

Also, a lot of people have reported big speedups from putting pg_xlog
on a dedicated RAID 1 pair, or moving the PostgreSQL logs off the data
partition.  So those sorts of divisions should be considered also.
Your idea of dividing things by access frequency is another good
thought.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Why query takes soo much time

2011-05-16 Thread Tom Lane

Denis de Bernardy  writes:
> An alternative plan could have been to hash join the tables together,
> to sort the result set, and to apply the limit/offset on the resulting
> set.

Indeed.  I rather wonder why the planner didn't do that to start with.
This plan looks to me like it might be suffering from insufficient
work_mem to allow use of a hash join.  Or possibly the OP changed some
of the cost_xxx or enable_xxx settings in a misguided attempt to force
it to use indexes instead.  As a rule of thumb, whole-table joins
probably ought not be using nestloop plans, and that frequently means
that indexes are worthless for them.

But in any case, as Craig noted, the real elephant in the room is the
huge OFFSET value.  It seems likely that this query is not standing
alone but is meant as one of a series that's supposed to provide
paginated output, and if so the total cost of the series is just going
to be impossible no matter what.  The OP needs to think about using a
cursor or some such to avoid repeating most of the work each time.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)

2011-05-16 Thread Merlin Moncure

On Sat, May 14, 2011 at 5:10 AM, Stefan Keller  wrote:
> Hi,
>
> I am conducting a benchmark to compare KVP table vs. hstore and got
> bad hstore performance results when the no. of records is greater than
> about 500'000.
>
> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
> -- with index on key
> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
> -- with GIST index on obj
>
> Does anyone have experience with that?

hstore is not really designed for large-ish sets like that.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Using pgiosim realistically

2011-05-16 Thread John Rouillard

On Sat, May 14, 2011 at 12:07:02PM -0500, k...@rice.edu wrote:
> On Fri, May 13, 2011 at 09:09:41PM +, John Rouillard wrote:
> > I am adding pgiosim to our testing for new database hardware and I am
> > seeing something I don't quite get and I think it's because I am using
> > pgiosim incorrectly.
> > 
> > Specs:
> > 
> >   OS: centos 5.5 kernel: 2.6.18-194.32.1.el5
> >   memory: 96GB
> >   cpu: 2x Intel(R) Xeon(R) X5690  @ 3.47GHz (6 core, ht enabled)
> >   disks: WD2003FYYS RE4
> >   raid: lsi - 9260-4i with 8 disks in raid 10 configuration
> >   1MB stripe size
> >   raid cache enabled w/ bbu
> >   disk caches disabled
> >   filesystem: ext3 created with -E stride=256
> > 
> > I am seeing really poor (70) iops with pgiosim.  According to:
> > http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-8.html in the
> > database benchmark they are seeing ~170 iops on a single disk for
> > these drives. I would expect an 8 disk raid 10 should get better then
> > 3x the single disk rate (assuming the data is randomly distributed).
> Those drives are 7200 rpm drives which would give you a maximum write
> rate of 120/sec at best with the cache disabled. I actually think your
> 70/sec is closer to reality and what you should anticipate in real use.
> I do not see how they could make 170/sec. Did they strap a jet engine to
> the drive. :)

Hmm, I stated the disk cache was disabled. I should have said the disk
write cache, but it's possible the readhead cache is disabled as well
(not quite sure how to tell on the lsi cards). Also there isn't a lot
of detail in what the database test mix is and I haven't tried
researching the site to see if the spec the exact test. If it included
a lot of writes and they were being handled by a cache then that could
explain it.

However, in my case I have an 8 disk raid 10 with a read only load (in
this testing configuration). Shouldn't I expect more iops than a
single disk can provide? Maybe pgiosim is hitting some other boundary
than just i/o?

Also it turns out that pgiosim can only handle 64 files. I haven't
checked to see if this is a compile time changable item or not.

-- 
-- rouilj

John Rouillard   System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Why query takes soo much time

2011-05-16 Thread Denis de Bernardy

[big nestloop with a huge number of rows]

You're in an edge case, and I doubt you'll get things to run much faster: you 
want the last 1k rows out of an 18M row result set... It will be slow no matter 
what you do.

What the plan is currently doing, is it's going through these 18M rows using a 
for each loop, until it returns the 1k requested rows. Without the offset, the 
plan is absolutely correct (and quite fast, I take it). With the enormous 
offset, it's a different story as you've noted.

An alternative plan could have been to hash join the tables together, to sort 
the result set, and to apply the limit/offset on the resulting set. You can 
probably force the planner to do so by rewriting your statement using a with 
statement, too:

EXPLAIN ANALYZE
WITH rows AS (
SELECT c.clause, s.subject ,s.object , s.verb, s.subject_type, s.object_type 
,s.doc_id ,s.svo_id 
FROM clause2 c INNER JOIN svo2 s ON (c.clause_id=s.clause_id AND 
c.source_id=s.doc_id AND c.sentence_id=s.sentence_id)
   INNER JOIN page_content p ON (s.doc_id=p.crawled_page_id)
)
SELECT *
FROM rows
ORDER BY svo_id limit 1000 offset 17929000


I've my doubts that it'll make much of a different, though: you'll still be 
extracting the last 1k rows out of 18M.

D

Re: [PERFORM] Why query takes soo much time

2011-05-16 Thread Craig Ringer


On 05/16/2011 01:39 PM, Adarsh Sharma wrote:

Dear all,
I have a query on 3 tables in a database as :-

_*Explain Analyze Output :-*_

explain anayze select c.clause, s.subject ,s.object , s.verb, 
s.subject_type , s.object_type ,s.doc_id ,s.svo_id from clause2 c, 
svo2 s ,page_content p where c.clause_id=s.clause_id and 
s.doc_id=c.source_id and c.sentence_id=s.sentence_id and 
s.doc_id=p.crawled_page_id order by s.svo_id limit 1000 offset 17929000




Using limit and offset can be horrifyingly slow for non-trivial queries. 
Are you trying to paginate results? If not, what are you trying to achieve?


In most (all?) cases, Pg will have to execute the query up to the point 
where it's found limit+offset rows, producing and discarding offset rows 
as it goes. Needless to say, that's horrifyingly inefficient.


Reformatting your query for readability (to me) as:

EXPLAIN ANALYZE
SELECT c.clause, s.subject ,s.object , s.verb, s.subject_type, 
s.object_type ,s.doc_id ,s.svo_id
FROM clause2 c INNER JOIN svo2 s ON (c.clause_id=s.clause_id AND 
c.source_id=s.doc_id AND c.sentence_id=s.sentence_id)

   INNER JOIN page_content p ON (s.doc_id=p.crawled_page_id)
ORDER BY s.svo_id limit 1000 offset 17929000

... I can see that you're joining on 
(c.clause_id,c.source_id,c.sentence_id)=(s.clause_id,s.doc_id,s.sentence_id). 
You have matching indexes idx_clause2_id and idx_svo2_id_dummy with 
matching column ordering. Pg is using idx_clause2_id in the join of svo2 
and clause2, but instead of doing a bitmap index scan using it and 
idx_svo2_id_dummy it's doing a nested loop using idx_clause2_id and 
pk_svo_id.


First: make sure your stats are up to date by ANALYZE-ing your tables 
and probably increasing the stats collected on the join columns and/or 
increasing default_statistics_target. If that doesn't help, personally 
I'd play with the random_page_cost and seq_page_cost to see if they 
reflect your machine's actual performance, and to see if you get a more 
favourable plan. If I were experimenting with this I'd also see if 
giving the query lots of work_mem allowed it to try a different approach 
to the join.



"Limit  (cost=21685592.91..21686802.44 rows=1000 width=2624) (actual 
time=414601.802..414622.920 rows=1000 loops=1)"
"  ->  Nested Loop  (cost=59.77..320659013645.28 rows=265112018116 
width=2624) (actual time=0.422..404902.314 rows=1793 loops=1)"
"->  Nested Loop  (cost=0.00..313889654.42 rows=109882338 
width=2628) (actual time=0.242..174223.789 rows=17736897 loops=1)"
"  ->  Index Scan using pk_svo_id on svo2 s  
(cost=0.00..33914955.13 rows=26840752 width=2600) (actual 
time=0.157..14691.039 rows=14238271 loops=1)"
"  ->  Index Scan using idx_clause2_id on clause2 c  
(cost=0.00..10.36 rows=4 width=44) (actual time=0.007..0.008 rows=1 
loops=14238271)"
"Index Cond: ((c.source_id = s.doc_id) AND 
(c.clause_id = s.clause_id) AND (c.sentence_id = s.sentence_id))"
"->  Bitmap Heap Scan on page_content p  (cost=59.77..2885.18 
rows=2413 width=8) (actual time=0.007..0.008 rows=1 loops=17736897)"

"  Recheck Cond: (p.crawled_page_id = s.doc_id)"
"  ->  Bitmap Index Scan on idx_crawled_id  
(cost=0.00..59.17 rows=2413 width=0) (actual time=0.005..0.005 rows=1 
loops=17736897)"

"Index Cond: (p.crawled_page_id = s.doc_id)"
"Total runtime: 414623.634 ms"

_*My Table & index definitions are as under :-

*_Estimated rows in 3 tables are :-

clause2 10341700
svo2 26008000
page_content 479785

CREATE TABLE clause2
(
  id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
  source_id integer,
  sentence_id integer,
  clause_id integer,
  tense character varying(30),
  clause text,
  CONSTRAINT pk_clause_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);
CREATE INDEX idx_clause2_id  ON clause2  USING btree (source_id, 
clause_id, sentence_id);


CREATE TABLE svo2
(
  svo_id bigint NOT NULL DEFAULT nextval('svo_svo_id_seq'::regclass),
  doc_id integer,
  sentence_id integer,
  clause_id integer,
  negation integer,
  subject character varying(3000),
  verb character varying(3000),
  "object" character varying(3000),
  preposition character varying(3000),
  subject_type character varying(3000),
  object_type character varying(3000),
  subject_attribute character varying(3000),
  object_attribute character varying(3000),
  verb_attribute character varying(3000),
  subject_concept character varying(100),
  object_concept character varying(100),
  subject_sense character varying(100),
  object_sense character varying(100),
  subject_chain character varying(5000),
  object_chain character varying(5000),
  sub_type_id integer,
  obj_type_id integer,
  CONSTRAINT pk_svo_id PRIMARY KEY (svo_id)
)WITH (  OIDS=FALSE);
CREATE INDEX idx_svo2_id_dummy  ON svo2  USING btree  (doc_id, 
clause_id, sentence_id);


CREATE TABLE page_content
(
  content_id integer NOT NULL DEFAULT 
nextval('page_content_ogc_fid_seq'::re

Re: [PERFORM] [PERFORMANCE] expanding to SAN: which portion best to move

2011-05-16 Thread Robert Klemme

On Fri, May 13, 2011 at 9:04 PM, Robert Haas  wrote:
> On Wed, May 4, 2011 at 6:31 AM, Willy-Bas Loos  wrote:
>> I'm asking them for (real) benchmarks, thanks for the advice. (fio is not
>> available for us now to do it myself, grmbl)
>> It just occurred to me that it is not necessarily the case that reading the
>> indexes causes a lot of random I/O (on the index itself).
>> I mean, maybe the index is generally read sequentially and then, when
>> retrieving the data, there is a lot of random I/O.
>> if it's a long story, any tips for info about this (book or web site)?
>
> If you don't do anything special, and if the query plan says "Index
> Scan" rather than "Bitmap Index Scan", then both the index I/O and the
> table I/O are likely to be fairly random.  However there are a number
> of cases in which you can expect the table I/O to be sequential:
>
> - In some cases, you may happen to insert rows with an ordering that
> matches the index.  For example, if you have a table with not too many
> updates and deletes, and an index on a serial column, then new rows
> will have a higher value in that column than old rows, and will also
> typically be physically after older rows in the file.  Or you might be
> inserting timestamped data from oldest to newest.
> - If the planner chooses a Bitmap Index Scan, it effectively scans the
> index to figure out which table blocks to read, and then reads those
> table blocks in block number order, so that the I/O is sequential,
> with skips.

Are these two separate phases (i.e. first scan index completely, then
access table)?

> - If you CLUSTER the table on a particular index, it will be
> physically ordered to match the index's key ordering.  As the table is
> further modified the degree of clustering will gradually decline;
> eventually you may wish to re-CLUSTER.
>
> It's also worth keeping in mind that the index itself won't
> necessarily be accessed in physically sequential order.  The point of
> the index is to emit the rows in key order, but if the table is
> heavily updated, it won't necessarily be the case that a page
> containing lower-valued keys physically precedes a page containing
> higher-valued keys.  I'm actually somewhat fuzzy on how this works,
> and to what extent it's a problem in practice, but I am fairly sure it
> can happen.

Separating index and tables might not be a totally good idea
generally.  Richard Foote has an excellent article about Oracle but I
assume at least a few things do apply to PostgreSQL as well - it's at
least worth as something to check PostgreSQL's access patterns
against:

http://richardfoote.wordpress.com/2008/04/16/separate-indexes-from-tables-some-thoughts-part-i-everything-in-its-right-place/

I would probably rather try to separate data by the nature and
frequency of accesses.  One reasonable separation would be to leave
all frequently accessed tables *and* their indexes on local RAID and
moving less frequently accessed data to the SAN.  This separation
could be easily identified if you have separate tables for current and
historic data.

Kind regards

robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] hash semi join caused by "IN (select ...)"

Re: [PERFORM] hash semi join caused by "IN (select ...)"

[PERFORM] hash semi join caused by "IN (select ...)"

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

Re: [PERFORM] Using pgiosim realistically

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

Re: [PERFORM] Using pgiosim realistically

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

Re: [PERFORM] Using pgiosim realistically

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

Re: [PERFORM] [PERFORMANCE] expanding to SAN: which portion best to move

Re: [PERFORM] Why query takes soo much time

Re: [PERFORM] KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)

Re: [PERFORM] Using pgiosim realistically

Re: [PERFORM] Why query takes soo much time

Re: [PERFORM] Why query takes soo much time

Re: [PERFORM] [PERFORMANCE] expanding to SAN: which portion best to move

19 matches

Site Navigation

Mail list logo

Footer information