Re: [PERFORM] pl/pgsql functions outperforming sql ones?

2012-01-30 Thread Pavel Stehule
2012/1/31 Carlo Stonebanks :
> Pavel, thank you very much for your explanation.
>
> Is it possible to define under what conditions that sql procs will
> outperform plpgsql ones, and vice-versa?

yes, little bit :)

when inlining is possible, then SQL function will be faster - typical
use case is simple scalar functions (with nonvolatile real
parameters).

Regards

Pavel

>
> -Original Message-

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Josh Berkus

> Looking at
> http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COSTI
> wonder if I should try reducing random_page_cost?

Yes, and I should speak to Heroku about reducing it by default.  RPC
represents the ratio between the cost of a sequential lookup of a single
row vs. the cost of a random lookup.  On standard spinning media on a
dedicated server 4.0 is a pretty good estimate of this.  However, you
are running on shared storage in a cloud, which has different math.

> Something that might help when it comes to advice on performance tuning is
> that this database is used only for analytics. It's essentially a partial
> replication of a production (document-oriented) database. So a lot of
> normal operations that might employ a series of sequential fetches may not
> actually be the norm in my case. Rather, I'm doing a lot of counts on data
> that is typically randomly distributed.

In that case, you might consider increasing default_statistics_target to
1000 and ANALYZEing your whole database.  That increases the sample size
for the database statstics collector, and most of the time will result
in somewhat better plans on large tables and data with skewed
distributions.  This is not something which Heroku would do as standard,
since most of their users are doing basic transactional webapps.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pl/pgsql functions outperforming sql ones?

2012-01-30 Thread Carlo Stonebanks
Pavel, thank you very much for your explanation.

Is it possible to define under what conditions that sql procs will
outperform plpgsql ones, and vice-versa?

-Original Message-
From: Pavel Stehule [mailto:pavel.steh...@gmail.com] 
Sent: January 30, 2012 2:57 AM
To: Carlo Stonebanks
Cc: Merlin Moncure; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] pl/pgsql functions outperforming sql ones?

Hello

2012/1/30 Carlo Stonebanks :
> Pavel, are you saying that the code of the stored function is actually
being
> added to the SQL query, instead of a call to it? For example, I have seen
> this:
>
> SELECT myVar
> FROM myTable
> WHERE myVar > 0 AND myFunc(myVar)
>
> And seen the SQL body of myVar appended to the outer query:
>
> ... Filter: SELECT CASE WHERE myVar < 10 THEN true ELSE false END
>
> Is this what we are talking about? Two questions:

yes - it is SQL function "inlining"

>
> 1) Is this also done when the function is called as a SELECT column;
>   e.g. would:
>      SELECT myFunc(myVar) AS result
>   - become:
>      SELECT (
>         SELECT CASE WHERE myVar < 10 THEN true ELSE false END
>      ) AS result?
>

yes

CREATE OR REPLACE FUNCTION public.fx(integer, integer)
 RETURNS integer
 LANGUAGE sql
AS $function$
select coalesce($1, $2)
$function$

postgres=# explain verbose select fx(random()::int, random()::int);
  QUERY PLAN
--
 Result  (cost=0.00..0.02 rows=1 width=0)
   Output: COALESCE((random())::integer, (random())::integer)
(2 rows)


> 2) Does that not bypass the benefits of IMMUTABLE?
>

no - optimizator works with expanded query - usually is preferred
style a writing SQL functions without flags, because optimizer can
work with definition of SQL function and can set well flags. SQL
function is not black box for optimizer like plpgsql does. And SQL
optimizer chooses a inlining or some other optimizations. Sometimes
explicit flags are necessary, but usually not for scalar SQL
functions.

postgres=# create or replace function public.fxs(int)
postgres-# returns setof int as $$
postgres$# select * from generate_series(1,$1)
postgres$# $$ language sql;
CREATE FUNCTION
postgres=# explain verbose select * from fxs(10);
QUERY PLAN
---
 Function Scan on public.fxs  (cost=0.25..10.25 rows=1000 width=4)
   Output: fxs
   Function Call: fxs(10)
(3 rows)

postgres=# create or replace function public.fxs(int)
returns setof int as $$
select * from generate_series(1,$1)
$$ language sql IMMUTABLE;
CREATE FUNCTION
postgres=# explain verbose select * from fxs(10);
QUERY PLAN

---
 Function Scan on pg_catalog.generate_series  (cost=0.00..10.00
rows=1000 width=4)
   Output: generate_series.generate_series
   Function Call: generate_series(1, 10) -- inlined query
(3 rows)

Regards

Pavel Stehule

>
>
> -Original Message-
> From: pgsql-performance-ow...@postgresql.org
> [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Pavel Stehule
> Sent: January 28, 2012 1:38 AM
> To: Carlo Stonebanks
> Cc: Merlin Moncure; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] pl/pgsql functions outperforming sql ones?
>
> 2012/1/27 Carlo Stonebanks :
>> Yes, I did test it  - i.e. I ran the functions on their own as I had
> always
>> noticed a minor difference between EXPLAIN ANALYZE results and direct
> query
>> calls.
>>
>> Interesting, so sql functions DON'T cache plans? Will plan-caching be of
> any
>> benefit to SQL that makes no reference to any tables? The SQL is
emulating
>> the straight non-set-oriented procedural logic of the original plpgsql.
>>
>
> It is not necessary usually - simple SQL functions are merged to outer
> query - there are e few cases where this optimization cannot be
> processed and then there are performance lost.
>
> For example this optimization is not possible (sometimes) when some
> parameter is volatile
>
> Regards
>
> Pavel Stehule
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Got it (with a little bit of klutzing around). :) Thanks!

On Mon, Jan 30, 2012 at 2:24 PM, Scott Marlowe wrote:

> On Mon, Jan 30, 2012 at 3:19 PM, Scott Marlowe 
> wrote:
> > On Mon, Jan 30, 2012 at 2:55 PM, Alessandro Gagliardi
> >  wrote:
> >> I set random_page_cost to 2 (with enable_seqscan on) and get the same
> >> performance I got with enable_seqscan off.
> >> So far so good. Now I just need to figure out how to set it globally.
> :-/
> >
> > alter database set random_page_cost=2.0;
>
> That should be:
>
> alter database dbnamegoeshere set random_page_cost=2.0;
>
>
>
> --
> To understand recursion, one must first understand recursion.
>


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Scott Marlowe
On Mon, Jan 30, 2012 at 3:19 PM, Scott Marlowe  wrote:
> On Mon, Jan 30, 2012 at 2:55 PM, Alessandro Gagliardi
>  wrote:
>> I set random_page_cost to 2 (with enable_seqscan on) and get the same
>> performance I got with enable_seqscan off.
>> So far so good. Now I just need to figure out how to set it globally. :-/
>
> alter database set random_page_cost=2.0;

That should be:

alter database dbnamegoeshere set random_page_cost=2.0;



-- 
To understand recursion, one must first understand recursion.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Scott Marlowe
On Mon, Jan 30, 2012 at 2:55 PM, Alessandro Gagliardi
 wrote:
> I set random_page_cost to 2 (with enable_seqscan on) and get the same
> performance I got with enable_seqscan off.
> So far so good. Now I just need to figure out how to set it globally. :-/

alter database set random_page_cost=2.0;

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
I set random_page_cost to 2 (with enable_seqscan on) and get the same
performance I got with enable_seqscan off.
So far so good. Now I just need to figure out how to set it globally. :-/

On Mon, Jan 30, 2012 at 1:45 PM, Scott Marlowe wrote:

> On Mon, Jan 30, 2012 at 2:39 PM, Alessandro Gagliardi
>  wrote:
> > Looking
> > at
> http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COST
> > I wonder if I should try reducing random_page_cost?
> >
>
> Yes try lowering it.  Generally speaking, random page cost should
> always be >= seq page cost.  Start with a number between 1.5 and 2.0
> to start with and see if that helps.   You can make it "sticky" for
> your user or database with alter user or alter database...
>


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Pretty sure. I just ran the same query twice in a row
with enable_seqscan=true and the "actual time" was on the order of 42
seconds both times. With enable_seqscan=false, it was on the order 3
seconds. Going back to enable_seqscan=true, it's back to 42 seconds. Unless
you're saying that enable_seqscan is determining whether or not the data is
being cached

On Mon, Jan 30, 2012 at 1:13 PM, Fernando Hevia  wrote:
>
> On Mon, Jan 30, 2012 at 17:35, Alessandro Gagliardi 
> wrote:
>
>> Well that was a *lot* faster:
>>
>> "HashAggregate  (cost=156301.82..156301.83 rows=2 width=26) (actual
>> time=2692.806..2692.807 rows=2 loops=1)"
>> "  ->  Bitmap Heap Scan on blocks  (cost=14810.54..155828.95 rows=472871
>> width=26) (actual time=289.828..1593.893 rows=575186 loops=1)"
>> "Recheck Cond: (created > '2012-01-29 00:00:00+00'::timestamp
>> with time zone)"
>> "Filter: (shared IS FALSE)"
>> "->  Bitmap Index Scan on blocks_created_idx
>>  (cost=0.00..14786.89 rows=550404 width=0) (actual time=277.407..277.407
>> rows=706663 loops=1)"
>> "  Index Cond: (created > '2012-01-29 00:00:00+00'::timestamp
>> with time zone)"
>> "Total runtime: 2693.107 ms"
>>
>>
> U sure the new timing isn't owed to cached data? If I am reading it
> correctly, from the latest explain you posted the Index Scan shouldn't have
> made a difference as it is reporting pretty much all rows in the table have
> created > 'yesterday'.
> If the number of rows with created < 'yesterday' isn't significant (~ over
> 25% with default config) a full scan will be chosen and it will probably be
> the better choice too.
>
>
>


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Scott Marlowe
On Mon, Jan 30, 2012 at 2:39 PM, Alessandro Gagliardi
 wrote:
> On Mon, Jan 30, 2012 at 1:25 PM, Josh Berkus  wrote:
>>
>> You can do "SHOW random_page_cost" yourself right now, too.
>>
> 4
>
> I also tried "SHOW seq_page_cost" and that's 1.
>
> Looking
> at http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COST
> I wonder if I should try reducing random_page_cost?
>
> Something that might help when it comes to advice on performance tuning is
> that this database is used only for analytics. It's essentially a partial
> replication of a production (document-oriented) database. So a lot of normal
> operations that might employ a series of sequential fetches may not actually
> be the norm in my case. Rather, I'm doing a lot of counts on data that is
> typically randomly distributed.

Yes try lowering it.  Generally speaking, random page cost should
always be >= seq page cost.  Start with a number between 1.5 and 2.0
to start with and see if that helps.   You can make it "sticky" for
your user or database with alter user or alter database...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
On Mon, Jan 30, 2012 at 1:25 PM, Josh Berkus  wrote:

> You can do "SHOW random_page_cost" yourself right now, too.
>
> 4

I also tried "SHOW seq_page_cost" and that's 1.

Looking at
http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COSTI
wonder if I should try reducing random_page_cost?

Something that might help when it comes to advice on performance tuning is
that this database is used only for analytics. It's essentially a partial
replication of a production (document-oriented) database. So a lot of
normal operations that might employ a series of sequential fetches may not
actually be the norm in my case. Rather, I'm doing a lot of counts on data
that is typically randomly distributed.

Thanks,

-Alessandro


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Josh Berkus
On 1/30/12 12:59 PM, Claudio Freire wrote:
> On Mon, Jan 30, 2012 at 5:55 PM, Alessandro Gagliardi
>  wrote:
>> Hm. Well, it looks like setting enable_seqscan=false is session specific, so
>> it seems like I can use it with this query alone; but it sounds like even if
>> that works, it's a bad practice. (Is that true?)
> 
> Yep

The issue with that is that the enable_seqscan setting is not limited to
that one table in that query, and won't change over time.  So by all
means use it as a hotfix right now, but it's not a long-term solution to
your problem.

> 
>> My effective_cache_size is 153kB

That seems appropriate for the Ronin; I'll test one out and see what
random_page_cost is set to as well, possibly Heroku needs to adjust the
basic template for the Ronin.  For Heroku, we want to favor index scans
a bit more than you would on regular hardware because the underlying
storage is Amazon, which has good seeks but crap throughput.

You can do "SHOW random_page_cost" yourself right now, too.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Fernando Hevia
On Mon, Jan 30, 2012 at 17:35, Alessandro Gagliardi wrote:

> Well that was a *lot* faster:
>
> "HashAggregate  (cost=156301.82..156301.83 rows=2 width=26) (actual
> time=2692.806..2692.807 rows=2 loops=1)"
> "  ->  Bitmap Heap Scan on blocks  (cost=14810.54..155828.95 rows=472871
> width=26) (actual time=289.828..1593.893 rows=575186 loops=1)"
> "Recheck Cond: (created > '2012-01-29 00:00:00+00'::timestamp
> with time zone)"
> "Filter: (shared IS FALSE)"
> "->  Bitmap Index Scan on blocks_created_idx  (cost=0.00..14786.89
> rows=550404 width=0) (actual time=277.407..277.407 rows=706663 loops=1)"
> "  Index Cond: (created > '2012-01-29 00:00:00+00'::timestamp
> with time zone)"
> "Total runtime: 2693.107 ms"
>
>
U sure the new timing isn't owed to cached data? If I am reading it
correctly, from the latest explain you posted the Index Scan shouldn't have
made a difference as it is reporting pretty much all rows in the table have
created > 'yesterday'.
If the number of rows with created < 'yesterday' isn't significant (~ over
25% with default config) a full scan will be chosen and it will probably be
the better choice too.


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 5:55 PM, Alessandro Gagliardi
 wrote:
> Hm. Well, it looks like setting enable_seqscan=false is session specific, so
> it seems like I can use it with this query alone; but it sounds like even if
> that works, it's a bad practice. (Is that true?)

Yep

> My effective_cache_size is 153kB

Um... barring some really bizarre GUC setting, I cannot imagine how it
could be preferring the sequential scan.
Maybe some of the more knowedgeable folks has a hint.

In the meanwhile, you can use the seqscan stuff on that query alone.
Be sure to use it on that query alone - ie, re-enable it afterwards,
or discard the connection.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Hm. Well, it looks like setting enable_seqscan=false is session specific,
so it seems like I can use it with this query alone; but it sounds like
even if that works, it's a bad practice. (Is that true?)

My effective_cache_size is 153kB

On Mon, Jan 30, 2012 at 12:50 PM, Claudio Freire wrote:

> On Mon, Jan 30, 2012 at 5:35 PM, Alessandro Gagliardi
>  wrote:
> > To answer your (non-)question about Heroku, it's a cloud service, so I
> don't
> > host PostgreSQL myself. I'm not sure how much I can mess with things like
> > GUC since I don't even have access to the "postgres" database on the
> server.
> > I am a long time SQL user but new to Postgres so I welcome suggestions on
> > where to start with that sort of thing. Setting enable_seqscan=false
> made a
> > huge difference, so I think I'll start there.
>
> It's not a good idea to abuse of the enable_stuff settings, they're
> for debugging, not for general use. In particular, disable sequential
> scans everywhere can have a disastrous effect on performance.
>
> It sounds as if PG had a misconfigured effective_cache_size. What does
> "show effective_cache_size" tell you?
>


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 5:35 PM, Alessandro Gagliardi
 wrote:
> To answer your (non-)question about Heroku, it's a cloud service, so I don't
> host PostgreSQL myself. I'm not sure how much I can mess with things like
> GUC since I don't even have access to the "postgres" database on the server.
> I am a long time SQL user but new to Postgres so I welcome suggestions on
> where to start with that sort of thing. Setting enable_seqscan=false made a
> huge difference, so I think I'll start there.

It's not a good idea to abuse of the enable_stuff settings, they're
for debugging, not for general use. In particular, disable sequential
scans everywhere can have a disastrous effect on performance.

It sounds as if PG had a misconfigured effective_cache_size. What does
"show effective_cache_size" tell you?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Well that was a *lot* faster:

"HashAggregate  (cost=156301.82..156301.83 rows=2 width=26) (actual
time=2692.806..2692.807 rows=2 loops=1)"
"  ->  Bitmap Heap Scan on blocks  (cost=14810.54..155828.95 rows=472871
width=26) (actual time=289.828..1593.893 rows=575186 loops=1)"
"Recheck Cond: (created > '2012-01-29 00:00:00+00'::timestamp with
time zone)"
"Filter: (shared IS FALSE)"
"->  Bitmap Index Scan on blocks_created_idx  (cost=0.00..14786.89
rows=550404 width=0) (actual time=277.407..277.407 rows=706663 loops=1)"
"  Index Cond: (created > '2012-01-29 00:00:00+00'::timestamp
with time zone)"
"Total runtime: 2693.107 ms"

To answer your (non-)question about Heroku, it's a cloud service, so I
don't host PostgreSQL myself. I'm not sure how much I can mess with things
like GUC since I don't even have access to the "postgres" database on the
server. I am a long time SQL user but new to Postgres so I welcome
suggestions on where to start with that sort of thing.
Setting enable_seqscan=false made a huge difference, so I think I'll start
there.

Thank you very much!
-Alessandro

On Mon, Jan 30, 2012 at 11:24 AM, Claudio Freire wrote:

> On Mon, Jan 30, 2012 at 4:13 PM, Alessandro Gagliardi
>  wrote:
> > So, here's the query:
> >
> > SELECT private, COUNT(block_id) FROM blocks WHERE created > 'yesterday'
> AND
> > shared IS FALSE GROUP BY private
> >
> > What confuses me is that though this is a largish table (millions of
> rows)
> > with constant writes, the query is over indexed columns of types
> timestamp
> > and boolean so I would expect it to be very fast. The clause where
> created >
> > 'yesterday' is there mostly to speed it up, but apparently it doesn't
> help
> > much.
>
> The number of rows touched is ~0.5M, and is correctly estimated, which
> would lead me to believe PG estimates the index plan to be slower.
>
> You could try by executing first "set enable_seqscan=false;" and then
> your query with explain analyze again. You'll probably get an index
> scan, and you'll see both how it performs and how PG thought it would
> perform. Any mismatch between the two probably means you'll have to
> change the planner tunables (the x_tuple_cost ones) to better match
> your hardware.
>
>
> > As for Hardware: I'm using Heroku's "Ronin" setup which involves 1.7 GB
> > Cache. Beyond that I don't really know.
> snip
> > As for GUC Settings: Again, I don't know what this is. Whatever Heroku
> > defaults to is what I'm using.
>
> And there's your problem. Without knowing/understanding those, you
> won't get anywhere. I don't know what Heroku is, but you should find
> out both hardware details and PG configuration details.
>
> > As for Maintenance Setup: I let Heroku handle that, so I again, I don't
> > really know. FWIW though, vacuuming should not really be an issue (as I
> > understand it) since I don't really do any updates or deletions. It's
> pretty
> > much all inserts and selects.
>
> Maintainance also includes analyzing the table, to gather stats that
> feed the optimizer, and it's very important to keep the stats
> accurate. You can do it manually - just perform an ANALYZE. However,
> the plan doesn't show any serious mismatch between expected and actual
> rowcounts, which suggests stats aren't your problem.
>


Re: [PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Jeff Janes
On Mon, Jan 30, 2012 at 9:46 AM, Saurabh  wrote:
> Thank you for the information.
>
> Schema of table is:
>
> ID                         bigint
> company_name     text
> data_set                text
> time                      timestamp
> Date                     date
>
> Length of company_name is not known so it is of datatype text. I need
> to build the index on company_name and ID. And then insert the
> records. I can not create the index after insertion because user can
> search the data as well while insertion.
>
> Machine is of 8 core, os centos6 and 8 GB of RAM.
>
> Here is my configuration:
>
> shared_buffers = 32MB

That is very small for your server.  I'd use at least 512MB, and maybe 2GB

> wal_buffers = 1024KB

If you are using 9.1, I would have this set to the default of -1 and
let the database decide for itself what to use.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 4:13 PM, Alessandro Gagliardi
 wrote:
> So, here's the query:
>
> SELECT private, COUNT(block_id) FROM blocks WHERE created > 'yesterday' AND
> shared IS FALSE GROUP BY private
>
> What confuses me is that though this is a largish table (millions of rows)
> with constant writes, the query is over indexed columns of types timestamp
> and boolean so I would expect it to be very fast. The clause where created >
> 'yesterday' is there mostly to speed it up, but apparently it doesn't help
> much.

The number of rows touched is ~0.5M, and is correctly estimated, which
would lead me to believe PG estimates the index plan to be slower.

You could try by executing first "set enable_seqscan=false;" and then
your query with explain analyze again. You'll probably get an index
scan, and you'll see both how it performs and how PG thought it would
perform. Any mismatch between the two probably means you'll have to
change the planner tunables (the x_tuple_cost ones) to better match
your hardware.


> As for Hardware: I'm using Heroku's "Ronin" setup which involves 1.7 GB
> Cache. Beyond that I don't really know.
snip
> As for GUC Settings: Again, I don't know what this is. Whatever Heroku
> defaults to is what I'm using.

And there's your problem. Without knowing/understanding those, you
won't get anywhere. I don't know what Heroku is, but you should find
out both hardware details and PG configuration details.

> As for Maintenance Setup: I let Heroku handle that, so I again, I don't
> really know. FWIW though, vacuuming should not really be an issue (as I
> understand it) since I don't really do any updates or deletions. It's pretty
> much all inserts and selects.

Maintainance also includes analyzing the table, to gather stats that
feed the optimizer, and it's very important to keep the stats
accurate. You can do it manually - just perform an ANALYZE. However,
the plan doesn't show any serious mismatch between expected and actual
rowcounts, which suggests stats aren't your problem.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
So, here's the query:

SELECT private, COUNT(block_id) FROM blocks WHERE created > 'yesterday' AND
shared IS FALSE GROUP BY private

What confuses me is that though this is a largish table (millions of rows)
with constant writes, the query is over indexed columns of types timestamp
and boolean so I would expect it to be very fast. The clause where created
> 'yesterday' is there mostly to speed it up, but apparently it doesn't
help much.

Here's the *Full Table and Index Schema*:

CREATE TABLE blocks
(
  block_id character(24) NOT NULL,
  user_id character(24) NOT NULL,
  created timestamp with time zone,
  locale character varying,
  shared boolean,
  private boolean,
  moment_type character varying NOT NULL,
  user_agent character varying,
  inserted timestamp without time zone NOT NULL DEFAULT now(),
  networks character varying[],
  lnglat point,
  CONSTRAINT blocks_pkey PRIMARY KEY (block_id )
)

WITH (
  OIDS=FALSE
);

CREATE INDEX blocks_created_idx
  ON blocks
  USING btree
  (created  DESC NULLS LAST);

CREATE INDEX blocks_lnglat_idx
  ON blocks
  USING gist
  (lnglat );

CREATE INDEX blocks_networks_idx
  ON blocks
  USING btree
  (networks );

CREATE INDEX blocks_private_idx
  ON blocks
  USING btree
  (private );

CREATE INDEX blocks_shared_idx
  ON blocks
  USING btree
  (shared );

Here's the results from *EXPLAIN ANALYZE:*

"HashAggregate  (cost=156619.01..156619.02 rows=2 width=26) (actual
time=43131.154..43131.156 rows=2 loops=1)"
*"  ->  Seq Scan on blocks  (cost=0.00..156146.14 rows=472871 width=26)
(actual time=274.881..42124.505 rows=562888 loops=1)"
**"Filter: ((shared IS FALSE) AND (created > '2012-01-29
00:00:00+00'::timestamp with time zone))"
**"Total runtime: 43131.221 ms"*
I'm using *Postgres version:* 9.0.5 (courtesy of Heroku)

As for *History:* I've only recently started using this query, so there
really isn't any.

As for *Hardware*: I'm using Heroku's "Ronin" setup which involves 1.7 GB
Cache. Beyond that I don't really know.

As for *Maintenance Setup*: I let Heroku handle that, so I again, I don't
really know. FWIW though, vacuuming should not really be an issue (as I
understand it) since I don't really do any updates or deletions. It's
pretty much all inserts and selects.

As for *WAL Configuration*: I'm afraid I don't even know what that is. The
query is normally run from a Python web server though the above explain was
run using pgAdmin3, though I doubt that's relevant.

As for *GUC Settings*: Again, I don't know what this is. Whatever Heroku
defaults to is what I'm using.

Thank you in advance!
-Alessandro Gagliardi


Re: [PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Andy Colson

On 1/30/2012 3:27 AM, Saurabh wrote:

Hi all,

I am using Postgresql database for our project and doing some
performance testing. We need to insert millions of record with indexed
columns. We have 5 columns in table. I created index on integer only
then performance is good but when I created index on text column as
well then the performance reduced to 1/8th times. My question is how I
can improve performance when inserting data using index on text
column?

Thanks,
Saurabh



Do it in a single transaction, and use COPY.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 2:46 PM, Saurabh  wrote:
> max_connections = 100
> shared_buffers = 32MB
> wal_buffers = 1024KB
> checkpoint_segments = 3

That's a default config isn't it?

You'd do well to try and optimize it for your system. The defaults are
really, reeallly conservative.

You should also consider normalizing. I'm assuming company_name could
be company_id ? (ie: each will have many rows). Otherwise I cannot see
how you'd expect to be *constantly* inserting millions of rows. If
it's a one-time initialization thing, just drop the indices and
recreate them as you've been suggested. If you create new records all
the time, I'd bet you'll also have many rows with the same
company_name, so normalizing would be a clear win.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Saurabh
Thank you for the information.

Schema of table is:

ID bigint
company_name text
data_settext
time  timestamp
Date date

Length of company_name is not known so it is of datatype text. I need
to build the index on company_name and ID. And then insert the
records. I can not create the index after insertion because user can
search the data as well while insertion.

Machine is of 8 core, os centos6 and 8 GB of RAM.

Here is my configuration:

max_connections = 100
shared_buffers = 32MB
wal_buffers = 1024KB
checkpoint_segments = 3


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Jeff Janes
On Mon, Jan 30, 2012 at 1:27 AM, Saurabh  wrote:
> Hi all,
>
> I am using Postgresql database for our project and doing some
> performance testing. We need to insert millions of record with indexed
> columns. We have 5 columns in table. I created index on integer only
> then performance is good but when I created index on text column as
> well then the performance reduced to 1/8th times.

Inserting into a indexed table causes a lot of random access to the
underlying index (unless the data is inserted in an order which
corresponds to the index order of all indexes, which is not likely to
happen with multiple indexes).  As soon as your indexes don't fit in
cache, your performance will collapse.

What if you don't have the integer index but just the text?  What is
the average length of the data in the text field?  Is your system CPU
limited or IO limited during the load?

> My question is how I
> can improve performance when inserting data using index on text
> column?

The only "magic" answer is to drop the index and rebuild after the
insert.  If that doesn't work for you, then you have to identify your
bottleneck and fix it.  That can't be done with just the information
you provide.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pl/pgsql functions outperforming sql ones?

2012-01-30 Thread Carlo Stonebanks
Update: The main stored function in question and all of its sub
sub-functions were recoded to new pure sql functions. 

I then stub tested the sub functions sql vs. plpgsql.

Here were the results for new sql vs old plpgsql:

Individual sub functions tested 20-30% faster

But the main function calling new sql sub functions ran 100% slower

So I tried this:

I modified the old plpgsql function to call the new sql sub functions.

THAT ran 20-30% faster then the unmodified version.

That modified function is listed below. All the functions ending in 2 are
the new SQL versions.

Any thoughts or insight would be much appreciated.

Carlo


CREATE OR REPLACE FUNCTION mdx_lib.lex_compare_candidate3(character varying,
character varying)
  RETURNS numeric AS
$BODY$
/*
Rate two strings candidacy for lex_compare.
param 1: first string to compare
param 2: 2nd string to compare
returns: numeric result like mdx_lib.lex_distance
0 is a failure, 1 a perfect match
*/
declare
   str1 varchar = $1;
   str2 varchar = $2;
   acro1 varchar;
   acro2 varchar;
   str_dist numeric;
   acro_dist numeric;
   result numeric;
begin
   if str1 = str2 then
  result = 0;
   else
  str1 = lower(regexp_replace(str1, '[^[:alnum:]]', '', 'g'));
  str2 = lower(regexp_replace(str2, '[^[:alnum:]]', '', 'g'));

  if str1 = str2 then
 result = 0.1;
  else
 str_dist = mdx_lib.lex_distance2(str1, str2);
 acro1 = mdx_lib.lex_acronym2(str1);
 acro2 = mdx_lib.lex_acronym2(str2);
 acro_dist = mdx_lib.lex_distance2(acro1, acro2);
 result = (acro_dist + (str_dist * 2)) / 2;
  end if;
   end if;

   result = 1 - result;
   if result < 0 then
  result = 0;
   end if;
   return result;
end;
$BODY$
  LANGUAGE plpgsql IMMUTABLE
  COST 100;



-Original Message-
From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Pavel Stehule
Sent: January 28, 2012 1:38 AM
To: Carlo Stonebanks
Cc: Merlin Moncure; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] pl/pgsql functions outperforming sql ones?

2012/1/27 Carlo Stonebanks :
> Yes, I did test it  - i.e. I ran the functions on their own as I had
always
> noticed a minor difference between EXPLAIN ANALYZE results and direct
query
> calls.
>
> Interesting, so sql functions DON'T cache plans? Will plan-caching be of
any
> benefit to SQL that makes no reference to any tables? The SQL is emulating
> the straight non-set-oriented procedural logic of the original plpgsql.
>

It is not necessary usually - simple SQL functions are merged to outer
query - there are e few cases where this optimization cannot be
processed and then there are performance lost.

For example this optimization is not possible (sometimes) when some
parameter is volatile

Regards

Pavel Stehule

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 6:27 AM, Saurabh  wrote:
> Hi all,
>
> I am using Postgresql database for our project and doing some
> performance testing. We need to insert millions of record with indexed
> columns. We have 5 columns in table. I created index on integer only
> then performance is good but when I created index on text column as
> well then the performance reduced to 1/8th times. My question is how I
> can improve performance when inserting data using index on text
> column?

Post all the necessary details. Schema, table and index sizes, some config...

Assuming your text column is a long one (long text), this results in
really big indices.
Assuming you only search by equality, you can make it a lot faster by hashing.
Last time I checked, hash indices were quite limited and performed
badly, but I've heard they improved quite a bit. If hash indices don't
work for you, you can always build them on top of btree indices by
indexing on the expression hash(column) and comparing as hash(value) =
hash(column) and value = column.
On a table indexed by URL I have, this improved things immensely. Both
lookup and insertion times improved.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgress is taking lot of CPU on our embedded hardware.

2012-01-30 Thread Marti Raudsepp
On Sat, Jan 28, 2012 at 19:11, Jayashankar K B
 wrote:
> But we are stumped by the amount of CPU Postgres is eating up.

You still haven't told us *how* slow it actually is and how fast you
need it to be? What's your database layout like (tables, columns,
indexes, foreign keys)? What do the queries look like that you have
problems with?

> Our database file is located on a class 2 SD Card. So it is understandable if 
> there is lot of IO activity and speed is less.

Beware that most SD cards are unfit for database write workloads,
since they only perform very basic wear levelling (in my experience
anyway -- things might have changed, but I'm doubtful). It's a matter
of time before you wear out some frequently-written blocks and they
start returning I/O errors or corrupted data.

If you can spare the disk space, increase checkpoint_segments, as that
means at least WAL writes are spread out over a larger number of
blocks. (But heap/index writes are still a problem)

They can also corrupt your data if you lose power in the middle of a
write -- since they use much larger physical block sizes than regular
hard drives and it can lose the whole block, which file systems or
Postgres are not designed to handle. They also tend to not respect
flush/barrier requests that are required for database consistency.

Certainly you should do such power-loss tests before you release your
product. I've built an embedded platform with a database. Due to disk
corruptions, in the end I opted for mounting all file systems
read-only and keeping the database only in RAM.

> Any configuration settings we could check up?

For one, you should reduce max_connections to a more reasonable number
-- I'd guess you don't need more than 5 or 10 concurrent connections.

Also set synchronous_commit=off; this means that you may lose some
committed transactions after power loss, but I think with SD cards all
bets are off anyway.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] How to improve insert speed with index on text column

2012-01-30 Thread Saurabh
Hi all,

I am using Postgresql database for our project and doing some
performance testing. We need to insert millions of record with indexed
columns. We have 5 columns in table. I created index on integer only
then performance is good but when I created index on text column as
well then the performance reduced to 1/8th times. My question is how I
can improve performance when inserting data using index on text
column?

Thanks,
Saurabh

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance