Re: [PERFORM] Comparing user attributes with bitwise operators

2004-09-18 Thread Greg Stark

Patrick Clery <[EMAIL PROTECTED]> writes:

> PLAN
> -
>  Limit  (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10 
> loops=1)
>->  Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418 rows=10 
> loops=1)
>  Sort Key: age
>  ->  Index Scan using people_attributes_search on people_attributes 
> (cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937 loops=1)
>Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) 
> AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
>  Total runtime: 69.899 ms
> (6 rows)
...
> - Is there a way of speeding up the sort?

The sort seems to have only taken 8ms out of 69ms or just over 10%. As long as
the index scan doesn't match too many records the sort should never be any
slower so it shouldn't be the performance bottleneck. You might consider
putting a subquery inside the order by with a limit to ensure that the sort
never gets more than some safe maximum. Something like:

select * from (select * from people_attributes where ... limit 1000) order by age 
limit 10

This means if the query matches more than 1000 it won't be sorted properly by
age; you'll get the top 10 out of some random subset. But you're protected
against ever having to sort more than 1000 records.

> - Will using queries like " WHERE orientation IN (1,2,4) " be any better/worse?

Well they won't use the GiST index, so no. If there was a single column with a
btree index then this would be the cleanest way to go.

> - The queries with the GiST index are faster, but is it of any benefit when
> the int[] arrays all contain a single value?

Well you've gone from 5 minutes to 60ms. You'll have to do more than one test
to be sure but it sure seems like it's of some benefit.

If they're always a single value you could make it an expression index instead
and not have to change your data model.

Just have the fields be integers individually and make an index as:

create index idx on people_attributes using gist (
  (array[age]) gist__int_ops, 
  (array[gender]) gist__int_ops,
...
)


However I would go one step further. I would make the index simply:

create index idx on people_attributes using gist (
  (array[age,gender,orientation,...]) gist__int_ops
)

And ensure that all of these attributes have distinct domains. Ie, that they
don't share any values. There are 4 billion integer values available so that
shouldn't be an issue.

Then you could use query_int to compare them the way you want. You
misunderstood how query_int is supposed to work. You index an array column and
then you can check it against a query_int just as you're currently checking
for overlap. Think of @@ as a more powerful version of the overlap operator
that can do complex logical expressions.

The equivalent of 

where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
  and '{2}'::int[] && gender
  and '{1,2,4}'::int[] && orientation

would then become:

   WHERE array[age,gender,orientation] @@ 
'(30|31|32|33|34|35|36|37|38|39|40)&(2)&(1|2|4)'

except you would have to change orientation and gender to not both have a
value of 2. 

You might consider doing the expression index a bit of overkill actually. You
might consider just storing a column "attributes" with an integer array
directly in the table.

You would also want a table that lists the valid attributes to be sure not to
have any overlaps:

1   age  1
2   age  2
...
101 gender   male
102 gender   female
103 orientation  straight
104 orientation  gay
105 orientation  bi
106 bodytype scrawny
...


> - Is there any hope for this structure?

You'll have to test this carefully. I tried using GiST indexes for my project
and found that I couldn't load the data and build the GiST indexes fast
enough. You have to test the costs of building and maintaining this index,
especially since it has so many columns in it.

But it looks like your queries are in trouble without it so hopefully it'll be
ok on the insert/update side for you.

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Comparing user attributes with bitwise operators

2004-09-18 Thread Patrick Clery
I have currently implemented a schema for my "Dating Site" that is storing 
user search preferences and user attributes in an int[] array using the 
contrib/intarray package (suggested by Greg Stark). But there are a few 
problems.
a) query_int can't be cast to int4.
b) query_int can't be indexed.

datingsite=> alter table people_attributes add column bla query_int;
ALTER TABLE
datingsite=> create index idx_query_int on people_attributes (bla);
ERROR:  data type query_int has no default operator class for access method 
"btree"
HINT:  You must specify an operator class for the index or define a default 
operator class for the data type.
datingsite=> create index idx_query_int on people_attributes (bla 
gist__int_ops);
ERROR:  operator class "gist__int_ops" does not exist for access method 
"btree"
datingsite=> alter table people_attributes drop column bla;
ALTER TABLE

c) query_int can only be used in one operation against int[]:

README.intarray:
int[] @@ query_int  - returns TRUE if array satisfies query (like '1&(2|3)')

It is not possible to use >=, <=, =, etc. Also, this operator does not work 
like example says:

datingsite=> select '{2,3}'::int[] @@ '1'::query_int;
 ?column? 
--
 f
(1 row)

d) I can't find a way to simply check if an integer is an array without 
declaring it as an array; Therefore, I need to use an int[] type for a column 
that will only be storing one int4 if I want to compare it to an int[] array:

README.intarray:
int[] && int[]  - overlap - returns TRUE if arrays has at least one common 
elements.

e) int[] and query_int are somewhat ugly to deal with since query_int needs 
to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm 
just being anal :)


Because of these limitations, I've chosen to declare the attribute columns as 
int[] arrays (even though they will only contain one value) so that I can use 
'{1,2,3}'::int[] && column_name:

README.intarray:
int[] && int[]  - overlap - returns TRUE if arrays has at least one common 
elements.

Here is the schema:

create table people (
person_id serial,
datecreated timestamp with time zone default now (),
signup_ip cidr not null,
username character varying(30) not null,
password character varying(28) not null,
email character varying(65) not null,
dob date not null,
primary key (person_id)
);

create table people_attributes (
person_id int references people (person_id) on delete cascade initially 
deferred,
askmecount int not null default 0,
age int[] not null default '{1}'::int[],
gender int[] not null default '{1}'::int[],
orientation int[] not null default '{1}'::int[],
bodytype int[] not null default '{1}'::int[],
children int[] not null default '{1}'::int[],
drinking int[] not null default '{1}'::int[],
education int[] not null default '{1}'::int[],
ethnicity int[] not null default '{1}'::int[],
eyecolor int[] not null default '{1}'::int[],
haircolor int[] not null default '{1}'::int[],
hairstyle int[] not null default '{1}'::int[],
height int[] not null default '{1}'::int[],
income int[] not null default '{1}'::int[],
occupation int[] not null default '{1}'::int[],
relation int[] not null default '{1}'::int[], /* multiple answer */
religion int[] not null default '{1}'::int[],
seeking int[] not null default '{1}'::int[], /* multiple answer */
smoking int[] not null default '{1}'::int[],
want_children int[] not null default '{1}'::int[],
weight int[] not null default '{1}'::int[],

primary key (person_id)
)
without oids;

create index people_attributes_search on people_attributes using gist (
age gist__int_ops,
gender gist__int_ops,
orientation gist__int_ops,
bodytype gist__int_ops,
children gist__int_ops,
drinking gist__int_ops,
education gist__int_ops,
ethnicity gist__int_ops,
eyecolor gist__int_ops,
haircolor gist__int_ops,
hairstyle gist__int_ops,
height gist__int_ops,
income gist__int_ops,
occupation gist__int_ops,
relation gist__int_ops,
religion gist__int_ops,
seeking gist__int_ops,
smoking gist__int_ops,
want_children gist__int_ops,
weight gist__int_ops
);

/* These will be compared against the people_attributes table */
create table people_searchprefs (
person_id int references people (person_id) on delete cascade initially 
deferred,
age int[] not null default 
'{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[],
gender int[] not null default '{1,2,4}'::int[],
orientation int[] not null default '{1,2,8}'::int[],
bodytype int[] not null default '{1,2,3,4,5,6}'::int[],
children int[] not null default '{0}'::int[],
drinking int[] not null default '{0}'::int[],
education int[] not null default '{0}'::int[],
ethnicity int[] not null default '{0}'::int[],
eyeco

Re: [PERFORM] Planner having way wrong estimate for group aggregate

2004-09-18 Thread Steinar H. Gunderson
On Sat, Sep 18, 2004 at 03:48:13PM -0400, Tom Lane wrote:
> 7.4 doesn't have any statistics on expression indexes.  8.0 will do what
> you want though.  (I just fixed an oversight that prevented it from
> doing so...)

OK, so I'll have to wait for 8.0.0beta3 or 8.0.0 (I tried 8.0.0beta2, it gave
me zero difference) -- fortunately, I can probably wait at the rate
everything else is progressing here. :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Planner having way wrong estimate for group aggregate

2004-09-18 Thread Tom Lane
"Steinar H. Gunderson" <[EMAIL PROTECTED]> writes:
> Now, my first notion was creating a functional index to help the planner:
> ...
> However, this obviously didn't help the planner (this came as a surprise to
> me, but probably won't come as a surprise to the more seasoned users here :-)

7.4 doesn't have any statistics on expression indexes.  8.0 will do what
you want though.  (I just fixed an oversight that prevented it from
doing so...)

> Actually, it seems that the higher I set statistics on "tid", the worse the
> estimate becomes.

I believe that the estimate of number of groups will be exactly the same
as the estimate of the number of values of tid --- there's no knowledge
that date_trunc() might reduce the number of distinct values.

> Any ideas for speeding this up?

In 7.4, the only way I can see to force this to use a hash aggregate is
to temporarily set enable_sort false or raise sort_mem.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Tryint to match Solaris-Oracle performance with directio?

2004-09-18 Thread Shachar Shemesh
Mischa Sandberg wrote:
In the meantime, what I gather from browsing mail archives is that 
postgresql on Solaris seems to get hung up on IO rather than CPU.
Well, people more knowledgeable in the secrets of postgres seem 
confident that this is not your problem. Fortunetly, however, there is a 
simple way to find out.

Just download the utinyint var type from pgfoundry 
(http://pgfoundry.org/projects/sql2pg/). There are some stuff there you 
will need to compile yourself from CVS. I'm sorry, but I haven't done a 
proper release just yet. In any case, the utinyint type should provide 
you with the data type you seek, and thus allow you to find out whether 
this is, indeed, the problem.

--
Shachar Shemesh
Lingnu Open Source Consulting ltd.
http://www.lingnu.com/
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


[PERFORM] Planner having way wrong estimate for group aggregate

2004-09-18 Thread Steinar H. Gunderson
Hi,

I'm using PostgreSQL 7.4 on a table with ~700.000 rows looking like this:

   Table "public.enkeltsalg"
   Column   |   Type   |   Modifiers   

+--+---
 id | integer  | not null default 
nextval('"enkeltsalg_id_seq"'::text)
 kommentar  | text | not null default ''::text
 antall | numeric(14,4)| not null
 belop  | numeric(10,0)| not null
 type   | character(1) | not null
 tid| timestamp with time zone | default now()
 eksternid  | integer  | 
 kasseid| integer  | 
 baraapning | integer  | 
 salgspris  | integer  | 
 firma  | integer  | 
 bongid | integer  | 
Indexes:
"enkeltsalg_pkey" primary key, btree (id)
"enkeltsalg_aapn" btree (baraapning)
"enkeltsalg_aapn_pris" btree (baraapning, salgspris)
"enkeltsalg_aapn_type" btree (baraapning, "type")
"enkeltsalg_pris" btree (salgspris)
Check constraints:
"enkeltsalg_type_valid" CHECK ("type" = 'K'::bpchar OR "type" = 'B'::bpchar OR 
"type" = 'M'::bpchar OR "type" = 'T'::bpchar)

And I'm doing the query (after VACUUM ANALYZE)

smt=# explain analyze select sum(belop) as omsetning,date_trunc('day',tid) as dato 
from enkeltsalg group by date_trunc('day',tid);
QUERY PLAN 
   
--
 GroupAggregate  (cost=108062.34..114477.98 rows=172735 width=17) (actual 
time=20977.544..23890.020 rows=361 loops=1)
   ->  Sort  (cost=108062.34..109912.99 rows=740263 width=17) (actual 
time=20947.372..21627.107 rows=710720 loops=1)
 Sort Key: date_trunc('day'::text, tid)
 ->  Seq Scan on enkeltsalg  (cost=0.00..18010.29 rows=740263 width=17) 
(actual time=0.091..7180.528 rows=710720 loops=1)
 Total runtime: 23908.538 ms
(5 rows)

Now, as you can see, the GroupAggregate here is _way_ off, so the planner
makes the wrong choice (it should do a hash aggregate). If I set sort_mem to
131072 instead of 16384, it does a hash aggregate (which is 10 seconds
instead of 24), but I can't have sort_mem that high generally.

Now, my first notion was creating a functional index to help the planner:

smt=# create index enkeltsalg_dag on enkeltsalg ( date_trunc('day',tid) );
CREATE INDEX  
smt=# vacuum analyze;
VACUUM

However, this obviously didn't help the planner (this came as a surprise to
me, but probably won't come as a surprise to the more seasoned users here :-)
):

smt=# explain analyze select sum(belop) as omsetning,date_trunc('day',tid) as dato 
from enkeltsalg group by date_trunc('day',tid);
QUERY PLAN 
   
--
 GroupAggregate  (cost=103809.15..110017.11 rows=175512 width=17) (actual 
time=21061.357..23917.370 rows=361 loops=1)
   ->  Sort  (cost=103809.15..105585.95 rows=710720 width=17) (actual 
time=21032.239..21695.674 rows=710720 loops=1)
 Sort Key: date_trunc('day'::text, tid)
 ->  Seq Scan on enkeltsalg  (cost=0.00..17641.00 rows=710720 width=17) 
(actual time=0.091..7231.387 rows=710720 loops=1)
 Total runtime: 23937.791 ms
(5 rows)

I also tried to increase the statistics on the "tid" column:

smt=# alter table enkeltsalg alter column tid set statistics 500;
ALTER TABLE
smt=# analyze enkeltsalg;
ANALYZE

However, this made the planner only do a _worse_ estimate:

smt=# explain analyze select sum(belop) as omsetning,date_trunc('day',tid) as dato 
from enkeltsalg group by date_trunc('day',tid);
QUERY PLAN 
   
--
 GroupAggregate  (cost=107906.59..114449.09 rows=199715 width=17) (actual 
time=20947.197..23794.389 rows=361 loops=1)
   ->  Sort  (cost=107906.59..109754.56 rows=739190 width=17) (actual 
time=20918.001..21588.735 rows=710720 loops=1)
 Sort Key: date_trunc('day'::text, tid)
 ->  Seq Scan on enkeltsalg  (cost=0.00..17996.88 rows=739190 width=17) 
(actual time=0.092..7166.488 rows=710720 loops=1)
 Total runtime: 23814.624 ms
(5 rows)

Actually, it seems that the higher I set statistics on "tid", the worse the
estimate becomes.

Also, I was told (on #postgresql :-) ) to include the following information:

smt

Re: [PERFORM] Tryint to match Solaris-Oracle performance with directio?

2004-09-18 Thread Tom Lane
Mischa Sandberg <[EMAIL PROTECTED]> writes:
> Our product (Sophos PureMessage) runs on a Postgres database.
> Some of our Solaris customers have Oracle licenses, and they've 
> commented on the performance difference between Oracle and Postgresql
> on such boxes. In-house, we've noticed the 2:1 (sometimes 5:1)
> performance difference in inserting rows (mostly 2-4K), between
> Postgresql on Solaris 8 and on Linux, for machines with comparable
> CPU's and RAM.

You haven't given any evidence at all to say that I/O is where the
problem is.  I think it would be good first to work through the
conventional issues such as configuration parameters, foreign key
problems, etc.  Give us some more detail about the slow INSERT
queries ...

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] Tryint to match Solaris-Oracle performance with directio?

2004-09-18 Thread Mischa Sandberg
Our product (Sophos PureMessage) runs on a Postgres database.
Some of our Solaris customers have Oracle licenses, and they've 
commented on the performance difference between Oracle and Postgresql
on such boxes. In-house, we've noticed the 2:1 (sometimes 5:1)
performance difference in inserting rows (mostly 2-4K), between
Postgresql on Solaris 8 and on Linux, for machines with comparable
CPU's and RAM.

These (big) customers are starting to ask, why don't we just port our 
dataserver to Oracle for them? I'd like to avoid that, if possible :-)

What we can test on, in-house are leetle Sun workstations, while some of 
our customers have BIG Sun iron --- so I have no means to-date to 
reproduce what their bottleneck is :-( Yes, it has been recommended that 
we talk to Sun about their iForce test lab ... that's in the pipe.

In the meantime, what I gather from browsing mail archives is that 
postgresql on Solaris seems to get hung up on IO rather than CPU.
Furthermore, I notice that Oracle and now MySQL use directio to bypass 
the system cache, when doing heavy writes to the disk; and Postgresql 
does not.

Not wishing to alter backend/store/file for this test, I figured I could 
get a customer to mount the UFS volume for pg_xlog  with the option 
"forcedirectio".

Any comment on this? No consideration of what the wal_sync_method is at 
this point. Presumably it's defaulting to fdatasync on Solaris.

BTW this is Postgres 7.4.1, and our customers are Solaris 8 and 9.
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] CPU maximized out!

2004-09-18 Thread Qing Zhao
Hi, there,
I am running PostgreSQL 7.3.4 on MAC OS X G5 with dual processors and
8GB memory. The shared buffer was set as 512MB.
The database has been running great until about 10 days ago when our
developers decided to add some indexes to some tables to speed up
certain uploading ops.
Now the CPU usage reaches 100% constantly when there are a few users
accessing their information by SELECT tables in databases. If I REINEX
all the indexes, the database performance improves a bit but before 
long,
it goes back to bad again.

My suspicion is that since  now a few indexes are added, every ops are
run by PostgreSQL with the indexes being used when calculating cost.
This leads to the downgrade of performance.
What do you think of this? What is the possible solution?
Thanks!
Qing
The following is the output from TOP command:
Processes:  92 total, 4 running, 88 sleeping... 180 threads
13:09:18
Load Avg:  2.81, 2.73, 2.50 CPU usage:  95.2% user, 4.8% sys, 0.0% 
idle
SharedLibs: num =  116, resident = 11.5M code, 1.66M data, 4.08M 
LinkEdit
MemRegions: num = 12132, resident =  148M + 2.82M private,  403M shared
PhysMem:   435M wired, 5.04G active, 2.22G inactive, 7.69G used,  316M 
free
VM: 32.7G + 81.5M   5281127(13) pageins, 8544145(0) pageouts

  PID COMMAND  %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  
VSIZE
27314 postgres92.2%  2:14.75   1 949  12.8M+  396M  75.0M+  
849M
26099 postgres91.1% 19:28.04   1 967  15.9M+  396M   298M+  
850M
24754 top  2.8%  4:48.33   12926   272K   404K   648K  
27.1M
0 kernel_tas   1.9%  2:12:05  40 2  8476  67.1M 0K   281M  
1.03G
  294 hwmond   0.5%  2:26:34   87557   240K   544K  1.09M  
31.0M
  347 lookupd  0.3%  1:52:28   23573  3.05M   648K  3.14M  
33.6M
   89 configd  0.1% 53:05.16   3   126   151   304K   644K   832K  
29.2M
26774 servermgrd   0.1%  0:02.93   11040   344K- 1.17M+ 1.86M  
28.2M
  170 coreservic   0.1%  0:09.04   14093   152K   532K  2.64M  
28.5M
  223 DirectoryS   0.1% 19:42.47   884   135   880K+ 1.44M  4.60M+ 
37.1M+
  125 dynamic_pa   0.0%  0:26.79   1121716K   292K28K  
17.7M
   87 kextd0.0%  0:01.23   21721 0K   292K36K  
28.2M
  122 update   0.0% 14:27.71   1 91516K   300K44K  
17.6M
1 init 0.0%  0:00.03   1121628K   320K76K  
17.6M
2 mach_init0.0%  3:36.18   2951876K   320K   148K  
18.2M
   81 syslogd  0.0%  0:19.96   1101796K   320K   148K  
17.7M

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly