Re: [PERFORM] Slow in morning hours

2004-06-06 Thread Harald Fuchs
In article <[EMAIL PROTECTED]>,
<[EMAIL PROTECTED]> writes:

> Hi All,
> I am using Linux 7.2 and postgresql 7.2.

> Our Office hours are over at 6pm but we use to keep our server 
> running 24 hours a day.  On the second day morning, Our PGSQL 
> Server becomes very slow.

> After continuous usage of one hour, It gradually starts responding 
> faster ! This has become every day routine !

> do u have any idea related to this  Is there any other reason that I 
> need to check up?

> Please any any idea to get relief daily morning problem !!

I guess you're doing a VACUUM at night which invalidates the buffer
cache.  If that's what happens, it's easy to fix: run some dummy
queries after the VACUUM which cause the buffer cache to get filled.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] Query involving views

2004-06-06 Thread Laurent Martelli
Hello again,

This question is related to my previous one (Unused table of view, see
http://archives.postgresql.org/pgsql-performance/2004-06/msg00043.php).
For the moment, I have queries where I join tables by hand. Since a
few tables are always joined together, I thought I could define a view
to centralize this and make my queries more readable. But I then
observe a drop in performances on some queries because it seems the
view is not "broken" by the planner, so some optimizations cannot
occur anymore. Maybe this assertion is plain wrong, it's just my
feeling of the situation.

I'm using postgresql 7.4.2 on Debian GNU/Linux.

Here are the details of my tables, queries and views:

CREATE TABLE pictures (
PictureID serial PRIMARY KEY,
RollID character varying(64) NOT NULL REFERENCES rolls,
FrameID character varying(64) NOT NULL,
Description character varying(255),
Filename character varying(255),
Owner integer NOT NULL REFERENCES users,
EntryDate datetime DEFAULT now(),
Date datetime,
NbClick integer DEFAULT 0,
NbRates integer DEFAULT 0,
MaxRate int2,
MinRate int2,
AverageRate float4 DEFAULT 5,
SumRates integer DEFAULT 0);

-- Each picture can belong to a number of topics
CREATE TABLE topicscontent (
TopicID integer REFERENCES topics ON DELETE cascade,
PictureID integer REFERENCES pictures ON DELETE cascade,
Direct boolean NOT NULL,
PRIMARY KEY (TopicID,PictureID) );

-- Each picture can be viewed by a number of groups
CREATE TABLE permissions (
GroupID integer NOT NULL REFERENCES groups ON DELETE cascade,
PictureID integer NOT NULL REFERENCES pictures ON DELETE cascade,
UNIQUE (GroupID, PictureID));

-- Each user can belong to a number of groups
CREATE TABLE groupsdef (
UserID integer REFERENCES users,
GroupID integer REFERENCES groups,
PRIMARY KEY (UserID,GroupID));

-- Each picture can have a number of keywords
CREATE TABLE keywords (
Type integer,
PictureID integer NOT NULL REFERENCES pictures ON DELETE cascade,
Value character varying(128) NOT NULL,
UNIQUE (Type,PictureID,Value));


Without views, if I want all the picture with a keyword value of
'laurent' that a user with ID of 2 can see, sorted by AverageRate:

SELECT DISTINCT ON (AverageRate,PictureID) P.*
   FROM Pictures AS P, GroupsDef AS G, Permissions AS A, Keywords AS K
   WHERE P.PictureID=A.PictureID 
 AND G.GroupID=A.GroupID 
 AND K.Value in ('laurent') 
 AND K.PictureID=P.PictureID 
 AND UserID=2
   ORDER BY AverageRate,PictureID;

QUERY PLAN 

 Unique  (cost=528.93..532.71 rows=504 width=97) (actual time=32.447..33.062 rows=274 
loops=1)
   ->  Sort  (cost=528.93..530.19 rows=504 width=97) (actual time=32.443..32.590 
rows=505 loops=1)
 Sort Key: p.averagerate, p.pictureid
 ->  Hash Join  (cost=297.36..506.31 rows=504 width=97) (actual 
time=12.495..29.312 rows=505 loops=1)
   Hash Cond: ("outer".groupid = "inner".groupid)
   ->  Hash Join  (cost=292.14..466.79 rows=900 width=101) (actual 
time=12.056..26.180 rows=750 loops=1)
 Hash Cond: ("outer".pictureid = "inner".pictureid)
 ->  Seq Scan on permissions a  (cost=0.00..125.05 rows=8305 
width=8) (actual time=0.007..6.271 rows=8305 loops=1)
 ->  Hash  (cost=291.43..291.43 rows=285 width=101) (actual 
time=11.961..11.961 rows=0 loops=1)
   ->  Hash Join  (cost=110.26..291.43 rows=285 width=101) 
(actual time=6.378..11.573 rows=274 loops=1)
 Hash Cond: ("outer".pictureid = "inner".pictureid)
 ->  Seq Scan on pictures p  (cost=0.00..68.33 
rows=2933 width=97) (actual time=0.007..2.426 rows=2933 loops=1)
 ->  Hash  (cost=109.55..109.55 rows=285 width=4) 
(actual time=6.163..6.163 rows=0 loops=1)
   ->  Seq Scan on keywords k  (cost=0.00..109.55 
rows=285 width=4) (actual time=0.032..5.929 rows=274 loops=1)
 Filter: ((value)::text = 'laurent'::text)
   ->  Hash  (cost=5.19..5.19 rows=12 width=4) (actual time=0.217..0.217 
rows=0 loops=1)
 ->  Seq Scan on groupsdef g  (cost=0.00..5.19 rows=12 width=4) 
(actual time=0.038..0.197 rows=11 loops=1)
   Filter: (userid = 2)
 Total runtime: 33.554 ms

Now, if I use the following view to abstract access rights:

CREATE VIEW userpictures (
   PictureID,RollID,FrameID,Description,Filename,
   Owner,EntryDate,Date,
   NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
   UserID) 
   AS SELECT DISTINCT ON (Permissions.PictureID,UserID) 

Re: [PERFORM] Query involving views

2004-06-06 Thread Tom Lane
Laurent Martelli <[EMAIL PROTECTED]> writes:
> Now, if I use the following view to abstract access rights:

> CREATE VIEW userpictures (
>PictureID,RollID,FrameID,Description,Filename,
>Owner,EntryDate,Date,
>NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
>UserID) 
>AS SELECT DISTINCT ON (Permissions.PictureID,UserID) 
>Pictures.PictureID,RollID,FrameID,Description,Filename,Owner,
>EntryDate,Date,NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
>UserID 
>FROM Permissions 
> JOIN Groupsdef using (GroupID) 
> JOIN pictures using (PictureID);

> [ performance sucks ]

Find a way to get rid of the DISTINCT ON.  That's essentially an
optimization fence.  Worse, the way you are using it here, it doesn't
even give well-defined results, since there's no ORDER BY constraining
which row will be selected out of a set of duplicates.  (I think it may
not matter to you, since you don't really care which groupsdef row is
selected, but in general a view constructed like this is broken.)

It might work to do the view as

SELECT ... all that stuff ...
FROM pictures p, users u
WHERE
  EXISTS (SELECT 1 FROM permissions prm, groupsdef g
  WHERE p.pictureid = prm.pictureid AND prm.groupid = g.groupid
AND g.userid = u.userid);

I'm not sure offhand about the performance properties of this either,
but it would be worth trying.

A cruder answer is just to accept that the view may give you multiple
hits, and put the DISTINCT in the top-level query.

I think though that in the long run you're going to need to rethink this
representation of permissions.  It's nice and simple but it's not going
to scale well.  Even your "fast" query is going to look like a dog once
you get to many thousands of permission entries.

It might work to maintain a derived table (basically a materialized
view) of the form (userid, groupid, pictureid) signifying that a user
can access a picture through membership in a group.  Put a nonunique
index on (userid, pictureid) on it.  This could then drive the EXISTS
test efficiently.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Query involving views

2004-06-06 Thread Laurent Martelli
> "Tom" == Tom Lane <[EMAIL PROTECTED]> writes:

  Tom> Laurent Martelli <[EMAIL PROTECTED]> writes:
  >> Now, if I use the following view to abstract access rights:

  >> CREATE VIEW userpictures (
  >> PictureID,RollID,FrameID,Description,Filename,
  >> Owner,EntryDate,Date,
  >> NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates, UserID) AS
  >> SELECT DISTINCT ON (Permissions.PictureID,UserID)
  >> Pictures.PictureID,RollID,FrameID,Description,Filename,Owner,
  >> EntryDate,Date,NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
  >> UserID FROM Permissions JOIN Groupsdef using (GroupID) JOIN
  >> pictures using (PictureID);

  >> [ performance sucks ]

  Tom> Find a way to get rid of the DISTINCT ON.  That's essentially
  Tom> an optimization fence.  Worse, the way you are using it here,
  Tom> it doesn't even give well-defined results, since there's no
  Tom> ORDER BY constraining which row will be selected out of a set
  Tom> of duplicates.  (I think it may not matter to you, since you
  Tom> don't really care which groupsdef row is selected, 

That's true. I do not use columns from groupsdef in the end. 

  Tom> but in general a view constructed like this is broken.)

  Tom> It might work to do the view as

  Tom> SELECT ... all that stuff ...  FROM pictures p, users u WHERE
  Tom> EXISTS (SELECT 1 FROM permissions prm, groupsdef g WHERE
  Tom> p.pictureid = prm.pictureid AND prm.groupid = g.groupid AND
  Tom> g.userid = u.userid);

  Tom> I'm not sure offhand about the performance properties of this
  Tom> either, but it would be worth trying.

This one does not yield very good performance. In fact, the best
performances I have is when I use a where clause like this one:

WHERE PictureID IN 
   (SELECT PictureID FROM permissions JOIN groupsdef USING(GroupID) 
  WHERE groupsdef.UserID=2)

But it's not as elegant to write as the initial view using "distinct
on". I could create a view like this:

CREATE VIEW userpictures (PictureID,UserID) 
AS SELECT pictureid,userid 
FROM permissions JOIN groupsdef USING(GroupID)

and then do queries like this:

SELECT * FROM pictures 
WHERE PictureID IN (SELECT PictureID FROM userpictures WHERE UserID=2)

but it's stillnot as elegant as 

SELECT * FROM userpictures WHERE UserID=2 

I think I'll try a function: 

CREATE FUNCTION picturesID(int) RETURNS SETOF int AS '
SELECT PictureID FROM permissions JOIN groupsdef USING(GroupID)
WHERE groupsdef.UserID=$1
' LANGUAGE sql;

SELECT * FROM pictures WHERE PictureID IN (select * from picturesID(2));

Here's something funny: using a function seems gives slihtly better results
than inlining the query (I did a dozen of runs and the timings were consistent):

SELECT * FROM pictures WHERE PictureID IN (select * from picturesID(2));
QUERY PLAN 
 Hash Join  (cost=15.50..100.49 rows=200 width=97) (actual time=28.609..46.568 
rows=2906 loops=1)
   Hash Cond: ("outer".pictureid = "inner".picturesid)
   ->  Seq Scan on pictures  (cost=0.00..68.33 rows=2933 width=97) (actual 
time=0.018..2.610 rows=2933 loops=1)
   ->  Hash  (cost=15.00..15.00 rows=200 width=4) (actual time=28.467..28.467 rows=0 
loops=1)
 ->  HashAggregate  (cost=15.00..15.00 rows=200 width=4) (actual 
time=23.698..26.201 rows=2906 loops=1)
   ->  Function Scan on picturesid  (cost=0.00..12.50 rows=1000 width=4) 
(actual time=16.202..19.952 rows=5076 loops=1)
 Total runtime: 48.601 ms

SELECT * FROM pictures WHERE PictureID IN (
SELECT PictureID FROM permissions JOIN groupsdef USING(GroupID) 
WHERE groupsdef.UserID=2);
QUERY PLAN 
 

 Hash Join  (cost=394.93..504.24 rows=2632 width=97) (actual time=35.770..53.574 
rows=2906 loops=1)
   Hash Cond: ("outer".pictureid = "inner".pictureid)
   ->  Seq Scan on pictures  (cost=0.00..68.33 rows=2933 width=97) (actual 
time=0.014..2.543 rows=2933 loops=1)
   ->  Hash  (cost=388.35..388.35 rows=2632 width=4) (actual time=35.626..35.626 
rows=0 loops=1)
 ->  HashAggregate  (cost=388.35..388.35 rows=2632 width=4) (actual 
time=30.988..33.502 rows=2906 loops=1)
   ->  Merge Join  (cost=5.40..376.72 rows=4652 width=4) (actual 
time=0.247..26.628 rows=5076 loops=1)
 Merge Cond: ("outer".groupid = "inner".groupid)
 ->  Index Scan using permissions_groupid_key on permissions  
(cost=0.00..280.77 rows=8305 width=8) (actual time=0.031..11.629 rows=7633 loops=1)
 ->  Sort  (cost=5.40..5.43 rows=12 width=4) (actual 
time=0.207..1.720 rows=5078 loops=1)
   Sort Key: groupsdef.groupid
   ->  Seq Scan on groupsdef  (cost=0.00..5.19 rows=12 
width=4) (actual time=0.030..0.182 rows=11 loops=1)
 Filter: (userid = 2)
 Total runtime: 54.748 ms

  

Re: [PERFORM] postgres performance: comparing 2 data centers

2004-06-06 Thread Rod Taylor
> The members table contains about 500k rows.  It has an index on
> (group_id, member_id) and on (member_id, group_id).


Yes, bad stats are causing it to pick a poor plan, but you're giving it
too many options (which doesn't help) and using space up unnecessarily.

Keep (group_id, member_id)
Remove (member_id, group_id)
Add (member_id)

An index on just member_id is actually going to perform better than
member_id, group_id since it has a smaller footprint on the disk.

Anytime where both group_id and member_id are in the query, the
(group_id, member_id) index will likely be used.

-- 
Rod Taylor 

Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
PGP Key: http://www.rbt.ca/signature.asc


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [JDBC] Using a COPY...FROM through JDBC?

2004-06-06 Thread Markus Schaber
Hi, Steve,

On Sat, 05 Jun 2004 13:12:29 -0700
Steve Wampler <[EMAIL PROTECTED]> wrote:

> I've got a simple database (no indices, 6 columns) that I need
> to write data quickly into through JDBC connections from
> multiple such connections simultaneously in a distributed
> environment.  (This is going to be a message logging service
> for software generated messages.)
> Using a PreparedStatement, I can get about 400/s inserted.  If I
> (on the java side) buffer up the entries and dump them in large
> transaction blocks I can push this up to about 1200/s.  I'd
> like to go faster.  One approach that I think might be
> promising would be to try using a COPY command instead of
> an INSERT, but I don't have a file for input, I have a 
> Java collection, so COPY isn't quite right.  Is there anyway to
> efficiently use COPY without having to create a file (remember
> that the java apps are distributed on a LAN and aren't running
> on the DB server.)  Is this a dead end because of the way
> COPY is implemented to only use a file?

We also found that using the psql frontend, using COPY seems to give a
factor 10 or more speedup. Sadly, as far as I learned, the current JDBC
driver does not support COPY ... FROM STDIN.

As a very bad workaround, it might be acceptable to use Runtime.exec()
to start the psql command line tool, and issue the statement there, or
even add a C-lib via JNI. Of course, the best "workaround" would be to
implement COPY support for the driver, and send the Patches to the
PGJDBC team for inclusion :-)

We also had to do some trickery to get instertion of lots of rows fast.
We dit lots of benchmarks, and currently use the following method:

Our input data is divided into chunks (the optimal size depends on the
machine, and seems to be between 250 and 3000). As the current pgjdbc
preparedStatements implementation just does a text replacement, but we
wantedto get the last bit of speed out of the machine, we issue a
"PREPARE" statement for the insertion on connection setup, and then
addBatch() a "EXECUTE blubb (data, row, values)" statement.

Then we have several concurrent threads, all running essentially a {get
batch, write batch, commit} loop on their own connection. Increasing
the thread number to more than three did not show further substantial
performance improvements. This lead us to the conclusion that
concurrency can compensate for the time the postmaster is forced to wait
while it syncs the WAL to disk, but there's still a concurrency limit
inside of postgres for inserts (I presume they have to lock at some
times, the multiversioning seems not to cover inserts very well).

Also, we surprisingly found that setting the transaction isolation to
"serializable" can speed things remarkably in some cases...

> Is there something else I can do?  Ultimately, this will end
> up on a machine running 1+0 RAID, so I expect that will give
> me some performance boost as well, but I'd like to push it
> up as best I can with my current hardware setup.

As any sane setup runs with syncing enabled in the backend, and each
sync (and so each commit) at least has to write at least one block, you
can calculate the theoretical maximum number of commits your machine can
achieve.

If you have 15k rpm disks (AFAIK, the fastest one currently available),
they spin at 250 rotations per second, so you cannot have more than 250
commits per second.

Regarding the fact that your machine has to do some works between the
sync() calls (e. G. processing the whole next batch), it is very likely
that it misses the next turn, so that you're likely to get a factor 2 or
3 number in reality.

One way to overcome this limit is using multiple writer threads, and
(having a highly capable I/O sybsystem) enabling commit delay in your
backend so that you can have more than one commit during the same write
operation.

It might also help to put the WAL log to a different disk (just link or
mount or mount --bind the appropriate subdirectory in your database), or
even put the indices on a third disk (needs ugly trickery) - it's a
shame that postmaster does not really support this techniques which are
considered standard in any professional database.

If you really need much more speed, that you could try to put the WAL
on a Solid State Disk (essentially a battery-backed RAM) so you can
overcome this physical limit, or (if you really trust your hardware and
your power supply) put the WAL into a RAMDISK or switch of syncing in
your postmaster configuration.

One thing you should check is whether I/O or CPU is the limiting factor.
If you have a cpu utilization higher than 90%, than all the tricks I
told you won't help much. (But using COPY still could help a lot as it
cut's down the CPU usage very much.)

We tested with two machines, a single-processor developer machine, and a
2-way 64-Bit Itanium SMP machine. On the desktop machine, a single
thread already utilized 80% CPU, and so only small improvement was
possible using 2 or more threads. 

Re: [PERFORM] Column correlation drifts, index ignored again

2004-06-06 Thread Tom Lane
John Siracusa <[EMAIL PROTECTED]> writes:
>   I want to do something that will convince Postgres that using the date 
> index is, by far, the best plan when running my queries, even when the 
> date column correlation stat drops well below 1.0.

Have you tried experimenting with random_page_cost?  Seems like your
results suggest that you need to lower it.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] General performance questions about postgres on Apple

2004-06-06 Thread Tom Lane
Sean Shanny <[EMAIL PROTECTED]> writes:
> We have the following setting for random page cost:
> random_page_cost = 1# units are one sequential page fetch cost
> Any suggestions on what to bump it up to?

Well, the default setting is 4 ... what measurements prompted you to
reduce it to 1?  The particular example you showed suggested that the
true value on your setup might be 10 or more.

Now I would definitely not suggest that you settle on any particular
value based on only one test case.  You need to try to determine an
appropriate average value, bearing in mind that there's likely to be
lots of noise in any particular measurement.

But in general, setting random_page_cost to 1 is only reasonable when
you are dealing with a fully-cached-in-RAM database, which yours isn't.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] General performance questions about postgres on Apple

2004-06-06 Thread Gregory S. Williamson

Why is my name on a mail from Tom Lane ? Really, he knows a *lot* more than I and 
should get due credit.

Seriously, is this the peformance remailer mangling something ?

Greg Williamson
(the real one)

-Original Message-
From:   Gregory S. Williamson
Sent:   Sun 6/6/2004 10:46 PM
To: Sean Shanny
Cc: [EMAIL PROTECTED]
Subject:Re: [PERFORM] General performance questions about postgres on Apple
In-reply-to: <[EMAIL PROTECTED]> 
References: <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]> <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
Comments: In-reply-to Sean Shanny <[EMAIL PROTECTED]>message
dated "Sun, 22 Feb 2004 21:48:54 -0500"
Date: Sun, 22 Feb 2004 22:24:29 -0500
Message-ID: <[EMAIL PROTECTED]>
From: Tom Lane <[EMAIL PROTECTED]>
X-Virus-Scanned: by amavisd-new at postgresql.org
X-Mailing-List: pgsql-performance
Precedence: bulk
Sender: [EMAIL PROTECTED]
X-imss-version: 2.5
X-imss-result: Passed
X-imss-scores: Clean:99.9 C:21 M:2 S:5 R:5
X-imss-settings: Baseline:2 C:2 M:2 S:2 R:2 (0.1500 0.3000)
Return-Path: [EMAIL PROTECTED]
X-OriginalArrivalTime: 07 Jun 2004 05:27:21.0994 (UTC) FILETIME=[1BC0EEA0:01C44C50]

Sean Shanny <[EMAIL PROTECTED]> writes:
> We have the following setting for random page cost:
> random_page_cost = 1# units are one sequential page fetch cost
> Any suggestions on what to bump it up to?

Well, the default setting is 4 ... what measurements prompted you to
reduce it to 1?  The particular example you showed suggested that the
true value on your setup might be 10 or more.

Now I would definitely not suggest that you settle on any particular
value based on only one test case.  You need to try to determine an
appropriate average value, bearing in mind that there's likely to be
lots of noise in any particular measurement.

But in general, setting random_page_cost to 1 is only reasonable when
you are dealing with a fully-cached-in-RAM database, which yours isn't.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend





---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster