Re: [PERFORM] tsearch2/GIST performance factors?

2005-10-18 Thread Craig A. James

Oleg wrote:

Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15


The default value for SIGLENINT actually didn't work at all.  It was only by 
increasing it that I got any performance at all.  An examination of the GIST 
indexes showed that most of the first level and many of the second level 
bitmaps were saturated.

tsearch2's index is a lossy index, read 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals

so search results  should be rechecked !


Yes, thanks.  We do indeed recheck the actual results.  The tests I'm running are just on 
the raw index performance - how long does it take to select ... where dockeys @@ 
to_tsquery(...).


We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


At this point we may not have time to try tsearchd, and unfortunately we're not 
in a position to sponsor anything yet.

My original question is still bothering me.  Is it normal for a keyword that 
occurs in more than about 2% of the documents to cause such inconsistent 
performance?  Is there any single thing I might look at that would help improve 
performance (like, do I need more memory?  More shared memory?  Different 
config parameters?)

Thanks,
Craig

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Sequential scan on FK join

2005-10-18 Thread Richard Huxton

Martin Nickel wrote:

When I turn of seqscan it does use the index - and it runs 20 to 30%
longer.  Based on that, the planner is correctly choosing a sequential
scan - but that's just hard for me to comprehend.  I'm joining on an int4
key, 2048 per index page - I guess that's a lot of reads - then the data
-page reads.  Still, the 8-minute query time seems excessive.  


You'll be getting (many) fewer than 2048 index entries per page. There's 
a page header and various pointers involved too, and index pages aren't 
going to be full. So - it needs to search the table on dates, fetch the 
id's and then assemble them for the hash join. Of course, if you have 
too many to join then all this will spill to disk slowing you further.


Now, you'd rather get down below 8 minutes. There are a number of options:
 1. Make sure your disk i/o is being pushed to its limit
 2. Look into increasing the sort memory for this one query set 
work_mem... (see the runtime configuration section of the manual)

 3. Actually - are you happy that your general configuration is OK?
 4. Perhaps use a cursor - I'm guessing you want to process these 
mailings in some way and only want them one at a time in any case.
 5. Try the query one day at a time and see if the balance tips the 
other way - you'll be dealing with substantially less data per query 
which might match your system better. Of course, this may not be 
practical for your applicaton.
 6. If your lead table is updated only rarely, you could try a CLUSTER 
on the table by mailing_id - that should speed the scan. Read the manual 
for the cluster command first though.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Help tuning postgres

2005-10-18 Thread Robert Treat
reindex should be faster, since you're not dumping/reloading the table
contents on top of rebuilding the index, you're just rebuilding the
index. 


Robert Treat
emdeon Practice Services
Alachua, Florida

On Wed, 2005-10-12 at 13:32, Steve Poe wrote:
 
 Would it not be faster to do a dump/reload of the table than reindex or
 is it about the same? 
 
 Steve Poe
 
 On Wed, 2005-10-12 at 13:21 -0400, Tom Lane wrote:
  Emil Briggs [EMAIL PROTECTED] writes:
   Not yet, the db is in production use and I have to plan for a down-time
   for that... or is it not impacting the activity on the table ?
  
   It will cause some performance hit while you are doing it.
  
  It'll also lock out writes on the table until the index is rebuilt,
  so he does need to schedule downtime.
  
  regards, tom lane
  
  ---(end of broadcast)---
  TIP 1: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly
 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Help tuning postgres

2005-10-18 Thread Csaba Nagy
In the light of what you've explained below about nonremovable row
versions reported by vacuum, I wonder if I should worry about the
following type of report:

INFO:  vacuuming public.some_table
INFO:  some_table: removed 29598 row versions in 452 pages
DETAIL:  CPU 0.01s/0.04u sec elapsed 18.77 sec.
INFO:  some_table: found 29598 removable, 39684 nonremovable row
versions in 851 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.02s/0.07u sec elapsed 23.16 sec.
VACUUM


Does that mean that 39684 nonremovable pages are actually the active
live pages in the table (as it reports 0 dead) ? I'm sure I don't have
any long running transaction, at least according to pg_stats_activity
(backed by the linux ps too). Or I should run a vacuum full...

This table is one of which has frequently updated rows.

TIA,
Csaba.


On Thu, 2005-10-13 at 14:40, Andrew Sullivan wrote:
 On Thu, Oct 13, 2005 at 10:15:03AM +0200, Csaba Nagy wrote:
  
  OK, this sounds interesting, but I don't understand: why would an update
  chase down a lot of dead tuples ? Should I read up on some docs, cause
  I obviously don't know enough about how updates work on postgres...
 
 Right.  Here's the issue:
 
 MVCC does not replace rows when you update.  Instead, it marks the
 old row as expired, and sets the new values.  The old row is still
 there, and it's available for other transactions who need to see it. 
 As the docs say (see
 http://www.postgresql.org/docs/8.0/interactive/transaction-iso.html),
 In effect, a SELECT query sees a snapshot of the database as of the
 instant that that query begins to run.  And that can be true because
 the original data is still there, although marked as expired for
 subsequent transactions.
 
 UPDATE works the same was as SELECT in terms of searching for rows
 (so does any command that searches for data).  
 
 Now, when you select data, you actually have to traverse all the
 existing versions of the tuple in order to get the one that's live
 for you.  This is normally not a problem: VACUUM goes around and
 cleans out old, expired data that is not live for _anyone_.  It does
 this by looking for the oldest transaction that is open.  (As far as
 I understand it, this is actually the oldest transaction in the
 entire back end; but I've never understood why that should the the
 case, and I'm too incompetent/dumb to understand the code, so I may
 be wrong on this point.)  If you have very long-running transactions,
 then, you can end up with a lot of versions of dead tuples on the
 table, and so reading the few records you want can turn out actually
 to be a very expensive operation, even though it ought to be cheap.
 
 You can see this by using the VERBOSE option to VACUUM:
 
 test=# VACUUM VERBOSE eval1 ;
 INFO:  vacuuming public.eval1
 INFO:  eval1: found 0 removable, 0 nonremovable row versions in 0
 pages
 DETAIL:  0 dead row versions cannot be removed yet.
 There were 0 unused item pointers.
 0 pages are entirely empty.
 CPU 0.00s/0.00u sec elapsed 0.00 sec.
 INFO:  vacuuming pg_toast.pg_toast_18831
 INFO:  index pg_toast_18831_index now contains 0 row versions in 1
 pages
 DETAIL:  0 index pages have been deleted, 0 are currently reusable.
 CPU 0.00s/0.00u sec elapsed 0.00 sec.
 INFO:  pg_toast_18831: found 0 removable, 0 nonremovable row
 versions in 0 pages
 DETAIL:  0 dead row versions cannot be removed yet.
 There were 0 unused item pointers.
 0 pages are entirely empty.
 CPU 0.00s/0.00u sec elapsed 0.00 sec.
 VACUUM
 
 Note those removable and nonremovable row versions.  It's the
 unremovable ones that can hurt.  WARNING: doing VACUUM on a big table
 on a disk that's already pegged is going to cause you performance
 pain, because it scans the whole table.  In some cases, though, you
 have no choice: if the winds are already out of your sails, and
 you're effectively stopped, anything that might get you moving again
 is an improvement.
 
  And how would the analyze help in finding this out ? I thought it would
  only show me additionally the actual timings, not more detail in what
  was done...
 
 Yes, it shows the actual timings, and the actual number of rows.  But
 if the estimates that the planner makes are wildly different than the
 actual results, then you know your statistics are wrong, and that the
 planner is going about things the wrong way.  ANALYSE is a big help. 
 There's also a verbose option to it, but it's usually less useful in
 production situations.
 
 A


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Help tuning postgres

2005-10-18 Thread Csaba Nagy
First of all thanks all for the input.

I probably can't afford even the reindex till Christmas, when we have
about 2 weeks of company holiday... but I guess I'll have to do
something until Christmas.

The system should at least look like working all the time. I can have
downtime, but only for short periods preferably less than 1 minute. The
tables we're talking about have ~10 million rows the smaller ones and
~150 million rows the bigger ones, and I guess reindex will take quite
some time.

I wonder if I could device a scheme like:
 
 - create a temp table exactly like the production table, including
indexes and foreign keys;
 - create triggers on the production table which log all inserts,
deletes, updates to a log table;
 - activate these triggers;
 - copy all data from the production table to a temp table (this will
take the bulk of the time needed for the whole operation);
 - replay the log on the temp table repeatedly if necessary, until the
temp table is sufficiently close to the original;
 - rename the original table to something else, and then rename the temp
table to the original name, all this in a transaction - this would be
ideally the only visible delay for the user, and if the system is not
busy, it should be quick I guess;
 - replay on more time the log;

All this should happen in a point in time when there's little traffic to
the data base.

Replaying could be as simple as a few delete triggers on the log table,
which replay the deleted record on the production table, and the replay
then consisting in a delete operation on the log table. This is so that
new log entries can be replayed later without replaying again what was
already replayed.

The big tables I should do this procedure on have low probability of
conflicting operations (like insert and immediate delete of the same
row, or multiple insert of the same row, multiple conflicting updates of
the same row, etc.), this is why I think replaying the log will work
fine... of course this whole set up will be a lot more work than just
reindex...

I wonder if somebody tried anything like this and if it has chances to
work ?

Thanks,
Csaba.

On Tue, 2005-10-18 at 17:18, Robert Treat wrote:
 reindex should be faster, since you're not dumping/reloading the table
 contents on top of rebuilding the index, you're just rebuilding the
 index. 
 
 
 Robert Treat
 emdeon Practice Services
 Alachua, Florida
 
 On Wed, 2005-10-12 at 13:32, Steve Poe wrote:
  
  Would it not be faster to do a dump/reload of the table than reindex or
  is it about the same? 
  
  Steve Poe
  
  On Wed, 2005-10-12 at 13:21 -0400, Tom Lane wrote:
   Emil Briggs [EMAIL PROTECTED] writes:
Not yet, the db is in production use and I have to plan for a down-time
for that... or is it not impacting the activity on the table ?
   
It will cause some performance hit while you are doing it.
   
   It'll also lock out writes on the table until the index is rebuilt,
   so he does need to schedule downtime.
   
 regards, tom lane
   
   ---(end of broadcast)---
   TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly
  
  
  ---(end of broadcast)---
  TIP 5: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Help tuning postgres

2005-10-18 Thread Andrew Sullivan
On Tue, Oct 18, 2005 at 05:21:37PM +0200, Csaba Nagy wrote:
 INFO:  vacuuming public.some_table
 INFO:  some_table: removed 29598 row versions in 452 pages
 DETAIL:  CPU 0.01s/0.04u sec elapsed 18.77 sec.
 INFO:  some_table: found 29598 removable, 39684 nonremovable row
 versions in 851 pages
 DETAIL:  0 dead row versions cannot be removed yet.

 Does that mean that 39684 nonremovable pages are actually the active
 live pages in the table (as it reports 0 dead) ? I'm sure I don't have
 any long running transaction, at least according to pg_stats_activity
 (backed by the linux ps too). Or I should run a vacuum full...
 
 This table is one of which has frequently updated rows.

No, you should be ok there.  What that should tell you is that you
have about 40,000 rows in the table.  But notice that your vacuum
process just removed about 75% of the live table rows.  Moreover,
your 39684 rows are taking 851 pages.  On a standard installation,
that's usually 8Kb/page.  So that's about 6,808 Kb of physical
storage space you're using.  Is that consistent with the size of your
data?  If it's very large compared to the data you have stored in
there, you may want to ask if you're leaking space from the free
space map (because of that table turnover, which seems pretty
severe).

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The whole tendency of modern prose is away from concreteness.
--George Orwell

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Inefficient escape codes.

2005-10-18 Thread Rodrigo Madera
Hello there,

This is my first post in the list. I have a deep low-level background
on computer programming, but I am totally newbie to sql databases. I am
using postgres because of its commercial license.

My problem is with storing large values. I have a database that stores
large ammounts of data (each row consisting of up to 5MB). After
carefully reading the Postgres 8.0 manual (the version I'm using), I
was told that the best option was to create a bytea field.

Large objects are out of the line here since we have lots of tables.

As I understand it, the client needs to put the data into the server
using a textual-based command. This makes the 5MB data grow up-to 5x,
making it 25MB in the worst case. (Example: 0x01 - \\001).

My question is:

1) Is there any way for me to send the binary field directly without needing escape codes?
2) Will this mean that the client actually wastes my network bandwidth
converting binary data to text? Or does the client transparently manage
this?

Thanks for any light on the subject,
Rodrigo



Re: [PERFORM] Inefficient escape codes.

2005-10-18 Thread Michael Fuhr
On Tue, Oct 18, 2005 at 06:07:12PM +, Rodrigo Madera wrote:
 1) Is there any way for me to send the binary field directly without needing
 escape codes?

In 7.4 and later the client/server protocol supports binary data
transfer.  If you're programming with libpq you can use PQexecParams()
to send and/or retrieve values in binary instead of text.

http://www.postgresql.org/docs/8.0/interactive/libpq-exec.html#LIBPQ-EXEC-MAIN

APIs built on top of libpq or that implement the protcol themselves
might provide hooks to this capability; check your documentation.
What language and API are you using?

See also COPY BINARY:

http://www.postgresql.org/docs/8.0/interactive/sql-copy.html

 2) Will this mean that the client actually wastes my network bandwidth
 converting binary data to text? Or does the client transparently manage
 this?

Binary transfer sends data in binary, not by automatically converting
to and from text.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Inefficient escape codes.

2005-10-18 Thread Michael Fuhr
[Please copy the mailing list on replies so others can participate
in and learn from the discussion.]

On Tue, Oct 18, 2005 at 07:09:08PM +, Rodrigo Madera wrote:
  What language and API are you using?
 
 I'm using libpqxx. A nice STL-style library for C++ (I am 101% C++).

I've only dabbled with libpqxx; I don't know if or how you can make
it send data in binary instead of text.  See the documentation or
ask in a mailing list like libpqxx-general or pgsql-interfaces.

  Binary transfer sends data in binary, not by automatically converting
  to and from text.
 
 Uh, I'm sorry I didn't get that... If I send: insert into foo
 values('\\001\\002') will libpq send 0x01, 0x02 or 001002??

If you do it that way libpq will send the string as text with escape
sequences; you can use a sniffer like tcpdump or ethereal to see this
for yourself.  To send the data in binary you'd call PQexecParams()
with a query like INSERT INTO foo VALUES ($1).  The $1 is a
placeholder; the other arguments to PQexecParams() provide the data
itself, the data type and length, and specify whether the data is in
text format or binary.  See the libpq documentation for details.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org