from:"Mischa Sandberg"

Re: [HACKERS] [PERFORM] Hints proposal

2006-10-20 Thread Mischa Sandberg

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Bucky
Jordan
Sent: Thursday, October 12, 2006 2:19 PM
To: josh@agliodbs.com; Jim C. Nasby
Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Hints proposal

  Well, one nice thing about the per-query method is you can post
before
  and after EXPLAIN ANALYZE along with the hints.

 One bad thing is that application designers will tend to use the hint,
fix
 the immediate issue, and never report a problem at all.  And query
hints
 would not be collectable in any organized way except the query log,
which
 would then require very sophisticated text parsing to get any useful
 information at all.

Or they'll report it when the next version of Postgres breaks their
app because the hints changed, or because the planner does something
else which makes those hints obsolete.

My main concern with hints (aside from the fact I'd rather see more
intelligence in the planner/stats) is managing them appropriately. I
have two general types of SQL where I'd want to use hints- big OLAP
stuff (where I have a lot of big queries, so it's not just one or two
where I'd need them) or large dynamically generated queries (Users
building custom queries). Either way, I don't want to put them on a
query itself.

What about using regular expressions, plus, if you have a function
(views, or any other statement that is stored), you can assign a rule to
that particular function. So you get matching, plus explicit selection.
This way it's easy to find all your hints, turn them off, manage them,
etc. (Not to mention dynamically generated SQL is ugly enough without
having to put hints in there).

- Bucky

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

[PERFORM] Bad plan for join to aggregate of join.

2006-09-12 Thread Mischa Sandberg


PG 8.0.3 is choosing a bad plan between a query.
I'm going to force the plan (by making one join into a function).

I'd like to know if this is unexpected; in general,
can PG see that a join on an grouped-by field
can be pushed down into the query as an indexable filter?

The query below joins a table message, to an aggregate of 
message_recipient joined to recipient. The joins are all on

indexed PK-FK columns. message_recipient is an intersect table.

  message :: message_recipient :: recipient

In the query plan below, the right side of the join returns one row of message, 
and PG knows it.


The left side of the join compute the entire aggregate of message_recipient
(est 700K rows), then does a merge join against the single message row.

I would have hoped for a nested-loop join, where the message id
field would be used to index-scan message_recipient,
which in turn would index-scan recipient by recipient id.

This is PG 8.0.3. All tables have been (very) recently analyzed.
The query plans estimated rowcounts all look bang-on.
message and message_recipient are tables of about 3M rows each.

As usual, this is on a system to which I only have restricted access.
But I'd be happy to expand on the info below with anything short of
the pg_dump.

---

EXPLAIN
SELECT  message.id AS m_db_id, message.m_global_id AS id, m_global_id, 
m_queue_id, h_message_id,
   m_date AS c_date_iso, m_date, c_subject_utf8, message.reason_id AS 
reason_id,
   m_reason.name AS m_reason, m_spam_probability, m_spam_level, h_to, 
m_message_size,
   m_header_size, date_part('epoch', message.m_date) AS c_qdate_time,
   h_from_local || '@' || h_from_domain AS h_from,
   env_from_local || '@' || env_from_domain AS env_from,
   env_from_local || '@' || env_from_domain AS m_envelope_from, 
location_name AS location,
   m_milter_host, m_relay, virus_name AS m_virus_name, m_all_recipients
FROM message
JOIN m_reason ON message.reason_id = m_reason.reason_id
JOIN message_all_recipients ON message.id = message_all_recipients.m_id
WHERE message.m_global_id = '2211000-1';


QUERY PLAN
---
Nested Loop  (cost=254538.42..283378.44 rows=1 width=425)
Join Filter: (outer.reason_id = inner.reason_id)
-  Merge Join  (cost=254538.42..283377.33 rows=1 width=416)
Merge Cond: (outer.m_id = inner.id)
-  Subquery Scan message_all_recipients  (cost=254535.40..281604.95 
rows=707735 width=40)
-  GroupAggregate  (cost=254535.40..274527.60 rows=707735 width=36)
-  Sort  (cost=254535.40..258250.57 rows=1486069 width=36)
Sort Key: message_recipient.message_id
-  Merge Join  (cost=0.00..78970.52 rows=1486069 width=36)
Merge Cond: (outer.id = inner.recipient_id)
-  Index Scan using pk_recipient on recipient  
(cost=0.00..5150.65 rows=204514 width=36)
-  Index Scan using pk_message_recipient on 
message_recipient  (cost=0.00..56818.25 rows=1486069 width=16)
 Filter: (is_mapped = 1)
-  Sort  (cost=3.02..3.03 rows=1 width=384)
Sort Key: message.id
-  Index Scan using unq_message_m_global_id on message  
(cost=0.00..3.01 rows=1 width=384)
Index Cond: ((m_global_id)::text = '2211000-1'::text)
-  Seq Scan on m_reason  (cost=0.00..1.04 rows=4 width=13)



--- Relevant tables and view:

# \d message
  Table public.message
  Column   |Type |
Modifiers
+-+-
id | bigint  | not null default 
nextval('public.message_id_seq'::text)
m_global_id| character varying(255)  | not null
reason_id  | smallint| not null
location_name  | character varying(255)  | not null
m_date | timestamp without time zone |
m_queue_id | character varying(255)  |
h_message_id   | character varying(255)  |
c_subject_utf8 | character varying(255)  |
env_from_local | character varying(255)  |
env_from_domain| character varying(255)  |
h_from_local   | character varying(255)  |
h_from_domain  | character varying(255)  |
h_from | character varying(255)  |
h_to   | character varying(255)  |
m_milter_host  | character varying(255)  |
m_relay| character varying(255)  |
m_spam_probability | double precision|
m_message_size | integer |
m_header_size  | integer |
m_spam_level

Re: [PERFORM] Bad plan for join to aggregate of join.

2006-09-12 Thread Mischa Sandberg


Tom Lane wrote:

Mischa Sandberg [EMAIL PROTECTED] writes:

can PG see that a join on an grouped-by field
can be pushed down into the query as an indexable filter?


No.  The GROUP BY serves as a partial optimization fence.  If you're
concerned about the speed of this query, I recommend making a different
view in which 'message' is joined inside the GROUP BY.


Thanks.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Optimizer internals

2006-06-15 Thread Mischa Sandberg


Mark Lewis wrote:

On Thu, 2006-06-15 at 14:05 -0400, John Vincent wrote:

Now I've been told by our DBA that we should have been able to wholy
satisfy that query via the indexes.



DB2 can satisfy the query using only indexes because DB2 doesn't do
MVCC.


You can get pretty much the same effect with materialized views.
Create a table that LOOKS like the index (just those columns),
with a foreign key relationship to the original table (cascade delete),
and have the after-insert trigger on the main table write a row to the derived 
table.
Now (index and) query the skinny table.

Advantage of these tables: you can cluster them regularily,
because it doesn't hard-lock the main table.

--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Solaris shared_buffers anomaly?

2006-06-14 Thread Mischa Sandberg


Tom Lane wrote:

Mischa Sandberg [EMAIL PROTECTED] writes:



Tom Lane wrote:

Does Solaris have any call that allows locking a shmem segment in RAM?

Yes, mlock(). But want to understand what's going on before patching.


Sure, but testing it with mlock() might help you understand what's going
on, by eliminating one variable: we don't really know if the shmem is
getting swapped, or something else.



For a dedicated DB server machine, Solaris has a feature:
create intimate shared memory with shmat(..., SHM_SHARE_MMU).
All backends share the same TLB entries (!). 


We use that already.  (Hmm, might be interesting for you to turn it
*off* and see if anything changes.  See src/backend/port/sysv_shmem.c.)


Gah. Always must remember to RTFSource.
And reproduce the problem on a machine I control :-)

--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

---(end of broadcast)---
TIP 6: explain analyze is your friend

[PERFORM] Solaris shared_buffers anomaly?

2006-06-13 Thread Mischa Sandberg


Jim C. Nasby wrote:
...

Actually, in 8.1.x I've seen some big wins from greatly increasing the
amount of shared_buffers, even as high as 50% of memory, thanks to the
changes made to the buffer management code. ...


Anyone else run into a gotcha that one of our customers ran into?
PG 7.4.8 running on Solaris 2.6, USparc w 4GB RAM.
Usually about 50 active backends.
(No reason to believe this wouldn't apply to 8.x).

Initially shared_buffers were set to 1000 (8MB).
Then, we moved all apps but the database server off the box.

Raised shared_buffers to 2000 (16MB).
Modest improvement in some frequent repeated queries.

Raised shared_buffers to 16000 (128MB).
DB server dropped to a CRAWL.

vmstat showed that it was swapping like crazy.
Dropped shared_buffers back down again. 
Swapping stopped.


Stared at ps u a lot, and realized that the shm seg appeared to
be counted as part of the resident set (RSS).
Theory was that the kernel was reading the numbers the same way,
and swapping out resident sets, since they obviously wouldn't
all fit in RAM :-)

Anyone from Sun reading this list, willing to offer an opinion?

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Solaris shared_buffers anomaly?

2006-06-13 Thread Mischa Sandberg


Tom Lane wrote:

Mischa Sandberg [EMAIL PROTECTED] writes:

vmstat showed that it was swapping like crazy.
Dropped shared_buffers back down again. 
Swapping stopped.


Does Solaris have any call that allows locking a shmem segment in RAM?


Yes, mlock(). But want to understand what's going on before patching.
No reason to believe that the multiply-attached shm seg was being swapped out
(which is frankly insane). Swapping out (and in) just the true resident set of
every backend would be enough to explain the vmstat io we saw.

http://www.carumba.com/talk/random/swol-09-insidesolaris.html

For a dedicated DB server machine, Solaris has a feature:
create intimate shared memory with shmat(..., SHM_SHARE_MMU).
All backends share the same TLB entries (!). 


Context switch rates on our in-house solaris boxes running PG
have been insane (4000/sec). Reloading the TLB map on every process
context switch might be one reason Solaris runs our db apps at less
than half the speed of our perftesters' Xeon beige-boxes.

That's guesswork. Sun is making PG part of their distro ... 
perhaps they've some knowledgeable input.


--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Solaris shared_buffers anomaly?

2006-06-13 Thread Mischa Sandberg


Jim C. Nasby wrote:

On Tue, Jun 13, 2006 at 03:21:34PM -0700, Mischa Sandberg wrote:

Raised shared_buffers to 16000 (128MB).
DB server dropped to a CRAWL.

vmstat showed that it was swapping like crazy.
Dropped shared_buffers back down again. 
Swapping stopped.


What's sort_mem set to? I suspect you simply ran the machine out of
memory.


8192 (8MB). No issue when shared_buffers was 2000; same apps always.


Also, Solaris by default will only use a portion of memory for
filesystem caching, which will kill PostgreSQL performance.


Yep, tested /etc/system segmap_percent at 20,40,60. 
No significant difference between 20 and 60.

Default is 10%? 12%? Can't recall.

Was not changed from 20 during the shared_buffer test.
--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Solaris shared_buffers anomaly?

2006-06-13 Thread Mischa Sandberg


Jim C. Nasby wrote:

On Tue, Jun 13, 2006 at 04:20:34PM -0700, Mischa Sandberg wrote:

Jim C. Nasby wrote:

What's sort_mem set to? I suspect you simply ran the machine out of
memory.

8192 (8MB). No issue when shared_buffers was 2000; same apps always.
 
So if all 50 backends were running a sort, you'd use 400MB. The box has

4G, right?


Umm ... yes. if. 35-40 of them are doing pure INSERTS. 
Not following your train.


Yep, tested /etc/system segmap_percent at 20,40,60. 
No significant difference between 20 and 60.

That's pretty disturbing... how large is your database?


~10GB. Good locality. Where heading?

--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] vacuuming problems continued

2006-06-06 Thread Mischa Sandberg


Andrew Sullivan wrote:

On Thu, Jun 01, 2006 at 01:54:08PM +0200, Antoine wrote:

Hi,
We just don't seem to be getting much benefit from autovacuum. Running
a manual vacuum seems to still be doing a LOT, which suggests to me
that I should either run a cron job and disable autovacuum, or just
run a cron job on top of autovacuum.


Don't know if this was covered in an earlier thread. Bear with me if so.

I'm working with 7.4.8 and 8.0.3 systems, and pg_autovacuum does have some 
glitches ... in part solved by the integrated autovac in 8.1:


- in our env, clients occasionally hit max_connections. This is a known and 
(sort of) desired pushback on load. However, that sometimes knocks pg_autovacuum 
out.


- db server goes down for any reason: same problem.


Just restarting pg_autovacuum is not good enough; when pg_autovacuum terminates, 
it loses its state, so big tables that change less than 50% between such 
terminations may never get vacuumed (!)


For that reason, it's taken a switch to a Perl script run from cron every 5 
minutes, that persists state in a table. The script is not a plug-compatible 
match for pg_autovacuum (hardcoded rates; hardcoded distinction between user and 
system tables), but you may find it useful.


--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.
#!/usr/bin/env perl

use strict;
use warnings;
use Carp;
use DBI;
use POSIX   qw(strftime);

# Hardcoded (aggressive) autovacuum parameters:
my ($VAC_BASE, $VAC_RATE)  = (1000, 0.8);
my ($ANA_BASE, $ANA_RATE)  = ( 500, 0.4);

my $VERBOSE = 'VERBOSE';
my $start = time;
my $stamp = strftime  %FT%T autovac: , localtime;

open STDERR, STDOUT; # Redirect PG VERBOSE output.
my $dbh = DBI-connect(dbi:Pg(PrintError=1,RaiseError=1):);

# REVISIT: move this to schema:

my $oid = $dbh-selectall_arrayref(__SQL__)-[0][0];
SELECT oid FROM pg_class WHERE relname = 'autovac_state';
__SQL__

$dbh-do(__SQL__) if !defined $oid;
CREATE TABLE public.autovac_state(
relid   oid NOT NULL PRIMARY KEY,
nametext NOT NULL,
analyze_timetimestamptz,
vacuum_time timestamptz,
analyze_tups bigint, -- (n_tup_ins+n_tup_upd+n_tup_del) at last ANALYZE
vacuum_tups  bigint  -- (n_tup_upd+n_tup_del) at last VACUUM
);
__SQL__

# Calculate updates/additions to autovac_state:

my $anavac = join ;, map {$_-[0]} @{$dbh-selectall_arrayref(__SQL__)};

SELECT * INTO TEMP new_state FROM (
SELECT  
relid,  -- identify tables by ID, so that (re)created 
tables always
--  are treated as fresh tables.
name,   -- for constructing the vacuum/analyze command
old_relid,  -- NULL means this will need a new state table entry
analyze_tups,-- _tups are used to update autovac_state
vacuum_tups,

CASE WHEN analyze_tups - prev_analyze_tups NOT BETWEEN 0 AND 
analyze_point
OR old_relid IS NULL
THEN now()
END AS analyze_time,

CASE WHEN vacuum_tups - prev_vacuum_tups NOT BETWEEN 0 AND 
vacuum_point
THEN now()
END AS vacuum_time

FROM (
SELECT  N.nspname || '.' || C.relname   AS name,
A.relid AS old_relid, 
C.oid   AS relid, 
S.n_tup_ins + S.n_tup_upd + S.n_tup_del AS analyze_tups,
S.n_tup_upd + S.n_tup_del   AS vacuum_tups,
COALESCE(A.analyze_tups,0)  AS 
prev_analyze_tups,
COALESCE(A.vacuum_tups,0)   AS 
prev_vacuum_tups,
CASE WHEN nspname ~ '^pg_' THEN 1.0 ELSE $ANA_RATE END 
* C.reltuples + $ANA_BASE   AS 
analyze_point,
CASE WHEN nspname ~ '^pg_' THEN 2.0 ELSE $VAC_RATE END
* C.reltuples + $VAC_BASE   AS vacuum_point
FROMpg_classAS C
JOINpg_namespaceAS N ON N.oid = C.relnamespace
JOINpg_stat_all_tables  AS S ON S.relid = C.oid
LEFT JOIN autovac_state AS A ON A.relid = S.relid
WHERE   N.nspname NOT LIKE 'pg_temp%'
) AS X
) AS X
WHERE   analyze_time IS NOT NULL OR vacuum_time IS NOT NULL;

SELECT CASE WHEN vacuum_time IS NOT NULL 
THEN 'VACUUM ANALYZE $VERBOSE ' || name
ELSE 'ANALYZE $VERBOSE ' || name 
END
FROM new_state;
__SQL__

if ($anavac) {
print STDERR $stamp.start\n;

$dbh-do(__SQL__);

$anavac;

UPDATE  autovac_state
SET analyze_tups

Re: [PERFORM] vacuuming problems continued

2006-06-06 Thread Mischa Sandberg


Joshua D. Drake wrote:
- in our env, clients occasionally hit max_connections. This is a 
known and (sort of) desired pushback on load. However, that sometimes 
knocks pg_autovacuum out.


That is when you use:

superuser_reserved_connections


Blush. Good point. Though, when we hit max_connections on 7.4.8 systems,
it's been a lemonade-from-lemons plus that vacuuming didn't fire up on top of 
everything else :-)



- db server goes down for any reason: same problem.


I believe you can use
stats_reset_on_server_start = on


We do. The problem is not the loss of pg_stat_user_tables.(n_tup_ins,...)
It's the loss of pg_autovacuum's CountAtLastVacuum (and ...Analyze)
numbers, which are kept in process memory. Never considered patching
pg_autovacuum to just sleep and try again, rather than exit, on a failed
db connection.

--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Selects query stats?

2006-05-23 Thread Mischa Sandberg


Dan Gorman wrote:

All,

I might be completely crazy here, but it seems every other database 
exposes select query stats. Postgres only exposes 
updates/deletes/inserts. Is there something I am missing here?


Perhaps.

You can EXPLAIN ANALYZE a SELECT, just like i/u/d -- but then you
don't get the normal result set back. Is that what you mean?

You can turn on log_min_duration_statement and get total SELECT duration
logged.

There's a thread in pgsql-hackers (Re: Porting MSSQL to PGSQL: trace and 
profile) about server-side logging of query plans and stats (for all four of 
s/i/u/d), which is indeed not there in PG.


--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Selects query stats?

2006-05-23 Thread Mischa Sandberg


Dan Gorman wrote:

What I am looking for is that our DB is doing X selects a min.


What specifically would you like to measure?
Duration for specific queries?
Queries in an app for which you have no source?
There may be a way to get what you want by other means ...
Details?

I gather you cannot just time the app that's doing the selects,
nor extract those selects and run them via psql and time them
on their own?


Dan Gorman wrote:

All,
I might be completely crazy here, but it seems every other database 
exposes select query stats. Postgres only exposes 
updates/deletes/inserts. Is there something I am missing here?



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Selects query stats?

2006-05-23 Thread Mischa Sandberg


Alvaro Herrera wrote:

Yeah, the problem seems underspecified.


So, Dan, the question is, what are you trying to measure?
This might be a statistic that management has always been given,
for Oracle, and you need to produce the same number for PostgreSQL.

If not, it's hard to figure out what a statement counter actually can measure,
to the extent that you can say, If that number does THIS, I should do THAT.

--
Engineers think that equations approximate reality.
Physicists think that reality approximates the equations.
Mathematicians never make the connection.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Best hardware

2005-06-04 Thread Mischa Sandberg

Quoting Bernd Jagla [EMAIL PROTECTED]:

 ... the speed of the head of the HD is actually
 limitiing. Also, I only experimented with RAID5, and heard that
 RAID10 will be good for reading but not writing.

Au contraire. RAID5 is worse than RAID10 for writing, because it has the
extra implicit read (parity stripe) for every write. I've switched all
my perftest boxes over from RAID5 to RAID10, and the smallest
performance increase was x1.6 . This is in an update-intensive system;
the WAL log's disk write rate was the controlling factor.

 Are Quad CPUs any good? I heard that the IBM quad system is supposed
to be 40%
 faster than HP or Dell???. 
Check out the other threads for negative experiences with Xeon 2x2 and
perhaps quad CPU's. Me, I'm looking forward to my first Opteron box
arriving next week.

 And how much RAM should go for: are 8GB enough? Oh, of course I wanted
to run it under RedHat...

First off, you need enough RAM to hold all your connections. Run your
app, watch the RSS column of ps. For my own simpler apps (that pump
data into the db) I allow 20MB/connection.

Next, if you are heavy on inserts, your tables will never fit in RAM,
and you really just need enough to hold the top levels of the indexes.
Look at the disk space used in your $PGDATA/base/dboid/tableoid
files, and you can work out whether holding ALL your indexes in memory
is feasible. 

If you are heavy on updates, the above holds, but ymmv depending on
locality of reference, you have to run your own tests. 

If you have concurrent big queries, all bets are off --- ask not how
much RAM you need, but how much you can afford :-)



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] PostgreSQL strugling during high load

2005-05-13 Thread Mischa Sandberg

Quoting Tom Lane [EMAIL PROTECTED]:

 Mindaugas Riauba [EMAIL PROTECTED] writes:
  ... So contents of database changes very fast. Problem is that
 when
  pg_autovacuum does vacuum those changes slows down too much.
 
 The vacuum cost parameters can be adjusted to make vacuums fired
 by pg_autovacuum less of a burden.  I haven't got any specific
 numbers
 to suggest, but perhaps someone else does.

I solved one problem by cranking sleep scaling to -S 20.
It made pg_autovacuum back off longer during extended periods of heavy
disk-intensive query activity. Our update activity is near-constant
insert rate, then once or twice a day, massive deletes.
-- 
Dreams come true, not free.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Prefetch

2005-05-11 Thread Mischa Sandberg

Quoting Christopher Kings-Lynne [EMAIL PROTECTED]:

  Another trick you can use with large data sets like this when you
 want 
  results
  back in seconds is to have regularly updated tables that aggregate
 the data
  along each column normally aggregated against the main data set.
 
  Maybe some bright person will prove me wrong by posting some
 working
  information about how to get these apparently absent features
 working.
 
 Most people just use simple triggers to maintain aggregate summary
 tables...

Don't know if this is more appropriate to bizgres, but:
What the first poster is talking about is what OLAP cubes do.

For big aggregating systems (OLAP), triggers perform poorly, 
compared to messy hand-rolled code. You may have dozens
of aggregates at various levels. Consider the effect of having 
each detail row cascade into twenty updates. 

It's particularly silly-looking when data is coming in as 
batches of thousands of rows in a single insert, e.g.

   COPY temp_table FROM STDIN;
   UPDATE fact_table ... FROM ... temp_table
   INSERT INTO fact_table ...FROM...temp_table

   (the above pair of operations is so common, 
Oracle added its MERGE operator for it).

Hence my recent post (request) for using RULES to aggregate 
--- given no luck with triggers FOR EACH STATEMENT.


---(end of broadcast)---
TIP 8: explain analyze is your friend

Federated PG servers -- Was: Re: [PERFORM] [GENERAL] Hash index vs. b-tree index (PostgreSQL

2005-05-11 Thread Mischa Sandberg

Was curious why you pointed out SQL-MED as a SQL-standard approach to
federated servers. Always thought of it as covering access to non-SQL
data, the way the lo_* interface works; as opposed to meshing compatible
(to say nothing of identical) SQL servers. Just checked Jim Melton's
last word on that, to make sure, too. Is there something beyond that,
that I'm missing?

The approach that made first best sense to me (perhaps from having gone
there before) is to leave the SQL syntactically unchanged, and to manage
federated relations via pg_ tables and probably procedures. MSSQL and
Sybase went that route. It won't preclude moving to a system embedded in
the SQL language. 

The hurdles for federated SQL service are:
- basic syntax (how to refer to a remote object)
- connection management and delegated security
- timeouts and temporary connection failures
- efficient distributed queries with 1 remote table
- distributed transactions
- interserver integrity constraints

Sometimes the lines get weird because of opportunistic implementations.
For example, for the longest time, MSSQL supported server.db.user.object
references WITHIN STORED PROCEDURES, since the proc engine could hide
some primitive connection management. 

PG struck me as such a natural for cross-server queries, because
it keeps everything out in the open, including statistics.
PG is also well set-up to handle heterogeneous table types,
and has functions that return rowsets. Nothing needs to be bent out of
shape syntactically, or in the cross-server interface, to get over the
hurdles above.

The fact that queries hence transactions can't span multiple databases
tells me, PG has a way to go before it can handle dependency on a
distributed transaction monitor. 

My 2c.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Bad plan after vacuum analyze

2005-05-11 Thread Mischa Sandberg

Quoting Guillaume Smet [EMAIL PROTECTED]:

 Hi,
 
 We have some performances problem on a particular query.
...

I have to say it, this was the best laid-out set of details behind a
problem I've ever seen on this list; I'm going to try live up to it, the
next time I have a problem of my own.



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] [GENERAL] Hash index vs. b-tree index (PostgreSQL

2005-05-10 Thread Mischa Sandberg

Quoting Jim C. Nasby [EMAIL PROTECTED]:

 I'm not really familiar enough with hash indexes to know if this
 would
 work, but if the maximum bucket size was known you could use that to
 determine a maximum range of buckets to look at. In some cases, that
 range would include only one bucket, otherwise it would be a set of
 buckets. If you found a set of buckets, I think you could then just
 go
 to the specific one you need.
 
 If we assume that the maximum bucket size is one page it becomes
 more
 realistic to take an existing large bucket and split it into several
 smaller ones. This could be done on an update to the index page, or
 a
 background process could handle it.
 
 In any case, should this go on the TODO list?
 
  Allowing a bucket size to be specified at CREATE INDEX doesn't seem
 out
  of line though.  We'd have to think up a scheme for
 index-AM-specific
  index parameters ...
 -- 
 Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
 Give your computer some brain candy! www.distributed.net Team #1828

Google dynamic hash or linear hash. It takes care of not needing to
have varying bucket sizes.

Hash indexes are useful if you ALWAYS require disk access; they behave
like worst-case random cache-thrash tests. That's probably why dbs have
gravitated toward tree indexes instead. On the other hand, there's more
(good) to hashing than initially meets the eye.

Dynamic multiway hashing has come a long way from just splicing the bits
together from multiple columns' hash values. If you can lay your hands
on Tim Merrett's old text Relational Information Systems, it's an
eye-opener. Picture an efficient terabyte spreadsheet.

For one thing, unlike a btree, a multicolumn hash is symmetric: it
doesn't matter which column(s) you do not specify in a partial match.

For another, a multiway hash is useful for much lower selectivity than a
btree. I built such indexes for OLAP cubes, and some dimensions were
only 10 elements wide. At the point where btree indexing becomes worse
than seqscan, a multiway hash tells you which 10% of the disk to scan.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg

Quoting [EMAIL PROTECTED]:

  exploring the option of buying 10 cheapass
  machines for $300 each.  At the moment, that $300 buys you, from
 Dell, a
  2.5Ghz Pentium 4
 
 Buy cheaper ass Dells with an AMD 64 3000+.  Beats the crap out of
 the 2.5
 GHz Pentium, especially for PostgreSQL.

Whence Dells with an AMD 64 ?? Perhaps you skimmed:

  http://www.thestreet.com/tech/kcswanson/10150604.html
or
  http://www.eweek.com/article2/0,1759,1553822,00.asp





---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] [GENERAL] Hash index vs. b-tree index (PostgreSQL

2005-05-10 Thread Mischa Sandberg

Quoting Jim C. Nasby [EMAIL PROTECTED]:

 Well, in a hash-join right now you normally end up feeding at least
 one
 side of the join with a seqscan. Wouldn't it speed things up
 considerably if you could look up hashes in the hash index instead?

You might want to google on grace hash and hybrid hash.

The PG hash join is the simplest possible: build a hash table in memory,
and match an input stream against it.

*Hybrid hash* is where you spill the hash to disk in a well-designed
way. Instead of thinking of it as building a hash table in memory, think
of it as partitioning one input; if some or all of it fits in memory,
all the better. The boundary condition is the same. 

The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now
joined the MS Borg. He demonstrated that for entire-table joins, hybrid
hash completely dominates sort-merge. MSSQL now uses what he developed
as an academic, but I don't know what the patent state is.

Grace hash is the original implementation of hybrid hash:
  Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984).
  Architecture and Performance of Relational Algebra Machine Grace. 



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg

Quoting Alex Stapleton [EMAIL PROTECTED]:

 This is why I mention partitioning. It solves this issue by storing 
 different data sets on different machines under the same schema.  
 These seperate chunks of the table can then be replicated as well for
 data redundancy and so on. MySQL are working on these things, but PG 
 just has a bunch of third party extensions, I wonder why these are  
 not being integrated into the main trunk :/ Thanks for pointing me to
 PGCluster though. It looks like it should be better than Slony at
 least.

Across a decade or two of projects, including creating a federated
database engine for Simba, I've become rather dubious of horizontal
partitions (across disks or servers), either to improve performance, or
just to scale up and not lose performance. [[The one exception is for
emphasis non-time-critical read-only/emphasis systems, with
Slony-style replication.]]

The most successful high-volume systems I've seen have broken up
databases functionally, like a pipeline, where different applications
use different sections of the pipe. 

The highest-volume system I've worked on is Acxiom's gigantic
data-cleansing system. This is the central clearinghouse for every scrap
of demographic that can be associated with some North American,
somewhere. Think of DB for 300M people (some dead). The volumes are
just beyond belief, for both updates and queries. At Acxiom, the
datasets are so large, even after partitioning, that they just
constantly cycle them through memory, and commands are executes in
convoys --- sort of like riding a paternoster.
..
Anybody been tracking on what Mr Stonebraker's been up to, lately?
Datastream management. Check it out. Like replication, everybody
hand-rolled their own datastream systems until finally somebody else
generalized it well enough that it didn't have to be built from scratch
every time.

Datastream systems require practically no locking, let alone distributed
transactions. They give you some really strong guarantees on transaction
elapsed-time and throughput. 
...
Where is this all leading? Well, for scaling data like this, the one
feature that you need is the ability of procedures/rules on one server
to perform queries/procedures on another. MSSQL has linked servers and
(blech) OpenQuery. This lets you do reasonably-efficient work when you
only deal with one table at a time. Do NOT try anything fancy with
multi-table joins; timeouts are unavoidable, and painful.

Postgres has a natural advantage in such a distributed server system:
all table/index stats are openly available through the SQL interface,
for one server to make rational query plans involving another server's
resources. God! I would have killed for that when I was writing a
federated SQL engine; the kluges you need to do this at arms-length from
that information are true pain.

So where should I go look, to see what's been done so far, on a Postgres
that can treat another PG server as a new table type?



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] [GENERAL] Hash index vs. b-tree index (PostgreSQL

2005-05-10 Thread Mischa Sandberg

Quoting Tom Lane [EMAIL PROTECTED]:

 Mischa Sandberg [EMAIL PROTECTED] writes:
  The PG hash join is the simplest possible: build a hash table in
 memory,  and match an input stream against it.
 
 [ raised eyebrow... ]  Apparently you've not read the code.  It's
 been hybrid hashjoin since we got it from Berkeley.  Probably not the
 best possible implementation of the concept, but we do 
 understand about spill to disk.

Apologies. I stopped reading around line 750 (PG 8.0.1) in
src/backend/executor/nodeHashjoin.c

if (!node-hj_hashdone)
{

/*
 * execute the Hash node, to build the hash table
 */
hashNode-hashtable = hashtable;
(void) ExecProcNode((PlanState *) hashNode);
...

and missed the comment:
/*
 * Open temp files for outer batches,
 */

Will quietly go and read twice, talk once. 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg

Quoting Christopher Kings-Lynne [EMAIL PROTECTED]:

  This is why I mention partitioning. It solves this issue by storing
  different data sets on different machines under the same schema. 
  These seperate chunks of the table can then be replicated as well for 
  data redundancy and so on. MySQL are working on these things
 *laff*
 Yeah, like they've been working on views for the last 5 years, and
 still haven't released them :D :D :D

? 
http://dev.mysql.com/doc/mysql/en/create-view.html
...for MySQL 5.0.1+ ?


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg

Quoting Christopher Kings-Lynne [EMAIL PROTECTED]:

 
 *laff*
 Yeah, like they've been working on views for the last 5 years, and
 still haven't released them :D :D :D
  
  ? 
  http://dev.mysql.com/doc/mysql/en/create-view.html
  ...for MySQL 5.0.1+ ?
 
 Give me a call when it's RELEASED.


:-) Touche'



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Partitioning / Clustering

2005-05-10 Thread Mischa Sandberg

Quoting Jim C. Nasby [EMAIL PROTECTED]:

 To the best of my knowledge no such work has been done. There is a
 project (who's name escapes me) that lets you run queries against a
 remote postgresql server from a postgresql connection to a different
 server, which could serve as the basis for what you're proposing.

Okay, if the following looks right to the powerthatbe, I'd like to start
a project. Here's the proposition:

servername.dbname.schema.object would change RangeVar, which would
affect much code. dbname.schema.object itself is not implemented in
8.0. So, simplicity dictates something like:

table pg_remote(schemaname text, connectby text, remoteschema text)

The pg_statistic info from a remote server cannot be cached in local
pg_statistic, without inventing pseudo reloids as well as a
pseudoschema. Probably cleaner to cache it somewhere else. I'm still
reading down the path that puts pg_statistic data where costsize can get
at it.

First step: find out whether one can link libpq.so to postmaster :-)


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] Query tuning help

2005-05-09 Thread Mischa Sandberg

Quoting Russell Smith [EMAIL PROTECTED]:

 On Mon, 9 May 2005 11:49 am, Dan Harris wrote:
  On May 8, 2005, at 6:51 PM, Russell Smith wrote:
 [snip]
  select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat
  FROM em
  JOIN ea ON em.incidentid = ea.incidentid  --- slight paraphrase /Mischa.
  AND em.entrydate between '2005-1-1' and '2005-5-9'
  AND ea.recordtext like '%RED%'  AND ea.recordtext like
'%CORVETTE%'

  Here's the situation:
  Due to the format of the systems with which I integrate ( I have no
  control over these formats ), we will get these 'recordtext' values one
  line at a time, accumulating over time.  The only way I can find to
  make this work is to insert a new record for each line.  The problem
  is, that when someone wants to search multiple keywords, they expect
  these words to be matched across multiple records with a given incident
  number.
 
For a very simple example:
 
  IncidentID  DateRecordtext
  --  -
  1   2005-05-01 14:21 blah blah blah RED blah blah
     2005-05-01 14:23 not what we are looking for
  1   2005-05-02 02:05 blah CORVETTE blah blah
 

select em.incidentid, ea.recordtest as retdata
fromem
join ( -- equivalent to where incidentid in (...), sometimes faster.
  select incidentid
  from  em join  ea using (incidentid)
  where em.entrydate between '2005-1-1' and '2005-5-9'
  group by incidentid
  having 1 = min(case when recordtest like '%RED%' then 1 end)
 and 1 = min(case when recordtest like '%CORVETTE%' then 1 end)
 ) as X using (incidentid);



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Whence the Opterons?

2005-05-08 Thread Mischa Sandberg

Thanks to everyone for their pointers to suppliers of Opteron systems.

The system I'm pricing is under a tighter budget than a production machine,
because it will be for perftests. Our customers tend to run on Dells but
occasionally run on (Sun) Opterons. 



---(end of broadcast)---
TIP 8: explain analyze is your friend

[PERFORM] Whence the Opterons?

2005-05-06 Thread Mischa Sandberg

After reading the comparisons between Opteron and Xeon processors for Linux,
I'd like to add an Opteron box to our stable of Dells and Sparcs, for 
comparison.

IBM, Sun and HP have their fairly pricey Opteron systems.
The IT people are not swell about unsupported purchases off ebay.
Anyone care to suggest any other vendors/distributors?
Looking for names with national support, so that we can recommend as much to our
customers.

Many thanks in advance.
-- 
Dreams come true, not free. -- S.Sondheim


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] COPY vs INSERT

2005-05-04 Thread Mischa Sandberg

Quoting David Roussel [EMAIL PROTECTED]:

  COPY invokes all the same logic as INSERT on the server side
  (rowexclusive locking, transaction log, updating indexes, rules). 
  The difference is that all the rows are inserted as a single
  transaction. This reduces the number of fsync's on the xlog,
  which may be a limiting factor for you. You'll want to crank 
  WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. 
  One of my streams has 6K records; I run with WB=1000, CS=128.
 
 So what's the difference between a COPY and a batch of INSERT
 statements.  Also, surely, fsyncs only occur at the end of a
 transaction, no need to fsync before a commit has been issued,
 right?

Sorry, I was comparing granularities the other way araound. As far as
xlog is concerned, a COPY is ALWAYS one big txn, no matter how many
putline commands you use to feed the copy. With inserts, you can choose
whether to commit every row, every nth row, etc.

Copy makes better use of the TCP connection for transmission. COPY uses
the TCP connection like a one-way pipe. INSERT is like an RPC: the
sender has to wait until the insert's return status roundtrips.
-- 
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] COPY vs INSERT

2005-05-04 Thread Mischa Sandberg

Quoting Kris Jurka [EMAIL PROTECTED]: 
 
 On Wed, 4 May 2005, Mischa Sandberg wrote: 
  
  Copy makes better use of the TCP connection for transmission. COPY 
 uses 
  the TCP connection like a one-way pipe. INSERT is like an RPC: the 
  sender has to wait until the insert's return status roundtrips. 
  
 Not true.  A client may send any number of Bind/Execute messages on 
a 
 prepared statement before a Sync message.  So multiple inserts may 
be 
 sent  
 in one network roundtrip.  This is exactly how the JDBC driver  
 implements batch statements.  There is some limit to the number of 
 queries  
 in flight at any given moment because there is the potential to 
 deadlock  
 if both sides of network buffers are filled up and each side is 
 blocked  
 waiting on a write.  The JDBC driver has conservatively selected 256 
 as  
 the maximum number of queries to send at once. 
 
Hunh. Interesting optimization in the JDBC driver. I gather it is 
sending a string of (;)-separated inserts. Sounds like 
efficient-but-risky stuff we did for ODBC drivers at Simba ... gets 
interesting when one of the insert statements in the middle fails. 
Good to know. Hope that the batch size is parametric, given that you 
can have inserts with rather large strings bound to 'text' columns in 
PG --- harder to identify BLOBs when talking to PG, than when talking 
to MSSQL/Oracle/Sybase. 
 
-- 
Engineers think equations approximate reality. 
Physicists think reality approximates the equations. 
Mathematicians never make the connection. 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] COPY vs INSERT

2005-05-03 Thread Mischa Sandberg

 Steven Rosenstein [EMAIL PROTECTED] writes:
  My question is, are there any advantages, drawbacks, or outright
  restrictions to using multiple simultaneous COPY commands to load
 data into
  the same table?

Do you mean, multiple COPY commands (connections) being putline'd from
the same thread (process)? I have indirect evidence that this may hurt.

Two copy commands from different threads/processes are fine, and can
help, if they alternate contention on some other resource (disk/CPU).

I'm basing this on being at the third generation of a COPY
implementation. The app loads about 1M objects/hour from 6 servers. 
Each object is split across four tables.
The batch load opens four connections and firehoses records down each.
A batch is 10K objects.

COPY invokes all the same logic as INSERT on the server side
(rowexclusive locking, transaction log, updating indexes, rules). 
The difference is that all the rows are inserted as a single
transaction. This reduces the number of fsync's on the xlog,
which may be a limiting factor for you. You'll want to crank 
WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. 
One of my streams has 6K records; I run with WB=1000, CS=128.

The downside I found with multiple clients inserting large blocks of
rows was, that they serialized. I THINK that's because at some point
they all needed to lock the same portions of the same indexes. I'm still
working on how to avoid that, tuning the batch size and inserting into a
 queue table with fewer indexes.

COPY (via putline) didn't do measurably better than INSERT until I
batched 40 newline-separate rows into one putline call, which improved
it 2-3:1. The suspect problem was stalling on the TCP stream; the driver
was flushing small packets. This may or may not be relevant to you;
depends on how much processing (waiting) your app does between posting
of rows.

In such a case, writing alternately to two TCP streams from the same
process increases the likelihood of a stall. I've never tested that
set-up; it would have been heading AWAY from the solution in my case.

Hope that helps.
-- 
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg

Quoting Markus Schaber [EMAIL PROTECTED]:

 Hi, Josh,
 
 Josh Berkus wrote:
 
  Yes, actually.   We need 3 different estimation methods:
  1 for tables where we can sample a large % of pages (say, = 0.1)
  1 for tables where we sample a small % of pages but are easily
 estimated
  1 for tables which are not easily estimated by we can't afford to
 sample a 
  large % of pages.
  
  If we're doing sampling-based estimation, I really don't want
 people to lose 
  sight of the fact that page-based random sampling is much less
 expensive than 
  row-based random sampling.   We should really be focusing on
 methods which 
  are page-based.

Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)

How about applying the distinct-sampling filter on a small extra data
stream to the stats collector? 

-- 
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg

Quoting Josh Berkus josh@agliodbs.com: 
 
 Mischa, 
  
  Okay, although given the track record of page-based sampling for 
  n-distinct, it's a bit like looking for your keys under the 
 streetlight, 
  rather than in the alley where you dropped them :-) 
  
 Bad analogy, but funny. 
 
Bad analogy? Page-sampling effort versus row-sampling effort, c'est 
moot. It's not good enough for stats to produce good behaviour on the 
average. Straight random sampling, page or row, is going to cause 
enough untrustworthy engine behaviour,for any %ages small enough to 
allow sampling from scratch at any time. 
 
I'm curious what the problem is with relying on a start-up plus 
incremental method, when the method in the distinct-sampling paper 
doesn't degenerate: you can start when the table is still empty. 
Constructing an index requires an initial full scan plus incremental 
update; what's the diff? 
 
 Unless, of course, we use indexes for sampling, which seems like a 
 *really  
 good* idea to me  
 
distinct-sampling applies for indexes, too. I started tracking the 
discussion of this a bit late.  Smart method for this is in VLDB'92: 
Gennady Antoshenkov, Random Sampling from Pseudo-ranked B+-trees. I 
don't think this is online anywhere, except if you have a DBLP 
membership. Does nybod else know better? 
Antoshenkov was the brains behind some of the really cool stuff in DEC 
Rdb (what eventually became Oracle). Compressed bitmap indices, 
parallel competing query plans, and smart handling of keys with 
hyperbolic distributions.  
--  
Engineers think equations approximate reality. 
Physicists think reality approximates the equations. 
Mathematicians never make the connection. 


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Suggestions for a data-warehouse migration routine

2005-04-28 Thread Mischa Sandberg

Quoting Richard Rowell [EMAIL PROTECTED]:

 I've ported enough of my companies database to Postgres to make
 warehousing on PG a real possibility.  I thought I would toss my
 data
 migration architecture ideas out for the list to shoot apart..
 
[...]
Not much feedback required.

Yes, dropping the entire database is faster and simpler.
If your database is small enough that you can rebuild it from scratch
every time, go for it.

Yes, vacuum analyze required; creating indexes alone does not create
statistics.

From a I'd dump an extract of pg_stat[io_]user_(tables|indexes)
to see how index usage and table load changes over time.
-- 
Dreams come true, not free. -- S.Sondheim, ITW 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Mischa Sandberg

Quoting Josh Berkus josh@agliodbs.com:

  Perhaps I can save you some time (yes, I have a degree in Math). If I
  understand correctly, you're trying extrapolate from the correlation
  between a tiny sample and a larger sample. Introducing the tiny sample
  into any decision can only produce a less accurate result than just
  taking the larger sample on its own; GIGO. Whether they are consistent
  with one another has no relationship to whether the larger sample
  correlates with the whole population. You can think of the tiny sample
  like anecdotal evidence for wonderdrugs.

 Actually, it's more to characterize how large of a sample we need.  For
 example, if we sample 0.005 of disk pages, and get an estimate, and then
 sample another 0.005 of disk pages and get an estimate which is not even
 close to the first estimate, then we have an idea that this is a table
which
 defies analysis based on small samples.   Wheras if the two estimates
are 
 1.0 stdev apart, we can have good confidence that the table is easily
 estimated.  Note that this doesn't require progressively larger
samples; any
 two samples would work.

We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?

About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.

-- 
Dreams come true, not free. -- S.Sondheim, ITW 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Mischa Sandberg

Quoting Andrew Dunstan [EMAIL PROTECTED]: 
 
 After some more experimentation, I'm wondering about some sort of  
 adaptive algorithm, a bit along the lines suggested by Marko 
Ristola, but limited to 2 rounds. 
  
 The idea would be that we take a sample (either of fixed size, or 
 some  small proportion of the table) , see how well it fits a larger 
sample 
  (say a few times the size of the first sample), and then adjust 
the  formula accordingly to project from the larger sample the 
estimate for the full population. Math not worked out yet - I think we 
want to ensure that the result remains bounded by [d,N]. 
 
Perhaps I can save you some time (yes, I have a degree in Math). If I 
understand correctly, you're trying extrapolate from the correlation 
between a tiny sample and a larger sample. Introducing the tiny sample 
into any decision can only produce a less accurate result than just 
taking the larger sample on its own; GIGO. Whether they are consistent 
with one another has no relationship to whether the larger sample 
correlates with the whole population. You can think of the tiny sample 
like anecdotal evidence for wonderdrugs.  
--  
Dreams come true, not free. -- S.Sondheim, ITW  


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon

2005-04-22 Thread Mischa Sandberg

Quoting Alvaro Herrera [EMAIL PROTECTED]:

 One further question is: is this really a meaningful test?  I mean, in
 production are you going to query 30 rows regularly?  And is the
 system always going to be used by only one user?  I guess the question
 is if this big select is representative of the load you expect in
 production.

While there may be some far-out queries that nobody would try,
you might be surprised what becomes the norm for queries,
as soon as the engine feasibly supports them. SQL is used for
warehouse and olap apps, as a data queue, and as the co-ordinator
or bridge for (non-SQL) replication apps. In all of these,
you see large updates, large result sets and volatile tables
(large to me means over 20% of a table and over 1M rows).

To answer your specific question: yes, every 30 mins,
in a data redistribution app that makes a 1M-row query, 
and writes ~1000 individual update files, of overlapping sets of rows. 
It's the kind of operation SQL doesn't do well, 
so you have to rely on one big query to get the data out.

My 2c
-- 
Dreams come true, not free. -- S.Sondheim, ITW


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Index bloat problem?

2005-04-21 Thread Mischa Sandberg

Quoting Bill Chandler [EMAIL PROTECTED]:

 ... The normal activity is to delete 3-5% of the rows per day,
 followed by a VACUUM ANALYZE. 
...
 However, on occasion, deleting 75% of rows is a 
 legitimate action for the client to take.  

  In case nobody else has asked: is your max_fsm_pages
  big enough to handle all the deleted pages, 
  across ALL tables hit by the purge?

 This parameter is most likely set incorrectly.  So
 that could be causing problems.  Could that be a
 culprit for the index bloat, though?

Look at the last few lines of vacuum verbose output.
It will say something like:

free space map: 55 relations, 88416 pages stored; 89184 total pages needed
  Allocated FSM size: 1000 relations + 100 pages = 5920 kB shared memory.

100 here is [max_fsm_pages] from my postgresql.conf.
If the total pages needed is bigger than the pages 
fsm is allocated for, then you are bleeding.
-- 
Dreams come true, not free. -- S.Sondheim, ITW


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Slow copy with little CPU/disk usage

2005-04-19 Thread Mischa Sandberg

Quoting Tom Lane [EMAIL PROTECTED]: 
 
 Jim C. Nasby [EMAIL PROTECTED] writes: 
  A friend of mine has an application where he's copying in 4000 rows at a 
  time into a table that has about 4M rows. Each row is 40-50 bytes. This 
  is taking 25 seconds on a dual PIII-1GHz with 1G of RAM and a 2 disk 
  SATA mirror, running FBSD 4.10-stable. There's one index on the table. 
  
 If there's no hidden costs such as foreign key checks, that does seem 
 pretty dang slow. 
  
  What's really odd is that neither the CPU or the disk are being 
  hammered. The box appears to be pretty idle; the postgresql proces is 
  using 4-5% CPU. 
--  
This sounds EXACTLY like my problem, if you make the box to a Xeon 2.4GHz, 2GB 
RAM ... with two SCSI drives (xlog and base); loading 10K rows of about 200 
bytes each; takes about 20 secs at the best, and much longer at the worst. By 
any chance does your friend have several client machines/processes trying to 
mass-load rows at the same time? Or at least some other processes updating 
that table in a bulkish way? What I get is low diskio, low cpu, even low 
context-switches ... and I'm betting he should take a look at pg_locks. For my 
own problem, I gather that an exclusive lock is necessary while updating 
indexes and heap, and the multiple processes doing the update can make that 
pathological. 
 
Anyway, have your friend check pg_locks. 
 
 
Dreams come true, not free. -- S.Sondheim, ITW 


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

[PERFORM] Strange serialization problem

2005-04-13 Thread Mischa Sandberg

I have a performance problem; I'd like any suggestions on where to continue
investigation. 

A set of insert-only processes seems to serialize itself. :-(

The processes appear to be blocked on disk IO, and probably the table drive,
rather than the pg_xlog drive.

Each process is inserting a block of 10K rows into a table.
I'm guessing they are serialized because one process by itself takes 15-20
secs; running ten processes in parallel averages 100-150 secs (each), with
elapsed (wall) time  of 150-200 secs. 

Polling pg_locks shows each process has (been granted) only the locks you would
expect. I RARELY see an Exclusive lock on an index, and then only on one index
at a time.

A sample from pg_locks:

TABLE/INDEX  GRANTED PID  MODE
m_reason   t 7340 AccessShare
messaget 7340 AccessShare
messaget 7340 RowExclusive
pk_message t 7340 AccessShare
tmp_messaget 7340 AccessShare
(m_reason is a one-row lookup table; see INSERT cmd below).

--
The query plan is quite reasonable (see below).

On a side note, this is the first app I've had to deal with that is sweet to
pg_xlog, but hammers the drive bearing the base table (3x the traffic).

log_executor_stats for a sample insert look reasonable (except the elapsed!)

! system usage stats:
! 308.591728 elapsed 3.48 user 1.27 system sec
! [4.00 user 1.39 sys total]
! 0/0 [0/0] filesystem blocks in/out
! 18212/15 [19002/418] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [0/0] voluntary/involuntary context switches
! buffer usage stats:
! Shared blocks:   9675 read,   8781 written, buffer hit rate = 97.66%
! Local  blocks:504 read, 64 written, buffer hit rate = 0.00%
! Direct blocks:  0 read,  0 written

Summarized ps output for the above backend process, sampled every 5 secs,
shows it is 94% in the 'D' state, 3% in the 'S' state.


== BACKGROUND ==


**SOFTWARE
- PG 7.4.6, RedHat 8.

--
**HARDWARE
Xeon 2x2 2.4GHz 2GB RAM
4 x 73GB SCSI; pg_xlog and base on separate drives.

--
**APPLICATION

Six machines post batches of 10K messages to the PG db server.
Machine #nn generates its ID keys as nn001::bigint etc.

Each process runs:
- COPY tmp_message FROM STDIN loads its own one-use TEMP table.
-  INSERT INTO message 
SELECT tmp.* FROM tmp_message AS tmp
JOIN m_reason ON m_reason.name = tmp.reason
LEFT JOIN message USING (ID) WHERE message.ID is null
  (check required because crash recovery logic requires idempotent insert)
  DROP TABLE tmp_message  --- call me paranoid, this is 7.4

The COPY step time is almost constant when #processes varies from 1 to 10.

--
**POSTGRES
pg_autovacuum is running with default parameters.

Non-default GUC values:
checkpoint_segments= 512
default_statistics_target  = 200
effective_cache_size   = 50
log_min_duration_statement = 1000
max_fsm_pages  = 100
max_fsm_relations  = 1000
random_page_cost   = 1
shared_buffers = 1
sort_mem   = 16384
stats_block_level  = true
stats_command_string   = true
stats_row_level= true
vacuum_mem = 65536
wal_buffers= 2000

Wal_buffers and checkpoint_segments look outrageous, 
but were tuned for another process, that posts batches of 1 6KB rows
in a single insert.
--
TABLE/INDEX STATISTICS

--
MACHINE STATISTICS

ps gives the backend process as 98% in (D) state, with 1% CPU.

A top snapshot:
CPU states:  cpuusernice  systemirq  softirq  iowaitidle
   total2.0%0.0%0.8%   0.0% 0.0%   96.9%0.0%
   cpu002.5%0.0%1.9%   0.0% 0.0%   95.4%0.0%
   cpu011.7%0.0%0.1%   0.0% 0.3%   97.6%0.0%
   cpu020.5%0.0%0.7%   0.0% 0.0%   98.6%0.0%
   cpu033.1%0.0%0.5%   0.0% 0.0%   96.2%0.0%
Mem:  2061552k av, 2041752k used,   19800k free,   0k shrd,   21020k buff

iostat reports that the $PGDATA/base drive is being worked but not overworked.
The pg_xlog drive is underworked:

   KBPS   TPS   KBPS   TPS   KBPS   TPS   KBPS   TPS
12:30  1 276316 31 8   3336   269
12:40  5 3   115122  5 5   2705   320
  ^pg_xlog^  ^base^

The base drive has run as much as 10MBPS, 5K TPS.
--
EXPLAIN ANALYZE output:
The plan is eminently reasonable. But there's no visible relationship
between the top level actual time

Re: [PERFORM] Normal case or bad query plan?

2004-10-19 Thread Mischa Sandberg

This may sound more elaborate than it's worth, but I don't know of
a better way to avoid a table scan.
You want to index on a computed value that is a common prefix of your
FROM and TO fields.
The next step is to search on a fixed SET of prefixes of different 
lengths. For example, some of your ranges might be common in the first 3 
bytes of ipaddr, some in two, some in only one.

You create and index on one common prefix of either 1,2 or 3 bytes, for 
each row.

Your query then looks something like (pardon my ignorance in PGSQL)
select  *
fromip2location
where   ip2prefix in (
network(:myaddr || '/8'),
network(:myaddr || '/16'),
network(:myaddr || '/24'),
:myaddr --- assuming single-address ranges are possible
)
and :myaddr between ip_address_from and ip_address_to
Although this looks a little gross, it hits very few records.
It also adapts cleanly to a join between ip2location and a table of
ip addrs.
Gabriele Bartolini wrote:
Hi guys,
   please consider this scenario. I have this table:
CREATE TABLE ip2location (
ip_address_from BIGINT NOT NULL,
ip_address_to BIGINT NOT NULL,
id_location BIGINT NOT NULL,
PRIMARY KEY (ip_address_from, ip_address_to)
);
I created a cluster on its primary key, by running:
CLUSTER ip2location_ip_address_from_key ON ip2location;
This allowed me to organise data in a more efficient way: the data that 
is contained are ranges of IP addresses with empty intersections; for 
every IP class there is a related location's ID. The total number of 
entries is 1392443.

For every IP address I have, an application retrieves the corresponding 
location's id from the above table, by running a query like:

SELECT id_location FROM ip2location WHERE '1102111' = 
ip_address_from AND '1102111' = ip_address_to;

For instance, by running the 'EXPLAIN ANALYSE' command, I get this 
funny result:

 QUERY PLAN
- 

 Seq Scan on ip2location  (cost=0.00..30490.65 rows=124781 width=8) 
(actual time=5338.120..40237.283 rows=1 loops=1)
   Filter: ((1040878301::bigint = ip_address_from) AND 
(1040878301::bigint = ip_address_to))
 Total runtime: 40237.424 ms

With other data, that returns an empty set, I get:
explain SELECT id_location FROM ip2location WHERE '1102111' = 
ip_address_from AND '1102111' = ip_address_to;

  QUERY PLAN
--- 

 Index Scan using ip2location_ip_address_from_key on ip2location  
(cost=0.00..419.16 rows=140 width=8)
   Index Cond: ((1102111::bigint = ip_address_from) AND 
(1102111::bigint = ip_address_to))

I guess the planner chooses the best of the available options for the 
first case, the sequential scan. This is not confirmed though by the 
fact that, after I ran SET enable_scan TO off, I got this:
 
QUERY PLAN
 

 Index Scan using ip2location_ip_address_from_key on ip2location  
(cost=0.00..31505.73 rows=124781 width=8) (actual 
time=2780.172..2780.185 rows=1 loops=1)
   Index Cond: ((1040878301::bigint = ip_address_from) AND 
(1040878301::bigint = ip_address_to))
 Total runtime: 2780.359 ms

Is this a normal case or should I worry? What am I missing? Do you have 
any suggestion or comment to do (that would be extremely appreciated)? 
Is the CLUSTER I created worthwhile or not?

Thank you,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig  IWA/HWG Member, 
ht://Check maintainer
Current Location: Prato, Toscana, Italia
[EMAIL PROTECTED] | http://www.prato.linux.it/~gbartolini | ICQ#129221447
  Leave every hope, ye who enter!, Dante Alighieri, Divine Comedy, 
The Inferno


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] sequential scan on select distinct

2004-10-11 Thread Mischa Sandberg

Tom Lane wrote:
Ole Langbehn [EMAIL PROTECTED] writes:
What do you think about the idea of an UniqueSort which would do
sort+unique in one pass ? 

This is what oracle does and it is quite fast with it...

Hashing is at least as fast, if not faster.
			regards, tom lane
I got good mileage in a different SQL engine, by combining the 
hash-aggregate and sort nodes into a single operator.
The hash table was just an index into the equivalent of the heap used 
for generating runs. That gave me partially aggregated data,
or eliminated duplicate keys, without extra memory overhead of the 
hash-aggregation node below the sort. Memory was scarce then ... :-)

BTW I'm really puzzled that Oracle is pushing 'index skip scan' as a new 
feature. Wasn't this in the original Oracle Rdb --- one of Gennady 
Antoshenkov's tweaks?

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

[PERFORM] Tryint to match Solaris-Oracle performance with directio?

2004-09-18 Thread Mischa Sandberg

Our product (Sophos PureMessage) runs on a Postgres database.
Some of our Solaris customers have Oracle licenses, and they've 
commented on the performance difference between Oracle and Postgresql
on such boxes. In-house, we've noticed the 2:1 (sometimes 5:1)
performance difference in inserting rows (mostly 2-4K), between
Postgresql on Solaris 8 and on Linux, for machines with comparable
CPU's and RAM.

These (big) customers are starting to ask, why don't we just port our 
dataserver to Oracle for them? I'd like to avoid that, if possible :-)

What we can test on, in-house are leetle Sun workstations, while some of 
our customers have BIG Sun iron --- so I have no means to-date to 
reproduce what their bottleneck is :-( Yes, it has been recommended that 
we talk to Sun about their iForce test lab ... that's in the pipe.

In the meantime, what I gather from browsing mail archives is that 
postgresql on Solaris seems to get hung up on IO rather than CPU.
Furthermore, I notice that Oracle and now MySQL use directio to bypass 
the system cache, when doing heavy writes to the disk; and Postgresql 
does not.

Not wishing to alter backend/store/file for this test, I figured I could 
get a customer to mount the UFS volume for pg_xlog  with the option 
forcedirectio.

Any comment on this? No consideration of what the wal_sync_method is at 
this point. Presumably it's defaulting to fdatasync on Solaris.

BTW this is Postgres 7.4.1, and our customers are Solaris 8 and 9.
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-15 Thread Mischa Sandberg

Simon Riggs wrote:
Jim C. Nasby
On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
PostgreSQL's functionality is in many ways similar to Oracle
Partitioning.
Loading up your data in many similar tables, then creating a view like:
CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
SELECT 200409130800, col1, col2, col3... FROM table200409130800
UNION ALL
SELECT 200409131000, col1, col2, col3... FROM table200409131000
UNION ALL
SELECT 200409131200, col1, col2, col3... FROM table200409131200
...etc...
will allow the PostgreSQL optimizer to eliminate partitions
from the query
when you run queries which include a predicate on the
partitioning_col, e.g.
select count(*) from bigtable where idate = 200409131000
The partitions are just tables, so no need for other management tools.
Oracle treats the partitions as sub-tables, so you need a range of commands
to add, swap etc the partitions of the main table.
A few years ago I wrote a federated query engine (wrapped as an ODBC 
driver) that had to handle thousands of contributors (partitions) to a 
pseudotable / VIEWofUNIONs. Joins did require some special handling in 
the optimizer, because of the huge number of crossproducts between 
different tables. It was definitely worth the effort at the time, 
because you need different strategies for: joining a partition to 
another partition on the same subserver; joining two large partitions on 
different servers; and joining a large partition on one server to a 
small one on another.

The differences may not be so great for a solitary server;
but they're still there, because of disparity in subtable sizes. The 
simplistic query plans tend to let you down, when you're dealing with 
honking warehouses.

I'm guessing that Oracle keeps per-subtable AND cross-all-subtables 
statistics, rather than building the latter from scratch in the course 
of evaluating the query plan. That's the one limitation I see in 
emulating their partitioned tables with Views.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Mischa Sandberg

Googling 'upsert' (an Oraclism, I believe) will get you hits on Oracle 
and DB2's implementation of MERGE, which does what AMOUNTS to what is
described below (one mass UPDATE...FROM, one mass INSERT...WHERE NOT 
EXISTS).

No, you shouldn't iterate row-by-row through the temp table.
Whenever possible, try to do updates in one single (mass) operation.
Doing it that way gives the optimizer the best chance at amortizing
fixed costs, and batching operations.
-
In any database other than Postgres, I would recommend doing the
INSERT /followed by/ the UPDATE. That order looks wonky --- your update
ends up pointlessly operating on the rows just INSERTED. The trick is, 
UPDATE acquires and holds write locks (the rows were previously visible 
to other processes), while INSERT's write locks refer to rows that no 
other process could try to lock.

Stephen Frost wrote:
* Markus Schaber ([EMAIL PROTECTED]) wrote:
Generally, what is the fastest way for doing bulk processing of 
update-if-primary-key-matches-and-insert-otherwise operations?

This is a very good question, and I havn't seen much of an answer to it
yet.  I'm curious about the answer myself, actually.  In the more recent
SQL specs, from what I understand, this is essentially what the 'MERGE'
command is for.  This was recently added and unfortunately is not yet
supported in Postgres.  Hopefully it will be added soon.
Otherwise, what I've done is basically an update followed by an insert
using outer joins.  If there's something better, I'd love to hear about
it.  The statements looks something like:
update X
  set colA = a.colA,
  colB = a.colB
  from Y a
  where keyA = a.keyA and
keyB = a.keyB;
insert into X
  select a.keyA,
 a.keyB,
 a.colA,
 a.colB
  from Y a left join X b
   using (keyA, keyB)
  where b.keyA is NULL and
b.keyB is NULL;
With the appropriate indexes, this is pretty fast but I think a merge
would be much faster.
Thanks,
			Stephen
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Mischa Sandberg

Mark Cotner wrote:
Hi all,
I had a difficult time deciding which list to post
this to, so please forgive me if this list doesn't
perfectly match my questions.  My decision will not
solely be based on performance, but it is the primary
concern.  I would be very appreciative if you all
could comment on my test plan.  Real world examples of
a Postgres implementation of =600G with a web
front-end would be great, or any data warehouse with
some size to it.
I'm only in the 30GB range of database, in case that's a consideration 
for my comments that follow.

At this time, I'm working out the best ROLAP options for our PG 
transaction store. The transaction store is highly volatile -- longest a 
txn stays in it is 15 days ... so you imagine the need for historic 
summaries :-)

I've also implemented multiple data servers, including
a federated server that had to make the best of existing engines
(like MySQL, PG and everything from MSJet to RedBrick in the commercial 
world).

The time has come to reevaluate/rearchitect an
application which I built about 3 years ago.  There
are no performance concerns with MySQL, but it would
benefit greatly from stored procedures, views, etc. 
If your company is currently happy with MySQL, there probably are other 
(nontechnical) reasons to stick with it. I'm impressed that you'd 
consider reconsidering PG.

Some of the mining that we do could benefit from
stored procedures as well.  MySQL may have these in
the works, but we won't be able to move to a version
of MySQL that supports stored procs for another year
or two.
And PG lets you back-end with some powerful pattern- and 
aggregate-handling languages, like Perl. This was definitely a plus for 
data mining of web traffic, for example. The power of server-side 
extensibility for bailing you out of a design dead-end is not 
inconsequential.

PG doesn't have PIVOT operators (qv Oracle and MSSQL), but it makes the 
translation from data to column fairly painless otherwise.

Requirements:
Merge table definition equivalent.  We use these
extensively.
Looked all over mysql.com etc, and afaics merge table
is indeed exactly a view of a union-all. Is that right?
PG supports views, of course, as well (now) as tablespaces, allowing you 
to split tables/tablesets across multiple disk systems.
PG is also pretty efficient in query plans on such views, where (say) 
you make one column a constant (identifier, sort of) per input table.

Merge table equivalent with all tables containing over
100M rows(and about 40 columns, some quite wide) will
need to do index scans in at least 5 seconds(MySQL
currently does 2, but we can live with 5) and return
~200 rows.
PG has TOAST for handling REALLY BIG columns, and the generic TEXT type 
is as efficient as any size-specific VARCHAR() type ... should make 
things easier for you.

Um, gonna sound silly, but the web interface has to
remain snappy under load.  I don't see this as a
major concern since you don't require table locking.
Agreed. It's more in your warehouse design, and intelligent bounding of 
queries. I'd say PG's query analyzer is a few years ahead of MySQL for 
large and complex queries.

If business logic is moved to the database(likely with
Postgres) performance for inserting with light logic
on each insert has to keep up with the 4.5M inserts
per 2 hours(which MySQL completes in ~35min
currently).  Acceptable numbers for this aggregation
would be 45-55min using stored procedures.
Again, it's a matter of pipeline design. The tools for creating an 
efficient pipeline are at least as good in PG as MySQL.

If you try to insert and postprocess information one row at a time,
procedures or no, there's no offhand way to guarantee your performance 
without a test/prototype.

On the other hand, if you do warehouse-style loading (Insert, or PG 
COPY, into a temp table; and then 'upsert' into the perm table), I can 
guarantee 2500 inserts/sec is no problem.

Here's our case study if you're interested . . . 
http://www.mysql.com/customers/customer.php?id=16
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Help with extracting large volumes of records across related

2004-09-13 Thread Mischa Sandberg

Damien Dougan wrote:
Basically we have a number of tables, which are exposed as 2 public
views (say PvA and PvB). For each row in PvA, there are a number of
related rows in PvB (this number is arbitrary, which is one of the
reasons why it cant be expressed as additional columns in PvA - so we
really need 2 sets of tables - which leads to two sets of extract calls
- interwoven to associate PvA with PvB).
Are there any tools/tricks/tips with regards to extracting large volumes
of data across related tables from Postgres? It doesnt have to export
into XML, we can do post-processing on the extracted data as needed -
the important thing is to keep the relationship between PvA and PvB on a
row-by-row basis.
Just recently had to come up with an alternative to MSSQL's SQL..FOR 
XML, for some five-level nested docs, that turned out to be faster (!)
and easier to understand:

Use SQL to organize each of the row types into a single text field, plus 
a single key field, as well as any filter fields you . Sort the union, 
and have the reading process break them into documents.

For example, if PvA has key (account_id, order_id) and 
fields(order_date, ship_date) and PvB has key (order_id, product_id) and 
fields (order_qty, back_order)

CREATE VIEW PvABxml AS
SELECT  account_id::text + order_id::text AS quay
,'order_date=' + order_date::text
+ ' ship_date=' + ship_date::text + '' AS info
,ship_date
FROMPvA
UNION ALL
SELECT  account_id::text + order_id::text + product_id::text
,'order_qty=' + order_qty::text +''
,ship_date
FROMPvA JOIN PvB USING (order_id)
Then:
SELECT quay, info
FROM pvABxml
WHERE ship_date = '...'
ORDER BY quay
gives you a stream of info in the (parent,child,child... 
parent,child,child...) order you want, that assemble very easily into 
XML documents. If you need to pick out, say, orders where there are 
backordered items, you probably need to work with a temp table with 
which to prefilter.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?

2004-09-10 Thread Mischa Sandberg

Mischa Sandberg wrote:
Coming from the MSSQL world, I'm used to the first step in optimization
to be, choose your clustered index and choose it well.
I see that PG has a one-shot CLUSTER command, but doesn't support
continuously-updated clustered indexes.
What I infer from newsgroup browsing is, such an index is impossible,
given the MVCC versioning of records (happy to learn I'm wrong).
I'd be curious to know what other people, who've crossed this same
bridge from MSSQL or Oracle or Sybase to PG, have devised,
faced with the same kind of desired performance gain for retrieving
blocks of rows with the same partial key.
Just to let people know, after trying various options, this looks the 
most promising:

- segment the original table into four tables (call them A,B,C,D)
- all insertions go into A.
- longterm data lives in B.
- primary keys of all requests to delete rows from (B) go into D -- no 
actual deletions are done against B. Deletions against A happen as normal.

- all queries are made against a view: a union of A and B and (not 
exists) D.

- daily merge A,B and (where not exists...) D, into C
- run cluster on C, then swap names on B and C, truncate A and D.
Not rocket science, but it seems to give the payback of normal 
clustering without locking the table for long periods of time. It also 
saves on VACUUM FULL time.

At present, we're only at 1M rows in B on this. More when I know it.
Advance warning on any gotchas with this approach would be much 
appreciated. Making a complete copy of (B) is a bit of an ouch.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?

2004-08-29 Thread Mischa Sandberg

Ummm ... not quite. In MSSQL/Sybase/Oracle, a clustered index maintains 
its space saturation as part of each update operation. High activity 
does indeed result in less-full pages (typically 60-80% full for tables 
with heavy deletions or rowsize changes). To bring the percentage back 
up, you run DBCC INDEXDEFRAG, which also does what you'd expect of a 
normal file defragmenter -- put related disk pages together on the platter.

But the performance difference is hardly as severe as I gather it can be 
if you neglect to vacuum.

As for SQL Server being a 'single-user database' ... ummm ... no, I 
don't think so. I'm REALLY happy to be shut of the Microsoft world, but 
MSSQL 7/2000/2005 is a serious big DB engine. It also has some serious 
bright heads behind it. They hired Goetz Graefe and Paul (aka Per-Ake) 
Larsen away from academia, and it shows, in the join and aggregate 
processing. I'll be a happy camper if I manage to contribute something
to PG that honks the way their stuff does. Happy to discuss, too.

Josh Berkus wrote:
Bruce,

How do vendors actually implement auto-clustering?  I assume they move
rows around during quiet periods or have lots of empty space in each
value bucket.

That's how SQL Server does it.   In old versions (6.5) you had to manually 
send commands to update the cluster, same as PG.   Also, when you create a 
cluster (or an index or table for that matter) you can manually set an amount 
of space to be held open on each data page for updates.

Also keep in mind that SQL Server, as a single-user database has a much 
easier time with this.  They don't have to hold several versions of an index 
in memory and collapse it into a single version at commit time.

All that being said, we could do a better job of auto-balancing clustered 
tables.   I believe that someone was working on this in Hackers through what 
they called B-Tree Tables.  What happened to that?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?

2004-08-29 Thread Mischa Sandberg

I think you've probably fingered the kicker of why PG doesn't have this 
kind of clustering already. Hence perhaps the need for other approaches
to the issue (the disk-IO efficiency of reading groups of rows related 
by a common key) that other DB's (with in-place update) address with
synchronous clustering ('heap rebalancing' ?).

Bruce Momjian wrote:
Adi Alurkar wrote:
IIRC it it to reduce the overflow of data or what oracle calls  
chained rows. i.e if a table has variable length columns and 10 rows  
get inserted into a datapage, if this datapage is full and one of the  
variable length field gets updated the row will now overflow into  
another datapage, but if the datapage is created with an appropriate  
amount of free space the updated row will be stored in one single  
datapage.

Agreed.  What I am wondering is with our system where every update gets
a new row, how would this help us?  I know we try to keep an update on
the same row as the original, but is there any significant performance
benefit to doing that which would offset the compaction advantage?
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?

2004-08-29 Thread Mischa Sandberg

This discussion is starting to sound like the split in HEAP memory 
management evolution, into garbage-collecting (e.g. Java) and 
non-garbage-collecting (e.g. C++).

Reclamation by GC's these days has become seriously sophisticated.
CLUSTER resembles the first generation of GC's, which were 
single-big-pass hold-everything-else threads.

Perhaps the latest in incremental GC algorithms would be worth scouting, 
for the next step in PG page management.

Greg Stark wrote:
Bruce Momjian [EMAIL PROTECTED] writes:
but is there any significant performance benefit to doing that which would
offset the compaction advantage?
Just as a side comment. Setting PCTFREE 0 PCTUSED 100 on tables that have no
updates on them has an astonishingly big effect on speed. So the penalty for
leaving some space free really is substantial.
I think the other poster is right. Oracle really needs pctfree because of the
way it handles updates. Postgres doesn't really need as much because it
doesn't try to squeeze the new tuple in the space the old one took up. If it
doesn't fit on the page the worst that happens is it has to store it on some
other page, whereas oracle has to do its strange row chaining thing.
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?

2004-08-29 Thread Mischa Sandberg

Sheer nitpick here...
A B-tree is where the records (data) live at all levels of the tree;
B+ tree is where the records are only at the leaf level.
That's what Knuth calls them, anyway.
Clustered indexes for all known dbs are true B+ trees.
Nonclustered indexes could be B-trees (probably aren't),
since there's no big fanout penalty for storing the little
(heap) row locators everywhere at all levels.
J. Andrew Rogers wrote:
As far as I know, Oracle does it by having a B-Tree organized heap (a
feature introduced around v8 IIRC), basically making the primary key
index and the heap the same physical structure.  
...
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?

2004-08-29 Thread Mischa Sandberg

J. Andrew Rogers wrote:
On Thu, 2004-08-26 at 12:30, Magnus Hagander wrote:
IIRC, SQL Server always creates clustered indexes by default for primary
keys.
That would surprise me actually.
Yaz, it should. It doesn't ALWAYS create clustered (unique) index for 
primary keys, but clustered is the default if you just specify

CREATE TABLE Foo (col1, ...
,PRIMARY KEY(col1, ...)
)
Saying PRIMARY KEY NONCLUSTERED(...) is how you override the default.

((Weird to be discussing so much MSSQL here))
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] my boss want to migrate to ORACLE

2004-08-05 Thread Mischa Sandberg

Regarding Raid5 at all, you might want to look at http://www.baarf.com

Stephane Tessier [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 I think with your help guys I'll do it!

 I'm working on it!

 I'll work on theses issues:

 we have space for more ram(we use 2 gigs on possibility of 3 gigs)
 iowait is very high 98% -- look like postgresql wait for io access
 raid5 --raid0 if i'm right raid5 use 4 writes(parity,data, etc) for each




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Inverted-list databases (was: Working on huge RAM based datasets)

2004-07-13 Thread Mischa Sandberg

Andy Ballingall [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]

 On another thread, (not in this mailing list), someone mentioned that
there
 are a class of databases which, rather than caching bits of database file
 (be it in the OS buffer cache or the postmaster workspace), construct a a
 well indexed memory representation of the entire data in the postmaster
 workspace (or its equivalent), and this, remaining persistent, allows the
DB
 to service backend queries far quicker than if the postmaster was working
 with the assumption that most of the data was on disk (even if, in
practice,
 large amounts or perhaps even all of it resides in OS cache).

As a historical note, System R (grandaddy of all relational dbs) worked this
way.
And it worked under ridiculous memory constraints by modern standards.

Space-conscious MOLAP databases do this, FWIW.

Sybase 11 bitmap indexes pretty much amount to this, too.

I've built a SQL engine that used bitmap indexes within B-Tree indexes,
making it practical to index every field of every table (the purpose of the
engine).

You can also build special-purpose in-memory representations to test for
existence (of a key), when you expect a lot of failures. Google
superimposed coding e.g.  http://www.dbcsoftware.com/dbcnews/NOV94.TXT



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] Query performance

2004-06-30 Thread Mischa Sandberg




Usually, when you post a request like this, you should provide 
something a little more concrete (the CREATE TABLE statement for that table, 
with 
Since you didn't, I'll posit something that sounds like what 
you're using, and take a stab at your problem.

TABLE Prices (
 stock 
VARCHAR(9)
 ,asof 
 DATE,
 ,opening MONEY
 ,closing 
MONEY
 ,PRIMARY KEY (stock, asof)
 )

SELECT stock, AVG((closing-opening)/opening) 
as ratio
FROM Prices 
GROUP BY stock
ORDER BY ratio DESC LIMIT 10; -- top 10 
best-performing stocks.

""Bill"" [EMAIL PROTECTED] wrote in 
message news:[EMAIL PROTECTED]...

  
  Actually, I have some 
  queries that are slow, however I was wondering if you could help me write a 
  query that is rather simple, but I, as a true database novice, can't seem to 
  conjure. So we have stocks, as I have previously said, and I have a huge 
  table which contains all of the opening and closing prices of some stocks from 
  each day. What I like to do, in English, for each stock in each day is 
  find a ratio: abs(closing-opening)/opening. Then I would like to average 
  all of the ratios of each day of each individual stock together to find a 
  final ratio for each stock, then I would like to find the highest average, to 
  find the best performing stock. So what query can I use, and (as is 
  appropriate for this group), how can it be optimized to run the 
  fastest?

Re: [PERFORM] Major differences between oracle and postgres performance - what can I do ?

2004-06-29 Thread Mischa Sandberg

Don't know about Oracle, but select-distinct in MSSQL2K will indeed throw
away duplicates, which chops the CPU time. Very easy to see in the graphic
query plan, both in terms of CPU and the number of rows retrieved from a
single-node or nested-loop subtree. Definitely a worthwhile optimization.

Tom Lane [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Stephen Frost [EMAIL PROTECTED] writes:
  * Tom Lane ([EMAIL PROTECTED]) wrote:
  [... thinks for awhile ...]  It seems possible that they may use sort
  code that knows it is performing a DISTINCT operation and discards
  duplicates on sight.  Given that there are only 534 distinct values,
  the sort would easily stay in memory if that were happening.

  Could this optimization be added to PostgreSQL?  It sounds like a very
  reasonable thing to do.

 That's what I was wondering about too.  But first I'd like to get
 some kind of reading on how effective it would be.  If someone can
 demonstrate that Oracle can do sort-and-drop-dups a lot faster than
 it can do a straight sort of the same amount of input data, that
 would be a strong indication that it's worth doing.  At this point
 we don't know if that's the source of their win or not.

 regards, tom lane

 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[PERFORM] Range query optimization

2004-06-29 Thread Mischa Sandberg

I'm trying to make a (qua-technical, qua-business) case for switching from
MS SQL, and one of the types of query that really doesn't sit well with MS
SQL2K is:

-- All fields integers or equivalent.
-- Table T(k, x: nonkey fields...)
-- Table U(k, a, z: m)-- for each value of (k) a set of non-intersecting
ranges [a,z) that map to (m) values.

select T.*, U.m from T join U on T.k=U.k and T.x = U.a and T.x  U.z

Typically there are are about 1000-2000 U rows per value of (k), about 100K
values of (k) and about 50M
values of T.

By itself, this type of query grinds the CPU to dust. A clustered index on
fields of U (take your pick) barely halves the problem of the loop through
1000-2000 rows of U for each row of T.  Hash join likewise.
The current workaround is a 'manual'  radix index on top of the range table,
but it's something of a hack.

Would the geometric of extensions handle such queries efficiently? I'm not
familiar with applying R-trees to linear range problems.

Dreams come true, not free. -- S.Sondheim



---(end of broadcast)---
TIP 8: explain analyze is your friend

60 matches

Mail list logo