Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Mitch Skinner
On Thu, 2005-11-10 at 12:23 -0500, Tom Lane wrote:
> Apparently, you are using a platform and/or locale in which strcoll() is
> spectacularly, god-awfully slow --- on the order of 10 msec per comparison.

The version with the condition is definitely doing more I/O.  The
version without the condition doesn't read at all.  I strace'd an
explain analyze for each separately, and this is what I ended up with
(the first is with the condition, the second is without):

bash-2.05b$ cut '-d(' -f1 subsourcestrace | sort | uniq -c
   7127 gettimeofday
  75213 _llseek
  1 Process 30227 attached - interrupt to quit
  1 Process 30227 detached
 148671 read
  2 recv
  4 semop
  4 send
bash-2.05b$ cut '-d(' -f1 subsourcestrace-nocond | sort | uniq -c
   9103 gettimeofday
  7 _llseek
  1 Process 30227 attached - interrupt to quit
  1 Process 30227 detached
  2 recv
  4 send

For the moment, all of the rows in the view I'm selecting from satisfy
the condition, so the output of both queries is the same.  The relevant
rows of the underlying tables are probably pretty contiguous (all of the
rows satisfying the condition and the join were inserted at the same
time).  Could it just be the result of a weird physical distribution of
data in the table/index files?  For the fast query, the actual number of
rows is a lot less than the planner expects.

> This is a bit hard to believe but I can't make sense of those numbers
> any other way.  What is the platform exactly, and what database locale
> and encoding are you using?

It's RHEL 3 on x86:
[EMAIL PROTECTED] root]# uname -a
Linux rehoboam 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005
i686 i686 i386 GNU/Linux

The glibc version is 2.3.2.

statgen=# select current_setting('lc_collate');
 current_setting
-
 en_US.UTF-8

Not sure what's relevant, but here's some more info:
The machine has 4.5GiB of RAM and a 5-disk Raid 5.  It's a dual xeon
3.2ghz.

   relname   | relpages |  reltuples
-+--+-
 external_id_map |   126883 | 1.55625e+07
 external_id_map_primary_key |64607 | 1.55625e+07
 subject |   31 |1186
 subject_pkey|   19 |1186

I've attached the output of "select name, setting from pg_settings".

And, in case my original message isn't handy, the explain analyze output
and table/view info is below.

Thanks for taking a look,
Mitch

statgen=> explain analyze select * from subject_source;

QUERY PLAN 

 Merge Join  (cost=0.00..330.72 rows=1186 width=46) (actual
time=0.051..8.890 rows=1186 loops=1)
   Merge Cond: ("outer".id = "inner".target_id)
   ->  Index Scan using subject_pkey on subject norm  (cost=0.00..63.36
rows=1186 width=28) (actual time=0.022..1.441 rows=1186 loops=1)
   ->  Index Scan using external_id_map_primary_key on external_id_map
eim  (cost=0.00..2485226.70 rows=15562513 width=26) (actual
time=0.016..2.532 rows=2175 loops=1)
 Total runtime: 9.592 ms
(5 rows)

statgen=> explain analyze select * from subject_source where
source='SCH';

QUERY PLAN 

 Merge Join  (cost=0.00..1147.33 rows=1 width=46) (actual
time=0.054..20258.161 rows=1186 loops=1)
   Merge Cond: ("outer".id = "inner".target_id)
   ->  Index Scan using subject_pkey on subject norm  (cost=0.00..63.36
rows=1186 width=28) (actual time=0.022..1.478 rows=1186 loops=1)
   ->  Index Scan using external_id_map_primary_key on external_id_map
eim  (cost=0.00..2524132.99 rows=2335 width=26) (actual
time=0.022..20252.326 rows=1186 loops=1)
 Filter: (source = 'SCH'::bpchar)
 Total runtime: 20258.922 ms
(6 rows)

statgen=> \d subject_source
 View "public.subject_source"
  Column   | Type  | Modifiers
---+---+---
 id| bigint|
 sex   | integer   |
 parent1   | bigint|
 parent2   | bigint|
 source| character(3)  |
 source_id | character varying(32) |
View definition:
 SELECT norm.id, norm.sex, norm.parent1, norm.parent2, eim.source,
eim.source_id
   FROM subject norm
   JOIN util.external_id_map eim ON norm.id = eim.target_id;

statgen=> \d subject
Table "public.subject"
 Column  |  Type   | Modifiers
-+-+---
 id  | bigint  | not null
 sex | integer |
 parent1 | bigint  |
 parent2 | bigint  |
Indexes:
"subject_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"subject_parent1" FOREIGN KEY (parent1) REFERENCES subject(id)
DEFERRABLE INITIALLY DEFERRED
"subject_parent2" FOREIGN KEY (parent2)

Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Richard Huxton

Mitch Skinner wrote:


The version with the condition is definitely doing more I/O.  The
version without the condition doesn't read at all.  

[snip]

   relname   | relpages |  reltuples
-+--+-
 external_id_map |   126883 | 1.55625e+07
 external_id_map_primary_key |64607 | 1.55625e+07
 subject |   31 |1186
 subject_pkey|   19 |1186


Does external_id_map really have 15 million rows? If not, try a VACUUM 
FULL on it. Be prepared to give it some time to complete.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Tom Lane
Mitch Skinner <[EMAIL PROTECTED]> writes:
> On Thu, 2005-11-10 at 12:23 -0500, Tom Lane wrote:
>> Apparently, you are using a platform and/or locale in which strcoll() is
>> spectacularly, god-awfully slow --- on the order of 10 msec per comparison.

> The version with the condition is definitely doing more I/O.  The
> version without the condition doesn't read at all.

That's pretty interesting, but what file(s) is it reading exactly?

It could still be strcoll's fault.  The only plausible explanation
I can think of for strcoll being so slow is if for some reason it were
re-reading the locale definition file every time, instead of setting up
just once.

If it is hitting Postgres files, it'd be interesting to look at exactly
which files and what the distribution of seek offsets is.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Tom Lane
Richard Huxton  writes:
> Mitch Skinner wrote:
>> The version with the condition is definitely doing more I/O.  The
>> version without the condition doesn't read at all.  

> Does external_id_map really have 15 million rows? If not, try a VACUUM 
> FULL on it. Be prepared to give it some time to complete.

Please don't, actually, until we understand what's going on.

The thing is that the given plan will fetch every row indicated by the
index in both cases, in order to check the row's visibility.  I don't
see how an additional test on a non-indexed column would cause any
additional I/O.  If the value were large enough to be toasted
out-of-line then it could cause toast table accesses ... but we're
speaking of a char(3).

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Mitch Skinner
On Fri, 2005-11-11 at 11:51 +, Richard Huxton wrote:
> Does external_id_map really have 15 million rows? If not, try a VACUUM 
> FULL on it. Be prepared to give it some time to complete.

Thanks for the reply.  It does indeed have that many rows:
statgen=> select count(*) from util.external_id_map ;
  count
--
 15562513
(1 row)

That table never gets deletions or updates, only insertions and reads.
For fun and base-covering, I'm running a full vacuum now.  Usually
there's just a nightly lazy vacuum.

If it helps, here's some background on what we're doing and why (plus
some stuff at the end about how it relates to Postgres):

We get very similar data from multiple sources, and I want to be able to
combine it all into one schema.  The data from different sources is
similar enough (it's generally constrained by the underlying biology,
e.g., each person has a father and a mother, two versions of each
regular chromosome, etc.) that I think putting it all into one set of
tables makes sense.

Different people in our group use different tools (Python, R, Java), so
instead of integrating at the code level (like a shared class hierarchy)
we use the schema as our shared idea of the data.  This helps make my
analyses comparable to the analyses from my co-workers.  We don't all
want to have to write basic sanity checks in each of our languages, so
we want to be able to have foreign keys in the schema.  Having foreign
keys and multiple data sources means that we have to generate our own
internal identifiers (otherwise we'd expect to have ID collisions from
different sources).  I'd like to be able to have a stable
internal-external ID mapping (this is actually something we spent a lot
of time arguing about), so we have a table that does exactly that.

When we import data, we do a bunch of joins against the external_id_map
table to translate external IDs into internal IDs.  It means that the
external_id_map table gets pretty big and the joins can take a long time
(it takes four hours to import one 11-million row source table into our
canonical schema, because we have to do 5 ID translations per row on
that one), but we don't need to import data too often so it works.  The
main speed concern is that exploratory data analyses are pretty
interactive, and also sometimes you want to run a bunch of analyses in
parallel, and if the queries are slow that can be a bottleneck.

I'm looking forward to partitioning the external_id_map table with 8.1,
and when Greenplum comes out with their stuff we'll probably take a
look.  If the main Postgres engine had parallel query execution, I'd be
pretty happy.  I also followed the external sort thread with interest,
but I didn't get the impression that there was a very clear consensus
there.

Since some of our sources change over time, and I can't generally expect
them to have timestamps on their data, what we do when we re-import from
a source is delete everything out of the canonical tables from that
source and then re-insert.  It sounds like mass deletions are not such a
common thing to do; I think there was a thread about this recently and
Tom questioned the real-world need to worry about that workload.  I was
thinking that maybe the foreign key integrity checks might be better
done by a join rather than a per-deleted-row trigger queue, but since
all my foreign keys are indexed on both ends it doesn't look like a
bottleneck.  

Anyway, all that probably has an effect on the data distribution in our
tables and indexes.  I'll report back on the effect of the full vacuum.

Thanks for reading,
Mitch

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Mitch Skinner
On Fri, 2005-11-11 at 09:17 -0500, Tom Lane wrote:
> Richard Huxton  writes:
> > Does external_id_map really have 15 million rows? If not, try a VACUUM 
> > FULL on it. Be prepared to give it some time to complete.
> 
> Please don't, actually, until we understand what's going on.

Ack, I was the middle of the vacuum full already when I got this.  I
still have the strace and lsof output from before the vacuum full.  It's
definitely reading Postgres files:

bash-2.05b$ grep '^read' subsourcestrace | cut -d, -f1 | sort |  uniq -c
 100453 read(44
  48218 read(47
bash-2.05b$ grep 'seek' subsourcestrace | cut -d, -f1 | sort |  uniq -c
  1 _llseek(40
  1 _llseek(43
  35421 _llseek(44
  1 _llseek(45
  1 _llseek(46
  39787 _llseek(47
  1 _llseek(48

File handles:
44 - external_id_map
47 - external_id_map_primary_key
40 - subject
43 - subject_pkey
45 - external_id_map_source
46 - external_id_map_source_target_id
48 - external_id_map_source_source_id_unique

As far as the seek offsets go, R doesn't want to do a histogram for me
without using up more RAM than I have.  I put up some files at:
http://arctur.us/pgsql/
They are:
subsourcestrace - the strace output from "select * from subject_source
where source='SCH'"
subsourcestrace-nocond - the strace output from "select * from
subject_source"
subsourcelsof - the lsof output (for mapping from file handles to file
names)
relfilenode.html - for mapping from file names to table/index names (I
think I've gotten all the relevant file handle-table name mappings
above, though)
seekoff-44 - just the beginning seek offsets for the 44 file handle
(external_id_map)
seekoff-47 - just the beginning seek offsets for the 47 file handle
(external_id_map_primary_key)

The vacuum full is still going; I'll let you know if it changes things.

> The thing is that the given plan will fetch every row indicated by the
> index in both cases, in order to check the row's visibility.  I don't
> see how an additional test on a non-indexed column would cause any
> additional I/O.  If the value were large enough to be toasted
> out-of-line then it could cause toast table accesses ... but we're
> speaking of a char(3).

Pardon my ignorance, but do the visibility check and the check of the
condition happen at different stages of execution?  Would it end up
checking the condition for all 15M rows, but only checking visibility
for the 1200 rows that come back from the join?  I guess I'm confused
about what "every row indicated by the index" means in the context of
the join.

Thanks for taking an interest,
Mitch


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Tom Lane
Mitch Skinner <[EMAIL PROTECTED]> writes:
> On Fri, 2005-11-11 at 09:17 -0500, Tom Lane wrote:
>> Please don't, actually, until we understand what's going on.

> Ack, I was the middle of the vacuum full already when I got this.

Given what you said about no deletions or updates, the vacuum should
have no effect anyway, so don't panic.

> I put up some files at: http://arctur.us/pgsql/

Great, I'll take a look ...

> Pardon my ignorance, but do the visibility check and the check of the
> condition happen at different stages of execution?  Would it end up
> checking the condition for all 15M rows, but only checking visibility
> for the 1200 rows that come back from the join?

No, the visibility check happens first.  The timing does seem consistent
with the idea that the comparison is being done at all 15M rows, but
your other EXPLAIN shows that only 2K rows are actually retrieved, which
presumably is because the merge doesn't need the rest.  (Merge will stop
scanning either input when it runs out of rows on the other side; so
this sort of plan is very fast if the range of keys on one side is
smaller than the range on the other.  The numbers from the no-comparison
EXPLAIN ANALYZE indicate that that is happening for your case.)  So the
comparison should happen for at most 2K rows too.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Tom Lane
I wrote:
> No, the visibility check happens first.  The timing does seem consistent
> with the idea that the comparison is being done at all 15M rows, but
> your other EXPLAIN shows that only 2K rows are actually retrieved, which
> presumably is because the merge doesn't need the rest.  (Merge will stop
> scanning either input when it runs out of rows on the other side; so
> this sort of plan is very fast if the range of keys on one side is
> smaller than the range on the other.  The numbers from the no-comparison
> EXPLAIN ANALYZE indicate that that is happening for your case.)  So the
> comparison should happen for at most 2K rows too.

After re-reading your explanation of what you're doing with the data,
I thought of a possible explanation.  Is the "source" value exactly
correlated with the external_id_map primary key?  What could be
happening is this:

1. We can see from the EXPLAIN ANALYZE for the no-comparison case that
the merge join stops after fetching only 2175 rows from external_id_map.
This implies that the subject table joins to the first couple thousand
entries in external_id_map and nothing beyond that.  In particular, the
merge join must have observed that the join key in the 2175'th row (in
index order) of external_id_map was larger than the last (largest) join
key in subject.

2. Let's suppose that source = 'SCH' is false for the 2175'th row of
external_id_map and every one after that.  Then what will happen is that
the index scan will vainly seek through the entire external_id_map,
looking for a row that its filter allows it to return, not knowing that
the merge join has no use for any of those rows.

If this is the story, and you need to make this sort of query fast,
then what you need to do is incorporate the "source" value into the
external_id_map index key somehow.  Then the index scan would be able to
realize that there is no possibility of finding another row with source
= 'SCH'.  The simplest way is just to make a 2-column index, but I
wonder whether the source isn't actually redundant with the
external_id_map primary key already ...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Mitchell Skinner
On Fri, 2005-11-11 at 10:53 -0500, Tom Lane wrote:
> After re-reading your explanation of what you're doing with the data,
> I thought of a possible explanation.  Is the "source" value exactly
> correlated with the external_id_map primary key?

Sort of.  In this case, at the beginning of external_id_map, yes, though
further down the table they're not.  For example, if we got new subjects
from 'SCH' at this point, they'd get assigned external_id_map.target_id
(the primary key) values that are totally unrelated to what the current
set are (the values in the external_id_map primary key just come off of
a sequence that we use for everything).

Right now though, since the 'SCH' data came in a contiguous chunk right
at the beginning and hasn't changed or grown since then, the correlation
is pretty exact, I think.  It's true that there are no 'SCH' rows in the
table after the first contiguous set (when I get back to work I'll check
exactly what row that is).  It's interesting that there are these
correlations in the the data that didn't exist at all in my mental
model.

> what you need to do is incorporate the "source" value into the
> external_id_map index key somehow.  Then the index scan would be able to
> realize that there is no possibility of finding another row with source
> = 'SCH'.  The simplest way is just to make a 2-column index

I thought that's what I had done with the
external_id_map_source_target_id index:

statgen=> \d util.external_id_map
 Table "util.external_id_map"
  Column   | Type  | Modifiers
---+---+---
 source_id | character varying(32) | not null
 source| character(3)  | not null
 target_id | bigint| not null
Indexes:
"external_id_map_primary_key" PRIMARY KEY, btree (target_id)
"external_id_map_source_source_id_unique" UNIQUE, btree (source,
source_id)
"external_id_map_source" btree (source)
"external_id_map_source_target_id" btree (source, target_id)
Foreign-key constraints:
"external_id_map_source" FOREIGN KEY (source) REFERENCES
util.source(id)

So if I understand your suggestion correctly, we're back to the "why
isn't this query using index foo" FAQ.  For the external_id_map table,
the statistics target for "source" is 200; the other two columns are at
the default level because I didn't think of them as being very
interesting statistics-wise.  I suppose I should probably go ahead and
raise the targets for every column of that table; I expect the planning
time is negligible, and our queries tend to be large data-wise.  Beyond
that, I'm not sure how else to encourage the use of that index.  If I
changed that index to be (target_id, source) would it make a difference?

Thanks for your help,
Mitch 

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Tom Lane
Mitchell Skinner <[EMAIL PROTECTED]> writes:
> On Fri, 2005-11-11 at 10:53 -0500, Tom Lane wrote:
>> what you need to do is incorporate the "source" value into the
>> external_id_map index key somehow.  Then the index scan would be able to
>> realize that there is no possibility of finding another row with source
>> = 'SCH'.  The simplest way is just to make a 2-column index

> I thought that's what I had done with the
> external_id_map_source_target_id index:
> "external_id_map_source_target_id" btree (source, target_id)

> If I changed that index to be (target_id, source) would it make a difference?

[ fools around with a test case ... ]  Seems like not :-(.  PG is not
bright enough to realize that an index on (source, target_id) can be
used with a mergejoin on target_id, because the index sort order isn't
compatible.  (Given the equality constraint on source, there is an
effective compatibility.  I had thought that 8.1 might be able to
detect this, but it seems not to in a simple test case --- there may be
a bug involved there.  In any case 8.0 definitely won't see it.)  An
index on (target_id, source) would be recognized as mergejoinable, but
that doesn't solve the problem because an index condition on the second
column doesn't provide enough information to know that the scan can stop
early.

Given your comment that the correlation is accidental, it may be that
there's not too much point in worrying.  The planner is picking this
plan only because it notices the asymmetry in key ranges, and as soon
as some more rows get added with higher-numbered target_ids it will
shift to something else (probably a hash join).

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] slow queries after ANALYZE

2005-11-11 Thread DW

Hello,

I'm perplexed. I'm trying to find out why some queries are taking a long 
time, and have found that after running analyze,  one particular query 
becomes slow.


This query is based on a view that is based on multiple left outer joins 
to merge data from lots of tables.


If I drop the database and reload it from a dump, the query result is 
instaneous (less than one second).


But after I run analyze, it then takes much longer to run -- about 10 
seconds, give or take a few depending on the hardware I'm testing it on.
Earlier today, it was taking almost 30 seconds on the actual production 
server -- I restarted pgsql server and the time got knocked down to 
about 10 seconds -- another thing I don't understand.


I've run the query a number of times before and after running analyze, 
and the problem reproduces everytime. I also ran with "explain", and saw 
that the costs go up dramatically after I run analyze.


I'm fairly new to postgresql and not very experienced as a db admin to 
begin with, but it looks like I'm going to have to get smarter about 
this stuff fast, unless it's something the programmers need to deal with 
when constructing their code and queries or designing the databases.


I've already learned that I've commited the cardinal sin of configuring 
my new database server with RAID 5 instead of something more sensible 
for databases like 0+1, but I've been testing out and replicating this 
problem on different hardware, so I know that this issue is not the 
direct cause of this.


Thanks for any info. I can supply more info (like config files, schemas, 
etc.) if you think it might help. But I though I would just describe the 
problem for starters.


-DW


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] slow queries after ANALYZE

2005-11-11 Thread Tom Lane
DW <[EMAIL PROTECTED]> writes:
> I'm perplexed. I'm trying to find out why some queries are taking a long 
> time, and have found that after running analyze,  one particular query 
> becomes slow.

This implies that the planner's default choice of plan (without any
statistics) is better than its choice when informed by statistics.
This is undesirable but not unheard of :-(

It would be interesting to see EXPLAIN ANALYZE results in both cases,
plus the contents of the relevant pg_stats rows.  (BTW, you need not
dump and reload to get back to the virgin state --- just delete the
relevant rows from pg_statistic.)  Also we'd want to know exactly what
PG version this is, and on what sort of platform.

You might be able to fix things by increasing the statistics targets or
tweaking planner cost parameters, but it'd be best to investigate before
trying to fix.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] slow queries after ANALYZE

2005-11-11 Thread DW

Tom Lane wrote:


It would be interesting to see EXPLAIN ANALYZE results in both cases,
plus the contents of the relevant pg_stats rows.  (BTW, you need not
dump and reload to get back to the virgin state --- just delete the
relevant rows from pg_statistic.)  Also we'd want to know exactly what
PG version this is, and on what sort of platform.



Thanks for replying. I've got a message into to my team asking if I need 
to de-identify some of the table names before I go submitting output to 
a public mailing list.


In the meantime, again I'm new to this -- I got pg_stats; which rows are 
 the relevent ones?


Also, I am running postgresql-server-7.4.9 from FreeBSD port (with 
optimized CFLAGS turned on during compiling)


OS: FreeBSD 5.4 p8

Thanks,
DW


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] slow queries after ANALYZE

2005-11-11 Thread Tom Lane
DW <[EMAIL PROTECTED]> writes:
> In the meantime, again I'm new to this -- I got pg_stats; which rows are 
>   the relevent ones?

The ones for columns that are mentioned in the problem query.
I don't think you need to worry about columns used only in the SELECT
output list, but anything used in WHERE, GROUP BY, etc is interesting.

> Also, I am running postgresql-server-7.4.9 from FreeBSD port (with 
> optimized CFLAGS turned on during compiling)
> OS: FreeBSD 5.4 p8

The hardware environment (particularly disks/filesystems) is relevant
too.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] 8.x index insert performance

2005-11-11 Thread Kelly Burkhart
On Thu, 2005-11-10 at 19:13 -0500, Tom Lane wrote:
> Kelly Burkhart <[EMAIL PROTECTED]> writes:
> > ...  A graph showing the performance
> > characteristics is here:
> 
> > 
> 
> I hadn't looked at this chart till just now, but it sure seems to put a
> crimp in my theory that you are running out of room to hold the indexes
> in RAM.  That theory would predict that once you fall over the knee of
> the curve, performance would get steadily worse; instead it gets
> markedly worse and then improves a bit.  And there's another cycle of
> worse-and-better around 80M rows.  I have *no* idea what's up with that.
> Anyone?  Kelly, could there be any patterns in the data that might be
> related?

I modified my original program to insert generated, sequential data.
The following graph shows the results to be flat:



Thus, hardware is sufficient to handle predictably sequential data.
There very well could be a pattern in the data which could affect
things, however, I'm not sure how to identify it in 100K rows out of
100M.

If I could identify a pattern, what could I do about it?  Could I do
some kind of a reversible transform on the data?  Is it better to insert
nearly random values?  Or nearly sequential?


I now have an 8G and a 16G machine I'm loading the data into.  I'll
report back after that's done.

I also want to try eliminating the order_main table, moving fields to
the transition table.  This will reduce the number of index updates
significantly at the cost of some wasted space in the table...

-K

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] 8.x index insert performance

2005-11-11 Thread Kevin Grittner
That sure seems to bolster the theory that performance is degrading
because you exhaust the cache space and need to start reading
index pages.  When inserting sequential data, you don't need to
randomly access pages all over the index tree.

-Kevin


>>> Kelly Burkhart <[EMAIL PROTECTED]>  >>>

I modified my original program to insert generated, sequential data.
The following graph shows the results to be flat:




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] 8.x index insert performance

2005-11-11 Thread Tom Lane
Kelly Burkhart <[EMAIL PROTECTED]> writes:
> On Thu, 2005-11-10 at 19:13 -0500, Tom Lane wrote:
>> Kelly, could there be any patterns in the data that might be
>> related?

> I modified my original program to insert generated, sequential data.
> The following graph shows the results to be flat:
> 
> Thus, hardware is sufficient to handle predictably sequential data.

Yeah, inserting sequentially increasing data would only ever touch the
right-hand edge of the btree, so memory requirements would be pretty low
and constant.

> There very well could be a pattern in the data which could affect
> things, however, I'm not sure how to identify it in 100K rows out of
> 100M.

I conjecture that the problem areas represent places where the key
sequence is significantly "more random" than it is elsewhere.  Hard
to be more specific than that though.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings