Re: [PERFORM] 8.x index insert performance

2005-11-14 Thread Kelly Burkhart
On Fri, 2005-11-11 at 18:02 -0500, Tom Lane wrote:
  There very well could be a pattern in the data which could affect
  things, however, I'm not sure how to identify it in 100K rows out of
  100M.
 
 I conjecture that the problem areas represent places where the key
 sequence is significantly more random than it is elsewhere.  Hard
 to be more specific than that though.
 

OK, I understand the pattern now.

My two tables hold orders, and order state transitions.  Most orders
have two transitions: creation and termination.  The problem happens
when there is a significant number of orders where termination is
happening a long time after creation, causing order_transition rows with
old ord_id values to be inserted.

This is valid, so I have to figure out a way to accomodate it.

You mentioned playing with checkpointing and bgwriter earlier in this
thread.  I experimented with the bgwriter through the weekend, but I
don't have a good idea what sensible parameter changes are...

Re: checkpointing, currently my checkpoints are happening every 5
minutes (if I turn on fsync, the graph shows checkpoints dramatically).
If I increase the checkpoint_timeout, could that be beneficial?  Or
would I just have more time between larger spikes? 

-K

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-11-14 Thread Ron

At 09:43 AM 11/14/2005, Kelly Burkhart wrote:

On Fri, 2005-11-11 at 18:02 -0500, Tom Lane wrote:
  There very well could be a pattern in the data which could affect
  things, however, I'm not sure how to identify it in 100K rows out of
  100M.

 I conjecture that the problem areas represent places where the key
 sequence is significantly more random than it is elsewhere.  Hard
 to be more specific than that though.


OK, I understand the pattern now.

My two tables hold orders, and order state transitions.  Most orders
have two transitions: creation and termination.  The problem happens
when there is a significant number of orders where termination is
happening a long time after creation, causing order_transition rows with
old ord_id values to be inserted.

This is valid, so I have to figure out a way to accomodate it.
Perhaps a small schema change would help?  Instead of having the 
order state transitions explicitly listed in the table, why not 
create two new tables; 1 for created orders and 1 for terminated 
orders.  When an order is created, its ord_id goes into the 
CreatedOrders table.  When an order is terminated, its ord_id is 
added to the TerminatedOrders table and then deleted from the 
CreatedOrders table.


Downsides to this approach are some extra complexity and that you 
will have to make sure that system disaster recovery includes making 
sure that no ord_id appears in both the CreatedOrders and 
TerminatedOrdes tables.  Upsides are that the insert problem goes 
away and certain kinds of accounting and inventory reports are now 
easier to create.


Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-11-11 Thread Kelly Burkhart
On Thu, 2005-11-10 at 19:13 -0500, Tom Lane wrote:
 Kelly Burkhart [EMAIL PROTECTED] writes:
  ...  A graph showing the performance
  characteristics is here:
 
  http://kkcsm.net/pgcpy.jpg
 
 I hadn't looked at this chart till just now, but it sure seems to put a
 crimp in my theory that you are running out of room to hold the indexes
 in RAM.  That theory would predict that once you fall over the knee of
 the curve, performance would get steadily worse; instead it gets
 markedly worse and then improves a bit.  And there's another cycle of
 worse-and-better around 80M rows.  I have *no* idea what's up with that.
 Anyone?  Kelly, could there be any patterns in the data that might be
 related?

I modified my original program to insert generated, sequential data.
The following graph shows the results to be flat:

http://kkcsm.net/pgcpy_2005_1.jpg

Thus, hardware is sufficient to handle predictably sequential data.
There very well could be a pattern in the data which could affect
things, however, I'm not sure how to identify it in 100K rows out of
100M.

If I could identify a pattern, what could I do about it?  Could I do
some kind of a reversible transform on the data?  Is it better to insert
nearly random values?  Or nearly sequential?


I now have an 8G and a 16G machine I'm loading the data into.  I'll
report back after that's done.

I also want to try eliminating the order_main table, moving fields to
the transition table.  This will reduce the number of index updates
significantly at the cost of some wasted space in the table...

-K

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] 8.x index insert performance

2005-11-11 Thread Kevin Grittner
That sure seems to bolster the theory that performance is degrading
because you exhaust the cache space and need to start reading
index pages.  When inserting sequential data, you don't need to
randomly access pages all over the index tree.

-Kevin


 Kelly Burkhart [EMAIL PROTECTED]  

I modified my original program to insert generated, sequential data.
The following graph shows the results to be flat:

http://kkcsm.net/pgcpy_2005_1.jpg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] 8.x index insert performance

2005-11-11 Thread Tom Lane
Kelly Burkhart [EMAIL PROTECTED] writes:
 On Thu, 2005-11-10 at 19:13 -0500, Tom Lane wrote:
 Kelly, could there be any patterns in the data that might be
 related?

 I modified my original program to insert generated, sequential data.
 The following graph shows the results to be flat:
 http://kkcsm.net/pgcpy_2005_1.jpg
 Thus, hardware is sufficient to handle predictably sequential data.

Yeah, inserting sequentially increasing data would only ever touch the
right-hand edge of the btree, so memory requirements would be pretty low
and constant.

 There very well could be a pattern in the data which could affect
 things, however, I'm not sure how to identify it in 100K rows out of
 100M.

I conjecture that the problem areas represent places where the key
sequence is significantly more random than it is elsewhere.  Hard
to be more specific than that though.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] 8.x index insert performance

2005-11-10 Thread Kelly Burkhart
On Tue, 2005-11-01 at 10:37 -0500, Ron Peacetree wrote:
 I'm surprised that no one seems to have yet suggested the following
 simple experiment:
 
 Increase the RAM 4GB - 8GB, tune for best performance, and
 repeat your 100M row insert experiment.
 
 Does overall insert performance change?  Does the performance
 drop foo rows in still occur?  Does it occur in ~ the same place?
 Etc.
  
 If the effect does seem to be sensitive to the amount of RAM in the
 server, it might be worth redoing the experiment(s) with 2GB and
 16GB as well...

Ron,

I would like to try this, however, since I'm sitting about 1000 miles
away from the server, tweaking things is not as simple as one might
hope.  I would also like to understand what is going on before I start
changing things.  If I can't get a satisfactory explanation for what I'm
seeing with current hardware, I'll have memory added and see what
happens.

-K

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] 8.x index insert performance

2005-11-10 Thread Kelly Burkhart
Second try... no attachment this time.

I've finally gotten around to profiling the back end.  Here is a more
precise description of what I'm doing:

I am copying data into two tables, order_main and order_transition
(table defs at the end of this post).  The order_transition table has
roughly double the number of rows as the order_main table.

My program is a C program using the libpq copy api which effectively
simulates our real application.  It reads data from two data files, and
appends copy-formatted data into two in-memory buffers.  After 10,000
order_transitions, it copies the order_main data, then the
order_transition data, then commits.  The test program is running on a
different machine than the DB.

After each batch it writes a record to stdout with the amount of time it
took to copy and commit the data (time only includes pg time, not the
time it took to build the buffers).  A graph showing the performance
characteristics is here:

http://kkcsm.net/pgcpy.jpg

The horizontal axis is number of transitions * 1 that have been
written.  The vertical axis is time in milliseconds to copy and commit
the data.  The commit time is very consistent up until about 60,000,000
rows, then performance drops and times become much less consistent.

I profiled the backend at three points, on batches 4, 6042 and 6067.
The first is right after start, the second is right before we hit the
wall, and the third is one of the initial slow batches.

I'm including inline the first 20 lines of gprof output for each batch.
Please let me know if this is insufficient.  I'll supply any necessary
further info.

Since this thread is stale, I'll repeat relevant hardware/software
stats:  server is a dual, dual-core opteron with 4GB RAM.  Disk is an
EMC Symmetrix connected via FC.  Data, index, logs on three separate
LUNS.  OS is SuSE Enterprise 9.  Postgres version is 8.1.b4.
shared_buffers=32768, fsync=off.

Thanks in advance for your help.

-K

---
 head -n 20 gprof.txt.4.777.47
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self  self total   
 time   seconds   secondscalls   s/call   s/call  name
 10.92  0.38 0.3855027 0.00 0.00  XLogInsert
  6.90  0.62 0.24   702994 0.00 0.00  _bt_compare
  5.46  0.81 0.192 0.10 1.64  DoCopy
  4.60  0.97 0.1616077 0.00 0.00  CopyReadLine
  3.74  1.10 0.13   484243 0.00 0.00  bttextcmp
  2.87  1.20 0.1093640 0.00 0.00  _bt_binsrch
  2.59  1.29 0.09   484243 0.00 0.00  varstr_cmp
  2.59  1.38 0.09   364292 0.00 0.00  LWLockRelease
  2.30  1.46 0.08   703394 0.00 0.00  FunctionCall2
  2.01  1.53 0.07   138025 0.00 0.00  hash_any
  2.01  1.60 0.07   133176 0.00 0.00  ReadBuffer
  2.01  1.67 0.07   364110 0.00 0.00  LWLockAcquire
  2.01  1.74 0.07   132563 0.00 0.00  PinBuffer
  1.72  1.80 0.0638950 0.00 0.00  _bt_insertonpg
  1.72  1.86 0.0638767 0.00 0.00  _bt_mkscankey

---
 head -n 20 gprof.txt.6042.1344.84 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self  self total   
 time   seconds   secondscalls   s/call   s/call  name
  9.67  0.52 0.5250431 0.00 0.00  XLogInsert
  7.71  0.94 0.42  1045427 0.00 0.00  _bt_compare
  5.95  1.26 0.32   713392 0.00 0.00  bttextcmp
  4.28  1.49 0.23  1045814 0.00 0.00  FunctionCall2
  3.35  1.67 0.18   155756 0.00 0.00  _bt_binsrch
  2.60  1.81 0.14   713392 0.00 0.00  varstr_cmp
  2.60  1.95 0.14   475524 0.00 0.00  LWLockAcquire
  2.60  2.09 0.14   191837 0.00 0.00  ReadBuffer
  2.60  2.23 0.142 0.07 2.52  DoCopy
  2.60  2.37 0.14   197393 0.00 0.00  hash_search
  2.60  2.51 0.14   197205 0.00 0.00  hash_any
  2.23  2.63 0.12   190481 0.00 0.00  PinBuffer
  2.04  2.74 0.11   345866 0.00 0.00  AllocSetAlloc
  1.86  2.84 0.10   475788 0.00 0.00  LWLockRelease
  1.86  2.94 0.1029620 0.00 0.00  pg_localtime

---
 head -n 20 gprof.txt.6067.9883.31 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self  self total   
 time   seconds   secondscalls   s/call   s/call  name
 17.17  1.14 1.1451231 0.00 0.00  XLogInsert
 10.82  1.85 0.72  1065556 0.00 0.00  _bt_compare
  4.77  2.17 0.32   158378 0.00 0.00  _bt_binsrch
  3.18  2.38 0.21   202921 0.00 0.00  hash_search
  3.18  2.59 0.21   742891 0.00 0.00  bttextcmp
  2.87  2.78 

Re: [PERFORM] 8.x index insert performance

2005-11-10 Thread Tom Lane
Kelly Burkhart [EMAIL PROTECTED] writes:
 I've finally gotten around to profiling the back end.

Thanks for following up.

The sudden appearance of pg_detoast_datum() in the top ten in the third
profile is suspicious.  I wouldn't expect that to get called at all,
really, during a normal COPY IN process.  The only way I can imagine it
getting called is if you have index entries that require toasting, which
seems a bit unlikely to start happening only after 60 million rows.
Is it possible that the index keys are getting longer and longer as your
test run proceeds?

Could you send me (off list) the complete gprof output files?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-11-10 Thread Tom Lane
Kelly Burkhart [EMAIL PROTECTED] writes:
 On Thu, 2005-11-10 at 17:18 -0500, Tom Lane wrote:
 Could you send me (off list) the complete gprof output files?

 Sure,

Thanks.  Right offhand I can see no smoking gun here.  The
pg_detoast_datum entry I was worried about seems to be just measurement
noise --- the gprof trace shows that it's called a proportional number
of times in both cases, and it falls through without actually doing
anything in all cases.

The later trace involves a slightly larger amount of time spent
inserting into the indexes, which is what you'd expect as the indexes
get bigger, but it doesn't seem that CPU time per se is the issue.
The just-before-the-cliff trace shows total CPU of 5.38 sec and the
after-the-cliff one shows 6.61 sec.

What I now suspect is happening is that you hit the wall at the point
where the indexes no longer fit into main memory and it starts taking
significant I/O to search and update them.  Have you tried watching
iostat or vmstat output to see if there's a noticeable increase in I/O
at the point where things slow down?  Can you check the physical size of
the indexes at that point, and see if it seems related to your available
RAM?

If that is the correct explanation, then the only solutions I can see
are (1) buy more RAM or (2) avoid doing incremental index updates;
that is, drop the indexes before bulk load and rebuild them afterwards.

One point to consider is that an index will be randomly accessed only
if its data is being loaded in random order.  If you're loading keys in
sequential order then only the right-hand edge of the index would get
touched, and it wouldn't need much RAM.  So, depending on what order
you're loading data in, the primary key index may not be contributing
to the problem.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] 8.x index insert performance

2005-11-10 Thread Tom Lane
Kelly Burkhart [EMAIL PROTECTED] writes:
 ...  A graph showing the performance
 characteristics is here:

 http://kkcsm.net/pgcpy.jpg

I hadn't looked at this chart till just now, but it sure seems to put a
crimp in my theory that you are running out of room to hold the indexes
in RAM.  That theory would predict that once you fall over the knee of
the curve, performance would get steadily worse; instead it gets
markedly worse and then improves a bit.  And there's another cycle of
worse-and-better around 80M rows.  I have *no* idea what's up with that.
Anyone?  Kelly, could there be any patterns in the data that might be
related?

The narrow spikes look like they are probably induced by checkpoints.
You could check that by seeing if their spacing changes when you alter
checkpoint_segments and checkpoint_timeout.  It might also be
entertaining to make the bgwriter parameters more aggressive to see
if you can ameliorate the spikes.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 8.x index insert performance

2005-11-02 Thread Greg Stark
Merlin Moncure [EMAIL PROTECTED] writes:

  select * from sometable where somefield IS NULL won't work because IS
 is
  not a nomally indexible operator.
 
 Ah, I didn't know that.  So there is no real reason not to exclude null
 values from all your indexes :).  Reading Tom's recent comments
 everything is clear now.

There are other reasons. If you want a query like 

  SELECT * FROM tab ORDER BY col LIMIT 10

to use an index on col then it can't exclude NULLs or else it wouldn't be
useful. (Oracle actually has this problem, you frequently have to add WHERE
col IS NOT NULL in order to let it use an index.)


-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 8.x index insert performance

2005-11-01 Thread Kelly Burkhart
On Mon, 2005-10-31 at 16:18 -0500, Tom Lane wrote:
 Kelly Burkhart [EMAIL PROTECTED] writes:
  Ha!  So I'm creating an index 98% full of nulls!  Looks like this is
  easily fixed with partial indexes.
 
 Still, though, it's not immediately clear why you'd be seeing a severe
 dropoff in insert performance after 50M rows.  Even though there are
 lots of nulls, I don't see why they'd behave any worse for insert speed
 than real data.  One would like to think that the insert speed would
 follow a nice O(log N) rule.
 
 Are you doing the inserts all in one transaction, or several?  If
 several, could you get a gprof profile of inserting the same number of
 rows (say a million or so) both before and after the unexpected dropoff
 occurs?

I'm doing the inserts via libpq copy.  Commits are in batches of approx
15000 rows.  I did a run last night after modifying the indexes and saw
the same pattern.  I'm dumping the database now and will modify my test
program to copy data from the dump rather than purely generated data.
Hopefully, this will allow me to reproduce the problem in a way that
takes less time to set up and run.

Tom, I'd be happy to profile the backend at several points in the run if
you think that would be helpful.  What compiler flags should I use?
Current settings in Makefile.global are:

CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline
-Wendif-labels -fno-strict-aliasing

Should I change this to:

CFLAGS = -g -pg -Wall ...

Or should I leave the -O2 in?

It may be weekend by the time I get this done.

-K

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-11-01 Thread Tom Lane
Kelly Burkhart [EMAIL PROTECTED] writes:
 Tom, I'd be happy to profile the backend at several points in the run if
 you think that would be helpful.  What compiler flags should I use?

Add -g -pg and leave the rest alone.  Also, if you're on Linux note that
you need -DLINUX_PROFILE.

 It may be weekend by the time I get this done.

Well, it's probably too late to think of tweaking 8.1 anyway...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] 8.x index insert performance

2005-11-01 Thread Ron Peacetree
I'm surprised that no one seems to have yet suggested the following
simple experiment:

Increase the RAM 4GB - 8GB, tune for best performance, and
repeat your 100M row insert experiment.

Does overall insert performance change?  Does the performance
drop foo rows in still occur?  Does it occur in ~ the same place?
Etc.
 
If the effect does seem to be sensitive to the amount of RAM in the
server, it might be worth redoing the experiment(s) with 2GB and
16GB as well...

ron

-Original Message-
From: Kelly Burkhart [EMAIL PROTECTED]
Sent: Oct 31, 2005 12:12 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] 8.x index insert performance

Greetings,

We are running some performance tests in which we are attempting to
insert about 100,000,000 rows in a database at a sustained rate.  About
50M rows in, our performance drops dramatically.

This test is with data that we believe to be close to what we will
encounter in production.  However in tests with purely generated,
sequential data, we did not notice this slowdown.  I'm trying to figure
out what patterns in the real data may be causing us problems.

I have log,data and indexes on separate LUNs on an EMC SAN.  Prior to
slowdown, each partition is writing at a consistent rate.  Index
partition is reading at a much lower rate.  At the time of slowdown,
index partition read rate increases, all write rates decrease.  CPU
utilization drops.

The server is doing nothing aside from running the DB.  It is a dual
opteron (dual core, looks like 4 cpus) with 4GB RAM.  shared_buffers =
32768.  fsync = off.  Postgres version is 8.1.b4.  OS is SuSE Enterprise
server 9.

My leading hypothesis is that one indexed column may be leading to our
issue.  The column in question is a varchar(12) column which is non-null
in about 2% of the rows.  The value of this column is 5 characters which
are the same for every row, followed by a 7 character zero filled base
36 integer.  Thus, every value of this field will be exactly 12 bytes
long, and will be substantially the same down to the last bytes.

Could this pattern be pessimal for a postgresql btree index?  I'm
running a test now to see if I can verify, but my runs take quite a long
time...

If this sounds like an unlikely culprit how can I go about tracking down
the issue?

Thanks,

-K

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Merlin Moncure
Kelly wrote:
 We are running some performance tests in which we are attempting to
 insert about 100,000,000 rows in a database at a sustained rate.
About
 50M rows in, our performance drops dramatically.
 
 This test is with data that we believe to be close to what we will
 encounter in production.  However in tests with purely generated,
 sequential data, we did not notice this slowdown.  I'm trying to
figure
 out what patterns in the real data may be causing us problems.
 
 I have log,data and indexes on separate LUNs on an EMC SAN.  Prior to
 slowdown, each partition is writing at a consistent rate.  Index
 partition is reading at a much lower rate.  At the time of slowdown,
 index partition read rate increases, all write rates decrease.  CPU
 utilization drops.
 
 The server is doing nothing aside from running the DB.  It is a dual
 opteron (dual core, looks like 4 cpus) with 4GB RAM.  shared_buffers =
 32768.  fsync = off.  Postgres version is 8.1.b4.  OS is SuSE
Enterprise
 server 9.
 
 My leading hypothesis is that one indexed column may be leading to our
 issue.  The column in question is a varchar(12) column which is
non-null
 in about 2% of the rows.  The value of this column is 5 characters
which
 are the same for every row, followed by a 7 character zero filled base
 36 integer.  Thus, every value of this field will be exactly 12 bytes
 long, and will be substantially the same down to the last bytes.
 
 Could this pattern be pessimal for a postgresql btree index?  I'm
 running a test now to see if I can verify, but my runs take quite a
long
 time...
 
 If this sounds like an unlikely culprit how can I go about tracking
down
 the issue?

well, can you defer index generation until after loading the set (or use
COPY?)

if that index is causing the problem, you may want to consider setting
up partial index to exclude null values.

One interesting thing to do would be to run your inserting process until
slowdown happens, stop the process, and reindex the table and then
resume it, and see if this helps.

Merlin




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Kelly Burkhart
On Mon, 2005-10-31 at 12:32 -0500, Merlin Moncure wrote:
 well, can you defer index generation until after loading the set (or use
 COPY?)

I cannot defer index generation.

We are using the copy API.  Copying 1 rows in a batch.

 
 if that index is causing the problem, you may want to consider setting
 up partial index to exclude null values.

This is a single column index.  I assumed that null column values were
not indexed.  Is my assumption incorrect?

-K

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Claus Guttesen
 We are running some performance tests in which we are attempting to
 insert about 100,000,000 rows in a database at a sustained rate.  About
 50M rows in, our performance drops dramatically.

 This test is with data that we believe to be close to what we will
 encounter in production.  However in tests with purely generated,
 sequential data, we did not notice this slowdown.  I'm trying to figure
 out what patterns in the real data may be causing us problems.

 I have log,data and indexes on separate LUNs on an EMC SAN.  Prior to
 slowdown, each partition is writing at a consistent rate.  Index
 partition is reading at a much lower rate.  At the time of slowdown,
 index partition read rate increases, all write rates decrease.  CPU
 utilization drops.

I'm doing some test-inserts (albeit with much fewer records) into
8.0.4 (on FreeBSD 6.0 RC1) and the import-time decreased three-fold
when I increased the below mentioned values:

shared_buffers = 8192
commit_delay = 10
commit_siblings = 1000

When I increased shared_buffers the kernel needed minor tweaking.

regards
Claus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Merlin Moncure
  if that index is causing the problem, you may want to consider
setting
  up partial index to exclude null values.
 
 This is a single column index.  I assumed that null column values were
 not indexed.  Is my assumption incorrect?
 
 -K
It turns out it is, or it certainly seems to be.  I didn't know that :).
So partial index will probably not help for null exclusion...

would be interesting to see if you are getting swaps (check pg_tmp) when
performance breaks down.  That is an easy fix, bump work_mem.

Merlin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread mark
On Mon, Oct 31, 2005 at 12:32:03PM -0500, Merlin Moncure wrote:
 if that index is causing the problem, you may want to consider setting
 up partial index to exclude null values.

Hey all.

Pardon my ignorance. :-)

I've been trying to figure out whether null values are indexed or not from
the documentation. I was under the impression, that null values are not
stored in the index. Occassionally, though, I then see a suggestion such
as the above, that seems to indicate to me that null values *are* stored
in the index, allowing for the 'exclude null values' to have effect?

Which is it? :-)

Thanks,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Merlin Moncure
 On Mon, Oct 31, 2005 at 12:32:03PM -0500, Merlin Moncure wrote:
  if that index is causing the problem, you may want to consider
setting
  up partial index to exclude null values.
 
 Hey all.
 
 Pardon my ignorance. :-)
 
 I've been trying to figure out whether null values are indexed or not
from
 the documentation. I was under the impression, that null values are
not
 stored in the index. Occassionally, though, I then see a suggestion
such
 as the above, that seems to indicate to me that null values *are*
stored
 in the index, allowing for the 'exclude null values' to have effect?
 
 Which is it? :-)

I think I'm the ignorant one...do explain on any lookup on an indexed
field where the field value is null and you get a seqscan.

Merlin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Tom Lane
[EMAIL PROTECTED] writes:
 I've been trying to figure out whether null values are indexed or not from
 the documentation. I was under the impression, that null values are not
 stored in the index.

You're mistaken, at least with regard to btree indexes.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Kelly Burkhart
On Mon, 2005-10-31 at 15:30 -0500, Tom Lane wrote:
 [EMAIL PROTECTED] writes:
  I've been trying to figure out whether null values are indexed or not from
  the documentation. I was under the impression, that null values are not
  stored in the index.
 
 You're mistaken, at least with regard to btree indexes.

Ha!  So I'm creating an index 98% full of nulls!  Looks like this is
easily fixed with partial indexes.

-K

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Merlin Moncure
 [EMAIL PROTECTED] writes:
  I've been trying to figure out whether null values are indexed or
not
 from
  the documentation. I was under the impression, that null values are
not
  stored in the index.
 
 You're mistaken, at least with regard to btree indexes.

hmm. I tried several different ways to filter/extract null values from
an indexed key and got a seq scan every time.  The only way I could
query for/against null values was to convert to bool via function.

However I did a partial exclusion on a 1% non null value really big
table and index size dropped as expected.

Merlin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Scott Marlowe
On Mon, 2005-10-31 at 13:13, Merlin Moncure wrote:
   if that index is causing the problem, you may want to consider
 setting
   up partial index to exclude null values.
  
  This is a single column index.  I assumed that null column values were
  not indexed.  Is my assumption incorrect?
  
  -K
 It turns out it is, or it certainly seems to be.  I didn't know that :).
 So partial index will probably not help for null exclusion...
 
 would be interesting to see if you are getting swaps (check pg_tmp) when
 performance breaks down.  That is an easy fix, bump work_mem.

OK, here's the issue in a nutshell.

NULLS, like everything else, are indexed.  HOWEVER, there's no way for
them to be used by a normal query, since =NULL is not a legal
construct.  So, you can't do something like:

select * from sometable where somefield = NULL

because you won't get any answers, since nothing can equal NULL and

select * from sometable where somefield IS NULL won't work because IS is
not a nomally indexible operator.

Which is why you can create two indexes on a table to get around this
like so:

create index iname1 on table (field) where field IS NULL

and

create index iname2 on table (field) where field IS NOT NULL

And then the nulls are indexable by IS / IS NOT NULL.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Merlin Moncure
 select * from sometable where somefield IS NULL won't work because IS
is
 not a nomally indexible operator.

Ah, I didn't know that.  So there is no real reason not to exclude null
values from all your indexes :).  Reading Tom's recent comments
everything is clear now.

Instead of using your two index approach I prefer to:
create function nullidx(anyelement) returns boolean as $$ select $1 is
null; $$ language
sql immutable;

create index on t(nullidx(f)); -- etc

Merlin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Tom Lane
Merlin Moncure [EMAIL PROTECTED] writes:
 You're mistaken, at least with regard to btree indexes.

 hmm. I tried several different ways to filter/extract null values from
 an indexed key and got a seq scan every time.

I said they were stored, not that you could query against them ;-)
IS NULL isn't considered an indexable operator, mainly because it's
not an operator at all in the strict sense of the word; and our index
access APIs only support querying on indexable operators.

The reason they're stored is that they have to be in order to make
multi-column indexes work right.  I suppose we could special-case
single-column indexes, but we don't.  In any case, it's more likely
that someone would one day get around to making IS NULL an indexable
operator than that we'd insert a special case like that.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread Tom Lane
Kelly Burkhart [EMAIL PROTECTED] writes:
 Ha!  So I'm creating an index 98% full of nulls!  Looks like this is
 easily fixed with partial indexes.

Still, though, it's not immediately clear why you'd be seeing a severe
dropoff in insert performance after 50M rows.  Even though there are
lots of nulls, I don't see why they'd behave any worse for insert speed
than real data.  One would like to think that the insert speed would
follow a nice O(log N) rule.

Are you doing the inserts all in one transaction, or several?  If
several, could you get a gprof profile of inserting the same number of
rows (say a million or so) both before and after the unexpected dropoff
occurs?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 8.x index insert performance

2005-10-31 Thread mark
On Mon, Oct 31, 2005 at 03:27:31PM -0500, Merlin Moncure wrote:
  On Mon, Oct 31, 2005 at 12:32:03PM -0500, Merlin Moncure wrote:
   if that index is causing the problem, you may want to consider setting
   up partial index to exclude null values.
  Hey all.
  Pardon my ignorance. :-)
  I've been trying to figure out whether null values are indexed or not from
  the documentation. I was under the impression, that null values are not
  stored in the index. Occassionally, though, I then see a suggestion such
  as the above, that seems to indicate to me that null values *are* stored
  in the index, allowing for the 'exclude null values' to have effect?
  Which is it? :-)
 I think I'm the ignorant one...do explain on any lookup on an indexed
 field where the field value is null and you get a seqscan.

Nahhh... I think the documentation could use more explicit or obvious
explanation. Or, I could have checked the source code to see. In any case,
I expect we aren't the only ones that lacked confidence.

Tom was kind enough to point out that null values are stored. I expect
that the seqscan is used if the null values are not selective enough,
the same as any other value that isn't selective enough.

Now we can both have a little more confidence! :-)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match