Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Manfred Koizar
On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs
[EMAIL PROTECTED] wrote:
I enclose a patch for checking out block sampling.

Can't comment on the merits of block sampling and your implementation
thereof.  Just some nitpicking:

|!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create

Linking the use of the Vitter algorithm to May 2004 is not quite
appropriate.  We introduced two stage sampling at that time.

|   * a random sample of targrows rows (or less, if there are less in the
|!  * sample of blocks). In this case, targblocks is always the same as
|!  * targrows, so we always read one row per block.

This is just wrong, unless you add on average.  Even then it is a
bit misleading, because in most cases we *read* more tuples than we
use.

|   * Although every row has an equal chance of ending up in the final
|   * sample, this sampling method is not perfect: not every possible
|   * sample has an equal chance of being selected.  For large relations
|   * the number of different blocks represented by the sample tends to be
|!  * too small.  In that case, block sampling should be used.

Is the last sentence a fact or personal opinion?

|   * block.  The previous sampling method put too much credence in the row
|   * density near the start of the table.

FYI, previous refers to the date mentioned above:
previous == before May 2004 == before two stage sampling.

|+ /* Assume that we will have rows at least 64 bytes wide 
|+  * Currently we're very unlikely to overflow availMem here 
|+  */
|+ if ((allocrows * sizeof(HeapTuple))  (allowedMem  4))

This is a funny way of saying
if (allocrows * (sizeof(Heaptuple) + 60)  allowedMem)

It doesn't match the comment above; and it is something different if
the size of a pointer is not four bytes.

|-  if (bs.m  0)
|+  if (bs.m  0 )

Oops.

|+  ereport(DEBUG2,
|+ (errmsg(ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u, 
|+ stats-tupattnum,samplerows, nmultiple, f1, d)));
^
missing space here and in some more places

I haven't been following the discussion too closely but didn't you say
that a block sampling algorithm somehow compensates for the fact that
the sample is not random?
Servus
 Manfred

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-24 Thread Manfred Koizar
On Thu, 22 Dec 2005 10:40:24 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
If you move items
from one page to the other in the opposite direction from the way the
scan is going, then it will miss those items.

AFAIU the (PG implementaion of the) LY method is designed to make
scans immune against problems caused by items moving right within the
same page and against page splits, i.e. items moving to a *new* right
sibling.  So making scans work with items moving to an *existing*
right sibling doesn't seem out of reach.

The code following this comment in _bt_restscan
/*
 * The item we're looking for moved right at least one page, so
 * move right.  We are careful here to pin and read-lock the next
 * non-dead page before releasing the current one.  This ensures
 * that a concurrent btbulkdelete scan cannot pass our position
 * --- if it did, it might be able to reach and delete our target
 * item before we can find it again.
 */
looks like it'd work for the case of page merging as well, as long as
we are careful to move items always from left to right.

BTW, if after having locked both pages we find that we have
super-exclusive locks, i.e. nobody else has pins on these pages, we
can reorganize much more agressively.  It might even be safe to move
items to the left page.  The parent page might need some special
handling, though.

Servus
 Manfred

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Re: Which qsort is used

2005-12-22 Thread Manfred Koizar
On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout
kleptog@svana.org wrote:
But where are you including the cost to check how many cells are
already sorted? That would be O(H), right?

Yes.  I didn't mention it, because H  N.

 This is where we come back
to the issue that comparisons in PostgreSQL are expensive.

So we agree that we should try to reduce the number of comparisons.
How many comparisons does it take to sort 10 items?  1.5 million?

Hmm, what are the chances you have 10 unordered items to sort and
that the first 8% will already be in order. ISTM that that probability
will be close enough to zero to not matter...

If the items are totally unordered, the check is so cheap you won't
even notice.  OTOH in Tom's example ...

|What I think is much more probable in the Postgres environment
|is almost-but-not-quite-ordered inputs --- eg, a table that was
|perfectly ordered by key when filled, but some of the tuples have since
|been moved by UPDATEs.

... I'd not be surprised if H is 90% of N.
Servus
 Manfred

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Re: Which qsort is used

2005-12-21 Thread Manfred Koizar
On Sat, 17 Dec 2005 00:03:25 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
I've still got a problem with these checks; I think they are a net
waste of cycles on average.  [...]
 and when they fail, those cycles are entirely wasted;
you have not advanced the state of the sort at all.

How can we make the initial check adavance the state of the sort?
One answer might be to exclude the sorted sequence at the start of the
array from the qsort, and merge the two sorted lists as the final
stage of the sort.

Qsorting N elements costs O(N*lnN), so excluding H elements from the
sort reduces the cost by at least O(H*lnN).  The merge step costs O(N)
plus some (=50%) more memory, unless someone knows a fast in-place
merge.  So depending on the constant factors involved there might be a
usable solution.

I've been playing with some numbers and assuming the constant factors
to be equal for all the O()'s this method starts to pay off at
  H  for N
  20   100
 130  1000
800010
Servus
 Manfred

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Shared locking in slru.c

2005-12-02 Thread Manfred Koizar
On Wed, 30 Nov 2005 13:53:13 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
The way the attached patch attacks this is for the shared-lock access
case to simply set the page's LRU counter to zero, without bumping up
the LRU counters of the other pages as the normal adjustment would do.

If you still plan to do this, you might also want to revert the
micro-optimisation intruduced by the original SLRU patch:

| Apart from refactoring I made a little change to SlruRecentlyUsed,
| formerly ClogRecentlyUsed:  It now skips incrementing lru_counts, if
| slotno is already the LRU slot, thus saving a few CPU cycles.

|+#define SlruRecentlyUsed(shared, slotno)  \
|+  do { \
|+  if ((shared)-page_lru_count[slotno] != 0) { \
|+  int iilru; \
|+  for (iilru = 0; iilru  NUM_CLOG_BUFFERS; iilru++) \
|+  (shared)-page_lru_count[iilru]++; \
|+  (shared)-page_lru_count[slotno] = 0; \
|+  } \
|+  } while (0)

Otherwise you could end up with a stable state of several pages having
lru_count == 0.
Servus
 Manfred


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Alternative variable length structure

2005-09-09 Thread Manfred Koizar
On Thu, 08 Sep 2005 18:02:44 +0900, ITAGAKI Takahiro
[EMAIL PROTECTED] wrote:
+  * The length of varlena2 is encoded as follows:
+  *
+  * | First| Trailing | Total | Max |
+  * | byte | bytes| bits  | length  |
+  * +--+--+---+-+
+  * | 0*** |0 | 7 | 127 |
+  * | 10** |1 |14 | 16K  -1 |
+  * | 110* |2 |21 | 2M   -1 |
+  * | 1110 |3 |28 | 256M -1 |
+  * |  |4 |32 | 4G   -1 |

With external and compression flags this would look like
   * | ec0* |0 | 5 | 31  |
   * | ec10 |1 |12 | 4K   -1 |
   * | ec110*** |2 |19 | 512K -1 |
   * | ec1110** |3 |26 | 64M  -1 |
   * | ec-- |4 |32 | 4G   -1 |

Only a matter of taste, but I'd just waste one byte for sizes between 4K
and 512K and thus get a shorter table:
   * | ec0* |0 | 5 | 31  |
   * | ec10 |1 |12 | 4K   -1 |
   * | ec110*** |3 |27 | 128M -1 |
   * | ec111--- |4 |32 | 4G   -1 |

And you get back that one byte for sizes between 64M and 128M :-)

Another possible tweak:
   * | 0*** |0 | 5 | 127 |
   * | 1ec0 |1 |12 | 4K   -1 |
   * | 1ec10*** |3 |27 | 128M -1 |
   * | 1ec11--- |4 |32 | 4G   -1 |

I.e. drop the flags for very short strings.  If these flags are needed
for a string of size 32 to 127, then use a two byte header.

Servus
 Manfred

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Remove xmin and cmin from frozen tuples

2005-09-03 Thread Manfred Koizar
On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian
pgman@candle.pha.pa.us wrote:
 Once I had a patch based on 7.4 that stored cmin and cmax in
 backend-local memory.

Interesting idea, but how would you record the cmin/xmin values without
requiring unlimited memory?

That's exactly the reason for not sending it to -patches.  Without
spilling to disk this is just not ready for real life.  The problem is
that -- unlike other data structures that build up during a
transaction, e.g. trigger queues -- cmin/cmax lookup requires random
access, so we'd need some form of tree or hash.  Unfornunately I never
got beyond brainstorming :-(

BTW, is there anything else that'd need spilling to disk during long
transactions?

Servus
 Manfred


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Remove xmin and cmin from frozen tuples

2005-09-02 Thread Manfred Koizar
On Fri, 2 Sep 2005 15:51:15 -0400 (EDT), Bruce Momjian
pgman@candle.pha.pa.us wrote:
   * Merge xmin/xmax/cmin/cmax back into three header fields

And don't forget xvac, please.

 Before subtransactions, there used to be only three fields needed to
 store these four values.

... five values.

 This was possible because only the current
 transaction looks at the cmin/cmax values.

Which is a reason to get rid of cmin/cmax in tuple headers entirely.
Once I had a patch based on 7.4 that stored cmin and cmax in
backend-local memory.  It passed make check and some volume tests, but
I felt it was not ready to be applied without any spill-to-disk
mechanism.  Development stalled when I tried to eliminate xvac as
well, which would have required deep cuts into VACUUM code :-(

Servus
 Manfred


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Must be owner to truncate?

2005-08-24 Thread Manfred Koizar
On Wed, 24 Aug 2005 07:01:00 +0200, Andreas Seltenreich
[EMAIL PROTECTED] wrote:
However, a question arose quickly: According to the standard, revoking
INSERT, UPDATE and DELETE after GRANT ALL PRIVILEGES would leave the
relation read-only, but with the TRUNCATE privilege lying around, this
would no longer be true for PostgreSQL.

I'd say that the TRUNCATE privilege includes DELETE, so that REVOKE
DELETE implicitly revokes TRUNCATE and GRANT TRUNCATE implicitly
grants DELETE.

Servus
 Manfred


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Cost of XLogInsert CRC calculations

2005-05-31 Thread Manfred Koizar
On Tue, 31 May 2005 12:07:53 +0100, Mark Cave-Ayland
[EMAIL PROTECTED] wrote:
Perhaps Manfred can tell us the generator
polynomial that was used to create the lookup tables?

 32   26   23   22   16   12   11   10   8   7   5   4   2   1
X  + X  + X  + X  + X  + X  + X  + X  + X + X + X + X + X + X + 1

- http://www.opengroup.org/onlinepubs/009695399/utilities/cksum.html

Or google for 04c11db7.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] WAL replay failure after file truncation(?)

2005-05-26 Thread Manfred Koizar
On Wed, 25 May 2005 18:19:19 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
 but it keeps a list (hash table, file, whatever) of those blocks.
 [...]  Is it sufficient to
 remember just the relation and the block number or do we need the
 contents a well?

We don't *have* the contents ... that's exactly why it's panicking ...

I meant the contents of the WAL record, not the original block
contents.  Anyway, I think it's not needed.

Servus
 Manfred


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] WAL replay failure after file truncation(?)

2005-05-25 Thread Manfred Koizar
On Wed, 25 May 2005 11:02:11 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
Plan B is for WAL replay to always be willing to extend the file to
whatever record number is mentioned in the log, even though this
may require inventing the contents of empty pages; we trust that their
contents won't matter because they'll be truncated again later in the
replay sequence.

Another idea: WAL replay does not apply changes to nonexistent blocks,
but it keeps a list (hash table, file, whatever) of those blocks.
When a truncate WAL record is found, all entries for blocks affected
by the truncation are removed from the list.  Is it sufficient to
remember just the relation and the block number or do we need the
contents a well?

If the list is non-empty at the end of WAL replay, this is evidence of
a serious problem (file system corruption or Postgres bug).

Servus
 Manfred


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Cost of XLogInsert CRC calculations

2005-05-25 Thread Manfred Koizar
On Wed, 18 May 2005 13:50:22 +0200, I wrote:
The most important figure is, that at MaxSpeed (/O2) 2x32 is almost
twice as fast as CRC32 while only being marginally slower than CRC32.
  ^
Silly typo!  That should have been:
The most important figure is, that at MaxSpeed (/O2) 2x32 is almost
twice as fast as CRC64 while only being marginally slower than CRC32.

Servus
 Manfred


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Cost of XLogInsert CRC calculations

2005-05-18 Thread Manfred Koizar
On Wed, 18 May 2005 01:12:26 -0400, Tom Lane [EMAIL PROTECTED] wrote:
 Wait, par for 32-bit CRCs? Or for 64-bit CRCs calculated using 32-bit ints?

Right, the latter.  We haven't actually tried to measure the cost of
plain 32bit CRCs... although I seem to recall that when we originally
decided to use 64bit, someone put up some benchmarks purporting to
show that there wasn't much difference.

That someone wasn't me (I wasn't around here at that time), but I have
done a few tests today on 32 bit Intel with VC6:

Optimization |  CRC algorithms
Settings |  3232a32b2x326464a64b
-+---
Default  | 7.67.66.28.39.19.29.5
MinSize  | 2.96   2.97   2.97   4.76   6.00   5.98   6.31
MaxSpeed | 2.92   2.92   2.97   3.13   6.32   6.33   6.22

32a and 32b are functionally equivalent variants of CRC32 where the crc
is a plain uint32, not a struct with just one field.  Same for 64a, 64b,
and 64, respectively.

The most important figure is, that at MaxSpeed (/O2) 2x32 is almost
twice as fast as CRC32 while only being marginally slower than CRC32.

In case anybody wants to repeat my tests or find any flaw therein, the
source is attached.

Servus
 Manfred


crctest.tgz
Description: Binary data

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)

2005-05-18 Thread Manfred Koizar
On Tue, 17 May 2005 22:12:17 -0700, Jeffrey W. Baker [EMAIL PROTECTED]
wrote:
Incrementing random_page_cost from 4 (the default) to 5 causes the
planner to make a better decision.

We have such a low default random_page_cost primarily to mask other
problems in the optimizer, two of which are

. multi-column index correlation

. interpolation between min_IO_Cost and max_IO_cost which approximates
max_IO_cost too fast.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Cost of XLogInsert CRC calculations

2005-05-17 Thread Manfred Koizar
On Mon, 16 May 2005 12:35:35 -0400, Tom Lane [EMAIL PROTECTED] wrote:
Anyone want to try it with non-gcc compilers?

MS VC++ 6.0 with various predefined optimizer settings

2x3264
Default (without any /O) 0.828125   0.906250
MinSize (contains /O1)   0.468750   0.593750
MaxSpeed (contains /O2)  0.312500   0.640625

Not that it really matters -- but at least this looks like another hint
that letting the compiler emulate 64 bit operations on non 64 bit
hardware is suboptimal.

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] BTW, if anyone wants to work on it...

2005-05-17 Thread Manfred Koizar
On Tue, 03 May 2005 02:45:09 -0400, Tom Lane [EMAIL PROTECTED] wrote:
 I'm starting to think
it'd be worth setting up a mechanism to handle such changes
automatically.

I've been using this skeleton for quite some time now.  Magnus'
psql ... | while read D
might be more robust than my
for db in `enumdatabases`

Servus
 Manfred
#!/bin/sh
#---
# pg_plug -- painless upgrade
# PG 7.4 standard - PG 7.4 with various enhancements
#
# 2003-10-04  mk  new
# 2003-10-14  mk  addpatchinfo
# 2003-11-01  mk  VIEW pg_patchinfo
# 2004-11-04  mk  renamed VIEW to pg_patches;
# features selectable via command line (-F)
# 2004-12-14  mk  freeze
#---

ALLFEATURES=22 27

#---

abortscript(){
echo 12
echo $1 12
exit 1
}

setconnectable(){
db=$1
allow=$2

$CMD template1 /dev/null EOF
UPDATE pg_database SET datallowconn = $allow WHERE datname = '$db';
EOF
}

freeze(){
db=$1

$CMD $db /dev/null EOF
VACUUM FREEZE;
EOF
}

# This does not work for database names with spaces or other funny
# characters.
enumdatabases(){
echo SELECT datname FROM pg_database; \
| $CMD template1 \
| sed -e '/1: datname = /!d' -e 's/.*1: datname = //' -e 's/.*//'
}

add22indexstat(){
db=$1

echo adding \pg_indexstat\, ignore error if it already exists
$CMD $db /dev/null EOF
CREATE TABLE pg_catalog.pg_indexstat( \
istindex oid NOT NULL, \
istcorrel float4 NOT NULL \
) WITHOUT OIDS;
CREATE UNIQUE INDEX pg_indexstat_index_index \
ON pg_indexstat(istindex);
EOF
}

add27patchinfo(){
db=$1

echo adding \pg_patchinfo()\, ignore error if it already exists
$CMD $db /dev/null EOF
COPY pg_proc FROM stdin WITH OIDS;
2981pg_patchinfo11  1   12  f   f   t   t   
i   0   2249show_patchinfo  -   \N
\.
-- copy acl from version()
UPDATE pg_proc SET proacl=(SELECT proacl FROM pg_proc WHERE oid=89) WHERE 
oid=2981;
CREATE VIEW pg_patches AS SELECT * FROM pg_patchinfo() AS p(name text, version 
text, base text, descr text);
EOF
}

#---

CMDNAME=`basename $0`
BINDIR=bin
FEATURES=

while [ $# -gt 0 ]
do
case $1 in
--username|-U)
POSTGRES_SUPERUSERNAME=$2
shift;;
--username=*)
POSTGRES_SUPERUSERNAME=`echo $1 | sed 
's/^--username=//'`
;;
-U*)
POSTGRES_SUPERUSERNAME=`echo $1 | sed 's/^-U//'`
;;
--feature|-F)
FEATURES=$FEATURES $2
shift;;
--feature=*)
F=`echo $1 | sed 's/^--feature=//'`
FEATURES=$FEATURES $F
;;
-F*)
F=`echo $1 | sed 's/^-F//'`
FEATURES=$FEATURES $F
;;
# Data directory. No default, unless the environment
# variable PGDATA is set.
--pgdata|-D)
PGDATA=$2
shift;;
--pgdata=*)
PGDATA=`echo $1 | sed 's/^--pgdata=//'`
;;
-D*)
PGDATA=`echo $1 | sed 's/^-D//'`
;;

-*)
echo $CMDNAME: invalid option: $1
exit 1
;;

# Non-option argument specifies data directory
*)
PGDATA=$1
;;
esac
shift
done

#---

if [ -z $PGDATA ]
then
  (
echo $CMDNAME: no data directory specified
echo You must identify the directory where the data for this database 
system
echo resides.  Do this with either the invocation option -D or the
echo environment variable PGDATA.
  ) 12
exit 1
fi


#---
PG_OPT=-O
PG_OPT=$PG_OPT -c search_path=pg_catalog
PG_OPT=$PG_OPT -c exit_on_error=true
#PG_OPT=$PG_OPT -c enable_indexstat=false
PG_OPT=$PG_OPT -D $PGDATA
CMD=$BINDIR/postgres $PG_OPT

: ${FEATURES:=$ALLFEATURES}

# Enable connections to template0
setconnectable template0 true || abortscript cannot enable connections to 
template0

for db in `enumdatabases`
do
echo converting database $db ...
for f in $FEATURES
do
case $f in
22)
add22indexstat $db # || abortscript cannot convert 
database $db
   

Re: [HACKERS] pgFoundry

2005-05-17 Thread Manfred Koizar
On Mon, 16 May 2005 20:54:15 -0400 (EDT), Bruce Momjian
pgman@candle.pha.pa.us wrote:
I have modifed the TODO HTML so the completed items are in italics.

Isn't it a bit misleading to have those items on the TODO list at all?
Shouldn't there be a separate list:  DONE for the next release?

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: Learning curves and such (was Re: [HACKERS] pgFoundry)

2005-05-17 Thread Manfred Koizar
On Tue, 17 May 2005 14:45:00 -0300 (ADT), Marc G. Fournier
[EMAIL PROTECTED] wrote:
Also, how many 'bugs' have we seen go through the lists that someone 
hasn't jump'd on and fixed in a couple of days?

Just imagine our marketing crew being able to say: According to our
great bug tracking system NN serious bugs have been reported last year,
98% of which have been fixed within three days.

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Learning curves and such

2005-05-17 Thread Manfred Koizar
On Tue, 17 May 2005 14:29:49 -0700, Josh Berkus josh@agliodbs.com
wrote:
grin You're not going to win over many people on *this* list with marketing 
arguments.

Yeah, that's the problem with *my* learning curve ...

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Views, views, views! (long)

2005-05-15 Thread Manfred Koizar
On Wed, 4 May 2005 21:37:40 -0700, Josh Berkus josh@agliodbs.com
wrote:
As stated above, these system views, once incorporated into a pg distribution, 
are likely to be with us *forever*.

I don't think that this is doable.  :-(

You might want to put the system views into a version specific schema,
say pg_views81.  The next PG version will contain a new schema
pg_views82 plus a version of 8.1 views that have been adapted to new
features and catalog structures as far as possible without breaking
compatibility.  Ideally the views in pg_views81 and pg_views82 will
look the same, but most likely there will be some differences.  In PG
8.3 we will have schemas pg_views81, pg_views82, and pg_views83 ...

Obviously it will get harder and harder to maintain older system view
schemas with each new Postgres version.  If in PG 8.7 it becomes clear
that carrying on pg_views81 doesn't make sense any more, you simply
drop it.  By that time tool vendors should have had enough time to
make their tools compatible with pg_views8x, for some x = 2.

Servus
 Manfred


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Best way to scan on-disk bitmaps

2005-05-15 Thread Manfred Koizar
On Thu, 12 May 2005 17:40:06 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
the planner believes that only
consecutive columns in the index are usable --- that is, if you have
quals for a and c but not for b, it will think that the condition for c
isn't usable with the index.  This is true for btree [...]

It's not difficult to setup a test case where an index is used, but
only with a=42 as an index condition, and c='foo' is applied as a
filter condition.  Now add a redundant condition on b
... AND b BETWEEN minb AND maxb ...
and watch how c='foo' moves into the index condition.

I did this test some time ago and I believe that adding the condition
on b did not change the estimated cost, only the actual execution
time.

Servus
 Manfred


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] We are not following the spec for HAVING without GROUP

2005-03-14 Thread Manfred Koizar
On Fri, 11 Mar 2005 10:37:13 +1300, Mark Kirkwood
[EMAIL PROTECTED] wrote:
Firebird 1.5.1 FreeBSD 5.3
[correct results]

Interbase 6.0:
SQL create table tab (col integer);
SQL select 1 from tab having 1=0;
SQL select 1 from tab having 1=1;




   0---:-)

SQL insert into tab values(1);
SQL insert into tab values(2);
SQL select 1 from tab having 1=0;
SQL select 1 from tab having 1=1;




   1

SQL

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Group-count estimation statistics

2005-02-01 Thread Manfred Koizar
On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Manfred Koizar [EMAIL PROTECTED] writes:
 That's not what I meant.  I tried to say that if we have a GROUP BY
 several columns and one of these columns alone has more than N/10
 distinct values, there's no way to get less than that many groups.

Oh, I see, you want a max calculation in there too.  Seems reasonable.
Any objections?

Yes.  :-(  What I said is only true in the absence of any WHERE clause
(or join).  Otherwise the same cross-column correlation issues you tried
to work around with the N/10 clamping might come back through the
backdoor.  I'm not sure whether coding for such a narrow use case is
worth the trouble.  Forget my idea.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Manfred Koizar
On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane [EMAIL PROTECTED] wrote:
 we should consider
something like clamp to size of table / 10 instead.

... unless a *single* grouping column is estimated to have more than
N/10 distinct values, which should be easy to check.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Refactoring

2005-01-31 Thread Manfred Koizar
On Wed, 19 Jan 2005 18:57:48 +0100, I wrote:
  My first vacuum.c
refactoring patch, rev 1.281 2004-06-08, added these comments in
repair_frag():

/*
 * VACUUM FULL has an exclusive lock on the relation.  So
 * normally no other transaction can have pending INSERTs or
 * DELETEs in this relation.  A tuple is either
 *   (a) a tuple in a system catalog, inserted or deleted by
 *   a not yet committed transaction or
 *   (b) dead (XMIN_INVALID or XMAX_COMMITTED) or
 *   (c) inserted by a committed xact (XMIN_COMMITTED) or
 *   (d) moved by the currently running VACUUM.
 * In case (a) we wouldn't be in repair_frag() at all.
 * In case (b) we cannot be here, because scan_heap() has
 * already marked the item as unused, see continue above.
 * Case (c) is what normally is to be expected.
 * Case (d) is only possible, if a whole tuple chain has been
 * moved while processing this or a higher numbered block.
 */

It turns out that this comment is not quite correct.  It is incomplete.
Case (b) should be: known dead (XMIN_INVALID, or XMAX_COMMITTED and xmax
is visible to all active transactions).

And there is a fifth possibility: (e) deleted (XMAX_COMMITTED) but at
least one active transaction does not see the deleting transaction.

The patch seems to imply that case (e) is a subcase of (b), but
effectively tuples in this state are treated more like (c).

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Manfred Koizar
On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Already done that way.
if (relvarcount  1)
clamp *= 0.1;

That's not what I meant.  I tried to say that if we have a GROUP BY
several columns and one of these columns alone has more than N/10
distinct values, there's no way to get less than that many groups.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Autotuning Group Commit

2005-01-24 Thread Manfred Koizar
On Fri, 21 Jan 2005 23:52:51 +, Simon Riggs [EMAIL PROTECTED]
wrote:
Currently, we have group commit functionality via GUC parameters
   commit_delay
andcommit_siblings

And since 7.3 we have ganged WAL writes (c.f. the thread starting at
http://archives.postgresql.org/pgsql-hackers/2002-10/msg00331.php) which
IMHO is a better solution to the same problem.  Maybe the code dealing
with commit_xxx parameters should just be removed.

Are you or is anybody else aware of benchmarks showing that group commit
via commit_xxx is still useful?

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Much Ado About COUNT(*)

2005-01-24 Thread Manfred Koizar
On Mon, 24 Jan 2005 08:28:09 -0700, Jonah H. Harris [EMAIL PROTECTED]
wrote:
UPDATE pg_user_table_counts
SET rowcount = rowcount + 1
WHERE schemaname = this_schemaname
AND tablename = TG_RELNAME;

This might work for small single user applications.  You'll have to keep
an eye on dead tuples in pg_user_table_counts though.

But as soon as there are several concurrent transactions doing both
INSERTs and DELETEs, your solution will in the best case serialise
access to test_tbl or it will break down because of deadlocks.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] ARC patent

2005-01-21 Thread Manfred Koizar
On Fri, 21 Jan 2005 02:31:40 +0200, Hannu Krosing [EMAIL PROTECTED] wrote:
2) Another simple, but nondeterministic, hack would be using randomness,
i.e. 

  2.1) select a random buffer in LR side half (or 30% or 60%) of 
   for replacement. 

  2.2) dont last accessed pages to top of LRU list immediately, 
   just push them uphill some amount, either random, or 
   perhaps 1/2 the way to top at each access.

Sounds good, but how do find the middle of a linked list?  Or the other
way round:  Given a list element, how do you find out its position in a
linked list?  So the only approach that is easily implementable is

2.3) If a sequential scan hint flag is set, put the buffer into the
 free list at a random position.

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Refactoring

2005-01-19 Thread Manfred Koizar
[Sorry, Neil, for abusing your thread.  Moving this discussion back to
where it belongs.]

On Tue, 18 Jan 2005 13:17:17 -0300, Alvaro Herrera
[EMAIL PROTECTED] wrote:
Hmm.  I think this is a good idea on principle, but what happens in case
a previous vacuum was interrupted?  Is there a possibility that tuples
belonging to that vacuum are still marked MOVED_OFF but are not in
vacpage-offsets, for example?

These bits are handled by an earlier phase of VACUUM, by
HeapTupleSatisfiesVacuum() in scan_heap().  My first vacuum.c
refactoring patch, rev 1.281 2004-06-08, added these comments in
repair_frag():

/*
 * VACUUM FULL has an exclusive lock on the relation.  So
 * normally no other transaction can have pending INSERTs or
 * DELETEs in this relation.  A tuple is either
 *   (a) a tuple in a system catalog, inserted or deleted by
 *   a not yet committed transaction or
 *   (b) dead (XMIN_INVALID or XMAX_COMMITTED) or
 *   (c) inserted by a committed xact (XMIN_COMMITTED) or
 *   (d) moved by the currently running VACUUM.
 * In case (a) we wouldn't be in repair_frag() at all.
 * In case (b) we cannot be here, because scan_heap() has
 * already marked the item as unused, see continue above.
 * Case (c) is what normally is to be expected.
 * Case (d) is only possible, if a whole tuple chain has been
 * moved while processing this or a higher numbered block.
 */

/*
 * There cannot be another concurrently running VACUUM.  If
 * the tuple had been moved in by a previous VACUUM, the
 * visibility check would have set XMIN_COMMITTED.  If the
 * tuple had been moved in by the currently running VACUUM,
 * the loop would have been terminated.  We had
 * elog(ERROR, ...) here, but as we are testing for a
 * can't-happen condition, Assert() seems more appropriate.
 */

/*
 * MOVED_OFF by another VACUUM would have caused the
 * visibility check to set XMIN_COMMITTED or XMIN_INVALID.
 */

This should answer your question, unless the comments are wrong.  (BTW
parts of that patch have been backed out by someone, so you won't find
the second comment in current sources.)

As for the problem whether the two code paths deal with the same set of
tuples, read
http://archives.postgresql.org/pgsql-hackers/2004-06/msg00410.php:

| [...] These assignments are
| preceded either by code that sets the appropriate infomask bits or by
| assertions that the bits are already set appropriately.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Much Ado About COUNT(*)

2005-01-16 Thread Manfred Koizar
On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
A would-be deleter of a tuple would have to go and clear the known
good bits on all the tuple's index entries before it could commit.
This would bring the tuple back into the uncertain status condition
where backends would have to visit the heap to find out what's up.
Eventually the state would become certain again (either dead to
everyone or live to everyone) and one or the other hint bit could be
set again.

Last time we discussed this, didn't we come to the conclusion, that
resetting status bits is not a good idea because of possible race
conditions?

In a previous post you wrote:
| I think we still have one free bit in index tuple headers...

AFAICS we'd need two new bits: visible to all and maybe dead.

Writing the three status bits in the order visible to all, maybe
dead, known dead, a normal index tuple lifecycle would be

  000 - 100 - 110 - 111

In states 000 and 110 the heap tuple has to be read to determine
visibility.

The transitions 000 - 100 and 110 - 111 happen as side effects of
index scans.  100 - 110 has to be done by the deleting transaction.
This is the operation where the additional run time cost lies.

One weakness of this approach is that once the index tuple state is
110 but the deleting transaction is aborted there is no easy way to
reset the maybe deleted bit.  So we'd have to consult the heap for
the rest of the tuple's lifetime.

Servus
 Manfred


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Shared row locking

2004-12-30 Thread Manfred Koizar
On Wed, 29 Dec 2004 19:57:15 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Manfred Koizar [EMAIL PROTECTED] writes:
 I don't see too much of a difference between #1 (an on-disk structure
 buffered in shared memory) and #2 (a shared memory structure spilling
 to disk).

If you stand back that far, maybe you can't see a difference ;-) ...

Well, I tried to step back a bit to see the whole picture.  You think I
went too far this time?  :-)

but the proposal on the table was for an extremely special-purpose
on-disk structure.  I'd prefer to see first if we can solve a more
general problem, namely the fixed size of the existing lock table.

I haven't been digging into the code for some time (:-() -- but isn't
this basically a key/value mapping with random access?  My concern is
that as soon as you start to push entries out of the memory structure
you have to implement some kind of on-disk structure to support fast
lookups, which sooner or later might lead to something that looks
suspiciously like the existing hash or btree code.

So the question is whether starting by making nbtree more flexible isn't
the lower hanging fruit...

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking

2004-12-30 Thread Manfred Koizar
On Thu, 30 Dec 2004 13:36:53 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Certainly not; indexes depend on locks, not vice versa.  You'd not be
able to do that without introducing an infinite recursion into the
system design.

Wouldn't you have to face the same sort of problems if you spill part of
the lock table to disk?  While you do I/O you have to hold some lock.
In either case there has to be a special class of locks that are pinned
in memory.

  In any case nbtree is much more heavyweight than we need
for this

Having funcionality we don't need is not a showstopper ... unless
heavyweight implies slow, which I have to admit may well be the case.

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-29 Thread Manfred Koizar
On Thu, 16 Dec 2004 21:54:14 -0300, Alvaro Herrera
[EMAIL PROTECTED] wrote:
  Else, it will have to wait, using XactLockTableWait, for
the first transaction in the array that is still running.  We can be
sure that no one will try to share-lock the tuple while we check the
btree because we hold an exclusive lock on the tuple's heap page's
buffer.

Do you really want to XactLockTableWait while holding an exclusive
lock on a heap page?  Or are you going to release the lock before
entering the wait?

Thanks to tablespaces, it's very easy to create special Relations that
can be dealt with by standard buffer and storage manager, etc.

I don't get how tablespaces would help.

The btree idea:
- does not need crash recovery.  Maybe we could use a stripped down
  version of nbtree.  This could cause a maintanibility nightmare.

It could be a nightmare if you simply duplicate and then modify the
code.  A better way would be to refactor nbtree to be able to handle
btrees with different properties:

. shared/private

. logged/non-logged

. flexible value data type.

- can't hold more than 300 or so simultaneous lockers (because of value
  length, limited to 1/3 of a page).  I doubt this is a real problem.

Or you store each lock as a separate index tuple.  If this is expected
to be a problem because of many repeating key vaules, nbtree should be
enhanced to support duplicate key compression, which might be a good
thing per se.

Servus
 Manfred


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking

2004-12-29 Thread Manfred Koizar
On Mon, 20 Dec 2004 21:44:01 +0100, [EMAIL PROTECTED] wrote:
Tom Lane [EMAIL PROTECTED] wrote on 20.12.2004, 19:34:21:
 #1 could have a pretty serious performance impact, too.  For small
 numbers of FOR UPDATE locks (too few to force spill to disk) I would
 expect #2 to substantially beat #1.  #1 essentially imposes the worst
 case performance at all times, whereas #2 degrades (at a currently
 unknown rate) when there are lots and lots of FOR UPDATE locks.

I don't see too much of a difference between #1 (an on-disk structure
buffered in shared memory) and #2 (a shared memory structure spilling
to disk).

As long as we can avoid unnecessary writes, #1 has the nice property
that it automatically adapts to different usage patterns because it
uses the normal shared buffer pool and cache replacement strategy.

[My gut feeling would be against another permanent on-disk structure,
since this is one more thing for a user to delete to save space
etc...]

Irrelevant, IMHO.  Whoever manually deletes any file under PGDATA
deserves whatever this may cause.

I haven't seen many programs that use
extended SELECT FOR UPDATE logic.

AFAICS, SELECT FOR UPDATE is not the primary issue here, because it
locks rows exclusively.  This thread is about shared locks, which
should be used for foreign key checks.

Servus
 Manfred


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Bgwriter behavior

2004-12-29 Thread Manfred Koizar
[I know I'm late and this has already been discussed by Richrad, Tom,
et al., but ...]

On Tue, 21 Dec 2004 16:17:17 -0600, Jim C. Nasby
[EMAIL PROTECTED] wrote:
look at where the last page you wrote out has ended up in the LRU list
since you last ran, and start scanning from there (by definition
everything after that page would have to be clean).

This is a bit oversimplified, because that page will be moved to the
start of the list when it is accessed the next time.

  A = B = C = D = E = F = G = H = I = J = K = L = m = n = o = p = q
  ^
would become

  M = A = B = C = D = E = F = G = H = I = J = K = L = n = o = p = q
  ^

(a-z ... known to be clean, A-Z ... possibly dirty)

But with a bit of cooperation from the backends this could be made to
work.  Whenever a backend takes the page which is the start of the
clean tail out of the list (most probably to insert it into another
list or to re-insert it at the start of the same list) the clean tail
pointer is advanced to the next list element, if any.  So we would get

  M = A = B = C = D = E = F = G = H = I = J = K = L = n = o = p = q
  ^

As a little improvement the clean tail could be prevented from
shrinking unnecessarily fast by moving the pointer to the previous
list element if this is found to be clean:

  M = A = B = C = D = E = F = G = H = I = J = K = l = n = o = p = q
  ^

Maybe this approach could serve both goals, (1) keeping a number of
clean pages at the LRU end of the list and (2) writing out other dirty
pages if there's not much to do near the end of the list.

But ...
On Tue, 21 Dec 2004 10:26:48 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
Also, the cntxDirty mechanism allows a block to be dirtied without
changing the ARC state at all.

... which might kill this proposal anyway.

Servus
 Manfred


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Updateable Views?

2004-08-09 Thread Manfred Koizar
On Sat, 07 Aug 2004 10:24:34 -0400, Jan Wieck [EMAIL PROTECTED]
wrote:
I have not heard of updatable subselects yet.

http://asktom.oracle.com/pls/ask/f?p=4950:8:6693556430011788783::NO::F4950_P8_DISPLAYID,F4950_P8_CRITERIA:273215737113,

| Here we update a join. [...]
| [EMAIL PROTECTED] update
|   2( select columnName, value
|   3from name, lookup
|   4   where name.keyname = lookup.keyname
|   5 and lookup.otherColumn = :other_value )
|   6 set columnName = value
|   7  /

Google for
oracle delete statement syntax
or
oracle update statement syntax

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] More vacuum.c refactoring

2004-08-06 Thread Manfred Koizar
[Sorry for the late reply.  I'm still struggling to catch up after
vacation ...]

On Fri, 9 Jul 2004 21:29:52 -0400 (EDT), Bruce Momjian
[EMAIL PROTECTED] wrote:

Where are we on this, 2x.  :-)

Here:
 Tom Lane wrote:
  Will study these comments later, but it's too late at night here...

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Quick question regarding tablespaces

2004-08-06 Thread Manfred Koizar
On Thu, 1 Jul 2004 22:55:56 -0400, Mike Rylander [EMAIL PROTECTED]
wrote:
I was thinking of purely tablespace-based random_page_cost, as that variable 
is tied to the access time of a particular filesystem.

Strictly speaking we'd also need tablespace-based sequential_page_cost.

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Trapping QUERY_CANCELED: yes, no, maybe?

2004-08-06 Thread Manfred Koizar
On Sat, 31 Jul 2004 21:24:33 -0400, Tom Lane [EMAIL PROTECTED] wrote:
Exactly.  There's a proof-of-concept test at the bottom of
regress/sql/plpgsql.sql, wherein a function gets control back
from a query that would have run for an unreasonably long time.

referring to
| -- we assume this will take longer than 1 second:
| select count(*) into x from tenk1 a, tenk1 b, tenk1 c;

On Thu, 29 Aug 2002 13:27:36 -0400, Tom Lane [EMAIL PROTECTED] wrote:
 You mean, that the test might fail on a system that takes more than
 ten seconds to INSERT or UPDATE a single row?  I don't think this is a
 real problem.

I don't like depending on a timeout *at all* in a regression test;
the exact value of the timeout is not particularly relevant to my
concern about it.

Maybe
SELECT sleep('0:0:2'::interval);
as used in regress/sql/stats.sql is a better way to ensure that the
query takes longer than one second?

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Trapping QUERY_CANCELED: yes, no, maybe?

2004-08-06 Thread Manfred Koizar
On Fri, 06 Aug 2004 18:55:49 -0400, Tom Lane [EMAIL PROTECTED] wrote:
You think there's a serious risk of failure there ;-) ?

Not on my hardware...

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Another unpleasant surprise using inheritance

2004-06-11 Thread Manfred Koizar
On Fri, 11 Jun 2004 14:11:00 +0200, Darko Prenosil
[EMAIL PROTECTED] wrote:
I think I found bug related to table inheritance (or at least very weird 
behavior). 

This is well known and there's a todo for it:

# Allow inherited tables to inherit index, UNIQUE constraint, and
primary key, foreign key [inheritance] 

See also http://momjian.postgresql.org/cgi-bin/pgtodo?inheritance.

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] More vacuum.c refactoring

2004-06-10 Thread Manfred Koizar
Near the end of repair_frag() in vacuum.c -- under the comment /* clean
moved tuples from last page in Nvacpagelist list */ -- there is code
that marks itemids as unused.  Itemids affected are those referring to
tuples that have been moved off the last page.

This code is very similar to vacuum_page().  The major difference is
that vacuum_page() uses vacpage-offsets while the code in repair_frag()
looks for MOVED_OFF bits in tuple headers.  AFAICS the tuples with the
MOVED_OFF bit set are exactly those referenced by vacpage-offsets.

The attached patch passes make check and make installcheck.  Please
apply unless I'm missing something.

Servus
 Manfred
diff -Ncr ../base/src/backend/commands/vacuum.c src/backend/commands/vacuum.c
*** ../base/src/backend/commands/vacuum.c   Wed Jun  2 21:46:59 2004
--- src/backend/commands/vacuum.c   Thu Jun 10 18:50:26 2004
***
*** 2288,2355 
vacpage-offsets_free  0)
{
Buffer  buf;
-   Pagepage;
-   OffsetNumberunused[BLCKSZ / sizeof(OffsetNumber)];
-   OffsetNumberoffnum,
-   maxoff;
-   int uncnt;
-   int num_tuples = 0;
  
buf = ReadBuffer(onerel, vacpage-blkno);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
!   page = BufferGetPage(buf);
!   maxoff = PageGetMaxOffsetNumber(page);
!   for (offnum = FirstOffsetNumber;
!offnum = maxoff;
!offnum = OffsetNumberNext(offnum))
!   {
!   ItemId  itemid = PageGetItemId(page, offnum);
!   HeapTupleHeader htup;
! 
!   if (!ItemIdIsUsed(itemid))
!   continue;
!   htup = (HeapTupleHeader) PageGetItem(page, itemid);
!   if (htup-t_infomask  HEAP_XMIN_COMMITTED)
!   continue;
! 
!   /*
!   ** See comments in the walk-along-page loop above, why 
we
!   ** have Asserts here instead of if (...) elog(ERROR).
!   */
!   Assert(!(htup-t_infomask  HEAP_MOVED_IN));
!   Assert(htup-t_infomask  HEAP_MOVED_OFF);
!   Assert(HeapTupleHeaderGetXvac(htup) == myXID);
! 
!   itemid-lp_flags = ~LP_USED;
!   num_tuples++;
! 
!   }
!   Assert(vacpage-offsets_free == num_tuples);
! 
!   START_CRIT_SECTION();
! 
!   uncnt = PageRepairFragmentation(page, unused);
! 
!   /* XLOG stuff */
!   if (!onerel-rd_istemp)
!   {
!   XLogRecPtr  recptr;
! 
!   recptr = log_heap_clean(onerel, buf, unused, uncnt);
!   PageSetLSN(page, recptr);
!   PageSetSUI(page, ThisStartUpID);
!   }
!   else
!   {
!   /*
!* No XLOG record, but still need to flag that XID 
exists
!* on disk
!*/
!   MyXactMadeTempRelUpdate = true;
!   }
! 
!   END_CRIT_SECTION();
! 
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
WriteBuffer(buf);
}
--- 2288,2297 
vacpage-offsets_free  0)
{
Buffer  buf;
  
buf = ReadBuffer(onerel, vacpage-blkno);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
!   vacuum_page(onerel, buf, vacpage);
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
WriteBuffer(buf);
}

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] More vacuum.c refactoring

2004-06-10 Thread Manfred Koizar
On Thu, 10 Jun 2004 17:19:22 -0400, Tom Lane [EMAIL PROTECTED] wrote:
This does not make me comfortable.

I understand you, honestly.  Do I read between your lines that you
didn't review my previous vacuum.c refactoring patch?  Please do.  It'd
make *me* more comfortable.

  You *think* that two different bits of code are doing the same thing,

I see three significant differences between the code in repair_frag()
and vacuum_page().

1) vacuum_page() has
Assert(vacpage-offsets_used == 0);

vacpage is the last VacPage that has been inserted into Nvacpagelist.
It is allocated in line 1566, offsets_used is immediately set to 0 and
never changed.  So this Assert(...) doesn't hurt.

2) In vacuum_page() the lp_flags are set inside a critical section.

This is no problem because the clear-used-flags loop does not
elog(ERROR, ...).  Please correct me if I'm wrong.

3) vacuum_page() uses vacpage-offsets to locate the itemids that are to
be updated.

If we can show that these are the same itemids that belong to the tuples
that are found by inspecting the tuple headers, then the two code
snippets are equivalent.  The first hint that this is the case is
Assert(vacpage-offsets_free == num_tuples);

So both spots expect to update the same number of itemids.  What about
the contents of the offsets[] array?  Offset numbers are entered into
this array by statements like

vacpage-offsets[vacpage-offsets_free++] = offnum;

in or immediately after the walk-along-page loop.  These assignments are
preceded either by code that sets the appropriate infomask bits or by
assertions that the bits are already set appropriately.

The rest (from PageRepairFragmentation to END_CRIT_SECTION) is
identical.

 so you want to hack up vacuum.c?  This
module is delicate code --- we've had tons of bugs there in the past

But why is it so delicate?  Not only because it handles difficult
problems, but also because it is written in a not very
maintenance-friendly way.  Before I started refactoring the code
repair_frag() had more than 1100 lines and (almost) all variables used
anywhere in the function were declared at function level.

We cannot declare a code freeze for a module just because it is so badly
written that every change is likely to break it.  Someday someone will
*have* to change it.

Better we break it today in an effort to make the code clearer.  

--- and no I have zero confidence that passing the regression tests
proves anything

Unfortunately you are right :-(

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow

2004-06-01 Thread Manfred Koizar
On Fri, 28 May 2004 14:47:01 -0400, Tom Lane [EMAIL PROTECTED] wrote:
If putting back xmax is the price we must pay for nested transactions,
then we *will* pay that price.  Maybe not in this release, but it will
inevitably happen.

we = every Postgres user, even those that do not use subtransactions.

price = 2% to 5% performance loss for databases where the working set
is larger than main memory.

Don't bother hollering veto ;-)

Ok, I'll shut up till I have something concrete to support my opinion.

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow

2004-06-01 Thread Manfred Koizar
On Fri, 28 May 2004 21:59:53 -0400, Alvaro Herrera
[EMAIL PROTECTED] wrote:
So the assumption was that when we see that this has
happenned, the Cmin is no longer important (== every future command can
already see the tuple)

If it was that simple, we wouldn't have had to invent the CMIN_IS_CMAX
flag.  This has been discussed two years ago.  Did you follow the link I
posted last week?  Every future command is not enough.  You have to
consider the current command and even commands started before this.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow

2004-05-28 Thread Manfred Koizar
On Thu, 27 May 2004 16:50:24 -0400, Alvaro Herrera
[EMAIL PROTECTED] wrote:
Now you are on the subject, can I ask you to take a peek at what I did
regarding tuple headers?

I did read your patch, but I didn't understand it. :-(

At first I thought I'd have to add back Xmax as a field on its own

Veto!  This would increase heap tuple header size == less tuples per
page == more pages per table == more I/O == performance loss.

 is there a situation
on which we should need to peek at Cmin after setting Xmax for a
particusar tuple?

http://archives.postgresql.org/pgsql-hackers/2002-05/msg00090.php


Servus
 Manfred

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow

2004-05-27 Thread Manfred Koizar
On Wed, 26 May 2004 18:17:55 -0400, Tom Lane [EMAIL PROTECTED] wrote:
(Some days I think we should dump
VACUUM FULL, because it's optimized for a case that's no longer very
interesting...)

So we still have to stick with VACUUM FULL for some time, right?

The next set of compatibility breakers I'm currently working on requires
a change in VACUUM FULL behaviour.  I would only move tuples that are
visible to all running transactions.  OTOH I wouldn't stop at the first
unmovable tuple.

With
X  active tuple
.  free space or dead tuple
y  new tuple, not yet visible to a running transaction
z  deleted tuple, still visible to a running transaction

the current implementation transforms this relation

    XXyX  XzXX 

into

 XzXX  XXyX

The new implementation would leave it as

   XX.. ..y.  .z..

If there are concurrent long-running transactions, the new VACUUM FULL
wouldn't truncate the relation as aggressively as it does now.  It could
leave the relation with lots of free space near the end.  This was
absolutely unacceptable at the time when VACUUM FULL was designed.  But
now we could use lazy VACUUM as an excuse for VACUUM FULL not working so
hard.  After the transaction still seeing z terminates, VACUUM (without
FULL) can truncate the relation to

   XX.. ..y.

and when y is updated the new version will be stored in a lower block
and plain VACUUM can truncate the relation again:

   XXY.

AFAICS this would make vacuum.c much simpler (no more chain moves).
Clearly this change alone doesn't have any merit.  But would such a
patch have any chance of being accepted, if it facilitates improvements
in other areas?

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow

2004-05-27 Thread Manfred Koizar
On Thu, 27 May 2004 14:23:07 -0400, Tom Lane [EMAIL PROTECTED] wrote:
 and when y is updated the new version will be stored in a lower block

Oh?  What makes you think that?  I see no guarantee of it.

You're right, I see only a tendency, because the majority of free space
is before the last block (obviously).  But don't we try to store the new
version on the same block as the old version?  That'd weaken my argument
a bit.

I think you'd have to not move *any* updated
tuples to be sure you don't need the chain-move mechanism.

Yes, per definitionem (move only tuples that are visible to all).

I'm disinclined to mess with VACUUM FULL without a clearer explanation
of where you're headed.

Have no fear.  I won't change anything in the near term.  As you were
talking about the future of VACUUM FULL,  I thought this might be a good
opportunity to ask.  The fact that you didn't outright reject the idea
is good enough for now.

I have no clear explanation at the moment, just some fuzzy ideas that
are beginning to crystallise.  I'm messing around with heap tuple
headers again, and the Xvac field seems to get in the way, unless I can
cut down the number of different scenarios where it is needed.

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] New horology failure

2004-05-24 Thread Manfred Koizar
[resending...]


On Sun, 23 May 2004 11:38:51 +0800, Christopher Kings-Lynne
[EMAIL PROTECTED] wrote:
I get this since Tom's commit.

--- ./results/horology.out  Sun May 23 11:39:49 2004
***
*** 1787,1796 
!  | Sat Sep 22 18:19:20 2001 PDT | @ 34 years| Fri Sep 22 
18:19:20 1967 PDT
[...]
--- 1787,1796 
!  | Sat Sep 22 18:19:20 2001 PDT | @ 34 years| Fri Sep 22 
18:19:20 1967 PST
[...]

I got the same with snapshot-20040521 yesterday [i.e. 2004-05-22]
afternoon when I ran make check.  But only once.  make installcheck
passed all tests, and the failure didn't reappear when I tried make
check again.

I just got the failure again with make check after having configured
with a new install directory.  My guess is that horology needs some
datafile from the install location.

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] zero-column table behavior

2004-05-24 Thread Manfred Koizar
[resending...]

On Sat, 22 May 2004 20:28:43 -0400, Neil Conway [EMAIL PROTECTED]
wrote:
-- Why is there a blank line before the -- that indicates the
-- end of the result set?

-- separates the header line from the *start* of the result set.  The
empty line is the header line, containing zero column headers.

-- If the result set contains two rows, ISTM the psql output
-- should emit either two or three blank lines before the --
-- that indicates the end of the result set

One empty header line before -- and then two empty data lines.

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Multiple Xids in PGPROC?

2004-05-05 Thread Manfred Koizar
On Tue, 04 May 2004 23:21:07 -0400, Tom Lane [EMAIL PROTECTED] wrote:
I thought we had devised a solution that did not require expansible
shared memory for this.  Bruce, Manfred, do you recall how that went?

AFAIR we did not discuss TransactionIdIsInProgress() specifically.
Currently this function is special insofar as it does not consult
pg_clog but loops over the PGPROC array.  The current implementation is
in sinval.c.  The straightforward pg_clog lookup is still in transam.c,
but has been deactivated:
 * Now this func in shmem.c and gives quality answer by scanning
 * PGPROC structures of all running backend. - vadim 11/26/96

What was the motivation for this change?  Consistency or speed?

With subtransactions we'd have to fall back to checking pg_clog (and
pg_subtrans) in certain cases.  There are lots of possible
implementations.  Here are some ideas (just brainstorming):

 .  We could first scan the PGPROC array.  If the xid is an active main
transaction, we're finished.

 .  If xid is older than RecentGlobalXmin, it cannot be active.

 .  We could include a small number of subtransaction xids in PGPROC.

 .  For additional subtransactions not fitting into this small array
there could be minsubxid and maxsubxid fields in PGPROC.  If the xid we
are looking for is outside all these ranges, it cannot be an active
subtransaction.

 .  If all these tests fail, we fall back to checking pg_clog.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Number of pages in a random sample

2004-04-29 Thread Manfred Koizar
On Mon, 26 Apr 2004 08:08:16 -0700, Sailesh Krishnamurthy
[EMAIL PROTECTED] wrote:
 A Bi-Level Bernoulli Scheme for Database Sampling
 Peter Haas, Christian Koenig (SIGMOD 2004) 

Does this apply to our problem?  AFAIK with Bernoulli sampling you don't
know the sample size in advance.

Anyway, thanks for the hint.  Unfortunately I couldn't find the
document.  Do you have a link?

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] btbulkdelete

2004-04-28 Thread Manfred Koizar
On Wed, 28 Apr 2004 00:08:48 -0400, Tom Lane [EMAIL PROTECTED] wrote:
 Is there a special reason for scanning the leaf pages in *logical*
 order, i.e. by following the opaque-btpo_next links?

Yes.  [..] interlocking between indexscans and deletions.

Thanks for refreshing my memory.  This has been discussed two years ago,
and I even participated in that discussion :-(

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] btbulkdelete

2004-04-26 Thread Manfred Koizar
On Mon, 26 Apr 2004 14:29:58 +0100, Simon Riggs [EMAIL PROTECTED]
wrote:
   Now that FSM
 covers free btree index pages this access pattern might be highly
 nonsequential.

I had considered implementing a mode where the index doesn't keep trying
to reuse space that was freed by earlier deletes.

Or maybe an FSM function a la Give me a free page near this one?

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[HACKERS] btbulkdelete

2004-04-25 Thread Manfred Koizar
On -performance we have been discussing a configuration where a bulk
delete run takes almost a day (and this is not due to crappy hardware or
apparent misconfiguration).  Unless I misinterpreted the numbers,
btbulkdelete() processes 85 index pages per second, while lazy vacuum is
able to clean up 620 heap pages per second.

Is there a special reason for scanning the leaf pages in *logical*
order, i.e. by following the opaque-btpo_next links?  Now that FSM
covers free btree index pages this access pattern might be highly
nonsequential.

I'd expect the following scheme to be faster:

for blknum = 1 to nblocks {
read block blknum;
if (block is a leaf) {
process it;
}
}

As there is no free lunch this has the downside that it pollutes the
cache with unneeded inner nodes and free pages.

OTOH there are far less inner pages than leaf pages (even a balanced
binary tree has more leaves than inner nodes), and if free pages become
a problem it's time to re-index.

Did I miss something else?

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[HACKERS] Number of pages in a random sample (was: query slows down with more accurate stats)

2004-04-25 Thread Manfred Koizar
On Mon, 19 Apr 2004 12:00:10 -0400, Tom Lane [EMAIL PROTECTED] wrote:
A possible compromise is to limit the number of pages sampled to
something a bit larger than n, perhaps 2n or 3n.  I don't have a feeling
for the shape of the different-pages probability function; would this
make a significant difference, or would it just waste cycles?

I would have replied earlier, if I had a good answer.  What I have so
far contains at least one, probably two flaws.  Knowing not much more
than the four basic arithmetic operations I was not able to improve my
model.  So I post what I have:

As usual we assume a constant number c of tuples per page.  If we have a
table of size B pages and want to collect a sample of n tuples, the
number of possible samples is (again in OOo syntax)

left( binom{cB}{n} right)

If we select an arbitrary page, the number of possible samples that do
NOT contain any tuple from this page is

left( binom {c (B-1)} {n} right)

Let's forget about our actual implementations of sampling methods and
pretend we have a perfect random sampling method.  So the probability
Pnot(c, B, n) that a certain page is not represented in a random sample
is

left( binom {c (B-1)} {n} right) over left( binom{cB}{n} right)

which can be transformed into the more computing-friendly form

prod from{i=0} to{n-1} {{cB-c - i} over {cB - i}}

Clearly the probability that a certain page *is* represented in a sample
is

Pyes(c, B, n) = 1 - Pnot(c, B, n)

The next step assumes that these probabilities are independent for
different pages, which in reality they are not.  We simply estimate the
number of pages represented in a random sample as 

numPag(c, B, n) = B * Pyes(c, B, n)

Here are some results for n = 3000:

B  \ c-10 |   100 |   200
---+---+---+---
   100 |  ---  |   100 |   100
  1000 |   972 |   953 |   951
  2000 |  1606 |  1559 |  1556
  3000 |  1954 |  1902 |  1899
  6000 |  2408 |  2366 |  2363
  9000 |  2588 |  2555 |  2553
 2 |  2805 |  2788 |  2787
 3 |  2869 |  2856 |  2856
10 |  2960 |  2956 |  2956

This doesn't look to depend heavily on the number of tuples per page,
which sort of justifies the assumption that c is constant.

In the next step I tried to estimate the number of pages that contain
exactly 1, 2, ... tuples of the sample.  My naive procedure works as
follows (I'm not sure whether it is even valid as a rough approximation,
constructive criticism is very welcome):

For c=100, B=3000, n=3000 we expect 1902 pages to contain at least 1
tuple of the sample.  There are 1098 more tuples than pages, these
tuples lie somewhere in those 1902 pages from the first step.
numPag(99, 1902, 1098) = 836 pages contain at least a second tuple.
So the number of pages containing exactly 1 tuple is 1902 - 836 = 1066.
Repeating these steps we get 611 pages with 2 tuples, 192 with 3, 30
with 4, and 3 pages with 5 tuples.

Here are some more numbers for c = 100 and n = 3000:

   B   | pages with 1, 2, ... tuples
---+
   100 |  1 to 24 tuples: 0, then 1, 2, 4, 10, 18, 26, 24, 11, 4
  1000 |  108, 201, 268, 229, 113, 29, 5
  2000 |  616, 555, 292,  83, 12, 1
  3000 | 1066, 611, 192,  30,  3
  6000 | 1809, 484,  68,   5
  9000 | 2146, 374,  32,   2
 2 | 2584, 196,   8
 3 | 2716, 138,   3
10 | 2912,  44

A small C program to experimentally confirm or refute these calculations
is attached.  Its results are fairly compatible with above numbers,
IMHO.

Servus
 Manfred
/*
** samsim.c  -  sampling simulator
*/
#include stdio.h
#include stdlib.h
#include sys/time.h
#include unistd.h

typedef int bool;


#define MAX_RANDOM_VALUE  (0x7FFF)

static void initrandom()
{
struct timeval tv;

gettimeofday(tv, NULL);
srandom(tv.tv_sec ^ tv.tv_usec);
}/*initrandom*/

/* Select a random value R uniformly distributed in 0  R  1 */
static double
random_fract(void)
{
longz;

/* random() can produce endpoint values, try again if so */
do
{
z = random();
} while (z = 0 || z = MAX_RANDOM_VALUE);
return (double) z / (double) MAX_RANDOM_VALUE;
}

/*
** data structure for (modified) Algorithm S from Knuth 3.4.2
*/
typedef struct
{
longN;  /* number of tuples, known in advance 
*/
int n;  /* sample size */
longt;  /* current tuple number */
int m;  /* tuples selected so far */
} SamplerData;
typedef SamplerData *Sampler;

static void Sampler_Init(Sampler bs, long N, int samplesize);
static bool Sampler_HasMore(Sampler bs);
static long 

[HACKERS] Tuple sampling

2004-04-19 Thread Manfred Koizar
The proposed new sampling method
(http://archives.postgresql.org/pgsql-hackers/2004-04/msg00036.php and
http://archives.postgresql.org/pgsql-patches/2004-04/msg00045.php)
basically incorporates two independant changes:

(1) Two-stage sampling:  Stage one collects a random sample of pages,
stage two collects a random sample of tuples out of these pages.

(2) Improved row count estimation:  Once a page is read in, all tuples
on this page are looked at to see whether they are active (visible,
live) or dead.  The estimated row count is calculated from the number of
active tuples seen, the number of visited pages, and the total number of
pages.

The preview patch implements three sampling methods:  Method 0 is the
original sampling method;  method 1 uses both changes;  and method 2 is
sort of change (1) without change (2).

Theoretically we could apply only the second change, if we come to the
conclusion that we do not want two-stage sampling.  But I don't have
test results for this.

I'd like to present my *interpretation* of the results I got from
testing various sampling methods.  Whoever is interested in looking at
the raw test results can get them from
http://www.pivot.at/pg/SamplePerf2.sxc or (for those still without OOo)
http://www.pivot.at/pg/SamplePerf2.html.

This set of tests measures only the sampling times, actually analysing
the sample and storing the results in pg_statistic has been disabled.
To eliminate as many sources of noise as possible, following steps have
been taken prior to each test run:

  . SELECT count(*) FROM testtable;  
  to make sure that all visibility bits in the tuple headers are set
  correctly,

  . tar cf /dev/null something
  to fill the OS cache with unrelated stuff,

  . SELECT count(*) FROM othertable;
  to fill the shared buffers with unrelated data,

  . CHECKPOINT;
  to avoid a checkpoint happening during the test run.

The tests have been performed with two types of tables.  The tables
named x, x2, x3, x3d have initially been copied from tenk1 in the
regression database.  These tables have between 20 and 30 tuples per
page.  Tables of the other type -- named y, y3, y3d -- have much smaller
tuples, ca. 150 per page.


Method 1 vs. method 2:

With large tuples method 2 is sometimes faster, sometimes slower than
method 1, never enough to worry about.  This has been tested for
statistics targets 10 and 100.

Small tuples, statistics target 10 to 100:  No reproduceable difference
up to 490 pages (8 tuples).  Starting at 1900 pages (32 tuples)
method 2 is consistently faster.  The difference can easily be explained
by hundreds of thousands or even millions of heap_fetch() calls.

OTOH with method 2 we get row count estimation errors of up to 20%
compared to less than 1% for method 1.  So I conclude that change (2) is
well worth the CPU cycles it costs.

Method 0 produced estimation errors of up to 60%.


What about two-stage sampling?

Comparing the speeds of method 0 and method 2 we get the following
pattern which is basically the same for all statistics targets I tried.

Sample size 3000 (3):
For tables smaller than 1000 (5000) pages all sampling methods access
all or almost all pages and there is hardly any runtime difference.  For
table sizes between 1000 (5000) and 3000 (3) the new method reads
the whole table while the old method starts to skip pages.  This doesn't
result in a significant runtime difference, however.  If the table size
grows beyond 3000 (3) pages, the number of page reads continues to
grow only for the old method, but up to ca. 3 (30) pages the new
method is not much faster (if at all).  For tables larger than 3
(30) pages two-stage sampling is reliably faster.


Other pros and cons of two-stage sampling

Pro and con:  Cache pollution.  Two-stage sampling reads the same number
of pages as one-stage sampling for small tables, slightly more for
medium sized tables (worst case is ca. 15%), and a little less to far
less for large to very large tables (3000 vs. 2 for 43 pages
with statistics target 10,  3 vs. 98000 for the same table with
statistsitcs target 100).

Con:  In Tom's words There's still a risk of not being able to collect
N rows out of N blocks, if you are unfortunate enough to select a lot of
wholly-empty pages.  But that seems like a low-probability scenario;
besides such a table would be so desperately in need of VACUUM FULL that
the possible low quality of the stats hardly matters.

Con:  The sample generated is not perfectly random (cf. discussion
starting at
http://archives.postgresql.org/pgsql-performance/2004-04/msg00196.php).
If somebody has an idea how we could steer against that effect of
collecting tuples from too few different pages, I'd be happy to
implement it.  If nothing pops up, I think that the current consensus is
that we don't care.

Anything else?

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the 

Re: [HACKERS] [GENERAL] Large DB

2004-04-02 Thread Manfred Koizar
On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Manfred Koizar [EMAIL PROTECTED] writes:
 What I have in mind is a kind of Double Vitter algorithm.  [...]
 random sample of sample_size block numbers, and then to sample the rows
 out of this pool of blocks.

That assumption is faulty, though --- consider wholly-empty pages.

A bigger problem is that this makes the sampling quite nonuniform,
because rows that are on relatively low-density pages would be more
likely to become part of the final sample than rows that are on pages
with lots of tuples.

This sounds like you are assuming that I want to take exactly one tuple
out of each block of the block sample.  This is not the case.  In the
second round I plan to apply the same (or a better) Vitter method as it
is done now.  The main difference is that blocks will be adressed
indirectly through the array of block numbers obtained in the first
round.

  Thus for example your sample would tend to favor
rows with wide values of variable-width columns and exclude narrower
values.  (I am not certain that the existing algorithm completely avoids
this trap, but at least it tries.)

I'm reading 7.4 source code and I fail to see how it does this.  If the
relation starts with an atypical distribution of wide/narrow or
dead/alive tuples, a wrong value for tuplesperpage is used for the rest
of the sampling.

Tuples immediately following one or more dead tuples have a better
chance of being selected.  This may be called as random as anything else
and not favouring a special property.  OTOH after long runs of dead
tuples consecutive tuples are likely to be selected.

Your comment about nonuniformity above exactly describes the current
algorithm:  Once the initial sample is fetched and tuplesperpage is
determined, targpos is computed without any further feedback.  If
targpos points to a sparsely populated area (with wide tuples or with
many dead tuples) tuples in this area are more likely to get into the
sample than tuples in densely populated areas (with many small active
tuples).

I think that cutting down the number of blocks to be looked at does not
affect these problems.

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [GENERAL] Large DB

2004-04-02 Thread Manfred Koizar
On Fri, 02 Apr 2004 18:06:12 -0500, Tom Lane [EMAIL PROTECTED] wrote:
You should not need to use the Vitter algorithm for the block-level
selection, since you can know the number of blocks in the table in
advance.  You can just use the traditional method of choosing each block
or not with probability (k/K), where k = number of sample blocks still
needed, K = number of blocks from here to the end.

Sounds reasonable.  I have to play around a bit more to get a feeling
where the Vitter method gets more efficient.

  You'd run the Vitter
algorithm separately to decide whether to keep or discard each live row
you find in the blocks you read.

You mean once a block is sampled we inspect it in any case?  This was
not the way I had planned to do it, but I'll keep this idea in mind.

Question: if the table size is less than N blocks, are you going to read
every block or try to reduce the number of blocks sampled?

Don't know yet.

people are setting the stats target to 100 which means a sample size of
3 --- how do the page-access counts look in that case?

 rel  | page
 size | reads
--+-
  300 |   300
 3000 |  3000
 5000 |  4999
  10K |  9.9K
  30K |  25.8K
 300K |   85K
   1M |  120K
  10M |  190K
 100M |  260K
   1G |  330K

This is exactly the table I posted before (for sample size 3000) with
every entry multiplied by 10.  Well, not quite exactly, but the
differences are far behind the decimal point.  So for our purposes, for
a given relation size the number of pages accessed is proportional to
the sample size.

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [GENERAL] Large DB

2004-04-02 Thread Manfred Koizar
On Fri, 02 Apr 2004 19:57:47 -0500, Tom Lane [EMAIL PROTECTED] wrote:
If you like I can send you the Vitter paper off-list (I have a PDF of
it).  The comments in the code are not really intended to teach someone
what it's good for ...

Yes, please.  [Would have sent this off-list.  But I'm blacklisted.]

Servus
 Manfred

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] Large DB

2004-04-02 Thread Manfred Koizar
[time to move this to -hackers]

On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Manfred Koizar [EMAIL PROTECTED] writes:
 The first step, however, (acquire_sample_rows() in analyze.c) has to
 read more rows than finally end up in the sample.  It visits less than
 O(nblocks) pages but certainly more than O(1).

 A vague feeling tries to tell me that the number of page reads is
 somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
 grow like O(ln(n)).

Good guess.  Vitter's paper says the expected time to sample n rows from
a table of size N is O(n * (1 + log(N/n))).

Well, for what I tried to find out my wild guess seems to be wrong.

I don't doubt that Vitter's formula is correct, but it assumes that
access to any tuple has the same cost.  This does not apply to our
problem, however.  With 100 tuples per page, we access the first
sample_size tuples at a cost of 0.01 sequential page reads per tuple.
Later we use less and less tuples per page which results in higher
per-tuple-cost.  Near the end of a large relation we can expect to
access only one tuple per page and more and more pages are skipped, so
that prefetching doesn't help any more.

Playing around with some real numbers (for 100 tuples/page and a sample
size of 3000) I got:

 rel  | page
 size | reads
--+-
   30 |30
  300 |   300expectation is something like 299.9995 
  500 |   499
   1K |   990
   3K |  2.6K
  30K |8K
 100K |   12K
   1M |   19K
  10M |   26K
 100M |   33K

This growth rate is steeper than O(log(nblocks)).

 I have an idea how this could be done with O(1) page reads.

What I have in mind is a kind of Double Vitter algorithm.  Whatever we
do to get our sample of rows, in the end the sampled rows come from no
more than sample_size different blocks.  So my idea is to first create a
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

I have to think harder though, what to do about those 400 pages that are
not accessed when the sample size is 3000 ...

The hard part is getting a genuinely random sample when we don't know N
in advance.  We do however know the table size in blocks, so if you're
willing to make assumptions about constant tuple density you could do
something different.  (But the tuple density assumption is exactly the
weak spot of what we've got, so I'm unconvinced that would be a big step
forward.)

Starting the scan at some random blocks should help against the common
case of unusual distribution of dead tuples near the start of the
relation.  And I plan to factor information about dead tuple hits into
an increasingly better estimation of dead/live tuple ratio.

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] PostgreSQL block size vs. LVM2 stripe width

2004-03-29 Thread Manfred Koizar
On Mon, 29 Mar 2004 08:50:42 -0800 (PST), [EMAIL PROTECTED] wrote:
In this case, I've only done 1 per each combination.  I've found the
results for this test to be reproduceable.

Pardon?

Linux-2.6.3, LVM2 Stripe Width
BLCKSZ
(going down)16 KB   32 KB   64 KB   128 KB  256 KB  512 KB
2 KB261726562652266426672642
4 KB439344864577455745114448
8 KB433744234471457641113642
16 KB   441244954532453629852312
32 KB   370537843886392529362362

 Does this mean that you first ran all test with 8 KB, then with 4, 2, 16
 and 32 KB BLCKSZ?  If so, I suspect that you are measuring the effects
 of something different.

Yes, that's correct, but why do you suspect that?

Gut feelings, hard to put into words.  Let me try:

Nobody really knows what the optimal BLCKSZ is.  Most probably it
depends on the application, OS, hardware, and other factors.  8 KB is
believed to be a good general purpose BLCKSZ.

I wouldn't be surprised if 8 KB turns out to be suboptimal in one or the
other case (or even in most cases).  But if so, I would expect it to be
either too small or too large.

In your tests, however, there are three configurations where 8 KB is
slower than both 4 KB and 16 KB.  Absent any explanation for this
interesting effect, it is easier to mistrust your numbers.

If you run your tests in the opposite order, on the same hardware, in
the same freshly formatted partitions, and you get the same results,
that would be an argument in favour of their accurancy.

Maybe we find out that those 1.5% are just noise.

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] PostgreSQL block size vs. LVM2 stripe width

2004-03-27 Thread Manfred Koizar
Mark,

how often did you run your tests?  Are the results reproduceable?

On Fri, 26 Mar 2004 14:00:01 -0800 (PST), [EMAIL PROTECTED] wrote:
Linux-2.6.3, LVM2 Stripe Width
(going across)
PostgreSQL
BLCKSZ
(going down)16 KB   32 KB   64 KB   128 KB  256 KB  512 KB
2 KB261726562652266426672642
4 KB439344864577455745114448
8 KB433744234471457641113642
16 KB   441244954532453629852312
32 KB   370537843886392529362362

Unless someone can present at least an idea of a theory why a BLCKSZ of
8 KB is at a local minimum (1 or 2% below the neighbouring values) for
stripe widths up to 64 KB I'm not sure whether we can trust these
numbers.

Before I hit the send button, I did a quick check of the link you
provided.  The links in the table contain the following test numbers:

16 KB   32 KB   64 KB   128 KB  256 KB  512 KB
2 KB 72  71  70   69  66  65
4 KB 64  63  62   61  60  58
8 KB 54  53  52   51  50  49
16 KB79  78  77   76  75  74
32 KB86  85  84   83  82  80

Does this mean that you first ran all test with 8 KB, then with 4, 2, 16
and 32 KB BLCKSZ?  If so, I suspect that you are measuring the effects
of something different.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] log_line_info

2004-03-10 Thread Manfred Koizar
On Tue, 09 Mar 2004 10:02:14 -0500, Andrew Dunstan [EMAIL PROTECTED]
wrote:
After this is applied (fingers crossed) and everyone is happy, I will 
submit a patch to remove log_timestamp, log_pid and (if we are agreed on 
it) log_source_port.

Is there agreement on removing these 3 config vars?

Please don't.  Declare them obsolete for 7.5 and remove them in a later
release.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Summary of Changes since last release (7.4.1)

2004-02-19 Thread Manfred Koizar
Simon,

On Thu, 19 Feb 2004 00:05:15 -, Simon Riggs
[EMAIL PROTECTED] wrote:
POSTGRESQL: Summary of Changes since last release (7.4.1)
--
18 Feb 2004

this is getting long over time.  If you plan to post it once a week,
flagging items that are new or have changed might help keeping track of
what's going on.

- Index performance improved when scanning highly non-unique indices;

! Index performance improved when scanning highly non-unique indices;
or
- (updated) Index performance improved ...


- ANALYZE will now collect statistics on expressional indexes, and make

+ ANALYZE will now collect statistics on expressional ...
or
- (new) ANALYZE will now collect statistics on expressional ...

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] [GENERAL] Transaction Question

2003-12-11 Thread Manfred Koizar
On Sat, 6 Dec 2003 10:43:18 -0500 (EST), Bruce Momjian
[EMAIL PROTECTED] wrote:
Where are we on nested transactions.  Is it something we can get for 7.5?

I honestly don't know.  I've been working on other things lately and
have not heard from Alvaro for some time.

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] Index creation takes for ever

2003-12-02 Thread Manfred Koizar
On Mon, 01 Dec 2003 13:32:10 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Manfred Koizar [EMAIL PROTECTED] writes:
 comparetup_index() compares two IndexTuples.  The structure
 IndexTupleData consists basically of not much more than an ItemPointer,
 and the patch is not much more than adding a comparison of two
 ItemPointers.  So how does the patch introduce a new low level
 implementation dependency?

Because it sorts on tuple position, which is certainly about as low
level as you can get.

The patch affects only the sort during index creation.  Mapping key
values to tuple positions is the sole purpose of an index.  The notion
that an index should not care for tuple positions looks a bit strange to
me.

  More to the point, though, no evidence has been
provided that this is a good idea.

The test script I posted with the patch shows that the patch produces
more efficient b-tree indices when there are lots of duplicates.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Index creation takes for ever

2003-12-01 Thread Manfred Koizar
On Mon, 1 Dec 2003 00:02:54 -0500 (EST), Bruce Momjian
[EMAIL PROTECTED] wrote:
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
And if it doesn't help index
  creation speed, at least the resulting index has better correlation.

... which has been shown by the example in the original message:
 Result without patch:
ctid
 --
  (153,14)
  (306,23)
  (305,80)
  (152,91)
   (76,68)
   (38,34)
  (153,34)
  (305,50)
(9,62)
  (305,40)
 (10 rows)
 
 Result with patch:
   ctid
 
   (0,5)
  (0,10)
  (0,15)
  (0,20)
  (0,25)
  (0,30)
  (0,35)
  (0,40)
  (0,45)
  (0,50)
 (10 rows)

And I think we all agree, that better index correlation leads to better
performance.

 I think this is a *very* dubious idea.  It introduces a low-level
 implementation dependency into our sort behavior

The patch modifies the static function comparetup_index() in
tuplesort.c.
The comment above this function says
/*
 * Routines specialized for IndexTuple case
 *
 * NOTE: actually, these are specialized for the btree case; [...]
 */

comparetup_index() compares two IndexTuples.  The structure
IndexTupleData consists basically of not much more than an ItemPointer,
and the patch is not much more than adding a comparison of two
ItemPointers.  So how does the patch introduce a new low level
implementation dependency?

Roger --- patch removed.  Thanks.

Could we agree on only removing the first five a half lines of the
comment in the patch?

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] logical column position

2003-11-20 Thread Manfred Koizar
On Wed, 19 Nov 2003 19:07:23 +0100, Andreas Pflug
[EMAIL PROTECTED] wrote:
is there 
any DB system out there that allows to reshuffle the column ordering?

Firebird:
ALTER TABLE tname ALTER COLUMN cname POSITION 7;

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: Defaults for GUC variables (was Re: [HACKERS] pg_ctl reports succes when start fails)

2003-10-28 Thread Manfred Koizar
On Mon, 27 Oct 2003 10:22:32 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
The low-tech solution to this would be to stop listing the default
values as commented-out entries, but just make them ordinary uncommented
entries.

Please not.  How should we ask a newbie seeking assistance on one of
the support lists to show his non-default config settings?

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Cannot dump/restore text value \N

2003-10-08 Thread Manfred Koizar
On Sun, 05 Oct 2003 19:12:50 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
it seems we have to compare the null representation string to the
pre-debackslashing input.

Here is a patch that does this and adds a few regression tests.

(This is probably fairly easy to make happen
in CVS tip, but it might be pretty painful in 7.3.)

There haven't been too much changes in this area between 7.3 and 7.4.
A patch against 7.3.4 will follow ...

Servus
 Manfred
diff -ruN ../base/src/backend/commands/copy.c src/backend/commands/copy.c
--- ../base/src/backend/commands/copy.c 2003-08-28 15:52:34.0 +0200
+++ src/backend/commands/copy.c 2003-10-08 10:43:02.0 +0200
@@ -90,7 +90,8 @@
   char *delim, char *null_print);
 static void CopyFrom(Relation rel, List *attnumlist, bool binary, bool oids,
 char *delim, char *null_print);
-static char *CopyReadAttribute(const char *delim, CopyReadResult *result);
+static char *CopyReadAttribute(const char *delim, const char *nullst,
+   CopyReadResult *result, bool *isnull);
 static Datum CopyReadBinaryAttribute(int column_no, FmgrInfo *flinfo,
Oid typelem, bool *isnull);
 static void CopyAttributeOut(char *string, char *delim);
@@ -1361,7 +1362,7 @@
 
if (file_has_oids)
{
-   string = CopyReadAttribute(delim, result);
+   string = CopyReadAttribute(delim, null_print, result, 
isnull);
 
if (result == END_OF_FILE  *string == '\0')
{
@@ -1370,7 +1371,7 @@
break;
}
 
-   if (strcmp(string, null_print) == 0)
+   if (isnull)
ereport(ERROR,

(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 errmsg(null OID in COPY 
data)));
@@ -1403,7 +1404,7 @@
 errmsg(missing data for 
column \%s\,

NameStr(attr[m]-attname;
 
-   string = CopyReadAttribute(delim, result);
+   string = CopyReadAttribute(delim, null_print, result, 
isnull);
 
if (result == END_OF_FILE  *string == '\0' 
cur == attnumlist  !file_has_oids)
@@ -1413,7 +1414,7 @@
break;  /* out of per-attr loop */
}
 
-   if (strcmp(string, null_print) == 0)
+   if (isnull)
{
/* we read an SQL NULL, no need to do anything 
*/
}
@@ -1442,7 +1443,7 @@
{
if (attnumlist == NIL  !file_has_oids)
{
-   string = CopyReadAttribute(delim, result);
+   string = CopyReadAttribute(delim, null_print, 
result, isnull);
if (result == NORMAL_ATTR || *string != '\0')
ereport(ERROR,

(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
@@ -1650,14 +1651,13 @@
  * END_OF_FILE:EOF indicator
  * In all cases, the string read up to the terminator is returned.
  *
- * Note: This function does not care about SQL NULL values -- it
- * is the caller's responsibility to check if the returned string
- * matches what the user specified for the SQL NULL value.
- *
  * delim is the column delimiter string.
+ * nullst says how NULL values are represented.
+ * *isnull is set true if a null attribute, else false.
  */
 static char *
-CopyReadAttribute(const char *delim, CopyReadResult *result)
+CopyReadAttribute(const char *delim, const char *nullst,
+  CopyReadResult *result, bool *isnull)
 {
int c;
int delimc = (unsigned char) delim[0];
@@ -1665,6 +1665,17 @@
unsigned char s[2];
char   *cvt;
int j;
+   boolmatchnull = true;
+   int matchlen = 0;
+
+#define CHECK_MATCH(c) \
+   do { \
+   if (matchnull) \
+   if (c == nullst[matchlen]) \
+   ++matchlen; \
+   else \
+   matchnull = false; \
+   } while (0)
 
s[1] = 0;
 
@@ -1733,6 +1744,7 @@
}
 

[HACKERS] Cannot dump/restore text value \N

2003-10-05 Thread Manfred Koizar
To be clear, this is not about \N as the default external
representation for NULL, I'm talking about a string consisting of the
two characters backslash and uppercase-N.

CREATE TABLE nonu (tx text NOT NULL);
INSERT INTO nonu VALUES ('\\N');
SELECT * FROM nonu;
COPY nonu TO stdout;

This correctly gives:
\\N

Now try to feed that back into the table:
DELETE FROM nonu;
COPY nonu FROM stdin;
\\N
\.

ERROR:  copy: line 1, CopyFrom: Fail to add null value in not null
attribute tx
lost synchronization with server, resetting connection

This happened with 7.3.4, while trying to restore a 1.3 GB dump :-(
ERROR:  copy: line 809051, CopyFrom: Fail to add null value in not
null attribute text
FATAL:  Socket command type 0 unknown

The bug is still in 7.4Beta3; didn't test with Beta 4 yet.

Servus
 Manfred

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Cannot dump/restore text value \N

2003-10-05 Thread Manfred Koizar
I have solved my restore problem by editing (the relevant part of) the
dump (:%s/^IN^I/^IN ^I/), a one-off solution g

Anyway, thanks for your investigation.

On Sun, 05 Oct 2003 19:12:50 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
it seems we have to compare the null representation string to the
pre-debackslashing input.

Sounds reasonable, IMHO.

  I wonder whether it would break any existing apps though.

Couldn't be worse than silently converting valid non-null values to
NULL ...

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] ADD FOREIGN KEY

2003-10-01 Thread Manfred Koizar
On Tue, 30 Sep 2003 08:00:07 -0400, Christopher Browne
[EMAIL PROTECTED] wrote:
I would be pretty game for a near-single-user-mode approach that
would turn off some of the usual functionality that we knew we didn't
need because the data source was an already-committed-and-FK-checked
set of data.

Single user mode is a good idea, IMHO.  But it should only make sure
that there is not more than one user connected to the database (or to
the postmaster).  Everything else should depend on special GUC
variables that are only settable in single user mode:

db= SET disable-fk-verification = true;
ERROR: disable-fk-verification can only be set in single user mode
db= SET SINGLE USER MODE ON;
ERROR: permission denied
HINT: Must be superuser or owner of database db.
db= \c - dbo
You are now connected as new user dbo.
db= SET SINGLE USER MODE ON;
ERROR: cannot enter single user mode
HINT: You are not the only user connected to database db.
-- after other users have logged out ...
db= SET SINGLE USER MODE ON;
SET
db= SET disable-fk-verification = true;
SET

Single user mode would also help in several cases where now a
standalone backend is required ...

Servus
 Manfred

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] osdl-dbt3 run results - puzzled by the execution

2003-09-24 Thread Manfred Koizar
On Fri, 19 Sep 2003 11:35:35 -0700, Jenny Zhang [EMAIL PROTECTED]
wrote:
I posted more results as you requested:

Unfortunately they only confirm what I suspected earlier:

 2) -  Index Scan using i_ps_suppkey on partsupp
  (cost=0.00..323.16 rows=80 width=34)
  (actual time=0.16..2.98 rows=80 loops=380)
  ctr=108.44

 the planner does not
 account for additional index scans hitting pages in the cache that
 have been brought in by preceding scans.  This is a known problem

PF1 = estimated number of page fetches for one loop ~ 320
L   = estimated number of loops ~ 400
P   = number of pages in relation ~ 21000

Cutting down the number of heap page fetches if PF1 * L  P and P 
effective_cache_size seems like an obvious improvement, but I was not
able to figure out where to make this change.  Maybe it belongs into
costsize.c near

run_cost += outer_path_rows *
(inner_path-total_cost - inner_path-startup_cost) *
joininfactor;

in cost_nestloop() or it should be pushed into the index cost
estimation functions.  Hackers?

For now you have to keep lying about effective_cache_size to make the
planner overestimate merge joins to compensate for the planner's
overestimation of nested loops.  Sorry for having no better answer.

Servus
 Manfred

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [SQL] plpgsql doesn't coerce boolean expressions to boolean

2003-09-09 Thread Manfred Koizar
On Mon, 08 Sep 2003 11:40:32 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
4. Use the parser's coerce_to_boolean procedure, so that nonbooleans
   will be accepted in exactly the same cases where they'd be accepted
   in a boolean-requiring SQL construct (such as CASE).  (By default,
   none are, so this isn't really different from #2.  But people could
   create casts to boolean to override this behavior in a controlled
   fashion.)

I vote for 4.  And - being fully aware of similar proposals having
failed miserably - I propose to proceed as follows:

If the current behaviour is considered a bug, let i=4, else let i=5.

In 7.i:  Create a new GUC variable plpgsql_strict_boolean (silly
name, I know) in the VERSION/PLATFORM COMPATIBILITY section of
postgresql.conf.  Make the new behaviour dependent on this variable.
Default plpgsql_strict_boolean to false.  Place a warning into the
release notes and maybe into the plpgsql documentation.

In 7.j, ji:  Change the default value of plpgsql_strict_boolean to
true.  Issue WARNINGs or NOTICEs as appropriate.  Update
documentation.

In 7.k, kj:  Remove old behaviour and GUC variable.  Update
documentation.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Index creation takes for ever

2003-09-08 Thread Manfred Koizar
On Mon, 8 Sep 2003 11:31:05 +0200, Zeugswetter Andreas SB SD
[EMAIL PROTECTED] wrote:
 As Tom mentioned, we might not want to keep the tid's in order after the
 index is created because he wants the most recent tid's first, so the
 expired ones migrate to the end.

But on average this argument only holds true for unique indexes, no ?
Is there any code that stops the heap lookup after the visible tuple is found ?
At least in an index with more rows per key you will fetch all heaps after the 
first one anyway to get at the next row. This is better done in heap order, no ?

Yes, yes, yes, and yes.  Seems we all agree on that; the patch has
been queued for 7.5.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Index creation takes for ever

2003-09-07 Thread Manfred Koizar
On Sun, 7 Sep 2003 11:43:42 -0400 (EDT), Bruce Momjian
[EMAIL PROTECTED] wrote:
I assume this completes this TODO:

   * Order duplicate index entries by tid for faster heap lookups

I don't think so, because the patch does nothing to keep the sort
order once the index is initially created.

 If you want to post it now, [...]

I did already post it.  It's only the last page or so of the original
message.  The link in that message points to a testing aid which is
not part of what I would like to see committed.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Buglist

2003-08-22 Thread Manfred Koizar
On Fri, 22 Aug 2003 16:27:53 +0530, Shridhar Daithankar
[EMAIL PROTECTED] wrote:
What does FSM does then?

FSM = Free Space Map.  VACUUM writes information into the FSM, INSERTs
consult the FSM to find pages with free space for new tuples.

 I was under impression that FSM stores page 
pointers and vacuum work on FSM information only. In that case, it wouldn't 
have to waste time to find out which pages to clean.

This has been discussed yesterday here and on -hackers (Decent VACUUM
(was: Buglist)).  We were talking about inventing a second data
structure: RSM.

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [GENERAL] Buglist

2003-08-22 Thread Manfred Koizar
On Fri, 22 Aug 2003 10:45:50 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
One big question mark in my mind about these partial vacuum
proposals is whether they'd still allow adequate FSM information to be
maintained.  If VACUUM isn't looking at most of the pages, there's no
very good way to acquire info about where there's free space.

VACUUM has accurate information about the pages it just visited.  Free
space information for pages not touched by VACUUM is still in the FSM,
unless free space on a page is too low to be interesting.  VACUUM has
to merge these two lists and throw away entries with little free space
if running out of room.

Thus we might end up with new almost full pages in the FSM while there
are pages with more free space lying around that a previous VACUUM
failed to register because there was more free space at that time.

Considering that
 .  FSM is lossy per definitionem
 .  we are targeting at relations with large passive areas
 .  decent VACUUM shall not replace lazy VACUUM
I see no problem here.

Future advice could be: VACCUM DECENT every hour, VACUUM daily,
VACUUM FULL once a year  where the first two could be scheduled by
autovacuum ...

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] [GENERAL] Buglist

2003-08-22 Thread Manfred Koizar
On Fri, 22 Aug 2003 12:18:02 -0400, Jan Wieck [EMAIL PROTECTED]
wrote:
Okay, my proposal would be to have a VACUUM mode where it tells the 
buffer manager to only return a page if it is already in memory

But how can it know?  Yes, we know exactly what we have in PG shared
buffers.  OTOH we keep telling people that they should configure
moderate values for shared_buffers because the OS is better at
caching.  Your CACHEONLY VACUUM wouldn't catch those pages that are in
the OS cache but not in the shared buffers, although they are
retrievable at almost the same low cost.

We should not try to avoid _any_ physical disk access.  It's good
enough to avoid useless reads.  Hence my proposal for a reclaimable
space list ...

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] [SQL] SELECT IN Still Broken in 7.4b

2003-08-21 Thread Manfred Koizar
On Thu, 21 Aug 2003 16:42:20 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
The point is that given
   WHERE a = 1 OR b = 1
you could create a plan that first indexscans on a, then indexscans on
b --- but you mustn't return any tuples in the second scan that you
already returned in the first.  IndexNext solves this by evaluating the
prior-scan index conditions to see if they are true.

WHERE a = 1 OR b = 2
and
WHERE a = 1 OR a = 2

are totally different things.  In the latter case you don't have to
check prior conditions because the conditions are mutually exclusive.
Is this reasonably easy to find out at plan creation time?

Yes, I changed your example to make my point clear, because
WHERE a = 1 OR a = 1
has its own set of problems.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [GENERAL] [HACKERS] [pgsql-advocacy] Need concrete Why Postgres not MySQL bullet list

2003-08-21 Thread Manfred Koizar
On Thu, 21 Aug 2003 15:05:52 +0200, I wrote:
Just wondering, what other databases has transactable DDLs?

Firebird.

Stop!  I withdraw that statement.  I must have mis-read some feature
list :-(

Tests with InterBase 6 showed that you can change metadata within a
transaction, but when you ROLLBACK, metadata changes persist.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


[HACKERS] Decent VACUUM (was: Buglist)

2003-08-21 Thread Manfred Koizar
On Wed, 20 Aug 2003 15:39:26 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
But I think the real point here is that there's no reason to think that
doing tuple deletion on-the-fly in foreground transactions is superior
to doing it in background with a vacuum process.  You're taking what
should be noncritical maintenance work and moving it into the critical
paths of your foreground applications.  Not only that, but you're
probably doing more total work per tuple --- VACUUM batches its work
in more ways than just the index cleanup aspect, IIRC.

Yes, I sign that, 100%.  That doesn't mean that we couldn't do any
better.  AFAICS Vivek's problem is that it is hard enough to hold a
good part of the working set in the cache, and still his disks are
saturated.  Now a VACUUM not only adds one more process to disk I/O
contention, but also makes sure that the working set pages are *not*
in memory which leads to higher I/O rates after the VACUUM.

I can imagine several use cases where only a small part of a large
relation is subject to DELETEs/UPDATEs.  Maybe Vivek's application
falls into this category.

If we teach VACUUM to not read pages that don't contain any dead
tuples, this could be a significant improvement.  I'm envisioning a
data structure (reclaimable space map, RSM) similar to the FSM.
Whenever a backend encounters a dead tuple it inserts a reference to
its page into the RSM.  Dead tuple detection is no problem, it is
already implemented for marking dead index tuples.  VACUUM, when run
in a new mode (decent), only checks pages that are listed in the
RSM.  To get full advantage of not doing unnecessary page reads, we'll
also need to redesign the index bulk delete routines.

The autovaccum daemon will watch the RSM and when the number of
entries is above a configurable threshold, it will start a
VACUUM DECENT ...

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Decent VACUUM (was: Buglist)

2003-08-21 Thread Manfred Koizar
[ still brainstorming ... ]

On Thu, 21 Aug 2003 17:16:50 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
 Whenever a backend encounters a dead tuple it inserts a reference to
 its page into the RSM.

This assumes that backends will visit dead tuples with significant
probability.  I doubt that assumption is tenable;

Good point.  What about:  Whenever a backend *deletes* a tuple it
inserts a reference to its page into the RSM?  Then an entry in the
RSM doesn't necessarily mean that the referenced page has reclaimable
space, but it would still be valueable information.

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Decent VACUUM (was: Buglist)

2003-08-21 Thread Manfred Koizar
On Thu, 21 Aug 2003 17:56:02 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
Conceivably it could be a win, though,
if you could do frequent vacuum decents and only a full-scan vacuum
once in awhile (once a day maybe).

That's what I had in mind; similar to the current situation where you
can avoid expensive VACUUM FULL by doing lazy VACUUM frequently
enough.

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [pgsql-advocacy] Need concrete Why Postgres not MySQL bullet list

2003-08-21 Thread Manfred Koizar
On Thu, 21 Aug 2003 14:45:03 +0530, Shridhar Daithankar
[EMAIL PROTECTED] wrote:
Just wondering, what other databases has transactable DDLs?

Firebird.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Correlation in cost_index()

2003-08-20 Thread Manfred Koizar
On Fri, 8 Aug 2003 16:53:48 -0700, Sean Chittenden
[EMAIL PROTECTED] wrote:
the problem with your patch was
that it picked an index less often than the current code when there
was low correlation.

Maybe bit rot?  What version did you apply the patch against?  Here is
a new version for Postgres 7.3.4:
http://www.pivot.at/pg/16d-correlation_734.diff

The only difference to the previous version is that

for (nKeys = 1; index-indexkeys[nKeys] != 0; nKeys++)

is now replaced with

for (nKeys = 1; nKeys  index-ncolumns; nKeys++)

Don't know whether the former just worked by chance when I tested the
7.3.2 version :-(.  Tests with 7.4Beta1 showed that index correlation
comes out too low with the old loop termination condition.  Anyway,
the latter version seems more robust.

In my tests the new index_cost_algorithms (1, 2, 3, 4) gave
consistently lower cost estimates than the old method (set
index_cost_algorithm = 0), except of course for correlations of 1.0 or
0.0, because in these border cases you get always min_IO_cost or
max_IO_cost, respectively.

Care to re-evaluate?  BTW, there's a version of the patch for 7.4Beta1
(http://www.pivot.at/pg/16d-correlation_74b1.diff) which also applies
cleanly against cvs snapshot from 2003-08-17.

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] Again on index correlation

2003-08-20 Thread Manfred Koizar
Recent discussion of index cost estimation ([HACKERS] Correlation in
cost_index() ca. two weeks ago) has lead to the conclusion that the
column correlation calculated by VACUUM does not always help when we
want to find out how well index access order corresponds to physical
tuple position.  Most problems arise when dealing with multi-column or
functional indices.  Another case is when there are many equal values
in the first index column.

What we really want is not column correlation, but index correlation
which turns out to be surprisingly easy to calculate.  There's no need
to look up comparison functions and handle different datatypes;
there's no need to look at the key values at all:  Just read the index
items in index order, sort them by heap page, and compute the Spearman
Rho function.  This even works for non-btree indices.

Try it yourself:
 .  download http://www.pivot.at/pg/contrib_icorrel.tgz 
 .  unpack
 .  make
 .  make install
 .  psql
. \i path/to/share/postgresql/contrib/icorrel.sql
. SELECT icorrel('myindex'::regclass);

This should work with 7.3.x and 7.4Beta.


How could the planner make use of index correlation?  Here
(http://www.pivot.at/pg/22-IndexCorrel_74b1.diff) is an experimental
patch that introduces a new system table pg_indexstat.  If the GUC
variable enable_indexstat is set to on, genericcostestimate tries to
get index correlation from pg_indexstat;  if there is none,
btcostestimate falls back to the old method.

Compatibility:  Although there is a new catalog table, initdb is not
required.  A patched postmaster still works with an old cluster, even
without creating pg_catalog.pg_indexstat.  Before you can make use of
pg_indexstat you have to create it via a standalone backend:

$ bin/postgres -D data -O template1

backend CREATE TABLE pg_catalog.pg_indexstat( \
istindex oid NOT NULL, \
istcorrel float4 NOT NULL) WITHOUT OIDS;
backend CREATE UNIQUE INDEX pg_indexstat_index_index \
ON pg_indexstat(istindex);

Repeat this for each database.


Usage example (using Sean's data):

psql testdb
testdb=# \d rucc
 Table public.rucc
Column |   Type   | Modifiers
---+--+---
 user_id   | integer  | not null
 category_id   | integer  | not null
 img_bytes | bigint   | not null
 img_hits  | integer  | not null
 html_bytes| bigint   | not null
 html_hits | integer  | not null
 unknown_bytes | bigint   | not null
 unknown_hits  | integer  | not null
 utc_date  | timestamp with time zone | not null
 time_interval | interval | not null
Indexes:
rucc_htmlbytes_idx btree (html_bytes),
rucc_id_date_idx btree (user_id, utc_date)

testdb=# SELECT 'rucc_id_date_idx'::regclass::oid;
   oid
-
 1281422
(1 row)

testdb=# set enable_seqscan = off;
SET

testdb=# set enable_indexstat = on;
SET

testdb=# INSERT INTO pg_indexstat VALUES (1281422, 0.0001);
INSERT 0 1
testdb=# EXPLAIN SELECT * FROM rucc WHERE user_id  1000;
  QUERY PLAN

 Index Scan using rucc_id_date_idx on rucc  (cost=0.00..634342.50
rows=139802 width=64)
   Index Cond: (user_id  1000)
(2 rows)

testdb=# UPDATE pg_indexstat SET istcorrel=0.1 WHERE istindex=1281422;
testdb=# EXPLAIN SELECT ...

istcorrel |  cost
--+--
   0.0001 | 634342.50
  0.1 | 514678.48
  0.2 | 407497.85
  0.5 | 161612.89
  0.9 |  10299.07
  1.0 |   3994.32

Actually the table is clustered on rucc_id_date_idx, so index
correlation is 1.0, but there is no way to know that, when we only
have the column correlations for user_id (1.0) and utc_date (0.59).
The current code guesses the index correlation to be 0.5 which gives a
cost estimate that is far too high.

For comparison:
seq scan estimated cost ~ 21000, actual ~ 11500,
index scan   actual ~ 4000


If you are going to test this patch, please be aware that I created it
on top of another one of my experimental patches
(http://www.pivot.at/pg/16d-correlation_74b1.diff).  If you don't want
to apply this one, one hunk of the IndexCorrel patch will fail in
selfuncs.c.  Should be no problem to apply it manually.

And those who are still experimenting with 7.3.4 performance, can use
http://www.pivot.at/pg/16d-correlation_734.diff and
http://www.pivot.at/pg/22-IndexCorrel_74b1.diff.

ToDo:

.. Move get_index_correlation from contrib into the backend.

.. ANALYZE table  computes index correlation for all indexes.

.. New command  ANALYZE index?

.. System cache invalidation?
   syscache.c: reloidattr = Anum_pg_indexstat_istindex ?

.. Dependency?

.. Remove GUC variable enable_indexstat

.. Remove old method in 

Re: [HACKERS] Correlation in cost_index()

2003-08-14 Thread Manfred Koizar
On Fri, 8 Aug 2003 11:06:56 -0700, Sean Chittenden
[EMAIL PROTECTED] wrote:
[...] it'd seem as though an avg depth of
nodes in index * tuples_fetched * (random_io_cost * indexCorrelation)
would be closer than where we are now...

Index depth does not belong here because we walk down the index only
once per index scan not once per tuple.  It might be part of the
startup cost.

The rest of your formula doesn't seem right, too, because you get
higher costs for better correlation.  Did you mean
random_io_cost * (1 - abs(indexCorrelation))?

FWIW, for small effective_cache_size max_IO_cost is almost equal to
tuples_fetched * random_page_cost.  So your formula (with the
corrections presumed above) boils down to ignoring
effective_cache_size and linear interpolation between 0 and
max_IO_cost.

It's very possible that cost_index() is wrong, but it seems as though
after some testing as if PostgreSQL _overly_ _favors_ the use of
indexes:

Was this an unpatched backend?  What were the values of
effective_cache_size and random_page_cost?

# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date  
'2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 21472.687500
startup_cost: 0.00

INFO:  cost_index: run_cost: 21154.308116
startup_cost: 0.00
indexCorrelation: 0.999729

 QUERY PLAN
---
 Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc 
  (cost=0.00..21154.31 rows=705954 width=64) (actual time=91.36..6625.79 rows=704840 
 loops=1)
   Index Cond: (utc_date  '2002-10-01 00:00:00-07'::timestamp with time zone)
 Total runtime: 11292.68 msec
(3 rows)

actual time=91.36..6625.79 but Total runtime: 11292.68 msec!
Where did those 4.7 seconds go?

# SET enable_seqscan = true; SET enable_indexscan = false;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date  
'2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 21472.687500
startup_cost: 0.00

INFO:  cost_index: run_cost: 21154.308116
startup_cost: 1.00
indexCorrelation: 0.999729
  QUERY PLAN
---
 Seq Scan on report_user_cat_count rucc  (cost=0.00..21472.69 rows=705954 width=64) 
 (actual time=1091.45..7441.19 rows=704840 loops=1)
   Filter: (utc_date  '2002-10-01 00:00:00-07'::timestamp with time zone)
 Total runtime: 10506.44 msec
(3 rows)

Same here: actual time=1091.45..7441.19 but Total runtime: 10506.44
msec  -  more than 3 seconds lost.

When we ignore total runtime and look at actual time we get

 seqidx
estimated   21473  21154
actual   7441   6626

This doesn't look too bad, IMHO.

BTW, I believe that with your example (single-column index, almost
perfect correlation, index cond selects almost all tuples) all
interpolation methods give an index cost estimation that is very close
to seq scan cost, and the actual runtimes show that this is correct.

Which I find surprising and humorous given the popular belief is, mine
included, contrary to those results.

How many tuples are in report_user_cat_count?  What are the stats for
report_user_cat_count.utc_date?

  I can say with pretty high
confidence that the patch to use a geometric mean isn't correct after
having done real world testing as its break even point is vastly
incorrect and only uses an index when there are less than 9,000 rows
to fetch, a far cry from the 490K break even I found while testing.

Could you elaborate, please.  The intention of my patch was to favour
index scans more than the current implementation.  If it does not, you
have found a bug in my patch.  Did you test the other interpolation
methods?

What I did find interesting, however, was that it does work better at
determining the use of multi-column indexes,

Yes, because it computes the correlation for a two-column-index as
correlation_of_first_index_column * 0.95
instead of
correlation_of_first_index_column / 2

 but I think that's
because the geometric mean pessimizes the value of indexCorrelation,
which gets pretty skewed when using a multi-column index.

I don't understand this.

# CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count 
(user_id,utc_date);
# CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count;
# ANALYZE report_user_cat_count;
# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM 

Re: [HACKERS] Correlation in cost_index()

2003-08-11 Thread Manfred Koizar
On Fri, 08 Aug 2003 18:25:41 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
  Two examples: [...]

One more example:
X   Y

A   A
a   B
A   C
b   A
B   B
b   C
C   A
c   B
C   C

Correlation for column X is something less than 1.0, OTOH correlation
for an index on upper(X) is 1.0.

I don't really see
a way to do this without actually examining the multi-column ordering
relationship during ANALYZE.

So did we reach consensus to add a TODO item?

* Compute index correlation on CREATE INDEX and ANALYZE,
  use it for index scan cost estimation

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Correlation in cost_index()

2003-08-10 Thread Manfred Koizar
On Thu, 7 Aug 2003 13:44:19 -0700, Sean Chittenden
[EMAIL PROTECTED] wrote:
 The indexCorrelation^2 algorithm was only a quick hack with no theory
 behind it :-(.  I've wanted to find some better method to put in there,
 but have not had any time to research the problem.

Could we quick hack it to a geometric mean instead since a mean
seemed to yield better results than indexCorrelation^2?

Linear interpolation on (1-indexCorrelation)^2 (algorithm 3 in
http://members.aon.at/pivot/pg/16-correlation-732.diff) is almost as
good as geometric interpolation (algorithm 4 in the patch, proposal 3
in this thread), and its computation is much cheaper because it does
not call exp() and log().  Download
http://members.aon.at/pivot/pg/cost_index.sxc and play around with
your own numbers to get a feeling.

(1-indexCorrelation)^2 suffers from the same lack of theory behind it
as indexCorrelation^2.  But the results look much more plausible.
Well, at least to me ;-)

Servus
 Manfred

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Correlation in cost_index()

2003-08-08 Thread Manfred Koizar
On Fri, 8 Aug 2003 16:53:48 -0700, Sean Chittenden
[EMAIL PROTECTED] wrote:
# SHOW effective_cache_size ;
 effective_cache_size
--
 4456
(1 row)

Only 35 MB?  Are you testing on such a small machine?

The stats are attached  bzip2 compressed.

Nothing was attached.  Did you upload it to your web site?

 I can say with pretty high confidence that the patch to use a
 geometric mean isn't correct

... the problem with your patch was
that it picked an index less often than the current code when there
was low correlation.

In cost_index.sxc I get lower estimates for *all* proposed new
interpolation methods.  Either my C code doesn't implement the same
calculations as the spreadsheet, or ... 

I manually applied bits of it [...]

... could this explain the unexpected behaviour?

I'm currently downloading your dump.  Can you post the query you
mentioned above?

Servus
 Manfred

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proof-of-concept for initdb-time shared_buffers selection

2003-07-31 Thread Manfred Koizar
On Fri, 04 Jul 2003 15:29:37 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
The attached patch shows how initdb can dynamically determine reasonable
shared_buffers and max_connections settings that will work on the
current machine.

Can't this be done on postmaster startup?  I think of two GUC
variables where there is only one today: min_shared_buffers and
max_shared_buffers.  If allocation for the max_ values fails, the
numbers are decreased in a loop of, say, 10 steps until allocation
succeeds, or even fails at the min_ values.

The actual values chosen are reported as a NOTICE and can be inspected
as readonly GUC variables.

This would make the lives easier for the folks trying to come up with
default .conf files, e.g.
  min_shared_buffers = 64
  max_shared_buffers = 2000
could cover a fairly large range of low level to mid level machines.

A paranoid dba, who doesn't want the postmaster to do unpredictable
things on startup, can always set min_xxx == max_xxx to get the
current behaviour.

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] 'out of tree' compile (was: Two weeks to feature freeze)

2003-06-27 Thread Manfred Koizar
On Thu, 26 Jun 2003 22:55:45 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
 I've not tried, but if PostgreSQL can do an 'out of tree' compile it
 could make it much easier.

Yes it can, following the usual procedure for autoconfiscated trees:
just invoke configure from an empty directory, eg
   mkdir build
   cd build
   ../pgsql/configure

I do this all the time.  Works pretty well.  The only minor annoyance
is that some files are created in the source tree:

./src/backend/bootstrap/bootparse.c
./src/backend/bootstrap/bootscanner.c
./src/backend/bootstrap/bootstrap_tokens.h
./src/backend/parser/gram.c
./src/backend/parser/parse.h
./src/backend/parser/scan.c
./src/backend/utils/misc/guc-file.c
./src/bin/psql/sql_help.h
./src/interfaces/ecpg/preproc/pgc.c
./src/interfaces/ecpg/preproc/preproc.c
./src/interfaces/ecpg/preproc/preproc.h
./src/pl/plpgsql/src/pl.tab.h
./src/pl/plpgsql/src/pl_gram.c
./src/pl/plpgsql/src/pl_scan.c

This didn't itch enough to make me scratch ;-)

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] MARKED_FOR_UPDATE XMAX_COMMITTED == XMAX_INVALID ?

2003-06-12 Thread Manfred Koizar
On Wed, 11 Jun 2003 09:05:33 -0400, Tom Lane [EMAIL PROTECTED]
wrote:
 If a transaction marks a tuple for update and later commits without
 actually having updated the tuple, [...] can we simply
 set the HEAP_XMAX_INVALID hint bit of the tuple?

AFAICS this is a reasonable thing to do.

Thanks for the confirmation.  Here's a patch which also contains some
more noncritical changes to tqual.c:
 .  make code more readable by introducing local variables for xvac
 .  no longer two separate branches for aborted and crashed.
The actions were the same in all cases.

Servus
 Manfred
diff -rcN ../base/src/backend/utils/time/tqual.c src/backend/utils/time/tqual.c
*** ../base/src/backend/utils/time/tqual.c  Thu May  8 21:45:55 2003
--- src/backend/utils/time/tqual.c  Thu Jun 12 12:10:31 2003
***
*** 76,86 
  
if (tuple-t_infomask  HEAP_MOVED_OFF)
{
!   if 
(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXvac(tuple)))
return false;
!   if (!TransactionIdIsInProgress(HeapTupleHeaderGetXvac(tuple)))
{
!   if 
(TransactionIdDidCommit(HeapTupleHeaderGetXvac(tuple)))
{
tuple-t_infomask |= HEAP_XMIN_INVALID;
return false;
--- 76,87 
  
if (tuple-t_infomask  HEAP_MOVED_OFF)
{
!   TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
!   if (TransactionIdIsCurrentTransactionId(xvac))
return false;
!   if (!TransactionIdIsInProgress(xvac))
{
!   if (TransactionIdDidCommit(xvac))
{
tuple-t_infomask |= HEAP_XMIN_INVALID;
return false;
***
*** 90,100 
}
else if (tuple-t_infomask  HEAP_MOVED_IN)
{
!   if 
(!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXvac(tuple)))
{
!   if 
(TransactionIdIsInProgress(HeapTupleHeaderGetXvac(tuple)))
return false;
!   if 
(TransactionIdDidCommit(HeapTupleHeaderGetXvac(tuple)))
tuple-t_infomask |= HEAP_XMIN_COMMITTED;
else
{
--- 91,102 
}
else if (tuple-t_infomask  HEAP_MOVED_IN)
{
!   TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
!   if (!TransactionIdIsCurrentTransactionId(xvac))
{
!   if (TransactionIdIsInProgress(xvac))
return false;
!   if (TransactionIdDidCommit(xvac))
tuple-t_infomask |= HEAP_XMIN_COMMITTED;
else
{
***
*** 152,162 
}
  
/* xmax transaction committed */
-   tuple-t_infomask |= HEAP_XMAX_COMMITTED;
  
if (tuple-t_infomask  HEAP_MARKED_FOR_UPDATE)
return true;
  
return false;
  }
  
--- 154,167 
}
  
/* xmax transaction committed */
  
if (tuple-t_infomask  HEAP_MARKED_FOR_UPDATE)
+   {
+   tuple-t_infomask |= HEAP_XMAX_INVALID;
return true;
+   }
  
+   tuple-t_infomask |= HEAP_XMAX_COMMITTED;
return false;
  }
  
***
*** 212,222 
  
if (tuple-t_infomask  HEAP_MOVED_OFF)
{
!   if 
(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXvac(tuple)))
return false;
!   if (!TransactionIdIsInProgress(HeapTupleHeaderGetXvac(tuple)))
{
!   if 
(TransactionIdDidCommit(HeapTupleHeaderGetXvac(tuple)))
{
tuple-t_infomask |= HEAP_XMIN_INVALID;
return false;
--- 217,228 
  
if (tuple-t_infomask  HEAP_MOVED_OFF)
{
!   TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
!   if (TransactionIdIsCurrentTransactionId(xvac))
return false;
!   if (!TransactionIdIsInProgress(xvac))
{
!   if (TransactionIdDidCommit(xvac))
{

[HACKERS] MARKED_FOR_UPDATE XMAX_COMMITTED == XMAX_INVALID ?

2003-06-11 Thread Manfred Koizar
If a transaction marks a tuple for update and later commits without
actually having updated the tuple, do we still need the information
that the tuple has once been reserved for an update or can we simply
set the HEAP_XMAX_INVALID hint bit of the tuple?

In other words, is this snippet from a patch I'm working on a valid
modification to HeapTupleSatisfiesVacuum in tqual.c?

{
if (TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple)))
return HEAPTUPLE_LIVE;
-   if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple)))
-   tuple-t_infomask |= HEAP_XMAX_COMMITTED;
-   else
-/* it's either aborted or crashed */
-   tuple-t_infomask |= HEAP_XMAX_INVALID;
+   /*
+* We don't really care whether xmax did commit, abort or
+* crash. We know that xmax did mark the tuple for update,
+* but it did not and will never actually update it.
+*/
+   tuple-t_infomask |= HEAP_XMAX_INVALID;
}
return HEAPTUPLE_LIVE;

There are a few more places in tqual.c which could be simplified like
that.

Servus
 Manfred

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html


[HACKERS] Separate build directory

2003-04-05 Thread Manfred Koizar
When I configure and make in a separate build directory tree, I get

createdb.c:15: dumputils.h: No such file or directory
and
print.c:9: common.h: No such file or directory

in src/bin/scripts.  I don't know whether the attached change to the
Makefile is the preferred way to fix this, at least it works for me.

BTW, make creates these files in the source tree:
./src/backend/bootstrap/bootparse.c
./src/backend/bootstrap/bootscanner.c
./src/backend/bootstrap/bootstrap_tokens.h
./src/backend/parser/gram.c
./src/backend/parser/parse.h
./src/backend/parser/scan.c
./src/backend/utils/misc/guc-file.c
./src/bin/psql/sql_help.h
./src/interfaces/ecpg/preproc/pgc.c
./src/interfaces/ecpg/preproc/preproc.c
./src/interfaces/ecpg/preproc/preproc.h
./src/pl/plpgsql/src/pl.tab.h
./src/pl/plpgsql/src/pl_gram.c
./src/pl/plpgsql/src/pl_scan.c

Shouldn't they better be created in the build tree?

Servus
 Manfred
diff -ruN ../base/src/bin/scripts/Makefile src/bin/scripts/Makefile
--- ../base/src/bin/scripts/Makefile2003-04-04 15:45:47.0 +0200
+++ src/bin/scripts/Makefile2003-04-05 14:42:15.0 +0200
@@ -16,7 +16,7 @@
 SCRIPTS := vacuumdb clusterdb
 PROGRAMS = createdb createlang createuser dropdb droplang dropuser
 
-override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+override CPPFLAGS := -I$(libpq_srcdir) -I. -I$(top_srcdir)/src/bin/scripts $(CPPFLAGS)
 
 
 all: submake-libpq submake-backend $(PROGRAMS)

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


  1   2   >