Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Hannu Krosing
Ühel kenal päeval, L, 2006-06-24 kell 19:36, kirjutas Bruce Momjian:
> Hannu Krosing wrote:
> > ?hel kenal p?eval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:

> > > 
> > > Bottom line: there's still lots of low-hanging fruit.  Why are people
> > > feeling that we need to abandon or massively complicate our basic
> > > architecture to make progress?
> > 
> > Maybe we could start from reusing the index tuples which point to
> > invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> > to do in-place replacement there ?
> > 
> > This probably has the same obstacles which have prevented us from
> > removing those in the first place (removing instead of marking as
> > invisible). Does it cause some locking issues ? Or does it go against
> > some other constraints of our index lookups ?
> > 
> > I think that just setting the invisible bit in an index leaf node causes
> > nearly as much disk io as removing the node.
> > 
> > If we could delete/reuse old index tuples, it would solve a sizable
> > chunk of index-growth problem, especially for cases where referenced key
> > value does not change.
> 
> I think heap _and_ index reuse is the only useful direction.  Index or
> heap reuse alone seems too marginal for the added complexity.

Sure, but index reuse seems a lot easier, as there is nothing additional
to remember or clean out when doing it.

When reusing a heap tuple you have to clean out all index entries
pointing to it.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] vacuum row?

2006-06-24 Thread Alvaro Herrera
Mark Woodward wrote:
> I originally suggested a methodology for preserving MVCC and everyone is
> confusing it as update "in place," this isnot what I intended.

It doesn't make sense, but maybe vacuuming a page would.  Naturally, it
would need to wholly scan all the indexes to clean'em up, so it's
probably not a good idea in general.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [CORE] GPL Source and Copyright Questions

2006-06-24 Thread Bruce Momjian
Tom Lane wrote:
> [ redirecting to -hackers, as I see no need for this to be a core issue ]
> 
> Charles Comiskey <[EMAIL PROTECTED]> writes:
> > Hello,
> > I've recently looked through the PostgreSQL code and a couple of questions 
> > surfaced.  I was hoping someone here may be able to answer them.  Two have 
> > links to possible GPL sources and the third is just a contribution 
> > question. 
> 
> > item #1: Does the geo_ops.c file contain GPL code?
> > Embedded within the geo_ops.c file is a John Franks copyright statement 
> > referring to wn/image.c file from WN Server 1.15.1.  WN Server appears to 
> > have been under the GPL license since 0.94 and continues to be offered 
> > under the GPL license today.  John's letter to Linux Journal seems to only 
> > point the user to his WN Server distribution vs granting any specific 
> > license.

The comment is:

/* poly_contain_pt()
 * Test to see if the point is inside the polygon.
 * Code adapted from integer-based routines in
 *  Wn: A Server for the HTTP
 *  File: wn/image.c
 *  Version 1.15.1
 *  Copyright (C) 1995  
 * (code offered for use by J. Franks in Linux Journal letter.)
 */

That term "adapted from" isn't something Thomas would idly type, I
think.  I bet it means he looked at John Franks' code and used it as a
base for our code.  I am not concerned.

> > Questions:
> > 1) Is any John Franks code really in this file?
> > 2) Did John provide a separate license for PostgreSQL to license it under 
> > the BSD license?
> 
> This code seems to have been inserted by Tom Lockhart on 1997-07-29
> (geo_ops.c rev 1.13).  Tom, any info on the copyright status?
> 
> > References:
> > - 1994 e-mail with GPL reference to WN Server v0.94: 
> > http://1997.webhistory.org/www.lists/www-talk.1994q4/1080.html
> > - 1995 e-mail from John with GPL license text reference: 
> > http://1997.webhistory.org/www.lists/www-talk.1995q1/0482.html
> > - WN Server url today: http://hopf.math.northwestern.edu/
> > - Link to Linux Journal article: http://www.linuxjournal.com/article/2197
> 
> 
> > item #2: Is dllinit.c GPL code?
> > The file dllinit.c, located in the src/utils directory documents the 
> > author as Mumit Khan.  Did Mumit Khan contribute this code and did he 
> > contribute it for distribution under the PostgreSQL license?  If I read 
> > correctly, the name stamp in CVS does not indicate that Mumit Khan 
> > directly contributed this file.  I ask because this question has surfaced 
> > as a forum item for a different project and Mumit Khan directly answered 
> > their forum posting (http://curl.haxx.se/mail/lib-2002-11/0061.html).
> 
> Per the comments in that thread, it would be pretty trivial to either
> rewrite or remove this file.  I don't think there is anything there that
> amounts to protectable content (and Mumit evidently agrees, see link)
> but let's do something about it anyway.  Can some of the Windows folk
> check whether we can just remove it?

File removed.

> > item #3: Carsten Wolff copyright in informix.c file
> > The file informix.c contains a copyright from Carsten Wolff.  Did Carsten 
> > directly contribute this file to the PostgreSQL project?
> 
> This code was added by Michael Meskes in informix.c rev 1.6
> (2003-05-06).  Michael, any info on the exact copyright status?

Fixed to remove copyright.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [CORE] GPL Source and Copyright Questions

2006-06-24 Thread Bruce Momjian
Michael Meskes wrote:
> On Thu, Jun 22, 2006 at 11:37:08AM -0400, Tom Lane wrote:
> > > item #3: Carsten Wolff copyright in informix.c file
> > > The file informix.c contains a copyright from Carsten Wolff.  Did Carsten 
> > > directly contribute this file to the PostgreSQL project?
> > 
> > This code was added by Michael Meskes in informix.c rev 1.6
> > (2003-05-06).  Michael, any info on the exact copyright status?
> 
> Yes. In fact the copyright belongs to credativ GmbH the company that
> paid Carsten for his work. As you may or may not know I'm the CEO of
> that company and can assure you that his work was contributed to the
> PostgreSQL project.

Michael, I saw your patch stating that the copyright was assigned to
PGDG.  However, once that happens, we are of the policy to remove
copyrights to individual users because it confuses things.

Therefore, I have updated your applied patch to just mention the
author's name, email address, and date.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: src/interfaces/ecpg/compatlib/informix.c
===
RCS file: /cvsroot/pgsql/src/interfaces/ecpg/compatlib/informix.c,v
retrieving revision 1.44
diff -c -c -r1.44 informix.c
*** src/interfaces/ecpg/compatlib/informix.c	23 Jun 2006 14:50:01 -	1.44
--- src/interfaces/ecpg/compatlib/informix.c	25 Jun 2006 01:41:06 -
***
*** 666,679 
  	return 0;
  }
  
! /***
! 		  rfmt.c  -  description
! 			 ---
! 	begin : Wed Apr 2 2003
! 	copyright			 : (C) 2003 by Carsten Wolff
! 	email : [EMAIL PROTECTED]
! 	Contributed under the PostgreSQL License by credativ GmbH
!  ***/
  
  static struct
  {
--- 666,675 
  	return 0;
  }
  
! /*
!  *	rfmt.c  -  description
!  *	by Carsten Wolff <[EMAIL PROTECTED]>, Wed Apr 2 2003
!  */
  
  static struct
  {

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Gist does not build with VC++ anymore

2006-06-24 Thread Bruce Momjian

Patch applied.  It seems offsetof() can only find structure members in
MSVC, not the offset of array elements in structures.  Anyway,
offsetof() was finding the first element, so the "[0]" is not needed.

---

Magnus Hagander wrote:
> > > I've updated my VC++ build env with latest CVS, and it no longer 
> > > builds because of changes to GIST:
> > > 
> > > src\backend\access\gist\gistutil.c(237) : error C2057: 
> > > expected constant expression
> > > src\backend\access\gist\gistutil.c(237) : error C2466: 
> > cannot allocate 
> > > an array of constant size 0
> > > src\backend\access\gist\gistutil.c(237) : error C2133: 'storage' :
> > > unknown size
> > > 
> > > 
> > > The problem appears to come from:
> > > #define GEVHDRSZ  (offsetof(GistEntryVector, vector[0]))
> > > 
> > > Which can't be used in this context. 
> > > 
> > > What would be the proper fix for that?
> > 
> > Hmm. Now that I look at it more clearly, it seems Hiroshi has 
> > a fix for this in his submitted patch (that still had a lot 
> > of other problems in the rest of it). I'm not sure if it's 
> > the proper fix, but it's there.
> 
> While I'm spamming everybody anyway, here's another thing that might fix
> it? This one compiles and tests, and I *think* it does the right
> thing... If it's correct, I think it looks like a cleaner solution.
> 
> //Magnus
> 
> 
> RCS file: /projects/cvsroot/pgsql/src/include/access/gist.h,v
> retrieving revision 1.52
> diff -c -r1.52 gist.h
> *** src/include/access/gist.h   5 Mar 2006 15:58:53 -   1.52
> --- src/include/access/gist.h   24 Jun 2006 17:20:28 -
> ***
> *** 142,148 
> GISTENTRY   vector[1];
>   } GistEntryVector;
> 
> ! #define GEVHDRSZ  (offsetof(GistEntryVector, vector[0]))
> 
>   /*
>* macro to initialize a GISTENTRY
> --- 142,148 
> GISTENTRY   vector[1];
>   } GistEntryVector;
> 
> ! #define GEVHDRSZ  (offsetof(GistEntryVector, vector))
> 
>   /*
>* macro to initialize a GISTENTRY
> 
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Bruce Momjian
Hannu Krosing wrote:
> ?hel kenal p?eval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:
> > Csaba Nagy <[EMAIL PROTECTED]> writes:
> > >> Surprisingly its mostly WAL traffic, the heap/index pages themselves are
> > >> often not yet synced to disk by time of vacuum, so no additional traffic
> > >> there. If you had made 5 updates per page and then vacuum it, then you
> > >> make effectively 1 extra WAL write meaning 20% increase in WAL traffic. 
> > 
> > > Is this also holding about read traffic ? I thought vacuum will make a
> > > full table scan... for big tables a full table scan is always badly
> > > influencing the performance of the box. If the full table scan would be
> > > avoided, then I wouldn't mind running vacuum in a loop... 
> > 
> > If you're doing heavy updates of a big table then it's likely to end up
> > visiting most of the table anyway, no?  There is talk of keeping a map
> > of dirty pages, but I think it'd be a win for infrequently-updated
> > tables, not ones that need constant vacuuming.
> > 
> > I think a lot of our problems in this area could be solved with fairly
> > straightforward tuning efforts on the existing autovacuum
> > infrastructure.  In particular, someone should be looking into
> > recommendable default vacuum-cost-delay settings so that a background
> > vacuum doesn't affect performance too much.
> 
> One thing that would help updates quite a lot in some scenarios is
> keeping the pages only partially-filled, so that most updates could keep
> the new version in the same page. I think that has also been discussed
> as an option to vacuum and maybe as part of initial inserts. Maybe some
> of it even ended up as a todo item.

We have a patch in the queue for index fillfactor which will be in 8.2.
I am also hoping the frequently updated rows will migrate out to the
empty pages.

> > Another problem with the
> > current autovac infrastructure is that it doesn't respond very well to
> > the case where there are individual tables that need constant attention
> > as well as many that don't.  If you have N databases then you can visit
> > a particular table at most once every N*autovacuum_naptime seconds, and
> > *every* table in the entire cluster gets reconsidered at that same rate.
> > I'm not sure if we need the ability to have multiple autovac daemons
> > running at the same time, 
> 
> My patch enabling effective continuous vacuum of fast-update tables,
> while still being able to vacuum huge slowly changing ones is still not
> applied. Without that patch there is no reason to vacuum the small and
> fast changingg tables while vacuum on bigger tables is running, as it
> won't clean out dead tuples anyway.

I think it will be applied, but I am looking for someone else to eyeball
it since Tom has come concerns.

> > but we definitely could use something with a
> > more flexible table-visiting pattern.  Perhaps it would be enough to
> > look through the per-table stats for each database before selecting the
> > database to autovacuum in each cycle, instead of going by "least
> > recently autovacuumed".
> > 
> > Bottom line: there's still lots of low-hanging fruit.  Why are people
> > feeling that we need to abandon or massively complicate our basic
> > architecture to make progress?
> 
> Maybe we could start from reusing the index tuples which point to
> invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> to do in-place replacement there ?
> 
> This probably has the same obstacles which have prevented us from
> removing those in the first place (removing instead of marking as
> invisible). Does it cause some locking issues ? Or does it go against
> some other constraints of our index lookups ?
> 
> I think that just setting the invisible bit in an index leaf node causes
> nearly as much disk io as removing the node.
> 
> If we could delete/reuse old index tuples, it would solve a sizable
> chunk of index-growth problem, especially for cases where referenced key
> value does not change.

I think heap _and_ index reuse is the only useful direction.  Index or
heap reuse alone seems too marginal for the added complexity.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Alvaro Herrera
Mark Woodward wrote:

> The update behavior of PostgreSQL is probably the *last* serious issue.
> Debate all you want, vacuum mitigates the problem to varying levels,
> fixing the problem will be a huge win. If the update behavior gets fixed,
> I can't think of a single issue with postgresql that would be a show
> stopper.

Nah, it's just *your* pet peeve.  Everyone has theirs.  Some people may
share yours, of course.  I agree it's a problem, but from there to
saying "it's _the last_ issue" there's a lot of distance.


Your idea of reusing a tuple's self pointer (t_ctid) does not work BTW,
because the self pointer must point to self.  The case where the pointer
does not point to exactly the same tuple, it must point to a newer
version.  If you change that invariant, a lot of things break; see for
example heap_get_lastest_tid.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Bruce Momjian
Heikki Linnakangas wrote:
> On Sat, 24 Jun 2006, Bruce Momjian wrote:
> 
> > OK, I have an idea.  Right now, an UPDATE where the old and new rows are
> > on the same page have two index entries.  What if we made only one index
> > entry for both?  We already have UPDATE chaining, where the old row
> > points to the new one.  If no key columns change, we set a bit in the
> > heap that the chaining points to the old and new row (both on the same
> > page), so an index scan uses one index entry to see the old and new row,
> > and once the old row is no longer visible, the page index id can be
> > updated to point to the new row and the old row can be marked free and
> > perhaps used for a future UPDATE.  (UPDATE already tries to do keep
> > updates on the same heap page.)
> 
> In fact, that's what I originally thought Mark was suggesting. A couple of 
> points:
> 
> Why the limitation of old and new row being on the same page?

Because having them be on the same page is the only way you can update
the page item pointer so when you recycle the row, you the indexes are
now pointing to the new version.  Pages look like:

[marker][item1][item2][item3]...[tuple1][tuple2][tuple3]

and indexes only point to items, not to tuples.  This allows tuples to
be compacted on the page without affecting the indexes.

If tuple1 is updated to tuple2, once tuple1 is no longer visible to any
backends, you can modify item1 to point to tuple2, and you can mark the
space used by tuple1 as reusable:

[marker][item1(tuple2)][item2][item3]...[free][tuple2][tuple3]

> This only works if none of the updated columns are indexed. That's a bit 
> annoying. It would be nice to be able to create new index tuples in those 

The hope is that a commonly updated tuple will eventually be on a page
where there is sufficient free space for updated version to stay on
there.  For an active server, there might be several updated versions of
rows on the same page.

> indexes that contain one of the changed columns, but not in others.

If you can't expire the old row because one of the indexed columns was
modified, I see no reason to try to reduce the additional index entries.

> What happens if you create a new index that contains one of 
> the changed columns?

Uh, I had not thought of that.  You could easily create two index
entries for the old and new rows, but then the heap bit saying there is
only one index row would be inaccurate for the new index.  I suppose you
could create new rows in all indexes and clear the heap bit.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Christopher Browne
Martha Stewart called it a Good Thing when [EMAIL PROTECTED] (Jan Wieck) wrote:
> On 6/22/2006 2:37 PM, Alvaro Herrera wrote:
>
>> Adding back pgsql-hackers.
>> Mark Woodward wrote:
>>> > Mark Woodward wrote:
>>> >
>>> >> Hmm, OK, then the problem is more serious than I suspected.
>>> >> This means that every index on a row has to be updated on every
>>> >> transaction that modifies that row. Is that correct?
>>> >
>>> > Add an index entry, yes.
>>> >
>>> >> I am attaching some code that shows the problem with regard to
>>> >> applications such as web server session management, when run, each
>>> >> second
>>> >> the system can handle fewer and fewer connections. Here is a brief
>>> >> output:
>>> >> [...]
>>> >> There has to be a more linear way of handling this scenario.
>>> >
>>> > So vacuum the table often.
>>> That fixes the symptom, not the problem. The problem is performance
>>> steadily degrades over time.
>> No, you got it backwards.  The performance degradation is the
>> symptom.
>> The problem is that there are too many dead tuples in the table.  There
>> is one way to solve that problem -- remove them, which is done by
>> running vacuum.
>
> Precisely.
>
>> There are some problems with vacuum itself, that I agree with.  For
>> example it would be good if a long-running vacuum wouldn't affect a
>> vacuum running in another table because of the long-running transaction
>> effect it has.  It would be good if vacuum could be run partially over a
>> table.  It would be good if there was a way to speed up vacuum by using
>> a dead space map or something.
>
> It would be good if vacuum wouldn't waste time on blocks that don't
> have any possible work in them. Vacuum has two main purposes. A)
> remove dead rows and B) freeze xids. Once a block has zero deleted
> rows and all xids are frozen, there is nothing to do with this block
> and vacuum should skip it until a transaction updates that block.
>
> This requires 2 bits per block, which is 32K per 1G segment of a
> heap. Clearing the bits is done when the block is marked dirty. This
> way vacuum would not waste any time and IO on huge slow changing
> tables. That part, sequentially scanning huge tables that didn't
> change much is what keeps us from running vacuum every couple of
> seconds.

This is, in effect, the "VACUUM Space Map."

I see one unfortunate thing about that representation of it, namely
that it would in effect require that non-frozen pages be kept on the
VSM for potentially a long time.

Based on *present* VACUUM strategy, at least.

Would it not be the case, here, that any time a page could be
"frozen," it would have to be?  In effect, we are always trying to run
VACUUM FREEZE?
-- 
output = ("cbbrowne" "@" "gmail.com")
http://cbbrowne.com/info/finances.html
Rules  of the  Evil  Overlord #72.  "If  all the  heroes are  standing
together around  a strange device and  begin to taunt me,  I will pull
out a conventional weapon  instead of using my unstoppable superweapon
on them. 

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Buffer for inner and outer table

2006-06-24 Thread Alvaro Herrera
Daniel Xavier de Sousa wrote:

>   Somebody can tell me, where the postgres control the buffer for
>   inner and outer  table, when it execute Nest_loop_join? I would want
>   how to change the size this buffer  and see all statistics about
>   this

There is no such buffer.  Buffers used in scans are kept in
shared_buffers, just like for everything else.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Buffer for inner and outer table

2006-06-24 Thread Daniel Xavier de Sousa
Hi,Please, Somebody can tell me, where the postgres control the buffer for inner and outer  table, when it execute Nest_loop_join? I would want how to change the size this buffer  and see all statistics about this     There is another doubt, how can I see the pages access (on Ram and HD) when it  execute Nest-loop-join?     Obrigado,  Daniel  Xavier de SousaMartijn van Oosterhout  escreveu:  On
 Sat, Jun 24, 2006 at 10:04:43PM +0300, Hannu Krosing wrote:> Maybe we could start from reusing the index tuples which point to> invisible tuples ? The index is not MVCC anyway, so maybe it is easier> to do in-place replacement there ?> > This probably has the same obstacles which have prevented us from> removing those in the first place (removing instead of marking as> invisible). Does it cause some locking issues ? Or does it go against> some other constraints of our index lookups ?The problem with updating an index is that you have to do it in a waythat concurrent scans (forwards and backwards) don't get confusedbecause the tuple they stopped on vanished.AIUI, the current approach is two step. The first time round you markit deleted but don't actually delete it. Thus, any scan currentlystopped on that tuple won't have a problem. Sometime later you removethe actual tuple, once you
 know there's no scan stopped on it (becauseno scan will deliberatly stop on a deleted tuple).I forget the actual locking steps that ensure this though.> If we could delete/reuse old index tuples, it would solve a sizable> chunk of index-growth problem, especially for cases where referenced key> value does not change.I think we recently changed the code to always scan an index a page ata time so maybe scans no longer stop in the middle of a page anymore...Or perhaps that was VACUUM only.Have a noce day,-- Martijn van Oosterhout  http://svana.org/kleptog/> From each according to his ability. To each according to his ability to litigate. 
		 
Yahoo! Copa 2006 - cobertura dos jogos em tempo real e tudo sobre a seleção brasileira!

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout
On Sat, Jun 24, 2006 at 10:04:43PM +0300, Hannu Krosing wrote:
> Maybe we could start from reusing the index tuples which point to
> invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> to do in-place replacement there ?
> 
> This probably has the same obstacles which have prevented us from
> removing those in the first place (removing instead of marking as
> invisible). Does it cause some locking issues ? Or does it go against
> some other constraints of our index lookups ?

The problem with updating an index is that you have to do it in a way
that concurrent scans (forwards and backwards) don't get confused
because the tuple they stopped on vanished.

AIUI, the current approach is two step. The first time round you mark
it deleted but don't actually delete it. Thus, any scan currently
stopped on that tuple won't have a problem. Sometime later you remove
the actual tuple, once you know there's no scan stopped on it (because
no scan will deliberatly stop on a deleted tuple).

I forget the actual locking steps that ensure this though.

> If we could delete/reuse old index tuples, it would solve a sizable
> chunk of index-growth problem, especially for cases where referenced key
> value does not change.

I think we recently changed the code to always scan an index a page at
a time so maybe scans no longer stop in the middle of a page anymore...
Or perhaps that was VACUUM only.

Have a noce day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Hannu Krosing
Ühel kenal päeval, L, 2006-06-24 kell 15:44, kirjutas Jan Wieck:

> >> That fixes the symptom, not the problem. The problem is performance
> >> steadily degrades over time.
> > 
> > No, you got it backwards.  The performance degradation is the symptom.
> > The problem is that there are too many dead tuples in the table.  There
> > is one way to solve that problem -- remove them, which is done by
> > running vacuum.
> 
> Precisely.
> 
> > There are some problems with vacuum itself, that I agree with.  For
> > example it would be good if a long-running vacuum wouldn't affect a
> > vacuum running in another table because of the long-running transaction
> > effect it has.  It would be good if vacuum could be run partially over a
> > table.  It would be good if there was a way to speed up vacuum by using
> > a dead space map or something.
> 
> It would be good if vacuum wouldn't waste time on blocks that don't have 
> any possible work in them. Vacuum has two main purposes. A) remove dead 
> rows and B) freeze xids. Once a block has zero deleted rows and all xids 
> are frozen, there is nothing to do with this block and vacuum should 
> skip it until a transaction updates that block.
> 
> This requires 2 bits per block, which is 32K per 1G segment of a heap. 
> Clearing the bits is done when the block is marked dirty. This way 
> vacuum would not waste any time and IO on huge slow changing tables. 
> That part, sequentially scanning huge tables that didn't change much is 
> what keeps us from running vacuum every couple of seconds.

Seems like a plan. 

Still, there is another problem which is not solved by map approach
only, at least with current implementation of vacuum.

This is the fact that we need to do full scan over index(es) to clean up
pointers to removed tuples. And huge tables tend to have huge indexes.

As indexes have no MVCC info inside them, it may be possible to start
reusing index entries pointing to rows that are invisible to all running
transactions. Currently we just mark these index entries as dead, but
maybe there is a way to reuse them. This could solve the index bloat
problem for may cases.

Another possible solution for indexes with mostly dead pointers is doing
a reindex, but this will become possible only after we have implemented
a concurrent, non-blocking CREATE INDEX.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: Anyone still care about Cygwin? (was Re: [HACKERS] [CORE] GPL

2006-06-24 Thread Robert Treat
On Friday 23 June 2006 14:30, Andrew Dunstan wrote:
> Tom Lane wrote:
> > Andrew Dunstan <[EMAIL PROTECTED]> writes:
> >> There are several supported platforms not represented on the buildfarm -
> >> e.g. the one HPUX member has never actually reported any results.
> >
> > Yeah, and this is not a good thing.  Eventually I'd like to get to a
> > point where every platform we consider "supported" has regular buildfarm
> > reports.  No more calls for port reports during beta periods --- beta
> > work should focus on functionality testing, not getting it to build.
>
> Then people who have access to people who own or can provide access to
> machines in classes not covered need to do a bit of begging ;-)
>
> The requirements are (deliberately) very modest:
>
> OS and toolset required to build postgres from CVS
> A modern perl installation (>=5.6 is adequate)
> Anonymous read access to a CVS repository - either the one at
> postgresql.org or a replica
> Outbound HTTP port 80 access to www.pgbuildfarm.org, possibly via a proxy.
>
> Once it is set up it is close to hands free - you just set up the cron
> job(s) or equivalent.
>

Dave, 

wasn't someone just trying to donate a machine to us for the website but we 
weren't sure what to do with it?  One that could do VM's?  Seems we could use 
that for some buildfarm members maybe. 

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jan Wieck

On 6/22/2006 2:37 PM, Alvaro Herrera wrote:


Adding back pgsql-hackers.

Mark Woodward wrote:

> Mark Woodward wrote:
>
>> Hmm, OK, then the problem is more serious than I suspected.
>> This means that every index on a row has to be updated on every
>> transaction that modifies that row. Is that correct?
>
> Add an index entry, yes.
>
>> I am attaching some code that shows the problem with regard to
>> applications such as web server session management, when run, each
>> second
>> the system can handle fewer and fewer connections. Here is a brief
>> output:
>> [...]
>> There has to be a more linear way of handling this scenario.
>
> So vacuum the table often.

That fixes the symptom, not the problem. The problem is performance
steadily degrades over time.


No, you got it backwards.  The performance degradation is the symptom.
The problem is that there are too many dead tuples in the table.  There
is one way to solve that problem -- remove them, which is done by
running vacuum.


Precisely.


There are some problems with vacuum itself, that I agree with.  For
example it would be good if a long-running vacuum wouldn't affect a
vacuum running in another table because of the long-running transaction
effect it has.  It would be good if vacuum could be run partially over a
table.  It would be good if there was a way to speed up vacuum by using
a dead space map or something.


It would be good if vacuum wouldn't waste time on blocks that don't have 
any possible work in them. Vacuum has two main purposes. A) remove dead 
rows and B) freeze xids. Once a block has zero deleted rows and all xids 
are frozen, there is nothing to do with this block and vacuum should 
skip it until a transaction updates that block.


This requires 2 bits per block, which is 32K per 1G segment of a heap. 
Clearing the bits is done when the block is marked dirty. This way 
vacuum would not waste any time and IO on huge slow changing tables. 
That part, sequentially scanning huge tables that didn't change much is 
what keeps us from running vacuum every couple of seconds.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Hannu Krosing
Ühel kenal päeval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
> Jonah H. Harris wrote:
> > On 6/23/06, Tom Lane <[EMAIL PROTECTED]> wrote:
> > > What I see in this discussion is a huge amount of "the grass must be
> > > greener on the other side" syndrome, and hardly any recognition that
> > > every technique has its downsides and complications.
> > 
> > I'm being totally objective.  I don't think we should abandon
> > PostgreSQL's overall design at all, because we do perform INSERTs and
> > DELETEs much better than most systems.  However, I've looked at many
> > systems and how they implement UPDATE so that it is a scalable
> > operation.  Sure, there are costs and benefits to each implementation,
> > but I think we have some pretty brilliant people in this community and
> > can come up with an elegant design for scalable UPDATEs.
> 
> I think the UPDATE case is similar to the bitmap index scan or perhaps
> bitmap indexes on disk --- there are cases we know can not be handled
> well by our existing code, so we have added (or might add) these
> features to try to address those difficult cases.

Not really. Bitmap index scan and bitmap index are both new additions
working well with existing framework. 

While the problem of slowdown on frequent updates is real, the suggested
fix is just plain wrong, as it is based on someones faulty assumption on
how index lookup works, and very much simplified view of how different
parts of the system work to implement MVCC.

The original fix he "suggests" was to that imagined behaviour and thus
ignored all the real problems of such change.

All the next suggestions were variations of the first ones, and failed
to address or even research any problems brought up.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Hannu Krosing
Ühel kenal päeval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:
> Csaba Nagy <[EMAIL PROTECTED]> writes:
> >> Surprisingly its mostly WAL traffic, the heap/index pages themselves are
> >> often not yet synced to disk by time of vacuum, so no additional traffic
> >> there. If you had made 5 updates per page and then vacuum it, then you
> >> make effectively 1 extra WAL write meaning 20% increase in WAL traffic. 
> 
> > Is this also holding about read traffic ? I thought vacuum will make a
> > full table scan... for big tables a full table scan is always badly
> > influencing the performance of the box. If the full table scan would be
> > avoided, then I wouldn't mind running vacuum in a loop... 
> 
> If you're doing heavy updates of a big table then it's likely to end up
> visiting most of the table anyway, no?  There is talk of keeping a map
> of dirty pages, but I think it'd be a win for infrequently-updated
> tables, not ones that need constant vacuuming.
> 
> I think a lot of our problems in this area could be solved with fairly
> straightforward tuning efforts on the existing autovacuum
> infrastructure.  In particular, someone should be looking into
> recommendable default vacuum-cost-delay settings so that a background
> vacuum doesn't affect performance too much.

One thing that would help updates quite a lot in some scenarios is
keeping the pages only partially-filled, so that most updates could keep
the new version in the same page. I think that has also been discussed
as an option to vacuum and maybe as part of initial inserts. Maybe some
of it even ended up as a todo item.

> Another problem with the
> current autovac infrastructure is that it doesn't respond very well to
> the case where there are individual tables that need constant attention
> as well as many that don't.  If you have N databases then you can visit
> a particular table at most once every N*autovacuum_naptime seconds, and
> *every* table in the entire cluster gets reconsidered at that same rate.
> I'm not sure if we need the ability to have multiple autovac daemons
> running at the same time, 

My patch enabling effective continuous vacuum of fast-update tables,
while still being able to vacuum huge slowly changing ones is still not
applied. Without that patch there is no reason to vacuum the small and
fast changingg tables while vacuum on bigger tables is running, as it
won't clean out dead tuples anyway.

> but we definitely could use something with a
> more flexible table-visiting pattern.  Perhaps it would be enough to
> look through the per-table stats for each database before selecting the
> database to autovacuum in each cycle, instead of going by "least
> recently autovacuumed".
> 
> Bottom line: there's still lots of low-hanging fruit.  Why are people
> feeling that we need to abandon or massively complicate our basic
> architecture to make progress?

Maybe we could start from reusing the index tuples which point to
invisible tuples ? The index is not MVCC anyway, so maybe it is easier
to do in-place replacement there ?

This probably has the same obstacles which have prevented us from
removing those in the first place (removing instead of marking as
invisible). Does it cause some locking issues ? Or does it go against
some other constraints of our index lookups ?

I think that just setting the invisible bit in an index leaf node causes
nearly as much disk io as removing the node.

If we could delete/reuse old index tuples, it would solve a sizable
chunk of index-growth problem, especially for cases where referenced key
value does not change.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jan Wieck

On 6/24/2006 9:23 AM, Mark Woodward wrote:


On Sat, 24 Jun 2006, Mark Woodward wrote:


I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the order
and referencing the latest version.


I thought I understood your idea, but now you lost me again. I thought
what you want is that the older heap tuple has a pointer to the
newer one. Which it already has, it's called t_ctid.


Perfect!


Can you try to explain more carefully how the whole thing would work?
What would an index tuple point to? What pointers would a heap tuple
have? What would an index scan do to find the row version it's interested
in? What exactly would an update do?



Since we already allocate space for some notion of linked list, then all
I'm suggesting is we reverse the order, sort of. Currently it looks like
this:

ver001->ver002->ver003->...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001->verN->...->ver003->ver002->|
 ^-/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last "latest"
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001) Don't
modify the index, otherwise, add the new version (just as before)

When you vacuum, simply make the latest version (verN) the key row (ver001).


This isn't done "simply". Currently, vacuum collects a trivial array of 
ctid's it is removing and every now and then does a bulk remove of the 
index tuples pointing to them. Now lets consider a table with two 
indexed columns with the following row versions resulting from an insert 
and 3 updates to that same row:


  v1:  a,b
  v2:  a,c
  v3:  a,d
  v4:  b,d

In your new scheme, there would be two index tuples for column 1 (one 
pointing to v1, one pointing to v4) and 3 index tuples for column 2 (one 
for each different value pointing to v1, v2 and v3). Did I get that 
right so far?


If vacuum now can remove v1, it has to update index 1 to point to v2 and 
remove the pointer to v1 from index 2. If it can remove v1 and v2, it 
has to update index 1 to point to v3 and remove v1 and v2 from index 2. 
If it can remove v1, v2 and v3 it must delete the index 1 tuple pointing 
to v1, delete the index 2 entries pointing to v1 and v2 and update the 
index 2 entry for v3 to point to v4. Figuring out which index tuples to 
remove and which ones to update can only be done by comparing each and 
every indexed columns old and new values. To do so, vacuum will have to 
fetch all the row versions, which can be scattered all over the place, 
with all possible locking issues including but not limited to deadlocks.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Heikki Linnakangas

On Sat, 24 Jun 2006, Bruce Momjian wrote:


OK, I have an idea.  Right now, an UPDATE where the old and new rows are
on the same page have two index entries.  What if we made only one index
entry for both?  We already have UPDATE chaining, where the old row
points to the new one.  If no key columns change, we set a bit in the
heap that the chaining points to the old and new row (both on the same
page), so an index scan uses one index entry to see the old and new row,
and once the old row is no longer visible, the page index id can be
updated to point to the new row and the old row can be marked free and
perhaps used for a future UPDATE.  (UPDATE already tries to do keep
updates on the same heap page.)


In fact, that's what I originally thought Mark was suggesting. A couple of 
points:


Why the limitation of old and new row being on the same page?

This only works if none of the updated columns are indexed. That's a bit 
annoying. It would be nice to be able to create new index tuples in those 
indexes that contain one of the changed columns, but not in others.


What happens if you create a new index that contains one of 
the changed columns?


- Heikki

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Crash on initdb in MSVC++

2006-06-24 Thread Magnus Hagander
Nevermind this whole mail. My .bki files weren't updated. (installed new
version of them in the wrong directory. Oops.) Updating  that to correct
files for HEAD mdae it pass again.

//Magnus


> HEAD built with msvc++ crashes on initdb. I'd appreciate any 
> pointers as to where to start looking...
> 
> Unhandled exception at 0x0046b9c7 in postgres.exe: 
> 0xC005: Access violation reading location 0x06b3.
> 
> Backtrace:
> > postgres.exe!_bt_start_vacuum(RelationData * rel=0x067f)
> Line 1000 + 0x3 bytes C
>   postgres.exe!btbulkdelete(FunctionCallInfoData *
> fcinfo=0x00d2f778)  Line 527 + 0x9 bytes  C
>   postgres.exe!OidFunctionCall3(unsigned int 
> functionId=332, unsigned long arg1=23780584, unsigned long 
> arg2=64453888, unsigned long
> arg3=62953160)  Line 1460 + 0xf bytes C
>   postgres.exe!index_build(RelationData * 
> heapRelation=0x016adce8, RelationData * 
> indexRelation=0x03d77d00, IndexInfo * indexInfo=0x03c096c8, 
> char isprimary=0, char istoast=0)  Line 1301 +
> 0x15 bytesC
>   postgres.exe!build_indices()  Line 1249 + 0x1a bytesC
>   postgres.exe!boot_yyparse()  Line 285   C
>   postgres.exe!BootstrapMain(int argc=4, char * * argv=0x0121241c)
> Line 487  C
>   postgres.exe!main(int argc=5, char * * argv=0x01212418)  Line
> 181 + 0xd bytes   C
> 
> 
> 
> The crash is on the line:
>   vac->relid = rel->rd_lockInfo.lockRelId;
> 
> in _bt_start_vacuum().
> 
> From what I can tell, the entire rel pointer is invalid at 
> this point, because the debugger says "Expression cannot be 
> evaluated" for all fields in it, including simple ints.
> 
> 
> //Magnus
> 
> ---(end of 
> broadcast)---
> TIP 6: explain analyze is your friend
> 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] vacuum row?

2006-06-24 Thread Mark Woodward
I originally suggested a methodology for preserving MVCC and everyone is
confusing it as update "in place," this isnot what I intended.

How about a form of vacuum that targets a particular row? Is this
possible? Would if have to be by transaction?

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Crash on initdb in MSVC++

2006-06-24 Thread Magnus Hagander
HEAD built with msvc++ crashes on initdb. I'd appreciate any pointers as
to where to start looking...

Unhandled exception at 0x0046b9c7 in postgres.exe: 0xC005: Access
violation reading location 0x06b3.

Backtrace:
>   postgres.exe!_bt_start_vacuum(RelationData * rel=0x067f)
Line 1000 + 0x3 bytes   C
postgres.exe!btbulkdelete(FunctionCallInfoData *
fcinfo=0x00d2f778)  Line 527 + 0x9 bytesC
postgres.exe!OidFunctionCall3(unsigned int functionId=332,
unsigned long arg1=23780584, unsigned long arg2=64453888, unsigned long
arg3=62953160)  Line 1460 + 0xf bytes   C
postgres.exe!index_build(RelationData * heapRelation=0x016adce8,
RelationData * indexRelation=0x03d77d00, IndexInfo *
indexInfo=0x03c096c8, char isprimary=0, char istoast=0)  Line 1301 +
0x15 bytes  C
postgres.exe!build_indices()  Line 1249 + 0x1a bytesC
postgres.exe!boot_yyparse()  Line 285   C
postgres.exe!BootstrapMain(int argc=4, char * * argv=0x0121241c)
Line 487C
postgres.exe!main(int argc=5, char * * argv=0x01212418)  Line
181 + 0xd bytes C



The crash is on the line:
vac->relid = rel->rd_lockInfo.lockRelId;

in _bt_start_vacuum().

From what I can tell, the entire rel pointer is invalid at this point,
because the debugger says "Expression cannot be evaluated" for all
fields in it, including simple ints.


//Magnus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward
> On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:
>> > On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:
>> >> In the scenario, as previously outlined:
>> >>
>> >> ver001->verN->...->ver003->ver2->|
>> >>   ^-/
>> >
>> > So you want to always keep an old version around?
>>
>> Prior to vacuum, it will be there anyway, and after vacuum, the new
>> version will become ver001.
>
> So you do intend to move verN into ver001's slot?  What about the
> other conditions you had mentioned where you have to follow
> PostgreSQL's current behavior?  How are you going to have a pointer
> chain in that case?

Who said anything about moving anything. When vacuum comes along, it
cleans out previous versions of rows. Very little will change.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Bruce Momjian
bruce wrote:
> Tom Lane wrote:
> > Bruce Momjian <[EMAIL PROTECTED]> writes:
> > > I think at some point we have to admit that _polling_ the tables, which
> > > is what autovacuum does, just isn't going to work well, no matter how
> > > much it is tweeked, and another approach should be considered for
> > > certain workload cases.
> > 
> > Autovacuum polls in its current, first-generation implementation;
> > what I said upthread was it needs to be smarter than that.  I am not
> > sure how you get from that to the conclusion that the very next step
> > is to abandon the vacuuming approach altogether.
> 
> I am not ready to abandon autovacuum, but as I stated later the UPDATE
> with no key change case is common enought that it could be handled
> better without involving autovacuum and its limitations.
> 
> As I remember, most databases have problem with DELETE/INSERT cycles,
> but we seem to be hit by UPDATE performance more than most, and more
> than is wise.

In an attempt to get some concrete on these ideas...  ;-)

I think the major complexity in doing an in-place UPDATE when no key
columns change is allowing rollback on transaction abort (or backend
crash), and properly handling visibility for transactions in progress.

If the old and new rows are on the same heap page (perhaps a necessary
limitation), you can easily update the heap item id to point to the new
heap row.  All indexes will then point to the new row, and sequential
scans will still see both rows (which is good).  That leave the rollback
issue (undoing the item id change), and having index scans for current
backends still see the old row.

OK, I have an idea.  Right now, an UPDATE where the old and new rows are
on the same page have two index entries.  What if we made only one index
entry for both?  We already have UPDATE chaining, where the old row
points to the new one.  If no key columns change, we set a bit in the
heap that the chaining points to the old and new row (both on the same
page), so an index scan uses one index entry to see the old and new row,
and once the old row is no longer visible, the page index id can be
updated to point to the new row and the old row can be marked free and
perhaps used for a future UPDATE.  (UPDATE already tries to do keep
updates on the same heap page.)

FYI, the reason heap cleanup is possible once you go with a single index
entry for old and new rows is because there is no index cleanup (and
single-row index cleanup is very expensive).

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Gist does not build with VC++ anymore

2006-06-24 Thread Magnus Hagander
> > I've updated my VC++ build env with latest CVS, and it no longer 
> > builds because of changes to GIST:
> > 
> > src\backend\access\gist\gistutil.c(237) : error C2057: 
> > expected constant expression
> > src\backend\access\gist\gistutil.c(237) : error C2466: 
> cannot allocate 
> > an array of constant size 0
> > src\backend\access\gist\gistutil.c(237) : error C2133: 'storage' :
> > unknown size
> > 
> > 
> > The problem appears to come from:
> > #define GEVHDRSZ(offsetof(GistEntryVector, vector[0]))
> > 
> > Which can't be used in this context. 
> > 
> > What would be the proper fix for that?
> 
> Hmm. Now that I look at it more clearly, it seems Hiroshi has 
> a fix for this in his submitted patch (that still had a lot 
> of other problems in the rest of it). I'm not sure if it's 
> the proper fix, but it's there.

While I'm spamming everybody anyway, here's another thing that might fix
it? This one compiles and tests, and I *think* it does the right
thing... If it's correct, I think it looks like a cleaner solution.

//Magnus


RCS file: /projects/cvsroot/pgsql/src/include/access/gist.h,v
retrieving revision 1.52
diff -c -r1.52 gist.h
*** src/include/access/gist.h   5 Mar 2006 15:58:53 -   1.52
--- src/include/access/gist.h   24 Jun 2006 17:20:28 -
***
*** 142,148 
GISTENTRY   vector[1];
  } GistEntryVector;

! #define GEVHDRSZ  (offsetof(GistEntryVector, vector[0]))

  /*
   * macro to initialize a GISTENTRY
--- 142,148 
GISTENTRY   vector[1];
  } GistEntryVector;

! #define GEVHDRSZ  (offsetof(GistEntryVector, vector))

  /*
   * macro to initialize a GISTENTRY

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Gist does not build with VC++ anymore

2006-06-24 Thread Magnus Hagander
> I've updated my VC++ build env with latest CVS, and it no 
> longer builds because of changes to GIST:
> 
> src\backend\access\gist\gistutil.c(237) : error C2057: 
> expected constant expression
> src\backend\access\gist\gistutil.c(237) : error C2466: cannot 
> allocate an array of constant size 0
> src\backend\access\gist\gistutil.c(237) : error C2133: 'storage' :
> unknown size
> 
> 
> The problem appears to come from:
> #define GEVHDRSZ  (offsetof(GistEntryVector, vector[0]))
> 
> Which can't be used in this context. 
> 
> What would be the proper fix for that?

Hmm. Now that I look at it more clearly, it seems Hiroshi has a fix for
this in his submitted patch (that still had a lot of other problems in
the rest of it). I'm not sure if it's the proper fix, but it's there.
Comments on it?


--- src/backend/access/gist/gistutil.c.orig Thu Jun 22 20:46:55 2006
+++ src/backend/access/gist/gistutil.c  Thu Jun 22 20:47:22 2006
@@ -228,6 +228,12 @@
 /* 
  * makes union of two key
  */
+
+#ifdef _MSC_VER
+#undef GEVHDRSZ
+#define GEVHDRSZ(offsetof(GistEntryVector, vector))
+#endif
+
 static void
 gistMakeUnionKey( GISTSTATE *giststate, int attno,
GISTENTRY   *entry1, bool
isnull1,



//Magnus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread PFC



What I see in this discussion is a huge amount of "the grass must be
greener on the other side" syndrome, and hardly any recognition that
every technique has its downsides and complications.


Sure ;)

	MVCC generates dead rows, by its very nature ; however I see two trends  
in this :


1* A large transaction that updates/deletes many rows.
	For instance suppose you UPDATE an entire table whose size is larger than  
memory.


	Old row versions have to be kept somewhere until commit, be it in the  
table itself or in some accessory undo-log.
	So, there will be a lot of harddisk grinding anyway, be it MVCC or  
Oracle-style, or whatever. MVCC will bloat the table and indexes, then  
VACUUM will shrink them. Update-in-place systems will bloat an undo log.


	It seems to me the current MVCC+VACUUM is the right tool for this job,  
requiring about the same amount of IO that the others.
	Vacuum scans sequentially, so it's the best way to process large volumes  
of data.


2* Many short transactions update rows in a table
Like the sessions problem, for instance.

Current VACUUM sucks for this case, I guess that's known.

---

	So, we have two different problems, and one tool which is adapted to one  
problem only. Should the tool (Vacuum) be fixed to handle both problems,  
making it more complex and difficult to maintain, or should another tool  
be created specifically for the second problem ?
	Problem 2 is very different from problem 1. The only case when they meet  
is when there is a continuous stream of small updates running concurrently  
with a long transaction.

So, what is the ideal tool for case 2 ?

	We'd want a vacuuming machine that can be run very often, or even better,  
continuously.
	The table can be large, larger than the disk cache, so scanning it is not  
an option.
	The updates are probably randomly distributed into the table. Therefore,  
VACUUMing a long time after these transactions are commited and the pages  
are no longer in cache would require a lot of random seeks, which is also  
bad.

Besides, we want this vacuum to be continuous and transparent.

	The best time to vacuum pages is, thus, when they are still in the  
background writer's memory, or the disk cache, waiting to be flushed to  
disk. There, they can be re-read, vacuumed and re-written with no seek  
penalty, only additional WAL traffic. However the amount of WAL traffic in  
bytes/s is less important that the frequency of WAL syncs. Emitting more  
WAL data shouldn't be a problem if those sync writes are coalesced with  
the sync writes of current reansactions.


	So, I guess the best solution for case 2 is to have the background writer  
perform on-the-fly VACUUM :


	An UPDATE or DELETE transaction hands over dirty pages to be written to  
the bgwriter. It also tells the bgwriter the ID of the current transaction  
and flags specifying if they contain candidate dead rows.
	The bgwriter should have a sufficiently large buffer in order to be able  
to keep these pages in memory until all the transactions that can see the  
dead rows in these pages are finished.

Then, the pages are vacuumed and written.

	The key is the size of the buffer. It should be large enough to contain  
enough pages so that it is actually possible to vacuum something out of  
them before writing them. However if the buffer capacity is exceeded (for  
instance, because there is a long running transaction), this is not a  
problem : the pages are simply written to disk normally, they will contain  
dead rows, which will need to be handled lated by the standard VACUUM.


	I think this would maximize code reuse by using the current bgwriter's  
architecture... did I miss something ?















---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] Gist does not build with VC++ anymore

2006-06-24 Thread Magnus Hagander
I've updated my VC++ build env with latest CVS, and it no longer builds
because of changes to GIST:

src\backend\access\gist\gistutil.c(237) : error C2057: expected constant
expression
src\backend\access\gist\gistutil.c(237) : error C2466: cannot allocate
an array of constant size 0
src\backend\access\gist\gistutil.c(237) : error C2133: 'storage' :
unknown size


The problem appears to come from:
#define GEVHDRSZ(offsetof(GistEntryVector, vector[0]))

Which can't be used in this context. 

What would be the proper fix for that?

//Magnus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Exporting type OID macros in a cleaner fashion

2006-06-24 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


> The alternative I'm currently thinking about is to build and install an
> auto-generated file comparable to fmgroids.h, containing *only* the type
> OID macro #defines extracted from pg_type.h.  This would require just a
> trivial amount of sed hacking.

FWIW, that's exactly what we currently do in DBD::Pg to generate an up to
date list of types:

## $file = "pg_type.h"
open(F, $file) or die qq{Could not open file "$file": $!\n};
my %oid;
my $maxlen = 1;
while() {
next unless /^#define\s+([A-Z0-9_]*OID)\s+(\d+)/o;
$oid{$1} = $2;
length($1) > $maxlen and $maxlen = length($1);
}
close(F);

We actually go on from there to build a bunch of custom C structs. The maxlen
is simply there to make our final "types.c" line up pretty.

Installing it at the top-level sounds good, although I don't think
we'd see a need to switch to it since we already parse pg_types.h ourselves. ;)

- --
Greg Sabino Mullane [EMAIL PROTECTED]
PGP Key: 0x14964AC8 200606241151
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iD8DBQFEnWAfvJuQZxSWSsgRAjgqAKClEDrahFG5NCSrK47Ae7P13QCD+ACg1KHe
XS2uYrmI5wf6i8Mpttpi1U8=
=wlUx
-END PGP SIGNATURE-



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris

On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:

> On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:
>> In the scenario, as previously outlined:
>>
>> ver001->verN->...->ver003->ver2->|
>>   ^-/
>
> So you want to always keep an old version around?

Prior to vacuum, it will be there anyway, and after vacuum, the new
version will become ver001.


So you do intend to move verN into ver001's slot?  What about the
other conditions you had mentioned where you have to follow
PostgreSQL's current behavior?  How are you going to have a pointer
chain in that case?


--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] cygwin breakage (was: GPL Source and Copyright Questions)

2006-06-24 Thread Magnus Hagander
> > Attached simple patch reverts this, as it clearly broke cygwin.
> 
> Applied ... hopefully it didn't also break mingw ;-)

Oh, I tested that. It also didn't break msvc.

//Magnus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] cygwin breakage (was: GPL Source and Copyright Questions)

2006-06-24 Thread Tom Lane
"Magnus Hagander" <[EMAIL PROTECTED]> writes:
> Attached simple patch reverts this, as it clearly broke cygwin.

Applied ... hopefully it didn't also break mingw ;-)

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward
> On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:
>> In the scenario, as previously outlined:
>>
>> ver001->verN->...->ver003->ver2->|
>>   ^-/
>
> So you want to always keep an old version around?

Prior to vacuum, it will be there anyway, and after vacuum, the new
version will become ver001.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris

On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:

In the scenario, as previously outlined:

ver001->verN->...->ver003->ver2->|
  ^-/


So you want to always keep an old version around?

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout
On Sat, Jun 24, 2006 at 09:23:28AM -0400, Mark Woodward wrote:
> > Can you try to explain more carefully how the whole thing would work?
> > What would an index tuple point to? What pointers would a heap tuple
> > have? What would an index scan do to find the row version it's interested
> > in? What exactly would an update do?
> 
> 
> Since we already allocate space for some notion of linked list, then all
> I'm suggesting is we reverse the order, sort of. Currently it looks like
> this:
> 
> ver001->ver002->ver003->...-verN
> 
> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
> have it do this:
> 
> ver001->verN->...->ver003->ver002->|
>  ^-/

You don't say where the index points or the order, but I'm assuming
from your diagram that ver1 is the oldest, verN is the newest.
Currently there is an index entry for each version, but in your version
there is only an index entry for ver1, right?

> This will speed up almost *all* queries when there are more than two
> version of rows.
> 
> OK, here is the behavior of an update:
> (1) Find the latest version of the row

You mean, find the version of the row which satisfies your snapshot. If
the version pointed to by the index is it, you're done. Otherwise you
follow the chain. The most common option being one step, because ver01
is likely to be invisible.

> (2) Duplicate row and modify as per plan
> (3) Set the t_ctid of the new row to the last "latest"
> (4) Set the t_ctid of the first row to that of the new row
> (5) Attempt to index the row
> (6) If the first version of the row is in the index already (ver001) Don't
> modify the index, otherwise, add the new version (just as before)

This looks OK, I guess. I wouldn't know about locking...

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jochem van Dieten

On 6/24/06, Mark Woodward wrote:


ver001->verN->...->ver003->ver002->|
 ^-/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last "latest"
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001) Don't
modify the index, otherwise, add the new version (just as before)


I am not sure I understand your algorithm. If we take as a starting
point the following situation of a fresh tuple, in very schematic form
it looks like:

Heap:
TIDT_CTID   XMIN  XMAX  Col1   Col2
xxx1xxx1ttt1  null1  1

Index on Col1:
1xxx1

Index on Col2:
1xxx1



Now, after an update to this tuple changing the Value2 field, in what
state should the heap, index 1 and index 2 be? If I understand you
correctly, you want it to be:

Heap:
TIDT_CTID   XMIN  XMAX  Col1   Col2
xxx1xxx2ttt1  ttt21  1
xxx2xxx1ttt2  null1  2

Index on Col1:
1xxx2

Index on Col2:
1xxx1
2xxx2


Three questions:
1. Do I understand your intention correctly?
2. Could you extend this for an update to increment value2 (because
the T_CTID logic gets a bit unclear for me there).
3. The net benefit of this would be 1 less entry in the index on Col1?

Jochem

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward
> On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:
>> Currently it looks like this:
>>
>> ver001->ver002->ver003->...-verN
>>
>> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
>> have it do this:
>>
>> ver001->verN->...->ver003->ver002->|
>
> Heh, because that's crazy.  The first time you insert a key into the
> index it will point to v1 of a tuple... say after 5 updates you have
> v2,v3,v4,v5... your c_tid pointer chain looks like v1
> (original)->v2->v3->v4-v5 (newest).  However, your whole idea is based
> on not having to do another index insert for unchanged keys, so the
> index still points to v1... which means you have to follow the c_tid
> chain to get to the newest version just like a sequential scan.  I
> don't see how you think you can reverse pointer it.

In the scenario, as previously outlined:

ver001->verN->...->ver003->ver2->|
  ^-/

The index points to version 1 (ver001) which points to the latest version
(verN).



>
>> This will speed up almost *all* queries when there are more than two
>> version of rows.
>
> Nope.

Of course it will.

>
>> When you vacuum, simply make the latest version (verN) the key row
>> (ver001).
>
> How are you going to do this without a ton of locking... remember, the
> index is pointing to v1 with a tid... so you'll have to physically
> move the newest version v5 to v1's tid from wherever it was... like a
> vacuum full on steroids.  Unless of course, you rebuild the index...
> but that's not a solution either.

I don't understand how you can assume this. In fact, it wil proably reduce
locking and disk IO by not having to modify indexes.
\

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] cygwin breakage (was: GPL Source and Copyright Questions)

2006-06-24 Thread Magnus Hagander
> > And why do win32 and cygwin now not include at all pg_config_os.h?
> 
> It's a way to avoid the step to copy win32\port.h in msvc. 
> configure copies it to pg_config_os.h. Since for win32 
> platforms (unfortunatly, at this point it considers cygwin 
> win32..)that will always be port/win32.h, it explicitly 
> includes that one instead.

Attached simple patch reverts this, as it clearly broke cygwin. 

Still can't get it to build on cygwin though, but I doubt it's the fault
of the win32 patch... With ./configure, I get:
checking for random... yes
checking for rint... yes
checking for srandom... yes

But if I look in the generated pg_config.h I have:
/* Define to 1 if you have the `random' function. */
/* #undef HAVE_RANDOM */

and similar for SRANDOM. This gives a "conflicting types for random"
between port.h line 314 and stdlib.h line 24.

Hopefully that's something broken in my cygwin environment only (a fresh
one installed, but I really don't know cygwin enough to comment on if I
broke something :-P), in which case someone with an already working
cygwin environment should be able to build again after this one.

//Magnus


cygwin.patch
Description: cygwin.patch

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris

On 6/24/06, Jonah H. Harris <[EMAIL PROTECTED]> wrote:

Grr... need coffee... s/c_tid/ctid/g

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris

On 6/24/06, Mark Woodward <[EMAIL PROTECTED]> wrote:

Currently it looks like this:

ver001->ver002->ver003->...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001->verN->...->ver003->ver002->|


Heh, because that's crazy.  The first time you insert a key into the
index it will point to v1 of a tuple... say after 5 updates you have
v2,v3,v4,v5... your c_tid pointer chain looks like v1
(original)->v2->v3->v4-v5 (newest).  However, your whole idea is based
on not having to do another index insert for unchanged keys, so the
index still points to v1... which means you have to follow the c_tid
chain to get to the newest version just like a sequential scan.  I
don't see how you think you can reverse pointer it.


This will speed up almost *all* queries when there are more than two
version of rows.


Nope.


When you vacuum, simply make the latest version (verN) the key row (ver001).


How are you going to do this without a ton of locking... remember, the
index is pointing to v1 with a tid... so you'll have to physically
move the newest version v5 to v1's tid from wherever it was... like a
vacuum full on steroids.  Unless of course, you rebuild the index...
but that's not a solution either.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward
> On Sat, 24 Jun 2006, Mark Woodward wrote:
>
>> I'm probably mistaken, but aren't there already forward references in
>> tuples to later versions? If so, I'm only sugesting reversing the order
>> and referencing the latest version.
>
> I thought I understood your idea, but now you lost me again. I thought
> what you want is that the older heap tuple has a pointer to the
> newer one. Which it already has, it's called t_ctid.

Perfect!
>
> Can you try to explain more carefully how the whole thing would work?
> What would an index tuple point to? What pointers would a heap tuple
> have? What would an index scan do to find the row version it's interested
> in? What exactly would an update do?


Since we already allocate space for some notion of linked list, then all
I'm suggesting is we reverse the order, sort of. Currently it looks like
this:

ver001->ver002->ver003->...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001->verN->...->ver003->ver002->|
 ^-/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last "latest"
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001) Don't
modify the index, otherwise, add the new version (just as before)

When you vacuum, simply make the latest version (verN) the key row (ver001).

There are, no doubt, issues that need to be resolved (I can think of a
coouple off the top of my head), but overall I think it is workable and
don't think this will affect performance in the simple case and improve
performance in the cases where there are more than one or two version of a
row.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Heikki Linnakangas

On Sat, 24 Jun 2006, Mark Woodward wrote:


I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the order
and referencing the latest version.


I thought I understood your idea, but now you lost me again. I thought 
what you want is that the older heap tuple has a pointer to the 
newer one. Which it already has, it's called t_ctid.


Can you try to explain more carefully how the whole thing would work? 
What would an index tuple point to? What pointers would a heap tuple 
have? What would an index scan do to find the row version it's interested 
in? What exactly would an update do?


- Heikki

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout
On Fri, Jun 23, 2006 at 03:10:39PM -0400, Mark Woodward wrote:
> This is NOT an "in-place" update. The whole MVCC strategy of keeping old
> versions around doesn't change. The only thing that does change is one
> level of indirection. Rather than keep references to all versions of all
> rows in indexes, keep only a reference to the first or "key" row of each
> row, and have the first version of a row form the head of a linked list to
> subsequent versions of each row. The list will be in decending order.

I thought of another issue with this. If you move away from storing
each row in the indexes, you can pretty much forget bitmap index scans.
They pretty much rely on every row being represented, not just a
subset. Until you go to the heap you don't know if a tuple will match,
which is precisely what the bitmap scan is trying to avoid. You could
follow the links, but that destroys the nice sequential access
properties.

> In the vast majority of cases, the overhead of this action will be
> trivial. In an unmodified row, you're there. In a modified row, you have
> one extra lookup. In extream cases, you may have to go back a few
> versions, but I don't see that as a common behavior.

I wonder if looking at old versions is really all that uncommon. A
large reporting query which runs for hours will probably be looking at
a lot of old versions. These are the queries that will be hit the
hardest.

If you're trying to avoid index bloat, I wonder if it wouldn't be
better to tackle this from the other end. In indexes, allow a row to
carry multiple CTIDs. That way a new version only requires adding six
more bytes, rather than a whole new tuple.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout
On Sat, Jun 24, 2006 at 08:14:10AM -0400, Mark Woodward wrote:
> > On 6/23/2006 3:10 PM, Mark Woodward wrote:
> >
> >> This is NOT an "in-place" update. The whole MVCC strategy of keeping old
> >> versions around doesn't change. The only thing that does change is one
> >> level of indirection. Rather than keep references to all versions of all
> >> rows in indexes, keep only a reference to the first or "key" row of each
> >> row, and have the first version of a row form the head of a linked list
> >> to
> >> subsequent versions of each row. The list will be in decending order.
> >
> > Where exactly do you intend to keep all those links (for a table with N
> > indexes)?
> >
> 
> I'm probably mistaken, but aren't there already forward references in
> tuples to later versions? If so, I'm only sugesting reversing the order
> and referencing the latest version.

You can't do that. The links exist so that in READ COMMITTED mode you
can always find the newest version. You would need to add additional
links to go backwards.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward
> On 6/23/2006 3:10 PM, Mark Woodward wrote:
>
>> This is NOT an "in-place" update. The whole MVCC strategy of keeping old
>> versions around doesn't change. The only thing that does change is one
>> level of indirection. Rather than keep references to all versions of all
>> rows in indexes, keep only a reference to the first or "key" row of each
>> row, and have the first version of a row form the head of a linked list
>> to
>> subsequent versions of each row. The list will be in decending order.
>
> Where exactly do you intend to keep all those links (for a table with N
> indexes)?
>

I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the order
and referencing the latest version.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] libpq Describe Extension [WAS: Bytea and perl]

2006-06-24 Thread Volkan YAZICI
On Jun 16 08:21, Tom Lane wrote:
> Bruce Momjian  writes:
> > Volkan YAZICI wrote:
> >> The problem is, AFAICS, it's not possible to distinguish between a tuple
> >> returning query (T, ..., C, Z or T, E) and a description of a portal (T,
> >> Z). Therefore, I've created a global flag (parsing_row_desc) which is
> >> turned on when we receive a 'T' and turned off if we receive a 'C' or
> >> 'E'. It's a kind of ugly method but the only solution I could come up
> >> with.
> 
> > The problem with this solution is that it is not thread-safe.  Perhaps
> > you can use a per-PGconn boolean?

Ie replaced the static flag with a conn->queryclass value using
PGQueryClass as Tom suggested. Also updated patch to be compatible with
exports.txt stuff.

> The whole thing sounds like brute force to me.  Shouldn't you be adding
> states to enum PGQueryClass, if you need to track what sort of Describe
> you're doing?

I totally agree with the followed ugly style. But IMHO the recursive
parsing (that is followed in pqParseInputN()) of received data is the main
problem behind this. I think, it will even get harder everytime somebody
try to to add another type of message parsing capability to that loop.
For instance, isn't pollution of PGQueryClass with state variables (like
PGQUERY_PREPARE or PGQUERY_DESCRIBE) one of the proofs of this.

While playing with pqParseInputN loops, I feel like coding Lisp recursions
using C syntax; it's quite ridiculous.


Regards.
Index: src/backend/tcop/postgres.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tcop/postgres.c,v
retrieving revision 1.489
diff -c -r1.489 postgres.c
*** src/backend/tcop/postgres.c 20 Jun 2006 22:52:00 -  1.489
--- src/backend/tcop/postgres.c 24 Jun 2006 11:31:10 -
***
*** 1853,1858 
--- 1853,1859 
  static void
  exec_describe_statement_message(const char *stmt_name)
  {
+   MemoryContext   oldContext;
PreparedStatement *pstmt;
TupleDesc   tupdesc;
ListCell   *l;
***
*** 1865,1871 
start_xact_command();
  
/* Switch back to message context */
!   MemoryContextSwitchTo(MessageContext);
  
/* Find prepared statement */
if (stmt_name[0] != '\0')
--- 1866,1872 
start_xact_command();
  
/* Switch back to message context */
!   oldContext = MemoryContextSwitchTo(MessageContext);
  
/* Find prepared statement */
if (stmt_name[0] != '\0')
***
*** 1923,1929 
--- 1924,1933 
  NULL);
else
pq_putemptymessage('n');/* NoData */
+   
+   MemoryContextSwitchTo(oldContext);
  
+   finish_xact_command();
  }
  
  /*
***
*** 1934,1939 
--- 1938,1944 
  static void
  exec_describe_portal_message(const char *portal_name)
  {
+   MemoryContext   oldContext;
Portal  portal;
  
/*
***
*** 1943,1949 
start_xact_command();
  
/* Switch back to message context */
!   MemoryContextSwitchTo(MessageContext);
  
portal = GetPortalByName(portal_name);
if (!PortalIsValid(portal))
--- 1948,1954 
start_xact_command();
  
/* Switch back to message context */
!   oldContext = MemoryContextSwitchTo(MessageContext);
  
portal = GetPortalByName(portal_name);
if (!PortalIsValid(portal))
***
*** 1975,1980 
--- 1980,1989 
  
portal->formats);
else
pq_putemptymessage('n');/* NoData */
+   
+   MemoryContextSwitchTo(oldContext);
+ 
+   finish_xact_command();
  }
  
  
Index: src/interfaces/libpq/exports.txt
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/exports.txt,v
retrieving revision 1.11
diff -c -r1.11 exports.txt
*** src/interfaces/libpq/exports.txt28 May 2006 22:42:05 -  1.11
--- src/interfaces/libpq/exports.txt24 Jun 2006 11:31:10 -
***
*** 130,132 
--- 130,134 
  PQencryptPassword 128
  PQisthreadsafe129
  enlargePQExpBuffer130
+ PQdescribePrepared131
+ PQdescribePortal  132
Index: src/interfaces/libpq/fe-exec.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-exec.c,v
retrieving revision 1.186
diff -c -r1.186 fe-exec.c
*** src/interfaces/libpq/fe-exec.c  28 May 2006 21:13:54 -  1.186
--- src/interfaces/libpq/fe-exec.c  24 Jun 2006 11:31:12 -
***
*** 61,66 
--- 61,68 
  static void parseInput(PGconn *conn);
  static bool PQexecStart(PGconn *conn);
  static PGresult *PQexecFinish(PGconn *conn);
+ static int pqDescribe(PGconn

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Heikki Linnakangas

On Fri, 23 Jun 2006, Jonah H. Harris wrote:


On 6/23/06, Mark Woodward <[EMAIL PROTECTED]> wrote:

Rather than keep references to all versions of all
rows in indexes, keep only a reference to the first or "key" row of each
row, and have the first version of a row form the head of a linked list to
subsequent versions of each row. The list will be in decending order.


By all means, please go ahead and try it because it's not quite that
easy.  You're going to run into serious locking and contention issues
this way.  In the end, it's not much better than running a sequential
scan to query a row that's been updated several thousand times on a
table that hasn't been vacuumed... follow that pointer :)


Can you elaborate what kind of locking and contention issues you're 
thinking of?


You could update the index tuple to point to a newer version of the row, 
when an index scan determines that the heap tuple it points to is not 
visible to anyone. We already check that to update the XMAX_COMMITTED hint 
bit. Updating the index tuple comes with a cost, of course, but 
alleviates the "follow that pointer" issue.


The biggest challenge that I see is that an index scan would somehow 
need to know when to follow the t_ctid chain and when not. If you follow 
the pointer and there's another index tuple for the row, the scan could 
see the same tuple twice. Some kind of bookkeeping would be needed to 
solve that.


Also, vacuuming would become a bit more complex, since it would need to 
update the index tuples to point to newer row versions instead of just 
removing them.


All in all, I think this solution to the "an update needs to update all 
indexes, even when none of the indexed columns changed" issue requires 
less changes than implementing Oracle style rollback segments and/or an 
undo log.


- Heikki

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread mark
On Sat, Jun 24, 2006 at 03:29:47AM -0400, Jan Wieck wrote:
> >It sounds like everybody agrees that things need to be fixed, and genuinely
> >motivated people are trying to offer what they have to the table.
> One singe core team member responds vaguely in a way, you feel being 
> supportive of your case, and you conclude that "everybody agrees"? 
> Sorry, x'use me?

> There are a couple of side effects on this "update in place" issue that 
> aren't even mentioned yet. Nobody with any significant in depth 
> knowledge of the Postgres non-overwriting storage engine concept seems 
> to suggest any move towards a storage system, that does updates in place 
> that require "undo" operations in case of a process/system failure. 
> You're ready to "fix" all those places to support the undo you need? You 
> must have a big table.

Jan: Who on the list has claimed that nothing is broken and nothing needs
to be improved? Are you making this claim?

Shutting down ideas is not constructive. Encouraging those with ideas to
step up and do more than talk could be.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jan Wieck

On 6/23/2006 9:56 PM, [EMAIL PROTECTED] wrote:


On Fri, Jun 23, 2006 at 03:08:34PM -0400, Bruce Momjian wrote:

Tom Lane wrote:
> ...
> suggesting.  We're having a hard enough time debugging and optimizing
> *one* storage model.  I think the correct path forward is to stick with
> the same basic storage model and vacuuming concept, and address the
> known performance issues with better-optimized vacuuming.  No, it will
> never be perfect for every scenario, but we can certainly make it much
> better than it is now, without risking killing the project by
> introducing undebuggable, unmaintainable complexity.
Well, are you suggesting we just stop improving the database?  I am sure
not.  But, your suggestion is that we can't do better without incurring
more complexity (true), and that complexity will not be worth it.  I
don't agree with that until I see some proposals, and shutting down
discussion because they will add complexity or are fruitless seems
unwise.


It sounds like everybody agrees that things need to be fixed, and genuinely
motivated people are trying to offer what they have to the table.


One singe core team member responds vaguely in a way, you feel being 
supportive of your case, and you conclude that "everybody agrees"? 
Sorry, x'use me?


There are a couple of side effects on this "update in place" issue that 
aren't even mentioned yet. Nobody with any significant in depth 
knowledge of the Postgres non-overwriting storage engine concept seems 
to suggest any move towards a storage system, that does updates in place 
that require "undo" operations in case of a process/system failure. 
You're ready to "fix" all those places to support the undo you need? You 
must have a big table.



Jan



Tom already has enough on his plate, as do most others here - so unless
a competent champion can take up the challenge, discussion is all we have.

I'm not liking the "we should do it this way," "no, we should do it that."
My read of the situation is that both may be useful, and that both should
be pursued. But one set of people can't pursue both.

Is any who is able, able to take up this challenge? Perhaps more than one,
from both major directions? (vacuum on one side, and improved storage on
the other) Somebody with the time and skill, who can work through the
design discussions on one of the aspects?

I want to contribute soon, and this is the sort of thing that interests me -
but I still don't have time yet, and there would be no guarantee that I
succeeded. Somebody else? :-)

Cheers,
mark




--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly