Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 09:03 +0100, Heikki Linnakangas wrote:

  As I understand it, there are two HOT features:
  
  Single-chain pruning, which trims HOT chains but doesn't reuse
  the space
  
  Defragementation, which prunes the entire page and reuses space
  and handles deleted rows, etc.
 
 I'm repeating myself, but I would split that second operation into two
 parts while we think about this: pruning the entire page, and
 defragmenting (PageRepairFragmentation). Tom wondered why they're
 separated in the patch. As the patch stands, there is no reason, but I
 feel that separating them and doing them at different times might be an
 important piece in the puzzle.

Yes, it does seem important to keep that separation, since it is
possible to do these things at different times, even if, for now, we
choose not to.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Heikki Linnakangas
Bruce Momjian wrote:
 Tom Lane wrote:
 The we-already-pinned-the-page problem is a bit nasty but may not be
 insurmountable.

If you find a way, that would be great.

 As I understand it, there are two HOT features:
 
   Single-chain pruning, which trims HOT chains but doesn't reuse
   the space
 
   Defragementation, which prunes the entire page and reuses space
   and handles deleted rows, etc.

I'm repeating myself, but I would split that second operation into two
parts while we think about this: pruning the entire page, and
defragmenting (PageRepairFragmentation). Tom wondered why they're
separated in the patch. As the patch stands, there is no reason, but I
feel that separating them and doing them at different times might be an
important piece in the puzzle.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



 I'm repeating myself, but I would split that second operation into two
 parts while we think about this: pruning the entire page, and
 defragmenting (PageRepairFragmentation). Tom wondered why they're
 separated in the patch. As the patch stands, there is no reason, but I
 feel that separating them and doing them at different times might be an
 important piece in the puzzle.



I think we all agree that defragmentation should be done less aggressively
and ideally when we are going to insert a tuple in the page and running
low on free space in the page. If we could really do it in heap_update(),
after
we figure out that there is not enough free space available in the page,
that will be great. We already have exclusive lock on the page and hopefully
we also have the vacuum-lock. (This is how earlier version of the patch used
to work - before we took out all the complexity associated with tracking
LP_DELETEd tuples,
tracking fragmented tuples and replaced them with simple
PageRepairFragmentation
Since in the earlier version, we never used to call PageRepairFragmentation,
we could easily do pruning in heap_update())

If we can't find a way to defragment inside heap_update(), then we have
three
choices:

1. Pass a hint to heap_fetch that a UPDATE may follow. If we are running low
on free space, we try to prune and defragment at that point.

2. Move some of the work to bgwriter.

3. Use some heuristic to decide whether to defragment or not.

I am not sure how easy it is to know apriori that we are fetching a tuple
for UPDATE. Assuming we can, this seems like a good idea.

Eventually we may want to move some work to bgwriter or some other
background process. But my take would be to save that for 8.4

As a starting point, we are doing 3 in the patch. We always try to
keep just one tuple worth of space free in the page. So if an UPDATE
takes up  remaining free space in the page, the page will be
pruned/defraged in the subsequent lookup (index or seq). In read-mostly
scenario,
only the first lookup would trigger prune/defrag, but next many lookups
would be quick. In update-mostly scenario, most of the heap_fetches
would anyways end up doing update, so doing pruning/defragmentation
in heap_fetch shouldn't be too bad. For a balanced scenario, we might
be moving cost of pruning/defragmentation to the SELECT path, but
UPDATEs would be quick.


The other issue is when to prune the HOT chain. I tend to agree that
the chances of having long HOT chains are less and the cost of following the
chain won't be too significant. At the same we probably don't want to live
with very long chains forever. An example would be: a tuple gets HOT updated
several times in a transaction. If the page is never updated again, the long
HOT
chain would never be pruned. I like Simon's idea of pruning the chain if it
goes beyond a limit. Instead of adding any significant complexity to
track length of each HOT chain, we can just mark the page with a flag
if heap_hot_fetch() detects a long chain. The next lookup (index or seq)
of the page will try to prune the page and clear the flag if successful.

I also agree that we should avoid taking exclusive lock before checking
free space in heap_page_prune_defrag(). That might be unsafe, but we
don't care if you occasionally skip the maintenance work (or do it a
little early)

Thanks,
Pavan

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Heikki Linnakangas
Bruce Momjian wrote:
 (Can someone time the access time for following a chain that fills an
 entire page (the worst case) vs. having a single tuple on the page?)

Here is some results, on my laptop. The test case is a table with a
single integer column, and a single row in the table. The row is updated
250 times to make a 251 tuples long chain. Page can hold 255 tuples, so
this is pretty much the worst case. After that, the row is fetched
100 times, in a PL/pgSQL function. I ran the tests with
enable_seqscan on and off, see seqscan and idxscan rows below.
Numbers are the times spent doing the fetches, in seconds. I repeated
each test a few times to make sure that the results are repeatable, they
seem to be repeatable to ~0.1s precision. The test script used is
attached. Autovacuum was disabled, and shared_buffers=320MB, otherwise
all settings were left to defaults.

HEADHOT HOT-opt HOT-pruned
seqscan 19.921.120.111.5
idxscan 27.831.430.413.7

Explanations of the columns:

HEAD:   CVS HEAD
HOT-pruned: CVS HEAD + HOT patch v15
HOT:CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short
circuited to do nothing
HOT-opt:CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot
like in CVS HEAD

I didn't expect a difference in seqscan performance between HEAD and
HOT. I oprofiled it, and figured out that it's because HOT patch removed
the static-qualifier XidInMVCCSnapshot, because it's needed in
plancat.c. I changed it back to static, dummying out the call in
plancat.c, and the results are now closer to each other (HOT-opt column).

Comparing the idxscan columns, it looks like following the chain *is*
more expensive than having to go through killed index pointers. Pruning
clearly does help.

Given that this test is pretty much the worst case scenario, I'm ok with
not pruning for the purpose of keeping chains short.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
set enable_seqscan=on;

DROP TABLE hottest;

CREATE TABLE hottest (id integer);

INSERT INTO hottest VALUES (1);

CREATE INDEX i_hottest ON hottest(id);

-- Fill the table 

CREATE OR REPLACE FUNCTION longchain (n integer) RETURNS void LANGUAGE PLPGSQL AS 
$$
DECLARE
  i integer;
BEGIN
  FOR i IN 1..n LOOP
UPDATE hottest SET id=id WHERE id = 1;
  END LOOP;
END;
$$;

SELECT longchain(250);

CREATE OR REPLACE FUNCTION fetchchain (n integer) RETURNS void LANGUAGE PLPGSQL AS 
$$
DECLARE
  i integer;
  foo integer;
BEGIN
  FOR i IN 1..n LOOP
SELECT id INTO foo FROM hottest WHERE id = 1;
  END LOOP;
END;
$$;

\timing
SELECT fetchchain(100);
\timing

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Heikki Linnakangas
Heikki Linnakangas wrote:
 Explanations of the columns:
 
 HEAD: CVS HEAD
 HOT-pruned:   CVS HEAD + HOT patch v15
 HOT:  CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short
 circuited to do nothing
 HOT-opt:  CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot
 like in CVS HEAD

One correction: HOT-opt was also short circuited like HOT. IOW, HOT and
HOT-opt are the same, except for the static qualifier in XidInMVCCSnapshot

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/6/07, Tom Lane [EMAIL PROTECTED] wrote:

 Heikki Linnakangas [EMAIL PROTECTED] writes:
  When I suggested that we get rid of the LP_DELETE flag for heap tuples,
  the tuple-level fragmentation and all that, and just take the vacuum
  lock and call PageRepairFragmentation, I was thinking that we'd do it in
  heap_update and only when we run out of space on the page. But as Greg
  said, it doesn't work because you're already holding a reference to at
  least one tuple on the page, the one you're updating, by the time you
  get to heap_update. That's why I put the pruning code to heap_fetch
  instead. Yes, though the amortized cost is the same, it does push the
  pruning work to the foreground query path.

 The amortized cost is only the same if every heap_fetch is associated
 with a heap update.  I feel pretty urgently unhappy about this choice.
 Have you tested the impact of the patch on read-mostly workloads?


For read-mostly workloads, only the first SELECT after an UPDATE
would trigger pruning/defragmentation. heap_page_prune_defrag()
would be a no-op for subsequent SELECTs  (PageIsPrunable() would
be false until the next UPDATE)

I think Heikki's recent test confirms this.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/8/07, Bruce Momjian [EMAIL PROTECTED] wrote:



 Would someone tell us exactly when pruning and defragmenting happens in
 the current version of the patch?  If we don't nail this issue down soon
 PostgreSQL 8.3 is going to sail without this patch.




I guess you already have the answers by now, but let me repeat here:

Until patch version 14, defragmentation and pruning were separate
operations, though we used to invoke both at the same time. The only
difference was that pruning can work with a simple exclusive lock
whereas defragmentation needs vacuum-strength lock. Tom argued
that its not worth the complexity, so I changed that and clubbed
both the operations together. What it means: pruning also now needs
vacuum-strength lock and is always followed by defragmentation.

So as the patch stands (version 15), we call heap_page_prune_defrag()
inside heap_release_fetch() and heapgetpage(), apart from the vacuum
loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable
is set when a tuple is updated/deleted. So for read-mostly workloads
heap_page_prune_defrag will mostly be no-op.

If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried
conditionally, so we don't wait if there is contention) and we are running
low on free space (right now if free space is less than 1.2 times the
average
tuple size or less than BLCKSZ/8), the entire page is pruned and
fragmentation
is repaired.

We also prune-defrag is vacuum lazy and vacuum full. But I assume we
are not worried about these maintenance activities.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Simon Riggs [EMAIL PROTECTED] wrote:

 On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote:
  Bruce Momjian wrote:
   (Can someone time the access time for following a chain that fills an
   entire page (the worst case) vs. having a single tuple on the page?)
 
  Here is some results, on my laptop.

HEADHOT HOT-opt HOT-pruned
  seqscan   19.921.120.111.5
  idxscan   27.831.430.413.7
 

  Comparing the idxscan columns, it looks like following the chain *is*
  more expensive than having to go through killed index pointers. Pruning
  clearly does help.
 
  Given that this test is pretty much the worst case scenario, I'm ok with
  not pruning for the purpose of keeping chains short.

 I wasn't expecting that result and had accepted the counter argument.



Yes, I go with Simon. I am also surprised that HOT-pruned did
so well in this setup. I always thought that HOT would do well
in update-intensive scenarios, but from the results it seems that
HOT is also doing well for read-mostly queries.

In this particular example, the first SELECT after the 250 UPDATEs
would have pruned all the dead tuples and reduced HOT chain
to a single tuple. Hence the total time for subsequent SELECTs improved
tremendously.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote:
 Oleg Bartunov [EMAIL PROTECTED] writes:
  On Mon, 10 Sep 2007, Simon Riggs wrote:
  Can we include that functionality now?
 
  This could be realized very easyly using dict_strict, which returns
  only known words, and mapping contains only this dictionary. So, 
  feel free to write it and submit.
 
 ... for 8.4.

I've coded a small patch to allow CaseSensitive synonyms.

  CREATE TEXT SEARCH DICTIONARY my_diction (
 TEMPLATE = biglist,
 DictFile = words,
 CaseSensitive = true
  );

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com
Index: src/backend/tsearch/dict_synonym.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/dict_synonym.c,v
retrieving revision 1.4
diff -c -r1.4 dict_synonym.c
*** src/backend/tsearch/dict_synonym.c	25 Aug 2007 02:29:45 -	1.4
--- src/backend/tsearch/dict_synonym.c	10 Sep 2007 15:14:21 -
***
*** 29,34 
--- 29,35 
  typedef struct
  {
  	int			len;	/* length of syn array */
+ 	bool		case_sensitive;
  	Syn		   *syn;
  } DictSyn;
  
***
*** 83,88 
--- 84,90 
  			   *end = NULL;
  	int			cur = 0;
  	char	   *line = NULL;
+ 	bool		case_sensitive = false;
  
  	foreach(l, dictoptions)
  	{
***
*** 95,100 
--- 97,107 
  	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  	 errmsg(unrecognized synonym parameter: \%s\,
  			defel-defname)));
+ 
+ 		if (pg_strcasecmp(CaseSensitive, defel-defname) == 0 
+ 			pg_strcasecmp(True, defGetString(defel)) == 0)
+ 			case_sensitive = true;
+ 
  	}
  
  	if (!filename)
***
*** 168,173 
--- 175,182 
  	d-len = cur;
  	qsort(d-syn, d-len, sizeof(Syn), compareSyn);
  
+ 	d-case_sensitive = case_sensitive;
+ 
  	PG_RETURN_POINTER(d);
  }
  
***
*** 180,195 
  	Syn			key,
  			   *found;
  	TSLexeme   *res;
  
  	/* note: d-len test protects against Solaris bsearch-of-no-items bug */
  	if (len = 0 || d-len = 0)
  		PG_RETURN_POINTER(NULL);
  
! 	key.in = lowerstr_with_len(in, len);
  	key.out = NULL;
  
  	found = (Syn *) bsearch(key, d-syn, d-len, sizeof(Syn), compareSyn);
! 	pfree(key.in);
  
  	if (!found)
  		PG_RETURN_POINTER(NULL);
--- 189,214 
  	Syn			key,
  			   *found;
  	TSLexeme   *res;
+ 	bool		need_pfree = false;
  
  	/* note: d-len test protects against Solaris bsearch-of-no-items bug */
  	if (len = 0 || d-len = 0)
  		PG_RETURN_POINTER(NULL);
  
! 	if (d-case_sensitive)
! 		key.in = in;
! 	else
! 	{
! 		key.in = lowerstr_with_len(in, len);
! 		need_pfree = true;
! 	}
! 
  	key.out = NULL;
  
  	found = (Syn *) bsearch(key, d-syn, d-len, sizeof(Syn), compareSyn);
! 
! 	if (need_pfree)
! 		pfree(key.in);
  
  	if (!found)
  		PG_RETURN_POINTER(NULL);

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PATCHES] Yet more tsearch refactoring

2007-09-10 Thread Heikki Linnakangas
* Defined new struct WordEntryPosVector that holds a uint16 length and a
variable size array of WordEntries. This replaces the previous
convention of a variable size uint16 array, with the first element
implying the length. WordEntryPosVector has the same layout in memory,
but is more readable in source code. The POSDATAPTR and POSDATALEN
macros are still used, though it would now be more readable to access
the fields in WordEntryPosVector directly.

* Removed needfree field from DocRepresentation. It was always set to false.

* Miscellaneous other commenting and refactoring

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/utils/adt/tsginidx.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/adt/tsginidx.c,v
retrieving revision 1.3
diff -c -r1.3 tsginidx.c
*** src/backend/utils/adt/tsginidx.c	7 Sep 2007 16:03:40 -	1.3
--- src/backend/utils/adt/tsginidx.c	10 Sep 2007 12:04:14 -
***
*** 25,37 
  	int32	   *nentries = (int32 *) PG_GETARG_POINTER(1);
  	Datum	   *entries = NULL;
  
! 	*nentries = 0;
  	if (vector-size  0)
  	{
  		int			i;
  		WordEntry  *we = ARRPTR(vector);
  
- 		*nentries = (uint32) vector-size;
  		entries = (Datum *) palloc(sizeof(Datum) * vector-size);
  
  		for (i = 0; i  vector-size; i++)
--- 25,36 
  	int32	   *nentries = (int32 *) PG_GETARG_POINTER(1);
  	Datum	   *entries = NULL;
  
! 	*nentries = vector-size;
  	if (vector-size  0)
  	{
  		int			i;
  		WordEntry  *we = ARRPTR(vector);
  
  		entries = (Datum *) palloc(sizeof(Datum) * vector-size);
  
  		for (i = 0; i  vector-size; i++)
***
*** 134,144 
  
  	if (query-size  0)
  	{
! 		int4		i,
  	j = 0;
  		QueryItem  *item;
  		GinChkVal	gcv;
  
  		gcv.frst = item = GETQUERY(query);
  		gcv.mapped_check = (bool *) palloc(sizeof(bool) * query-size);
  
--- 133,151 
  
  	if (query-size  0)
  	{
! 		int			i,
  	j = 0;
  		QueryItem  *item;
  		GinChkVal	gcv;
  
+ 		/*
+ 		 * check-parameter array has one entry for each value (operand) in the
+ 		 * query. We expand that array into mapped_check, so that there's one
+ 		 * entry in mapped_check for every node in the query, including 
+ 		 * operators, to allow quick lookups in checkcondition_gin. Only the 
+ 		 * entries corresponding operands are actually used.
+ 		 */
+ 
  		gcv.frst = item = GETQUERY(query);
  		gcv.mapped_check = (bool *) palloc(sizeof(bool) * query-size);
  
Index: src/backend/utils/adt/tsgistidx.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/adt/tsgistidx.c,v
retrieving revision 1.3
diff -c -r1.3 tsgistidx.c
*** src/backend/utils/adt/tsgistidx.c	7 Sep 2007 15:09:56 -	1.3
--- src/backend/utils/adt/tsgistidx.c	10 Sep 2007 15:56:09 -
***
*** 133,152 
  }
  
  static int
! compareint(const void *a, const void *b)
  {
! 	if (*((int4 *) a) == *((int4 *) b))
  		return 0;
! 	return (*((int4 *) a)  *((int4 *) b)) ? 1 : -1;
  }
  
  static int
  uniqueint(int4 *a, int4 l)
  {
  	int4	   *ptr,
  			   *res;
  
! 	if (l == 1)
  		return l;
  
  	ptr = res = a;
--- 133,159 
  }
  
  static int
! compareint(const void *va, const void *vb)
  {
! 	int4 a = *((int4 *) va);
! 	int4 b = *((int4 *) vb);
! 
! 	if (a == b)
  		return 0;
! 	return (a  b) ? 1 : -1;
  }
  
+ /*
+  * Removes duplicates from an array of int4. 'l' is
+  * size of the input array. Returns the new size of the array.
+  */
  static int
  uniqueint(int4 *a, int4 l)
  {
  	int4	   *ptr,
  			   *res;
  
! 	if (l = 1)
  		return l;
  
  	ptr = res = a;
***
*** 570,581 
  } SPLITCOST;
  
  static int
! comparecost(const void *a, const void *b)
  {
! 	if (((SPLITCOST *) a)-cost == ((SPLITCOST *) b)-cost)
  		return 0;
  	else
! 		return (((SPLITCOST *) a)-cost  ((SPLITCOST *) b)-cost) ? 1 : -1;
  }
  
  
--- 577,591 
  } SPLITCOST;
  
  static int
! comparecost(const void *va, const void *vb)
  {
! 	SPLITCOST *a = (SPLITCOST *) va;
! 	SPLITCOST *b = (SPLITCOST *) vb;
! 
! 	if (a-cost == b-cost)
  		return 0;
  	else
! 		return (a-cost  b-cost) ? 1 : -1;
  }
  
  
Index: src/backend/utils/adt/tsrank.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/adt/tsrank.c,v
retrieving revision 1.4
diff -c -r1.4 tsrank.c
*** src/backend/utils/adt/tsrank.c	7 Sep 2007 16:03:40 -	1.4
--- src/backend/utils/adt/tsrank.c	10 Sep 2007 15:56:51 -
***
*** 53,74 
  {
  	WordEntry  *ptr = ARRPTR(t),
  			   *end = (WordEntry *) STRPTR(t);
! 	int			len = 0,
! clen;
  
  	while (ptr  end)
  	{
! 		if ((clen = POSDATALEN(t, ptr)) == 0)
  			len += 1;
  		else
  			len += clen;
  		ptr++;
  	}
  
  	return len;
  }
  
! static int4
  WordECompareQueryItem(char *eval, char *qval, WordEntry *ptr, QueryOperand 

Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Heikki Linnakangas
Bruce Momjian wrote:
 My guess is that once you prune a tuple
 you no longer have to check its visibility, and that is where the win is
 coming from.  

Yes, I think that's right.

Oh, one more thing occured to me. Without HOT, we not only mark index
tuples pointing to dead tuples as killed, we remove them altogether if
the index page gets full. If you modify the test case so that after
doing the updates, you insert a bunch of tuples with a different key to
fill the index page, you should see CVS HEAD winning HOT without pruning
hands down.

 If we check a tuple in a chain and the tuple is dead is it possible the
 pruning operation is cheaper than having to check the tuple again for
 visibility the next time we see it?  If so, we can just always prune
 when we see a dead tuple in the chain, which I believe was the original
 design.  Pruning becomes an operation similar to marking an index entry
 as dead.  (I assuming pruing does not generate WAL traffic.)

Pruning does generate a WAL record at the moment. Maybe you could do
some kind of a quick pruning without a WAL record. Like just modify the
ctid of the oldest dead tuple in the chain, or the redirecting line
pointer if there is one, to point to the latest live tuple, without
removing the dead tuples or the line pointers.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 18:23 +0100, Heikki Linnakangas wrote:

  If we check a tuple in a chain and the tuple is dead is it possible the
  pruning operation is cheaper than having to check the tuple again for
  visibility the next time we see it?  If so, we can just always prune
  when we see a dead tuple in the chain, which I believe was the original
  design.  Pruning becomes an operation similar to marking an index entry
  as dead.  (I assuming pruing does not generate WAL traffic.)
 
 Pruning does generate a WAL record at the moment. Maybe you could do
 some kind of a quick pruning without a WAL record. Like just modify the
 ctid of the oldest dead tuple in the chain, or the redirecting line
 pointer if there is one, to point to the latest live tuple, without
 removing the dead tuples or the line pointers.

Sounds like a great idea.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Florian Pflug [EMAIL PROTECTED] wrote:


 Maybe that risk could be lowered if instead of a flag, we stored the
 minimal global xmin needed to prune at least one tuple.



I like the idea. The question is whether the chances of a Prunable
page being looked up again and again in presence of a long
running transaction are high enough to justify adding 4 bytes
to page header.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


 Oh, one more thing occured to me. Without HOT, we not only mark index
 tuples pointing to dead tuples as killed, we remove them altogether if
 the index page gets full. If you modify the test case so that after
 doing the updates, you insert a bunch of tuples with a different key to
 fill the index page, you should see CVS HEAD winning HOT without pruning
 hands down.


Um. I am not sure I follow that. With HOT even if the HOT chain is long,
there is still just one index entry for the entire chain. So I don't see
why CVS HEAD would do better than HOT in the above case. Net-net
there will be equal number of index keys after the inserts.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote:

 Heikki Linnakangas wrote:

 
  Pruning does generate a WAL record at the moment. Maybe you could do
  some kind of a quick pruning without a WAL record. Like just modify the
  ctid of the oldest dead tuple in the chain, or the redirecting line
  pointer if there is one, to point to the latest live tuple, without
  removing the dead tuples or the line pointers.

 I am wondering what you even mean by removing the dead tuples when
 pruning.  I thought only defragmentation removed tuples.


Pruning removes intermediate dead tuples by marking their line pointers
~LP_USED and redirecting the root line pointer to the first
live/recently_dead
tuple in the chain. OTOH defragmentation moves tuples around to repair
fragmentation. IOW defragmentation is nothing but calling
PageRepairFragmentation on the page.

For example, if we have a HOT chain of 5 tuples:

1 -- 2 -- 3 -- 4 -- 5

of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE

At the end of pruning:

- line pointer of 1 is redirected to 4
- line pointers of 2 and 3 are marked ~LP_USED
- the offset of 4 and 5 is unchanged

At the end of defragmentation:

- 4 and 5 are moved to the end of the page to create a larger
contiguous free space in the page.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Bruce Momjian
Heikki Linnakangas wrote:
 Bruce Momjian wrote:
  My guess is that once you prune a tuple
  you no longer have to check its visibility, and that is where the win is
  coming from.  
 
 Yes, I think that's right.
 
 Oh, one more thing occured to me. Without HOT, we not only mark index
 tuples pointing to dead tuples as killed, we remove them altogether if
 the index page gets full. If you modify the test case so that after
 doing the updates, you insert a bunch of tuples with a different key to
 fill the index page, you should see CVS HEAD winning HOT without pruning
 hands down.
 
  If we check a tuple in a chain and the tuple is dead is it possible the
  pruning operation is cheaper than having to check the tuple again for
  visibility the next time we see it?  If so, we can just always prune
  when we see a dead tuple in the chain, which I believe was the original
  design.  Pruning becomes an operation similar to marking an index entry
  as dead.  (I assuming pruing does not generate WAL traffic.)
 
 Pruning does generate a WAL record at the moment. Maybe you could do
 some kind of a quick pruning without a WAL record. Like just modify the
 ctid of the oldest dead tuple in the chain, or the redirecting line
 pointer if there is one, to point to the latest live tuple, without
 removing the dead tuples or the line pointers.

I am wondering what you even mean by removing the dead tuples when
pruning.  I thought only defragmentation removed tuples.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Bruce Momjian
Pavan Deolasee wrote:
   Pruning does generate a WAL record at the moment. Maybe you could do
   some kind of a quick pruning without a WAL record. Like just modify the
   ctid of the oldest dead tuple in the chain, or the redirecting line
   pointer if there is one, to point to the latest live tuple, without
   removing the dead tuples or the line pointers.
 
  I am wondering what you even mean by removing the dead tuples when
  pruning.  I thought only defragmentation removed tuples.
 
 
 Pruning removes intermediate dead tuples by marking their line pointers
 ~LP_USED and redirecting the root line pointer to the first
 live/recently_dead
 tuple in the chain. OTOH defragmentation moves tuples around to repair
 fragmentation. IOW defragmentation is nothing but calling
 PageRepairFragmentation on the page.
 
 For example, if we have a HOT chain of 5 tuples:
 
 1 -- 2 -- 3 -- 4 -- 5
 
 of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE
 
 At the end of pruning:
 
 - line pointer of 1 is redirected to 4
 - line pointers of 2 and 3 are marked ~LP_USED
 - the offset of 4 and 5 is unchanged
 
 At the end of defragmentation:
 
 - 4 and 5 are moved to the end of the page to create a larger
 contiguous free space in the page.

Right.  My point is that pruning only modifies item pointers.  It does
not change the actual heap tuples.  In the quote above, how is Heikki's
quick pruning different from the pruning you are describing?

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Heikki Linnakangas
Bruce Momjian wrote:
 Pavan Deolasee wrote:
 Pruning does generate a WAL record at the moment. Maybe you could do
 some kind of a quick pruning without a WAL record. Like just modify the
 ctid of the oldest dead tuple in the chain, or the redirecting line
 pointer if there is one, to point to the latest live tuple, without
 removing the dead tuples or the line pointers.
 I am wondering what you even mean by removing the dead tuples when
 pruning.  I thought only defragmentation removed tuples.


 Pruning removes intermediate dead tuples by marking their line pointers
 ~LP_USED and redirecting the root line pointer to the first
 live/recently_dead
 tuple in the chain. OTOH defragmentation moves tuples around to repair
 fragmentation. IOW defragmentation is nothing but calling
 PageRepairFragmentation on the page.

 For example, if we have a HOT chain of 5 tuples:

 1 -- 2 -- 3 -- 4 -- 5

 of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE

 At the end of pruning:

 - line pointer of 1 is redirected to 4
 - line pointers of 2 and 3 are marked ~LP_USED
 - the offset of 4 and 5 is unchanged

 At the end of defragmentation:

 - 4 and 5 are moved to the end of the page to create a larger
 contiguous free space in the page.
 
 Right.  My point is that pruning only modifies item pointers.  It does
 not change the actual heap tuples.  In the quote above, how is Heikki's
 quick pruning different from the pruning you are describing?

In quick pruning, you don't mark the line pointer as not used.
Otherwise someone might reuse the line pointer for another tuple, which
is a problem if you then crash. WAL replay would see an insert to a line
pointer that's already in use (the quick pruning wouldn't be WAL
logged), which would raise an error.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote:



 Right.  My point is that pruning only modifies item pointers.  It does
 not change the actual heap tuples.  In the quote above, how is Heikki's
 quick pruning different from the pruning you are describing?



I think the only difference is that the quick pruning does not mark
intermediate tuples ~LP_USED and hence we may avoid WAL logging.

Btw, I am not too enthusiastic about quick pruning because it leaves
behind dead heap-only tuples which are not reachable from any root
heap tuples. Not that we can't handle them, but I am worried about
making such changes right now unless we are sure about the benefits.
We can always tune and tweak in 8.4

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 20:46 +0100, Heikki Linnakangas wrote:
 Bruce Momjian wrote:
  Pavan Deolasee wrote:
  At the end of pruning:
 
  - line pointer of 1 is redirected to 4
  - line pointers of 2 and 3 are marked ~LP_USED
  - the offset of 4 and 5 is unchanged
 

  In the quote above, how is Heikki's
  quick pruning different from the pruning you are describing?

 In quick pruning, you don't mark the line pointer as not used.
 Otherwise someone might reuse the line pointer for another tuple, which
 is a problem if you then crash. WAL replay would see an insert to a line
 pointer that's already in use (the quick pruning wouldn't be WAL
 logged), which would raise an error.

That seems like the best way forward to me. It's just a hint that says
where the first live tuple is in the chain.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Bruce Momjian
Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  Looking at the patch I see:
 
  +   /*
  +* Mark the page as clear of prunable tuples. If we find a tuple 
  which
  +* may become prunable, we shall set the hint again.
  +*/
  +   PageClearPrunable(page);
 
  I like the idea of the page hint bit, but my question is if there is a
  long-running transaction, isn't each SELECT going to try to defragment a
  page over and over again because there is still something prunable on
  the page?
 
 Well it'll try to prune the chains over and over again. If it doesn't find
 anything it won't defragment, but yes.
 
 I think we could tackle that by storing on the page the oldest xmax which
 would allow us to prune a tuple. Then you could compare RecentGlobalXmin
 against that and not bother looking at any other chains if it hasn't been
 passed yet. 

Yea, that might work if we have space on the page.

 It would be hard to do that with single-chain pruning though, once you the
 limiting tuple you would then wouldn't know what the next limiting xmax is
 until the next time you do a full-page prune. Still that gets us down to at
 most two full-page prunes per update instead of a potentially unbounded number
 of prunes.
 
 This seems like a further optimization to think about after we have a place to
 trigger the pruning where it'll do the most good.

I was thinking if we could only try defragementing when we are doing an
INSERT or UPDATE, then the defragment would either free enough space so
we could stay on the page or the new row would go on another page and we
would probably not return to that page again anytime soon.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] External builds need tsearch include files

2007-09-10 Thread Heikki Linnakangas
I tried to compile a custom tsearch dictionary as an external module,
but it couldn't find the tsearch include files in the server
installation. tsearch subdirectory needs to be added to
src/include/Makefile, otherwise they don't get installed:

*** src/include/Makefile9 Dec 2005 21:19:35 -   1.21
--- src/include/Makefile10 Sep 2007 19:53:54 -
***
*** 18,24 

  # Subdirectories containing headers for server-side dev
  SUBDIRS = access bootstrap catalog commands executor lib libpq mb \
!   nodes optimizer parser regex rewrite storage tcop utils \
port port/win32 port/win32/arpa port/win32/netinet port/win32/sys

  # Install all headers
--- 18,24 

  # Subdirectories containing headers for server-side dev
  SUBDIRS = access bootstrap catalog commands executor lib libpq mb \
!   nodes optimizer parser regex rewrite storage tcop tsearch utils \
port port/win32 port/win32/arpa port/win32/netinet port/win32/sys

  # Install all headers


-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Simon Riggs
On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:

 I think the only difference is that the quick pruning does not mark
 intermediate tuples ~LP_USED and hence we may avoid WAL logging.

Sounds great.

 Btw, I am not too enthusiastic about quick pruning because it leaves 
 behind dead heap-only tuples which are not reachable from any root
 heap tuples. Not that we can't handle them, but I am worried about
 making such changes right now unless we are sure about the benefits.
 We can always tune and tweak in 8.4

...but we don't want to reach them. They are waiting to be removed.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Gregory Stark

Pavan Deolasee [EMAIL PROTECTED] writes:

 On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote:

 Right.  My point is that pruning only modifies item pointers.  It does
 not change the actual heap tuples.  In the quote above, how is Heikki's
 quick pruning different from the pruning you are describing?

 I think the only difference is that the quick pruning does not mark
 intermediate tuples ~LP_USED and hence we may avoid WAL logging.

 Btw, I am not too enthusiastic about quick pruning because it leaves
 behind dead heap-only tuples which are not reachable from any root
 heap tuples. Not that we can't handle them, but I am worried about
 making such changes right now unless we are sure about the benefits.
 We can always tune and tweak in 8.4

You could mark such tuples with LP_DELETE. That would also let other
transactions quickly tot up how much space would be available if they were to
run PageRepairFragmentation.

But if you don't WAL log setting LP_DELETE then you would still have to deal
with unreachable HOT tuples who lost their LP_DELETE flag. I suppose that as
long as PageRepairFragmentation first looks for any dead HOT tuples without
depending on LP_DELETE that would be enough.

I wonder if you could do the quick prune without even taking the exclusive
page lock? You're overwriting 16 bits and you know nobody else will be
modifying any of the line pointers in question to anything else. (because your
pin prevents the vacuum lock from coming in and trying to mark it unused).

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Yet more tsearch refactoring

2007-09-10 Thread Teodor Sigaev



Heikki Linnakangas wrote:

* Defined new struct WordEntryPosVector that holds a uint16 length and a
variable size array of WordEntries. This replaces the previous
convention of a variable size uint16 array, with the first element
implying the length. WordEntryPosVector has the same layout in memory,
but is more readable in source code. The POSDATAPTR and POSDATALEN
macros are still used, though it would now be more readable to access
the fields in WordEntryPosVector directly.


Did you check it on 64-bit boxes with strict alignment? I remember that was a 
headache for me.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Yet more tsearch refactoring

2007-09-10 Thread Heikki Linnakangas
Teodor Sigaev wrote:
 Heikki Linnakangas wrote:
 * Defined new struct WordEntryPosVector that holds a uint16 length and a
 variable size array of WordEntries. This replaces the previous
 convention of a variable size uint16 array, with the first element
 implying the length. WordEntryPosVector has the same layout in memory,
 but is more readable in source code. The POSDATAPTR and POSDATALEN
 macros are still used, though it would now be more readable to access
 the fields in WordEntryPosVector directly.
 
 Did you check it on 64-bit boxes with strict alignment? I remember that
 was a headache for me.

No, I didn't. But I don't see how it would make a difference, the
resulting memory layout is the same AFAICS.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:
 I think the only difference is that the quick pruning does not mark
 intermediate tuples ~LP_USED and hence we may avoid WAL logging.

 Sounds great.

What it sounds is utterly unsafe.  You can get away with not WAL-logging
individual bit flips (that is, hint-bit-setting) because either state of
the page is valid.  If I read this proposal correctly it is to change
t_ctid without WAL-logging, which means that a partial page write (torn
page syndrome) could leave the page undetectably corrupted --- t_ctid
is 6 bytes and could easily cross a hardware sector boundary.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Bruce Momjian
Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote:
  I think the only difference is that the quick pruning does not mark
  intermediate tuples ~LP_USED and hence we may avoid WAL logging.
 
  Sounds great.
 
 What it sounds is utterly unsafe.  You can get away with not WAL-logging
 individual bit flips (that is, hint-bit-setting) because either state of
 the page is valid.  If I read this proposal correctly it is to change
 t_ctid without WAL-logging, which means that a partial page write (torn
 page syndrome) could leave the page undetectably corrupted --- t_ctid
 is 6 bytes and could easily cross a hardware sector boundary.

OK, we certainly want to avoid WAL writes for pruning a single chain if
possible.  One idea would be to just mark the item pointer as dead but
not repoint the first item pointer to point to the first live tuple.

Heikki, would that give us speedups similar to what you saw in your
earlier tests?  It would if the majority of the cost of scanning the
page looking for the first live tuple was in doing the visibility tests
on every tuple.

Basically, when we are walking the chain via an index and we see the
tuple is dead we mark it as dead but don't repoint.  We would repoint
and reclaim on a defragementation.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/11/07, Gregory Stark [EMAIL PROTECTED] wrote:




 You could mark such tuples with LP_DELETE. That would also let other
 transactions quickly tot up how much space would be available if they were
 to
 run PageRepairFragmentation.





IMHO we are making full circles here. We have already tried LP_DELETE
techniques
and moved away to simplify things. We also tried reusing dead space without
running PageRepairFragmentation. Each of these techniques worked just fine
with slightly different performance characteristics. What we now have is a
simplified algorithm which is much easier to follow and is safer, yet giving
us a very good performance boost. I am not sure if this is the right time to
throw new ideas because we would never be sure as what we are doing
would be the most optimal solution. Would it help if we go with some
solution right now, get rest of the review process done and then use
the feedback during beta testing to tune things ? We may have far more
data points at that time to choose one technique over other. And we
would also know what areas to focus on.

I am also worried that by focusing too much on this issue we may
overlook some other correctness issue in the patch.

From whatever we have discussed so far, IMHO we should do the following
things and let rest of the review process proceed

- Defragment a page only when the free space left in the page is not
enough to accommodate even a single tuple (use average tuple length for
this decision). This would mean we might be defragmenting even though
there is no immediate UPDATE to the page. But we can treat this as
fillfactor which allows us to provision for the next UPDATE coming to
the page. Since we are defragmenting when the page is almost full
hopefully we would reclaim good amount of space in the page and
won't call defragmentation for next few UPDATEs.

We already have mechanism to track average tuple request size in
relcache. May be we can have some relcache invalidation to keep the
information in sync (send invalidation when the average request size
changes by say 5%)

- Avoid pruning chains in every index or seq lookup. But if the chain
becomes longer than X tuples, mark the page to be pruned in the
next lookup. We can choose to separate prune and defragmentation
and only do pruning in this case. But I would prefer to keep them
together for now.

- Track the minimum xmin in the page header to avoid repeated
(wasted) attempts to prune a Prunable page in the presence of long running
transactions.

We can save rest of the techniques for beta testing period or 8.4.


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Bruce Momjian
Pavan Deolasee wrote:
 IMHO we are making full circles here. We have already tried LP_DELETE
 techniques
 and moved away to simplify things. We also tried reusing dead space without
 running PageRepairFragmentation. Each of these techniques worked just fine
 with slightly different performance characteristics. What we now have is a
 simplified algorithm which is much easier to follow and is safer, yet giving
 us a very good performance boost. I am not sure if this is the right time to
 throw new ideas because we would never be sure as what we are doing
 would be the most optimal solution. Would it help if we go with some
 solution right now, get rest of the review process done and then use
 the feedback during beta testing to tune things ? We may have far more
 data points at that time to choose one technique over other. And we
 would also know what areas to focus on.

I totally understand.  I was just throwing out ideas to make sure I
fully understood it and to explore various options.  I do agree we
should just wait for the review, knowing we have these trade-offs.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match