Re: [PATCHES] Explain XML patch v2

2008-07-04 Thread Tom Raney

Quoting daveg [EMAIL PROTECTED]:


On Wed, Jul 02, 2008 at 09:01:18AM -0700, David Fetter wrote:

On Wed, Jul 02, 2008 at 05:57:29PM +0200, Peter Eisentraut wrote:
 It would also be interesting if EXPLAIN could optionally be a
 function that returns a datum of type XML, to allow further
 processing.

It would be better to have a function which allows people to plug in
their own serialization.  A JSON or YAML one, for example, would be
much lighter weight on both ends.


+1 for either of these.

-dg



So, this leads me to the idea of assembling the EXPLAIN data  
internally in an output-neutral data structure.  At the very end of  
processing, one decision statement would decide which output plugin to  
use for output.  Sprinkling XML print statements throughout the code  
(as currently done in the patch) while functional, is not ideal.  And,  
the escaping of XML content should ideally be done in the serializer  
anyway.


Of course, this realization didn't occur to me until *after* I had  
spent a bit of time coding up the patch in its current form.  Oh well.


Thoughts?

Regarding the XML datum, in order to support that, will all users need  
to compile with libxml?  Are there any lighter weight solutions to  
serialize other than libxml?


-Tom Raney




--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Explain XML patch v2

2008-07-02 Thread Tom Raney

Peter Eisentraut wrote:

Am Mittwoch, 2. Juli 2008 schrieb Tom Raney:
  

This is an update to my EXPLAIN XML patch submitted a few days ago.



Could you explain how you came up with the XML schema design?  I suppose you 
just made something up that went along with the existing XML output.
  
Yes, it is based on the existing output. 
I would like to see more integration with the spirit of the existing XML 
functionality.  For example, instead of things like


runtime ms=\%.3f\ /\n

we ought to be using XML Schema data types for time intervals and so on.
  
The DTD provides only rudimentary document validation but it has no 
support for type checking.   So, it may make sense to move to the more 
rigorous XML Schema.  There is a 'duration' data type that could be used 
for the instance listed above.  Or, we could define our own.

We might also want to use an XML namespace.
  
Taking the 'ms' field listed above:  
xmlns=http://www.postgresql.org/v8.4/ms; or something like this?  
Table and index names should be escaped using the existing escape mechanism 
for identifiers.  There might also be encoding issues.
  

That's a good point.  Or, wrap them with CDATA.
It would also be interesting if EXPLAIN could optionally be a function that 
returns a datum of type XML, to allow further processing.


Any thoughts on these issues?
  
I am in the process of writing a parser of this XML output for the Red 
Hat Visual Explain tool.  I want to see what surprises come up during 
implementation. 




--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] Explain XML patch v2

2008-07-01 Thread Tom Raney

This is an update to my EXPLAIN XML patch submitted a few days ago.

I've added a documentation patch and modified some of the code per  
comments by Gregory Stark.


Because the main consumer of output generated by this patch will  
presumably be a machine, I didn't  clutter up the documentation with  
matching XML output for every standard example query.  But, I added  
enough to hopefully give the user an idea of what to expect.


Regards,

Tom Raney



*** doc/src/sgml/perform.sgml.orig  2008-07-01 20:27:19.0 -0700
--- doc/src/sgml/perform.sgml   2008-07-01 20:34:44.0 -0700
***
*** 47,59 
  operations on the raw rows, then there will be additional nodes
  quoteatop/ the scan nodes to perform these operations.  Again,
  there is usually more than one possible way to do these operations,
! so different node types can appear here too.  The output
  of commandEXPLAIN/command has one line for each node in the plan
  tree, showing the basic node type plus the cost estimates that the planner
  made for the execution of that plan node.  The first line (topmost node)
  has the estimated total execution cost for the plan; it is this number
  that the planner seeks to minimize.
 /para
  
 para
  Here is a trivial example, just to show what the output looks like.
--- 47,62 
  operations on the raw rows, then there will be additional nodes
  quoteatop/ the scan nodes to perform these operations.  Again,
  there is usually more than one possible way to do these operations,
! so different node types can appear here too.  The standard output
  of commandEXPLAIN/command has one line for each node in the plan
  tree, showing the basic node type plus the cost estimates that the planner
  made for the execution of that plan node.  The first line (topmost node)
  has the estimated total execution cost for the plan; it is this number
  that the planner seeks to minimize.
 /para
+para
+For examples of XML output, see the bottom of this page.
+/para
  
 para
  Here is a trivial example, just to show what the output looks like.
***
*** 448,453 
--- 451,513 
  process the table in any case, so there's no value in expending additional
  page reads to look at an index.
 /para
+ 
+para
+ Examples of XML output:
+ programlisting
+ EXPLAIN XML SELECT * FROM tenk1;
+ 
+QUERY PLAN
+ -
+ ![CDATA[ ?xml version=1.0?
+ 
+  explain version=x.x
+  plan name=Seq Scan indent=0
+table name=tenk1/
+cost startup=0.00 total=458.00 rows=1 width=244 /
+  /plan
+  /explain
+ (8 rows)]]
+ /programlisting
+ /para
+ para
+ programlisting
+ EXPLAIN ANALYZE XML SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1  100 
AND t1.unique2 = t2.unique2;
+ 
+   QUERY PLAN
+ -
+ ![CDATA[ ?xml version=1.0?
+ 
+  explain version=x.x
+  plan name=Nested Loop indent=0
+cost startup=5.03 total=693.17 rows=100 width=488 /
+analyze time_start=0.981 time_end=6.118 rows=100 loops=1 /
+  /plan
+  plan name=Bitmap Heap Scan indent=3
+table name=tenk1 alias=t1/
+cost startup=5.03 total=221.07 rows=100 width=244 /
+analyze time_start=0.826 time_end=2.226 rows=100 loops=1 /
+qualifier type=Recheck Cond value=(unique1  100) /
+  /plan
+  plan name=Bitmap Index Scan indent=6
+index name=tenk1_unique1 /
+cost startup=0.00 total=5.00 rows=100 width=0 /
+analyze time_start=0.663 time_end=0.663 rows=100 loops=1 /
+qualifier type=Index Cond value=(unique1  100) /
+  /plan
+  plan name=Index Scan indent=3
+index name=tenk2_unique2 /
+table name=tenk2 alias=t2/
+cost startup=0.00 total=4.71 rows=1 width=244 /
+analyze time_start=0.020 time_end=0.022 rows=1 loops=100 /
+qualifier type=Index Cond value=(t2.unique2 = t1.unique2) /
+  /plan
+  runtime ms=7.204 /
+  /explain
+ (28 rows)]]
+ /programlisting
+ /para
+ 
/sect1
  
   sect1 id=planner-stats
*** doc/src/sgml/ref/explain.sgml.orig  2008-07-01 20:29:14.0 -0700
--- doc/src/sgml/ref/explain.sgml   2008-06-29 20:05:37.0 -0700
***
*** 30,36 
  
   refsynopsisdiv
  synopsis
! EXPLAIN [ ANALYZE ] [ VERBOSE ] replaceable 
class=parameterstatement/replaceable
  /synopsis
   /refsynopsisdiv
  
--- 30,36 
  
   refsynopsisdiv
  synopsis
! EXPLAIN [ ANALYZE ] [ VERBOSE ] [ XML [ DTD ] ] replaceable 
class=parameterstatement/replaceable
  /synopsis
   /refsynopsisdiv
  
***
*** 112,117 
--- 112,135 
 /varlistentry
  
 varlistentry
+ termliteralXML/literal/term
+ listitem
+  para
+   Emit XML output instead of the standard output.
+  /para
+ /listitem
+/varlistentry
+ 
+varlistentry
+ termliteralDTD/literal

[PATCHES] Hash Index Build Patch v2

2007-10-21 Thread Tom Raney
This revised version of our patch uses the function estimate_rel_size() 
from plancat.c to estimate the number of tuples in the parent relation.  
This method is an alternative to scanning the parent relation to 
estimate the number of tuples, as we did in the first version of the patch.


-Tom
#include stdio.h
#include stdlib.h


extern int main(int argc, char **argv) {
  char *p;
  long i, tups, seed;

  if (argc  3) {
  printf(Too few args.  tuples seed\n);
  return 0;
  }

  tups = strtol(argv[1],p,0);
  seed = strtol(argv[2], p, 0);
  srand(seed);

  for (i=0; i tups; i++) {
  printf(%d\n, rand());
  }

  return 0;

} *** ./backend/access/hash/hash.c.orig   2007-09-23 19:01:09.0 -0700
--- ./backend/access/hash/hash.c2007-10-21 12:07:48.455594000 -0700
***
*** 22,33 
  #include access/hash.h
  #include catalog/index.h
  #include commands/vacuum.h
  
  
  /* Working state for hashbuild and its callback */
  typedef struct
  {
!   double  indtuples;
  } HashBuildState;
  
  static void hashbuildCallback(Relation index,
--- 22,36 
  #include access/hash.h
  #include catalog/index.h
  #include commands/vacuum.h
+ #include optimizer/plancat.h
  
  
  /* Working state for hashbuild and its callback */
  typedef struct
  {
!   double  indtuples; /* The current number of index tuples */
!   RelationheapRel;   /* The index covers this heap relation */
!   HSpool  *spool;/* Used to sort the index tuples before 
insertion into the index */
  } HashBuildState;
  
  static void hashbuildCallback(Relation index,
***
*** 40,46 
--- 43,80 
  
  /*
   *hashbuild() -- build a new hash index.
+  *
+  *
+  *   The algorithm:
+  *(1) Initialize the build state
+  *(2) Retrieve estimate of tuple count from estimate_rel_size(); 
+  *(3) Transform the heap file tuples into index tuples (itups),
+  *while inserting them into a spool.  If the spool overflows
+  *memory, sort it into runs and spill it to disk
+  *(4) Finish sorting the spool
+  *(5) Pre-initialize all the buckets of the final index
+  *(6) Insert itups from the spool into the index
+  *
+  *   Sorting the tuples before inserting them into the index is a classical
+  * bulk-load technique, also used in the BTree code.  The sort is done in
+  * hash value order.
+  *   Pre-allocating the buckets minimizes the number of overflow pages.
+  *   The reason for step (2) is that in order to sort, in step (3), one must
+  * know the hash value, which depends on the number of buckets, which in turn
+  * depends on the number of itups = the number of rows in the heap file.
+  *   Steps (3),(4) and (6) parallel similar steps in the BTree code.
+  *
+  *   Here is an alternative algorithm:
+  *(1') Same as (1)
+  *(2') Scan the heap file, counting the number of rows, forming index
+  * tuples and inserting them into a spool (the spool is not 
presorted).
+  *(3') Sort the spool
+  *(4') same as (5)
+  *(5') same as (6)
+  *Although this algorithm would be somewhat faster, we prefer the existing
+  * algorithm because it reuses existing BTree code.
   */
+ 
  Datum
  hashbuild(PG_FUNCTION_ARGS)
  {
***
*** 50,55 
--- 84,94 
IndexBuildResult *result;
double  reltuples;
HashBuildState buildstate;
+   double  tuples;
+   BlockNumber pages;
+   HashMetaPagemetap;
+   Buffer  metabuf;
+   uint32  num_bkt; /* Estimates number of buckets in the final 
index */
  
/*
 * We expect to be called exactly once for any index relation. If that's
***
*** 59,81 
elog(ERROR, index \%s\ already contains data,
 RelationGetRelationName(index));
  
!   /* initialize the hash index metadata page */
!   _hash_metapinit(index);
! 
!   /* build the index */
buildstate.indtuples = 0;
  
!   /* do the heap scan */
!   reltuples = IndexBuildHeapScan(heap, index, indexInfo,
!  
hashbuildCallback, (void *) buildstate);
  
-   /*
-* Return statistics
-*/
-   result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
  
!   result-heap_tuples = reltuples;
!   result-index_tuples = buildstate.indtuples;
  
PG_RETURN_POINTER(result);
  }
--- 98,158 
elog(ERROR, index \%s\ already contains data,
 RelationGetRelationName(index));
  
!   /* initialize the build state */
buildstate.indtuples = 0;
+   buildstate.heapRel = heap;
+   buildstate.spool = h_spoolinit(index);
  
!/*
!   * Retrieve an estimate of the number of rows
!   *
! */
! tuples=0;
!   

Re: [PATCHES] Hash Index Build Patch

2007-09-26 Thread Tom Raney

Alvaro Herrera wrote:

Hi Tom,

Tom Raney wrote:

  

We used spool functions from the BTree code to sort the index
tuples. Sorting is done on the hash value of the tuples.  The hash
value depends on the number of primary bucket pages (henceforth
just bucket pages) that will be required to fit all the index
tuples. So, before sorting, the base relation is scanned to get
the total number of tuples.



Just wondering, wouldn't it be enough to obtain a tuple count estimate
by using reltuples / relpages * RelationGetNumberOfBlocks, like the
planner does?

  

Hello Alvaro,

We thought of that and the verdict is still out whether it is more 
costly to scan the entire relation to get the accurate count or use the 
estimate and hope for the best with the possibility of splits occurring 
during the build.   If we use the estimate and it is completely wrong 
(with the actual tuple count being much higher) the sort will provide no 
benefit and it will behave as did the original code.


But, to be honest, I don't know exactly when the catalog is updated and 
how accurate the estimate is.  If you have any information there (or 
anyone else) please let me know.  It would be great to eliminate that 
extra pass!


Sincerely,
Tom Raney



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] Hash Index Build Patch

2007-09-25 Thread Tom Raney

Hello All,

We have prepared a patch (against CVS HEAD)for the TODO item:
* Hash
  -During index creation, pre-sort the tuples to improve build speed
 http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php

Details of this patch's performance improvements can be found at 
http://web.cecs.pdx.edu/~raneyt/pg/.


For example, in one 12 million tuple example, the 8.2.4 hash index
build required over 2.8 hours while this patch's build required 80
seconds. Hash build times are now comparable to BTree build times,
for example, for a 60 million tuple example the patched hash index
build time is 8.1 minutes and the BTree build time is 4.6 minutes.

We used spool functions from the BTree code to sort the index
tuples. Sorting is done on the hash value of the tuples.  The hash
value depends on the number of primary bucket pages (henceforth
just bucket pages) that will be required to fit all the index
tuples. So, before sorting, the base relation is scanned to get
the total number of tuples. Then, the fill factor (either default
or user defined, as applicable) is used to get the estimate of the
number of bucket pages.

The 8.2.4 code builds the index by inserting one tuple at a time
and splitting buckets as needed.  We pre-allocate the estimated
number of bucket pages all at once. This avoids bucket page splits
and thus redistribution of index tuples between the bucket page
that caused the split and the newly created bucket page.

We changed the values of low_mask, high_mask and max_bucket, used
by hashkey2bucket () while inserting index tuples to bucket pages.
They are set as follows:
Low_mask = (power of 2 value-1) less than the estimate.
High_mask = (power of 2 value-1) greater than the estimate.
Max_bucket = estimated number of bucket -1 (because bucket numbers
start from 0).

(Please note: hashsort.c is a new file and resides in 
src/backend/access/hash/)


We have also attached the simple data generator we used in the
tests.  Our SQL statements were as follows:

DROP TABLE IF EXISTS test;

CREATE TABLE test (
  value  INTEGER
);

COPY test FROM 'data';
\timing
CREATE INDEX hash ON test USING HASH(value);

Regards,
Shreya Bhargava [EMAIL PROTECTED]
Tom Raney [EMAIL PROTECTED]

/*---
 * hashsort.c
 *-*/

/* 
 * Functions needed to support initializing and sorting tuples in the spool
 * in hash.c
 */


#include postgres.h

#include access/hash.h
#include miscadmin.h
#include storage/smgr.h
#include utils/tuplesort.h

struct HSpool
{
/* State data for tuplesort.c */
Tuplesortstate *sortstate; 
Relation index;
};

/* create and initialize the spool structure */
HSpool *h_spoolinit(Relation index)
{
HSpool *hspool;
int hKbytes;

hspool = (HSpool *) palloc0(sizeof(HSpool));

hspool-index = index;

/* hKbytes is the amount of memory we are going to use
 * to sort the spool.  
 */

hKbytes = maintenance_work_mem;
hspool-sortstate = tuplesort_begin_index(index,false,hKbytes,false);

/* Substitute the default compare call-back function
 * for a specific hash index build compare function.
 */

tuplesort_set_hashindex(hspool-sortstate);

return hspool;

}

void h_spool(IndexTuple itup, HSpool *hspool)
{
tuplesort_putindextuple(hspool-sortstate, itup);
}

/* 
 * h_bkt_num() estimates the number of buckets
 * in the final hash table.  
 *
 */

uint32 h_bkt_num(uint32 tuples, Relation rel)
{
int32 ffactor;
int32 data_width;
int32 item_width;
uint32 bkt_num;

/*
 * Calculate the fill factor.  We could get this from the meta page
 * but at the point this method is called, the meta page has not been
 * created.
 *
 */

data_width = get_typavgwidth(RelationGetDescr(rel)-attrs[0]-atttypid,

RelationGetDescr(rel)-attrs[0]-atttypmod);
item_width = MAXALIGN(sizeof(IndexTupleData)) + MAXALIGN(data_width) +
sizeof(ItemIdData);

ffactor = RelationGetTargetPageUsage(rel, HASH_DEFAULT_FILLFACTOR) / 
item_width;
if (ffactor  10)
ffactor = 10;

bkt_num = tuples / ffactor;
bkt_num = bkt_num +1;

return bkt_num;
}

/*  In order to define an order for the index tuples, there must be a mask
 *  applied to the 32 bit hash value of the index key during the sort 
 *  process.
 *  For example, if there are 4,555 buckets approximated, the mask (or 
modulo) 
 *  would be 8,191 (hex value 1FFF). 
 *
 */

void h_set_bkt_mask(HSpool *spool, uint32 bkts) {
uint32 bkt_pwr2, bkt_mask;

bkt_pwr2 = _hash_log2(bkts);
bkt_mask = (1(bkt_pwr2))-1