Re: [PATCHES] Explain XML patch v2
Quoting daveg [EMAIL PROTECTED]: On Wed, Jul 02, 2008 at 09:01:18AM -0700, David Fetter wrote: On Wed, Jul 02, 2008 at 05:57:29PM +0200, Peter Eisentraut wrote: It would also be interesting if EXPLAIN could optionally be a function that returns a datum of type XML, to allow further processing. It would be better to have a function which allows people to plug in their own serialization. A JSON or YAML one, for example, would be much lighter weight on both ends. +1 for either of these. -dg So, this leads me to the idea of assembling the EXPLAIN data internally in an output-neutral data structure. At the very end of processing, one decision statement would decide which output plugin to use for output. Sprinkling XML print statements throughout the code (as currently done in the patch) while functional, is not ideal. And, the escaping of XML content should ideally be done in the serializer anyway. Of course, this realization didn't occur to me until *after* I had spent a bit of time coding up the patch in its current form. Oh well. Thoughts? Regarding the XML datum, in order to support that, will all users need to compile with libxml? Are there any lighter weight solutions to serialize other than libxml? -Tom Raney -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Explain XML patch v2
Peter Eisentraut wrote: Am Mittwoch, 2. Juli 2008 schrieb Tom Raney: This is an update to my EXPLAIN XML patch submitted a few days ago. Could you explain how you came up with the XML schema design? I suppose you just made something up that went along with the existing XML output. Yes, it is based on the existing output. I would like to see more integration with the spirit of the existing XML functionality. For example, instead of things like runtime ms=\%.3f\ /\n we ought to be using XML Schema data types for time intervals and so on. The DTD provides only rudimentary document validation but it has no support for type checking. So, it may make sense to move to the more rigorous XML Schema. There is a 'duration' data type that could be used for the instance listed above. Or, we could define our own. We might also want to use an XML namespace. Taking the 'ms' field listed above: xmlns=http://www.postgresql.org/v8.4/ms; or something like this? Table and index names should be escaped using the existing escape mechanism for identifiers. There might also be encoding issues. That's a good point. Or, wrap them with CDATA. It would also be interesting if EXPLAIN could optionally be a function that returns a datum of type XML, to allow further processing. Any thoughts on these issues? I am in the process of writing a parser of this XML output for the Red Hat Visual Explain tool. I want to see what surprises come up during implementation. -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
[PATCHES] Explain XML patch v2
This is an update to my EXPLAIN XML patch submitted a few days ago. I've added a documentation patch and modified some of the code per comments by Gregory Stark. Because the main consumer of output generated by this patch will presumably be a machine, I didn't clutter up the documentation with matching XML output for every standard example query. But, I added enough to hopefully give the user an idea of what to expect. Regards, Tom Raney *** doc/src/sgml/perform.sgml.orig 2008-07-01 20:27:19.0 -0700 --- doc/src/sgml/perform.sgml 2008-07-01 20:34:44.0 -0700 *** *** 47,59 operations on the raw rows, then there will be additional nodes quoteatop/ the scan nodes to perform these operations. Again, there is usually more than one possible way to do these operations, ! so different node types can appear here too. The output of commandEXPLAIN/command has one line for each node in the plan tree, showing the basic node type plus the cost estimates that the planner made for the execution of that plan node. The first line (topmost node) has the estimated total execution cost for the plan; it is this number that the planner seeks to minimize. /para para Here is a trivial example, just to show what the output looks like. --- 47,62 operations on the raw rows, then there will be additional nodes quoteatop/ the scan nodes to perform these operations. Again, there is usually more than one possible way to do these operations, ! so different node types can appear here too. The standard output of commandEXPLAIN/command has one line for each node in the plan tree, showing the basic node type plus the cost estimates that the planner made for the execution of that plan node. The first line (topmost node) has the estimated total execution cost for the plan; it is this number that the planner seeks to minimize. /para +para +For examples of XML output, see the bottom of this page. +/para para Here is a trivial example, just to show what the output looks like. *** *** 448,453 --- 451,513 process the table in any case, so there's no value in expending additional page reads to look at an index. /para + +para + Examples of XML output: + programlisting + EXPLAIN XML SELECT * FROM tenk1; + +QUERY PLAN + - + ![CDATA[ ?xml version=1.0? + + explain version=x.x + plan name=Seq Scan indent=0 +table name=tenk1/ +cost startup=0.00 total=458.00 rows=1 width=244 / + /plan + /explain + (8 rows)]] + /programlisting + /para + para + programlisting + EXPLAIN ANALYZE XML SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 100 AND t1.unique2 = t2.unique2; + + QUERY PLAN + - + ![CDATA[ ?xml version=1.0? + + explain version=x.x + plan name=Nested Loop indent=0 +cost startup=5.03 total=693.17 rows=100 width=488 / +analyze time_start=0.981 time_end=6.118 rows=100 loops=1 / + /plan + plan name=Bitmap Heap Scan indent=3 +table name=tenk1 alias=t1/ +cost startup=5.03 total=221.07 rows=100 width=244 / +analyze time_start=0.826 time_end=2.226 rows=100 loops=1 / +qualifier type=Recheck Cond value=(unique1 100) / + /plan + plan name=Bitmap Index Scan indent=6 +index name=tenk1_unique1 / +cost startup=0.00 total=5.00 rows=100 width=0 / +analyze time_start=0.663 time_end=0.663 rows=100 loops=1 / +qualifier type=Index Cond value=(unique1 100) / + /plan + plan name=Index Scan indent=3 +index name=tenk2_unique2 / +table name=tenk2 alias=t2/ +cost startup=0.00 total=4.71 rows=1 width=244 / +analyze time_start=0.020 time_end=0.022 rows=1 loops=100 / +qualifier type=Index Cond value=(t2.unique2 = t1.unique2) / + /plan + runtime ms=7.204 / + /explain + (28 rows)]] + /programlisting + /para + /sect1 sect1 id=planner-stats *** doc/src/sgml/ref/explain.sgml.orig 2008-07-01 20:29:14.0 -0700 --- doc/src/sgml/ref/explain.sgml 2008-06-29 20:05:37.0 -0700 *** *** 30,36 refsynopsisdiv synopsis ! EXPLAIN [ ANALYZE ] [ VERBOSE ] replaceable class=parameterstatement/replaceable /synopsis /refsynopsisdiv --- 30,36 refsynopsisdiv synopsis ! EXPLAIN [ ANALYZE ] [ VERBOSE ] [ XML [ DTD ] ] replaceable class=parameterstatement/replaceable /synopsis /refsynopsisdiv *** *** 112,117 --- 112,135 /varlistentry varlistentry + termliteralXML/literal/term + listitem + para + Emit XML output instead of the standard output. + /para + /listitem +/varlistentry + +varlistentry + termliteralDTD/literal
[PATCHES] Hash Index Build Patch v2
This revised version of our patch uses the function estimate_rel_size() from plancat.c to estimate the number of tuples in the parent relation. This method is an alternative to scanning the parent relation to estimate the number of tuples, as we did in the first version of the patch. -Tom #include stdio.h #include stdlib.h extern int main(int argc, char **argv) { char *p; long i, tups, seed; if (argc 3) { printf(Too few args. tuples seed\n); return 0; } tups = strtol(argv[1],p,0); seed = strtol(argv[2], p, 0); srand(seed); for (i=0; i tups; i++) { printf(%d\n, rand()); } return 0; } *** ./backend/access/hash/hash.c.orig 2007-09-23 19:01:09.0 -0700 --- ./backend/access/hash/hash.c2007-10-21 12:07:48.455594000 -0700 *** *** 22,33 #include access/hash.h #include catalog/index.h #include commands/vacuum.h /* Working state for hashbuild and its callback */ typedef struct { ! double indtuples; } HashBuildState; static void hashbuildCallback(Relation index, --- 22,36 #include access/hash.h #include catalog/index.h #include commands/vacuum.h + #include optimizer/plancat.h /* Working state for hashbuild and its callback */ typedef struct { ! double indtuples; /* The current number of index tuples */ ! RelationheapRel; /* The index covers this heap relation */ ! HSpool *spool;/* Used to sort the index tuples before insertion into the index */ } HashBuildState; static void hashbuildCallback(Relation index, *** *** 40,46 --- 43,80 /* *hashbuild() -- build a new hash index. + * + * + * The algorithm: + *(1) Initialize the build state + *(2) Retrieve estimate of tuple count from estimate_rel_size(); + *(3) Transform the heap file tuples into index tuples (itups), + *while inserting them into a spool. If the spool overflows + *memory, sort it into runs and spill it to disk + *(4) Finish sorting the spool + *(5) Pre-initialize all the buckets of the final index + *(6) Insert itups from the spool into the index + * + * Sorting the tuples before inserting them into the index is a classical + * bulk-load technique, also used in the BTree code. The sort is done in + * hash value order. + * Pre-allocating the buckets minimizes the number of overflow pages. + * The reason for step (2) is that in order to sort, in step (3), one must + * know the hash value, which depends on the number of buckets, which in turn + * depends on the number of itups = the number of rows in the heap file. + * Steps (3),(4) and (6) parallel similar steps in the BTree code. + * + * Here is an alternative algorithm: + *(1') Same as (1) + *(2') Scan the heap file, counting the number of rows, forming index + * tuples and inserting them into a spool (the spool is not presorted). + *(3') Sort the spool + *(4') same as (5) + *(5') same as (6) + *Although this algorithm would be somewhat faster, we prefer the existing + * algorithm because it reuses existing BTree code. */ + Datum hashbuild(PG_FUNCTION_ARGS) { *** *** 50,55 --- 84,94 IndexBuildResult *result; double reltuples; HashBuildState buildstate; + double tuples; + BlockNumber pages; + HashMetaPagemetap; + Buffer metabuf; + uint32 num_bkt; /* Estimates number of buckets in the final index */ /* * We expect to be called exactly once for any index relation. If that's *** *** 59,81 elog(ERROR, index \%s\ already contains data, RelationGetRelationName(index)); ! /* initialize the hash index metadata page */ ! _hash_metapinit(index); ! ! /* build the index */ buildstate.indtuples = 0; ! /* do the heap scan */ ! reltuples = IndexBuildHeapScan(heap, index, indexInfo, ! hashbuildCallback, (void *) buildstate); - /* -* Return statistics -*/ - result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult)); ! result-heap_tuples = reltuples; ! result-index_tuples = buildstate.indtuples; PG_RETURN_POINTER(result); } --- 98,158 elog(ERROR, index \%s\ already contains data, RelationGetRelationName(index)); ! /* initialize the build state */ buildstate.indtuples = 0; + buildstate.heapRel = heap; + buildstate.spool = h_spoolinit(index); !/* ! * Retrieve an estimate of the number of rows ! * ! */ ! tuples=0; !
Re: [PATCHES] Hash Index Build Patch
Alvaro Herrera wrote: Hi Tom, Tom Raney wrote: We used spool functions from the BTree code to sort the index tuples. Sorting is done on the hash value of the tuples. The hash value depends on the number of primary bucket pages (henceforth just bucket pages) that will be required to fit all the index tuples. So, before sorting, the base relation is scanned to get the total number of tuples. Just wondering, wouldn't it be enough to obtain a tuple count estimate by using reltuples / relpages * RelationGetNumberOfBlocks, like the planner does? Hello Alvaro, We thought of that and the verdict is still out whether it is more costly to scan the entire relation to get the accurate count or use the estimate and hope for the best with the possibility of splits occurring during the build. If we use the estimate and it is completely wrong (with the actual tuple count being much higher) the sort will provide no benefit and it will behave as did the original code. But, to be honest, I don't know exactly when the catalog is updated and how accurate the estimate is. If you have any information there (or anyone else) please let me know. It would be great to eliminate that extra pass! Sincerely, Tom Raney ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PATCHES] Hash Index Build Patch
Hello All, We have prepared a patch (against CVS HEAD)for the TODO item: * Hash -During index creation, pre-sort the tuples to improve build speed http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php Details of this patch's performance improvements can be found at http://web.cecs.pdx.edu/~raneyt/pg/. For example, in one 12 million tuple example, the 8.2.4 hash index build required over 2.8 hours while this patch's build required 80 seconds. Hash build times are now comparable to BTree build times, for example, for a 60 million tuple example the patched hash index build time is 8.1 minutes and the BTree build time is 4.6 minutes. We used spool functions from the BTree code to sort the index tuples. Sorting is done on the hash value of the tuples. The hash value depends on the number of primary bucket pages (henceforth just bucket pages) that will be required to fit all the index tuples. So, before sorting, the base relation is scanned to get the total number of tuples. Then, the fill factor (either default or user defined, as applicable) is used to get the estimate of the number of bucket pages. The 8.2.4 code builds the index by inserting one tuple at a time and splitting buckets as needed. We pre-allocate the estimated number of bucket pages all at once. This avoids bucket page splits and thus redistribution of index tuples between the bucket page that caused the split and the newly created bucket page. We changed the values of low_mask, high_mask and max_bucket, used by hashkey2bucket () while inserting index tuples to bucket pages. They are set as follows: Low_mask = (power of 2 value-1) less than the estimate. High_mask = (power of 2 value-1) greater than the estimate. Max_bucket = estimated number of bucket -1 (because bucket numbers start from 0). (Please note: hashsort.c is a new file and resides in src/backend/access/hash/) We have also attached the simple data generator we used in the tests. Our SQL statements were as follows: DROP TABLE IF EXISTS test; CREATE TABLE test ( value INTEGER ); COPY test FROM 'data'; \timing CREATE INDEX hash ON test USING HASH(value); Regards, Shreya Bhargava [EMAIL PROTECTED] Tom Raney [EMAIL PROTECTED] /*--- * hashsort.c *-*/ /* * Functions needed to support initializing and sorting tuples in the spool * in hash.c */ #include postgres.h #include access/hash.h #include miscadmin.h #include storage/smgr.h #include utils/tuplesort.h struct HSpool { /* State data for tuplesort.c */ Tuplesortstate *sortstate; Relation index; }; /* create and initialize the spool structure */ HSpool *h_spoolinit(Relation index) { HSpool *hspool; int hKbytes; hspool = (HSpool *) palloc0(sizeof(HSpool)); hspool-index = index; /* hKbytes is the amount of memory we are going to use * to sort the spool. */ hKbytes = maintenance_work_mem; hspool-sortstate = tuplesort_begin_index(index,false,hKbytes,false); /* Substitute the default compare call-back function * for a specific hash index build compare function. */ tuplesort_set_hashindex(hspool-sortstate); return hspool; } void h_spool(IndexTuple itup, HSpool *hspool) { tuplesort_putindextuple(hspool-sortstate, itup); } /* * h_bkt_num() estimates the number of buckets * in the final hash table. * */ uint32 h_bkt_num(uint32 tuples, Relation rel) { int32 ffactor; int32 data_width; int32 item_width; uint32 bkt_num; /* * Calculate the fill factor. We could get this from the meta page * but at the point this method is called, the meta page has not been * created. * */ data_width = get_typavgwidth(RelationGetDescr(rel)-attrs[0]-atttypid, RelationGetDescr(rel)-attrs[0]-atttypmod); item_width = MAXALIGN(sizeof(IndexTupleData)) + MAXALIGN(data_width) + sizeof(ItemIdData); ffactor = RelationGetTargetPageUsage(rel, HASH_DEFAULT_FILLFACTOR) / item_width; if (ffactor 10) ffactor = 10; bkt_num = tuples / ffactor; bkt_num = bkt_num +1; return bkt_num; } /* In order to define an order for the index tuples, there must be a mask * applied to the 32 bit hash value of the index key during the sort * process. * For example, if there are 4,555 buckets approximated, the mask (or modulo) * would be 8,191 (hex value 1FFF). * */ void h_set_bkt_mask(HSpool *spool, uint32 bkts) { uint32 bkt_pwr2, bkt_mask; bkt_pwr2 = _hash_log2(bkts); bkt_mask = (1(bkt_pwr2))-1