date:20060116

Re: [HACKERS] [GENERAL] [PATCH] Better way to check for getaddrinfo function.

2006-01-16 Thread R, Rajesh (STSD)

Title: RE: [GENERAL] [PATCH] Better way to check for getaddrinfo function. 

That was very much situation specific.

But the bottomline is the default test does not include  in the test code.

So, pg uses getaddrinfo.c.And the getaddrinfo.c does not work for me. 

Ipv6 client authenciation fails.

I have modified the patch.

$ diff -r configure.in configure.in.new

918a919

> AC_MSG_CHECKING([for getaddrinfo])

920c921,926

<   AC_REPLACE_FUNCS([getaddrinfo])

---

>  AC_TRY_LINK([#include  #include ],

> [char (*f)();f=getaddrinfo;],

>   ac_cv_func_getaddrinfo=yes, ac_cv_func_getaddrinfo=no)

> if test x"$ac_cv_func_getaddrinfo" = xyes; then

>   AC_DEFINE(HAVE_GETADDRINFO,1,[Define if you have the getaddrinfo function])

> fi

923a930

> AC_MSG_RESULT([$ac_cv_func_getaddrinfo])

Rajesh R

--

This space intentionally left non-blank. 

-Original Message-

From: Tom Lane [mailto:[EMAIL PROTECTED]] 

Sent: Monday, January 16, 2006 11:28 PM

To: R, Rajesh (STSD)

Cc: pgsql-hackers@postgresql.org; pgsql-general@postgresql.org

Subject: Re: [GENERAL] [PATCH] Better way to check for getaddrinfo function. 

"R, Rajesh (STSD)" <[EMAIL PROTECTED]> writes:

> Just thought that the following patch might improve checking for 

> getaddrinfo function (in configure.in)

Since AC_TRY_RUN tests cannot work in cross-compilation scenarios, you need an *extremely* good reason to put one in.  "I thought this might improve things" doesn't qualify.  Exactly what problem are you trying to solve and why is a run-time test necessary?  Why doesn't the existing coding work for you?

            regards, tom lane <> 

configure-in.patch
Description: configure-in.patch

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs

On Mon, 2006-01-16 at 20:02 -0500, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > Sure hash table is dynamic, but we read all inner rows to create the
> > hash table (nodeHash) before we get the outer rows (nodeHJ).
> 
> But our idea of the number of batches needed can change during that
> process, resulting in some inner tuples being initially assigned to the
> wrong temp file.  This would also be true for hashagg.

So we correct that before we start reading the outer table.

> > Why would we continue to dynamically build the hash table after the
> > start of the outer scan?
> 
> The number of tuples written to a temp file might exceed what we want to
> hold in memory; we won't detect this until the batch is read back in,
> and in that case we have to split the batch at that time.  For hashagg
> this point would apply to the aggregate states not the input tuples, but
> it's still a live problem (especially if the aggregate states aren't
> fixed-size values ... consider a "concat" aggregate for instance).

OK, I see what you mean. Sounds like we should have a new definition for
Aggregates, "Sort Insensitive" that allows them to work when the input
ordering does not effect the result, since that case can be optimised
much better when using HashAgg. Since we know that applies to the common
cases of SUM, AVG etc this will certainly help people.

For sort-sensitive aggregates sounds like we either:
1. Write to a single file, while we remember the start offset of the
first row of each batch.
2. Write to multiple files, adding a globally incrementing sequenceid.
Batches are then resorted on the sequenceid before processing.
3. We give up, delete the existing batches and restart the scan from the
beginning of the outer table.

Sounds like (1) is best, since the overflow just becomes a SortedAgg.
But all of them sound ugly.

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane

Greg Stark <[EMAIL PROTECTED]> writes:
> For a hash aggregate would it be possible to rescan the original table
> instead of spilling to temporary files?

Sure, but the possible performance gain is finite and the possible
performance loss is not.  The "original table" could be an extremely
expensive join.  We'd like to think that the planner gets the input size
estimate approximately right and so the amount of extra I/O caused by
hash table resizing should normally be minimal.  The cases where it is
not right are *especially* not likely to be a trivial table scan as you
are supposing.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Tom Lane

Bruce Momjian  writes:
> Wow, looks great.  Is that URL stable?  Can we link to it from the
> PostgreSQL developers page?

The thing seems to have only the very vaguest grasp on whether it is
parsing C or C++ ... or should I say that it is convinced it is parsing
C++ despite all evidence to the contrary?  I'd be happier with the
pretty pictures if they had some connection to reality.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Greg Stark

Tom Lane <[EMAIL PROTECTED]> writes:

> > Why would we continue to dynamically build the hash table after the
> > start of the outer scan?
> 
> The number of tuples written to a temp file might exceed what we want to
> hold in memory; we won't detect this until the batch is read back in,
> and in that case we have to split the batch at that time.  For hashagg
> this point would apply to the aggregate states not the input tuples, but
> it's still a live problem (especially if the aggregate states aren't
> fixed-size values ... consider a "concat" aggregate for instance).

For a hash aggregate would it be possible to rescan the original table instead
of spilling to temporary files? Then when you run out of working memory you
simply throw out half the hash table and ignore subsequent tuples that fall in
those hash buckets. Then you rescan for the discarded hash bucket regions.

This avoids having to do any disk writes at the expense possibly of additional
reads. I think in terms of i/o it would be much faster in most cases.

The downsides are: a) volatile aggregates or aggregates with side-effects
would be confused by being executed twice. I'm not clear that volatile
aggregate functions make any sense anyways though. b) I'm unclear whether
rescanning the table could potentially find tuples in a different state than
previous scans. If so then the idea doesn't work at all. But I don't think
that's possible is it?

The main problem is c) it may lose in terms of i/o for cases where the
cardinality is low (ie, it's overflowing despite having low cardinality
because the table is really really big too). But most cases will be spilling
because the cardinality is high. So the table may be big but the spill files
are nearly as big anyways and having to write and then read them means double
the i/o.

The upside of not having to write out temporary files is big. I find queries
that require temporary sort files get hit with a *huge* performance penalty.
Often an order of magnitude. Part of that could probably be mitigated by
having the sort files on a separate spindle but I think it's always going to
hurt especially if there are multiple operations spilling to disk
simultaneously.

-- 
greg

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] FW: PGBuildfarm member firefly Branch HEAD Failed at Stage Check

2006-01-16 Thread Larry Rosenman

PG Build Farm wrote:
> The PGBuildfarm member firefly had the following event on branch HEAD:
> 
> Failed at Stage: Check
> 
> The snapshot timestamp for the build that triggered this notification
> is: 2006-01-17 03:27:00 
> 
> The specs of this machine are:
> OS:  UnixWare / 7.1.4
> Arch: i386
> Comp: cc (SCO UDK) / 4.2
> 
> For more information, see
> http://www.pgbuildfarm.org/cgi-bin/show_history.pl?nm=firefly&br=HEAD 

This is a Stats failure.  I thought the latest change Tom put in would fix
it.

I guess not.

How can I help?

LER


-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3683 US


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Jonathan Gardner

On Monday 16 January 2006 10:51, Tom Lane wrote:
> Neil Conway <[EMAIL PROTECTED]> writes:
> > I don't think it would be all that painful. There would be no need to
> > convert the entire source tree to use proper Doxygen-style comments in
> > one fell swoop: individual files and modules can be converted whenever
> > anyone gets the inclination to do so. I don't think the maintenance
> > burden would be very substantial, either.
> 
> In the previous go-round on this topic, I seem to recall some concern
> about side-effects, to wit reducing the readability of the comments for
> ordinary non-doxygen code browsing.  I'd be quite against taking any
> noticeable hit in that direction.
> 
> A quick look through the doxygen manual doesn't make it sound too
> invasive, but I am worried about how well it will coexist with pgindent.
> It seems both tools think they can dictate the meaning of the characters
> immediately after /* of a comment block.
> 

In my experimentation, I found that there are really three paths for using 
Doxygen on code.

(1) Do nothing to the code. Doxygen in this state works pretty well. You can 
browse the code and you can find references to things. Some of the comments 
are picked up correctly as-is. It does a really good job of formatting the 
code for HTML.

(2) Do minimally invasive changes to make Doxygen recognize comments better. 
Example: Move comments from behind to in-front, or vice-versa. Put an extra 
'*' after '/*' to denote special documentation. This makes Doxygen especially 
useful, because most things already have proper documentation and bringing 
that out to the person who browses Doxygen's output is very good. All of a 
sudden, we have a manual that reads like a book describing how PostgreSQL is 
put together PLUS we still have the marked-up code.

(3) Do maximally invasive changes to the comments. This includes things like 
using Doxygen-specific markup, putting in LaTeX for formulas, and putting in 
cross-references. This is a huge amount of work (initial + maintenance) and 
doesn't add much benefit beyond what (2) has. It makes the book read more 
like a book, but it was already pretty readable.

It's not a question of WHETHER to use Doxygen. You can use it whether or not 
anyone else wants to. It's really a question of how far do we bend backwards 
to accomodate Doxygen. I'd say we just stick to moving comments around and 
putting in the extra '*'s. Let's not encourage people to put in Doxygen 
markup that isn't obvious. If someone contributes code with Doxygen mark-up, 
we won't turn them down but we won't promise we'll keep their markup as 
pretty as it was when they submitted it.

-- 
Jonathan Gardner
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> Sure hash table is dynamic, but we read all inner rows to create the
> hash table (nodeHash) before we get the outer rows (nodeHJ).

But our idea of the number of batches needed can change during that
process, resulting in some inner tuples being initially assigned to the
wrong temp file.  This would also be true for hashagg.

> Why would we continue to dynamically build the hash table after the
> start of the outer scan?

The number of tuples written to a temp file might exceed what we want to
hold in memory; we won't detect this until the batch is read back in,
and in that case we have to split the batch at that time.  For hashagg
this point would apply to the aggregate states not the input tuples, but
it's still a live problem (especially if the aggregate states aren't
fixed-size values ... consider a "concat" aggregate for instance).

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs

On Mon, 2006-01-16 at 14:43 -0500, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > For HJ we write each outer tuple to its own file-per-batch in the order
> > they arrive. Reading them back in preserves the original ordering. So
> > yes, caution required, but I see no difficulty, just reworking the HJ
> > code (nodeHashjoin and nodeHash). What else do you see?
> 
> With dynamic adjustment of the hash partitioning, some tuples will go
> through multiple temp files before they ultimately get eaten, and
> different tuples destined for the same aggregate may take different
> paths through the temp files depending on when they arrive.  It's not
> immediately obvious that ordering is preserved when that happens.
> I think it can be made to work but it may take different management of
> the temp files than hashjoin uses.  (Worst case, we could use just a
> single temp file for all unprocessed tuples, but this would result in
> extra I/O.)

Sure hash table is dynamic, but we read all inner rows to create the
hash table (nodeHash) before we get the outer rows (nodeHJ).
Why would we continue to dynamically build the hash table after the
start of the outer scan? (I see that we do this, as you say, but I am
surprised).

Best Regards, Simon Riggs






---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [pgsql-www] [HACKERS] source documentation tool doxygen

2006-01-16 Thread Robert Treat

This was my plan all along, just been waiting for someone to make it work with 
the postgresql code and then send instructions to the postgresql web team on 
how to set it up. 

Robert Treat

On Monday 16 January 2006 18:32, Bruce Momjian wrote:
> Wow, looks great.  Is that URL stable?  Can we link to it from the
> PostgreSQL developers page?
>
>  http://www.postgresql.org/developer/coding
>
> ---
>
> Joachim Wieland wrote:
> > I've created a browsable source tree "documentation", it's done with the
> > doxygen tool.
> >
> > http://www.mcknight.de/pgsql-doxygen/cvshead/html/
> >
> > There was a discussion about this some time ago, Jonathan Gardner
> > proposed it here:
> >
> > http://archives.postgresql.org/pgsql-hackers/2004-03/msg00748.php
> >
> > quite a few people found it useful but it somehow got stuck. The reason
> > was apparently that you'd have to tweak your comments in order to profit
> > from it as much as possible.
> >
> > However I still think it's a nice-to-have among the online documentation
> > and it gives people quite new to the code the possibility to easily check
> > what is where defined and how... What do you think?
> >
> > doxygen can also produce a pdf but I haven't succeeded in doing that so
> > far, pdflatex keeps bailing out. Has anybody else succeeded building this
> > yet?
> >
> >
> >
> > Joachim
> >
> > --
> > Joachim Wieland 
> > [EMAIL PROTECTED] C/ Usandizaga 12 1?B 
> >  ICQ: 37225940 20002 Donostia / San Sebastian (Spain)
> > GPG key available
> >
> > ---(end of broadcast)---
> > TIP 3: Have you checked our extensive FAQ?
> >
> >http://www.postgresql.org/docs/faq

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Simon Riggs

On Mon, 2006-01-16 at 21:24 +0100, Manfred Koizar wrote:
> On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs
> <[EMAIL PROTECTED]> wrote:
> >I enclose a patch for checking out block sampling.
> 
> Can't comment on the merits of block sampling and your implementation
> thereof.  

But your thoughts are welcome...

> |!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create
> 
> Linking the use of the Vitter algorithm to May 2004 is not quite
> appropriate.  We introduced two stage sampling at that time.

I'll redo some of the comments as you suggest and review coding points.

> |   * a random sample of targrows rows (or less, if there are less in the
> |!  * sample of blocks). In this case, targblocks is always the same as
> |!  * targrows, so we always read one row per block.
> 
> This is just wrong, unless you add "on average".  Even then it is a
> bit misleading, because in most cases we *read* more tuples than we
> use.

The current code reads a number of blocks == targrows, hence "one per
block" but I'll change the comment.

> |   * Although every row has an equal chance of ending up in the final
> |   * sample, this sampling method is not perfect: not every possible
> |   * sample has an equal chance of being selected.  For large relations
> |   * the number of different blocks represented by the sample tends to be
> |!  * too small.  In that case, block sampling should be used.
> 
> Is the last sentence a fact or personal opinion?

That is my contention, based upon research. The existing code clearly
identifies that case as requiring a solution. We use Chaudhuri et al's
1998 result for the sufficiency of sample size, yet overlook his
estimators and his ideas for random block selection.

My first point is that we need proportional sample size, not fixed size.
[I show that for large tables we currently use a sample that is 2-5
times too small, even based on Chaudhuri's work.]

ANALYZE is very I/O bound: random row sampling would need to scan the
whole table to take a 2-3% sample in most cases. Block sampling is a
realistic way of getting a larger sample size without loss of
performance, and also a way of getting a reasonable sample when
accessing very few blocks.

We would keep random row sampling for smaller relations where the
performance effects of sample collection are not noticeable.

> I haven't been following the discussion too closely but didn't you say
> that a block sampling algorithm somehow compensates for the fact that
> the sample is not random?

I haven't said that block sampling compensates for the sample being
non-random. There is a danger of bias in the sample and I made that
clear. I've tried to mitigate that by the use of random blocks, reusing
your code and taking note of the earlier problems you solved. However,
block sampling allows us to spot cases where rows are naturally grouped
together, which row sampling doesn't spot until very high sample sizes.
I explored in detail a common case where this causes the current method
to fall down very badly. 

Brutlag & Richardson [2000] give a good run down on block sampling and
I'm looking to implement some of his block estimators to complement the
current patch. (Credit to Andrew Dunstan for locating the paper...)
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps

[I did briefly explore the possibility of hybrid row/block sampling,
using row sampling as well some block sampling to identify grouping,
with an attempt to avoid swamping the sample with biased rows; but I
don't have a strong basis for that, so I've ignored that for now.]

This patch allows us to explore the possible bias that block sampling
gives, allowing us to make a later decision to include it, or not.

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [pgsql-www] [HACKERS] source documentation tool doxygen

2006-01-16 Thread Michael Glaesemann



On Jan 17, 2006, at 8:40 , Marc G. Fournier wrote:



the only question I have ... is there any way of improving that  
right index?  Love the 'detail pages', but somehow making the right  
index more 'tree like' might make navigation a bit easier ...


Along those lines, I wonder if a CSS couldn't be worked up to  
integrate the look with the rest of the site.


Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] equivalence class not working?

2006-01-16 Thread uwcssa

Fine. The rest documentation says:" For now, the test only uses restriction clauses (those in restrictinfo_list). --Nels, Dec '92", however, I understand it as being overridden by the

followup, which is:"XXX as of 7.1, equivalence class info *is* available.  Considerimproving this code as foreseen by Nels."
 
Therefore, equivalence class should be detected and used for index selection...  or anyone
could tell me if after 7.1  Postgresql has determined not to use equi-join for index selection...
 
On 1/16/06, Tom Lane <[EMAIL PROTECTED]> wrote:
uwcssa <[EMAIL PROTECTED]> writes:> I am using postgres 
8.1.  In indxpath.c, it says " Note: if Postgres tried> to optimize queries by forming equivalence> classes over equi-joined attributes (i.e., if it recognized that> aqualification such as "where
> a.b=3Dc.d and a.b=3D5" could make use of>  an index on c.d), then we could use that equivalence class info here with> joininfo_list to do more complete tests for the usability> of a partial index.  .  XXX as of 
7.1, equivalence class info *is*> available."Are you deliberately ignoring the rest of the comment?   regards, tom lane

Re: [pgsql-www] [HACKERS] source documentation tool doxygen

2006-01-16 Thread Marc G. Fournier



the only question I have ... is there any way of improving that right 
index?  Love the 'detail pages', but somehow making the right index more 
'tree like' might make navigation a bit easier ...




On Mon, 16 Jan 2006, Bruce Momjian wrote:



Wow, looks great.  Is that URL stable?  Can we link to it from the
PostgreSQL developers page?

http://www.postgresql.org/developer/coding

---

Joachim Wieland wrote:

I've created a browsable source tree "documentation", it's done with the
doxygen tool.

http://www.mcknight.de/pgsql-doxygen/cvshead/html/

There was a discussion about this some time ago, Jonathan Gardner proposed
it here:

http://archives.postgresql.org/pgsql-hackers/2004-03/msg00748.php

quite a few people found it useful but it somehow got stuck. The reason was
apparently that you'd have to tweak your comments in order to profit from
it as much as possible.

However I still think it's a nice-to-have among the online documentation
and it gives people quite new to the code the possibility to easily check
what is where defined and how... What do you think?

doxygen can also produce a pdf but I haven't succeeded in doing that so far,
pdflatex keeps bailing out. Has anybody else succeeded building this yet?



Joachim

--
Joachim Wieland  [EMAIL PROTECTED]
C/ Usandizaga 12 1?B   ICQ: 37225940
20002 Donostia / San Sebastian (Spain) GPG key available

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq



--
 Bruce Momjian|  http://candle.pha.pa.us
 pgman@candle.pha.pa.us   |  (610) 359-1001
 +  If your life is a hard drive, |  13 Roberts Road
 +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings




Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Bruce Momjian


Wow, looks great.  Is that URL stable?  Can we link to it from the
PostgreSQL developers page?

http://www.postgresql.org/developer/coding

---

Joachim Wieland wrote:
> I've created a browsable source tree "documentation", it's done with the
> doxygen tool.
> 
> http://www.mcknight.de/pgsql-doxygen/cvshead/html/
> 
> There was a discussion about this some time ago, Jonathan Gardner proposed
> it here:
> 
> http://archives.postgresql.org/pgsql-hackers/2004-03/msg00748.php
> 
> quite a few people found it useful but it somehow got stuck. The reason was
> apparently that you'd have to tweak your comments in order to profit from
> it as much as possible.
> 
> However I still think it's a nice-to-have among the online documentation
> and it gives people quite new to the code the possibility to easily check
> what is where defined and how... What do you think?
> 
> doxygen can also produce a pdf but I haven't succeeded in doing that so far,
> pdflatex keeps bailing out. Has anybody else succeeded building this yet?
> 
> 
> 
> Joachim
> 
> -- 
> Joachim Wieland  [EMAIL PROTECTED]
> C/ Usandizaga 12 1?B   ICQ: 37225940
> 20002 Donostia / San Sebastian (Spain) GPG key available
> 
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Joachim Wieland

On Tue, Jan 17, 2006 at 06:51:15AM +0900, Michael Glaesemann wrote:
> I haven't looked at it yet, but might there not be a way to have a
> preprocessing step where the current comment format is converted to
> something doxygen-friendly? pg_indent2doxygen or something? Then the
> current comment style could be maintained and doxygen could get what
> it likes.

It's not about getting in all the comments, it's mostly about comments
describing functions / makros / typedefs / structures / global variables
and the like.

Comments in function bodies will just stay I guess.

However we could try to find some de facto rules about comments to get most
of the matches: For example a comment starting on column 0 of the line is
probably one you're looking for vs. one that has some whitespace at the
beginning is not.

Of course this would still need manual review for every file.

Joachim

-- 
Joachim Wieland  [EMAIL PROTECTED]
C/ Usandizaga 12 1°B   ICQ: 37225940
20002 Donostia / San Sebastian (Spain) GPG key available

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] ScanKey representation for RowCompare index conditions

2006-01-16 Thread Tom Lane

Martijn van Oosterhout  writes:
> On Mon, Jan 16, 2006 at 12:07:44PM -0500, Tom Lane wrote:
>> Since you didn't understand what I was saying, I suspect that plan A is
>> too confusing ...

> Umm, yeah. Now you've explained it I think it should be excluded on the
> basis that it'll be a source of bugs. For all the places that matter a
> row-condition needs to be treated as a whole and storing parts in a
> top level ScanKey array just seems like asking for trouble.

Actually, I'd just been coming back to plan A, after realizing that it's
not possible to do this with zero change in _bt_checkkeys() after all.
The amount of information passed down to an ordinary sk_func isn't
sufficient to do what the row-compare subroutine would need to do:
it needs access to the whole indextuple, not only the first column's
value, plus access to the direction and continuescan variables.  We
could add extra parameters to what gets passed, but that's certainly
no cheaper than adding a test on some sk_flags bits.  And having the
row comparison logic inside bt_checkkeys is probably more understandable
than having a separate subroutine with strange calling conventions.

I think given adequate documentation in skey.h, either representation
is about the same as far as the source-of-bugs argument goes.  The main
risk you get from a Plan-B approach is that it's easier to fail to
notice bugs of omission, where a piece of code looking at a ScanKey
needs to do something special with a row-compare key and omits to do so.
At least that's what I'm guessing at the moment --- I haven't actually
tried to code up any of the changes yet.  In any case, since we are only
intending to support this for searches of btree indexes, there are not
that many pieces of code that will be concerned with it.  I think it's
just _bt_first, _bt_preprocess_keys, _bt_checkkeys, and
ExecIndexBuildScanKeys.  Everyplace else will be dealing only in
"simple" ScanKeys with no row comparisons.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Michael Glaesemann



On Jan 17, 2006, at 3:51 , Tom Lane wrote:


A quick look through the doxygen manual doesn't make it sound too
invasive, but I am worried about how well it will coexist with  
pgindent.
It seems both tools think they can dictate the meaning of the  
characters

immediately after /* of a comment block.


I haven't looked at it yet, but might there not be a way to have a  
preprocessing step where the current comment format is converted to  
something doxygen-friendly? pg_indent2doxygen or something? Then the  
current comment style could be maintained and doxygen could get what  
it likes.


Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Anyone see a need for BTItem/HashItem?

2006-01-16 Thread David Fetter

On Mon, Jan 16, 2006 at 04:21:50PM -0500, Tom Lane wrote:
> David Fetter <[EMAIL PROTECTED]> writes:
> > On Mon, Jan 16, 2006 at 04:02:07PM -0500, Tom Lane wrote:
> >> David Fetter <[EMAIL PROTECTED]> writes:
> >>> If you cut it out, what will the "heap" and "index" access
> >>> methods needed for SQL/MED use?
> >> 
> >> What's that have to do with this?
> 
> > I'm sure you'll correct me if I'm mistaken, but this is a
> > candidate for the spot where such interfaces--think of Informix's
> > Virtual (Table|Index) Interface--would go.
> 
> Can't imagine putting anything related to external-database access
> inside either the btree or hash AMs; it'd only make sense to handle
> it at higher levels.  It's barely conceivable that external access
> would make sense as a specialized AM in its own right, but I don't
> see managing external links exclusively within the indexes.
> 
> IOW, if we did need extra stuff in IndexTuples for external access,
> we'd want to put it inside IndexTuple, not in a place where it could
> only be seen by these AMs.

Thanks for the explanation :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] ScanKey representation for RowCompare index conditions

2006-01-16 Thread Martijn van Oosterhout

On Mon, Jan 16, 2006 at 12:07:44PM -0500, Tom Lane wrote:
> Since you didn't understand what I was saying, I suspect that plan A is
> too confusing ...

Umm, yeah. Now you've explained it I think it should be excluded on the
basis that it'll be a source of bugs. For all the places that matter a
row-condition needs to be treated as a whole and storing parts in a
top level ScanKey array just seems like asking for trouble.

Given no plan C, I guess plan B is the way to go...?
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

signature.asc
Description: Digital signature

Re: [HACKERS] equivalence class not working?

2006-01-16 Thread Tom Lane

uwcssa <[EMAIL PROTECTED]> writes:
> I am using postgres 8.1.  In indxpath.c, it says " Note: if Postgres tried
> to optimize queries by forming equivalence
> classes over equi-joined attributes (i.e., if it recognized that
> aqualification such as "where
> a.b=3Dc.d and a.b=3D5" could make use of
>  an index on c.d), then we could use that equivalence class info here with
> joininfo_list to do more complete tests for the usability
> of a partial index.  .  XXX as of 7.1, equivalence class info *is*
> available."

Are you deliberately ignoring the rest of the comment?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Anyone see a need for BTItem/HashItem?

2006-01-16 Thread Tom Lane

David Fetter <[EMAIL PROTECTED]> writes:
> On Mon, Jan 16, 2006 at 04:02:07PM -0500, Tom Lane wrote:
>> David Fetter <[EMAIL PROTECTED]> writes:
>>> If you cut it out, what will the "heap" and "index" access methods
>>> needed for SQL/MED use?
>> 
>> What's that have to do with this?

> I'm sure you'll correct me if I'm mistaken, but this is a candidate
> for the spot where such interfaces--think of Informix's Virtual
> (Table|Index) Interface--would go.

Can't imagine putting anything related to external-database access
inside either the btree or hash AMs; it'd only make sense to handle
it at higher levels.  It's barely conceivable that external access
would make sense as a specialized AM in its own right, but I don't
see managing external links exclusively within the indexes.

IOW, if we did need extra stuff in IndexTuples for external access,
we'd want to put it inside IndexTuple, not in a place where it could
only be seen by these AMs.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] equivalence class not working?

2006-01-16 Thread uwcssa

not sure if this is the right place to post...
 
I am using postgres 8.1.  In indxpath.c, it says " Note: if Postgres tried to optimize queries by forming equivalenceclasses over equi-joined attributes (i.e., if it recognized that a
 qualification such as "where a.b=c.d and a.b=5" could make use of an index on c.d), then we could use that equivalence class info 
here with joininfo_list to do more complete tests for the usabilityof a partial index.  .  XXX as of 7.1, equivalence class info *is* available."

 
 
Now i have the following two queries on TPC-H, where there is an index built on "o_totalprice". 
 
explain select * from lineitem, orders where o_totalprice=l_extendedprice and l_extendedprice<2000;
explain select * from lineitem, orders where o_totalprice=l_extendedprice and l_extendedprice<2000 and o_totalprice<2000;
 
The second query uses the index while the first does not. It seems to me that both queries are the same (the "o_totalprice<2000" in the second query can be inferred).    Is there something that needs to be tuned or ...?

 
thanks a lot!

Re: [HACKERS] Anyone see a need for BTItem/HashItem?

2006-01-16 Thread David Fetter

On Mon, Jan 16, 2006 at 04:02:07PM -0500, Tom Lane wrote:
> David Fetter <[EMAIL PROTECTED]> writes:
> > On Mon, Jan 16, 2006 at 03:52:01PM -0500, Tom Lane wrote:
> >> Does anyone see a reason to keep this layer of struct
> >> definitions?
> 
> > If you cut it out, what will the "heap" and "index" access methods
> > needed for SQL/MED use?
> 
> What's that have to do with this?

I'm sure you'll correct me if I'm mistaken, but this is a candidate
for the spot where such interfaces--think of Informix's Virtual
(Table|Index) Interface--would go.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Anyone see a need for BTItem/HashItem?

2006-01-16 Thread Jonah H. Harris

>From what I've seen, I don't think we need to keep them around.On 1/16/06, Tom Lane <[EMAIL PROTECTED]> wrote:
I'm considering getting rid of the BTItem/BTItemData andHashItem/HashItemData struct definitions and just referencing
IndexTuple(Data) directly in the btree and hash AMs.  It appears thatat one time in the forgotten past, there was some access-method-specificdata in index entries in addition to the common IndexTuple struct, but
that's been gone for a long time and I can't see a reason why either ofthese AMs would resurrect it.  So this just seems like extra notationalcruft to me, as well as an extra layer of palloc overhead (see eg
_bt_formitem()).  GIST already got rid of this concept, or never had it.Does anyone see a reason to keep this layer of struct definitions?regards, tom lane---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Anyone see a need for BTItem/HashItem?

2006-01-16 Thread Tom Lane

David Fetter <[EMAIL PROTECTED]> writes:
> On Mon, Jan 16, 2006 at 03:52:01PM -0500, Tom Lane wrote:
>> Does anyone see a reason to keep this layer of struct definitions?

> If you cut it out, what will the "heap" and "index" access methods
> needed for SQL/MED use?

What's that have to do with this?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Anyone see a need for BTItem/HashItem?

2006-01-16 Thread David Fetter

On Mon, Jan 16, 2006 at 03:52:01PM -0500, Tom Lane wrote:
> I'm considering getting rid of the BTItem/BTItemData and
> HashItem/HashItemData struct definitions and just referencing
> IndexTuple(Data) directly in the btree and hash AMs.  It appears
> that at one time in the forgotten past, there was some
> access-method-specific data in index entries in addition to the
> common IndexTuple struct, but that's been gone for a long time and I
> can't see a reason why either of these AMs would resurrect it.  So
> this just seems like extra notational cruft to me, as well as an
> extra layer of palloc overhead (see eg _bt_formitem()).  GIST
> already got rid of this concept, or never had it.
> 
> Does anyone see a reason to keep this layer of struct definitions?

If you cut it out, what will the "heap" and "index" access methods
needed for SQL/MED use?

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[HACKERS] Anyone see a need for BTItem/HashItem?

2006-01-16 Thread Tom Lane

I'm considering getting rid of the BTItem/BTItemData and
HashItem/HashItemData struct definitions and just referencing
IndexTuple(Data) directly in the btree and hash AMs.  It appears that
at one time in the forgotten past, there was some access-method-specific
data in index entries in addition to the common IndexTuple struct, but
that's been gone for a long time and I can't see a reason why either of
these AMs would resurrect it.  So this just seems like extra notational
cruft to me, as well as an extra layer of palloc overhead (see eg
_bt_formitem()).  GIST already got rid of this concept, or never had it.

Does anyone see a reason to keep this layer of struct definitions?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> Tom has not spoken against checking for UNIQUE constraints: he is just
> pointing out that there never could be a constraint in the case I was
> identifying.

More generally, arguing for or against any system-wide change on the
basis of performance of compute_minimal_stats() is folly, because that
function is not used for any common datatypes.  (Ideally it'd never be
used at all.)

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Manfred Koizar

On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs
<[EMAIL PROTECTED]> wrote:
>I enclose a patch for checking out block sampling.

Can't comment on the merits of block sampling and your implementation
thereof.  Just some nitpicking:

|!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create

Linking the use of the Vitter algorithm to May 2004 is not quite
appropriate.  We introduced two stage sampling at that time.

|   * a random sample of targrows rows (or less, if there are less in the
|!  * sample of blocks). In this case, targblocks is always the same as
|!  * targrows, so we always read one row per block.

This is just wrong, unless you add "on average".  Even then it is a
bit misleading, because in most cases we *read* more tuples than we
use.

|   * Although every row has an equal chance of ending up in the final
|   * sample, this sampling method is not perfect: not every possible
|   * sample has an equal chance of being selected.  For large relations
|   * the number of different blocks represented by the sample tends to be
|!  * too small.  In that case, block sampling should be used.

Is the last sentence a fact or personal opinion?

|   * block.  The previous sampling method put too much credence in the row
|   * density near the start of the table.

FYI, "previous" refers to the date mentioned above:
previous == before May 2004 == before two stage sampling.

|+ /* Assume that we will have rows at least 64 bytes wide 
|+  * Currently we're very unlikely to overflow availMem here 
|+  */
|+ if ((allocrows * sizeof(HeapTuple)) > (allowedMem >> 4))

This is a funny way of saying
if (allocrows * (sizeof(Heaptuple) + 60) > allowedMem)

It doesn't match the comment above; and it is something different if
the size of a pointer is not four bytes.

|-  if (bs.m > 0)
|+  if (bs.m > 0 )

Oops.

|+  ereport(DEBUG2,
|+ (errmsg("ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u", 
|+ stats->tupattnum,samplerows, nmultiple, f1, d)));
^
missing space here and in some more places

I haven't been following the discussion too closely but didn't you say
that a block sampling algorithm somehow compensates for the fact that
the sample is not random?
Servus
 Manfred

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Simon Riggs

On Mon, 2006-01-16 at 12:26 -0600, Jim C. Nasby wrote:
> On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote:
> > Josh Berkus  writes:
> > >> It's also worth mentioning that for datatypes that only have an "="
> > >> operator the performance of compute_minimal_stats is O(N^2) when values
> > >> are unique, so increasing sample size is a very bad idea in that case.
> > 
> > > Hmmm ... does ANALYZE check for UNIQUE constraints?
> > 
> > Our only implementation of UNIQUE constraints is btree indexes, which
> > require more than an "=" operator, so this seems irrelevant.
> 
> IIRC, the point was that if we know a field has to be unique, there's no
> sense in doing that part of the analysis on it; you'd only care about
> correlation.

An interesting point: if we know the cardinality by definition there is
no need to ANALYZE for that column. An argument in favour of the
definitional rather than the discovery approach. As a designer I very
often already know the cardinality by definition, so I would appreciate
the opportunity to specify that to the database.

Tom has not spoken against checking for UNIQUE constraints: he is just
pointing out that there never could be a constraint in the case I was
identifying. For other datatypes we could check for them rather than
analyzing the sample, but the effect would be the same either way.

My original point was that behaviour is O(N^2) when unique; it is still
O(N^2) when nearly unique, albeit O(N^2) - O(N). Reducing stats_target
does not help there - the effect varies according to number of rows in
the sample, not num slots in MCV list.

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> For HJ we write each outer tuple to its own file-per-batch in the order
> they arrive. Reading them back in preserves the original ordering. So
> yes, caution required, but I see no difficulty, just reworking the HJ
> code (nodeHashjoin and nodeHash). What else do you see?

With dynamic adjustment of the hash partitioning, some tuples will go
through multiple temp files before they ultimately get eaten, and
different tuples destined for the same aggregate may take different
paths through the temp files depending on when they arrive.  It's not
immediately obvious that ordering is preserved when that happens.
I think it can be made to work but it may take different management of
the temp files than hashjoin uses.  (Worst case, we could use just a
single temp file for all unprocessed tuples, but this would result in
extra I/O.)

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] PostgreSQL win32 & NT4

2006-01-16 Thread Bruce Momjian

Magnus Hagander wrote:
> > >> NT4 is officially dead, IMHO no need for PostgreSQL to officially 
> > >> support it, let's leave place for companies offering commercial 
> > >> postgresql versions to work on it if they have enough 
> > customer requests.
> > >> BTW Win 2000 is more or less 6 years old now ...
> > I believe Microsoft has an official 10 year support policy. 
> > Although the don't backport all features (which makes sense).
> 
> They're 10 years for "extended support". Only 5 years for "mainstream
> support", meaning it has already passed.
> 
> The difference? You can no longer get a hotfix unless you have a
> specific "hotfix agreement", that you have to buy separate. You don't
> get any no-charge incident support (that would be
> support-because-of-a-bug). You don't get to make any feature requests
> (though seriously, they hardly ever listen to you unless you're really
> big anyway). And you don't get any "warranty claims", whatever that
> means in reality.
> 
> You do, however, get paid support and security hotfixes.

I think you know a little _too_ much about Microsoft support.  :-)

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] PostgreSQL win32 & NT4

2006-01-16 Thread Magnus Hagander

> >> NT4 is officially dead, IMHO no need for PostgreSQL to officially 
> >> support it, let's leave place for companies offering commercial 
> >> postgresql versions to work on it if they have enough 
> customer requests.
> >> BTW Win 2000 is more or less 6 years old now ...
> I believe Microsoft has an official 10 year support policy. 
> Although the don't backport all features (which makes sense).

They're 10 years for "extended support". Only 5 years for "mainstream
support", meaning it has already passed.

The difference? You can no longer get a hotfix unless you have a
specific "hotfix agreement", that you have to buy separate. You don't
get any no-charge incident support (that would be
support-because-of-a-bug). You don't get to make any feature requests
(though seriously, they hardly ever listen to you unless you're really
big anyway). And you don't get any "warranty claims", whatever that
means in reality.

You do, however, get paid support and security hotfixes.

//Magnus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] PostgreSQL win32 & NT4

2006-01-16 Thread Joshua D. Drake


Bruce Momjian wrote:

[EMAIL PROTECTED] wrote:

NT4 is officially dead, IMHO no need for PostgreSQL to officially support it,
let's leave place for companies offering commercial postgresql versions to
work on it if they have enough customer requests.
BTW Win 2000 is more or less 6 years old now ...
I believe Microsoft has an official 10 year support policy. Although the 
don't

backport all features (which makes sense).



Agreed.  If they are using NT4 they certainly should be happy using
PostgreSQL 8.1.X.  So, we will have an NT4 solution, it just won't be
our most recent major release.




--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] PostgreSQL win32 & NT4

2006-01-16 Thread Bruce Momjian

[EMAIL PROTECTED] wrote:
> NT4 is officially dead, IMHO no need for PostgreSQL to officially support it,
> let's leave place for companies offering commercial postgresql versions to
> work on it if they have enough customer requests.
> BTW Win 2000 is more or less 6 years old now ...

Agreed.  If they are using NT4 they certainly should be happy using
PostgreSQL 8.1.X.  So, we will have an NT4 solution, it just won't be
our most recent major release.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs

On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> >> A couple of days ago I found myself wanting to aggregate 3 Billion
> >> tuples down to 100 Million tuples based on an integer key with six
> >> integer values -- six sum()'s.
> 
> > There is already hash table overflow (spill to disk) logic in HashJoins,
> > so this should be possible by reusing that code for HashAggs. That's on
> > my todo list, but I'd welcome any assistance.
> 
> Yeah, I proposed something similar awhile back in conjunction with
> fixing the spill logic for hash joins (which was always there, but was
> not adaptive until recently).  I got the join part done but got
> distracted before fixing HashAgg :-(

You've done the main work. :-)

> The tricky part is to preserve the existing guarantee that tuples are
> merged into their aggregate in arrival order.  (This does not matter for
> the standard aggregates but it definitely does for custom aggregates,
> and there will be unhappy villagers appearing on our doorsteps if we
> break it.)  I think this can work correctly under the above sketch but
> it needs to be verified.  It might require different handling of the
> TODO files than what hashjoin does.

For HJ we write each outer tuple to its own file-per-batch in the order
they arrive. Reading them back in preserves the original ordering. So
yes, caution required, but I see no difficulty, just reworking the HJ
code (nodeHashjoin and nodeHash). What else do you see?

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs

On Mon, 2006-01-16 at 09:42 -0500, Rod Taylor wrote:
> On Mon, 2006-01-16 at 08:32 +, Simon Riggs wrote:
> > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> > > 
> > A question: Are the rows in your 3 B row table clumped together based
> > upon the 100M row key? (or *mostly* so) We might also be able to
> 
> They are randomly distributed. Fully sorting the data is quite painful.

...

> I don't understand how this helps. 

It wouldn't since your rows are randomly distributed. The idea was not
related to improving HashAgg, but to improving Aggregation for the case
of naturally grouped data.

> I think I need something closer to:
> 
> HashAgg
>   -> HashSort (to disk)
> 
> HashSort would create a number of files on disk with "similar" data.
> Grouping all similar keys into a single temporary file which HashAgg can
> deal with individually (100 loops by 1M target keys instead of 1 loop by
> 100M target keys). The results would be the same as partitioning by
> keyblock and running a HashAgg on each partition, but it would be
> handled by the Executor rather than by client side code.
> 
> > > I've written something similar using a client and COPY with temporary
> > > tables. Even with the Export/Import copy I still beat the Sort&Sum
> > > method PostgreSQL falls back to.

That is exactly how the spill to disk logic works for HashJoin (and
incidentally, identical to an Oracle one-pass hash join since both are
based upon the hybrid hash join algorithm). 

Multi-pass would only be required to handle very skewed hash
distributions, which HJ doesn't do yet.

So yes, this can be done. 

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-16 Thread Jim C. Nasby

On Sat, Jan 14, 2006 at 07:28:21PM +0900, Michael Glaesemann wrote:
> 
> On Jan 13, 2006, at 21:42 , Leandro Guimar?es Faria Corcete DUTRA wrote:
> 
> >If you still declare the natural key(s) as UNIQUEs, you have just made
> >performance worse.  Now there are two keys to be checked on UPDATEs  
> >and
> >INSERTs, two indexes to be updated, and probably a SEQUENCE too.
> 
> For UPDATEs and INSERTs, the "proper" primary key also needs to be  
> checked, but keys are used for more than just checking uniqueness:  
> they're also often used in JOINs. Joining against a single integer  
> I'd think it quite a different proposition (I'd think faster in terms  
> of performance) than joining against, say, a text column or a  
> composite key.

a) the optimizer does a really poor job on multi-column index statistics
b) If each parent record will have many children, the space savings from
using a surrogate key can be quite large
c) depending on how you view things, putting actual keys all over the
place is denormalized

Generally, I just use surrogate keys for everything unless performance
dictates something else.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Tom Lane

Neil Conway <[EMAIL PROTECTED]> writes:
> I don't think it would be all that painful. There would be no need to
> convert the entire source tree to use proper Doxygen-style comments in
> one fell swoop: individual files and modules can be converted whenever
> anyone gets the inclination to do so. I don't think the maintenance
> burden would be very substantial, either.

In the previous go-round on this topic, I seem to recall some concern
about side-effects, to wit reducing the readability of the comments for
ordinary non-doxygen code browsing.  I'd be quite against taking any
noticeable hit in that direction.

A quick look through the doxygen manual doesn't make it sound too
invasive, but I am worried about how well it will coexist with pgindent.
It seems both tools think they can dictate the meaning of the characters
immediately after /* of a comment block.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Neil Conway

On Mon, 2006-01-16 at 07:57 -0600, Andrew Dunstan wrote:
> I too have done this. But retrofitting Doxygen style comments to the
> PostgreSQL source code would be a big undertaking. Maintaining it, which
> would be another task for reviewers/committers, would also be a pain unless
> there were some automated checking tool.

I don't think it would be all that painful. There would be no need to
convert the entire source tree to use proper Doxygen-style comments in
one fell swoop: individual files and modules can be converted whenever
anyone gets the inclination to do so. I don't think the maintenance
burden would be very substantial, either.

-Neil

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Jim C. Nasby

On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote:
> Josh Berkus  writes:
> >> It's also worth mentioning that for datatypes that only have an "="
> >> operator the performance of compute_minimal_stats is O(N^2) when values
> >> are unique, so increasing sample size is a very bad idea in that case.
> 
> > Hmmm ... does ANALYZE check for UNIQUE constraints?
> 
> Our only implementation of UNIQUE constraints is btree indexes, which
> require more than an "=" operator, so this seems irrelevant.

IIRC, the point was that if we know a field has to be unique, there's no
sense in doing that part of the analysis on it; you'd only care about
correlation.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Warm-up cache may have its virtue

2006-01-16 Thread Jim C. Nasby

On Sat, Jan 14, 2006 at 04:13:56PM -0500, Qingqing Zhou wrote:
> 
> "Qingqing Zhou" <[EMAIL PROTECTED]> wrote
> >
> > I wonder if we should really implement file-system-cache-warmup strategy
> > which we have discussed before. There are two natural good places to do
> > this:
> >
> > (1) sequentail scan
> > (2) bitmap index scan
> >
> 
> For the sake of memory, there is a third place a warm-up cache or pre-read 
> is beneficial (OS won't help us):
> (3) xlog recovery

Wouldn't it be better to improve pre-reading data instead, ie, making
sure things like seqscan and bitmap scan always keep the IO system busy?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Thomas Hallgren


Andrew,
Why not publish it as it stands today? Changing comments can be done in due time, no need to 
rush it. Or are the comments in some special format today that is used by some other tool?


What I'm trying to say is that for people like me, this would be very useful. Just clicking 
on a structure and see what file it's defined in is great. I don't care much if the comments 
are not perfect from day one (or ever for that matter). I can click on the file and read the 
comments right there if I want to.


If the comments have no special format today, well then perhaps the Doxygen format could be 
a good format to gradually adopt? No need to add checking etc. at this point. Perhaps later 
on (years from now) when most of the comments use that format.


Regards,
Thomas Hallgren

Andrew Dunstan wrote:

Thomas Hallgren said:

I wish I've had this when I started working with PostgreSQL. This looks
really good. Very  useful indeed, even without the comments. What kind
of changes are needed in order to get  the comments in?



I too have done this. But retrofitting Doxygen style comments to the
PostgreSQL source code would be a big undertaking. Maintaining it, which
would be another task for reviewers/committers, would also be a pain unless
there were some automated checking tool.

cheers

andrew



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Docs off on ILIKE indexing?

2006-01-16 Thread Tom Lane

"Magnus Hagander" <[EMAIL PROTECTED]> writes:
> http://www.postgresql.org/docs/8.1/static/indexes-types.html
> says:
> The optimizer can also use a B-tree index for queries involving the
> pattern matching operators LIKE, ILIKE, ~, and ~*, if the pattern is a
> constant and is anchored to the beginning of the string - for example,
> col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. 

> But really, does it use indexes for ILIKE?

That's pretty poorly phrased.  For ILIKE it'll only work if there's a
prefix of the pattern that's not letters (and hence is unaffected by
the case-folding issue).

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [GENERAL] [PATCH] Better way to check for getaddrinfo function.

2006-01-16 Thread Tom Lane

"R, Rajesh (STSD)" <[EMAIL PROTECTED]> writes:
> Just thought that the following patch might improve checking for
> getaddrinfo function (in configure.in)

Since AC_TRY_RUN tests cannot work in cross-compilation scenarios,
you need an *extremely* good reason to put one in.  "I thought this
might improve things" doesn't qualify.  Exactly what problem are you
trying to solve and why is a run-time test necessary?  Why doesn't
the existing coding work for you?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
>> A couple of days ago I found myself wanting to aggregate 3 Billion
>> tuples down to 100 Million tuples based on an integer key with six
>> integer values -- six sum()'s.

> There is already hash table overflow (spill to disk) logic in HashJoins,
> so this should be possible by reusing that code for HashAggs. That's on
> my todo list, but I'd welcome any assistance.

Yeah, I proposed something similar awhile back in conjunction with
fixing the spill logic for hash joins (which was always there, but was
not adaptive until recently).  I got the join part done but got
distracted before fixing HashAgg :-(

In principle, you just reduce the range of currently-in-memory hash
codes whenever you run low on memory.  The so-far-accumulated working
state for aggregates that are not in the range anymore goes into a temp
file, and subsequently any incoming tuples that hash outside the range
go into another temp file.  After you've completed the scan, you
finalize and emit the aggregates that are still in memory, then you pick
up the first set of dropped aggregates, rescan the associated "TODO"
file of unprocessed tuples, lather rinse repeat till done.

The tricky part is to preserve the existing guarantee that tuples are
merged into their aggregate in arrival order.  (This does not matter for
the standard aggregates but it definitely does for custom aggregates,
and there will be unhappy villagers appearing on our doorsteps if we
break it.)  I think this can work correctly under the above sketch but
it needs to be verified.  It might require different handling of the
TODO files than what hashjoin does.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] ScanKey representation for RowCompare index conditions

2006-01-16 Thread Tom Lane

Martijn van Oosterhout  writes:
> ISTM that row-wise comparisons, as far as indexes are concerned are
> actually simpler than normal scan-keys. For example, if you have the
> condition (a,b) >= (5,1) then once the index has found that point,
> every subsequent entry in the index matches (except possibly NULLs). So
> you really don't actually need a row-comparison at any point after
> than.

Well, yeah you do, eg if the caller starts demanding backward fetch;
you need to be able to tell where to stop.  In any case it's certainly
necessary to pass down the whole row-condition into the index AM;
without that, the closest you could get would be an indexscan on a >= 5
which might have to scan an undesirably large number of rows before
reaching the first one with b >= 1.

> Now, if you have bracketing conditions like (a,b) BETWEEN (5,1) and
> (7,0) it gets more interesting. Ideally you'd want to pass both of
> these to the index so it knows when to stop. This really suggests using
> your plan B since you then have the possibility of passing both, which
> you don't really have with plan A. The latter also allows you to give
> more auxilliary conditions like b>4.

No, you misunderstood my plan A --- it's not giving up any of these
options.  But it's paying for it in complexity.  I was envisioning a
case like the above as being represented this way:

sk_flagssk_attnosk_func sk_datum

ROW_START   a   >=  5
ROW_END b   >=  1
ROW_START   a   <=  7
ROW_END b   <=  0
normal  b   >   4

Hence my comments about weird rules for the ordering of entries under
plan A --- the normal and ROW_START entries would be in order by attno,
but the row continuation/end entries wouldn't appear to follow this global
ordering.  They'd follow a local order within each row condition,
instead.

Since you didn't understand what I was saying, I suspect that plan A is
too confusing ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

[HACKERS] Docs off on ILIKE indexing?

2006-01-16 Thread Magnus Hagander

http://www.postgresql.org/docs/8.1/static/indexes-types.html

says:
The optimizer can also use a B-tree index for queries involving the
pattern matching operators LIKE, ILIKE, ~, and ~*, if the pattern is a
constant and is anchored to the beginning of the string - for example,
col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. 


But really, does it use indexes for ILIKE? (And I assume the same holds
for case insensitive regexp matching)

(If it does, can someone enlighten me on what I have to do - I have a
system with C locale that refuses to do it for ILIKE, but works just
fine for LIKE. My workaronud for now is to create an index on lower(foo)
and then use WHERE lower(foo) LIKE 'bar%' which works fine - but it does
require an extra index..)

So. Am I off, or are the docs? Or is it just me who can't read ;-)

//Magnus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] ScanKey representation for RowCompare index conditions

2006-01-16 Thread Tom Lane

Simon Riggs <[EMAIL PROTECTED]> writes:
> On Sun, 2006-01-15 at 18:03 -0500, Tom Lane wrote:
>> There's one nontrivial decision still to make about how to implement
>> proper per-spec row-comparison operations, namely: how a row
>> comparison ought to be represented in the index access method API.

> I'm not sure I understand why. Surely a row wise comparison can take
> advantage of indexes on some of its columns?

Not until we write the code to let it do so.  Currently, what the system
can match to an index on (a,b) is conditions like WHERE a >= x AND b >= y,
which is quite a different animal both semantically and representationally
from WHERE (a,b) >= (x,y).  (At least, assuming that one imputes the
correct SQL-spec meaning to the latter, viz a > x OR (a = x AND b >= y).)
Because of the semantic difference, we have to preserve the difference
when passing the condition to the index AM, thus the need for my
question.

> So why would we support this at all?

Example applications here and here:
http://archives.postgresql.org/pgsql-performance/2004-07/msg00188.php
http://archives.postgresql.org/pgsql-performance/2005-12/msg00590.php
(which demonstrate BTW the reasons for not trying to handle this by just
expanding out the OR/AND equivalents).

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] ScanKey representation for RowCompare index conditions

2006-01-16 Thread Martijn van Oosterhout

On Sun, Jan 15, 2006 at 06:03:12PM -0500, Tom Lane wrote:
> There's one nontrivial decision still to make about how to implement
> proper per-spec row-comparison operations, namely: how a row
> comparison ought to be represented in the index access method API.
> The current representation of index conditions in the AM API is an
> array of ScanKey structs, one per "indexcol op constant" index
> condition, with implicit ANDing across all the conditions.  There
> are various bits of code that require the ScanKeys to appear in order
> by index column, though this isn't inherent in the data structure
> itself.

ISTM that row-wise comparisons, as far as indexes are concerned are
actually simpler than normal scan-keys. For example, if you have the
condition (a,b) >= (5,1) then once the index has found that point,
every subsequent entry in the index matches (except possibly NULLs). So
you really don't actually need a row-comparison at any point after
than.

Now, if you have bracketing conditions like (a,b) BETWEEN (5,1) and
(7,0) it gets more interesting. Ideally you'd want to pass both of
these to the index so it knows when to stop. This really suggests using
your plan B since you then have the possibility of passing both, which
you don't really have with plan A. The latter also allows you to give
more auxilliary conditions like b>4.

So it seems like a good idea. No plan C I can think of...

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

signature.asc
Description: Digital signature

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Rod Taylor

On Mon, 2006-01-16 at 08:32 +, Simon Riggs wrote:
> On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> > A couple of days ago I found myself wanting to aggregate 3 Billion
> > tuples down to 100 Million tuples based on an integer key with six
> > integer values -- six sum()'s.
> > 
> > PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> > style Sort & Sum took a fair amount of time to complete (cancelled the
> > process after 24 hours -- small machine).
> 
> > Spilling to disk would be nice but I suspect the obvious method would
> > thrash quite badly with non-sorted input.
> 
> There is already hash table overflow (spill to disk) logic in HashJoins,
> so this should be possible by reusing that code for HashAggs. That's on
> my todo list, but I'd welcome any assistance.

> A question: Are the rows in your 3 B row table clumped together based
> upon the 100M row key? (or *mostly* so) We might also be able to

They are randomly distributed. Fully sorting the data is quite painful.

> pre-aggregate the rows using a plan like
>   HashAgg
>   SortedAgg
> or
>   SortedAgg
>   Sort
>   SortedAgg
> 
> The first SortedAgg seems superfluous, buy would reduce the row volume
> considerably if incoming rows were frequently naturally adjacent, even
> if the values were not actually sorted. (This could also be done during
> sorting, but its much easier to slot the extra executor step into the
> plan). That might then reduce the size of the later sort, or allow it to
> become a HashAgg.
> 
> I could make that manually enabled using "enable_pre_agg" to allow us to
> measure the effectiveness of that technique and decide what cost model
> we'd use to make it automatic. Would that help?

I don't understand how this helps. The problem isn't the 3B data source
rows but rather the 100M destination keys that are being aggregated
against.

The memory constraints of HashAgg are a result of the large number of
target keys and should be the same if it was 100M rows or 10B rows.

I think I need something closer to:

HashAgg
-> HashSort (to disk)

HashSort would create a number of files on disk with "similar" data.
Grouping all similar keys into a single temporary file which HashAgg can
deal with individually (100 loops by 1M target keys instead of 1 loop by
100M target keys). The results would be the same as partitioning by
keyblock and running a HashAgg on each partition, but it would be
handled by the Executor rather than by client side code.

> > I've written something similar using a client and COPY with temporary
> > tables. Even with the Export/Import copy I still beat the Sort&Sum
> > method PostgreSQL falls back to.
> 
> You can get round this now by chopping the larger table into pieces with
> a WHERE clause and then putting them back together with a UNION. If the
> table is partitioned, then do this by partitions.

True, except this results in several sequential scans over the source
data. I can extract and sort in a single pass at client side but it
would be far better if I could get PostgreSQL to do the same. I could
probably write a plpgsql function to do that logic but it would be quite
messy.

> This should also help when it comes to recalculating the sums again in
> the future, since you'll only need to rescan the rows that have been
> added since the last summation.

We store the aggregated results and never do this type of calculation on
that dataset again. The original dataset comes from about 300 partitions
(time and source) and they are removed upon completion. While this
calculation is being performed additional partitions are added.

I suppose I could store source data in 300 * 1000 partitions (Approx 300
batches times 1000 segments) but that would probably run into other
problems. PostgreSQL probably has issues with that many tables.

-- 

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] [PATCH] Better way to check for getaddrinfo function.

2006-01-16 Thread R, Rajesh (STSD)

Title: [PATCH] Better way to check for getaddrinfo function.






 

Just thought that the following patch might improve checking for getaddrinfo function (in configure.in)

I was forced to write 'coz getaddrinfo went unnoticed in Tru64 Unix.


(displaying attached patch)


$ diff -r configure.in configure.in.1

920c920,944

<   AC_REPLACE_FUNCS([getaddrinfo])

---

>  AC_CACHE_CHECK([for getaddrinfo], ac_cv_func_getaddrinfo,

> [AC_TRY_LINK([#include ],

> [struct addrinfo *g,h;g=&h;getaddrinfo("","",g,&g);],

>   AC_TRY_RUN([

> #include 

> #include 

> #include 

> #ifndef AF_INET

> # include 

> #endif

> #ifdef __cplusplus

> extern "C"

> #endif

> char (*f) ();

> int main(void) {

>

>    f = getaddrinfo;

>

>    return 0;

> }

>   ],ac_cv_func_getaddrinfo=yes, ac_cv_func_getaddrinfo=no),

> ac_cv_func_getaddrinfo=no)])

> if test "$ac_cv_func_getaddrinfo" = yes; then

>   AC_DEFINE(HAVE_GETADDRINFO,1,[Define if you have the getaddrinfo function])

> fi




Rajesh R

--

This space intentionally left non-blank. 

 <> 





configure-in.patch
Description: configure-in.patch

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Andrew Dunstan

Thomas Hallgren said:
> I wish I've had this when I started working with PostgreSQL. This looks
> really good. Very  useful indeed, even without the comments. What kind
> of changes are needed in order to get  the comments in?
>

I too have done this. But retrofitting Doxygen style comments to the
PostgreSQL source code would be a big undertaking. Maintaining it, which
would be another task for reviewers/committers, would also be a pain unless
there were some automated checking tool.

cheers

andrew



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Kim Bisgaard


Try following the link (the Doxygen icon) - it has both a tutorial and
extensive doc.

Regards,
Kim Bisgaard

Thomas Hallgren wrote:

I wish I've had this when I started working with PostgreSQL. This 
looks really good. Very useful indeed, even without the comments. What 
kind of changes are needed in order to get the comments in?


Regards,
Thomas Hallgren

Joachim Wieland wrote:


I've created a browsable source tree "documentation", it's done with the
doxygen tool.

http://www.mcknight.de/pgsql-doxygen/cvshead/html/

There was a discussion about this some time ago, Jonathan Gardner 
proposed

it here:

http://archives.postgresql.org/pgsql-hackers/2004-03/msg00748.php

quite a few people found it useful but it somehow got stuck. The 
reason was
apparently that you'd have to tweak your comments in order to profit 
from

it as much as possible.

However I still think it's a nice-to-have among the online documentation
and it gives people quite new to the code the possibility to easily 
check

what is where defined and how... What do you think?

doxygen can also produce a pdf but I haven't succeeded in doing that 
so far,
pdflatex keeps bailing out. Has anybody else succeeded building this 
yet?




Joachim




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster




---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] source documentation tool doxygen

2006-01-16 Thread Thomas Hallgren

I wish I've had this when I started working with PostgreSQL. This looks really good. Very 
useful indeed, even without the comments. What kind of changes are needed in order to get 
the comments in?


Regards,
Thomas Hallgren

Joachim Wieland wrote:

I've created a browsable source tree "documentation", it's done with the
doxygen tool.

http://www.mcknight.de/pgsql-doxygen/cvshead/html/

There was a discussion about this some time ago, Jonathan Gardner proposed
it here:

http://archives.postgresql.org/pgsql-hackers/2004-03/msg00748.php

quite a few people found it useful but it somehow got stuck. The reason was
apparently that you'd have to tweak your comments in order to profit from
it as much as possible.

However I still think it's a nice-to-have among the online documentation
and it gives people quite new to the code the possibility to easily check
what is where defined and how... What do you think?

doxygen can also produce a pdf but I haven't succeeded in doing that so far,
pdflatex keeps bailing out. Has anybody else succeeded building this yet?



Joachim




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] source documentation tool doxygen

2006-01-16 Thread Joachim Wieland

I've created a browsable source tree "documentation", it's done with the
doxygen tool.

http://www.mcknight.de/pgsql-doxygen/cvshead/html/

There was a discussion about this some time ago, Jonathan Gardner proposed
it here:

http://archives.postgresql.org/pgsql-hackers/2004-03/msg00748.php

quite a few people found it useful but it somehow got stuck. The reason was
apparently that you'd have to tweak your comments in order to profit from
it as much as possible.

However I still think it's a nice-to-have among the online documentation
and it gives people quite new to the code the possibility to easily check
what is where defined and how... What do you think?

doxygen can also produce a pdf but I haven't succeeded in doing that so far,
pdflatex keeps bailing out. Has anybody else succeeded building this yet?



Joachim

-- 
Joachim Wieland  [EMAIL PROTECTED]
C/ Usandizaga 12 1°B   ICQ: 37225940
20002 Donostia / San Sebastian (Spain) GPG key available

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Coding standards? Recommendations?

2006-01-16 Thread Simon Riggs

On Sun, 2006-01-15 at 11:14 -0500, korry wrote:
> I've noticed a variety of coding styles in the PostgreSQL source code.  In 
> particular, I see a mix of naming conventions.  Some variables use camelCase 
> (or CamelCase), others use under_score_style.

I just follow the style of the module I'm working in.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs

On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> A couple of days ago I found myself wanting to aggregate 3 Billion
> tuples down to 100 Million tuples based on an integer key with six
> integer values -- six sum()'s.
> 
> PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> style Sort & Sum took a fair amount of time to complete (cancelled the
> process after 24 hours -- small machine).

> Spilling to disk would be nice but I suspect the obvious method would
> thrash quite badly with non-sorted input.

There is already hash table overflow (spill to disk) logic in HashJoins,
so this should be possible by reusing that code for HashAggs. That's on
my todo list, but I'd welcome any assistance.

A question: Are the rows in your 3 B row table clumped together based
upon the 100M row key? (or *mostly* so) We might also be able to
pre-aggregate the rows using a plan like
HashAgg
SortedAgg
or
SortedAgg
Sort
SortedAgg

The first SortedAgg seems superfluous, buy would reduce the row volume
considerably if incoming rows were frequently naturally adjacent, even
if the values were not actually sorted. (This could also be done during
sorting, but its much easier to slot the extra executor step into the
plan). That might then reduce the size of the later sort, or allow it to
become a HashAgg.

I could make that manually enabled using "enable_pre_agg" to allow us to
measure the effectiveness of that technique and decide what cost model
we'd use to make it automatic. Would that help?

> I've written something similar using a client and COPY with temporary
> tables. Even with the Export/Import copy I still beat the Sort&Sum
> method PostgreSQL falls back to.

You can get round this now by chopping the larger table into pieces with
a WHERE clause and then putting them back together with a UNION. If the
table is partitioned, then do this by partitions.

This should also help when it comes to recalculating the sums again in
the future, since you'll only need to rescan the rows that have been
added since the last summation.

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] ScanKey representation for RowCompare index

2006-01-16 Thread Simon Riggs

On Sun, 2006-01-15 at 18:03 -0500, Tom Lane wrote:
> There's one nontrivial decision still to make about how to implement
> proper per-spec row-comparison operations, namely: how a row
> comparison ought to be represented in the index access method API.

I'm not sure I understand why. Surely a row wise comparison can take
advantage of indexes on some of its columns? When would it be
advantageous to create an index on the whole row anyway? Wouldn't the
key likely be excessively long? So why would we support this at all?

> I'm currently leaning to plan B, on the grounds that:
> 
> 1. It would require no changes in _bt_checkkeys(), which is the only
> user of the data structure that is particularly performance-critical.
> With plan A we'd be adding at least a few cycles to _bt_checkkeys in
> all cases.  Plan B avoids that at the cost of an extra level of function
> call to do a row comparison.  Given the relative frequency of the two
> sorts of index conditions, this has to be the better tradeoff to make.
> 
> 2. There's quite a bit of logic in btree indexscan setup that would find
> it convenient to treat a row comparison as if it were a single condition
> on just its leftmost column.  This may end up being nearly a wash, but
> I think that plan A would make the setup code a shade more complex than
> plan B would.  In particular, the rules about valid orderings of the
> ScanKey entries would be complicated under plan A, whereas under plan
> B it's still clear where everything belongs.
> 
> Any thoughts?  Anyone see a good plan C, or a serious flaw that I'm
> missing in either of these ideas?

That looks sound. 

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

60 matches

Mail list logo