Re: [HACKERS] modules

2008-04-03 Thread Ron Mayer

Joshua D. Drake wrote:

On Thu,  3 Apr 2008 13:54:11 -
Greg Sabino Mullane [EMAIL PROTECTED] wrote:

Right now contrib is a real catch-all of various things; it would
be nice to categorize them somehow. And by categorize, I
emphatically do NOT mean move to pgfoundry, which is pretty much a
kiss of death.


Pgfoundry is not a kiss of death except that you spread falsehoods like
that. PGfoundry is a very alive project that is constantly adding
content and has continuing and very active projects.


+1.
This modules/packages concept seems 100% orthogonal to pgfoundry to me.

Pgfoundry seems like a bug-tracking / development infrastructure
somewhat like sourceforge.   Some modules might want to use it,
some might not (no doubt postgis would stay with refractions, etc).

This hypothetical modules project is more focused on installers,
and perhaps ways to find and run the module installers whether
from pgfoundry or elsewhere.

   Ron

PS: Regarding pgfoundry and credibility; it seems the stature
and image of pgfoundry would go up a lot if postgresql itself
were hosted there.   But no, I'm not advocating that.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] SHA1 on postgres 8.3

2008-04-03 Thread Ron Mayer

Sam Mason wrote:

On Thu, Apr 03, 2008 at 07:07:56PM +0200, Svenne Krap wrote:


ID serial
Username varchar
Password_md5 varchar
Password_sha1 varchar
... 
Why not just use SHA-512, you get many more quality bits that way.


Or if he just wanted to use builtin tools and reduce accidental
collisions almost exactly the same much as he propopses, he could use

   password_md5_with_salt_xxx varchar
   password_md5_with_salt_yyy varchar

but I also can't see the point.  Won't he have more
collisions from cosmic rays turning the results of his comparisons
from false to true anyway?

I would drop md5 totally and use sha1 and ripemd-160 if possible.. but 
currently i use only md5 as it is the only available one..  Loading 
pgcrypto is overkill for something as simple as hash-functions.


Sounds like a good reason for moving the current md5 function out into
pgcrypto as well! :)


I'd rephrase it as saying a good reason for making it less
intimidating to install modules.   +1 to all the approaches
that make this less scary.

For Perl Digest-SHA1's in CPAN http://search.cpan.org/dist/Digest-SHA1/



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] modules

2008-04-02 Thread Ron Mayer

Andrew Dunstan wrote:

Tom Lane wrote:

as having better system support for packages or modules or whatever
you want to call them; and maybe we also need some marketing-type


...re-raise the question of getting rid of contrib...
The PostgreSQL Standard Modules. 


While renaming, could we go one step further and come up with a
clear definition of what it takes for something to qualify as
a module?   In particular I think standardizing the installation
would go a long way to letting packagers automate the installation
of modules from pgfoundry.

I think it'd be especially cool if one could one-day have a command

  pg_install_module  [modulename] -d [databasename]

and it would magically get (or verify that it had) the latest
version from pgfoundry; compile it (if needed) and install it
in the specified database.

The closest analogy to what I'm thinking is the perl CPAN or ruby gems.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modules

2008-04-02 Thread Ron Mayer

Andrew Dunstan wrote:

I think it'd be especially cool if one could one-day have a command

  pg_install_module  [modulename] -d [databasename]


Yes, and the CPAN analogy that has been in several minds, but it only 
goes so far. Perl and Ruby are languages - Postgres is a very different 
animal.


Sure - but the benefits of standardizing installers for optional
components seems to apply the same for both.

We do in fact have some support for building / installing some modules 
in a standard way. It's called pgxs and it is used by quite a number of 
existing modules.


Cool.  Seems to handle at least quite a bit of the building part of
standardized modules.

One thing that might be worth looking at is an install command at the 
SQL level, so the INSTALL foo would run the install script for the foo 
module in the current database, assuming it's in the standard location.


I'm guessing that this would be harder to add various
options (install/ignore dependancies ; specify a different source
web site) that a standard installer would like to have.


We don't have a central repository of non-standard modules, like CPAN, 
and so of course no facility for fetching / building / installing them.


Seems that could easily be improved in a number of ways.

  * The installer could specify the source.  For example
  pg_install_module postgis -source http://www.refractions.net
in exactly the same way ruby uses
  gem install rails –source http://gems.rubyonrails.org

  * pgfoundry could provide a repository of installable modules
for projects hosted there.

  * perhaps pgfoundry could even have a section where it points
to installers on third party sites?

Not all modules fit a single pattern, either. There are addon languages, 
types, and function libraries, as we all as utilities that are not 
installed in the database at all.


Agreed.  Such a mechanism would only really apply for things
that are installed in the database.   But from an end user's
point of view, installing functions, index types, languages,
data types, etc all see to fit the pg_install postgis -d mydb,
pg_install pl_ruby -d mydb, etc. pattern pretty well.

Finally, setting up modules so they can be built for Windows, especially 
using MSVC, will probably be quite a challenge.


Indeed.   Seems ruby gems give you the option of installing a ruby
version or a windows version that I'm guessing has pre-compiled
object files.

So if someone wants to make a start on any of this I'm sure we would all 
listen up.

I'm happy to try, though might need pointing in the right directions.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump additional options for performance

2008-02-26 Thread Ron Mayer

Magnus Hagander wrote:

On Tue, Feb 26, 2008 at 08:28:11AM -0500, Andrew Dunstan wrote:

Simon Riggs wrote:

Separate files seems much simpler...

Yes, We need to stick to the KISS principle.
ISTM that we could simply invent a new archive format of d for directory.

Yeah, you can always ZIP (or whatever) the resulting directory when you're
done..


I could imagine this being very rsync friendly too.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Permanent settings

2008-02-21 Thread Ron Mayer
Magnus Hagander wrote:
 If they don't have an actual database, it's fairly common to use SQLite or
 similar just to get proper database storage for it.

With all the concern about parsing in this thread, perhaps it's best
if this config-overrides file not be of the same syntax as postgresql.conf
at all.

If the interactive form of these overrides will be
SET PERMANENTLY work_mem TO 65MB;, why not make the override
config file use the same syntax; since a parser for it'll have
to exist anyway?

Maybe some XML bloat.  Or, since you mentioned it, perhaps SQLite
itself, since some people on the thread seem to want sql-like
syntaxes to maintain it?



[Personally, we maintain perl scripts that apply patches to
the default postgresql.conf; and check those in to source
control.  I don't think I'd use this override file feature.]

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Feature Freeze Date for Next Release

2008-02-05 Thread Ron Mayer
Wouldn't seeing which patches are trickling in during the first months
of 8.4 development give a better indication of when it should be
freezable?   I'm all in favor of having lots of advance notice and
predictable schedules --- but it seems in the next month or so we'll
have a lot more insight of whether 8.4'll be getting big features or
little ones based on what work-in-progress patches start coming.



Simon Riggs wrote:
 Can I ask when the Feature Freeze for next release will be?

This gives me flashbacks to this earlier thread:

http://archives.postgresql.org/pgsql-hackers/2006-09/msg01979.php
Simon Riggs in 2006 wrote:
 David Page wrote:
 Following the recent discussion on this list and another on pgsql-core,
 we have decided that we would like to aim to meet the following schedule
 for the release of PostgreSQL 8.3:

 April 1st 2007 - Feature freeze
 June 1st 2007 - Release

 I'm very happy to have clearly stated dates. That means we can all plan
 what we'll be able to achieve in that time, which is important when some
 of the largest or most complex features are being considered.


Quoting Simon in 2008 again:
 We've long expressed the wish to move development onto a cycle that ends
 in the Spring, so next alternative would appear to be end-March-2009,
 which is 14 months away now.

In light of the 2006 thread, I'd say April First like the previous
thread would be better than End-March for those with a sense of irony.




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] configurability of OOM killer

2008-02-05 Thread Ron Mayer
Decibel! wrote:
 
 Yes, this problem goes way beyond OOM. Just try and configure
 work_memory aggressively on a server that might see 50 database
 connections, and do it in such a way that you won't swap. Good luck.

That sounds like an even broader and more difficult problem
than managing memory.

If you have 50 connections that all want to perform large sorts,
what do you want to have happen?

  a) they each do their sorts in parallel with small amounts
 of memory for each; probably all spilling to disk?
  b) they each get a big chunk of memory but some have to
 wait for each other?
  c) something else?

Seems (a)'s already possible today with setting small work_mem.
Seems (b)'s already possible today with a larger work_mem and
pg_pool.

Stepping back from the technical details, what do you think
should happen.   (though perhaps it should be taken to a different
thread)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Feature Freeze Date for Next Release

2008-02-05 Thread Ron Mayer
Simon Riggs wrote:
 Can I ask when the Feature Freeze for next release will be?

Also, from  http://www.postgresql.org/about/press/faq

  Q: When will 8.4 come out?
   A: Historically, PostgreSQL has released approximately
  every 12 months and there is no desire in the community
  to change from that pattern. So expect 8.4 sometime in
  the fourth quarter of 2008.

So you can count back from then to guess a freeze.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] configurability of OOM killer

2008-02-04 Thread Ron Mayer

Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:
  

... OOM_Killer



Egad.  Whoever thought *this* was a good idea should be taken out
and shot:
  

If I read this right, http://lkml.org/lkml/2007/2/9/275 even the
shared memory is counted many times (once per child) for the
parent process - even though it's (obviously) not copy-on-write
so the shared memory's unlikely to contribute to problems.

I wonder if postgres startup should write something (warning?
at least log?) in the log file if the OOM killer is enabled.I assume
most people who care deeply about their database dying would notice a
warning in log files; while  most people who don't mind the OOM killer
also wouldn't be too bothered by extra noise in the file.



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] configurability of OOM killer

2008-02-04 Thread Ron Mayer
Alvaro Herrera wrote:
 Yeah, the only way to improve the OOM problem would be to harass the
 Linux developers to tweak badness() so that it considers the postmaster
 to be an essential process rather than the one to preferentially kill.

Wouldn't the more general rule that Jeff Davis pointed out upstream
make more sense?

That shared memory of the children should not be added to the size
of the parent process multiple times regardless of if something's
an essential process or not.Since those bytes are shared, it
seems such bytes should only be added to the badness once, no?


(assuming I understood Jeff correctly)


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUCvariable

2008-01-29 Thread Ron Mayer

Tom Lane wrote:

Kevin Grittner [EMAIL PROTECTED] writes:
  
Tom Lane [EMAIL PROTECTED] wrote: 


Or is someone prepared to argue that there are no applications out
there that will be broken if the same query, against the same unchanging
table, yields different results from one trial to the next? 
  

Won't even autovacuum analyze cause this too if the
new stats changes the plan?


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] Proposed patch: synchronized_scanning GUCvariable

2008-01-28 Thread Ron Mayer
Jeff Davis wrote:
 On Mon, 2008-01-28 at 23:13 +, Heikki Linnakangas wrote:
   
 clusteredness didn't get screwed up by a table that looks like this: 
 5 6 7 8 9 1 2 3 4
 
 ...test table with a similar
 distribution to your example, and it shows a correlation of about -0.5,
 but it should be as good as something near -1 or +1.

 I am not a statistics expert, but it seems like a better measurement
 would be: what is the chance that, if the tuples are close together in
 index order, the corresponding heap tuples are close together?.
   
Same applies for data clustered by zip-code.

All rows for any State or City or County or SchoolZone
are close together on the same pages; yet postgres's
stats think they're totally unclustered.
 The answer to that question in your example is very likely, so there
 would be no problem.
 Is there a reason we don't do this?
   
I've been tempted to do things like

   update pg_statistic set stanumbers3='{1.0}' where starelid=2617 and
staattnum=7;

after every analyze when I have data like this from tables clustered
by zip.  Seems it'd help more plans than it hurts, but haven't been
brave enough to try in production.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Postgresql Materialized views

2008-01-14 Thread Ron Mayer
Mark Mielke wrote:
 Mark Mielke wrote:
 Counts, because as we all know, PostgreSQL count(*) is slow, and in
 any case, my count(*) is not on the whole table, but on a subset.
 Doing this in a general way seems complex to me as it would need to be
 able to evaluate whether a given INSERT or UPDATE or one of the
 dependent tables would impact the WHERE clause for the materialized
 view, and it still wouldn't know which rows to add/update/remove
 without detailed analysis, so it would either be throwing out the
 entire materialized view and recreating it on INSERT or UPDATE (or
 deferring until the next query?) in which case it may be very slow, or
 it may be very complex.
 
 Bah. I forgot to add: The feature I've been wondering about (and not
 necessarily looking for somebody else to do, although I don't think I
 know the code well enough to do it at this point):
 
 Web applications often make the same queries over and over. While
 memcache can be used to cache results, the memcache interface is
 different from the web application interfere requiring complex code, and
 as well, one loses the transaction guarantees as the memcache results
 are not guaranteed to be up-to-date with the database. 

Regarding up-to-dateness note that there is a pgfoundry project that
helps there.   http://pgfoundry.org/projects/pgmemcache/   The other
advantages of doing the caching outside the database is that (a) the
memory for the cached results don't have to sit in the database machine,
and (b) you can cache post-processed (rendered into HTML or gifs)
fragments rather than raw data.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Declarative partitioning grammar

2008-01-11 Thread Ron Mayer
Gavin Sherry wrote:
 CREATE TABLE is modified to accept a PARTITION BY clause. This clause
 contains one or more partition declarations. The syntax is as follows:
 PARTITION BY {partition_type} (column_name[, column_name...])
 [PARTITIONS number]
   (
partition_declaration[, partition_declaration...]
 
   )
 The partition type can be one of HASH, RANGE or LIST.

What would be the drawbacks of
  CREATE TABLE tablename(...)
  PARTITION BY function_taking_row_returning_partition_name
instead of the explicit types?


It seems that with my own function I could pretty easily emulate
the HASH,RANGE,or LIST types.   It seems a function returning a
partition name would largely avoid the need for the sub-partition stuff
too -- at least for the cases when the only reason you wanted
sub-partitions was for composite partition support.

I'm not sure if a function would give more flexibility, but
it sure seems it'd be easier for me to remember than the various
PARTITION BY LIST (a) (
 VALUES ('L') SUBPARTITION BY RANGE (b) (VALUES('x'),VALUES('y')),
 VALUES ('M') SUBPARTITION BY RANGE (b) (VALUES('u'),VALUES('t')))
or whowever it'd look.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Ron Mayer
hris Browne wrote:
 _On The Other Hand_, there will be attributes that are *NOT* set in a
 more-or-less chronological order, and Segment Exclusion will be pretty
 useless for these attributes.

Short summary:

  With the appropriate clustering, ISTM Segment Exclusion
  can be useful on all columns in a table.

  Cluster the table by one-bit-of-column1 || one-bit-of-column-2...
  That way any col2=value query could exclude at least about
  half the table regardless of if it's monotonically increasing
  or even totally random.

It seems one could make Segment Exclusion even useful for
multiple unrelated columns in a table.Consider a large
table of people where one might want segment exclusion to
help with both first name, and last name.

One could cluster the table by first-letter-of-last-name ||
first-letter-of-first-name.   Then a query for last name Smith
could exclude all but the consecutive segments of S's; while
the query for first name John would only have to look in the
26 runs of segments with AJ, BJ, CJ, ...

Perhaps better - hash each column and interleave the bits
col1bit1, col2bit1, col3bit1, col1bit2, col2bit2, col3bit3
If I understand segment exclusion right, that way on any
table bigger than 8 segments any query of col1=val,
or col2=val or col3=val would scan at most half the table;
on a table bigger than 64 segments any such query would
scan at most 1/4 of the table.

Obviously this only benefits the rarely changing parts of
tables; and new and updated values would be in a very hot
segment at the end of the table that wouldn't be segment
excluded much.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Ron Mayer
Chris Browne wrote:
 _On The Other Hand_, there will be attributes that are *NOT* set in a
 more-or-less chronological order, and Segment Exclusion will be pretty
 useless for these attributes.  

Really?I was hoping that it'd be useful for any data
with long runs of the same value repeated - regardless of ordering.

My biggest tables are clustered by zip/postal-code -- which means that
while the City, State, Country attributes aren't monotonically increasing
or decreasing; they are grouped tightly together.   I'd expect all queries
for San Francisco to be able to come from at most 2 segments; and all queries
for Texas to be able to come from only a fraction of the whole.


If the segment sizes are configurable - I imagine this would even
be useful for other data - like a people table organized
by last_name,first_name.   John's may be scattered through out
the table -- but at least the John Smith's would all be on one
segment, while the Aaron-through-Jim Smith segments might get excluded.

Or am I missing something?

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Ron Mayer
Andrew Sullivan wrote:
 On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
 ...the requirements: no single tuple in the segment may be 
 significantly out of the average bounds. Otherwise, the min/max gets 
 pretty useless and the segment can never be excluded.
 
 The segment can never be excluded in a search on that key, yes.  But
 consider the likely cases we're looking at: 
 ...you're searching for data in a given
 date range or for primary (sometimes artificial) identifiers in a range,
 and the source data increases (almost) monotonically.  

It seems to me the idea discussed elsewhere in the thread[1]
about configurable segment sizes would make this limitation
much less problematic for some types of data.

Some of my biggest tables are clustered by zip-code; and are insert
mostly.   Common queries are where state_provence='TX' or
where city='Dallas'.  While I doubt I have enough data to fill
a 1GB segment for any but the largest cities; I certainly have
runs of many consecutive blocks - since clustering by zip tends
to cluster cities as well.  Even though the table's not monotonically
increasing or decreasing, like values for cities and states are
clustered together.

Is my understanding right that these Segment Visibility Maps could
help this case as well?

[1] http://archives.postgresql.org/pgsql-hackers/2008-01/msg00065.php


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Sorting Improvements for 8.4

2007-12-20 Thread Ron Mayer
Gregory Stark wrote:
 Note that speeding up a query from 20s to 5s isn't terribly useful. 

I disagree totally with that.

That is the difference between no chance of someone waiting for a web
page to load; vs. a good chance they'd wait.   And 2s vs 0.5s is the
difference between a web site that feels responsive and one that doesn't.

 If it's OLTP you can't be using all your cores for each user anyways.

Even so, I'd much rather keep each response time lower.   If web page
requests are coming in at 1 a second, it's much nicer to respond to
each of them in 1 second than in 4 seconds -- even if the overall
throughput is identical.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Sorting Improvements for 8.4

2007-12-19 Thread Ron Mayer
Mark Mielke wrote:
 I am curious - what algorithms exist to efficiently do a parallel sort?
 Do you mean if sorting 1 million items, it is possible to separate this
 into  2 sets of 500 thousand each, execute them in separate threads
 (with task administration and synchronization overhead) , combine the
 results, and complete the task in significantly less time than doing it
 in one thread? I am skeptical that this is possible...

The link in the beginning of the thread points to articles
that seem to describe one such algorithm; along with benchmarks.
(http://tinyurl.com/3bvu4u, http://tinyurl.com/32wg2m)
The improvements were pretty consistent from set sizes ranging
from very small sets (hundreds) to quite large ones (hundreds of K).

Interestingly, even multi-threading helped a lot.

   Our tests correlate well with previous research that showed
Intel’s implementation of SMT (Hyper-Threading) to be
adept at hiding this latency [6, 20, 12].Table 4 shows that by
having two threads access memory at the same time, performance
improved over 80% when compared to the singlethreaded version.

It uses both quicksort phases and merge phases; for the merge phase
using 2CPUs (no hyperthreading) apparently gave more than 2X speed
improvement; apparently because it could parallelize memory access
with CPU more.

 Or do you mean being able to perform parts of the query plan fully in
 parallel? If this, then one would need a lot more than ParallelSort...

I wouldn't recommend that - it seems like a Hard Problem.

My guess is that the best way to use multiple threads in one backend
would be to find specific algorithms like sorting that  would be
easier to isolate.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Sorting Improvements for 8.4

2007-12-19 Thread Ron Mayer
Tom Lane wrote:
 ...I can believe that suitable test cases would show
 2X improvement for 2 threads,

One other thing I found  interesting is that their test case
showed a near 2X improvement for hyperthreading; where I haven't
heard of many other ways to get hyperthreading to show improvements
for postgreql.


 but it doesn't follow that you will get
 10X improvement with 10 threads, or even 4X with 4.

Yeah - unless those 10 cores have additional I/O to the
memories compared to a 1 core system (which I'd hope
would be the case or else I'd expect many apps would be
run into memory bottlenecks on such systems, no?).




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Sorting Improvements for 8.4

2007-12-17 Thread Ron Mayer
Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

Benchmarks I see[1][2] suggest that sorting is an area that
improves greatly with multiple processors and even with
multi-threading on a single core processor.

   For 1-processor and 2-threads (1p2t), the algorithm sorts
   the relation about 48% faster than the single-threaded version
   with a speedup of 31% during the quicksort and 58% during the
   mergesort. The dual-processor (2p2t) version provides an even
   faster total speedup of 86% over the single-threaded version
   with a speedup of 60% during the quicksort and 100% during
   the merge sort.
[from the acm paper on link 2 below]

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

[1] 
http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf
[2] 
http://delivery.acm.org/10.1145/112/1114254/DaMoN_103.pdf?key1=1114254key2=5713023711coll=dl=ACMCFID=15151515CFTOKEN=6184618



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [GENERAL] Stored procedure issue

2007-12-01 Thread Ron Johnson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/01/07 20:40, Dragan Zubac wrote:
 Hello
 
 I have a stored procedure which does the billing stuff
 in our system,it works ok,but if I put in
 production,where there is some 5-10 billing events per
 second,the whole database slows down. It won't even
 drop some test table,reindex,vacuum,things which were
 done before in the blink of an eye. If I stop the
 application which calls the procedure,all is back to
 normal.
 
 We didn't implement any special locking mechanism in
 the procedure,all is default. The procedure is
 updating user's balance in table 'users'. On the other
 hand a couple of 'heavy load' table has foreign keys
 pointing to table 'users'.
 
 Is it the matter of concurency and some locking issue
 or maybe the existing of all those foreign keys
 pointing to table 'users',or maybe something else
 which we're not aware at the moment ?

Are you using transactions?

Are the tables properly indexed?

Are the queries in the SP using the indexes properly?

Did you do all the testing on a tiny database.

Is the SP written as efficiently?  (Think of ways to refactor it in
order to get the same results with less effort.)

- --
Ron Johnson, Jr.
Jefferson LA  USA

%SYSTEM-F-FISH, my hovercraft is full of eels
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHUh9nS9HxQb37XmcRAjPTAJ4jRUZUaF+j2KAB3+lBY6A3ROfynACfawWT
0QN026Ncl/Iag2M6E1kfjUg=
=RlXy
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] PG 7.3 is five years old today

2007-11-29 Thread Ron Mayer
Robert Treat wrote:
 On Tuesday 27 November 2007 15:07, Simon Riggs wrote:
 On Tue, 2007-11-27 at 14:02 -0500, Tom Lane wrote:
 There has been some discussion of making a project policy of dropping
 support for old releases after five years.  Should we consider formally
 instituting that?
 ...
 Perhaps we should ask for volunteers to maintain that branch? ...
 
 +1 to see if anyone else wants to take over management of the branch. I also 
 think we should be a bit more generous on the EOL notice.

One thing that could soften the blow is if the EOL notice mentions
which commercial organizations will provide paid support for longer
than the community does.

I assume that's one of the benefits of going with the commercial
support organizations?

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] How to keep a table in memory?

2007-11-13 Thread Ron Mayer
Heikki Linnakangas wrote:
 Luke Lonergan wrote:
 Vacuum is a better thing to run, much less CPU usage.
 
 Vacuum is actually not good for this purpose, because it's been
 special-cased to not bump the usage count.

Though the OS's page cache will still see it as accesses, no?

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Feature Freeze date for 8.4

2007-10-23 Thread Ron Mayer
Joshua D. Drake wrote:
 
 We develop and commit like normal *until* the community feels there is
 enough for release. Then we announce a feature freeze.

I think you just described what will happen in reality regardless
of whatever is decided to be an official plan.  :)  I don't
think that's necessarily a bad thing.   IMHO 8.3 is looking like
a better product because it's late, and DBAs concerned about
stability will likely stay on 8.2 for a while anyway.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Release notes introductory text

2007-10-12 Thread Ron Mayer
Tom Lane wrote:
 Joshua D. Drake [EMAIL PROTECTED] writes:
 With respect to you Kevin, your managers should wait. You don't
 install .0 releases of any software into production without months
 of testing. At which point, normally a .1 release has come out anyway.
 
 How exactly do you expect the software to get from a .0 to a .1 release,
 or to have addressed the bugs that might bite you when it does get to .1,
 if you aren't helping to test it?

In most environments I've seen, developer and QA systems don't hesitate
to move to .0 releases (or even beta).   I agree with Joshua that it's
nerve wracking to move _production_ systems to .0 releases from most
software vendors.

 Now I realize that you did say test above, but way too often I see
 this sort of argument as a justification for doing nothing and expecting
 somebody else to fix it.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Use of postmaster

2007-10-03 Thread Ron Mayer
Brendan Jurd wrote:
 Seems it would be best to apply this
 nomenclature consistently, and simply drop the name postmaster from
 use.
 

+1  I agree the term postmaster references in the docs, etc should
go away - with perhaps the exception of one faq that say that
postmaster's a deprecated name in case anyone encounters it
in other web sites or old docs.

PS: Oh it's so hard to resist saying it should be renamed
postmaSQLter.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Why is there a tsquery data type?

2007-08-30 Thread Ron Mayer
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 Why does text search need a tsquery data type?  I realize it needs
 tsvector so it can create indexes and updated trigger columns, but it
 seems tsquery could instead just be a simple text string.
 
 By that logic, we don't need any data types other than text.
 

Could similar logic argue that we'd want special types for regular
expressions too? That seems quite parallel to the tsquery type to me.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] SQL feature requests

2007-08-24 Thread Ron Mayer
Tom Lane wrote:
 Part of the reason for being conservative about changing here
 is that we've got a mix of standard and nonstandard behaviors
 
 A lot of this is legacy behavior that would never have passed muster
 if it had been newly proposed in the last few years --- we have gotten
 *far* stricter about SQL compliance than we used to be.  But at this
 point backwards compatibility also has to weigh heavily.

Has there been any thought to eventually phasing them out?

Perhaps a GUC to give warnings in the log file when
they're encountered.   I guess we'd need 3 levels of
warnings, off, reasonable and pedantic.   When set
to the reasonable level it could only give smart warning
messages like
  Warning: Use of frivolous nonstandard behavior XXX.
  Hint: Use the standard YYY instead.
and when set to pedantic it would point out every
non-standard SQL statement - useful only for someone
to be aware of how much postgresql dependent behavior
they might have.

Then a farther future release could deprecate the
frivolous non-standard pieces presumably leading to
simpler code in the long run.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] tsearch2 patch status report

2007-08-22 Thread Ron Mayer
Merlin Moncure wrote:
 On 8/21/07, Magnus Hagander [EMAIL PROTECTED] wrote:
 OTOH, if we do it as a compat package, we need to set a firm end-date on
 it, so we don't have to maintain it forever. 
 
 I would suggest making a pgfoundry project...that's what was done with
 userlocks.  I'm pretty certain no one besides me has ever used the
 wrappers I created...a lot more people use tsearch2 than userlocks
 though.
 

Hmm..  In that case I'd think people should ask if anyone would use
the tsearch2 compatibility layer before even doing pgfoundry.
Speaking for myself, I expect migrating to the core text search
APIs as something we'd do as part of our 8.3 migration even if
such a compatibility layer existed.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] tsearch2 in PostgreSQL 8.3?

2007-08-17 Thread Ron Mayer
Joshua D. Drake wrote:
 Tom Lane wrote:
 Robert Treat [EMAIL PROTECTED] writes:
 What exactly does default_text_search_config buy us?  I think it is 
 supposed 
 to simplify things, but it sounds like it adds a bunch of corner cases, 
 Well, the main thing we'd lose if we remove it is all trace of upward
 compatibility from the contrib version of tsearch.
 
 I don't think this is all that big of a deal. In fact I would expect it
 going from contrib to core and never had any illusion to the effect that
 I would be able to just upgrade from 8.2 (8.1) Tsearch2 to 8.3.

FWIW, I also would _not_ have expected compatibility between contrib
and core.   In fact, I would have expected contrib tsearch to be a
place where experimental APIs existed and that the single
biggest difference between contrib vs core was that the
core APIs removed any cruft that might have been in contrib.

If default_text_search_config makes things more confusing or more
fragile, I'd rather see it gone than kept around for
backward-compatibility-to-pre-core reasons.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] tsearch2 in PostgreSQL 8.3?

2007-08-15 Thread Ron Mayer
Magnus Hagander wrote:
 I don't use the functional index part, but for new users I can see how
 that's certainly a *lot* easier. 

Can someone with modern hardware test to see if it's
still quite a bit slower than the extra column.  I had
tried it too years ago; and found the functional index
to be quite a bit slower:
http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
but it'd be interesting to see if faster CPUs changed this.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Ron Mayer
Bruce Momjian wrote:
 Oleg Bartunov wrote:
 What is a basis of your assumption ? In my opinion, it's very limited
 use of text search, because it doesn't supports ranking. For 4-5 years
 of tsearch2 usage I never used it and I never seem in mailing lists.
 This is very user-oriented feature and we could probably ask 
 -general people for their opinion.

I think I asked about this kind of usage a couple years back;
and Oleg pointed out other reasons why it wasn't as good an
idea too.

http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php

The particular question I had asked why the functional index was
slower than maintaining the extra column; with the explanation
that the lossy index having to call the function (including
parsing, dictionary lookup, etc) for re-checking the data made
it inadvisable to avoid the extra column anyway.

 I doubt 'general' is going to understand the details of merging this
 into the backend.  I assume we have enough people on hackers to decide
 this.
 
 Are you saying the majority of users have a separate column with a
 trigger?

I think so.   At least when I was using it in 2005 the second
column with the trigger was faster than using a functional index.

 We need more feedback from users.
 
 Well, I am waiting for other hackers to get involved, but if they don't,
 I have to evaluate it myself on the email lists.

Personally, I think documentation changes would be an OK way to
to handle it.   Something that makes it extremely clear to the
user the advantages of having the extra column and the risks
of avoiding them.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Ron Mayer
Bruce Momjian wrote:
 Ron Mayer wrote:
 Bruce Momjian wrote:
 Oleg Bartunov wrote:
 What is a basis of your assumption ? 
 I think I asked about this kind of usage a couple years back;...

 http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
 http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php

 ...why the functional index was
 slower than maintaining the extra column; with the explanation
 that the lossy index having to call the function (including
 parsing, dictionary lookup, etc) for re-checking the data ...
 ...

 Are you saying the majority of users have a separate column with a
 trigger?
 I think so.   At least when I was using it in 2005 the second
 column with the trigger was faster than using a functional index.
 
 OK, it is good you measured it.  I wonder how GIN would behave because
 it is not lossy.

Too bad I don't have the same database around anymore.
It seems the re-parsing for re-checking for the lossy index was very
expensive, tho.
In the end, I suspect it depends greatly on what fraction of rows match.

 We need more feedback from users.
 Well, I am waiting for other hackers to get involved, but if they don't,
 I have to evaluate it myself on the email lists.
 Personally, I think documentation changes would be an OK way to
 to handle it.   Something that makes it extremely clear to the
 user the advantages of having the extra column and the risks
 of avoiding them.
 
 Sure, but you have make sure you use the right configuration in the
 trigger, no?  Does the tsquery have to use the same configuration?

I wish I knew this myself. :-)   Whatever I had done happened to work
but that was largely through people on IRC walking me through it.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Ron Mayer
Bruce Momjian wrote:
 Ron Mayer wrote:
 I wish I knew this myself. :-)   Whatever I had done happened to work
 but that was largely through people on IRC walking me through it.
 
 This illustrates the major issue --- that this has to be simple for
 people to get started, while keeping the capabilities for experienced
 users.
 
 I am now thinking that making users always specify the configuration
 name and not allowing :: casting is going to be the best approach.  We
 can always add more in 8.4 after it is in wide use.

That's fair.   Either the docs need to make it totally obvious or
the software should force people to do something safe.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Idea that might inspire more patch reviewing.

2007-05-18 Thread Ron Mayer
Bruce Momjian wrote:
 In talking to people who are assigned to review patches or could review
 patches, I often get the reply, Oh, yea, I need to do that.

Would it inspire more people to learn enough to become patch
reviewers if patch authors scheduled walkthroughs of their
patches with question and answer sessions over IRC or maybe
even some voice conferencing system like skype?

While it might not be of immediate value, I imagine a number
of inspiring-to-be-hackers might find such walkthroughs
enlightening, and if actual qualified reviewers participate
in the QA during those walkthroughs seeing the kinds of
questions raised would be quite educational as well.

I don't know if this would help - I guess it needs 3 things:
patch authors willing to do such walkthroughs, qualified
people willing to participate in them, and enough wannabe
hackers wanting to listen in.

Do people think this would help- or is it just a
clunkier way of doing what's already done via email
on the patches list?

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Not ready for 8.3

2007-05-16 Thread Ron Mayer
Andrew Dunstan wrote:
 Josh Berkus wrote:
 I think that may be where we're heading.  In that case, we may need to
 talk about branching earlier so that developers can work on the new
 version who are frozen out of the in-process one.
 
 I've argued this in the past. But be aware that it will make more work
 for committers. It is very far from cost-free.

I hate to bring up the CMS flamewar again, but IMHO the one advantage
the distributed systems (Git, Monotone, Darcs, etc) have over CVS 
subversion is that they push this branch management cost down to the
developer who wants a branch - rather from the guys maintaining the
official CMS.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-15 Thread Ron Mayer
Josh Berkus wrote:
 And then what? dynamically construct all your SQL queries?
 Sure, sounds like a simple solution to me...
 
 Not to mention DB security issues.  How do you secure your database when 
 your web client has DDL access?
 
 So, Edward, the really *interesting* idea would be to come up with a 
 secure, normalized way to do UDFs *without* EAV tables.  People would be 
 very impressed.
 

I have a system with many essentially user-defined fields, and was
thinking of creating something similar to an Array type and writing
some GIST indexes for it.

My current workaround is to store them as a YAML document and use
tsearch to index it (with application logic to further refine the
results) - but a EAV datatype that could be put in tables and
effectively indexed would be of quite a bit of interest here.
And yes, a better say to do UDFs would be even cooler.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] SCMS question

2007-02-23 Thread Ron Mayer
Bruce Momjian wrote:
 
 My typical cycle is to take the patch, apply it to my tree, then cvs
 diff and look at the diff, adjust the source, and rerun until I like the
 diff and apply.  How do I do that with this setup?

The most similar to what you're doing would be to
merge the patch's branch into yours.   It's about
exactly the same amount of work as applying a
patch (a one liner if there are no conflicts).

From that point you could continue exactly as you are now - with the
additional benefit(?) that the checkin history of the branch should (I
hope) be preserved through the merge process so the SCM's history would
let you see which changes from the patch came from the submitter and which
changes came from the modifications in your tree.



(I think this SCM requirements list
  
http://changelog.complete.org/posts/528-Whose-Distributed-VCS-Is-The-Most-Distributed.html
 is one of the more interesting.
 The two features I like about the distributed systems are
   # 5. Branching preserves full history
   # 6. Merging preserves full history.
 so the history of the branch (including which changes
 came from the submitter and which were modifications
 in your tree) are preserved when they're eventually
 committed to head.)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] SCMS question

2007-02-23 Thread Ron Mayer
Alvaro Herrera wrote:

 Yes, it's nice.  Consider this: Andrew develops some changes to PL/perl
 in his branch.  Neil doesn't like something in those changes, so he
 commits a fix there.  

If I understand right, another advantage is that the SCM will keep
track of which of those changes came from Neil and which from Andrew
and that this history will be preserved when eventually merged with head.
In contrast, with CVS I think  detailed history of changes within Andrew's
branch would be lost and only recorded as one large checkin in CVS.
Right?

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Ron Mayer
Alvaro Herrera wrote:
 
 Once autovacuum_naptime... autovacuum_max_workers...
 How does this sound?

The knobs exposed on autovacuum feel kinda tangential to
what I think I'd really want to control.

IMHO vacuum_mbytes_per_second would be quite a bit more
intuitive than cost_delay, naptime, etc.


ISTM I can relatively easily estimate and/or spec out how
much extra I/O bandwidth I have per device for vacuum;
and would pretty much want vacuum to be constantly
running on whichever table that needs it the most so
long as it can stay under that bandwith limit.

Could vacuum have a tunable that says X MBytes/second
(perhaps per device) and have it measure how much I/O
it's actually doing and try to stay under that limit?

For more fine-grained control a cron job could go
around setting different MBytes/second limits during
peak times vs idle times.


If people are concerned about CPU intensive vacuums
instead of I/O intensive ones (does anyone experience
that? - another tuneable vacuum_percent_of_cpu would
be more straightforward than delay_cost, cost_page_hit,
etc.   But I'd be a bit surprised if cpu intensive
vacuums are common.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] A more general approach (Re: Data archiving/warehousing idea)

2007-02-01 Thread Ron Mayer
Hannu Krosing wrote:
 ...is storing all tuple visibility info in a separate file.
 
 At first it seems to just add overhead, but for lots (most ? ) usecases
 the separately stored visibility should be highly compressible, so for
 example for bulk-loaded tables you could end up with one bit per page
 saying that all tuples on this page are visible.

Seems you could do one bit per page compression even with
visibility data stored in the pages themselves.

I could imagine a table re-writing vacuum freeze that finds
pages with all data visible to everyone and packs them with
a single bit saying everything here's frozen.

The penalty would be an expensive splitting of the page (and
who knows what evil locks would be needed) if an update is
ever done on those wholly frozen pages -- but we're talking
about read-mostly tables here so that tradeoff might not be bad.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] tsearch in core patch, for inclusion

2007-01-24 Thread Ron Mayer
Andrew Dunstan wrote:
 Joshua D. Drake wrote:
 Where on the website can I see what plugins are included with
 PostgreSQL?

YES!  That's IMHO a more fundamental problem.  The specific
question about Text Search seems more like a symptom.  While
I don't mind Text Search in core, it seems an even bigger deal
that it's hard to find information on extensions (whether
from contrib or from gborg or from external places like postgis).

A web page with a table easily visible on the
postgresql web site that had
   Extension (i.e. tsearch2, postgis)
   Project Maturity  (i.e. alpha/beta/stable)
   Compatability (i.e. extension 1.0 works with postgresql 8.2)
   Description   (i.e. full text search)
   URL
would be a partial fix.

 contrib is a horrible misnomer. Can we maybe bite the bullet and call it
 something else?

+1
How about plugins or extensions or optional libraries.




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [GENERAL] Autovacuum Improvements

2007-01-22 Thread Ron Mayer
Gregory Stark wrote:
 
 Actually no. A while back I did experiments to see how fast reading a file
 sequentially was compared to reading the same file sequentially but skipping
 x% of the blocks randomly. The results were surprising (to me) and depressing.
 The breakeven point was about 7%. [...]
 
 The theory online was that as long as you're reading one page from each disk
 track you're going to pay the same seek overhead as reading the entire track.

Could one take advantage of this observation in designing the DSM?

Instead of a separate bit representing every page, having each bit
represent 20 or so pages might be a more useful unit.  It sounds
like the time spent reading would be similar; while the bitmap
would be significantly smaller.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] Autovacuum Improvements

2007-01-17 Thread Ron Mayer
Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:
 I'd like to hear other people's opinions on Darcy Buskermolen proposal
 to have a log table, on which we'd register what did we run, at what
 time, how long did it last, [...] 
 
 I think most people would just be happy if we could get autovacuum to
 log it's actions at a much higher log level.  I think that autovacuum
 vacuumed table x is important and shouldn't be all the way down at the
 debug level.

+1 here.  Even more than autovacuum vacuumed table x, I'd like to see
vacuum starting table x and vacuum done table x.  The reason I say
that is because the speculation autovacuum might have been running
is now a frequent phrase I hear when performance questions are asked.
If vacuum start and end times were logged at a much earlier level,
that feature plus log_min_duration_statement could easily disprove
the vacuum might have been running hypothesis.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Function execution costs 'n all that

2007-01-17 Thread Ron Mayer
Tom Lane wrote:
 
 BTW, I'm thinking that a cost constant probably ought to be measured
 in units of cpu_operator_cost.  The default for built-in functions would
 thus be 1, at least till such time as someone wants to refine the
 estimates.  We'd probably want the default for PL and SQL functions to
 be 10 or 100 or so.

Any chance that costs could eventually change to real-world units?

It's hard for me to guess how many cpu_operator_cost units
something might take; but relatively easy for me to measure
or estimate in fractions-of-a-seconds how long something takes.

I could imagine having the other planner costs be measured in seconds
too - perhaps with the goal of eventually writing some auto-config
code that tries to measure values like cpu_tuple_cost on a given
piece of hardware.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] InitPostgres and flatfiles question

2007-01-05 Thread Ron Mayer
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 What value is allowing multiple queies via PQexec()
 
 The only argument I can think of is that it allows applications to be
 sloppy about parsing a SQL script into individual commands before they
 send it.  (I think initdb may be guilty of exactly that BTW...)  At the
 same time you could argue that such sloppiness is inherently a Bad Idea.

Doesn't it also avoid some network(?) overhead when you have
a large number of small inserts or updates?

I seem to recall a previous company where we had a major performance
by concatenating a bunch of updates with ;s in between and sending
them to postgresql as a single command.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Ron Mayer
Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
 I have a new idea.  ...the BSD kernel...similar issue...to smooth writes:
 Linux has a more complex solution to this (of course) which has undergone a
 few generations over time. Older kernels had a user space daemon called
 bdflush which called an undocumented syscall every 5s. More recent ones have a
 kernel thread called pdflush. I think both have various mostly undocumented
 tuning knobs but neither makes any sort of guarantee about the amount of time
 a dirty buffer might live before being synced.

Earlier in this thread (around the 7th) was a discussion of
/proc/sys/vm/dirty_expire_centisecs and /proc/vm/dirty_writeback_centisecs
which seem to be the tunables that matter here.  Googling suggests that

  dirty_expire_centisecs specifies that data which has been dirty in memory
  for longer than this interval will be written out next time a pdflush
  daemon wakes up

and

  dirty_writeback_centisecs expresses the interval between those wakeups

It seems to me that the sum of the two times does determine the maximum
time before the kernel will start syncing a dirtied page.



Bottom line, though is that it seems both postgresql and the OS's are
trying to delay writes in the hopes of collapsing them; and that the
actual delay is the sum of the OS's delay and postgresql's delay.  I
think Kevin Grittner's experimentation earlier in the thread did indeed
suggest that getting writes to the OS faster and let it handle the
collapsing of the writes was an effective method of reducing painful
checkpoints.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] EXPLAIN ANALYZE on 8.2

2006-12-15 Thread Ron

At 10:45 AM 12/15/2006, Tom Lane wrote:

Gregory Stark [EMAIL PROTECTED] writes:
 There are various attempts at providing better timing infrastructure at low
 overhead but I'm not sure what's out there currently. I expect to 
do this what
 we'll have to do is invent a pg_* abstraction that has various 
implementations

 on different architectures.

You've got to be kidding.  Surely it's glibc's responsibility, not ours,
to implement gettimeofday correctly for the hardware.

regards, tom lane


I agree with Tom on this.  Perhaps the best compromise is for the pg 
community to make thoughtful suggestions to the glibc community?


Ron Peacetree 



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Ron Mayer
ITAGAKI Takahiro wrote:
 Kevin Grittner [EMAIL PROTECTED] wrote:
 
 ...the file system cache will collapse repeated writes to the
 same location until things settle ...
 If we just push dirty pages out to the OS as soon as possible, 
 and let the file system do its job, I think we're in better
 shape...
 
 Maybe we have two entirely different tuning approaches:
   1. Retain dirty buffers in database, and keep OS buffers clean.
   2. Keep database clean, and entrust OS to manage dirty buffers.
 
 I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints
 vary in the approaches; write() will be worse in 1st, fsync() in 2nd.

The fsync() won't necessarily be worse in the second approach.  OS's have
quite a few tunable parameters that can encourage the system to physically
write the pending write()s before the fsync() - either in the background
or by the process doing the write() itself when there are too many
dirty pages.

For checkpoints, I think the main question is whether postgresql's
background writer is smarter or less smart than pdflush or the
equivalent on your system for database workloads.

 Also, database has own access-frequency information for its buffers,
 so I think 1st approach behaves better in handling re-dirty of buffers.

I'm curious what access-frequency info the OS and the database has.

One thing I do worry about is if both postgresql and the OS
are both delaying write()s in the hopes of collapsing them
at the same time.  If so, this would cause both to be experience
bigger delays than expected, and make checkpoints worse.

I'm guesing if you use approach 1. you might be better off
turning down the amount of buffering that the OS does with
dirty pages - and if you use approach 2, you might be better
off turning down the amount of delays that postgresql adds.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Grouped Index Tuples

2006-12-11 Thread Ron Mayer
Jim C. Nasby wrote:
 On usage, ISTM it would be better to turn on GIT only for a clustered
 index and not the PK? I'm guessing your automatic case is intended for
 SERIAL PKs, but maybe it would be better to just make that explicit.

Not necessarily; since often (in my tables at least) the data for
come columns has some local grouping of similar values even though
it's not the clustered index.

The obvious example is addresses (city, state, street, etc may not appear
clustered - but if you cluster your table by zip code, GIT will work
well for them).

Other examples would be tables containing dates and product names - even
though there's no total monotonic ordering if your product families
get refreshed every couple years you'll find that one range of product
names shows up mostly in old years, and others in new years - so GIT
could prove useful there - despite not being a clustered index.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Proposal: syntax of operation with tsearch's configuration

2006-11-17 Thread Ron Mayer
Tom Lane wrote:
 Peter Eisentraut [EMAIL PROTECTED] writes:
 I don't see any comparable arguments about this full-text search stuff.  
 In particular I don't see any arguments why a change would necessary at 
 all, including why moving to core would be necessary in the first 
 place.
 
 AFAIR the only argument in favor of that is basically a marketing one:
 users perceive a feature as more real, or more supported, if it's in
 core.  I don't find this argument especially compelling myself.

On the flip side of that argument - the more non-SQL-standard pieces
are in core, the more non-real other pieces non-in-core appear.
People seem to have little doubts regarding the CPAN, or Ruby Gems.
I believe because to a large part that's because a lot of very
important and well supported functionality exists outside of their
core distributions.  The less that's pre-baked into core, I think
the more people will be aware of the rich set of extensions postgresql
enables.


From a marketing point of view (should I have moved this to .advocacy),
it seems to me the biggest problem is the name contrib.  If it were
called optional or advanced or extra I think it'd be seen less
suspiciously by hosting companies (who seem to have the biggest problem
with contrib) and we wouldn't need as many discussions of which contribs
to move into core.

  Ron M


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [SPAM?] Re: Asynchronous I/O Support

2006-10-24 Thread Ron Mayer
Zeugswetter Andreas ADI SD wrote:
 POSIX_FADV_WILLNEED definitely sounds very interesting, but:
 
 I think this interface was intended to hint larger areas (megabytes).
 But the wishful thinking was not to hint seq scans, but to advise
 single 8k pages.

Surely POSIX_FADV_SEQUENTIAL is the one intended to hint seq scans,
and POSIX_FADV_RANDOM to hint random access.  No?

ISTM, _WILLNEED seems just right for small random-access blocks.



Anyway, for those who want to see what they do in Linux,
  http://www.gelato.unsw.edu.au/lxr/source/mm/fadvise.c
Pretty scary that Bruce said it could make older linuxes
dump core - there isn't a lot of code there.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Hints WAS: Index Tuning Features

2006-10-12 Thread Ron Mayer
Mark Woodward wrote:
 
 Exactly. IMHO, it is a frustrating environment. PostgreSQL is a great
 system, and while I completely respect the individuals involved, I think
 the management for lack of a better term, is difficult.

'course you're welcome to fork the project as well if your style
and/or priorities are different than the postgresql core team's.

If your approach is that much less frustrating, your project
would gain that much more momentum from developers joining you.

If more developers like your style and/or priorities, they'll
migrate to your project.  I think Bizgres, Mammoth, EnterpriseDB
and RedHat DB and Gentoo's-occasional-bizzaro-patches are both proofs
that it can work as well as proofs that it's difficult.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Index Tuning Features

2006-10-11 Thread Ron Mayer
Andrew Sullivan wrote:
 Just because I'm one of those statistics true believers, what sort of
 information do you think it is possible for the DBA to take into
 consideration, when building a hint, that could not in principle be
 gathered efficiently by a statistics system?  It seems to me that
 you're claiming that DBAs can have magic knowledge.

Is one example is the table of addresses clustered by zip-code
and indexes on State, City, County, etc?

The current statistics systems at least see no correlation between
these fields (since the alphabetical ordering of cities and
numbering of postal codes is quite different).   This makes the
planner under-use the indexes because it sees no correlation and
overestimates the number of pages read and the random accesses
needed.

However since San Francisco, CA data happens to be tightly packed
on a few pages (since it shares the same few zip codes), few
pages are needed and mostly sequential access could be used
when querying SF data -- though the optimizer guesses most pages
in the table may be hit, so often ignores the indexes.


Now I'm not saying that a more advanced statistics system
couldn't one-day be written that sees these patterns in the
data -- but it doesn't seem likely in the near term.  DBA-based
hints could be a useful interim work-around.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Fwd: Is the fsync() fake on FreeBSD6.1?

2006-09-24 Thread Ron Mayer
Andrew - Supernews wrote:
 
 Whether the underlying device lies about the write completion is another
 matter. All current SCSI disks have WCE enabled by default, which means
 that they will lie about write completion if FUA was not set in the
 request, which FreeBSD never sets. (It's not possible to get correct
 results by having fsync() somehow selectively set FUA, because that would
 leave previously-completed requests in the cache.)
 
 WCE can be disabled on either a temporary or permanent basis by changing
 the appropriate modepage. It's possible that Linux does this automatically,
 or sets FUA on all writes, though that would surprise me considerably;
 however I disclaim any knowledge of Linux internals.


The Linux SATA driver author Jeff Garzik suggests [note 1] that
The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE
 command to be generated has only been present in the most recent [as of
 mid 2005] 2.6.x kernels.  See the write barrier stuff that people
 have been discussing.  Furthermore, read-after-write implies nothing
 at all.  The only way to you can be assured that your data has hit
 the platter is
   (1) issuing [FLUSH|SYNC] CACHE, or
   (2) using FUA-style disk commands
 It sounds like your test (or reasoning) is invalid.



Before those min-2005 2.6.x kernels apparently fsync on linux didn't
really try to flush caches even when drives supported it (which
apparently most actually do if the requests are actually sent).

[note 1] http://lkml.org/lkml/2005/5/15/82

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Optimizer improvements: to do or not to do?

2006-09-13 Thread Ron Mayer
Simon Riggs wrote:
 On Mon, 2006-09-11 at 06:20 -0700, Say42 wrote:
 That's what I want to do:
 1. Replace not very useful indexCorrelation with indexClustering.
 
 An opinion such as not very useful isn't considered sufficient
 explanation or justification for a change around here.

Not sufficient for some types of data would have been more fair.

I speculate that an new additional stat of
  average # of unique values for a column within a block
would go a long way to helping my worst queries.


It's common here for queries to vastly overestimate the
number of pages that would need to be read because
postgresql's guess at the correlation being practically 0
despite the fact that the distinct values for any given
column are closely packed on a few pages.

Our biggest tables (180G or so) are mostly spatial data with columns
like City State Zip County Street School District, Police
Beat, lat/long etc; and we cluster the table on zip,street.

Note that practically all the rows for any single value of any
of the columns will lay in the same few blocks.  However the
calculated correlation being low because the total ordering
of the other values doesn't match that of zip codes.  This
makes the optimizer vastly overestimate the cost of index
scans because it guesses that most of the table will need
to be read, even though in reality just a few pages are needed.


If someone does look at the correlation calculations, I hope
this type of data gets considered as well.


I speculate that a new stat of
  average # of unique values for a column within a block
could be useful here in addition to correlation.  For most
all my columns in my big table, this stat would be 1 or 2;
which I think would be a useful hint that despite a low
correlation, the distinct values are indeed packed together
in blocks.   That way the optimizer can see that a
smaller number of pages would need to be accessed than
correlation alone would suggest.

Does this make sense, or am I missing something.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Optimizer improvements: to do or not to do?

2006-09-13 Thread Ron Mayer
Gregory Stark wrote:
 Ron Mayer [EMAIL PROTECTED] writes:
 ...vastly overestimate the number of pages .. because postgresql's guess 
 at the correlation being practically 0 despite the fact that the distinct 
 values for any given column are closely packed on a few pages.
 
 I think we need a serious statistics jock to pipe up with some standard
 metrics that do what we need. Otherwise we'll never have a solid footing for
 the predictions we make and will never know how much we can trust them.

Do we know if any such people participate/lurk on this list, or
if the conversation should go elsewhere?

 That said I'm now going to do exactly what I just said we should stop doing
 and brain storm about an ad-hoc metric that might help:
 
 I wonder if what we need is something like: sort the sampled values by value
 and count up the average number of distinct blocks per value. That might let
 us predict how many pages a fetch of a specific value would retrieve. Or
 perhaps we need a second histogram where the quantities are of distinct pages
 rather than total records.

Either of these sound like they might be an improvement over correlation
itself to estimate the number of pages it'd need to read.  Would it be
relatively easy or hard for a programmer not too familiar with the code
to experiment with these ideas?  Where would be a good place to look.

 We might also need a separate average number of n-block spans per value
 metric to predict how sequential the i/o will be in addition to how many pages
 will be fetched.

I'm wildly guessing that, the # of pages itself seems to be
a bigger factor than the sequential/random nature.  For example,
I do a query for data from a particular small city I'd only
need dozens of pages, not many thousands.

OTOH, it'd be neat to know if this were true.  Is there any
good way to make something like explain analyze show both
the expected and actual # of pages and # of seeks?

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] 8.2 features status

2006-08-05 Thread Ron Mayer

Tom Lane wrote:

But a quick troll through the CVS logs shows ...

multi-row VALUES, not only for INSERT but everywhere SELECT is allowed ...
multi-argument aggregates, including SQL2003-standard statistical aggregates ...
standard_conforming_strings can be turned on (HUGE deal for some people) ...
support SQL-compliant row comparisons; they can be indexscan quals


ISTM this could be spun as a standards-focused release as well (at least
partial implementations of a number of optional(?) SQL2003 features).

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] 8.2 features status

2006-08-05 Thread Ron Mayer

Tom Lane wrote:

I tend to agree --- I don't see much value in trying to institute a
formalized process.


One more problem with the formalized process of claiming features
in advance may stop what I suspect is a significant source of
contributions -- people who add features/patches for internal
work in their company and only after the fact find that they
are something they'd contribute back.

The small contribution I made (to help admins know when FSM
settings were too low by monitoring log files instead
of manual checks[1]) was done because we wanted it internally.

Only after it proved useful to us, it was mentioned to the lists.

Thanks in part to the BSD nature of postgresql, I suspect
there are many internal-and-not-yet-released useful patches
lurking around in industry.  If I'm right, I'd wonder what
the advocacy guys could do to get corporations to volunteer
to contribute changes back that they've found useful
internally.



We have not had that many cases where lack of
communication was a problem.


One could say too much communication was the problem this time.

I get the impression people implied they'd do something on a TODO
and didn't.  Arguably the project had been better off if noone
had claimed the TODO, so if another company/team/whatever needed
the feature badly, they could have worked on it themselves rather
than waiting in hope of the feature.   Of course they could have
done this anyway - but if they see it on an implied roadmap document
for the next release they're more likely to wait.

   Ron

[1] http://archives.postgresql.org/pgsql-patches/2005-02/msg00171.php

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] 8.2 features status

2006-08-05 Thread Ron Mayer

[EMAIL PROTECTED] wrote:

Ron Mayer wrote:

We have not had that many cases where lack of
communication was a problem.

One could say too much communication was the problem this time.

I get the impression people implied they'd do something on a TODO
and didn't.  Arguably the project had been better off if noone
had claimed the TODO, so if another company/team/whatever needed
the feature badly, they could have worked on it themselves rather
than waiting in hope of the feature.


This is just perverse. Surely you are not seriously suggesting that we
should all develop in secret and then spring miracles fully grown on the
community?


Of course not.   What I'm suggesting is two things.

(1) That misleading information is worse than no information; and
that speculative information next to TODOs can do as much harm
discouraging others as it the good it does for communication.  Perhaps
a name/assignment/claim on a todo might be nice if someone wanted a
private conversation with someone who knows about a feature; but
even there wouldn't a public discussion on the lists likely be better?

(2) That much corporate development on BSD projects is indeed
developed in secret.  Although may want to be contributed later
either because the company no longer decides it's a trade-secret
or gets tired of maintaining their own fork.   Sure, such patches
might need even more discussion and revision than if they were
designed with core - but I think it's a reality that such work
exists.


We have bumped patches before because they have done things
without discussing them, and were found not to be accepatble. The more
complex features get, the more communication is needed.


Agreed, of course.  This makes me think that ongoing discussion
on hackers  patches is the only way to judge progress on a
todo; and anything like assigned names  estimated dates  releases
are less likely to be meaningful than what one could infer from
discussions on the lists.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] O_NOATIME

2006-08-03 Thread Ron Mayer
Tom Lane wrote:
 Ron Mayer [EMAIL PROTECTED] writes:
 Would people be interested in a trivial patch that adds O_NOATIME
 to open() for platforms that support it?  (apparently Linux 2.6.8
 and better).
 
 Isn't that usually, and more portably, handled in the filesystem
 mount options?

Yes to both.  I could imagine that for small systems/workstations
you might have some files that want access time, and others that
wanted NOATIME -- it seems the new flag lets you choose on a
file-by-file bases.

That's why I asked.  I imagine it won't help on any well-administered
production server since they'd probably mount the whole filesystem
that way; but might help a bit on out-of-the-box-default-config
benchmarks done by naive users who don't tweak filesystem settings.

Don't know if we'd care about such an audience or not.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] O_NOATIME

2006-08-03 Thread Ron Mayer
Tom Lane wrote:
 Ron Mayer [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Isn't that usually, and more portably, handled in the filesystem
 mount options?
 
 Yes to both.  I could imagine that for small systems/workstations
 you might have some files that want access time, and others that
 wanted NOATIME -- it seems the new flag lets you choose on a
 file-by-file bases.
 
 Personally, if I were an admin who wanted access times, I'd regard
 the existence of such a flag as a security hole.

I'm not sure I see that.  I'd have thought since postgresql
already caches stuff in shared buffers, the atime of a postgresql
file isn't reliable anyway; and outside of postgresql O_NOATIME
doesn't seem to me to affect admins any worse the existence of utime().


OTOH, I'm not going to argue for the patch either.  I think it'd
be fair to say adding a linuxism and only benefiting novice/casual
users isn't that exciting.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] On-disk bitmap index patch

2006-08-02 Thread Ron Mayer
Tom Lane wrote:
 Both of these pages say up front that they are considering read-only
 data.  

Can I assume read-mostly partitions could use the read-I/O
efficient indexes on update-intensive partitions of the same
table could use b-tree indexes?

All of my larger (90GB+) tables can have partitions that are either
almost read-only (spatial data updated one state/country at
a time, about quarterly) or are totally read-only (previous
months and years historical data).

 So one of the questions that has to be answered (and the
 submitters have been entirely mum about) is exactly how bad is the
 update performance?  If it's really awful that's going to constrain
 the use cases quite a lot, whereas merely a bit slower than btree
 wouldn't be such a problem.

Once we have easy-to-use partitioning, would it be the case that
many larger tables will have near-read-only partitions that could
use I/O friendlier indexes (GIN?  Hash? Bitmap?)?  Any examples
of a large data set that can't be partitioned into hot areas and
read-mostly areas?


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] GUC with units, details

2006-07-27 Thread Ron Mayer

Peter Eisentraut wrote:

Jim Nasby wrote:

The truth is, virtually no one, even highly technical people, ever
picks nits between kB vs KiB vs KB.


The question isn't so much whether to allow KiB and such -- that would 
obviously be trivial.  The question is whether we want to have kB mean 
1000 bytes instead of 1024 bytes.


Would it satisfy everyone if the documentation states that
specifying a value of N kB means that *at least* N 1000 bytes
are allocated; and perhaps even documenting that in the current
implementation it happens to be 24 extra bytes.

In my mind, that goes against current practice.  The only argument 
raised in favor was that international standards require such use.  I'm 
as much a fan of measurement standards as anyone, but I'm also a 
practitioner of current practice.


With the spec reading at least N KB, even the most pedantic
spec reader can't complain, because it is true.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] On-disk bitmap index patch

2006-07-25 Thread Ron Mayer

Tom Lane wrote:

[EMAIL PROTECTED] writes:

Reading 1/4, for a larger table, has a good chance of being faster than
reading 4/4 of the table. :-)


Really?

If you have to hit one tuple out of four, it's pretty much guaranteed
that you will need to fetch every heap page.  


I think it's not uncommon to have data that is more clustered than expected.

In my case it shows up often in GIS data; where often a query
that only accesses 1/4 of the rows in the table really will
only access about 1/4 of the pages in the table -- for example
when analyzing Texas or the Southwest US.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Units in postgresql.conf

2006-07-20 Thread Ron Mayer

Peter Eisentraut wrote:
I think it would be useful to allow units to be added to these settings, for 
example...

shared_buffers = 512MB
which is a bit cumbersome to calculate right now (you'd need = 65536).

I haven't thought yet how to parse or implement this, but would people find 
this useful?


Would this extend to things like random_page_cost and similar?

If the random_page_cost were specifiable in seconds or ms it might be easier
to someday write a program to measure such values on particular hardware
platforms.   (though I guess for that to work, the config file would also
need to add the reference cost (is it a non-random page access) as well...)


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] plPHP and plRuby

2006-07-19 Thread Ron Mayer

Peter Eisentraut wrote:

Hannu Krosing wrote:

So we would have
src/pl/pljava/README.TXT

and anybody looking for pl-s would find the info in a logical place


Right.  When was the last time any user looked under src/pl in the first 
place?  Or even under src?  If you're looking for pljava, it's the 
first hit in Google.


The difference is that I will have reasonable confidence that
the README.TXT under src/pl will give instructions that match
the version of PostgreSQL that I have.   I assume that README
will call out the version of PL/R or PL/Ruby that I want that
was tested with the release of PostgreSQL I have.

The first hit on Google will probably give me the most
recently blogged about version; which does nothing to help
me find what I need.

The organization of the source code is controlled by exactly two 
factors:

2. convenience of development


I thought convenience of development included the addressing
the problem that PLs are annoyingly deeply tied to specific
versions of Core.

I would imagine with this README.TXT proposal, it's the responsibility
of the PL/XXX developer to port their PL to PostgreSQL during the Beta,
and if the did and tested it, the release will point to the version
of the PL supported by the PL maintainer for that version.   If they
don't do this testing during the beta, the README.TXT may merely say
the PL/Haskell team did not complete testing during the 8.2 beta; so
good luck.

This aids to the convenience of development of PostgreSQL and the PLs
because it defines the process and responsibility for integration
testing the PLs with the Core releases; and puts some pressure to
synchronize the releases.


Anything else is between you and your packager.

And if that didn't convince you, I still got PL/sh in the wait ...


With which versions of PostgreSQL is this version of PL/sh supported?
I see that PL/sh on http://pgfoundry.org/projects/plsh/ is version 1.1?
Does that mean it goes with PostgreSQL 1.1?   The Projects page
for PL/SH (http://plsh.projects.postgresql.org/) suggests it was
last modified in May 2005 - does that mean I need the May 2005 backend
of PostgreSQL to compile it?  Oh.  The download page says older
releases are also supported.  Does that include 7.1?

Putting this info in the README.TXT is one way to help users
know what's supported.   If I saw a README.TXT for pl/sh I'd
have some confidence I'd be directly pointed to the version I need.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] plPHP and plRuby

2006-07-19 Thread Ron Mayer

Tom Lane wrote:

The difference is that I will have reasonable confidence that
the README.TXT under src/pl will give instructions that match
the version of PostgreSQL that I have.   I assume that README
will call out the version of PL/R or PL/Ruby that I want that
was tested with the release of PostgreSQL I have.


On what do you base that assumption?  A README file laying about in an
otherwise unused part of the source tree is the very definition of out
of sight, out of mind.  


I was hoping it would say something like

  PostgreSQL 8.2.0 has been tested with version XX.YY.01 of PL/Whatever
  You can install it by getting that release and doing the following.

with specific version numbers rather than links to URLS that would change.

It that wasn't the intent of the README.TXT, I'm not sure what is.


I can pretty much guarantee you that it will
NOT get updated, especially not during minor releases.  Even if it is up
to date at the instant we put out a release, it'll be obsolete as soon
as the external project makes an update release.  ISTM links like this
are far better kept on project websites ...


I was hoping that this README.TXT point to the specific old version
that was tested in much the same way that the old 8.0.0 source tree
continues to have the same PL/pgsql that has always been there.

If the external project updates their release and breaks compatability
I think they should be encouraged to update the README.TXT to say
something like
  PostgreSQL 8.2.1 has been tested with PL/Whatever version XX.YY.99
If they don't make that update, the README would
  PostgreSQL 8.2.0 has been tested with PL/Whatever version XX.YY.00



I would imagine with this README.TXT proposal, it's the responsibility
of the PL/XXX developer to port their PL to PostgreSQL during the Beta,
and if the did and tested it, the release will point to the version
of the PL supported by the PL maintainer for that version.


And if they didn't?  I was just noticing that the current release of
plruby contains installation instructions that appear to date to 7.3.
If he can't be bothered to update his own docs, what are the odds that
he'll submit timely updates for a README in the main source tree?


Yeah.  Good point.   I guess the alternatives are that the README
still say
  PostgreSQL 7.3.0 has been tested with PL/Ruby version X.Y.Z
or
  We are unaware of up-to-date instructions for PL/Ruby.  Good Luck.
Though if you'd welcome people in the community to submit patches
to that README I suspect they'll be updated even more regularly
than 2002 or whenever 7.3 come out.

If I spent time figuring it out, I wouldn't mind submitting a patch
for such a README; and I suspect the other guys who blog about
PL/Ruby installation instructions in late 2005 would be happy to
submit such patches as well.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Running a query twice to ensure cached results.

2006-06-08 Thread Ron Mayer

Tom Lane wrote:

-- do it again to ensure fully cached
bench=# select count(*) from accounts;


Short summary:

 Does running a query only twice really insure that a result is cached?
 It seems not to be the case for seq-scans on Linux.

 I think this may matters to the discussions about a readahead
 thread/process that come up on this list that come up here occasionally.

Experimental results here suggest that for larger tables Linux seems
to detect a seq-scan and not bother caching.   It's very reproducible
for me here to do a reboot and not see the full speedup on a seq_scan
until the third time I run a query.su  An example shown below [1] shows
that the third run of a query is faster than the second run.  The
output of a 'vmstat 5' [2] while these queries was happening agrees
that significant I/O was still happening on the second run, but
no I/O happened the third time.   The table comfortably fits in
memory (700MB table on a 2GB ram machine) and the machine was
otherwise idle so noone else wiped out the cache between the
first and second runs.

Why do I think this is worth mentioning here?
  * I think it impacts the occasional thread about wanting
to include logic in postgresql for readahead [3] or for
the threads suggesting hinting to the the OS though madvise
or similar to avoid caching seq-scans.   It seems that the
Linux is detecting and at least somewhat reacting
to seq scans even with no hinting.  Anything added
to postgresql might end up being a duplicated effort.
I think Bruce suggested that Solaris does this free-behind
automatically [4], but this is the first I've noticed
that Linux seems to do similar.

  * I think it matters to people who post explain analyze
twice without running it so often they get stable results.
(I note that this was not a problem for Tom since the
timing of his first and second runs were the same so
I assume he was just saying that he observed that the
query was cached rather than that the first run forced
the second run to be cached.)

Ron


=
== [note 1] the repeated queries showing the speedup after 3 runs.
== Running the same select count(*) 4 times after a clean reboot.
== Seems the OS's caching logic decided that the first seq_scan
== wasn't 'interesting' enough
=
fli=# select count(*) from facets_s;
  count
--
 15976558
(1 row)

Time: 29788.047 ms
fli=# select count(*) from facets_s;
  count
--
 15976558
(1 row)

Time: 19344.573 ms
fli=# select count(*) from facets_s;
  count
--
 15976558
(1 row)

Time: 13411.272 ms
fli=# select count(*) from facets_s;
  count
--
 15976558
(1 row)

Time: 13107.856 ms


# [note 2] vmstat 5 while the above queries were being run


procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 1  1140  62140  71256 713360004731   9284  7  1 92  0
*** the first time
 1  0140  50860  31912 80830402 2521529 1147  2612 49 15  0 36
 1  0360  54420  2 85524000 23934 7 1139  2553 47 14  0 39
 0  1360  54008  11100 87870800 2370425 1149  2467 46 12  0 41
 0  1360  52512  11140 89659200 24062 6 1135  2460 47 11  0 41
*** the second time
 0  0360  52688  11172 90691600 1335719 1085  1989 31  7 38 24
 1  0360  53976  11076 9125400   44 1427357 1113  2102 32  7 29 32
 2  0360  54788  10908 92378800 2450954 1171  2474 46 12  0 42
 1  0360  54944   3096 93994800 1118039 1093  1976 65 13  0 22
*** the third time
 3  0360  54280   3872 94050800   26414 1041  1560 85 15  0  0
 1  0360  53852   3904 940940008829 1022  1505 53  9 36  2
 2  0360  51616   4052 94306800   44354 1037  1552 82 15  0  4
 1  0360  51488   4060 9431800022 2 1013  1522 84 16  0  0

#
[3] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01449.php
[4] http://archives.postgresql.org/pgsql-performance/2003-10/msg00188.php

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-06 Thread Ron Mayer

Tom Lane wrote:


One objection to this is that after moving off the gold standard of
1.0 = one page fetch, there is no longer any clear meaning to the
cost estimate units; you're faced with the fact that they're just an
arbitrary scale.  I'm not sure that's such a bad thing, though.


It seems to me the appropriate gold standard is Time, in microseconds
or milliseconds.

The default postgresql.conf can come with a set of hardcoded
values that reasonably approximate some real-world system; and
if that's documented in the file someone reading it can say
 hey, my CPU's about the same but my disk subsystem is much
 faster, so I know in which direction to change things.
And another person may say ooh, now I know that my 4GHz
machines should have about twice the number here as my 2GHz
box.

For people who *really* care a lot (HW vendors?), they could
eventually make measurements on their systems.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Ron Mayer

Jim C. Nasby wrote:


There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?


Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.

I think the reasons against adding this feature to postgresql are
largely the same as the reasons why compressed filesystems aren't
very popular.

Has anyone tried running postgresql on a compressing file-system?
I'd expect the penalties to outweigh the benefits (or they'd be
more common); but if it gives impressive results, it might add
weight to this feature idea.


  Ron M

I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Automatic free space map filling

2006-03-04 Thread Ron Mayer

Jim C. Nasby wrote:

... how many pages per bit ...



Are we trying to set up a complex solution to a problem
that'll be mostly moot once partitioning is easier and
partitioned tables are common?

In many cases I can think of the bulk of the data would be in
old partitions that are practically never written to (so would
need no vacuuming and could always use index-only lookups);
while the hot parts of large tables would be on partitions
that would need frequent vacuuming and wouldn't benefit
from index-only lookups.

In these cases, 1 bit per partition would work well,
and seems a lot easier to keep track of than bits-per-page.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] Looking for a tool to * pg tables as ERDs

2006-02-23 Thread Ron Peacetree
Where * == 
{print | save to PDF | save to mumble format | display on screen}

Anyone know of one?

TiA
Ron

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-18 Thread Ron

At 08:37 PM 2/15/2006, Dann Corbit wrote:

Adding some randomness to the selection of the pivot is a known 
technique to fix the oddball partitions problem.


True, but it makes QuickSort slower than say MergeSort because of the 
expense of the PRNG being called ~O(lgN) times during a sort.



However, Bentley and Sedgewick proved that every quick sort 
algorithm has some input set that makes it go quadratic


Yep.  OTOH, that input set can be so specific and so unusual as to 
require astronomically unlikely bad luck or hostile hacking in order 
for it to actually occur.



 (hence the recent popularity of introspective sort, which switches 
to heapsort if quadratic behavior is detected.  The C++ template I 
submitted was an example of introspective sort, but PostgreSQL does 
not use C++ so it was not helpful).
...and there are other QuickSort+Other hybrids that address the issue 
as well.  MergeSort, RadixExchangeSort, and BucketSort all come to 
mind.  See Gonnet and Baeza-Yates, etc.




Here are some cases known to make qsort go quadratic:
1. Data already sorted


Only if one element is used to choose the pivot; _and_ only if the 
pivot is the first or last element of each pass.
Even just always using the middle element as the pivot avoids this 
problem.  See Sedgewick or Knuth.




2. Data reverse sorted


Ditto above.



3. Data organ-pipe sorted or ramp


Not sure what this means?  Regardless, median of n partitioning that 
includes samples from each of the 1st 1/3, 2nd 1/3, and final 3rd of 
the data is usually enough to guarantee O(NlgN) behavior unless the 
_specific_ distribution known to be pessimal to that sampling 
algorithm is encountered.  The only times I've ever seen it ITRW was 
as a result of hostile activity: purposely arranging the data in such 
a manner is essentially a DoS attack.




4. Almost all data of the same value


Well known fixes to inner loop available to avoid this problem.


There are probably other cases.  Randomizing the pivot helps some, 
as does check for in-order or reverse order partitions.
Randomizing the choice of pivot essentially guarantees O(NlgN) 
behavior no matter what the distribution of the data at the price of 
increasing the cost of each pass by a constant factor (the generation 
of a random number or numbers).



In sum, QuickSort gets all sorts of bad press that is far more FUD 
than fact ITRW.
Ron.  




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ron

At 04:24 AM 2/17/2006, Ragnar wrote:

On fös, 2006-02-17 at 01:20 -0500, Ron wrote:

 OK, so here's _a_ way (there are others) to obtain a mapping such that
   if a  b then f(a)  f (b) and
   if a == b then f(a) == f(b)

 By scanning the table once, we can map say 001h (Hex used to ease
 typing) to the row with the minimum value and 111h to the row
 with the maximum value as well as mapping everything in between to
 their appropriate keys.  That same scan can be used to assign a
 pointer to each record's location.

This step is just as expensive as the original 
sort you want to replace/improve.


Why do you think that?  External sorts involve 
the equivalent of multiple scans of the table to 
be sorted, sometimes more than lgN (where N is 
the number of items in the table to be 
sorted).  Since this is physical IO we are 
talking about, each scan is very expensive, and 
therefore 1 scan is going to take considerably 
less time than = lgN scans will be.



If you want to keep this mapping saved as a sort 
of an index, or as part ot each row data, this 
will make the cost of inserts and updates enormous.


Not sure you've got this right either.  Looks to 
me like we are adding a = 32b quantity to each 
row.  Once we know the mapping, incrementally 
updating it upon insert or update would seem to 
be simple matter of a fast search for the correct 
ranking [Interpolation search, which we have all 
the needed data for, is O(lglgN).  Hash based 
search is O(1)]; plus an increment/decrement of 
the key values greater/less than the key value of 
the row being inserted / updated.  Given than we 
are updating all the keys in a specific range 
within a tree structure, that update can be done 
in O(lgm) (where m is the number of records affected).



  We can now sort the key+pointer pairs instead of the actual data and
 use an optional final pass to rearrange the actual rows if we wish.

How are you suggesting this mapping be accessed? 
If the mapping is kept separate from the tuple 
data, as in an index, then how will you look up the key?
???  We've effectively created a data set where 
each record is a pointer to a DB row plus its 
key.  We can now sort the data set by key and 
then do an optional final pass to rearrange the 
actual DB rows if we so wish.  Since that final 
pass is very expensive, it is good that not all 
use scenarios will need that final pass.


The amount of storage required to sort this 
representation of the table rather than the 
actual table is so much less that it turns an 
external sorting problem into a internal sorting 
problem with an optional final pass that is =1= 
scan (albeit one scan with a lot of seeks and 
data movement).  This is a big win.  It is a 
variation of a well known technique.  See Sedgewick, Knuth, etc.




 That initial scan to set up the keys is expensive, but if we wish
 that cost can be amortized over the life of the table so we don't
 have to pay it all at once.  In addition, once we have created those
 keys, then can be saved for later searches and sorts.

What is the use case where this would work better than a
regular btree index ?
Again, ???  btree indexes address different 
issues.  They do not in any way help create a 
compact data representation of the original data 
that saves enough space so as to turn an external 
ranking or sorting problem into an internal one.



Ron 




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ron

At 05:19 AM 2/17/2006, Markus Schaber wrote:

Hi, Ron,

Ron schrieb:

 OK, so here's _a_ way (there are others) to obtain a mapping such that
  if a  b then f(a)  f (b) and
  if a == b then f(a) == f(b)

 Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb
 integer; a 4KB row becomes a 32Kb integer; etc)
 Since even a 1TB table made of such rows can only have 256M - 512M
 possible values even if each row is unique, a 28b or 29b key is large
 enough to represent each row's value and relative rank compared to all
 of the others even if all row values are unique.

 By scanning the table once, we can map say 001h (Hex used to ease
 typing) to the row with the minimum value and 111h to the row with
 the maximum value as well as mapping everything in between to their
 appropriate keys.  That same scan can be used to assign a pointer to
 each record's location.

But with a single linear scan, this cannot be accomplished, as the table
contents are neither sorted nor distributed linearly between the minimum
and the maximum.
So what?  We are talking about key assignment here, not anything that 
requires physically manipulating the actual DB rows.

One physical IO pass should be all that's needed.



For this mapping, you need a full table sort.
One physical IO pass should be all that's needed.  However, let's 
pretend you are correct and that we do need to sort the table to get 
the key mapping.  Even so, we would only need to do it =once= and 
then we would be able to use and incrementally update the results 
forever afterward.  Even under this assumption, one external sort to 
save all subsequent such sorts seems well worth it.


IOW, even if I'm wrong about the initial cost to do this; it is still 
worth doing ;-)




 That initial scan to set up the keys is expensive, but if we wish that
 cost can be amortized over the life of the table so we don't have to pay
 it all at once.  In addition, once we have created those keys, then can
 be saved for later searches and sorts.

But for every update or insert, you have to resort the keys, which is
_very_ expensive as it basically needs to update a huge part of the table.


??? You do not need to resort already ordered data to insert a new 
element into it such that the data stays ordered!  Once we have done 
the key ordering operation once, we should not ever need to do it 
again on the original data.  Else most sorting algorithms wouldn't work ;-)



Ron 




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ron

At 10:53 AM 2/17/2006, Martijn van Oosterhout wrote:

On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote:
 For this mapping, you need a full table sort.
 One physical IO pass should be all that's needed.  However, let's
 pretend you are correct and that we do need to sort the table to get
 the key mapping.  Even so, we would only need to do it =once= and
 then we would be able to use and incrementally update the results
 forever afterward.  Even under this assumption, one external sort to
 save all subsequent such sorts seems well worth it.

 IOW, even if I'm wrong about the initial cost to do this; it is still
 worth doing ;-)

I think you're talking about something different here. You're thinking
of having the whole table sorted and when you add a new value you add
it in such a way to keep it sorted. The problem is, what do you sort it
by? If you've sorted the table by col1, then when the user does ORDER
BY col2 it's useless.

No, I'm thinking about how to represent DB row data in such a way that
a= we use a compact enough representation that we can sort internally 
rather than externally.
b= we do the sort once and avoid most of the overhead upon subsequent 
similar requests.


I used the example of sorting on the entire row to show that the 
approach works even when the original record being sorted by is very large.
All my previous comments on this topic hold for the case where we are 
sorting on only part of a row as well.


If all you are doing is sorting on a column or a few columns, what 
I'm discussing is even easier since treating the columns actually 
being used a sort criteria as integers rather than the whole row as 
an atomic unit eats less resources during the key creation and 
mapping process.  If the row is 2-4KB in size, but we only care about 
some combination of columns that only takes on = 2^8 or = 2^16 
different values, then what I've described will be even better than 
the original example I gave.


Basically, the range of a key is going to be restricted by how
a= big the field is that represents the key (columns and such are 
usually kept narrow for performance reasons) or
b= big each row is (the more space each row takes, the fewer rows fit 
into any given amount of storage)

c= many rows there are in the table
Between the conditions, the range of a key tends to be severely 
restricted and therefore use much less space than sorting the actual 
DB records would.  ...and that gives us something we can take advantage of.




Indeed, this is what btrees do, you store the order of the table
seperate from the data. And you can store multiple orders. But even
then, when someone does ORDER BY lower(col1), it's still useless.

And you're right, we still need to do the single mass sort in the
beginning, which is precisely what we're trying to optimise here.

Sigh.  My points were:
1= we have information available to us that allows us to map the rows 
in such a way as to turn most external sorts into internal sorts, 
thereby avoiding the entire external sorting problem in those 
cases.  This is a huge performance improvement.


2= that an external sort is =NOT= required for initial key 
assignment, but even if it was it would be worth it.


3= that this is a variation of a well known technique so I'm not 
suggesting heresy here.



Ron 




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] Need pointers to standard pg database(s) for testing

2006-02-17 Thread Ron

I assume we have such?

Ron



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Ron

At 06:35 AM 2/16/2006, Steinar H. Gunderson wrote:

On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote:
 Even better (and more easily scaled as the number of GPR's in the CPU
 changes) is to use
 the set {L; L+1; L+2; t1; R-2; R-1; R}
 This means that instead of 7 random memory accesses, we have 3; two
 of which result in a burst access for three elements each.

Isn't that improvement going to disappear competely if you choose a bad
pivot?
Only if you _consistently_ (read: the vast majority of the time: 
quicksort is actually darn robust) choose a _pessimal_, not just 
bad, pivot quicksort will degenerate to the O(N^2) behavior 
everyone worries about.  See Corman  Rivest for a proof on this.


Even then, doing things as above has benefits:
1= The worst case is less bad since the guaranteed O(lgs!) pivot 
choosing algorithm puts s elements into final position.

Worst case becomes better than O(N^2/(s-1)).

2=  The overhead of pivot choosing can overshadow the benefits using 
more traditional methods for even moderate values of s.  See 
discussions on the quicksort variant known as samplesort and 
Sedgewick's PhD thesis for details.  Using a pivot choosing algorithm 
that actually does some of the partitioning (and does it more 
efficiently than the usual partitioning algorithm does) plus using 
partition-in-place (rather then Lomuto's method) reduces overhead 
very effectively (at the cost of more complicated / delicate to get 
right partitioning code).  The above reduces the number of moves used 
in a quicksort pass considerably regardless of the number of compares used.


3= Especially in modern systems where the gap between internal CPU 
bandwidth and memory bandwidth is so great, the overhead of memory 
accesses for comparisons and moves is the majority of the overhead 
for both the pivot choosing and the partitioning algorithms within 
quicksort.  Particularly random memory accesses.  The reason (#GPRs - 
1) is a magic constant is that it's the most you can compare and move 
using only register-to-register operations.


In addition, replacing as many of the memory accesses you must do 
with sequential rather than random memory accesses is a big deal: 
sequential memory access is measured in 10's of CPU cycles while 
random memory access is measured in hundreds of CPU cycles.  It's no 
accident that the advances in Grey's sorting contest have involved 
algorithms that are both register and cache friendly, minimizing 
overall memory access and using sequential memory access as much as 
possible when said access can not be avoided.  As caches grow larger 
and memory accesses more expensive, it's often worth it to use a 
BucketSort+QuickSort hybrid rather than just QuickSort.


...and of course if you know enough about the data to be sorted so as 
to constrain it appropriately, one should use a non comparison based 
O(N) sorting algorithm rather than any of the general comparison 
based O(NlgN) methods.




 SIDE NOTE: IIRC glibc's qsort is actually merge sort.  Merge sort
 performance is insensitive to all inputs, and there are way to
 optimize it as well.

glibc-2.3.5/stdlib/qsort.c:

  /* Order size using quicksort.  This implementation incorporates
 four optimizations discussed in Sedgewick:

I can't see any references to merge sort in there at all.

Well, then I'm not the only person on the lists whose memory is faulty ;-)

The up side of MergeSort is that its performance is always O(NlgN).
The down sides are that it is far more memory hungry than QuickSort and slower.


Ron




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] [HACKERS] qsort again

2006-02-16 Thread Ron

At 07:10 AM 2/16/2006, Florian Weimer wrote:

* Neil Conway:

 On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
 It seems clear that our qsort.c is doing a pretty awful job of picking
 qsort pivots, while glibc is mostly managing not to make that mistake.
 I haven't looked at the glibc code yet to see what they are doing
 differently.

 glibc qsort is actually merge sort, so I'm not surprised it avoids this
 problem.

qsort also performs twice as many key comparisons as the theoretical
minimum.


The theoretical minimum number of comparisons for a general purpose 
comparison based sort is O(lgN!).
QuickSort uses 2NlnN ~= 1.38NlgN or ~1.38x the optimum without tuning 
(see Knuth, Sedgewick, Corman, ... etc)
OTOH, QuickSort uses ~2x as many =moves= as the theoretical minimum 
unless tuned, and moves are more expensive than compares in modern systems.


See my other posts for QuickSort tuning methods that attempt to 
directly address both issues.



Ron 




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:

On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
 3= Especially in modern systems where the gap between internal CPU
 bandwidth and memory bandwidth is so great, the overhead of memory
 accesses for comparisons and moves is the majority of the overhead
 for both the pivot choosing and the partitioning algorithms within
 quicksort.  Particularly random memory accesses.  The reason (#GPRs -
 1) is a magic constant is that it's the most you can compare and move
 using only register-to-register operations.

But how much of this applies to us? We're not sorting arrays of
integers, we're sorting pointers to tuples. So while moves cost very
little, a comparison costs hundreds, maybe thousands of cycles. A tuple
can easily be two or three cachelines and you're probably going to
access all of it, not to mention the Fmgr structures and the Datums
themselves.
Pointers are simply fixed size 32b or 64b quantities.  They are 
essentially integers.  Comparing and moving pointers or fixed size 
keys to those pointers is exactly the same problem as comparing and 
moving integers.


Comparing =or= moving the actual data structures is a much more 
expensive and variable cost proposition.  I'm sure that pg's sort 
functionality is written intelligently enough that the only real data 
moves are done in a final pass after the exact desired order has been 
found using pointer compares and (re)assignments during the sorting 
process.  That's a standard technique for sorting data whose key or 
pointer is much smaller than a datum.


Your cost comment basically agrees with mine regarding the cost of 
random memory accesses.  The good news is that the number of datums 
to be examined during the pivot choosing process is small enough that 
the datums can fit into CPU cache while the pointers to them can be 
assigned to registers: making pivot choosing +very+ fast when done correctly.


As you've noted, actual partitioning is going to be more expensive 
since it involves accessing enough actual datums that they can't all 
fit into CPU cache.  The good news is that QuickSort has a very 
sequential access pattern within its inner loop.  So while we must go 
to memory for compares, we are at least keeping the cost for it down 
it a minimum.  In addition, said access is nice enough to be very 
prefetch and CPU cache hierarchy friendly.




None of this is cache friendly. The actual tuples themselves could be
spread all over memory (I don't think any particular effort is expended
trying to minimize fragmentation).
It probably would be worth it to spend some effort on memory layout 
just as we do for HD layout.




Do these algorithms discuss the case where a comparison is more than
1000 times the cost of a move?
A move is always more expensive than a compare when the datum is 
larger than its pointer or key.  A move is always more expensive than 
a compare when it involves memory to memory movement rather than CPU 
location to CPU location movement.  A move is especially more 
expensive than a compare when it involves both factors.  Most moves 
do involve both.


What I suspect you meant is that a key comparison that involves 
accessing the data in memory is more expensive than reassigning the 
pointers associated with those keys.   That is certainly true.


Yes.  The problem has been extensively studied. ;-)



Where this does become interesting is where we can convert a datum to
an integer such that if f(A)  f(B) then A  B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting 
just the keys and the pointers to those datums followed by an 
optional final pass where we do the actual data movement is also a 
standard technique for handling large data structures.



Regardless of what tweaks beyond the basic algorithms we use, the 
algorithms themselves have been well studied and their performance 
well established.  QuickSort is the best performing of the O(nlgn) 
comparison based sorts and it uses less resources than HeapSort or MergeSort.


Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 10:52 AM 2/16/2006, Ron wrote:

At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:


Where this does become interesting is where we can convert a datum to
an integer such that if f(A)  f(B) then A  B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting 
just the keys and the pointers to those datums followed by an 
optional final pass where we do the actual data movement is also a 
standard technique for handling large data structures.

I thought some follow up might be in order here.

Let's pretend that we have the typical DB table where rows are ~2-4KB 
apiece.  1TB of storage will let us have 256M-512M rows in such a table.


A 32b hash code can be assigned to each row value such that only 
exactly equal rows will have the same hash code.

A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= 
64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an optional 
pass to rearrange the actual rows if we so wish.


We get the same result while only examining and manipulating 1/50 to 
1/25 as much data during the sort.


If we want to spend more CPU time in order to save more space, we can 
compress the key+pointer representation.  That usually reduces the 
amount of data to be manipulated to ~1/4 the original key+pointer 
representation, reducing things to ~512M-1GB worth of compressed 
pointers+keys.  Or ~1/200 - ~1/100 the original amount of data we 
were discussing.


Either representation is small enough to fit within RAM rather than 
requiring HD IO, so we solve the HD IO bottleneck in the best 
possible way: we avoid ever doing it.


Ron   




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 12:19 PM 2/16/2006, Scott Lamb wrote:

On Feb 16, 2006, at 8:32 AM, Ron wrote:

Let's pretend that we have the typical DB table where rows are
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an

optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) = (x != y)
This is not: (H(x)  H(y)) = (x  y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or 
hash-like codes that maintains the mapping to support that second statement.


More later when I can get more time.
Ron 




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 01:47 PM 2/16/2006, Ron wrote:

At 12:19 PM 2/16/2006, Scott Lamb wrote:

On Feb 16, 2006, at 8:32 AM, Ron wrote:

Let's pretend that we have the typical DB table where rows are
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an

optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) = (x != y)
This is not: (H(x)  H(y)) = (x  y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or 
hash-like codes that maintains the mapping to support that second statement.


More later when I can get more time.
Ron


OK, so here's _a_ way (there are others) to obtain a mapping such that
 if a  b then f(a)  f (b) and
 if a == b then f(a) == f(b)

Pretend each row is a integer of row size (so a 2KB row becomes a 
16Kb integer; a 4KB row becomes a 32Kb integer; etc)
Since even a 1TB table made of such rows can only have 256M - 512M 
possible values even if each row is unique, a 28b or 29b key is large 
enough to represent each row's value and relative rank compared to 
all of the others even if all row values are unique.


By scanning the table once, we can map say 001h (Hex used to ease 
typing) to the row with the minimum value and 111h to the row 
with the maximum value as well as mapping everything in between to 
their appropriate keys.  That same scan can be used to assign a 
pointer to each record's location.


We can now sort the key+pointer pairs instead of the actual data and 
use an optional final pass to rearrange the actual rows if we wish.


That initial scan to set up the keys is expensive, but if we wish 
that cost can be amortized over the life of the table so we don't 
have to pay it all at once.  In addition, once we have created those 
keys, then can be saved for later searches and sorts.


Further space savings can be obtained whenever there are duplicate 
keys and/or when compression methods are used on the Key+pointer pairs.


Ron







---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-15 Thread Ron
This behavior is consistent with the pivot choosing algorithm 
assuming certain distribution(s) for the data.  For instance, 
median-of-three partitioning is known to be pessimal when the data is 
geometrically or hyper-geometrically distributed.  Also, care must be 
taken that sometimes is not when there are many equal values in the 
data.  Even pseudo random number generator based pivot choosing 
algorithms are not immune if the PRNG is flawed in some way.


How are we choosing our pivots?


At 06:28 PM 2/15/2006, Tom Lane wrote:


I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c.  For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,10), 0;
insert into atest (i,r) select generate_series(1,10), random()*10;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes.  (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.)  Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

regards, tom lane


100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
snip
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms


100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
snip
Time: 20865.979 ms
Time: 21000.907 ms
Time: 21297.585 ms
Time: 21714.518 ms
Time: 25423.235 ms
Time: 27543.052 ms
Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-15 Thread Ron

At 08:21 PM 2/15/2006, Tom Lane wrote:

Ron [EMAIL PROTECTED] writes:
 How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices.


OK, this is a bad way to do median-of-n partitioning for a few 
reasons.  See Sedgewick's PhD thesis for details.


Basically, if one is using median-of-n partitioning to choose a 
pivot, one should do it in =one= pass, and n for that pass should be 
= the numbers of registers in the CPU.  Since the x86 ISA has 8 
GPR's, n should be = 8.  7 for instance.


Special purposing the median-of-n code so that the minimal number of 
comparisons and moves is used to sort the sample and then 
partitioning in place is the best way to do it.  In addition, care 
must be taken to deal with the possibility that many of the keys may be equal.


The (pseudo) code looks something like this:

qs(a[],L,R){
if((R-L)  SAMPLE_SIZE){ // Not worth using qs for too few elements
   SortSample(SAMPLE_SIZE,a[],L,R);
   // Sorts SAMPLE_SIZE= n elements and does median-of-n 
partitioning for small n

   // using the minimal number of comparisons and moves.
   // In the process it ends up partitioning the first n/2 and last 
n/2 elements

   // SAMPLE_SIZE is a constant chosen to work best for a given CPU.
   //  #GPRs - 1 is a good initial guess.
   // For the x86 ISA, #GPRs - 1 = 7. For native x86-64, it's 15.
   // For most RISC CPUs it's 31 or 63.  For Itanium, it's 127 (!)
   pivot= a[(L+R)1]; i= L+(SAMPLE_SIZE1); j= R-(SAMPLE_SIZE1);
   for(;;){
  while(a[++i]  pivot);
  while(a[--j]  pivot);
  if(i = j) break;
  if(a[i]  a[j]) swap(a[i],a[j]);
  }
   if((i-R) = (j-L)){qs(a,L,i-1);}
   else{qs(a,i,R);}
else{OofN^2_Sort(a,L,R);}
// SelectSort may be better than InsertSort if KeySize in bits  
RecordSize in bits

} // End of qs

Given that the most common CPU ISA in existence has 8 GPRs, 
SAMPLE_SIZE= 7 is probably optimal:

t= (L+R);
the set would be {L; t/8; t/4; t/2; 3*t/4; 7*t/8; R;}
== {L; t3; t2; t1; (3*t)2; (7*t)3; R} as the locations.
Even better (and more easily scaled as the number of GPR's in the CPU 
changes) is to use

the set {L; L+1; L+2; t1; R-2; R-1; R}
This means that instead of 7 random memory accesses, we have 3; two 
of which result in a

burst access for three elements each.
That's much faster; _and_ using a sample of 9, 15, 31, 63, etc (to 
max of ~GPRs -1) elements is more easily done.


It also means that the work we do sorting the sample can be taken 
advantage of when starting
inner loop of quicksort: items L..L+2, t, and R-2..R are already 
partitioned by SortSample().


Insuring that the minimum number of comparisons and moves is done in 
SortSample can be down by using a code generator to create a 
comparison tree that identifies which permutation(s) of n we are 
dealing with and then moving them into place with the minimal number of moves.


SIDE NOTE: IIRC glibc's qsort is actually merge sort.  Merge sort 
performance is insensitive to all inputs, and there are way to 
optimize it as well.


I'll leave the actual coding to someone who knows the pg source 
better than I do.
Ron 




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] What do the Windows pg hackers out there like for dev tools?

2006-02-10 Thread Ron
Subject line says it all.  I'm going to be testing changes under both 
Linux and WinXP, so I'm hoping those of you that do M$ hacking will 
pass along your list of suggestions and/or favorite (and hated so I 
know what to avoid) tools.


TiA,
Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] ISO 8601 Intervals

2006-01-09 Thread Ron Mayer

Larry Rosenman wrote:

Michael Glaesemann wrote:

On Jan 8, 2006, at 12:12 , Larry Rosenman wrote:

  I was thinking of handling the TODO for ISO8601 Interval output.


Just to be clear, you're talking about the ISO8601 duration syntax
(PnYnMnDTnHnMnS), correct? (The SQL standard made the unfortunate
choice to call durations, i.e., lengths of time, intervals.)  


Back in 2003 I submitted such a patch [1,1b] that resulted in a fair
amount of discussion including some still (AFAIK) open issues
about the naming of the datestyle settings to control it [2,3,4].

There was also some discussion of the range off ISO 8601 durations
to support (ISO 8601 Basic Format, ISO  8601 Alternative Format,
and ISO 8601 Extended Format (which is more human-readable)) [5].

Finally, there is a similar, but different syntax currently supported
by postgresql (where '1Y1M' means 1 year 1 minute, while ISO 'P1Y1M'
would mean 1 year 1 month) and Tom recommended ripping that code
out[7] and at one point said my patch was looking cleaner than
the exiting code [8].  My patch does not yet rip that out.


I still use the patch myself, but don't have it updated to CVS tip.
I'd be happy to do so if people want that as a starting point.

   Ron


[1] http://archives.postgresql.org/pgsql-patches/2003-09/msg00103.php
[1b] http://archives.postgresql.org/pgsql-patches/2003-09/msg00286.php
[2] http://archives.postgresql.org/pgsql-patches/2003-09/msg00122.php
[3] http://archives.postgresql.org/pgsql-patches/2003-09/msg00129.php
[4] http://archives.postgresql.org/pgsql-patches/2003-09/msg00130.php
[5] http://archives.postgresql.org/pgsql-patches/2003-09/msg00133.php
[6] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php
[7] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php

[8] http://archives.postgresql.org/pgsql-patches/2003-09/msg00121.php

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] ISO 8601 Intervals

2006-01-09 Thread Ron Mayer

One more link...
this http://archives.postgresql.org/pgsql-patches/2003-12/msg00049.php
was the final draft of the patch I submitted, with docs patches, that
did not break backward computability (did not rip out the old syntax)
and supported both input and output of ISO-8601 compliant intervals
by setting the datestyle to iso8601basic as discussed in the thread
linked in the quoted article below.

It was applied http://archives.postgresql.org/pgsql-patches/2003-12/msg00253.php
and then debated 
http://archives.postgresql.org/pgsql-patches/2003-12/msg00202.php
and then unapplied 
http://archives.postgresql.org/pgsql-patches/2003-12/msg00030.php
on Peter Eisentraut's recommendation to implement SQL standard intervals first.



Ron Mayer wrote:

Larry Rosenman wrote:


Michael Glaesemann wrote:


On Jan 8, 2006, at 12:12 , Larry Rosenman wrote:


  I was thinking of handling the TODO for ISO8601 Interval output.



Just to be clear, you're talking about the ISO8601 duration syntax
(PnYnMnDTnHnMnS), correct? (The SQL standard made the unfortunate
choice to call durations, i.e., lengths of time, intervals.)  



Back in 2003 I submitted such a patch [1,1b] that resulted in a fair
amount of discussion including some still (AFAIK) open issues
about the naming of the datestyle settings to control it [2,3,4].

There was also some discussion of the range off ISO 8601 durations
to support (ISO 8601 Basic Format, ISO  8601 Alternative Format,
and ISO 8601 Extended Format (which is more human-readable)) [5].

Finally, there is a similar, but different syntax currently supported
by postgresql (where '1Y1M' means 1 year 1 minute, while ISO 'P1Y1M'
would mean 1 year 1 month) and Tom recommended ripping that code
out[7] and at one point said my patch was looking cleaner than
the exiting code [8].  My patch does not yet rip that out.


I still use the patch myself, but don't have it updated to CVS tip.
I'd be happy to do so if people want that as a starting point.

   Ron


[1] http://archives.postgresql.org/pgsql-patches/2003-09/msg00103.php
[1b] http://archives.postgresql.org/pgsql-patches/2003-09/msg00286.php
[2] http://archives.postgresql.org/pgsql-patches/2003-09/msg00122.php
[3] http://archives.postgresql.org/pgsql-patches/2003-09/msg00129.php
[4] http://archives.postgresql.org/pgsql-patches/2003-09/msg00130.php
[5] http://archives.postgresql.org/pgsql-patches/2003-09/msg00133.php
[6] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php
[7] http://archives.postgresql.org/pgsql-patches/2003-09/msg00134.php

[8] http://archives.postgresql.org/pgsql-patches/2003-09/msg00121.php


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Supporting NULL elements in arrays

2005-11-09 Thread Ron Mayer

Joe Conway wrote:


Last time I thought about this problem, that's what I concluded. I don't 
think there is a reasonable and backward compatible solution.


I also think the best non-compatible solution is to require non-numeric 
elements to be delimited (double quotes, configurable?), and use NULL 
unadorned to represent NULL.



If we're going non-computable, would something that's a superset
of the SQL Standard's array value constructor be useful; or is
the standard's arrays so limited as to not even address the
things that cause issues for postgresql arrays?

If I read the confusing thing right, I think the standard
does use NULL for nulls in arrays, single-quotes for strings,
etc. like ARRAY['FOO',null,'BAR'] and unadorned numbers
for numbers in arrays.   That's similar to what I think Joe
suggested, but with single rather than double quotes?

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
I've now gotten verification from multiple working DBA's that DB2, Oracle, and
SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in
setups akin to Oracle RAC) when attached to a decent (not outrageous, but
decent) HD subsystem...

I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR is
attainable.  Cache based bursts that high, yes.  ASTR, no.

The DBA's in question run RW installations that include Solaris, M$, and Linux 
OS's
for companies that just about everyone on these lists are likely to recognize.

Also, the implication of these pg IO limits is that money spent on even 
moderately
priced 300MBps SATA II based RAID HW is wasted $'s.

In total, this situation is a recipe for driving potential pg users to other 
DBMS. 
  
25MBps in and 15MBps out is =BAD=.

Have we instrumented the code in enough detail that we can tell _exactly_ where
the performance drainage is?

We have to fix this.
Ron  


-Original Message-
From: Luke Lonergan [EMAIL PROTECTED]
Sent: Oct 5, 2005 11:24 AM
To: Michael Stone [EMAIL PROTECTED], Martijn van Oosterhout 
kleptog@svana.org
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Nope - it would be disk wait.

COPY is CPU bound on I/O subsystems faster that 50 MB/s on COPY (in) and about 
15 MB/s (out).

- Luke

 -Original Message-
From:   Michael Stone [mailto:[EMAIL PROTECTED]
Sent:   Wed Oct 05 09:58:41 2005
To: Martijn van Oosterhout
Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject:Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote:
COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)
[snip]
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

Note that 45% of that cpu usage is system--which is where IO overhead
would end up being counted. Until you profile where you system time is
going it's premature to say it isn't an IO problem.

Mike Stone


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
First I wanted to verify that pg's IO rates were inferior to The Competition.
Now there's at least an indication that someone else has solved similar
problems.  Existence proofs make some things easier ;-)

Is there any detailed programmer level architectual doc set for pg?  I know
the best doc is the code, but the code in isolation is often the Slow Path to
understanding with systems as complex as a DBMS IO layer.

Ron
 

-Original Message-
From: Joshua D. Drake [EMAIL PROTECTED]
Sent: Oct 5, 2005 1:18 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?


The source is freely available for your perusal. Please feel free to
point us in specific directions in the code where you may see some
benefit. I am positive all of us that can, would put resources into
fixing the issue had we a specific direction to attack.

Sincerely,

Joshua D. Drake

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
I'm putting in as much time as I can afford thinking about pg related
performance issues.  I'm doing it because of a sincere desire to help
understand and solve them, not to annoy people.

If I didn't believe in pg, I would't be posting thoughts about how to
make it better.  

It's probably worth some review (suggestions marked with a +:

+I came to the table with a possibly better way to deal with external
sorts (that now has branched into 2 efforts: short term improvements
to the existing code, and the original from-the-ground-up idea).  That
suggestion was based on a great deal of prior thought and research,
despite what some others might think.

Then we were told that our IO limit was lower than I thought.

+I suggested that as a Quick Fix we try making sure we do IO
transfers in large enough chunks based in the average access time
of the physical device in question so as to achieve the device's
ASTR (ie at least 600KB per access for a 50MBps ASTR device with
a 12ms average access time.) whenever circumstances allowed us.
As far as I know, this experiment hasn't been tried yet.

I asked some questions about physical layout and format translation
overhead being possibly suboptimal that seemed to be agreed to, but
specifics as to where we are taking the hit don't seem to have been
made explicit yet.

+I made the from left field suggestion that perhaps a pg native fs
format would be worth consideration.  This is a major project, so
the suggestion was to at least some extent tongue-in-cheek.

+I then made some suggestions about better code instrumentation
so that we can more accurately characterize were the bottlenecks are. 

We were also told that evidently we are CPU bound far before one
would naively expect to be based on the performance specifications
of the components involved.

Double checking among the pg developer community led to some
differing opinions as to what the actual figures were and under what
circumstances they were achieved.  Further discussion seems to have
converged on both accurate values and a better understanding as to
the HW and SW  needed; _and_ we've gotten some RW confirmation
as to what current reasonable expectations are within this problem
domain from outside the pg community.

+Others have made some good suggestions in this thread as well.
Since I seem to need to defend my tone here, I'm not detailing them
here.  That should not be construed as a lack of appreciation of them.

Now I've asked for the quickest path to detailed understanding of the
pg IO subsystem.  The goal being to get more up to speed on its
coding details.  Certainly not to annoy you or anyone else.

At least from my perspective, this for the most part seems to have
been an useful and reasonable engineering discussion that has
exposed a number of important things.
  
Regards,
Ron

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Ron Peacetree
The constants related to inlining involve pcode, not actual assembly 
instructions, and are compiler version dependent as well as subject to change 
without notice by the GNU folks...

from:
http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/Optimize-Options.html#Optimize-Options

-finline-limit=n
By default, gcc limits the size of functions that can be inlined. This flag 
allows the control of this limit for functions that are explicitly marked as 
inline (i.e., marked with the inline keyword or defined within the class 
definition in c++). n is the size of functions that can be inlined in number of 
pseudo instructions (not counting parameter handling). The default value of n 
is 600. Increasing this value can result in more inlined code at the cost of 
compilation time and memory consumption. Decreasing usually makes the 
compilation faster and less code will be inlined (which presumably means slower 
programs). This option is particularly useful for programs that use inlining 
heavily such as those based on recursive templates with C++.

Inlining is actually controlled by a number of parameters, which may be 
specified individually by using --param name=value. The -finline-limit=n option 
sets some of these parameters as follows:

max-inline-insns
is set to n.
max-inline-insns-single
is set to n/2.
max-inline-insns-auto
is set to n/2.
min-inline-insns
is set to 130 or n/4, whichever is smaller.
max-inline-insns-rtl
is set to n. 

Using -finline-limit=600 thus results in the default settings for these 
parameters. See below for a documentation of the individual parameters 
controlling inlining.

Note: pseudo instruction represents, in this particular context, an abstract 
measurement of function's size. In no way, it represents a count of assembly 
instructions and as such its exact meaning might change from one release to an 
another. 

Further Down It Says...

--param name=value
In some places, GCC uses various constants to control the amount of 
optimization that is done. For example, GCC will not inline functions that 
contain more that a certain number of instructions. You can control some of 
these constants on the command-line using the --param option.

The names of specific parameters, and the meaning of the values, are tied to 
the internals of the compiler, and are subject to change without notice in 
future releases.

In each case, the value is an integer. The allowable choices for name are given 
in the following table:

snip

max-inline-insns-single
Several parameters control the tree inliner used in gcc. This number sets the 
maximum number of instructions (counted in gcc's internal representation) in a 
single function that the tree inliner will consider for inlining. This only 
affects functions declared inline and methods implemented in a class 
declaration (C++). The default value is 300.

max-inline-insns-auto
When you use -finline-functions (included in -O3), a lot of functions that 
would otherwise not be considered for inlining by the compiler will be 
investigated. To those functions, a different (more restrictive) limit compared 
to functions declared inline can be applied. The default value is 300.

max-inline-insns
The tree inliner does decrease the allowable size for single functions to be 
inlined after we already inlined the number of instructions given here by 
repeated inlining. This number should be a factor of two or more larger than 
the single function limit. Higher numbers result in better runtime performance, 
but incur higher compile-time resource (CPU time, memory) requirements and 
result in larger binaries. Very high values are not advisable, as too large 
binaries may adversely affect runtime performance. The default value is 600.

max-inline-slope
After exceeding the maximum number of inlined instructions by repeated 
inlining, a linear function is used to decrease the allowable size for single 
functions. The slope of that function is the negative reciprocal of the number 
specified here. The default value is 32.

min-inline-insns
The repeated inlining is throttled more and more by the linear function after 
exceeding the limit. To avoid too much throttling, a minimum for this function 
is specified here to allow repeated inlining for very small functions even when 
a lot of repeated inlining already has been done. The default value is 130.

max-inline-insns-rtl
For languages that use the RTL inliner (this happens at a later stage than tree 
inlining), you can set the maximum allowable size (counted in RTL instructions) 
for the RTL inliner with this parameter. The default value is 600.


-Original Message-
From: Martijn van Oosterhout kleptog@svana.org
Sent: Oct 4, 2005 8:24 AM
To: Simon Riggs [EMAIL PROTECTED]
Cc: Tom Lane [EMAIL PROTECTED], Ron Peacetree [EMAIL PROTECTED], 
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
Jeff, are those _burst_ rates from HD buffer or _sustained_ rates from
actual HD media?  Rates from IO subsystem buffer or cache are
usually considerably higher than Average Sustained Transfer Rate.

Also, are you measuring _raw_ HD IO (bits straight off the platters, no
FS or other overhead) or _cooked_ HD IO (actual FS or pg IO)?

BTW, it would seem Useful to measure all of raw HD IO, FS HD IO,
and pg HD IO as this would give us an idea of just how much overhead
each layer is imposing on the process.

We may be able to get better IO than we currently are for things like
sorts by the simple expedient of making sure we read enough data per
seek.

For instance, a HD with a 12ms average access time and a ASTR of
50MBps should always read _at least_ 600KB/access or it is impossible
for it to achieve it's rated ASTR.

This number will vary according to the average access time and the
ASTR of your physical IO subsystem, but the concept is valid for _any_
physical IO subsystem.
 

-Original Message-
From: Jeffrey W. Baker [EMAIL PROTECTED]
Sent: Oct 3, 2005 4:42 PM
To: josh@agliodbs.com
Cc: 
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote:
 Michael,
 
  Realistically, you can't do better than about 25MB/s on a
   single-threaded I/O on current Linux machines,
 
  What on earth gives you that idea? Did you drop a zero?
 
 Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A 
 Big-Name Proprietary Database doesn't get much more than that either.

I find this claim very suspicious.  I get single-threaded reads in
excess of 1GB/sec with XFS and  250MB/sec with ext3.  

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
Let's pretend we get a 24HD HW RAID solution like that J Baker
says he has access to and set it up as a RAID 10.  Assuming
it uses two 64b 133MHz PCI-X busses and has the fastest HDs
available on it,  Jeff says he can hit ~1GBps of XFS FS IO rate
with that set up (12*83.3MBps= 1GBps).

Josh says that pg can't do more than 25MBps of DB level IO
regardless of how fast the physical IO subsystem is because at
25MBps, pg is CPU bound.  

Just how bad is this CPU bound condition?  How powerful a CPU is
needed to attain a DB IO rate of 25MBps?
 
If we replace said CPU with one 2x, 10x, etc faster than that, do we
see any performance increase?

If a modest CPU can drive a DB IO rate of 25MBps, but that rate
does not go up regardless of how much extra CPU we throw at
it...

Ron 

-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Oct 3, 2005 6:03 PM
To: Jeffrey W. Baker [EMAIL PROTECTED]
Cc: 
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Jeffrey,

 I guess database reads are different, but I remain unconvinced that they
 are *fundamentally* different.  After all, a tab-delimited file (my sort
 workload) is a kind of database.

Unfortunately, they are ... because of CPU overheads.   I'm basing what's 
reasonable for data writes on the rates which other high-end DBs can 
make.   From that, 25mb/s or even 40mb/s for sorts should be achievable 
but doing 120mb/s would require some kind of breakthrough.

 On a single disk you wouldn't notice, but XFS scales much better when
 you throw disks at it.  I get a 50MB/sec boost from the 24th disk,
 whereas ext3 stops scaling after 16 disks.  For writes both XFS and ext3
 top out around 8 disks, but in this case XFS tops out at 500MB/sec while
 ext3 can't break 350MB/sec.

That would explain it.  I seldom get more than 6 disks (and 2 channels) to 
test with.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
OK, change performance to single thread performance and we
still have a valid starting point for a discussion.

Ron


-Original Message-
From: Gregory Maxwell [EMAIL PROTECTED]
Sent: Oct 3, 2005 8:19 PM
To: Ron Peacetree [EMAIL PROTECTED]
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On 10/3/05, Ron Peacetree [EMAIL PROTECTED] wrote:
[snip]
 Just how bad is this CPU bound condition?  How powerful a CPU is
 needed to attain a DB IO rate of 25MBps?

 If we replace said CPU with one 2x, 10x, etc faster than that, do we
 see any performance increase?

 If a modest CPU can drive a DB IO rate of 25MBps, but that rate
 does not go up regardless of how much extra CPU we throw at
 it...

Single threaded was mentioned.
Plus even if it's purely cpu bound, it's seldom as trivial as throwing
CPU at it, consider the locking in both the application, in the
filesystem, and elsewhere in the kernel.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Ron Peacetree
*blink* Tapes?!  I thought that was a typo...
If our sort is code based on sorting tapes, we've made a mistake.  HDs
are not tapes, and Polyphase Merge Sort and it's brethren are not the
best choices for HD based sorts.

Useful references to this point:
Knuth, Vol 3 section 5.4.9, (starts p356 of 2ed)   
Tharp, ISBN 0-471-60521-2, starting p352
Folk, Zoellick, and Riccardi, ISBN 0-201-87401-6, chapter 8 (starts p289)

The winners of the Daytona version of Jim Gray's sorting contest, for
general purpose external sorting algorithms that are of high enough quality
to be offered commercially, also demonstrate a number of better ways to
attack external sorting using HDs.

The big take aways from all this are:
1= As in Polyphase Merge Sort, optimum External HD Merge Sort
performance is obtained by using Replacement Selection and creating
buffers of different lengths for later merging.  The values are different.

2= Using multiple HDs split into different functions, IOW _not_ simply
as RAIDs, is a big win.
A big enough win that we should probably consider having a config option
to pg that allows the use of HD(s) or RAID set(s) dedicated as temporary
work area(s).
 
3= If the Key is small compared record size, Radix or Distribution
Counting based algorithms are worth considering.

The good news is all this means it's easy to demonstrate that we can
improve the performance of our sorting functionality.

Assuming we get the abyssmal physical IO performance fixed...
(because until we do, _nothing_ is going to help us as much)

Ron 


-Original Message-
From: Tom Lane [EMAIL PROTECTED]
Sent: Oct 1, 2005 2:01 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort? 

Jeffrey W. Baker [EMAIL PROTECTED] writes:
 I think the largest speedup will be to dump the multiphase merge and
 merge all tapes in one pass, no matter how large M.  Currently M is
 capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
 the tape.  It could be done in a single pass heap merge with N*log(M)
 comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here.  The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives.  But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass.  That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is.  But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size.  (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
be well repaid by the potential savings in I/O.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


<    1   2   3   4   >