Re: [HACKERS] bitmap AM design

2005-03-04 Thread Pailloncy Jean-Gerard
Le 2 mars 05, à 21:17, Hannu Krosing a écrit :
Ühel kenal päeval (teisipäev, 1. märts 2005, 14:54-0500), kirjutas
[EMAIL PROTECTED]:
Now, it occurs to me that if my document reference table can refer to
something other than an indexed primary key, I can save a lot of index
processing time in PostgreSQL if I can have a safe analogy to CTID.
I guess you could work on making hash indexes better (for concurrent
access).
You should have a look to this thread
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00263.php
Take a look at this paper about lock-free parallel hash table
http://www.cs.rug.nl/~wim/mechver/hashtable/
'a safe analogy to CTID' looks remarkably like hash index
--
Hannu Krosing [EMAIL PROTECTED]
Cordialement,
Jean-Gérard Pailloncy
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] bitmap AM design

2005-03-04 Thread Neil Conway
Pailloncy Jean-Gerard wrote:
You should have a look to this thread
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00263.php
Take a look at this paper about lock-free parallel hash table
http://www.cs.rug.nl/~wim/mechver/hashtable/
Is this relevant? Hash indexes are on-disk data structures, so ISTM 
lock-free algorithms aren't really applicable.

(BTW, is poor concurrency really the biggest issue with hash indexes? If 
so, there is some low-hanging fruit that I noticed a few years ago, but 
never got around to fixing: _hash_doinsert() write-locks the hash 
metapage on every insertion merely to increment a tuple counter. This 
could be improved by just acquiring the lock with probability 1/k, and 
incrementing the counter k times -- or some similar statistical 
approximation. IMHO there are bigger issues with hash indexes, like the 
lack of WAL safety, the very slow index build times, and their 
relatively poor performance -- i.e. no better than btree for any 
workload I've seen. If someone wants to step up to the plate and fix 
some of that, I'll improve hash index concurrency -- any takers? :-) )

-Neil
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] bitmap AM design

2005-03-04 Thread pgsql
 Pailloncy Jean-Gerard wrote:
 You should have a look to this thread
 http://archives.postgresql.org/pgsql-hackers/2005-02/msg00263.php

 Take a look at this paper about lock-free parallel hash table
 http://www.cs.rug.nl/~wim/mechver/hashtable/

 Is this relevant? Hash indexes are on-disk data structures, so ISTM
 lock-free algorithms aren't really applicable.

 (BTW, is poor concurrency really the biggest issue with hash indexes? If
 so, there is some low-hanging fruit that I noticed a few years ago, but
 never got around to fixing: _hash_doinsert() write-locks the hash
 metapage on every insertion merely to increment a tuple counter. This
 could be improved by just acquiring the lock with probability 1/k, and
 incrementing the counter k times -- or some similar statistical
 approximation. IMHO there are bigger issues with hash indexes, like the
 lack of WAL safety, the very slow index build times, and their
 relatively poor performance -- i.e. no better than btree for any
 workload I've seen. If someone wants to step up to the plate and fix
 some of that, I'll improve hash index concurrency -- any takers? :-) )


As always, I'd love to have the time to do this stuff, but as always, I'm
not in the position to spend any time on it. It's frustrating.

Anyway, IMHO, hash indexes would be dramatically improved if you could
specify your own hashing function and declare initial table size. If you
could do that, and work on an assumption that the hashing index was for
fairly static data, it could handle many needs. As it stands, hash indexes
are virtually useless.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] refactoring fork() and EXEC_BACKEND

2005-03-04 Thread Magnus Hagander
While going through the usual motions needed to fork a child 
process of 
the postmaster, it occurred to me that there's a fair bit of 
duplicated 
code involved. There are also #ifdef for various situations (BeOS, 
LINUX_PROFILE, and EXEC_BACKEND), which makes the code yet 
more ugly. I 
think we could make this a lot cleaner.

I'd like to define an API like so:

This is a lot like what I was planning to work towards with the
refactoring of the forkexec code I promised to do for 8.1. Glad to hear
you think in the same direction.


pid_t fork_process(int proc_type);
pid_t fork_backend(Port *port);

If the process needs to add a lot of private information to 
the argv in 
the case of EXEC_BACKEND, they could invoke a third variant:

#ifdef EXEC_BACKEND
pid_t forkexec_process(int proc_type, int argc, char **argv);
#endif

(Or possibly using varargs, if that is cleaner for most call-sites). 
Hopefully most call sites could just use fork_process().

I was actually thinking of not passing these on the commandline at all,
in order to avoid possible quoting issues etc (recall all the problems
with the stupid commandline processing on win32). Instead moving it into
a struct that is appended to the end of the backend variable file/shared
memory.


snip

So, most call sites would be quite nice:

pid_t result = fork_process(PROC_TYPE_FOO);
if (result == -1) { /* fork failed, in parent */ }
else if (result == 0) { /* in child */ }
else { /* in parent, `result' is pid of child */ }

You're not going to be able to get the in child there for an execed
process, are you? it has to be called somewhere in the new process, and
thus it would have to be a function, wouldn't it?


I'd also like to move the implementation of fork_process() and 
friends, 
as well as internal_forkexec(), into a separate file -- I'd rather not 
clutter up postmaster.c with it.

That was also what I was thinking. Let me know if you want to split the
load somewhere :-)


//Magnus

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Where to see the patch queue (was Re: [PATCHES] Patch

2005-03-04 Thread Robert Treat
On Thursday 03 March 2005 19:08, Neil Conway wrote:
 Thomas F.O'Connell wrote:
 committers, myself included, deserve some blame for not making more
 rapid progress on the queue of unapplied patches for 8.1. In the
 meanwhile, the queue should be easier for folks to find (why is the
 pgpatches queue the only one linked from postgresql.org, but it is
 almost empty?)


Wow, I hadn't realized just how big that queue had gotten. I've got no qualms 
adding another link to the website, but can we get a scheme wrt to links for 
the various patche queues?  Is pgpatches for the current stable branch and 
pgpatches2 for the current development branch?  (Well, that can't be true 
since before 8.0 release they were both for unstable branches...)  

Incidentally, I think DEV_FAQ 1.4 should be modified to include mentioning the 
patch queues. 

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] SQL99 Hierarchical queries

2005-03-04 Thread Evgen Potemkin
Hello hackers,
 Thanks very much for doing this work.  Is there some way you can
 summarize what you did so others can join you in working on it?  If it
 is easier for you to write this in some language other than English,
 please do, and we'll find translators :)
 
 I noticed that the patch touches the parser, the optimizer and the
 executor.  What does it to with each?  What did I miss?
 
This is some info about implementations.

Parser.
 WITH aliased queries stored as list of SUBQUERY nodes. Each of
them is parsed as usually. In pstate-withClause already analyzed part
of WITH queries list is stored. When next WITH subquery is analyzed,
it's been added to subqueries list inside withClause node, so any WITH
subquery can see al prevoius WITH subqueries.
In FROM clause all WITH aliases represented by special type of
RangTblEntry - RTE_WITH_SUBQUERY. It stores a reference to WithClause
node and index of itself in QITH subqueries list.
For analyzing var added step to search in WITH aliases if they are present.
Recursiveness support. Before WITH subquery analyzing in
pstate-withClause-calias  stored it's alias. So when transforming
FROM clause item and relation name found only in
pstate-withClause-calias query marked as recursive. SQL99 recursive
queries are made using UNION and first UNION subquery should be non
recursive (this isn't currently checked). Thus when transforming set
operation statement, after analyzing of first statement it's is RTE
stored in pstate-withClause-cRTE, and all checks for vars in this
WITH subquery in made against this RTE.

Optimizer.
WithClause node transformed to With node, scan of RTEs of type
RTE_WITH_SUBQUERY to WithScan nodes. Each WITH subquery is enveloped
into WithSubPlan node. It stores result and working tuplestores, and
some flags (used for prevent double initialization and execution) for
each subquery.
Nothing extraordinary is done here.

Executor.
When executor tries to fetch first tuple from any WithScan node,
this node check whether With node have been executed, if no then it
executes it and then fetches all it's tuples from it's result
tuplestore.
With node being executed is simply call ExecNode on each subplan
in it's list and storing tuples in result tuplestore.
Recursiveness support. It's all done in Append node. If it marked
as recusrive, it changes a little it's behaviour.
Tuples fetched from subplan are stored in workin table. When
Append reaches the end of list of its subqueries it call
nodeWithSwitchTables. This function for query being executed will
append result table to final table, move working table to result
table, and clean working table. After this Append begins next loop of
subqueries execution, starting from 2nd subquery. Thus first Append
subquery is executed only once. Execution ends when no one tuple
fetched from all subqueries.
This approach allows WithScan nodes to fetch data fetched by
Append in previous loop.

Regards, Evgen.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Solving hash table overrun problems

2005-03-04 Thread Tom Lane
Aaron Birkland [EMAIL PROTECTED] writes:
 This also brings up a line of thought I had a while ago on a related
 topic.  Something like a HashDistinct might be useful, if it had no
 startup cost.  It would basically be a plan node in the executor that
 would dynamically build a hashtable so that it can pull rows from its
 child node (discarding if they appear in the hashtable) until it can
 pass on a novel row.   I have some reservations about it, though.

We already have that: the planner will use a HashAgg node in this
fashion in some contexts (I think just as one of the ways to do IN,
at the moment).  It's not yet bright enough to consider doing it for
SELECT DISTINCT.  The DISTINCT planning code is old and crufty and
pretty tightly interwired with ORDER BY ... it needs work.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] bitmap AM design

2005-03-04 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 (BTW, is poor concurrency really the biggest issue with hash indexes? If 
 so, there is some low-hanging fruit that I noticed a few years ago, but 
 never got around to fixing: _hash_doinsert() write-locks the hash 
 metapage on every insertion merely to increment a tuple counter. 

Given the short amount of time that lock is held, this wouldn't
win anything worth noticing.  Also, it's not merely to increment a
counter --- the counter drives decisions about whether to split buckets,
so any decrease in accuracy would lead directly to losses in overall
performance.

The lack of WAL support is a much bigger issue.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


[HACKERS] db cluster ?

2005-03-04 Thread Bostjan Potocnik




Hello hackers,

i'm wondering if is possible to somehow spread pretty big db (aprox 50G) over few boxes to get more speed ? 
if anyone did that i'd be glad to have some directions in right way,

thanks and best regard,
Bostjan




Re: [HACKERS] bitmap AM design

2005-03-04 Thread Tom Lane
[EMAIL PROTECTED] writes:
 Anyway, IMHO, hash indexes would be dramatically improved if you could
 specify your own hashing function

That's called a custom operator class.

 and declare initial table size.

It would be interesting to see if setting up the hashtable with about
the right number of buckets initially would make CREATE INDEX enough
faster to be a win ... but that doesn't mean I want to make the user
deal with it.  We could probably hack hashbuild() to estimate the
size of the parent table using the same code that the planner is now
using (ie, actual size in pages times a possibly-dead-reckoning rows
per page estimate).

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Solving hash table overrun problems

2005-03-04 Thread Bruno Wolff III
On Thu, Mar 03, 2005 at 17:05:40 -0500,
  Tom Lane [EMAIL PROTECTED] wrote:
 
 * Estimate the number of batches N using the planner's estimate.
 We will always choose N a power of 2.  A tuple's batch number is
 ((H div K) mod N).

If K is way low this could be very slow. Is there a way to do something
similar changing the hash function to H div KN? If you went down this
road you would probably want to use distinct primes for each new N.

 * Now begin scanning the outer join input.  Tuples of batch number
 zero (according to the current calculation) can be matched to the
 current hashtable contents.  Tuples of higher batch numbers are dropped
 into holding files for the outer input, one per batch.

For new keys at this step do you know their final disposition so that making
new hash entries won't be necessary?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Solving hash table overrun problems

2005-03-04 Thread Tom Lane
Bruno Wolff III [EMAIL PROTECTED] writes:
   Tom Lane [EMAIL PROTECTED] wrote:
 * Estimate the number of batches N using the planner's estimate.
 We will always choose N a power of 2.  A tuple's batch number is
 ((H div K) mod N).

 If K is way low this could be very slow.

How so?  You're not concerned about the time to do the division itself
are you?

 * Now begin scanning the outer join input.  Tuples of batch number
 zero (according to the current calculation) can be matched to the
 current hashtable contents.  Tuples of higher batch numbers are dropped
 into holding files for the outer input, one per batch.

 For new keys at this step do you know their final disposition so that making
 new hash entries won't be necessary?

Well, we probably have a pretty fair idea of N at this point, so it'll
usually be right --- but we reserve the right to increase N again later
in case we have to do so because one of the later inner batches is much
bigger than the zero batch we are currently processing.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Solving hash table overrun problems

2005-03-04 Thread Bruno Wolff III
On Fri, Mar 04, 2005 at 10:42:08 -0500,
  Tom Lane [EMAIL PROTECTED] wrote:
 Bruno Wolff III [EMAIL PROTECTED] writes:
Tom Lane [EMAIL PROTECTED] wrote:
  * Estimate the number of batches N using the planner's estimate.
  We will always choose N a power of 2.  A tuple's batch number is
  ((H div K) mod N).
 
  If K is way low this could be very slow.
 
 How so?  You're not concerned about the time to do the division itself
 are you?

No, rather having lots of entries in the same hash buckets. I was thinking
about recent discussions were there was a large number of rows with almost
all of the keys having just a few values, but there are a lot of unique
keys, but analyze doesn't see enough of the unique ones to make a good
estimate for K.

  * Now begin scanning the outer join input.  Tuples of batch number
  zero (according to the current calculation) can be matched to the
  current hashtable contents.  Tuples of higher batch numbers are dropped
  into holding files for the outer input, one per batch.
 
  For new keys at this step do you know their final disposition so that making
  new hash entries won't be necessary?
 
 Well, we probably have a pretty fair idea of N at this point, so it'll
 usually be right --- but we reserve the right to increase N again later
 in case we have to do so because one of the later inner batches is much
 bigger than the zero batch we are currently processing.

I just noticed that it wasn't mentioned that an overflow could occur at this
step. I didn't think it would be hard to do one if needed, but was wondering
if knowing a key couldn't match (because it was in the current batch 0 and
didn't match and existing key in that batch) was enough to emit or discard
the row.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Solving hash table overrun problems

2005-03-04 Thread Tom Lane
Bruno Wolff III [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] wrote:
 Bruno Wolff III [EMAIL PROTECTED] writes:
 If K is way low this could be very slow.
 
 How so?  You're not concerned about the time to do the division itself
 are you?

 No, rather having lots of entries in the same hash buckets.

That won't happen because we are going to set K with an eye to the
maximum number of rows we intend to hold in memory (given work_mem).
With the addition of the dynamic batch splitting logic, that number
of rows is actually reasonably accurate.

The only way this scheme can really lose badly is if there are large
numbers of tuples with exactly the same hash code, so that no matter how
much we increase N we can't split up the bucketload.  This is a risk for
*any* hashing scheme, however.  In practice we have to rely on the
planner to not choose hashing when there are only a few distinct values
for the key.

 I just noticed that it wasn't mentioned that an overflow could occur at this
 step.

It can't, because we aren't loading the outer tuples into the hash
table.  We are just considering them one at a time and probing for
matches.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] I am in Copenhagen

2005-03-04 Thread Bruce Momjian
I am in Copenhagen and am speaking tomorrow and will return on Sunday.

When I return I will get back to the patches queue.  I have been delayed
working on my own patches.  :-)

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] bitmap AM design

2005-03-04 Thread pgsql
 [EMAIL PROTECTED] writes:
 Anyway, IMHO, hash indexes would be dramatically improved if you could
 specify your own hashing function

 That's called a custom operator class.

Would I also be able to query the bucket size and all that?


 and declare initial table size.

 It would be interesting to see if setting up the hashtable with about
 the right number of buckets initially would make CREATE INDEX enough
 faster to be a win ... but that doesn't mean I want to make the user
 deal with it.  We could probably hack hashbuild() to estimate the
 size of the parent table using the same code that the planner is now
 using (ie, actual size in pages times a possibly-dead-reckoning rows
 per page estimate).


I know a linear hash is different than a classic simple hash table, but a
classic simple hash table has some great advantages at the expense of disk
space. IMHO being able to use the hash index in a way that is more of the
classic theoretical hash table and use the linear behavor if the table
grows beyond initial estimates I think would be a big win. It could
actually get to a 1:1 operation data retrieval on properly estimated
tables.

Estimations are a great idea, something like first prime after 2*NROWS
(with a GUC limit, I guess) would probably make hash indexes the fastest
most disk hogging index around.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] buildfarm issues

2005-03-04 Thread Andrew Dunstan
Now that we've been running for a while there are a few buildfarm issues 
that I need to address.

First, do we keep the right data about the members? Essentially, we 
keep: operating_system, os_version, compiler, compiler_version, 
architecture. For Linux, we genarlly ask for the 
Distribution/distro-version instead of the OS/os-version. However, that 
lead to interesting situations - Gentoo for example is so flexible 
that in version 2004.03 you might easily be using kernel version 2.4.x 
or 2.6.x ... in fact it's almost impossible to tell what might be 
installed on a Gentoo system, or how it was compiled. So I'm really not 
sure how we should treat such systems.

Second is the fact that systems change over time. People upgrade their 
machines. I'm considering a facility to allow people to change the 
os-version,compiler-version aspects of their registered personality - 
these will become essentially timestamped pieces of information, so 
we'll still be able to tie a set of values to a history item.

Third, what can be done to improve the usefulness of the server / web 
site?  I already have the following items on my list of TODOs:
. allow sorting of member list by personality feature (e.g. see all the 
OS/X boxes together)
. filter dashboard page to only show certain members (this already works 
but you have to construct the list manually, e.g. 
http://www.pgbuildfarm.org/cgi-bin/show_status.pl?member=catmember=dogmember=lorismember=gibbon 
)
. an email alerting system for when a system changes state

Fourth, there are some significant gaps in the buildfarm coverage. I am 
currently personally supplying the Windows and Cygwin members, but since 
that is on my laptop which is frequently required for other work it's 
not ideal, and there can be large gaps. Other gaps include HPUX, for 
example. If you run a machine that should be covered and can help, 
please do. After the initial setup there should be very little work 
involved.

Finally, a volunteer or two to help me with both running and 
administering this would be great. Knowledhge of perl and Template 
Toolkit a decided advantage.

Buildfarm was created to assist the principal hackers - so any and all 
(constructive) input is welcome.

cheers
andrew
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] buildfarm issues

2005-03-04 Thread Darcy Buskermolen
On Friday 04 March 2005 10:11, Andrew Dunstan wrote:
 Now that we've been running for a while there are a few buildfarm issues
 that I need to address.

 First, do we keep the right data about the members? Essentially, we
 keep: operating_system, os_version, compiler, compiler_version,
 architecture. For Linux, we genarlly ask for the
 Distribution/distro-version instead of the OS/os-version. However, that
 lead to interesting situations - Gentoo for example is so flexible
 that in version 2004.03 you might easily be using kernel version 2.4.x
 or 2.6.x ... in fact it's almost impossible to tell what might be
 installed on a Gentoo system, or how it was compiled. So I'm really not
 sure how we should treat such systems.

 Second is the fact that systems change over time. People upgrade their
 machines. I'm considering a facility to allow people to change the
 os-version,compiler-version aspects of their registered personality -
 these will become essentially timestamped pieces of information, so
 we'll still be able to tie a set of values to a history item.

What about using uname(1), cc -v,  etc to glean this information and post it 
with each event logged?  I belive you have all this stuff already in the 
config.log that is used already ?


 Third, what can be done to improve the usefulness of the server / web
 site?  I already have the following items on my list of TODOs:
 . allow sorting of member list by personality feature (e.g. see all the
 OS/X boxes together)
 . filter dashboard page to only show certain members (this already works
 but you have to construct the list manually, e.g.
 http://www.pgbuildfarm.org/cgi-bin/show_status.pl?member=catmember=dogmem
ber=lorismember=gibbon )
 . an email alerting system for when a system changes state

 Fourth, there are some significant gaps in the buildfarm coverage. I am
 currently personally supplying the Windows and Cygwin members, but since
 that is on my laptop which is frequently required for other work it's
 not ideal, and there can be large gaps. Other gaps include HPUX, for
 example. If you run a machine that should be covered and can help,
 please do. After the initial setup there should be very little work
 involved.


 Finally, a volunteer or two to help me with both running and
 administering this would be great. Knowledhge of perl and Template
 Toolkit a decided advantage.

 Buildfarm was created to assist the principal hackers - so any and all
 (constructive) input is welcome.

 cheers

 andrew

 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] buildfarm issues

2005-03-04 Thread Andrew Dunstan

Darcy Buskermolen wrote:
On Friday 04 March 2005 10:11, Andrew Dunstan wrote:
 

Now that we've been running for a while there are a few buildfarm issues
that I need to address.
First, do we keep the right data about the members? Essentially, we
keep: operating_system, os_version, compiler, compiler_version,
architecture. For Linux, we genarlly ask for the
Distribution/distro-version instead of the OS/os-version. However, that
lead to interesting situations - Gentoo for example is so flexible
that in version 2004.03 you might easily be using kernel version 2.4.x
or 2.6.x ... in fact it's almost impossible to tell what might be
installed on a Gentoo system, or how it was compiled. So I'm really not
sure how we should treat such systems.
Second is the fact that systems change over time. People upgrade their
machines. I'm considering a facility to allow people to change the
os-version,compiler-version aspects of their registered personality -
these will become essentially timestamped pieces of information, so
we'll still be able to tie a set of values to a history item.
   

What about using uname(1), cc -v,  etc to glean this information and post it 
with each event logged?  I belive you have all this stuff already in the 
config.log that is used already ?
 

See previous para - on Linux we want the distro name and version, not 
Linux plus kernel version. uname doesn't seem to help much there. 
Also, I have no idea how portable cc -v is. Can we guarantee to have the 
compiler version properly identified on every platform?

cheers
andrew
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [HACKERS] 8.0.X and the ARC patent

2005-03-04 Thread Simon Riggs
On Wed, 2005-03-02 at 20:55 -0500, Tom Lane wrote:
 Michael Adler [EMAIL PROTECTED] writes:
  Looking at the Response Time Charts 
 
  8.0.1/ARC
  http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/rt.html
 
  20050301 with 2Q patch
  http://www.osdl.org/projects/dbt2dev/results/dev4-010/313/rt.html
 
  It seems like the average response time has gone down, but the worse
  case ceiling has raised about 35%.
 
 The worst cases are associated with checkpoints.  I'm not sure why a
 checkpoint would have a greater effect on the 2Q system than an ARC
 system --- checkpoint doesn't request any new buffers so you'd think
 it'd be independent.  Maybe this says that the bgwriter is less
 effective with 2Q, so that there are more dirty buffers remaining to
 be written at the checkpoint?  But why?

The pattern seems familiar. Reduced average response time increases
total throughput, which on this test means we have more dirty buffers to
write at checkpoint time.

I would not neccessarily suspect 2Q over ARC, at least initially.

The pattern of behaviour is similar across ARC, 2Q and Clock, though the
checkpoint points differ in intensity. The latter makes me suspect
BufMgrLock contention or similar.

There is a two-level effect at Checkpoint time...first we have the write
from PostgreSQL buffers to OS cache, then we have the write from OS
cache to disk by the pdflush daemon. At this point, I'm not certain
whether the delay is caused by the checkpointing or the pdflush daemons.
Mark and I had discussed some investigations around that. This behaviour
is new in the 2.6 kernel, so it is possible there is an unpleasant
interaction there, though I do not wish to cast random blame.

Checkpoint doesn't request new buffers, but it does require the
BufMgrLock in order to write all of the dirty buffers. It could be that
the I/Os map direct to OS cache, so that the tight loop to write out
dirty buffers causes such an extreme backlog for the BufMgrLock that it
takes more than a minute to clear and return to normal contention.

It could be that at checkpoint time, the number of writes exceeds the
dirty_ratio and the kernel forces the checkpoint process to bypass the
cache and pdflush daemons altogether, and performing the I/O itself.
Single-threaded, this would display the scalability profile we see. Some
kernel level questions in there...

There is no documented event-state model for LWlock acquisition, so it
might be possible that there is a complex bottleneck in amongst them.

Amdahl's Law tells me that looking at the checkpoints is the next best
action for tuning, since they add considerably to the average response
time. Looking at the oprofile for the run as a whole is missing out the
delayed transaction behaviour that occurs during checkpoints.

I would like to try and catch an oprofile of the system while performing
a checkpoint, as a way to give us some clues. Perhaps that could be
achieved by forcing a manual checkpoint as superuser, and making that
interaction cause a switch to a new oprofile output file.

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] 8.0.X and the ARC patent

2005-03-04 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Checkpoint doesn't request new buffers, but it does require the
 BufMgrLock in order to write all of the dirty buffers.

There is no BufMgrLock anymore in the clock-algorithm code, so I'm not
sure how much of this analysis is still relevant.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] db cluster ?

2005-03-04 Thread pgsql
 Hello hackers,

 i'm wondering if is possible to somehow spread pretty big db (aprox 50G)
 over few boxes to get more speed ?
 if anyone did that i'd be glad to have some directions in right way,


I have done different elements of clusering with PostgreSQL on a per task
basis, but not a fully comprehensive generic distributed cluster. There
are a couple tools you can use if your are an engineering sort of fellow.

Sloney is a replication cluster, all the data is on all the machines.

There is a project that shows promise as a distributed data system:
contrib/dblink. One could segment their database as a number of logical
data managers and use dblink to incorporate the data on one database into
the queries on another. It won't be transparent, but could be fairly
managable if you use views to implement the links.

I guess the real question is what performance do you need to improve? If
it is just read performance, then sloney is probably your best bet. Put a
number of redundant machines behind a load balancer and you are all set.

If you need to increase write performance, well, that can be problematic.

What is it that your want to accomplish?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] buildfarm issues

2005-03-04 Thread Jim Buttafuoco
Andrew,

A couple of things,  

1. we need to develop a matrix of systems/os/compiler to see what coverage we 
do have and compare it to the INSTALL 
guide.
2. the run_build.pl should be changed to keep the information on the system to 
date (and have the matrix in 1 change)
3. have the run_build.pl script check the build farm CVS to see if there is a 
new version of the build farm code and 
either download and/or alert the owner of the system

I am able and willing to help out in all of the above.  I am also willing to 
help fill the gap in systems if needed. I 
have been using perl for over 10 years now (since perl 4 days) and have been 
doing web/postgres coding for the last 6 
years.

Let me know what I can do to help

Jim


-- Original Message ---
From: Andrew Dunstan [EMAIL PROTECTED]
To: Darcy Buskermolen [EMAIL PROTECTED]
Cc: PostgreSQL-development pgsql-hackers@postgresql.org
Sent: Fri, 04 Mar 2005 14:28:09 -0500
Subject: Re: [HACKERS] buildfarm issues

 Darcy Buskermolen wrote:
 
 On Friday 04 March 2005 10:11, Andrew Dunstan wrote:
   
 
 Now that we've been running for a while there are a few buildfarm issues
 that I need to address.
 
 First, do we keep the right data about the members? Essentially, we
 keep: operating_system, os_version, compiler, compiler_version,
 architecture. For Linux, we genarlly ask for the
 Distribution/distro-version instead of the OS/os-version. However, that
 lead to interesting situations - Gentoo for example is so flexible
 that in version 2004.03 you might easily be using kernel version 2.4.x
 or 2.6.x ... in fact it's almost impossible to tell what might be
 installed on a Gentoo system, or how it was compiled. So I'm really not
 sure how we should treat such systems.
 
 Second is the fact that systems change over time. People upgrade their
 machines. I'm considering a facility to allow people to change the
 os-version,compiler-version aspects of their registered personality -
 these will become essentially timestamped pieces of information, so
 we'll still be able to tie a set of values to a history item.
 
 
 
 What about using uname(1), cc -v,  etc to glean this information and post it 
 with each event logged?  I belive you have all this stuff already in the 
 config.log that is used already ?
   
 
 
 See previous para - on Linux we want the distro name and version, not 
 Linux plus kernel version. uname doesn't seem to help much there. 
 Also, I have no idea how portable cc -v is. Can we guarantee to have the 
 compiler version properly identified on every platform?
 
 cheers
 
 andrew
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
--- End of Original Message ---


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] buildfarm issues

2005-03-04 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Also, I have no idea how portable cc -v is.

Not at all.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] hi all

2005-03-04 Thread Christopher Browne
A long time ago, in a galaxy far, far away, [EMAIL PROTECTED] (Qu Tianlian) 
wrote:
 Hi all:
 I have a question.
 How to add table in slony.
 I try to add table in already being database that using slony . but it's not
 realize table's replication.
 I used postgresql version 7.4.2 and slony version 1.0

 Can you help me . Thanks

You should take this to the Slony-I mailing list, as that is where
people who know hang out.

By the way, there was never a release of something slony version
1.0.

Versions of Slony-I that have been released include 1.0.1, 1.0.2, and
1.0.5, but there was no 1.0 version.
-- 
output = (cbbrowne @ gmail.com)
http://linuxdatabases.info/info/spreadsheets.html
The social dynamics  of the net are a direct  consequence of the fact
that nobody  has yet developed  a Remote Strangulation  Protocol.  
-- Larry Wall

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] bitmap AM design

2005-03-04 Thread Victor Y. Yegorov
Some more thoughts and questions.

The main idea above bitmaps is narrowing some data sets' possible values to
yes and no (i.e. 1 and 0) and then organize the data in the series of
bits, where bit's position determines values to consider. In the cases, where
several indexed attributes are used in WHERE clause, it's possible to do
logical AND/OR on bitmaps before returning any results to the caller. For
large tables with high number of low-cardinality attributes using bitmaps can
result in certain speed-up.

For on-disk bitmaps I'm working on, each value of each indexed attribute has
it's own bitmap (i.e. series of bits, with bits set to 1 for rows with
corresponding fields having value of that bitmap). Scanning the bitmap, we end
up with an array of 1-bits' positions and need to convert those positions to
CTIDs, as executor is expecting. So, index should also keep a CTID table, that
corresponds to the bitmap's data. This CTID table will be the same for all
bitmap indexes, created for one table, thus having 2 bitmap indexes will mean
you're wasting some amount of disk space, storing absolutely identical data.
So, to save space, we have 2 possibilities:
1) create a new relkind for the CTID table (maybe used not only for on-disk
   bitmaps);
2) make all create index ... using bitmap statements actually create/extend
   existing bitmap index. This also implies, that planner/executor should
   try using multicolumn bitmap index when at least one indexed field is
   present in the WHERE clause.

I'm working on the 2nd case, because 1st one requires more work not only in
the access method + executor + planner area. It is also possible to keep
things as is and make a note in the documentation, that it is better to have 1
multicolumn bitmap index, then several single column ones, and that planner
will still use multicolumn index even if not all columns are involved.

Any comments/ideas here?


After implementing bitmap index access method, it'll be necessary to teach
planner and executor to use multicolumn bitmaps for any number of
scan-attributes. Also, I cannot say in what circumstances planner should
prefer bitmap scan to seqscan; I thought of cases, when it estimates return
set being about 60% of the relation. What community has to say here?


Also, as Tom is planning to work on in-memory bitmaps (maybe something is
done already, don't know), I thought that it can be possible to cooperate.
As I have no clue at the moment how in-memory bitmaps are going to work, is it
possible to hear from you some draft of the forthcoming implementation?
And what prerequisites would be required to join both bitmaps somehow?

Waiting for your thoughts/comments.


-- 

Victor Y. Yegorov

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] 8.0.X and the ARC patent

2005-03-04 Thread Greg Stark

Simon Riggs [EMAIL PROTECTED] writes:

 Amdahl's Law tells me that looking at the checkpoints is the next best
 action for tuning, since they add considerably to the average response
 time. Looking at the oprofile for the run as a whole is missing out the
 delayed transaction behaviour that occurs during checkpoints.

Even aside from the effect it has on average response time. I would point out
that many applications are governed by the worst case more than the average
throughput.

For a web server, for example (or any OLTP application in general), it doesn't
matter if the database can handle x transactions/s on average. What matters is
that 100% of the time the latency is below the actual rate of requests. If
every 30m latency suddenly spikes up beyond that, even for only a minute, then
it will fall behind in the requests. The user will effectively see a
completely unresponsive web server.

So I would really urge you to focus your attention on the maximum latency
figure. It's at least if not *more* important than the average throughput
number.



PS That's why I was pushing before for the idea that the server should try to
spread the I/O from one checkpoint out over more or less the time interval
between checkpoints. If it's been 30m since the last checkpoint then you have
about 30m to do the I/O for this checkpoint. (Though I would suggest a safety
factor of aiming to be finished within 50% of the time.)

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match