date:20070314

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-14 Thread Michael Paesold


Andrew Dunstan wrote:

Albe Laurenz wrote:
A fix could be either that the server checks escape sequences for 
validity
  


This strikes me as essential. If the db has a certain encoding ISTM we 
are promising that all the text data is valid for that encoding.


The question in my mind is how we help people to recover from the fact 
that we haven't done that.


I would also say that it's a bug that escape sequences can get characters 
into the database that are not valid in the specified encoding. If you 
compare the encoding to table constraints, there is no way to simply 
escape a constraint check.


This seems to violate the principle of consistency in ACID. Additionally, 
if you include pg_dump into ACID, it also violates durability, since it 
cannot restore what it wrote itself.

Is there anything in the SQL spec that asks for such a behaviour? I guess not.

A DBA will usually not even learn about this issue until they are presented 
with a failing restore.


Best Regards,
Michael Paesold

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-14 Thread Michael Fuhr

On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
 Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
  Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
  we had to use iconv?
 
 What issues? I've upgraded several 8.0 database to 8.1. without having to use 
 iconv. Did I miss something?

http://www.postgresql.org/docs/8.1/interactive/release-8-1.html

Some users are having problems loading UTF-8 data into 8.1.X.  This
is because previous versions allowed invalid UTF-8 byte sequences
to be entered into the database, and this release properly accepts
only valid UTF-8 sequences. One way to correct a dumpfile is to run
the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-14 Thread Pavan Deolasee


Tom Lane wrote:

 I've developed the attached patch against HEAD, and no longer see any
 funny behavior.  Would appreciate it if you'd test some more, though.


The patch works for me. With the patch applied, I don't see the
weird errors in the pgbench and other customized tests that I
used to see earlier.

I looked at the patch as well. ISTM that we are now moving chains
in pieces where each piece is terminated by a DEAD tuple. That
implies that the MOVED_OFF chain is actually broken. This
should not be a problem as long as our assumption that all
RECENTLY_DEAD tuples preceding a DEAD tuple must also be DEAD
and its only the way OldtestXmin is calculated that we see
them as RECENTLY_DEAD.

If that assumption is true (and it must be true for us to move
the chain in pieces), doesn't that mean we don't really need to
move the RECENTLY_DEAD tuples preceding a DEAD tuple ? One way
to do so would be to collect the target tuple (the tuple from
where we started following the t_ctid chain) in the free_offsets
if a DEAD or INSERT_IN_PROGRESS tuple is found while following
the t_ctid chain. One-by-one we would collect all the
RECENTLY_DEAD tuples preceding a DEAD tuple in the truncated
pages.

Not that I am suggesting we do this, just wanted to check if
there is a flaw in my thinking. I agree that we should not be
spending too much time on fixing this corner case and the
patch that you have developed is good enough.


Thanks,
Pavan


--


EnterpriseDBhttp://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread tomas

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Tue, Mar 13, 2007 at 05:39:05PM +0300, Teodor Sigaev wrote:
 Hmm, hstore + (optionally) functional indexes. Is it answer?

I have used it in a (yet) test system. It works surprisingly well.

Thanks
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFF96sWBcgs9XrR2kYRApgIAJwJ/dbyHc5gnsgm8ljMvF0RL5qSiwCfXzSz
PVrJkIbi497Ezns2+vXOLM8=
=xGj+
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-14 Thread Peter Eisentraut

Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
 Is there anything in the SQL spec that asks for such a behaviour? I guess
 not.

I think that the octal escapes are a holdover from the single-byte days where 
they were simply a way to enter characters that are difficult to find on a 
keyboard.  In today's multi-encoding world, it would make more sense if there 
were an escape sequence for a *codepoint* which is then converted to the 
actual encoding (if possible and valid) in the server.  The meaning of 
codepoint is, however, character set dependent as well.

The SQL standard supports escape sequences for Unicode codepoints, which I 
think would be a very useful feature (try entering a UTF-8 character 
bytewise ...), but it's a bit weird to implement and it's not clear how to 
handle character sets other than Unicode.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Synchronized Scan update

2007-03-14 Thread Zeugswetter Andreas ADI SD


  The advantage of sync_scan_offset is that, in some situations, a 
  second scan can actually finish faster than if it were the only
query 
  executing, because a previous scan has already caused some blocks to

  be cached. However, 16 is a small number because that benefit would 
  only be once per scan, and sync scans are only helpful on large
tables.

Agreed.

 Alright, understood. That last part is actually something I 
 now want to avoid because it's using the current 
 cache-spoiling behaviour of seqscans to advantage. I'd like 
 to remove that behaviour, but it sounds like we can have both
 - SeqScans that don't spoil cache
 - Synch scans by setting sync_scan_offset to zero.
 
I like the idea of reducing tuning parameters, but we should, at
a 
minimum, still allow an on/off button for sync scans. My tests 
revealed that the wrong combination of 
OS/FS/IO-Scheduler/Controller could result in bad I/O behavior.
   
   Agreed
   
  
  Do you have an opinion about sync_scan_threshold versus a simple 
  sync_scan_enable?
 
 enable_sync_scan?

Seems the suggested guc's are very related. IIRC The agreed suggestion
was to use NBuffers (or a percentage thereof ?) to decide whether to
spoil the buffer cache for a seq scan. I seems this same metric should
be used to decide whether to sync a scan when sync scan is enabled. So
when the tablesize is below NBuffers (or a percentage thereof) you
neighter recycle buffers nor sync the seq scans.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [PATCHES] Bitmapscan changes

2007-03-14 Thread Heikki Linnakangas


Tom Lane wrote:

At this point I'm feeling unconvinced that we want it at all.  It's
sounding like a large increase in complexity (both implementation-wise
and in terms of API ugliness) for a fairly narrow use-case --- just how
much territory is going to be left for this between HOT and bitmap indexes?


I'm in a awkward situation right now. I've done my best to describe the 
use cases for clustered indexes. I know the patch needs refactoring, 
I've refrained from making API changes and tried to keep all the 
ugliness inside the b-tree, knowing that there's changes to the indexam 
API coming from the bitmap index patch as well.


I've been seeking for comments on the design since November, knowing 
that this is a non-trivial change. I have not wanted to spend too much 
time polishing the patch, in case I need to rewrite it from scratch 
because of some major design flaw or because someone comes up with a 
much better idea.


It's frustrating to have the patch dismissed at this late stage on the 
grounds of it's not worth it. As I said in February, I have the time 
to work on this, but if major changes are required to the current 
design, I need to know.


Just to recap the general idea: reduce index size taking advantage of 
clustering in the heap.


Clustered indexes have roughly the same performance effect and use cases 
as clustered indexes on MS SQL Server, and Index-Organized-Tables on 
Oracle, but the way I've implemented them is significantly different. On 
other DBMSs, the index and heap are combined to a single b-tree 
structure. The way I've implemented them is less invasive, there's no 
changes to the heap for example, and it doesn't require moving live tuples.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[HACKERS] need help in understanding gist function

2007-03-14 Thread sharath kumar


I need help in understanding certain things in gist ...
1. What is the functionality of gistplacetopage() function in gist.c ?Can
you please give me detail description of this function.
2. I have left and right buffers with values that I have filled. Now, how do
I write these changes(buffers) permanently into gistentry pages?

Thanks
Sharat

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Zdenek Kotala


Tom Lane wrote:

Josh Berkus josh@agliodbs.com writes:

Zdenec,

I have following idea:
1) add guc varibale which enable usage of OS time zone files
2) add extra parameters into ./configure script which enable OS TZ
support in the code and get path to OS TZ files.


If we're adding it as a configure-time variable, there's no reason to have 
a GUC.


I see zero reason to have either.  It would only make sense to do this
in the context of a platform-specific distribution such as an RPM, and
in that context the simplest solution is to let the RPM specfile make
the substitution (ie, after make install and before packaging,
rm -rf PG's timezone tree and insert a symlink).  Then it's on the RPM
packager's head whether it's the right thing to do or not.   A configure
switch strikes me as mostly a foot-gun, because the average user of
Postgres won't have any way to know whether the files are compatible.


I don't think to make a symlink is good solution. It generates a lot of 
future problem with package update or patching. Configure switch is much 
comfortable for packagers/patch makers.  In case when average user want 
to compile own postgres we can offer regression test focused on TZ 
validation. (By the way average user is surprise, that postgres has own 
zone files)


I also think, usage system's timezone files should be by default and 
configure script determines zonefiles location based on OS. Another 
location could be set by switch.  If for some platform will be necessary 
use own copy, special switch (e.g. --enable-internal-tzfiles) setup 
postgres for process own timezone copy.



Zdenek

PS: For information there are TZ locations on several OS:

/usr/share/lib/zoneinfo Solaris
/usr/share/zoneinfo Redhat
/opt/dce/lib/zoneinfo   HP-UX (no TZif magic word)
/etc/zoneinfo/  Tru64 (no TZif magic word)
/usr/share/zoneinfo/MacOS
registers   MS Windows





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Naz Gassiep





Granted, but a configure switch would allow users who want to use OS TZ
file in conjunction with a compiled from source installation. Many
users of OSes with package managers such as Debian or RedHat may, for
whatever reason, want to use a source tarball to install and also use
the OS TZ list.


That being said, this user group may be small enough to ignore. Just
throwing it in for thought.


Tom Lane wrote:

  Josh Berkus josh@agliodbs.com writes:
  
  
Zdenec,


  I have following idea:
1) add guc varibale which enable usage of OS time zone files
2) add extra parameters into ./configure script which enable OS TZ
support in the code and get path to OS TZ files.
  

  
  
  
  
If we're adding it as a configure-time variable, there's no reason to have 
a GUC.

  
  
I see zero reason to have either.  It would only make sense to do this
in the context of a platform-specific distribution such as an RPM, and
in that context the simplest solution is to let the RPM specfile make
the substitution (ie, after "make install" and before packaging,
rm -rf PG's timezone tree and insert a symlink).  Then it's on the RPM
packager's head whether it's the right thing to do or not.  A configure
switch strikes me as mostly a foot-gun, because the average user of
Postgres won't have any way to know whether the files are compatible.

			regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] need help in understanding gist function

2007-03-14 Thread Teodor Sigaev

1. What is the functionality of gistplacetopage() function in gist.c 
?Can you please give me detail description of this function.


It tries to place index tuple on page. If page hasn't enough space, 
gistplacetopage splits page on two or more pages. Number of page to be splitted 
depends, basically, on quality of user-defined picksplit and distribution of 
tuple's length.


2. I have left and right buffers with values that I have filled. Now, 
how do I write these changes(buffers) permanently into gistentry pages?

Sorry, I don't understand your questions. Where you have left and right buffers?
Real write are produced by bgwriter process, in backend we should just mark 
byffer as dirty with a help of  MarkBufferDirty call.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread David Fetter

On Wed, Mar 14, 2007 at 12:07:45PM +1300, Edward Stanley wrote:
 On Wed, 14 Mar 2007, David Fetter wrote:
  On Tue, Mar 13, 2007 at 05:54:34PM +, Richard Huxton wrote:
   David Fetter wrote:
   On Tue, Mar 13, 2007 at 02:21:37PM +, Richard Huxton wrote:
   David Fetter wrote:
   On Tue, Mar 13, 2007 at 09:31:45AM +, Richard Huxton wrote:
   * Another good example is the questionnaire.
   
   With all due respect, this is a solved problem *without EAV or
   run-time DDL*.  The URL below has one excellent approach to this.
   
   http://www.varlena.com/GeneralBits/110.php
   
   Which broadly speaking was the solution I used for my
   questionnaire, except I had a restricted set of types so basically
   just coerced them to text and side-stepped the inheritance issue.
   To the extent that it's dynamic, it's still just EAV though.
   
   That's precisely the difference between the above solution and
   yours, and it's the difference between a good design and one that
   will come up and bit you on the as^Hnkle.
  
   It's still basically EAV (either approach).  The key fault with EAV
   is that the tables have no semantic meaning - answer_int contains
   number of oranges, days since birth and the price of a tube ticket
   in pennies.
 
  Stuffing all of those into an answer_int is *precisely* what the end
  user must not do.  That's pilot error.
 
   Now, with a questionnaire that might not matter because everything
   is an answer and you're not necessarily going to do much more than
   count/aggregate it.
 
  See above.
 
   It doesn't remove the need for run-time DDL if you allow users to
   add their own questions.
   
   Sure it does.   When a user, who should be talking with you, wants
   to ask a new kind of question, that's the start of a discussion
   about what new kind(s) of questions would be generally applicable
   in the questionnaire schema.  Then, when you come to an agreement,
   you roll it into the new schema, and the whole system gets an
   improvement.
  
   Fine, but if you're not letting the user extend the system, then
   it's not really addressing Edward's original posting, is it?
 
  It's my contention that Edward's original idea is ill-posed.  SQL is
  just fine for doing this kind of thing, and it's *not that hard*.
 
   If the user's talking to me, I might as well just write the DDL
   myself - it's the talk that'll take the time, not writing a dozen
   lines of SQL.
 
  It's the talk that's the important part.  Machines are really bad at
  seeing the broader picture.  In the attempt to save a few minutes'
  discussion, he's trying to borrow that time from a system asked to do
  things that computers are inherently bad at doing, and every end user
  will pay that time back at a very high rate of interest.  This is
  precisely the kind of false economy that so plagues software
  development and maintenance these days.
 
   The interesting part of the problem (from a Comp-Sci point of view)
   is precisely in automating part of that discussion.  It's providing
   an abstraction so that you don't end up with a mass of attributes
   while still providing freedom to the user.
 
  This freedom and efficiency you're talking about is better supplied,
  IMHO, by putting a standard DDL for questionnaires up on a pgfoundry
  or an SF.net.  That way, improvements to the DDL get spread all over
  the world, and a very large amount of wheel reinvention gets avoided.
  Reusable components are a big chunk of both freedom and efficiency. :)
 
  Cheers,
  D
 
 Maybe I should rethink the problem a bit - from the very brief
 initial research I've done, it seems EAV schemas have two common
 uses: 

 1) When new attributes have to be created on-the-fly 
 2) When the number of possible properties for an entity greatly (orders of 
 magnitude) exceeds the number of properties any one entity is likely to have. 

Um, no.  The first use case is bad coding practice, and the second is
a classic case for a join table, which is the standard way to handle
M:N relationships.

 I'm not sure about solving the first problem - there seems to be a lot of 
 debate around this. I can see reasons for and against allowing this. However 
 I think the second is a very real problem. One such example is a patient 
 record system.
 
 For each patient we have a table of common data (dob, sex, height, weight 
 etc) 
 but as well as this a patient can present with many symptoms. This might be a 
 table of 40,000 possible symptoms. 

Here's how I'd do that:

CREATE TABLE patient (
patient_id SERIAL PRIMARY KEY, /* for simplicity.  Some
  combination of columns in the
  table would also have a UNIQUE
  NOT NULL constraint on it.
*/
...
);

CREATE TABLE symptom (
symptom_id SERIAL PRIMARY KEY, /* See above. */
...
);

CREATE TABLE patient_presents_with (

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Gregory Stark

David Fetter [EMAIL PROTECTED] writes:

 CREATE TABLE symptom (
 symptom_id SERIAL PRIMARY KEY, /* See above. */
 ...
 );

 CREATE TABLE patient_presents_with (
 patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
 symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
 UNIQUE(patient_id, symptom_id)
 );

I'm just glad I don't have your doctor. I hope mine doesn't think symptoms are
all boolean values.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Martijn van Oosterhout

On Wed, Mar 14, 2007 at 01:13:58PM +0100, Zdenek Kotala wrote:
 I don't think to make a symlink is good solution. It generates a lot of 
 future problem with package update or patching. Configure switch is much 
 comfortable for packagers/patch makers.  In case when average user want 
 to compile own postgres we can offer regression test focused on TZ 
 validation. (By the way average user is surprise, that postgres has own 
 zone files)

What is the actual problem being solved here? That people expected the
timezone changes to be picked up automatically?  think if you weigh it
up, that problem is less significant than:

1. You do a minor system upgrade and now postgres crashes because the
file format changed or the files moved.
2. You run a replication system and get different results on different
machine.

I think that from a data integrity point of view the current system is
the best. At the very least what you propose is a modularity violation:
Postgres depending on undocumented private data of another system
component.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Andrew Dunstan


Martijn van Oosterhout wrote:

I think that from a data integrity point of view the current system is
the best. At the very least what you propose is a modularity violation:
Postgres depending on undocumented private data of another system
component.


  


I don't think you can reasonably describe the system timezone database 
as undocumented private data. Plenty of other systems rely on it, as we 
used to do.


But I take Tom's point about most users not knowing if their TZ database 
is usable or not. Maybe we need a tool (maybe on pgfoundry) that will do 
some analysis to find out, if such a thing is possible.


cheers

andrew


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Zdenek Kotala


Martijn van Oosterhout wrote:

On Wed, Mar 14, 2007 at 01:13:58PM +0100, Zdenek Kotala wrote:
I don't think to make a symlink is good solution. It generates a lot of 
future problem with package update or patching. Configure switch is much 
comfortable for packagers/patch makers.  In case when average user want 
to compile own postgres we can offer regression test focused on TZ 
validation. (By the way average user is surprise, that postgres has own 
zone files)


What is the actual problem being solved here? That people expected the
timezone changes to be picked up automatically?  think if you weigh it
up, that problem is less significant than:


People expect consistent timezone setting for all application on one 
machine.



1. You do a minor system upgrade and now postgres crashes because the
file format changed or the files moved.


When you perform minor system upgrade which will delivery new TZ file 
format, than new version of libc must be delivery anyway and you 
probably must recompile postgres on upgraded system - you can check if 
  TZ files works fine and if not you can compile it with build in.


If file is moved, postgres raises error. But I don't see problem there. 
If you compare changes between 8.1.5 and 8.1.6, you can see a lot of 
removed files.



2. You run a replication system and get different results on different
machine.


However on another point of view, You very often have application and 
postgres on one machine. And if you have different tz files for 
application and for postgres, the result should be really strange. This 
case is most common than replication issue.




I think that from a data integrity point of view the current system is
the best. At the very least what you propose is a modularity violation:
Postgres depending on undocumented private data of another system
component.


Yes, it is true, dependency on private data is not good. But It is 
smaller evil, than have more different timezone on one system.



Zdenek

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-14 Thread Tom Lane

Pavan Deolasee [EMAIL PROTECTED] writes:
 If that assumption is true (and it must be true for us to move
 the chain in pieces), doesn't that mean we don't really need to
 move the RECENTLY_DEAD tuples preceding a DEAD tuple ?

As I've already said several times: they are dead, but at least for
VACUUM FULL's purposes it seems unreasonably difficult to determine that
and remove them.  The point at which we'd figure this out is after we've
already performed dead-tuple removal (at least for some of the pages
involved).  It would make for a significant increase in logical
complexity, and that code is too damn complicated already.  Since we
know this is a seldom-seen corner case, I'm not going to risk
introducing new bugs to recycle a few tuples a bit sooner.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] Bitmapscan changes

2007-03-14 Thread Hannu Krosing

Ühel kenal päeval, K, 2007-03-14 kell 10:22, kirjutas Heikki
Linnakangas:
 Tom Lane wrote:
  At this point I'm feeling unconvinced that we want it at all.  It's
  sounding like a large increase in complexity (both implementation-wise
  and in terms of API ugliness) for a fairly narrow use-case --- just how
  much territory is going to be left for this between HOT and bitmap indexes?
 
 I'm in a awkward situation right now. I've done my best to describe the 
 use cases for clustered indexes. 

...

 Just to recap the general idea: reduce index size taking advantage of 
 clustering in the heap.
 
 Clustered indexes have roughly the same performance effect and use cases 
 as clustered indexes on MS SQL Server, and Index-Organized-Tables on 
 Oracle, but the way I've implemented them is significantly different. On 
 other DBMSs, the index and heap are combined to a single b-tree 
 structure. The way I've implemented them is less invasive, there's no 
 changes to the heap for example, and it doesn't require moving live tuples.

Do you keep visibility info in the index ?

How does this info get updated when visibility data changes in the
heap ?

If there is no visibility data in index, then I can't see, how it gets
the same performance effect as Index-Organized-Tables, as lot of random
heap access is still needed.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Zdenek Kotala


Andrew Dunstan wrote:

Martijn van Oosterhout wrote:

I think that from a data integrity point of view the current system is
the best. At the very least what you propose is a modularity violation:
Postgres depending on undocumented private data of another system
component.


  


I don't think you can reasonably describe the system timezone database 
as undocumented private data. Plenty of other systems rely on it, as we 
used to do.


But I take Tom's point about most users not knowing if their TZ database 
is usable or not. Maybe we need a tool (maybe on pgfoundry) that will do 
some analysis to find out, if such a thing is possible.


I guess some regression test should test TZ validity?

Zdenek

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Alvaro Herrera

Andrew Dunstan wrote:
 Martijn van Oosterhout wrote:
 I think that from a data integrity point of view the current system is
 the best. At the very least what you propose is a modularity violation:
 Postgres depending on undocumented private data of another system
 component.
 
 I don't think you can reasonably describe the system timezone database 
 as undocumented private data. Plenty of other systems rely on it, as we 
 used to do.

No -- we relied on the libc's TZ API, which is not the same.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread mark

On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:
 David Fetter [EMAIL PROTECTED] writes:
  CREATE TABLE symptom (
  symptom_id SERIAL PRIMARY KEY, /* See above. */
  ...
  );
 
  CREATE TABLE patient_presents_with (
  patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
  symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
  UNIQUE(patient_id, symptom_id)
  );
 
 I'm just glad I don't have your doctor. I hope mine doesn't think symptoms are
 all boolean values.

Where is the boolean above? It is M:N, with each having whatever data
is required.

The issue I have with the above is that it seems unnecessarily
inefficient.  Whenever mapping from a patient to a symptom, or a
symptom to a patient, it requires searching indexes for three tables.
Perhaps this would work well if there was heavy overlap of symptoms
for different patients. For the cases I have hit this problem,
however, there may be overlap, but it is not easy to detect, and even
if it was detected, we would end with some sort of garbage collection
requirements where symptoms are removed once all references to the
symptoms are removed.

The case most familiar to me, is a set of viewing preferences for web
pages. Some users specify no preferences, while others have dozens of
preferences. As I have no requirements to search for users with a
particular preference, I chose to solve this by packing many of the
preferences together into a TEXT field, and having the application
pack/unpack the data. I still have tables that map object id to
attribute/value, but they are used for the data that can require
longer queries. Without clustering the data, searching for a dozen
of these attributes requires either querying all attributes, where
the attributes could be scattered throughout the table, or querying
them one by one, which is worse.

If there was an efficient way to do this for both of my use cases,
I would be strongly tempted to use it. :-)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] [PATCHES] Bitmapscan changes

2007-03-14 Thread Joshua D. Drake

Hannu Krosing wrote:
 Ühel kenal päeval, K, 2007-03-14 kell 10:22, kirjutas Heikki
 Linnakangas:
 Tom Lane wrote:
 At this point I'm feeling unconvinced that we want it at all.  It's
 sounding like a large increase in complexity (both implementation-wise
 and in terms of API ugliness) for a fairly narrow use-case --- just how
 much territory is going to be left for this between HOT and bitmap indexes?
 I'm in a awkward situation right now. I've done my best to describe the 
 use cases for clustered indexes. 
 
 ...
 
 Just to recap the general idea: reduce index size taking advantage of 
 clustering in the heap.

This is what I suggest.

Provide a tarball of -head with the patch applied.

Provide a couple of use cases that can be run with explanation of how to
verify the use cases.

Allow the community to drive the inclusion by making it as easy as
possible to allow a proactive argument to take place by the people
actually using the product.

Proving that a user could and would use the feature is something that is
a very powerful argument.

Sincerely,

Joshua D. Drake





-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 What is the actual problem being solved here? That people expected the
 timezone changes to be picked up automatically?  think if you weigh it
 up, that problem is less significant than: ...

One other point is that symlinking to system timezone info will not
cause Postgres to pick up changes automatically.  You'll probably
still need a postmaster restart to get the new settings loaded in.
If the timezone info update is delivered as part of a Postgres update,
the need for this is fairly obvious, but if it's happening as part
of an update of a seemingly unrelated package, not so much.

To me, the idea of linking to system timezone info is a convenience
for packagers, and it is (and should be) the packagers' responsibility
to determine whether it's a sane idea for their environment.  If they
are competent to determine that, they are certainly competent enough
to make it happen without assistance from our configure script.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Joshua D. Drake

Gregory Stark wrote:
 David Fetter [EMAIL PROTECTED] writes:
 
 CREATE TABLE symptom (
 symptom_id SERIAL PRIMARY KEY, /* See above. */
 ...
 );

 CREATE TABLE patient_presents_with (
 patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
 symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
 UNIQUE(patient_id, symptom_id)
 );
 
 I'm just glad I don't have your doctor. I hope mine doesn't think symptoms are
 all boolean values.

With a unique across (patient_id,symptom_id) you could have a single
patient with as many unique symptoms as could be listed.



Joshua D. Drake




-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Gregory Stark

[EMAIL PROTECTED] writes:

 On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:
 David Fetter [EMAIL PROTECTED] writes:
  CREATE TABLE symptom (
  symptom_id SERIAL PRIMARY KEY, /* See above. */
  ...
  );
 
  CREATE TABLE patient_presents_with (
  patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
  symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
  UNIQUE(patient_id, symptom_id)
  );
 
 I'm just glad I don't have your doctor. I hope mine doesn't think symptoms 
 are
 all boolean values.

 Where is the boolean above? It is M:N, with each having whatever data
 is required.

No, the above schema can only show whether a patient has or doesn't have a
symptom. There is nowhere to store *where* the pain, inflammation, swelling,
aneurism, etc is, or how severe it is, or when it occurred, etc.

In any case the above arguably *is* an EA schema anyways. Your symptom is
just as much an abstract meaningless concept from a database point of view as
the questionnaire's answer or the bug tracker's tag. Especially once you
start actually having to record information *about* the symptom.

This is a silly argument. The only reasonable conclusion is that any dogmatic
principle that doesn't take into account the application requirements is
wrong. In some cases you want a flexible abstract schema because the
application is flexible and abstract, in others you need the database schema
to understand your specific data structures so it can help you manipulate it.
You have to pick which is more useful for your application, you can't have
your cake and eat it too.

And all of the sudden I have a craving for cake...

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Csaba Nagy

On Wed, 2007-03-14 at 16:08, [EMAIL PROTECTED] wrote:
 On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:
  David Fetter [EMAIL PROTECTED] writes:
   CREATE TABLE symptom (
   symptom_id SERIAL PRIMARY KEY, /* See above. */
   ...
   );
  
   CREATE TABLE patient_presents_with (
   patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
   symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
   UNIQUE(patient_id, symptom_id)
   );
  
  I'm just glad I don't have your doctor. I hope mine doesn't think symptoms 
  are
  all boolean values.
 
 Where is the boolean above? It is M:N, with each having whatever data
 is required.

The boolean is assumed in the symptoms table. In any case, even if it's
not a boolean value, even if maybe the symptoms table is a complex one
on it's own, it still is one single type for all symptoms of all
patients. The real problem is that in some real world applications you
have a mix of wildly varying types of attributes a user might want to
use, and you can't know what those will be beforehand... the symptoms
thing is simple to solve in the way David did it, but there really are
other situations which a simple m:n can't easily cover. How would you
handle a data base of user settings for 10K different applications and
100M different users where each application must be able to store it's
own (type safe !!) settings in the same structure, and applications come
and go with their own settings ? Come up with a good solution to this
combined with queries like give me all the users who have this set of
settings set to these values running fast, and then you're talking.

Cheers,
Csaba.



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Andrew Dunstan


Joshua D. Drake wrote:

Gregory Stark wrote:
  

David Fetter [EMAIL PROTECTED] writes:



CREATE TABLE symptom (
symptom_id SERIAL PRIMARY KEY, /* See above. */
...
);

CREATE TABLE patient_presents_with (
patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
UNIQUE(patient_id, symptom_id)
);
  

I'm just glad I don't have your doctor. I hope mine doesn't think symptoms are
all boolean values.



With a unique across (patient_id,symptom_id) you could have a single
patient with as many unique symptoms as could be listed.



  


I think Greg is suggesting that the table needs to be augmented with, 
say, a severity field.


cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread mark

On Wed, Mar 14, 2007 at 03:25:48PM +, Gregory Stark wrote:
 [EMAIL PROTECTED] writes:
  On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:
  David Fetter [EMAIL PROTECTED] writes:
   CREATE TABLE symptom (
   symptom_id SERIAL PRIMARY KEY, /* See above. */
   ...
   );
   CREATE TABLE patient_presents_with (
   patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
   symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
   UNIQUE(patient_id, symptom_id)
   );
  I'm just glad I don't have your doctor. I hope mine doesn't think 
  symptoms are all boolean values.
  Where is the boolean above? It is M:N, with each having whatever data
  is required.
 No, the above schema can only show whether a patient has or doesn't have a
 symptom. There is nowhere to store *where* the pain, inflammation, swelling,
 aneurism, etc is, or how severe it is, or when it occurred, etc.

What do you think the '...' is in the symptom table? :-)

I'm not saying I would do it that way - but it seems to me that this
is a bit of a religious debate, without people listening to each
other. For one person to suggest a workable solution, and another
person to write it off as quickly as you did, misunderstanding, or
misrepresenting it, is what I would call not listening. :-)

 In any case the above arguably *is* an EA schema anyways. Your symptom is
 just as much an abstract meaningless concept from a database point of view as
 the questionnaire's answer or the bug tracker's tag. Especially once you
 start actually having to record information *about* the symptom.

 This is a silly argument. The only reasonable conclusion is that any dogmatic
 principle that doesn't take into account the application requirements is
 wrong. In some cases you want a flexible abstract schema because the
 application is flexible and abstract, in others you need the database schema
 to understand your specific data structures so it can help you manipulate it.
 You have to pick which is more useful for your application, you can't have
 your cake and eat it too.

No. Another reasonable conclusion is that the answer is not simple. This
doesn't mean the answer is undesirable. It means that people need more
time. :-)

I prefer abstractions, especially if they are more efficient than if I were
to roll my own each time. Relational databases do a good job today. They can
do a better job tomorrow.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Daylight Saving Time question PostgreSQL 8.1.4

2007-03-14 Thread Tom Lane

Andrew Dunstan [EMAIL PROTECTED] writes:
 But I take Tom's point about most users not knowing if their TZ database 
 is usable or not. Maybe we need a tool (maybe on pgfoundry) that will do 
 some analysis to find out, if such a thing is possible.

It's not really *that* hard: diff between our timezone files and the
system files will make it pretty clear.  For instance, diffing a CVS
HEAD installation against a not-too-up-to-date Fedora Core 5 system
shows only a few different files, reflecting the fact that they're
different snapshots of the zic database:

$ diff -r --br . /usr/share/zoneinfo
Files ./America/Havana and /usr/share/zoneinfo/America/Havana differ
Files ./America/Santiago and /usr/share/zoneinfo/America/Santiago differ
Files ./Antarctica/Palmer and /usr/share/zoneinfo/Antarctica/Palmer differ
Files ./Australia/Perth and /usr/share/zoneinfo/Australia/Perth differ
Files ./Australia/West and /usr/share/zoneinfo/Australia/West differ
Files ./Chile/Continental and /usr/share/zoneinfo/Chile/Continental differ
Files ./Chile/EasterIsland and /usr/share/zoneinfo/Chile/EasterIsland differ
Files ./Cuba and /usr/share/zoneinfo/Cuba differ
Files ./Pacific/Easter and /usr/share/zoneinfo/Pacific/Easter differ
Only in ./US: Pacific-New
Only in /usr/share/zoneinfo: iso3166.tab
Only in /usr/share/zoneinfo: posix
Only in /usr/share/zoneinfo: posixrules
Only in /usr/share/zoneinfo: right
Only in /usr/share/zoneinfo: zone.tab
$

But IMHO the thing that you need to know to make an informed decision
is what the future update path for that system is likely to be.  In the
case of me packaging Postgres for Red Hat, I feel pretty comfortable
that there will be no major surgery on glibc's timezone code within
any single RHEL release series, so if it works today it'll still work
tomorrow.  A Sun engineer packaging Postgres for Solaris may be able
to make the same kind of determination.  But I think Joe Average User
is sticking his neck out if he assumes such a thing for say a Gentoo
box ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] autovacuum next steps, take 3

2007-03-14 Thread Alvaro Herrera

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  1. Grab the AutovacSchedule LWLock exclusively.
  2. Check to see if another worker is currently processing
  that table; if so drop LWLock and go to next list entry.
  3. Recompute whether table needs vacuuming; if not,
  drop LWLock and go to next entry.  (This test covers the
  case where someone vacuumed the table since you made your
  list.)
  4. Put table OID into shared memory, drop LWLock, then
  vacuum table.
  5. Clear current-table OID from shared memory, then
  repeat for next list entry.
 
  The point I'm not very sure about is that this proposal means we need to
  do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date
  stats.
 
 True.  You could probably drop the lock while rechecking stats, at the
 cost of having to recheck for collision (repeat step 2) afterwards.
 Or recheck stats before you start, but if collisions are likely then
 that's a waste of time.  But on the third hand, does it matter?
 Rechecking the stats should be much cheaper than a vacuum operation,
 so I'm not seeing that there's going to be a problem.  It's not like
 there are going to be hundreds of workers contending for that lock...

Turns out that it does matter, because not only we need to read pgstats,
but we also need to fetch the pg_autovacuum and pg_class rows again for
the table.  So we must release the AutovacuumSchedule lock before trying
to open pg_class etc.

Unless we are prepared to cache (keep a private copy of) the contents
of said tuples between the first check (i.e. when building the initial
table list) and the recheck?  This is possible as well, but it gives me
an uneasy feeling.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Gregory Stark

[EMAIL PROTECTED] writes:

 On Wed, Mar 14, 2007 at 03:25:48PM +, Gregory Stark wrote:
 [EMAIL PROTECTED] writes:
  On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:
  David Fetter [EMAIL PROTECTED] writes:
   CREATE TABLE symptom (
   symptom_id SERIAL PRIMARY KEY, /* See above. */
   ...
   );
   CREATE TABLE patient_presents_with (
   patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
   symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
   UNIQUE(patient_id, symptom_id)
   );
  I'm just glad I don't have your doctor. I hope mine doesn't think 
  symptoms are all boolean values.
  Where is the boolean above? It is M:N, with each having whatever data
  is required.
 No, the above schema can only show whether a patient has or doesn't have a
 symptom. There is nowhere to store *where* the pain, inflammation, swelling,
 aneurism, etc is, or how severe it is, or when it occurred, etc.

 What do you think the '...' is in the symptom table? :-)

Ah, I did in fact miss that. So then this is just a standard EA schema. I
thought you were on the other side of the debate?


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread David Fetter

On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:
 David Fetter [EMAIL PROTECTED] writes:
 
  CREATE TABLE symptom (
  symptom_id SERIAL PRIMARY KEY, /* See above. */
  ...
  );
 
  CREATE TABLE patient_presents_with (
  patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
  symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
  UNIQUE(patient_id, symptom_id)
  );
 
 I'm just glad I don't have your doctor.  I hope mine doesn't think
 symptoms are all boolean values.

What's in the symptom table is up to the doctor.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Richard Huxton


David Fetter wrote:

On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:

David Fetter [EMAIL PROTECTED] writes:


CREATE TABLE symptom (
symptom_id SERIAL PRIMARY KEY, /* See above. */
...
);

CREATE TABLE patient_presents_with (
patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
UNIQUE(patient_id, symptom_id)
);

I'm just glad I don't have your doctor.  I hope mine doesn't think
symptoms are all boolean values.


What's in the symptom table is up to the doctor.


Surely the point is that for high temperature I need a temperature 
value, whereas for tooth decay I'll want a tooth number (or whatever 
they use). Which brings us back to where we started...


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-14 Thread Csaba Nagy

On Wed, 2007-03-14 at 16:50, David Fetter wrote:
 On Wed, Mar 14, 2007 at 02:28:03PM +, Gregory Stark wrote:
  David Fetter [EMAIL PROTECTED] writes:
  
   CREATE TABLE symptom (
   symptom_id SERIAL PRIMARY KEY, /* See above. */
   ...
   );
  
   CREATE TABLE patient_presents_with (
   patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
   symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
   UNIQUE(patient_id, symptom_id)
   );
  
  I'm just glad I don't have your doctor.  I hope mine doesn't think
  symptoms are all boolean values.
 
 What's in the symptom table is up to the doctor.

OK, and here's the problem: each doctor might want to put something else
in the symptom table. Each doctor might want to do it in a type safe
way, e.g. so that the application enforces an enumeration of
high/moderate/low for the symptom fever (or maybe another doctor wants
it in exact degrees)... you can all stuff it in a string field, but you
know how reliable that will be.

Cheers,
Csaba.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] [PATCHES] Bitmapscan changes

2007-03-14 Thread Alvaro Herrera

Joshua D. Drake wrote:

 Allow the community to drive the inclusion by making it as easy as
 possible to allow a proactive argument to take place by the people
 actually using the product.

This seems to be a rather poor decision making process: Are the users
happy with the new feature?  If so, then apply the patch.  It leads to
unmanageable code.

Which is why we don't do things that way.  The code must fit within the
general architecture before application -- particularly if it's an
internal API change.  That's what the review process is for.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] hash join hashtable size and work_mem

2007-03-14 Thread Timothy J. Kordas



in nodeHash.c, the function ExecChooseHashTableSize() uses two different 
methods for determining the number of buckets to use.


the current code looks something like:

if (ntuples * tuplesize  work_mem * 1024)
buckets = (work_mem * 1024) / (tupsize * 10);
else
buckets = ntuples/10

So for the case where a spill is expected; we use work_mem to decide on our 
hash size. For the case where a spill isn't expected; we rely on the row 
estimate alone -- and make no provision for speeding the join by using the 
memory that we're allowed to use.


When profiling large hash-joins, it often is the case that scanning the 
hash-buckets is a bottleneck; it would be nice for the user to be able to 
throw memory at a join to improve performance.


Am I missing something about the current implementation ? I would expect 
that the bucket count would be calculated something like:


buckets = (work_mem * 1024L) / (tup_size * NTUP_PER_BUCKET)

for both cases ?

making this change appears to improve hash-join performance substantially in 
some cases, and as far as I can tell doesn't hurt anything (apart from using 
memory that it is allowed to use given a particular work_mem setting).


-Tim
--
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] [PATCHES] Bitmapscan changes

2007-03-14 Thread Joshua D. Drake

Alvaro Herrera wrote:
 Joshua D. Drake wrote:
 
 Allow the community to drive the inclusion by making it as easy as
 possible to allow a proactive argument to take place by the people
 actually using the product.
 
 This seems to be a rather poor decision making process: Are the users
 happy with the new feature?  If so, then apply the patch.  It leads to
 unmanageable code.

Perhaps reading my message again is in order. I think it is pretty
obvious that the a user shouldn't determine if a patch should be applied.

My whole point was that if people are clamoring for the feature, it
could drive that feature to be more aggressively reviewed.

I can't even count how many times I see:

This seems like a corner case feature, I don't think we should add it.

So I am suggesting a way to insure that the feature is not considered
corner case. (if it is indeed not a corner case)

Sincerely,

Joshua D. Drake

-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] hash join hashtable size and work_mem

2007-03-14 Thread Tom Lane

Timothy J. Kordas [EMAIL PROTECTED] writes:
 Am I missing something about the current implementation ?

If the planner has correctly predicted the number of rows, the table
loading should be about NTUP_PER_BUCKET in either regime.  Are you
sure you aren't just wishing that NTUP_PER_BUCKET were smaller?
I don't see that making the hashtable much larger than ntuples
is a good idea --- that just spreads out the live entries over more
cache lines, resulting in more cache thrashing.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] hash join hashtable size and work_mem

2007-03-14 Thread Timothy J. Kordas


Tom Lane wrote:

If the planner has correctly predicted the number of rows, the table
loading should be about NTUP_PER_BUCKET in either regime.  Are you
sure you aren't just wishing that NTUP_PER_BUCKET were smaller?


Maybe I wish NTUP_PER_BUCKET was smaller. But I don't think that's the whole 
story.


The planner estimates definitely play a role in my concern here. For 
mis-estimated inner relations, the current calculation may over-subscribe 
the hash-table even if more work_mem was available (that is, there are too 
many hash collisions *and* memory isn't being used to the fullest extent 
allowed).


I've been tracking the number of tuples which land in each bucket, and I'd 
like to see that number go down as I increase work_mem.


I would expect for the same data a hash-join with a work_mem of 256MB to run 
faster than one run with 32MB; even if the inner relation is only 30MB.


the implementation I've been experimenting with actually takes the average 
of the current implementation (ntuples/10) and the spill version 
(work_mem/(tupsize * 10).



-Tim


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] hash join hashtable size and work_mem

2007-03-14 Thread Tom Lane

Timothy J. Kordas [EMAIL PROTECTED] writes:
 I would expect for the same data a hash-join with a work_mem of 256MB to run 
 faster than one run with 32MB; even if the inner relation is only 30MB.

Once you get to the point where each tuple is in a different bucket, it
is clearly impossible for further increases in hashtable size to improve
matters.  All you can do is waste RAM and cache lines.

Now if we set NTUP_PER_BUCKET = 1 we would not be exactly at that critical
point because of uneven bucket loading and other factors ... but I
question whether there's enough incremental improvement available to
justify making the hashtable much larger than that.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] hash join hashtable size and work_mem

2007-03-14 Thread Simon Riggs

On Wed, 2007-03-14 at 10:28 -0700, Timothy J. Kordas wrote:
 I would expect for the same data a hash-join with a work_mem of 256MB
 to run faster than one run with 32MB; even if the inner relation is
 only 30MB. 

Certainly not for all data, but for some distrubutions yes, probably.

The easiest thing to do is prove thats true and then work out how to
spot that case ahead of time, or at least find a place where you can
adjust your assumptions cheaply enough to improve things.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] how to add seconds to a TimestampTz

2007-03-14 Thread Alvaro Herrera

Is there a better way than going to time_t and back?  I am currently
using this:

db-next_worker =
time_t_to_timestamptz(timestamptz_to_time_t(current_time) +
  autovacuum_naptime);

(db-next_worker is a TimestampTz, as is current_time.
autovacuum_naptime is integer for a number of seconds)

but it doesn't seem clean, and the comments in the functions more or
less say that their use is discouraged.

I saw about doing it via the interval input/output but that's an awful
lot of work ...

Is this the first time this is going to be done in the backend?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] how to add seconds to a TimestampTz

2007-03-14 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Is there a better way than going to time_t and back?

Isn't the standard SQL-level locution
timestamptz + numeric_value * '1 second'::interval
?  I'm not sure what would be the most convenient realization
of this at the C level, but please stay away from time_t ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

[HACKERS] UPDATE using sub selects

2007-03-14 Thread NikhilS


Hi,

I have coded up a patch which solves the following TODO. I will submit a
patch for this soon:


  - UPDATE
 - Allow UPDATE tab SET ROW (col, ...) = (SELECT...)


 http://archives.postgresql.org/pgsql-hackers/2006-07/msg01306.php


The question is that since this enhances the UPDATE syntax, what changes and
where all they need to be made with respect to the documentation?

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] UPDATE using sub selects

2007-03-14 Thread Tom Lane

NikhilS [EMAIL PROTECTED] writes:
 I have coded up a patch which solves the following TODO. I will submit a
 patch for this soon:
   http://archives.postgresql.org/pgsql-hackers/2006-07/msg01306.php

Cool...

 The question is that since this enhances the UPDATE syntax, what changes and
 where all they need to be made with respect to the documentation?

Documentation is the very least of your worries.  What exactly is your
implementation plan?  If this were a simple or narrow fix it would
have been done already.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] UPDATE using sub selects

2007-03-14 Thread NikhilS


Hi,




 The question is that since this enhances the UPDATE syntax, what changes
and
 where all they need to be made with respect to the documentation?

Documentation is the very least of your worries.  What exactly is your
implementation plan?  If this were a simple or narrow fix it would
have been done already.




The implementation that I have planned is pretty similar to the way INSERT
INTO ... SELECT has been implemented.

Along with the grammar changes in gram.y, the changes are localized in the
transformUpdateStmt code path. The SELECT clause ends up becoming a subquery
to the update query with the target column expressions transformed properly
to include the subquery expressions. Does this sound ok?

I have tried some update-subselect variations and they seem to work. For
example the case in the src/test/regress/sql/update.sql, which used to fail
till now, seems to work:

UPDATE update_test SET (a,b) = (select a,b FROM update_test where c = 'foo')
 WHERE a = 10;

Will try testing out some other variations too.

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com

46 matches

Mail list logo