Re: [HACKERS] How can I use large object on PostgreSQL Linux

2005-02-08 Thread Premsun Choltanwanich


Thank You Michael Fuhr ,

Now, I already installed the contrib/lo and contrib/dbsize modules from source as your suggestion.

From my first information, I got 3 errors /lo, /dbsize and /admin.That mean now only /admin that I cannot found anywhere. 

I still try to find it. If anybody know more information about /admin (in Windows is admin.dll) please tell me.


 Michael Fuhr [EMAIL PROTECTED] 08-Feb-05 11:07:10 AM 
On Mon, Feb 07, 2005 at 02:33:24PM +0700, Premsun Choltanwanich wrote:  I'd guess that you haven't installed some third-party modules that  you need on the Linux box, or that you've installed them in the wrong  place. I don't make sure about third-party information cause all of module I use for Windows Version is install already on the time I install PostgreSQL RC1 on Windows.If you built PostgreSQL from source then you can install thecontrib/lo and contrib/dbsize modules from source as well. Formore information see contrib/README, contrib/lo/README.lo, andcontrib/dbsize/README.dbsize.-- Michael Fuhrhttp://www.fuhr.org/~mfuhr/


Re: [HACKERS] pg_dump bug in 7.3.9 with sequences

2005-02-08 Thread Christopher Kings-Lynne
Not that I can see.  You still have to get the sequence object from
someplace, and having it be a distinct datatype is zero help for finding
a column-specific sequence.
What about allowing sequence qualifiers in the SERIAL definition?
Chris
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] AT TIME ZONE

2005-02-08 Thread Christopher Kings-Lynne
Hi,
Just a quick check that the extension to AT TIME ZONE to allow 
specifying intervals as well as country/city is on the list for 8.1.

I believe it was a fairly simple thing to do now that we have our own 
time zone library...

Chris

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Cross column statistics

2005-02-08 Thread Christopher Kings-Lynne
B) gather a full matrix of the level of correlation between each column and
   each other column. If this were a single floating point number per pair
   then it might be feasible. It would still obviously be n^2 in the number of
   columns though, so there would have to be some way to limit on how many
   columns would be analyzed this way.
Use foreign keys to just record those cross-correlations.
Chris
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] AT TIME ZONE

2005-02-08 Thread Magnus Hagander
 Hi,
 
 Just a quick check that the extension to AT TIME ZONE to 
 allow specifying intervals as well as country/city is on the 
 list for 8.1.
 
 I believe it was a fairly simple thing to do now that we have 
 our own time zone library...

Yeah, this is on my personal hope to do for 8.1 list. At least the
country/city part, haven't really thought about the other one.

//Magnus

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] AT TIME ZONE

2005-02-08 Thread Christopher Kings-Lynne
Yeah, this is on my personal hope to do for 8.1 list. At least the
country/city part, haven't really thought about the other one.
One of the two forms already works...can't quite remember which...
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] Is there a way to make VACUUM run completely outside

2005-02-08 Thread Hannu Krosing
Ühel kenal päeval (esmaspäev, 7. veebruar 2005, 19:01-0300), kirjutas
Alvaro Herrera:

  Also, why must it be run outside of transaction block if it can be
  rollbacked ?
 
 A vacuum actually uses several transactions, so it wouldn't work as the
 user would expect if run in a transaction.  The first one is committed
 rather early and new ones are opened and closed.  (One per table, IIRC.)

So I guess that making it commit and open new transaction at a regular
interval (like each minute) during vacuuming single table would not
alter its visible behaviour. That would solve my problem of long-running
vacuums on large tables polluting unrelated small but heavily updated
tables with dead tuples.

I'll take a peak at code and try to come up with a naive proposal you
can shoot down ;)

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] AT TIME ZONE

2005-02-08 Thread Michael Glaesemann
On Feb 8, 2005, at 20:43, Christopher Kings-Lynne wrote:
Yeah, this is on my personal hope to do for 8.1 list. At least the
country/city part, haven't really thought about the other one.
One of the two forms already works...can't quite remember which...
I think this is perhaps what you were trying to remember:
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00870.php
Michael Glaesemann
grzm myrealbox com
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread pgsql
 [EMAIL PROTECTED] wrote:

 In this case, the behavior observed could be changed by altering the
 sample size for a table. I submit that an arbitrary fixed sample size is
 not a good base for the analyzer, but that the sample size should be
 based
 on the size of the table or some calculation of its deviation.


Mark,

 Do you have any evidence that the Sample Size had anything to do
 with the performance problem you're seeing?

I have evidence, if you look through some of the messages in this thread,
you'll see how a sample size of 1 provides enough data points to
create stats the planner can use.


 I also do a lot with the complete Census/TIGER database.

Cool, have any code for Mapserver?


 Every problem I have with the optimizer comes down to the
 fact that the data is loaded (and ordered on disk) by
 State/County FIPS codes, and then queried by zip-code
 or by city name.  Like this:

  Alabama36101 [hundreds of pages with zip's in 36***]
  Alaska 99686 [hundreds of pages with zip's in 9]
  Arizona85701 [hundreds of pages with zip's in 855**]

 Note that the zip codes are *NOT* sequential.

 The correlation statistic sees that the Zip codes are not
 sequential; so it makes the *HORRIBLE* assumption that they
 are scattered randomly across the disk.

It is my theory that this is because there are too few data points with
which to properly characterize the nature of the data.


 In reality, even though there's no total ordering of the
 zip codes; any given zip code only exists on a couple
 disk pages; so index scans would be the right choice.
I totally agree.


 But the single correlation parameter is not sufficient
 to let the optimizer known this.

 No matter how large a sample size you choose, ANALYZE
 will correctly see that Zip codes and State FIPS codes
 are non-correlated, and the optimizer will overestimate
 the # of pages an index scan will need.


I tried to create an analogy in another post, and TIGER is a perfect
example of the analogy.

Think of the difference between an oscilloscope and a spectrum analizer.
The current sampling code works more like an oscilloscope. It assumes a
fairly normalized distribution of data. Given this, it works perfectly
fine.

When a scope is presented with an audio signal, it looks more like
gibberish showing almost no correlation. When you view it in frequency
domain, as with a spectrum analyzer, you can see clear patterns in the
signal.

Now, fortunately, we don't need any sort of absolute visualization of the
data in TIGER, we only need to see that the data has many subtle trends
rather than one fairly evenly distributed one. That's why more samples
works.

If we could do anything, I would add more statistics to the database. A
standard deviation and maybe a sliding window deviation. A standard
deviation might be pretty high, were as a sliding window whould show less
localized deviation. Less localized deviation whould favor index scans in.

Anyway, like I said. I think the expectation that the data is fairly
normalized or evenly distributed works very well for data acquired over
time. It is data like TIGER that is in a multiple field order, i.e. state,
zipr, zipl that has complex paterns for the secondary sorts that can't be
detected with too small a sample.





 PS: I pointed out workarounds in my earlier posting
 in this thread.  Yes, I'm using the same TIGER data
 you are.


Cool.



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Stephen Frost
* Mark Kirkwood ([EMAIL PROTECTED]) wrote:
 I can see your point, however I wonder if the issue is that the default
 stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and
 maybe we should consider making a higher value (say '100') the default.

Personally, I think that'd be reasonable.

 The idea of either automatically increasing sample size for large
 tables, or doing a few more samplings with different sizes and examining
 the stability of the estimates is rather nice, provided we can keep the
 runtime for ANALYZE to reasonable limits, I guess :-)

I also agree with this and personally don't mind *too* much if analyze
takes a little while on a large table to get decent statistics for it.
One thing I was wondering about though is if we use the index to
get some of the statistics information?  Would it be possible, or
reasonable?  Do we already?  I dunno, just some thoughts there, I keep
hearing about the number of rows that are sampled and I would have
thought it'd make sense to scan the index for things like the number of
distinct values...

Stephen


signature.asc
Description: Digital signature


Fw: Re: [HACKERS] float4 regression test failed on linux parisc

2005-02-08 Thread Jim Buttafuoco
Tom,

I'm back with this issue.  I have comparied the src/backend/utils/adt/float.c 
from 7.4.6 against CVS HEAD.  There was 
some work done on the infinity handling (don't know who, I am NOT a CVS 
expert/user).  The problem I see is that the 
float4in does a check to see if the value is infinity BEFORE calling 
CheckFloat4Val (this was added for 8.0) but the 
float4div (and friends) doesn't.  All I want to do is add a check in 
CheckFloat4Val for infinity (and remove the 
individual checks before the CheckFloat4Val call in other routines).  

I hope I have explained my problem and solution. 

Jim
-- Forwarded Message ---
From: Jim Buttafuoco [EMAIL PROTECTED]
To: Tom Lane [EMAIL PROTECTED]
Cc: pgsql-hackers pgsql-hackers@postgresql.org
Sent: Tue, 1 Feb 2005 17:20:17 -0500
Subject: Re: [HACKERS] float4 regression test failed on linux parisc 

Tom,

The issue is with a select 'Infinity'::float4/'Infinity'::float4; which should 
return NAN.  without the cast I get the 
overflow message from CheckFloat4Val with the cast I get NAN (as expected). How 
about testing for isnan() inside 
CheckFloat4Val (just for PARISC / Linux)?

I am trying to get this system working for the buildfarm as there are NO other 
HP PARISC system on the farm.

Jim

-- Original Message ---
From: Tom Lane [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: pgsql-hackers pgsql-hackers@postgresql.org
Sent: Tue, 01 Feb 2005 17:13:52 -0500
Subject: Re: [HACKERS] float4 regression test failed on linux parisc

 Jim Buttafuoco [EMAIL PROTECTED] writes:
  Change:
  CheckFloat4Val(result);
  To:
  CheckFloat4Val((float4)result);
 
 CheckFloat4Val is defined to take a double, so whatever the above is
 accomplishing is wrong: probably it's masking an out-of-range result.
 I think you've hit a bug in Debian's version of gcc for PA-RISC.
 
   regards, tom lane
--- End of Original Message ---
--- End of Forwarded Message ---


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread pgsql
 [EMAIL PROTECTED] writes:

 The basic problem with a fixed sample is that is assumes a normal
 distribution.

 That's sort of true, but not in the way you think it is.

[snip]

Greg, I think you have an excellent ability to articulate stats, but I
think that the view that this is like election polling is incorrect.

Election polling assumes a very simple outcome: Some standard ditribution
of a limited number options. I don't think it applies to this.


 When you look at a sine wave on an oscilloscope, you can see it clear as
 day. When you look at music on the scope, you know there are many waves
 there, but it is difficult to make heads or tails of it. (use xmms or
 winamp to see for yourself) The waves change in frequency, amplitude,
 and
 duration over a very large scale. That's why you use a spectrum analyzer
 to go from time domain to frequency domain. In frequency domain, you can
 see the trends better.

 That's not a bad analogy to many problems where you're measuring data that
 has
 non-randomness in it but that are not visible in the domain that the
 statistics that are being analyzed. This seems to happen a lot with
 geographic
 data, for instance.

EXACTLY!!!


 If you find that increasing the stats targets improves things then this
 isn't true. If you find that it doesn't then what's really needed is a
 cleverer set of statistics to look for.

I will be the first one to say that increasing the samples is not perfect,
but it is a methodology that will help without major changes in postgres.
Simply increasing the samples to a percentage of the estimated number of
rows (with some upper and lower limits of course) will increase the
accuracy of the n_distinct and correlation settings (at least a little
bit), and that will make a huge impact with very little work.

If we want to discuss improved statatistics, then we should include a
standard deviation and a sliding window deviation, or something like that.
Hell, maybe even FFT.

The basic problem, I think, is that the sampling mechanism is more like an
oscilloscope looking for large trends instead of a spectrum analyzer
looking for the smaller ones.

We have to be able to tell the planner that adjacent values are less
random even though, as a whole, they are seemingly random.

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] float4 regression test failed on linux parisc

2005-02-08 Thread Tom Lane
Jim Buttafuoco [EMAIL PROTECTED] writes:
 All I want to do is add a check in CheckFloat4Val for infinity (and remove 
 the 
 individual checks before the CheckFloat4Val call in other routines).  

That's not at all what you proposed before, and it would have vastly
more side-effects than just removing the platform-dependent behavior
you are on about.  If we did that then this would work:

regression=# select ('infinity'::float4) / (1::float4);
ERROR:  type real value out of range: overflow

... which arguably it ought to, but you'd be changing the behavior
everywhere not just for your broken compiler.

I think the real question we ought to face up to sometime is what it is
we are trying to accomplish with CheckFloat4Val and CheckFloat8Val in
the first place.  The latter routine in particular seems pretty
ill-advised to me: if something can be represented as a double then why
don't we just allow it?

ISTM that what we really want is to reject out-of-range results, as in
these examples:

regression=# select (1e37::float4) / (1e-37::float4);
ERROR:  type real value out of range: overflow
regression=# select (1e300::float8) / (1e-37::float8);
ERROR:  type double precision value out of range: overflow
regression=#

On machines that have IEEE infinity, I think it would work to report
overflow if the result is infinity when neither input is.  But I dunno
how well that works on non-IEEE hardware.  Also, what about rejecting
NaN results?  Thoughts anyone?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] problemsafter crash

2005-02-08 Thread Tom Lane
ohp@pyrenet.fr writes:
 I had a crash last night and since while vacuuming databases (either full
 or lazy) I get this error:

 duplcate key violates unique cnstraint pg_statistic_relid_att_index

Reindexing pg_statistic would probably make that go away ...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] float4 regression test failed on linux parisc

2005-02-08 Thread Jim Buttafuoco
Tom,

The other option is to note that on older ( and I mean real old systems where 
the fp unit is sub par) systems that 
this test is likely to fail. I have now seen this on my real old Alpha and now 
HP PARISC systems.  Is there a way to 
just modify the regression test to pass by these test on these platforms?

Jim


-- Original Message ---
From: Tom Lane [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: pgsql-hackers pgsql-hackers@postgresql.org
Sent: Tue, 08 Feb 2005 10:25:26 -0500
Subject: Re: [HACKERS] float4 regression test failed on linux parisc 

 Jim Buttafuoco [EMAIL PROTECTED] writes:
  All I want to do is add a check in CheckFloat4Val for infinity (and remove 
  the 
  individual checks before the CheckFloat4Val call in other routines).
 
 That's not at all what you proposed before, and it would have vastly
 more side-effects than just removing the platform-dependent behavior
 you are on about.  If we did that then this would work:
 
 regression=# select ('infinity'::float4) / (1::float4);
 ERROR:  type real value out of range: overflow
 
 ... which arguably it ought to, but you'd be changing the behavior
 everywhere not just for your broken compiler.
 
 I think the real question we ought to face up to sometime is what it is
 we are trying to accomplish with CheckFloat4Val and CheckFloat8Val in
 the first place.  The latter routine in particular seems pretty
 ill-advised to me: if something can be represented as a double then why
 don't we just allow it?
 
 ISTM that what we really want is to reject out-of-range results, as in
 these examples:
 
 regression=# select (1e37::float4) / (1e-37::float4);
 ERROR:  type real value out of range: overflow
 regression=# select (1e300::float8) / (1e-37::float8);
 ERROR:  type double precision value out of range: overflow
 regression=#
 
 On machines that have IEEE infinity, I think it would work to report
 overflow if the result is infinity when neither input is.  But I dunno
 how well that works on non-IEEE hardware.  Also, what about rejecting
 NaN results?  Thoughts anyone?
 
   regards, tom lane
--- End of Original Message ---


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[HACKERS] One Big trend vs multiple smaller trends in table statistics

2005-02-08 Thread pgsql
A couple of us using the US Census TIGER database have noticed something
about the statistics gathering of analyze. If you follow the thread Query
Optimizer 8.0.1 you'll see the progression of the debate.

To summarize what I think we've seen:

The current implementation of analyze is designed around sampling a table
to characterize the basic trend of the data. The problem with the approach
is that it assumes that the data has a singular trend behavior.

Greg Stark posts Cross column statistics touches on the general problem.

The best analogy so far is the difference between an oscilloscope and a
spectrum analizer. The current statistics gathering is like a sampling
oscilloscope trying to display a single wave form.

Some data trends are more like audio signals where the data has many
smaller trends in a seemingly random stream. With a specrum analyzer you
can see the various components. Use Winamp or XMMS for a visualization.

Lets assume data is in a multiple sort order. Lets assume it is a set of
street addresses sorted by:

state, streetname, streettyppe, address

MA, ABBOT, RD, 100
MA, ABBOT, RD, 200
MA, ABBOT, RD, 300
MA, ABBOT, ST, 100
MA, ABBOT, ST, 200
MA, MAPLE, RD, 100
MA, MAPLE, RD, 200
MA, MAPLE, ST, 100
...
...
WY, ABBOT, RD, 100
etc.


This table has MILLIONS of rows, every single address in the country. The
trend of state is clearly an increasing step ramp over the entire table.
The trend of streetname can be imagined as a waveform of a series of ramps
for each state. The trend of streettype, similarly, is a series of ramps
per street name, and the wave form for address is a ramp for each
streettype.

The statistics PostgreSQL currently employs will work great for state,
but much less so for streetname.

A query of select * from addresses where streetname = 'ABBOT' will be
seen as more expensive than it really is. Most of the ABBOTs will be
together in about 50 clusters (one for each state, assuming every state
has atlease on ABBOT), but the current stats are not designed to detect
this.

Yes, eventually, if the sub-trends are small enough, the index scans
become more expensive than table scans, but the current stats can't tell
where that point is. Clearly it is not at the secondary sort (or
streetname) level.

I've found that increasing the sample size in analyze.c can help in
specific cases, but the overall problem remains.

The question is: Is this really a problem? If so, what can we do?

I was thinking of trying to compute a sliding window standard deviation
which should be able to detect smaller trends in an overall table, this
would require a lot of work in analyze.c.

If the sliding window deviation is low, then the correlation of the table
should be increased, telling the planner that an index scan is a better
choice. The actual math behind the values has to be worked out, of course,
but what do you think about the idea?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Bruno Wolff III
On Mon, Feb 07, 2005 at 17:45:23 -0500,
  Greg Stark [EMAIL PROTECTED] wrote:
 
 However for discrete values like the top ten most common values and the
 total number of distinct values it's not so clear at all that you can
 extrapolate from a sample at all. And it's certainly not clear that a fixed
 size sample gives you at all the same confidence for a large population as it
 does for a small population.

If you were to keep a complete histogram for the sample points (which may
or may not be practical) you should be able to estimate the number of
distinct values under some assumptions. Such as, that all values outside
of the top N values have the same likelihood. I don't think this is
unreasonable.

 
 If you sampled across the country and found your sample of 600 people had 4
 different top choices for president, how do you extrapolate that to calculate
 the total number of top choices for president the 300 million+ people will
 have across the country? You could multiply by 500k but that's not going to be
 right. Hell you're probably better off just going with 4 but that's clearly
 not right either.

Well you can put some bounds on the number. Since no 5th candidate was
picked by any of the 600 people, you would expect that the number of
people prefering other candidates is on the order of 50 or less, so
that the number of distinct values is going to be 50 or less.

I think the histogram idea will work well for estimating the number of
rows with particular values, but if you are interested in the number
of unique values, you are going to have problems with some data sets.
(Ones with a few very common values and lots of extremely rare items.)
In this case there may be some way to use information from indexes on
the data to get better results.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] float4 regression test failed on linux parisc

2005-02-08 Thread Tom Lane
Jim Buttafuoco [EMAIL PROTECTED] writes:
 this test is likely to fail. I have now seen this on my real old Alpha
 and now HP PARISC systems.

It works fine on PARISC, and has ever since I've been associated with
this project --- I run these tests multiple times a day on old HP
hardware, and they have always passed with every compiler I've used
(both gcc and HP's).  Lots of people have reported clean passes on Alpha
as well.  One more time: you have a compiler bug, and you really ought
to be griping to the gcc people not us.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Is there a way to make VACUUM run completely outside

2005-02-08 Thread Alvaro Herrera
On Tue, Feb 08, 2005 at 01:55:47PM +0200, Hannu Krosing wrote:

 So I guess that making it commit and open new transaction at a regular
 interval (like each minute) during vacuuming single table would not
 alter its visible behaviour. That would solve my problem of long-running
 vacuums on large tables polluting unrelated small but heavily updated
 tables with dead tuples.

Interesting.  The problem is that a long running VACUUM on a single
table will keep in PGPROC a TransactionId that will last very long,
which will pollute every concurrent Snapshot; so smaller tables can't
be cleaned up because the tuples are visible for the transaction running
the vacuum -- except that that transaction cannot possibly want to look
at them.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
La tristeza es un muro entre dos jardines (Khalil Gibran)

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] float4 regression test failed on linux parisc

2005-02-08 Thread Jim Buttafuoco
except isinf() works just fine on my system.  It's just when CheckFloat4Val is 
called with infinity as the val you 
you get the overflow message.  If I move the isinf into CheckFloat4Val all is 
fine.  

If you don't want to fix this, it's fine with me.  I am just reporting problems 
and trying to fix them.  I will shut 
up now and put my energy into other causes!

Jim



-- Original Message ---
From: Tom Lane [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: pgsql-hackers pgsql-hackers@postgresql.org
Sent: Tue, 08 Feb 2005 11:42:11 -0500
Subject: Re: [HACKERS] float4 regression test failed on linux parisc 

 Jim Buttafuoco [EMAIL PROTECTED] writes:
  this test is likely to fail. I have now seen this on my real old Alpha
  and now HP PARISC systems.
 
 It works fine on PARISC, and has ever since I've been associated with
 this project --- I run these tests multiple times a day on old HP
 hardware, and they have always passed with every compiler I've used
 (both gcc and HP's).  Lots of people have reported clean passes on Alpha
 as well.  One more time: you have a compiler bug, and you really ought
 to be griping to the gcc people not us.
 
   regards, tom lane
--- End of Original Message ---


---(end of broadcast)---
TIP 8: explain analyze is your friend


[HACKERS] correlation in pg_stats

2005-02-08 Thread Ron Mayer

Short summary:

  * It looks to me like the planner vastly overestimates 
the # of pages read by index scan in quite a few of my
tables even though stats collected by ANALYZE are correct.

  * The problem happens any time you have multiple columns
that have a number of repeated values in them, and
you CLUSTER the table by a sort using both columns
(like city,state,zip,phone# or firstname,lastname).

  * I think this is the problem that Mark Kirkwood is seeing
in his threads Query optimizer 8.0.1 and One Big trend
vs multiple smaller trends in hackers.

  * A test script demonstrating the issue also follows.

  * I think keeping one more stat per attribute in
pg_stastic that could describe this behavior.


Longer:  


  If I understand the optimizer correctly,  correlation is used
  to both guess how much random disk access will be required in
  a query; as well as estimate how many pages will be read.

  Unfortunately, many tables in my larger databases have
  columns with values that are tightly packed on a few pages;
  even though there is no total-ordering across the whole table.
  Stephan Szabo described this as a clumping effect:
  http://archives.postgresql.org/pgsql-performance/2003-01/msg00286.php


  The test script below shows a table with 6 columns and 
  1 million rows.

  Note that ANALYZE correctly observes that the correlation
  for all the columns except A is near zero.

  However, also note that columns A,B,C,and even D will have
  extremely few UNIQUE values on any given page of data.  And
  conversely, an index scan on any of those columns be a good
  choice (as seen by EXPLAIN ANALYZE output below).


  Instead of just storing correlation,  I would also like
  to store a variation of correlation that pretends that
  the disk-pages for a particular column were sorted by the
  min value for that column.  Wherever the optimizer would
  use the existing 'correlation' to estimate the number
  of pages that would be accesses; it could use this 
  correlation discounting the order of blocks value instead.
  All the math/stastics/theory that suggests correlation is
  good at estimating # of pages would remain intact.
  Anyway... I've talked too much...

The test script showing 
   1) the tables
   2) the values in pg_stats
   3) EXPLAIN ANALYZE of columns showing the problem
follows:








fli=# create temporary table tmp1mil as
  select * from
 (select generate_series as a from generate_series(0,9)) as a,
 (select generate_series as b from generate_series(0,9)) as b,
 (select generate_series as c from generate_series(0,9)) as c,
 (select generate_series as d from generate_series(0,9)) as d,
 (select generate_series as e from generate_series(0,9)) as e,
 (select generate_series as f from generate_series(0,9)) as f
  order by a,b,c,d,e,f;
fli-# fli-# fli-# fli-# fli-# fli-# fli-# fli-# 
SELECT
fli=# fli=# vacuum analyze tmp1mil;
VACUUM
fli=# select * from pg_stats where tablename='tmp1mil';
 schemaname | tablename | attname | null_frac | avg_width | n_distinct |   
most_common_vals|   most_common_freqs   
 | histogram_bounds | correlation 
+---+-+---+---++---++--+-
 pg_temp_4  | tmp1mil   | a   | 0 | 4 | 10 | 
{7,8,5,0,6,3,4,1,2,9} | 
{0.113,0.106667,0.105667,0.104333,0.101333,0.097,0.095,0.0936667,0.092,0.091}
  |  |   1
 pg_temp_4  | tmp1mil   | b   | 0 | 4 | 10 | 
{1,4,0,5,6,2,9,7,3,8} | 
{0.119333,0.112667,0.107,0.101,0.099,0.0976667,0.0946667,0.092,0.0886667,0.088} 
   |  |0.229754
 pg_temp_4  | tmp1mil   | c   | 0 | 4 | 10 | 
{9,5,0,1,8,6,4,7,3,2} | 
{0.114667,0.107,0.103,0.101667,0.101667,0.0996667,0.0956667,0.094,0.093,0.089}
 |  |0.142119
 pg_temp_4  | tmp1mil   | d   | 0 | 4 | 10 | 
{4,3,2,8,7,9,5,6,1,0} | 
{0.114667,0.11,0.108,0.104,0.102667,0.099,0.098,0.090,0.089,0.084}  
   |  |   0.0930835
 pg_temp_4  | tmp1mil   | e   | 0 | 4 | 10 | 
{0,5,1,9,4,7,8,2,3,6} | 

Re: [HACKERS] Cross column statistics

2005-02-08 Thread Greg Stark
Christopher Kings-Lynne [EMAIL PROTECTED] writes:

  B) gather a full matrix of the level of correlation between each column 
  and
 each other column. If this were a single floating point number per pair
 then it might be feasible. It would still obviously be n^2 in the number 
  of
 columns though, so there would have to be some way to limit on how many
 columns would be analyzed this way.
 
 Use foreign keys to just record those cross-correlations.

My email was about cross-column intra-table correlations. inter-table
correlations are a whole other ball of wax.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] correlation in pg_stats

2005-02-08 Thread pgsql

 Short summary:

   * It looks to me like the planner vastly overestimates
 the # of pages read by index scan in quite a few of my
 tables even though stats collected by ANALYZE are correct.

   * The problem happens any time you have multiple columns
 that have a number of repeated values in them, and
 you CLUSTER the table by a sort using both columns
 (like city,state,zip,phone# or firstname,lastname).

   * I think this is the problem that Mark Kirkwood is seeing
 in his threads Query optimizer 8.0.1 and One Big trend
 vs multiple smaller trends in hackers.

actually [EMAIL PROTECTED], is Mark Woodward. Pleased to meet you.
:)

(I hate using my name on lists like this because of spammers)


   * A test script demonstrating the issue also follows.

   * I think keeping one more stat per attribute in
 pg_stastic that could describe this behavior.


 Longer:


   If I understand the optimizer correctly,  correlation is used
   to both guess how much random disk access will be required in
   a query; as well as estimate how many pages will be read.

   Unfortunately, many tables in my larger databases have
   columns with values that are tightly packed on a few pages;
   even though there is no total-ordering across the whole table.
   Stephan Szabo described this as a clumping effect:
   http://archives.postgresql.org/pgsql-performance/2003-01/msg00286.php

Yes.

I think we are describing the exact same issue.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] PHP/PDO Database Abstraction Layer

2005-02-08 Thread Joshua D. Drake
Anyone up for it? This is our chance to get a really top notch PHP
driver for PostgreSQL that supports all the appropriate goodies.

What exactly do you want people to do?
My thought process was to beat the heck out of it. Find things that
don't work, should work, aren't implemented and hopefully make them
work and get them implemented.
Sincerely,
Joshua D. Drake




--
Command Prompt, Inc., your source for PostgreSQL replication,
professional support, programming, managed services, shared
and dedicated hosting. Home of the Open Source Projects plPHP,
plPerlNG, pgManage,  and pgPHPtoolkit.
Contact us now at: +1-503-667-4564 - http://www.commandprompt.com
begin:vcard
fn:Joshua D. Drake
n:Drake;Joshua D.
org:Command Prompt, Inc.
adr:;;PO Box 215;Cascade Locks;Oregon;97014;USA
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
note:Command Prompt, Inc. is the largest and oldest US based commercial PostgreSQL support provider. We  provide the only commercially viable integrated PostgreSQL replication solution, but also custom programming, and support. We authored  the book Practical PostgreSQL, the procedural language plPHP, and adding trigger capability to plPerl.
x-mozilla-html:FALSE
url:http://www.commandprompt.com/
version:2.1
end:vcard


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] PHP/PDO Database Abstraction Layer

2005-02-08 Thread Alvaro Herrera
On Tue, Feb 08, 2005 at 10:10:58AM -0800, Joshua D. Drake wrote:
 Anyone up for it? This is our chance to get a really top notch PHP
 driver for PostgreSQL that supports all the appropriate goodies.
 
 What exactly do you want people to do?
 
 My thought process was to beat the heck out of it. Find things that
 don't work, should work, aren't implemented and hopefully make them
 work and get them implemented.

The get last inserted Id business is likely to be hard to implement.
It doesn't even take parameters ...

-- 
Alvaro Herrera ([EMAIL PROTECTED])
Tulio: oh, para qué servirá este boton, Juan Carlos?
Policarpo: No, aléjense, no toquen la consola!
Juan Carlos: Lo apretaré una y otra vez.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] external indices ...

2005-02-08 Thread Marc G. Fournier
I believe that this is what Oleg et al tap into with the tsearch2 stuff, 
no?  I have someone asking me about it, and want to make sure that I'm 
telling him the right answer ... is this what GiST is?  And, if so, what 
is a *good* doc for me to point them at to get up to speed with it?  Is 
what we have in the docs the best place, or is there something someone 
else has written that gets into it even more?

Thanks ...

Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] problemsafter crash

2005-02-08 Thread ohp
Hi Tom,

The problem was worse then that:
there were actually twice the same row in the table, delete those twins
helped.
One of my customer suffered the same problem in 2 of his tables and even
the OID was the same!

how can that  be possible?

again, deleting those rows helped
On Tue, 8 Feb 2005, Tom Lane wrote:

 Date: Tue, 08 Feb 2005 10:40:33 -0500
 From: Tom Lane [EMAIL PROTECTED]
 To: ohp@pyrenet.fr
 Cc: pgsql-hackers list pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] problemsafter crash

 ohp@pyrenet.fr writes:
  I had a crash last night and since while vacuuming databases (either full
  or lazy) I get this error:

  duplcate key violates unique cnstraint pg_statistic_relid_att_index

 Reindexing pg_statistic would probably make that go away ...

   regards, tom lane


-- 
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
6, Chemin d'Harraud Turrou   +33-5-61-50-97-01 (Fax)
31190 AUTERIVE   +33-6-07-63-80-64 (GSM)
FRANCE  Email: ohp@pyrenet.fr
--
Make your life a dream, make your dream a reality. (St Exupery)

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] external indices ...

2005-02-08 Thread pgsql
I haven't worked with GiST, although I have been curious from time to
time. Just never had the time to sit, read, and try out the GiST system.

On my text search system (FTSS) I use functions that return sets of data.
It make be easier to implement that than a GiST.

Basically, I create a unique ID index on a table of data. Create my
external index. When a search is done, I return a set of unique IDs and
ranks.



 I believe that this is what Oleg et al tap into with the tsearch2 stuff,
 no?  I have someone asking me about it, and want to make sure that I'm
 telling him the right answer ... is this what GiST is?  And, if so, what
 is a *good* doc for me to point them at to get up to speed with it?  Is
 what we have in the docs the best place, or is there something someone
 else has written that gets into it even more?

 Thanks ...

 
 Marc G. Fournier   Hub.Org Networking Services
 (http://www.hub.org)
 Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ:
 7615664




---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] Fast reference without an index?

2005-02-08 Thread pgsql
A question to the hackers:

Is there a way, and if I'm being stupid please tell me, to use something
like a row ID to reference a row in a PostgreSQL database? Allowing the
database to find a specific row without using an index?

I mean, an index has to return something like a row ID for the database to
find the row, right? Granted it has to find the version with the right
transaction ID, but still, you get the idea. In the old days of ISAM
stuff, it would just be the offset into the file.

In my text search system, rather than create an additional indexed column
for row ID, wouldn't it  be more efficient if I could use PostgreSQL's
data for this? This would greatly improve the performance of FTSS as it
would eliminate the extra index operation per row returned.



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] correlation in pg_stats

2005-02-08 Thread Mark Kirkwood
[EMAIL PROTECTED] wrote:
actually [EMAIL PROTECTED], is Mark Woodward. Pleased to meet you.
:)
(I hate using my name on lists like this because of spammers)
Not to be confused with me :-)
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] external indices ...

2005-02-08 Thread Oleg Bartunov
Marc,
On Tue, 8 Feb 2005, Marc G. Fournier wrote:
I believe that this is what Oleg et al tap into with the tsearch2 stuff, no? 
I have someone asking me about it, and want to make sure that I'm telling him 
the right answer ... is this what GiST is?  And, if so, what is a *good* doc 
for me to point them at to get up to speed with it?  Is what we have in the 
docs the best place, or is there something someone else has written that gets 
into it even more?
I'm a little bit confused, because I couldn't catch your meaning :)
Thanks ...

Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 6: Have you searched our list archives?
 http://archives.postgresql.org
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] external indices ...

2005-02-08 Thread Marc G. Fournier
On Tue, 8 Feb 2005, Oleg Bartunov wrote:
Marc,
On Tue, 8 Feb 2005, Marc G. Fournier wrote:
I believe that this is what Oleg et al tap into with the tsearch2 stuff, 
no? I have someone asking me about it, and want to make sure that I'm 
telling him the right answer ... is this what GiST is?  And, if so, what is 
a *good* doc for me to point them at to get up to speed with it?  Is what 
we have in the docs the best place, or is there something someone else has 
written that gets into it even more?
I'm a little bit confused, because I couldn't catch your meaning :)
tsearch2 is an external indexing method, no?  external in that it isn't 
intergrated as part of PostgreSQL, but is added on ... with the 'add on' 
being through the GiST framework ... right ?


Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Fast reference without an index?

2005-02-08 Thread Tom Lane
[EMAIL PROTECTED] writes:
 Is there a way, and if I'm being stupid please tell me, to use something
 like a row ID to reference a row in a PostgreSQL database? Allowing the
 database to find a specific row without using an index?

ctid ... which changes on every update ...

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] external indices ...

2005-02-08 Thread Tom Lane
Marc G. Fournier [EMAIL PROTECTED] writes:
 I believe that this is what Oleg et al tap into with the tsearch2 stuff, 
 no?  I have someone asking me about it, and want to make sure that I'm 
 telling him the right answer ... is this what GiST is?

Better ask what he means by external exactly.  I think I've heard some
people use the phrase external index to mean any index that's not
actually part of the table itself; which makes all our index types
external.  (As a counterexample, imagine a table that is itself
organized as a btree.)

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] external indices ...

2005-02-08 Thread Marc G. Fournier
Summary:
Currently used with Oracle databases thru their Extensible
Indexing Cartridge.
What we are interested in is finding out whether or not we can use our
technology directly with PostgreSQL.  That would involve, from a overly
simplistic view, allowing Postgres to use external indexes.
Does that help any?
On Tue, 8 Feb 2005, Tom Lane wrote:
Marc G. Fournier [EMAIL PROTECTED] writes:
I believe that this is what Oleg et al tap into with the tsearch2 stuff,
no?  I have someone asking me about it, and want to make sure that I'm
telling him the right answer ... is this what GiST is?
Better ask what he means by external exactly.  I think I've heard some
people use the phrase external index to mean any index that's not
actually part of the table itself; which makes all our index types
external.  (As a counterexample, imagine a table that is itself
organized as a btree.)
regards, tom lane

Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] external indices ...

2005-02-08 Thread Oleg Bartunov
On Tue, 8 Feb 2005, Marc G. Fournier wrote:
On Tue, 8 Feb 2005, Oleg Bartunov wrote:
Marc,
On Tue, 8 Feb 2005, Marc G. Fournier wrote:
I believe that this is what Oleg et al tap into with the tsearch2 stuff, 
no? I have someone asking me about it, and want to make sure that I'm 
telling him the right answer ... is this what GiST is?  And, if so, what 
is a *good* doc for me to point them at to get up to speed with it?  Is 
what we have in the docs the best place, or is there something someone 
else has written that gets into it even more?
I'm a little bit confused, because I couldn't catch your meaning :)
tsearch2 is an external indexing method, no?  external in that it isn't 
intergrated as part of PostgreSQL, but is added on ... with the 'add on' 
being through the GiST framework ... right ?
I think, that guy asked about if tsearch2 is an external program  which
running as a separate process.

Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Fast reference without an index?

2005-02-08 Thread pgsql
 [EMAIL PROTECTED] writes:
 Is there a way, and if I'm being stupid please tell me, to use something
 like a row ID to reference a row in a PostgreSQL database? Allowing the
 database to find a specific row without using an index?

 ctid ... which changes on every update ...

Well, how does an index do it? Say this:

select * from mytable where name = 'foo';

The index must return something. Say I have a row that it constantly being
updated, or has an original item inserted. An item which is valid within
my transaction, and an item which has just be inserted but has a
transaction id greater than mine.


Dosn't the index have some base number which points to the first valid
occurance of the row, and then the valid row is found based on the
transaction ID, or has PG changed? Is that the ctid?



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] correlation in pg_stats

2005-02-08 Thread Simon Riggs
[EMAIL PROTECTED]@[EMAIL PROTECTED] wrote
  Short summary:
 
* It looks to me like the planner vastly overestimates
  the # of pages read by index scan in quite a few of my
  tables even though stats collected by ANALYZE are correct.
 
* The problem happens any time you have multiple columns
  that have a number of repeated values in them, and
  you CLUSTER the table by a sort using both columns
  (like city,state,zip,phone# or firstname,lastname).
 
* I think this is the problem that Mark Kirkwood is seeing
  in his threads Query optimizer 8.0.1 and One Big trend
  vs multiple smaller trends in hackers.

 actually [EMAIL PROTECTED], is Mark Woodward. Pleased to meet you.
 :)

 (I hate using my name on lists like this because of spammers)

 
* A test script demonstrating the issue also follows.
 
* I think keeping one more stat per attribute in
  pg_stastic that could describe this behavior.
 
 
  Longer:
 
 
If I understand the optimizer correctly,  correlation is used
to both guess how much random disk access will be required in
a query; as well as estimate how many pages will be read.
 
Unfortunately, many tables in my larger databases have
columns with values that are tightly packed on a few pages;
even though there is no total-ordering across the whole table.
Stephan Szabo described this as a clumping effect:
 
 http://archives.postgresql.org/pgsql-performance/2003-01/msg00286.php

 Yes.

 I think we are describing the exact same issue.


I believe that is a very common case.

Many natural value distributions follow the Zipf distribution, which is
commonly known as the 80-20 curve.

Best Regards, SImon Riggs


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Fast reference without an index?

2005-02-08 Thread Alvaro Herrera
On Tue, Feb 08, 2005 at 04:29:34PM -0500, [EMAIL PROTECTED] wrote:
  [EMAIL PROTECTED] writes:
  Is there a way, and if I'm being stupid please tell me, to use something
  like a row ID to reference a row in a PostgreSQL database? Allowing the
  database to find a specific row without using an index?
 
  ctid ... which changes on every update ...
 
 Well, how does an index do it?

The index _is_ updated on every update.  And on vacuum too, of course;
so you'd also need to hack VACUUM to make sure your reference is
updated.

Maybe you could store your data in a GiST index ...


-- 
Alvaro Herrera ([EMAIL PROTECTED])
[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere.(Lamar Owen)

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Interpretation of TRUSTED

2005-02-08 Thread Thomas Hallgren
Hi,
A TRUSTED language specifies that ordinary users can use the language. 
It also implies that access to the file system should be prevented. In 
essence, ordinary users can never access the filesystem.

Is it OK to design a trusted language so that it allows access to the 
filesystem provided that the session user is a super-user?

Regards,
Thomas Hallgren

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Josh Berkus
Mark, Stephen, etc:

  I can see your point, however I wonder if the issue is that the default
  stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and
  maybe we should consider making a higher value (say '100') the default.

 Personally, I think that'd be reasonable.

I don't really think that we want to increase sampling for *all* columns.   
Each column that doesn't get queried but has increased stats adds the size of 
the pg_stats table and the amount of vacuuming it needs, decreasing 
performance of stats lookups.  Especially when very large text and BYTEA 
columns are involved.

Neil and I talked about setting a seperate GUC, default_index_stats_target, 
for 8.0 that would dramatically increase the stats sampling for indexed 
columns only (say, to 250).Unfortunately, I was unable to come up with a 
good test case for setting this value.

Perhaps the Tiger database would allow us to set up a good test case.  Does 
anyone have a well-indexed PG Tiger I could play with?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Interpretation of TRUSTED

2005-02-08 Thread David Fetter
On Tue, Feb 08, 2005 at 11:12:07PM +0100, Thomas Hallgren wrote:
 Hi,
 A TRUSTED language specifies that ordinary users can use the
 language.  It also implies that access to the file system should be
 prevented. In essence, ordinary users can never access the
 filesystem.
 
 Is it OK to design a trusted language so that it allows access to
 the filesystem provided that the session user is a super-user?

I believe that that is what UNTRUSTED languages are for.  Only the
super-user may create functions in them, although there is no inherent
restriction on other users' calling those functions.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Interpretation of TRUSTED

2005-02-08 Thread elein
An untrusted language is so because of what it can do and cannot do.
An untrusted language cannot access the filesystem, for example.

WHO writes and runs untrusted and trusted procedures is a different
thing.  It is the security mechanism meant to restrict writting
and running the functions written in a trusted/untrusted language.

--elein

On Tue, Feb 08, 2005 at 11:12:07PM +0100, Thomas Hallgren wrote:
 Hi,
 A TRUSTED language specifies that ordinary users can use the language. 
 It also implies that access to the file system should be prevented. In 
 essence, ordinary users can never access the filesystem.
 
 Is it OK to design a trusted language so that it allows access to the 
 filesystem provided that the session user is a super-user?
 
 Regards,
 Thomas Hallgren
 
 
 
 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])
 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Interpretation of TRUSTED

2005-02-08 Thread Thomas Hallgren
elein wrote:
An untrusted language is so because of what it can do and cannot do.
An untrusted language cannot access the filesystem, for example.
WHO writes and runs untrusted and trusted procedures is a different
thing.  It is the security mechanism meant to restrict writting
and running the functions written in a trusted/untrusted language.
 

Ok, that makes a lot of sense. I read the documentation: 3.4.1 
Installing Procedural Languages. It is a bit vague on what an untrusted 
language should be prevented from doing. do not allow access to 
database server internals or the file system is what it says. How about 
network access? Is there a more canonical definition somewhere that I 
can read ?

Regards,
Thomas Hallgren

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [HACKERS] Interpretation of TRUSTED

2005-02-08 Thread Tom Lane
David Fetter [EMAIL PROTECTED] writes:
 On Tue, Feb 08, 2005 at 11:12:07PM +0100, Thomas Hallgren wrote:
 Is it OK to design a trusted language so that it allows access to
 the filesystem provided that the session user is a super-user?

 I believe that that is what UNTRUSTED languages are for.  Only the
 super-user may create functions in them, although there is no inherent
 restriction on other users' calling those functions.

AFAICS, what Thomas proposes would be exactly equivalent to root running
scripts owned by non-root users --- in this case, if session user is
root then functions written by other people would be allowed to do
things they normally shouldn't be able to do.  It strikes me as a great
loophole for Trojan-horse functions.  Not that a sane superuser would
run functions controlled by other people in the first place.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Interpretation of TRUSTED

2005-02-08 Thread Andrew Dunstan

Tom Lane wrote:
 

On Tue, Feb 08, 2005 at 11:12:07PM +0100, Thomas Hallgren wrote:
   

Is it OK to design a trusted language so that it allows access to
the filesystem provided that the session user is a super-user?
 

AFAICS, what Thomas proposes would be exactly equivalent to root running
scripts owned by non-root users --- in this case, if session user is
root then functions written by other people would be allowed to do
things they normally shouldn't be able to do.  It strikes me as a great
loophole for Trojan-horse functions.  Not that a sane superuser would
run functions controlled by other people in the first place.
 

Agreed.
It's also not how other PLs work. I don't think this definition should 
be up to the individual language. So my answer to his question above 
would be No.

cheers
andrew
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Interpretation of TRUSTED

2005-02-08 Thread Jeff Davis
On Tue, 2005-02-08 at 14:51 -0800, elein wrote:
 An untrusted language is so because of what it can do and cannot do.
 An untrusted language cannot access the filesystem, for example.
 

There is a bit of confusion in the terminology. Trusted means that the
language has been deemed safe, i.e. it cannot do things like access
the filesystem. An untrusted language is not safe, i.e. it can do things
that you would not want an ordinary user to do.

I can see how someone would be easily confused about the terminology,
and I have to think about it each time. Trusted vs. untrusted refers to
the language, not the user of the language.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Connect By for 8.0

2005-02-08 Thread Bruce Momjian
Hans-Jürgen Schönig wrote:
 Neil Conway wrote:
  Robert Treat wrote:
  
  Actually i believe people want both syntax's as the former is used by 
  oracle and the latter by db2 (iirc)
  
  
  I think the past consensus has been to adopt the SQL standard syntax. Is 
  there any reason to also support the Oracle syntax other than for 
  compatibility? (And if that is it, I think it's a pretty flimsy reason.)
  
  -Neil
  
  ---(end of broadcast)---
  TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
 
 
 Neil,
 
 Compatibility is the ONLY reason and it is related to money (the only 
 language people understand).
 We have done a lot of migration here and I can tell you that support for 
 Oracle style CONNECT BY would make more people happy than the SQL style 
 syntax.
 The reason for that is very simple: Code can be migrated without any 
 changes (= without introducing potential bugs).
 I know that SQL standards are tremendously important but if there is an 
 easy way to support Oracle syntax as well this is definitely the 
 preferred way to go.
 I think it is important not to think in dogmas (in this case this means 
 SQL syntax is always better) - there should be a reasonable compromise 
 between compatibility and standard.
 My compromise would be: Support both syntaxes if possible.

I can see your point, but imagine if we had Oracle compatibility for
lots of cases --- our system would have either non-standard or duplicate
ways of doing things, and that would be quite confusing.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] How can I use large object on PostgreSQL Linux

2005-02-08 Thread Michael Fuhr
On Tue, Feb 08, 2005 at 02:55:01PM +0700, Premsun Choltanwanich wrote:

 From my first information, I got 3 errors /lo, /dbsize and /admin. That
 mean now only  /admin that I cannot found anywhere. 
  
 I still try to find it. If anybody know more information about /admin
 (in Windows is admin.dll) please tell me.

I wonder if admin is pgadmin-tools.

http://cvs.pgadmin.org/cgi-bin/viewcvs.cgi/pgadmin-tools/

What does the query below show on the Windows box?  If it shows the same
functions listed in pgadmin-tools/support/admin.sql.in then you might
need to install pgadmin-tools to make the systems identical (although
you might not actually need pgadmin-tools for its functionality).

SELECT proname, oidvectortypes(proargtypes) AS argtypes
FROM pg_proc
WHERE probin LIKE '%admin%';

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] How can I use large object on PostgreSQL Linux

2005-02-08 Thread Premsun Choltanwanich


I don't sure. However It maybe possibled cause I use PostgreSQL 8 RC1 with pgAdmin. Michael Fuhr [EMAIL PROTECTED] 09-Feb-05 11:22 AM 
On Tue, Feb 08, 2005 at 02:55:01PM +0700, Premsun Choltanwanich wrote: From my first information, I got 3 errors /lo, /dbsize and /admin. That mean now only /admin that I cannot found anywhere.   I still try to find it. If anybody know more information about /admin (in Windows is admin.dll) please tell me.I wonder if admin is pgadmin-tools.http://cvs.pgadmin.org/cgi-bin/viewcvs.cgi/pgadmin-tools/What does the query below show on the Windows box? If it shows the samefunctions listed in pgadmin-tools/support/admin.sql.in then you mightneed to install pgadmin-tools to make the systems identical (althoughyou might not actually need pgadmin-tools for its functionality).SELECT proname, oidvectortypes(proargtypes) AS argtypesFROM pg_procWHERE probin LIKE '%admin%';-- Michael Fuhrhttp://www.fuhr.org/~mfuhr/


Re: [HACKERS] Connect By for 8.0

2005-02-08 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 Hans-Jürgen Schönig wrote:
 My compromise would be: Support both syntaxes if possible.

 I can see your point, but imagine if we had Oracle compatibility for
 lots of cases --- our system would have either non-standard or duplicate
 ways of doing things, and that would be quite confusing.

Oracle has the resources to outtalk, outshout, and outlast everyone else
on the SQL standards committee.  Despite that, their syntax was not
adopted as the standard.  This should give you some clue about how badly
their syntax sucks.  Now why exactly should we adopt it?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Connect By for 8.0

2005-02-08 Thread Joshua D. Drake
Tom Lane wrote:
Bruce Momjian pgman@candle.pha.pa.us writes:
 

Hans-Jürgen Schönig wrote:
   

My compromise would be: Support both syntaxes if possible.
 

Great... code away. I would suggest a:
oracle_compat = true/false in the postgresql.conf
Yes I am kidding. The differences between Oracle and PostgreSQL
for most things is not that great. If the small syntatical
differences are the only thing holding them from using PostgreSQL
they were not that serious in the first place.
Sincerely,
Joshua D. Drake

 

I can see your point, but imagine if we had Oracle compatibility for
lots of cases --- our system would have either non-standard or duplicate
ways of doing things, and that would be quite confusing.
   

Oracle has the resources to outtalk, outshout, and outlast everyone else
on the SQL standards committee.  Despite that, their syntax was not
adopted as the standard.  This should give you some clue about how badly
their syntax sucks.  Now why exactly should we adopt it?
 


regards, tom lane
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match
 


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
begin:vcard
fn:Joshua Drake
n:Drake;Joshua
org:Command Prompt, Inc.
adr:;;PO Box 215 ;Cascade Locks;OR;97014;US
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0334
x-mozilla-html:FALSE
url:http://www.commandprompt.com
version:2.1
end:vcard


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] libpq API incompatibility between 7.4 and 8.0

2005-02-08 Thread Martin Pitt
Hi!

Tom Lane [2005-02-04 10:27 -0500]:
 This problem isn't worth spending more development time on than it takes
 to change SO_MAJOR_VERSION (we have lots of higher-priority issues).

I just did that:

--- postgresql-8.0.1-old/src/interfaces/libpq/Makefile  2005-01-26 
20:24:19.0 +0100
+++ postgresql-8.0.1/src/interfaces/libpq/Makefile  2005-02-07 
20:09:38.719686432 +0100
@@ -16,8 +16,8 @@

 # shared library parameters
 NAME= pq
-SO_MAJOR_VERSION= 3
-SO_MINOR_VERSION= 2
+SO_MAJOR_VERSION= 4
+SO_MINOR_VERSION= 0
 DLTYPE= library

 override CPPFLAGS :=  -DFRONTEND -I$(srcdir) $(CPPFLAGS) 
-I$(top_builddir)/src/port

Is that right? Or will you use SO_MINOR_VERSION=1?

However, I just stumbled across another problem: libpq3 and the new
libpq4 use the same translation domain libpq4, thus they cannot be
installed in parallel. Can you please change the domain to libpq4 as
well? This should generally be done anyway, but at least we can live
with one breakage of this rule (for libpq3), so that we don't need to
mess with the old libpq3 package.

Thanks,

Martin

-- 
Martin Pitt   http://www.piware.de
Ubuntu Developerhttp://www.ubuntulinux.org
Debian GNU/Linux Developer   http://www.debian.org


signature.asc
Description: Digital signature


Re: [HACKERS] external indices ...

2005-02-08 Thread Tom Lane
Marc G. Fournier [EMAIL PROTECTED] writes:
 Currently used with Oracle databases thru their Extensible
 Indexing Cartridge.
 What we are interested in is finding out whether or not we can use our
 technology directly with PostgreSQL.  That would involve, from a overly
 simplistic view, allowing Postgres to use external indexes.

 Does that help any?

Well, I spent a little time googling on this, and dug through Oracle
documents that seem to have been written by marketroids who think
componentize is a verb :-( ... but by the time my brain went numb,
I hadn't actually seen anything that hasn't been in Postgres for years.

What you need to make clear to them is that their existing code base
isn't likely to be worth a dime --- if they don't mind recoding, the
functionality is probably all there, but there will not be any
compatibility of details.  There are no standards covering this stuff.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Thinking about breaking up the BufMgrLock

2005-02-08 Thread Kenneth Marshall
On Sun, Feb 06, 2005 at 07:30:37PM -0500, Tom Lane wrote:
 
 ReadBuffer needs to do a lookup to map the page ID to a buffer ID,
 which in principle requires only a shared lock on the page-to-buffer
 mapping (embodied in the buf_table hash table).  Assuming success, it
 also needs to mark the buffer pinned and update the LRU-list position
 of the buffer.  Marking pinned is certainly a buffer-local change,
 so we could imagine splitting up the BufMgrLock like this:
 
 1. A global LWLock for the page-to-buffer mapping, call it say
 BufMappingLock.  Share lock on this is sufficient to allow reading the
 hash table; must get exclusive lock when reassigning any buffer's page
 identity.
 
 2. A global LWLock for the LRU strategy information, say BufLRULock
 or BufStrategyLock.
 
 3. Per-buffer LWLocks (or maybe just spinlocks) protecting the state of
 each buffer header; you need this lock to examine or change a buffer
 header.
 
 ReleaseBuffer has no locking problems in this formulation: it just grabs
 the per-buffer lock, decrements its refcount, releases the lock.
 
For the per-buffer, a latch would provide a lightweight method of updating
the contents of the buffer without hampering the read-only access. A latch
is comprised of a latch bit and a sequence number that can be set in an
atomic action. The flow for the two cases is simple:

Write:
 1. Get latch.
 2. Update the buffer.
 3. Increment the sequence number.
 4. Release the latch.

Read:
 1. Read version number.
 2. Read buffer.
 3. Check latch. If latched, go to 1.
 4. If version number has changed, go to 1.

By using this process, readers will only see a consistent state of
the buffer. Also, since the read does not entail a write operation
it will not cause a cache line update and contribute to the a cache
update storm. The get latch operation can be implemented using
an atomic operation such as TAS (test-and-set) and CAS (compare-and-set).
This would provide readers an extremely lightweight access to the
buffer - no cache line update hit. If you need to have sequenced access
to the buffer, then you would need to use LWLocks but in many cases
such as 3. in Tom's list a latch would work well.

 ReadBuffer looks like:
 
   * Acquire share lock on BufMappingLock.
   * Search hash table for desired ID.  (we assume success)
   * acquire per-buffer lock.
   * increment buffer refcount.
   * release per-buffer lock.
   * release share lock on BufMappingLock.
   * update the LRU state.
 
 (We have to bump the pin count on the target buffer before releasing the
 BufMappingLock, otherwise someone could reassign the buffer as soon as
 we release BufMappingLock.)
 
 This would be pretty good from a locking point of view, except that
 update the LRU state seems to require taking an exclusive lock on a
 global data structure, which puts us about back where we were.
 Contention for a global BufLRULock might be a bit better than for the
 existing overall BufMgrLock, but it'll still cripple the performance
 of ReadBuffer.
 
 Perhaps someone knows of a neat data structure that can maintain an LRU
 list with only local updates?  I don't though.
 
The clock algorithm is pretty close to this and provides an approximation
to LRU that eleminates the need to move buffers to the MRU position by
using a reference bit.

 This would convert the existing strict LRU behavior into an
 approximate LRU.  I'm worried that the change might be penny-wise and
 pound-foolish: if a poorer cache management algorithm causes us to have
 to do more I/O, it's probably a net loss to save some lock contention.
 But without a smart idea about data structures I don't see how to do
 better.
 

One alternative to an approximate LRU, such as the clock algorithm, would
be to have multiple buffer pools as we discussed in the previous thread.
The contention would be reduced by 1/N, where N is the number of pools.
Of course, buffers should be allocated in a fashion that would maximize
locality and minimize the effect of scan cache polution.

More food for thought.

Ken

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Ron Mayer
[EMAIL PROTECTED] wrote:
In this case, the behavior observed could be changed by altering the
sample size for a table. I submit that an arbitrary fixed sample size is
not a good base for the analyzer, but that the sample size should be based
on the size of the table or some calculation of its deviation.
  Mark,
Do you have any evidence that the Sample Size had anything to do
with the performance problem you're seeing?
I also do a lot with the complete Census/TIGER database.
Every problem I have with the optimizer comes down to the
fact that the data is loaded (and ordered on disk) by
State/County FIPS codes, and then queried by zip-code
or by city name.  Like this:
Alabama36101 [hundreds of pages with zip's in 36***]
Alaska 99686 [hundreds of pages with zip's in 9]
Arizona85701 [hundreds of pages with zip's in 855**]
Note that the zip codes are *NOT* sequential.
The correlation statistic sees that the Zip codes are not
sequential; so it makes the *HORRIBLE* assumption that they
are scattered randomly across the disk.
In reality, even though there's no total ordering of the
zip codes; any given zip code only exists on a couple
disk pages; so index scans would be the right choice.
But the single correlation parameter is not sufficient
to let the optimizer known this.
No matter how large a sample size you choose, ANALYZE
will correctly see that Zip codes and State FIPS codes
are non-correlated, and the optimizer will overestimate
the # of pages an index scan will need.
  Ron
PS: I pointed out workarounds in my earlier posting
in this thread.  Yes, I'm using the same TIGER data
you are.

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] libpq API incompatibility between 7.4 and 8.0

2005-02-08 Thread Martin Pitt
Hi!

Tom Lane [2005-02-03 11:12 -0500]:
 Martin Pitt [EMAIL PROTECTED] writes:
  I am thinking the easiest solution will be to re-add get_progname() to
  8.0.X and bump the major for 8.1.
 
  Seconded. Then we don't need another library version, and it is still
  not necessary to drag this get_progname accident forever.
 
 We're going to stop dragging the accident now, not a year from now.
 
 Putting get_progname back into libpq would require re-adding all of
 path.c, which creates many more issues than just adjusting
 SO_MAJOR_VERSION would do.

The consensus seems to be to just bump the SONAME. The next one will
definitively be 4 then? I would like to fix this in my packages
before I announce them for public testing, but this issue is the only
blocker to it. 

Will there be a new bugfix release soon?

Thanks for your work and have a nice day!

Martin

-- 
Martin Pitt   http://www.piware.de
Ubuntu Developerhttp://www.ubuntulinux.org
Debian GNU/Linux Developer   http://www.debian.org


signature.asc
Description: Digital signature


Re: [HACKERS] Connect By for 8.0

2005-02-08 Thread Christopher Browne
A long time ago, in a galaxy far, far away, [EMAIL PROTECTED] (Hans-Jürgen 
Schönig) wrote:
 I think it is important not to think in dogmas (in this case this
 means SQL syntax is always better) - there should be a reasonable
 compromise between compatibility and standard.
 My compromise would be: Support both syntaxes if possible.

As long as they can co-exist, that seems plausible.

I'd personally rather see the standard form, as it's got a more
diverse set of uses.  If introducing the CONNECT BY form did anything
to prevent implementing it the RIGHT way, I'm pretty sure I'm not the
only one that would be most displeased.
-- 
output = (cbbrowne @ gmail.com)
http://linuxfinances.info/info/slony.html
If anyone ever  markets  a really  well-documented Unix that   doesn't
require  babysitting by a phalanx of  provincial Unix clones, there'll
be a  lot of unemployable,  twinky-braindamaged misfits out deservedly
pounding the pavements.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] How can I use large object on PostgreSQL Linux

2005-02-08 Thread Dave Page



-Original Message-
From: [EMAIL PROTECTED] on behalf of Michael Fuhr
Sent: Wed 2/9/2005 4:22 AM
To: Premsun Choltanwanich
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] How can I use large object on PostgreSQL Linux
 
 I wonder if admin is pgadmin-tools.

Yes, it is.

Regards, Dave

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly