Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Simon Riggs
On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  The approach I suggested uses the existing technique for selecting
  random blocks, then either an exhaustive check on all of the rows in a
  block or the existing random row approach, depending upon available
  memory. We need to check all of the rows in a reasonable sample of
  blocks otherwise we might miss clusters of rows in large tables - which
  is the source of the problems identified.
  
  The other reason was to increase the sample size, which is a win in any
  form of statistics.
 
 Only if your sample is random and independent. The existing mechanism tries
 fairly hard to ensure that every record has an equal chance of being selected.
 If you read the entire block and not appropriate samples then you'll introduce
 systematic sampling errors. For example, if you read an entire block you'll be
 biasing towards smaller records.

Yes, I discussed that, following Brutlag  Richardson [2000]. The bottom
line is if there is no clustering, block sampling is random, which is
good; if there is clustering, then you spot it, which is good.

 I think it would be useful to have a knob to increase the sample size
 separately from the knob for the amount of data retained in the statistics
 tables. Though I think you'll be disappointed and find you have to read an
 unreasonably large sample out of the table before you get more useful distinct
 estimates.

OK, I'll look at doing that.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Heads up: upcoming back-branch re-releases

2006-01-05 Thread Marko Kreen
+listitemparaFix bug in filename/contrib/pgcrypto/ Openwall
+gen_salt processing (Marko Kreen/para/listitem

I guess it should be bit more explicit:

listitemparaFix bug in filename/contrib/pgcrypto/ gen_salt,
which caused it not to use all available salt space for md5 and
xdes algorithms (Marko Kreen, Solar Designer)/para
paraSalts for blowfish and standard des are unaffected/para/listitem

This hopefully makes it clear who is affected and how important is to
upgrade.  Also the 'Openwall' is confusing, better credit fix author.

--
marko

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Rod Taylor
  Do you *really* want the median estimate in these case?  Are you certain 
  you 
  do not want something with the opposite behavior of Chaudhuri's estimate so 
  that for small sample sizes the bias is toward a high estimate of D? 
  (Converges on D from the right instead of the left.)
  
  Chaudhuri's -D-- needed
  Estimate   estimate
 
 Hmmm.  Yeah, I see what you mean.  True, the ideal approach would to 
 deterime for each query operation whether a too-low D or a too-high D 
 was more risky, and then use the more conservative number.   However, 
 that would complicate the query planner enough that I think Tom would 
 leave us. :-p

You could have some specific functions vote themselves out if their cost
is shakey. We know that the cost of a miscalculated nestloop is huge, so
after calculating the common case it might apply a multiplier for the
risk involved.

There have been lots of requests for a way to achieve more consistent
plans that have a determined worst case performance, even if they never
perform as well in the best case as another algorithm might. Perhaps
this could be a GUC.

PlanCost + PlanCost * Risk * RiskGUC

Risk is a number that indicates how badly things can go wrong.

RiskGUC is an integer multiplier. Someone who is risk averse (wants a
predictable execution time rather than the best possible time) would set
this value high. Others who want the best possible plan in most cases
even if it performs poorly once in a while will set the value very low,
possibly 0.

-- 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges

2006-01-05 Thread Stephen Frost
Tom, et al,

Sorry for the longish email; if you're most interested in a change to
the ACL system to allow more privileges then skip to the bottom where
I worked up a change to give us more options without much of a
performance impact (I don't think anyway).  Personally, I'd prefer that
to overloading the bits we have (except perhaps for vacuum/analyze).

* Tom Lane ([EMAIL PROTECTED]) wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  GRANT ADD COLUMN, etc, aren't maintenance commands, they're DDL
  statements and as such should be the purview of the owner.  TRUNCATE,
  VACUUM and ANALYZE are DML commands and are commands a user of
  the table would use through the normal course of inserting, updating or
  deleteing data in the table.
 
 I find this reasoning fairly dubious.  In particular, it's hard to argue
 that there is no DDL component to TRUNCATE when it effectively does an
 implicit disable-triggers operation.  Another thing setting TRUNCATE
 apart from run-of-the-mill DDL operations is that it inherently violates
 MVCC rules (by deleting rows that should still be visible to concurrent
 transactions).

Kind of makes one wish you could know what tables were going to be
touched for a given transaction at the start of the transaction.  That's
not really here nor there tho.  I could see limiting truncate privileges
to tables which don't have on-delete triggers, that doesn't help with
the MVCC problem though and ends up being why we can't just use 'delete'
privileges for it.

Could we base vacuum and analyze rights off of other privileges though?
vacuum allowed if yoou have 'delete' privileges, analyze if you have
'insert', 'update', or 'delete'?  And for 'truncate' permissions...

 But my real problem with the approach is that I don't see where it
 stops.  If you're allowed to do ANALYZE, why not ALTER TABLE SET
 STATISTICS?  If you're allowed to do TRUNCATE, why not the
 recently-discussed ALTER TABLE SET RELIABILITY?  And how about CLUSTER?
 All of these could be pretty useful for some applications not too far
 removed from yours.  And there will be someone wanting a bit for
 DISABLE/ENABLE TRIGGER coming along right afterwards.  Must we implement
 a separate nonstandard privilege bit for every operation that someone
 comes up and wants a bit for, if they have the necessary cut-and-paste
 skill to submit a patch for it?

I think analyze is distinct from set statistics.  SET STATISTICS, if
used improperly (perhaps by mistake or misunderstanding), could cause 
serious havoc on the system as potentially very poor plans are chosen 
because the statistics aren't at all correct.  I don't see how running 
analyze a couple times would have a detrimental effect (except for the 
effort of running the analyze itself, but that's really not all that 
much and if they want to DoS the box in that way there are other 
things they can do).  SET STATISTICS is also hopefully something you're
not having to change or do every time you add/remove/update data.

SET RELIABILITY is a more interesting question since it could be used in
a situation similar to why truncate's popular (mass data
loading/reloading).  The same is true for disable/enable triggers.

 I'd feel happier about an approach that adds *one* privilege bit
 covering a range of operations that we agree to be useful.  This will
 avoid chewing a disproportionate amount of ACL storage space, and it
 will force us to confront the decision about which operations are out
 as well as which are in.

If we modify VACUUM/ANALYZE to be based off what I suggested above, then
we could add just one 'BYPASS' permission bit which would allow
TRUNCATE right now and then SET RELIABILITY, and DISABLE/ENABLE TRIGGER
later.  I'm not a particularly big fan of this though because, while I'd
like to be able to give TRUNCATE permissions I'm not a big fan of SET
RELIABILITY because it would affect PITR backups.  I suppose a user
would still have to intentionally do that though and so they'd have only
themselves to blame if the data they loaded wasn't part of the backup.
DISABLE/ENABLE TRIGGER has a similar issue though because that could
probably be used to bypass CHECK and REFERENCES constraints, which I
wouldn't want to allow.

A BYPASS bit would be better than having to give ownership rights
though.  We could also look at restructuring the ACL system to be able
to handle more permissions better.  As was suggested elsewhere, one
option would be to have a seperate set of ACLs (at least internally)
such that a given set of commands/permissions would be associated 
with one set or the other set (and hopefully rarely, both).  These could
be divided up by either level or frequency (which I think would actually 
result in the same set).  Level would be row/table/database, frequency
would be estimation of usage in the real world.  This would seperate
SELECT, INSERT, UPDATE, DELETE into one set of permissions, and then
REFERENCES, RULE, TRIGGER, TRUNCATE, SET RELIABILITY, DISABLE/ENABLE

Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges

2006-01-05 Thread Michael Paesold

Stephen Frost wrote:

I'm not a particularly big fan of this though because, while I'd
like to be able to give TRUNCATE permissions I'm not a big fan of SET
RELIABILITY because it would affect PITR backups.


As far as I have understood the discussion... with WAL archiving turned on, 
the whole RELIABILITY changes would be no-ops, no?
Just as the CTAS optimization etc. only skip WAL if WAL archiving is turned 
off.


Best Regards,
Michael Paesold 




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges

2006-01-05 Thread Stephen Frost
* Michael Paesold ([EMAIL PROTECTED]) wrote:
 Stephen Frost wrote:
 I'm not a particularly big fan of this though because, while I'd
 like to be able to give TRUNCATE permissions I'm not a big fan of SET
 RELIABILITY because it would affect PITR backups.
 
 As far as I have understood the discussion... with WAL archiving turned on, 
 the whole RELIABILITY changes would be no-ops, no?
 Just as the CTAS optimization etc. only skip WAL if WAL archiving is turned 
 off.

Oh, I thought the reliability bit would bypass WAL even with archiving
turned on (which could be fine in some cases, just not all cases :).  Of
course, all of this is still up in the air somewhat. :)  If it's a noop
in that case then the 'bypass' bit might be alright to have control SET
RELIABILITY.  I'd rather have the flexibility to have them be seperately
grantable though.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

  Only if your sample is random and independent. The existing mechanism tries
  fairly hard to ensure that every record has an equal chance of being 
  selected.
  If you read the entire block and not appropriate samples then you'll 
  introduce
  systematic sampling errors. For example, if you read an entire block you'll 
  be
  biasing towards smaller records.
 
 Did you read any of the papers on block-based sampling?   These sorts of 
 issues
 are specifically addressed in the algorithms.

We *currently* use a block based sampling algorithm that addresses this issue
by taking care to select rows within the selected blocks in an unbiased way.
You were proposing reading *all* the records from the selected blocks, which
throws away that feature.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

  These statements are at odds with my admittedly basic understanding of
  statistics.  Isn't the power of a sample more related to the absolute size 
  of
  the sample than the sample as fraction of the population?  Why not just pick
  a smallish sample size, say about 3000, and apply it to all the tables, even
  the ones with just a single row (modify appropriately from block sampling).
 
 Nope, it's definitely proportional.   As a simple example, a sample of 500 
 rows
 in a table of 1000 rows should yeild stats estimates with 90%+ accuracy.  But 
 a
 sample of 500 rows in a 600,000,000 row table is so small as to be nearly
 useless; it's quite possible to get all the same value in a random sample of 
 0.1% even on a column with a D/N of 0.001.   If you look at the papers cited,
 almost all researchers more recent than Chaudhuri use a proportional sample
 size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like 

  SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Heads up: upcoming back-branch re-releases

2006-01-05 Thread Bruce Momjian

Thanks. Updated.

---

Marko Kreen wrote:
 +listitemparaFix bug in filename/contrib/pgcrypto/ Openwall
 +gen_salt processing (Marko Kreen/para/listitem
 
 I guess it should be bit more explicit:
 
 listitemparaFix bug in filename/contrib/pgcrypto/ gen_salt,
 which caused it not to use all available salt space for md5 and
 xdes algorithms (Marko Kreen, Solar Designer)/para
 paraSalts for blowfish and standard des are unaffected/para/listitem
 
 This hopefully makes it clear who is affected and how important is to
 upgrade.  Also the 'Openwall' is confusing, better credit fix author.
 
 --
 marko
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] nicer error out in initdb?

2006-01-05 Thread Andrew Dunstan


If we find at the bottom of test_config_settings() that we have not been
able to run successfully at all (i.e. status != 0 at about line 1183 of
initdb.c) is there any point in continuing? Don't we know that we are
bound to fail at the template1 creation stage? Maybe we should just exit
nicely when we do know this.

cheers

andrew


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] TRUNCATE, VACUUM, ANALYZE privileges

2006-01-05 Thread Rod Taylor
On Thu, 2006-01-05 at 09:41 -0500, Stephen Frost wrote:
 * Michael Paesold ([EMAIL PROTECTED]) wrote:
  Stephen Frost wrote:
  I'm not a particularly big fan of this though because, while I'd
  like to be able to give TRUNCATE permissions I'm not a big fan of SET
  RELIABILITY because it would affect PITR backups.
  
  As far as I have understood the discussion... with WAL archiving turned on, 
  the whole RELIABILITY changes would be no-ops, no?
  Just as the CTAS optimization etc. only skip WAL if WAL archiving is turned 
  off.
 
 Oh, I thought the reliability bit would bypass WAL even with archiving
 turned on (which could be fine in some cases, just not all cases :).  Of

It might be better if this was an setting in postgresql.conf requiring a
restart to change, off by default.

I don't like the thought of a table owner or even a super-user being
able to throw away data because they failed to investigate the full
impact of the backup strategy. I.e. Someone missed the memo that backups
were changing from pg_dumps to PITR for database environment H.

-- 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] nicer error out in initdb?

2006-01-05 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 If we find at the bottom of test_config_settings() that we have not been
 able to run successfully at all (i.e. status != 0 at about line 1183 of
 initdb.c) is there any point in continuing? Don't we know that we are
 bound to fail at the template1 creation stage? Maybe we should just exit
 nicely when we do know this.

Nope: since we've been suppressing stderr, exit nicely here would
translate as refuse to provide any information about the problem.
The existing behavior is just fine.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] postmaster/postgres options assimilation plan

2006-01-05 Thread Bruce Momjian

FYI, with the options merged, we still have this TODO item:

* %Remove behavior of postmaster -o

---

Peter Eisentraut wrote:
 Here's the plan for assimilating the command-line options of the postmaster 
 and postgres options.  I reported earlier on a couple of conflict areas; here 
 is the full plan:
 
 * Remove: postmaster -a -b -m -M
 
 These options have done nothing forever.
 
 * postmaster options added to postgres: -h -i -k -l -n
 
 These options will not have any useful effects, but their behavior is 
 consistent if you do, say, SHOW listen_addresses.
 
 * postgres options added to postmaster: -e -E -f -O -P -t -W
 
 Using -E with postmaster does nothing, though.
 
 * Renamed options (because of conflicts):
 
 postgres -N is now postgres -j (mostly internal use)
 
 postgres -o is now postgres -r (mostly internal use)
 
 (postmaster -o is obsolete but still works for compatibility; postgres -o 
 will 
 get you an error.)
 
 postgres -p is now postgres -y (internal use only)
 
 postmaster -S now sets work_mem, like postgres -S does.  The (deprecated) 
 silent mode can be obtained using the long option --silent-mode=on.
 
 postmaster -s is now postmaster -T (expert/hacker use only)
 
 
 For the options -O, -P, and -W I have added three new GUC variables 
 allow_system_table_mods (PGC_SIGHUP), ignore_system_indexes (PGC_BACKEND), 
 connection_startup_delay (PGC_BACKEND); mostly to simplify the 
 postmaster-postgres communication.
 
 Except for a few odd exceptions, all command line arguments now map to 
 setting 
 a GUC variable.
 
 Comments?
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [Bizgres-general] WAL bypass for INSERT, UPDATE and

2006-01-05 Thread Bruce Momjian
Simon Riggs wrote:
  So, we need a name for EXCLUSIVE mode that suggests how it is different
  from TRUNCATE, and in this case, the difference is that EXCLUSIVE
  preserves the previous contents of the table on recovery, while TRUNCATE
  does not.  Do you want to call the mode PRESERVE, or EXCLUSIVE WRITER?
  Anyway, the keywords are easy to modify, even after the patch is
  submitted.  FYI, I usually go through keywords.c looking for a keyword
  we already use.
 
 I'm very happy for suggestions on what these new modes are called.
 
So, to summarize, I think we should add DROP/TRUNCATE, and use that by
default (or optionally off?) in pg_dump, and, assuming we want EXCLUSIVE
for more than just COPY, we need to add ALTER TABLE EXCLUSIVE.
   
   Would you mind stating again what you mean, just so I can understand
   this? Your summary isn't enough.
  
  New ALTER TABLE mode, perhaps call it PERSISTENCE:
  
  ALTER TABLE tab PERSISTENCE DROP ON RECOVERY
  ALTER TABLE tab PERSISTENCE TRUNCATE ON RECOVERY
  
  These would drop or truncate all tables with this flag on a non-clean
  start of the postmaster, and write something in the server logs. 
  However, I don't know that we have the code in place to DROP/TRUNCATE in
  recovery mode, and it would affect all databases, so it could be quite
  complex to implement.  In this mode, no WAL logs would be written for
  table modifications, though DDL commands would have to be logged.
 
 Right now, this will be a TODO item... it looks like it will take some
 thought to implement correctly.

OK, I know my suggestions have made it more complicated.

TODO added:

* Allow control over which tables are WAL-logged

  Allow tables to bypass WAL writes and just fsync() dirty pages on
  commit.  To do this, only a single writer can modify the table, and
  writes must happen only on new pages.  Readers can continue accessing
  the table.  This would affect COPY, and perhaps INSERT/UPDATE too.
  Another option is to avoid transaction logging entirely and truncate
  or drop the table on crash recovery.  These should be implemented
  using ALTER TABLE, e.g. ALTER TABLE PERSISTENCE [ DROP | TRUNCATE |
  STABLE | DEFAULT ].  Tables using non-default logging should not use
  referential integrity with default-logging tables, and tables using
  stable logging probably can not have indexes.  [walcontrol]


  ALTER TABLE tab PERSISTENCE PRESERVE (or STABLE?)
  
  Table contents are preserved across recoveries, but data modifications
  can happen only one at a time.  I don't think we have a lock mode that
  does this, so I am worried a new lock mode will have to be created.  A
  simplified solution at this stage would be to take an exclusive lock on
  the table, but really we just need a single-writer table lock, which I
  don't think we have. initially this can implemented to only affect COPY
  but later can be done for other commands. 
 
 ExclusiveLock locks out everything apart from readers, no new lock mode
 AFAICS. Implementing that is little additional work for COPY.

Nice.

 Tom had a concern about setting this for I, U, D commands via the
 executor. Not sure what the details of that are, as yet.

That is much more complicated than the COPY-only idea, for sure.  I am
thinking we could add the ALTER syntax and just do COPY at this stage,
meaning that I/U/D still do full logging until we get to improving them.
The big benefit is that the user API doesn't need to change when we
improve the code.  In fact I think we could do the TRUNCATE/DROP easily
for I/U/D, but the STABLE option would require work and we don't need to
implement it in the first patch.

 We can use either of the unlogged modes for pg_dump, so I'd suggest its
 this one. Everybody happy with this being the new default in pg_dump, or
 should it be an option?
 
  ALTER TABLE tab PERSISTENCE DEFAULT
  
  This would be our current default mode, which is full concurrency and
  persistence.
 
 I'm thinking whether the ALTER TABLE statement might be better with two
 bool flags rather than a 3-state char.
 
 flag 1: ENABLE LOGGING | DISABLE LOGGING
 
 flag 2: FULL RECOVERY | TRUNCATE ON RECOVERY
 
 Giving 3 possible sets of options:
 
 -- the default
 ALTER TABLE mytable ENABLE LOGGING FULL RECOVERY; (default)
 
 -- EXCLUSIVE mode
 ALTER TABLE mytable DISABLE LOGGING FULL RECOVERY;
 ...which would be used like this
   ALTER TABLE mytable DISABLE LOGGING;
   COPY or other bulk data manipulation SQL
   ALTER TABLE mytable ENABLE LOGGING;
 ...since FULL RECOVERY is the default.
 
 -- multiuser temp table mode
 ALTER TABLE mytable DISABLE LOGGING TRUNCATE ON RECOVERY;
 ...which would usually be left on all the time
 
 which only uses one new keyword LOGGING and yet all the modes are fairly
 explicit as to what they do.
 
 An alternative might be the slightly more verbose:
   ALTER TABLE mytable DISABLE LOGGING FORCE EXCLUSIVE TABLE LOCK;
 which would be turned off by
   ALTER TABLE 

Re: [HACKERS] [Bizgres-general] WAL bypass for INSERT, UPDATE and

2006-01-05 Thread Bruce Momjian
Simon Riggs wrote:
 On Thu, 2005-12-29 at 11:37 -0500, Bruce Momjian wrote:
  Having COPY behave differently because it is
  in a transaction is fine as long as it is user-invisible, but once you
  require users to do that to get the speedup, it isn't user-invisible
  anymore.
 
 Since we're agreed on adding ALTER TABLE rather than COPY LOCK, we have
 our explicit mechanism for speedup.
 
 However, it costs a single line of code and very very little execution
 time to add in the optimization to COPY to make it bypass WAL when
 executed in the same transaction that created the table. Everything else
 is already there.
 
 As part of the use_wal test:
 + if (resultRelInfo-ri_NumIndices == 0  
 + !XLogArchivingActive()
  (cstate-rel-rd_createSubid != InvalidSubTransactionId ))
 + use_wal = false;
 
 the value is already retrieved from cache...
 
 Can anyone see a reason *not* to put that change in also? We just don't
 advertise it as the suggested route to gaining performance, nor would
 we rely on it for pg_dump/restore performance. 

Seems like a nice optimization.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Inconsistent syntax in GRANT

2006-01-05 Thread Bruce Momjian
Josh Berkus wrote:
 Euler,
 
  It should but it's not implemented yet. There is no difficulty in doing
  it. But I want to propose the following idea: if some object depends on
  another object and its type is 'DEPENDENCY_INTERNAL' we could
  grant/revoke privileges automagically to it. Or maybe create another
  type of dependency to do so.
  Comments?
 
 I think this would be difficult to work out.  Personally, the only 
 clear-cut case I can think of is SERIAL columns; other dependancies would 
 require a lot of conditional logic.

Addded to TODO:

* Allow SERIAL sequences to inherit permissions from the base table?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Josh Berkus
Greg,

 We *currently* use a block based sampling algorithm that addresses this
 issue by taking care to select rows within the selected blocks in an
 unbiased way. You were proposing reading *all* the records from the
 selected blocks, which throws away that feature.

The block-based algorithms have specific math to cope with this.   Stuff 
which is better grounded in statistical analysis than our code.   Please 
read the papers before you judge the solution.

 Worse, my recollection from the paper I mentioned earlier was that
 sampling small percentages like 3-5% didn't get you an acceptable
 accuracy. Before you got anything reliable you found you were sampling
 very large percentages of the table. And note that if you have to sample
 anything over 10-20% you may as well just read the whole table. Random
 access reads are that much slower.

Right, which is why researchers are currently looking for something better.  
The Brutlag  Richardson claims to be able to produce estimates which are 
within +/- 3x 90% of the time using a 5% sample, which is far better than 
our current accuracy.  Nobody claims to be able to estimate based on  
0.1% of the table, which is what Postgres tries to do for large tables.

5% based on block-based sampling is reasonable; that means a straight 5% of 
the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
sampling, that would require as much as 25% of the table, making it easier 
to just scan the whole thing.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Jim C. Nasby
On Thu, Jan 05, 2006 at 10:12:29AM -0500, Greg Stark wrote:
 Worse, my recollection from the paper I mentioned earlier was that sampling
 small percentages like 3-5% didn't get you an acceptable accuracy. Before you
 got anything reliable you found you were sampling very large percentages of
 the table. And note that if you have to sample anything over 10-20% you may as
 well just read the whole table. Random access reads are that much slower.

If I'm reading backend/commands/analyze.c right, the heap is accessed
linearly, only reading blocks that get selected but reading them in heap
order, which shouldn't be anywhere near as bad as random access.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] when can we get better partitioning?

2006-01-05 Thread hubert depesz lubaczewski
hii was really more than happy when i saw table partitioning in release info for 8.1.then i tried to use it, and hit some serious problem (described on pgsql-general).basically the question is - is anybody at the moment working on improving partitioning capabilities? like improving queries to partitioned tables with ORDER BY and LIMIT statements?
if yes, when can we start thinking about release with this functionality?i, my own, do not have enough c knowledge to do it by myself, yet i would *really* like to have this feature improved, as it would save me and my coworkers a lot of work and hassle that we face right now with rewriting queries to ask directly partitions instead of master table, that got partitioned.
best regards,depesz


Re: [HACKERS] when can we get better partitioning?

2006-01-05 Thread Joshua D. Drake

hubert depesz lubaczewski wrote:

hi
i was really more than happy when i saw table partitioning in release 
info for 8.1.
then i tried to use it, and hit some serious problem (described on 
pgsql-general).
basically the question is - is anybody at the moment working on 
improving partitioning capabilities? like improving queries to 
partitioned tables with ORDER BY and LIMIT statements?

if yes, when can we start thinking about release with this functionality?

i, my own, do not have enough c knowledge to do it by myself, yet i 
would *really* like to have this feature improved, as it would save me 
and my coworkers a lot of work and hassle that we face right now with 
rewriting queries to ask directly partitions instead of master table, 
that got partitioned.


You could consider sponsoring one of the developers to enhance the feature.

Joshua D. Drake



best regards,

depesz




--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [Bizgres-general] WAL bypass for INSERT, UPDATE and

2006-01-05 Thread Simon Riggs
On Thu, 2006-01-05 at 12:27 -0500, Bruce Momjian wrote:

 Seems like a nice optimization.

Negative thoughts: Toast tables have a toast index on them, yes? We have
agreed that we cannot use the optimization if we have indexes on the
main table. It follows that we cannot use the optimization if we have
*any* toasted data, since that would require a pointer between two
blocks, which would not be correctly recovered following a crash. If we
log the toast table then there could be a mismatch between heap and
toast table; if we don't log the toast table there could be a mismatch
between toast table and toast index.

We can test to see if the toast table is empty when we do ALTER TABLE,
but loading operations may try to create toasted data rows.

Presumably that means we must either:
i) abort a COPY if we get a toastable value
ii) if we get a toastable value, insert the row into a new block, which
we do logging of, then also log the toast insert and the toast index
insert - i.e. some blocks we log, others not

This is still useful for many applications, IMHO, but the list of
restrictions seems to be growing. Worse, we wouldn't know that the toast
tables were empty until after we did the COPY TO for a pg_dump, so we
wouldn't be able to retrospectively add an ALTER TABLE command ahead of
the COPY. 

Thoughts? Hopefully there are some flaws in my thinking here,

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] catalog corruption bug

2006-01-05 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 We have encountered a very nasty but apparently rare bug which appears to
 result in catalog corruption.

I've been fooling around with this report today.  In several hours of
trying, I've been able to get one Assert failure from running Jeremy's
example on CVS tip.  (I would've given up long ago, except the Assert
happened very soon after I started trying...)  The assert was from this
line in hio.c:

Assert(PageIsNew((PageHeader) pageHeader));

which we've seen before in connection with the
vacuum-vs-relation-extension race condition found last May.  It seems
we still have an issue of that sort :-(.

While fruitlessly waiting for the test to fail again, I've been combing
through the code looking for possible failure paths, and I've found
something that might explain it.  I think this is definitely a bug even
if it isn't what's happening in Jeremy's test:

mdread() is defined to not fail, but silently return a page of zeroes,
if asked to read a page that's at or beyond the end of the table file.
(As noted therein, this seems like pretty sucky design, but there are
various reasons that make it hard to change the behavior.)

Therefore, if for some reason a process tries to read the page just at
EOF, it will leave behind a buffer pool entry that is marked BM_VALID
but contains zeroes.  There are a number of scenarios that could cause
this, but all seem rather low-probability.  One way is if a process'
rd_targblock field for a relation is pointing at the last page of the
file and then VACUUM comes along and truncates off that page because
it's empty.  The next insertion attempt by the process will try to
fetch that page, obtain all-zeroes, decide the page has no free space
(PageGetFreeSpace is carefully coded to be sure that happens), and
go looking for free space elsewhere.

Now suppose someone tries to obtain a new page in the relation by
calling ReadBuffer(rel, P_NEW).  The location of the new page is
determined by asking lseek() how long the file is.  ReadBuffer then
obtains a buffer for that file offset --- and it is going to get a
hit on the all-zeroes buffer previously left behind.  Since the buffer
is marked BM_VALID, the test if it was already in the buffer pool,
we're done succeeds and the buffer is returned as-is.  This is fine
as far as the caller knows: it's expecting to get back an all-zero
page, so it goes merrily along.  The problem is that if that code
path is taken, we *have not extended the file on disk*.

That means, until we get around to writing the dirty buffer to disk
(eg via checkpoint), the kernel thinks the file doesn't contain that
block yet.  So if someone else comes along and again does
ReadBuffer(rel, P_NEW), the lseek computation will return the same
offset as before, and we'll wind up handing back the very same buffer
as before.  Now we get the above-mentioned Assert, if we are lucky
enough to be running an assert-enabled build.  Otherwise the code
in hio.c will just wipe and reinitialize the page, leading to loss
of whatever rows were previously placed in it.

Based on this analysis, the logic in ReadBuffer is wrong: if it finds
an existing buffer in the P_NEW case, it still has to zero the page
and do smgrextend() to be sure that the kernel thinks the page has
been added to the file.

I'm also thinking that the test for empty page in hio.c ought to be
an actual test and elog, not just an Assert.  Now that we've seen
two different bugs allowing the can't happen case to happen, I'm no
longer satisfied with not having any check there in a non-assert build.
The consequences of not detecting an overwrite are too severe.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Questions on printtup()

2006-01-05 Thread Qingqing Zhou

I did some profiling related to printtup() by a simple libpq SELECT *
test program (revised from the libpq programing sample in document
without retriving the results). There are 260k or so records in table
test(i int).

/* original version - prepare tuple and send */
SELECT * TIMING: 0.63 sec

/* Prepare but not send
In printtup():
- pq_endmessage(buf);
+ pfree(buf.data);
+ buf.data = NULL;
 */
SELECT * TIMING: 0.46 sec

/* No prepare no send
In ExecSelect():
- (*dest-receiveSlot) (slot, dest);
*/
SELECT * TIMING: 0.08 sec

So we spend a portion of time at preparing tuples in printtup() by
converting the tuple format to a network format. I am not quite familiar
with that part, so I wonder is it possible to try to send with original
tuple format with minimal preparing job (say handling toast) -- which
might increase the amount of data of network communication, but may reduce
the CPU on server side?

If this is not a non-starter, I am happy to look into details,

Regards,
Qingqing

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] Warm-up cache may have its virtue

2006-01-05 Thread Qingqing Zhou

Hinted by this thread:

http://archives.postgresql.org/pgsql-performance/2006-01/msg00016.php

I wonder if we should really implement file-system-cache-warmup strategy
which we have discussed before. There are two natural good places to do
this:

(1) sequentail scan
(2) bitmap index scan

We can consider (2) as a generalized version of (1). For (1), we have
mentioned several heuristics like keep scan interval to avoid competition.
These strategy is also applable to (2).

Question: why file-system level, instead of buffer pool level?  For two
reasons:  (1) Notice that in the above thread, the user just use
shared_buffers = 8192 which suggest that file-system level is already
good enough; (2) easy to implement.

Use t*h*r*e*a*d? Well, I am a little bit afraid of mention this word.
But we can have some dedicated backends to do this - like bgwriter.

Let's dirty our hands!

Comments?

Regards,
Qingqing


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Questions on printtup()

2006-01-05 Thread Tom Lane
Qingqing Zhou [EMAIL PROTECTED] writes:
 So we spend a portion of time at preparing tuples in printtup() by
 converting the tuple format to a network format. I am not quite familiar
 with that part, so I wonder is it possible to try to send with original
 tuple format with minimal preparing job (say handling toast) -- which
 might increase the amount of data of network communication, but may reduce
 the CPU on server side?

It's called retrieving in binary format ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Warm-up cache may have its virtue

2006-01-05 Thread Tom Lane
Qingqing Zhou [EMAIL PROTECTED] writes:
 Hinted by this thread:
   http://archives.postgresql.org/pgsql-performance/2006-01/msg00016.php
 I wonder if we should really implement file-system-cache-warmup strategy
 which we have discussed before.

The difference between the cached and non-cached states is that the
kernel has seen fit to remove those pages from its cache.  It is
reasonable to suppose that it did so because there was a more immediate
use for the memory.  Trying to override that behavior will therefore
result in de-optimizing the global performance of the machine.

If the machine is actually dedicated to Postgres, I'd expect disk pages
to stay in cache without our taking any heroic measures to keep them
there.  If they don't, that's a matter for kernel configuration tuning,
not warmup processes.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Warm-up cache may have its virtue

2006-01-05 Thread Qingqing Zhou


On Thu, 5 Jan 2006, Tom Lane wrote:

 The difference between the cached and non-cached states is that the
 kernel has seen fit to remove those pages from its cache.  It is
 reasonable to suppose that it did so because there was a more immediate
 use for the memory.  Trying to override that behavior will therefore
 result in de-optimizing the global performance of the machine.


Yeah, so in another word, warm-up cache is just wasting of time if the
pages are already in OS caches. I agree with this. But does this mean it
may deserve to experiment another strategy: big-stomach Postgres, i.e.,
with big shared_buffer value. By this strategy, (1) almost all the buffers
are in our control, and we will know when a pre-read is needed; (2) avoid
double-buffering: though people are suggested not to use very big
shared_buffer value, but in practice, I see people gain performance by
increase it to 20 or more.

Feasibility: Our bufmgr lock rewrite already makes this possible. But to
enable it, we may need more work: (w1) make bufferpool relation-wise,
which makes our estimation of data page residence more easy and reliable.
(w2) add aggresive pre-read on buffer pool level. Also, another benefit of
w1 will make our query planner can estimate query cost more precisely.

Regards,
Qingqing

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] catalog corruption bug

2006-01-05 Thread Jeremy Drake
Here is some additional information that I have managed to gather today
regarding this.  It is not really what causes it, so much as what does
not.

I removed all plperl from the loading processes.  I did a VACUUM FULL
ANALYZE, and then I reindexed everything in the database (Including
starting the backend in standalone mode and running REINDEX SYSTEM
dbname).  They still failed.

So it is apparently not that plperl issue which was being discussed
earlier.

Also, what I said about the corruption not having persisted into other
backends was not quite correct.  It was leaving behind types in pg_type
which were in some of the pg_temp* schemas which corresponded to some of
the temp tables.  But I took those out and still had issues (slightly
different).

Here is some interesting stuff too.  I just stopped my processes to start
up a batch again to copy the error message I got now, but before doing so
I was doing a VACUUM FULL ANALYZE VERBOSE so I could hopefully start from
a relatively clean state.  I got a few warnings I don't remember seeing
before.

INFO:  vacuuming pg_catalog.pg_shdepend
INFO:  pg_shdepend: found 108 removable, 440 nonremovable row versions
in 15 p
ages
DETAIL:  0 dead row versions cannot be removed yet.
Nonremovable row versions range from 53 to 53 bytes long.
There were 1492 unused item pointers.
Total free space (including removable row versions) is 89780 bytes.
7 pages are or will become empty, including 0 at the end of the table.
12 pages containing 89744 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index pg_shdepend_depender_index now contains 448 row versions in
33 pages
DETAIL:  108 index row versions were removed.
23 index pages have been deleted, 23 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.10 sec.
WARNING:  index pg_shdepend_depender_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  index pg_shdepend_reference_index now contains 448 row versions
in 12 pages
DETAIL:  108 index row versions were removed.
3 index pages have been deleted, 3 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_shdepend_reference_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  pg_shdepend: moved 4 row versions, truncated 15 to 4 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index pg_shdepend_depender_index now contains 448 row versions in
33 pages
DETAIL:  4 index row versions were removed.
23 index pages have been deleted, 23 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_shdepend_depender_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  index pg_shdepend_reference_index now contains 448 row versions
in 12 pages
DETAIL:  4 index row versions were removed.
4 index pages have been deleted, 4 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_shdepend_reference_index contains 448 row versions,
but table contains 440 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  analyzing pg_catalog.pg_shdepend
INFO:  pg_shdepend: scanned 4 of 4 pages, containing 440 live rows and 0
dead rows; 440 rows in sample, 440 estimated total rows



Similar for pg_type, there being 248 index row versions vs 244 row
versions in the table.

1631 vs 1619 in pg_attribute
95 vs 94 in pg_index


Looks like it may be time to start a standalone backend and REINDEX
again...

-- 
Don't smoke the next cigarette.  Repeat.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] catalog corruption bug

2006-01-05 Thread Tom Lane
Jeremy Drake [EMAIL PROTECTED] writes:
 Here is some additional information that I have managed to gather today
 regarding this.  It is not really what causes it, so much as what does
 not.
 ...
 Similar for pg_type, there being 248 index row versions vs 244 row
 versions in the table.

The ReadBuffer bug I just fixed could result in disappearance of catalog
rows, so this observation is consistent with the theory that that's
what's biting you.  It's not proof though...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Warm-up cache may have its virtue

2006-01-05 Thread Qingqing Zhou


On Thu, 5 Jan 2006, Qingqing Zhou wrote:

 Feasibility: Our bufmgr lock rewrite already makes this possible. But to
 enable it, we may need more work: (w1) make bufferpool relation-wise,
 which makes our estimation of data page residence more easy and reliable.
 (w2) add aggresive pre-read on buffer pool level. Also, another benefit of
 w1 will make our query planner can estimate query cost more precisely.


w1 is doable by introducing a shared-memory bitmap indicating which
pages of a relation are in buffer pool (We may want to add a hash to
manage the relations). Theoretically, O(shared_buffer) bits are enough. So
this will not use a lot of space.

When we maintain the SharedBufHash, we maintain this bitmap. When we do
query cost estimation or preread, we just need a rough number, so this can
be done by scanning the bitmap without lock. Thus there is also almost no
extra cost.

Regards,
Qingqing



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 Greg,
 
  We *currently* use a block based sampling algorithm that addresses this
  issue by taking care to select rows within the selected blocks in an
  unbiased way. You were proposing reading *all* the records from the
  selected blocks, which throws away that feature.
 
 The block-based algorithms have specific math to cope with this.   Stuff 
 which is better grounded in statistical analysis than our code.   Please 
 read the papers before you judge the solution.

I'm confused since Postgres's current setup *was* based on papers on just this
topic. I haven't read any of the papers myself, I'm just repeating what was
previously discussed when the current block sampling code went in. There was
extensive discussion of the pros and cons of different algorithms taken from
various papers and how they related to Postgres. I don't recall any of them
suggesting that there was any way to take a sample which included every row
from a sampled block and then somehow unbias the sample after the fact.

 Right, which is why researchers are currently looking for something better.  
 The Brutlag  Richardson claims to be able to produce estimates which are 
 within +/- 3x 90% of the time using a 5% sample, which is far better than 
 our current accuracy.  Nobody claims to be able to estimate based on  
 0.1% of the table, which is what Postgres tries to do for large tables.
 
 5% based on block-based sampling is reasonable; that means a straight 5% of 
 the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
 sampling, that would require as much as 25% of the table, making it easier 
 to just scan the whole thing.

Postgres's current sample sizes are clearly geared towards the histograms
where they are entirely realistic. All of the distinct estimates are clearly
just ad hoc attempts based on the existing sampling. 

Is a mechanism that is only 5x faster than reading the whole table (assuming
random_page_cost of 4) and is off by more than a factor of three 10% of the
time really worth it?

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match