Re: [HACKERS] Multi-Dimensional Histograms

2009-07-05 Thread Gregory Maxwell
On Mon, Jun 29, 2009 at 10:22 PM, Robert Haasrobertmh...@gmail.com wrote:
 I'm finding myself unable to follow all the terminology on this thead.
  What's dimension reduction?  What's PCA?
[snip]

Imagine you have a dataset with two variables, say height in inches
and age in years. For tue purpose of discussion lets pretend for a
moment that all the people in your sample have height the same as
their age.

You could create a 2d histogram of your data:

 |0002
 |0060
a|0300
g|4000
e|0003
 |0010
 |0100
 |
  height

You could store this 2d histogram as is and use it for all the things
you'd use histograms for or you could make an observation of the
structure and apply a rotation and flattening of the data and convert
it to a 1d histogram

[0113426200...]  which is far more compact.

Often data has significant correlation, so it's often possible to
reduce the dimensionality without reducing the selectivity of the
histogram greatly.

This becomes tremendously important as the number of dimensions goes
up because the volume of a N dimensional space increases incredibly
fast as the number of dimensions increase.

PCA is used as one method of dimensionality reduction. In PCA you find
a linear transformation (scaling, rotation) of the data that aligns
the data so that the axis lines cut through the data-space in the
orientations with the greatest variance.

I have no clue how you would apply PCA to postgresql histograms, since
to build the PCA transform you need to do some non-trivial operations
with the data.  Perhaps PCA could be done on a random sample of a
table, then that transformation could be stored and used to compute
the histograms. I'm sure there has been a lot of research on this.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Significantly larger toast tables on 8.4?

2009-01-07 Thread Gregory Maxwell
On Fri, Jan 2, 2009 at 5:48 PM, Martijn van Oosterhout
klep...@svana.org wrote:
 So you compromise. You split the data into say 1MB blobs and compress
 each individually. Then if someone does a substring at offset 3MB you
 can find it quickly. This barely costs you anything in the compression
 ratio mostly.

 Implementation though, that's harder. The size of the blobs is tunable
 also. I imagine the optimal value will probably be around 100KB. (12
 blocks uncompressed).

Or have the database do that internally:  With the available fast
compression algorithms (zlib; lzo; lzf; etc) the diminishing return
from larger compression block sizes kicks in rather quickly. Other
algos like LZMA or BZIP gain more from bigger block sizes, but I
expect all of them are too slow to ever consider using in PostgreSQL.

So, I expect that the compression loss from compressing in chunks of
64kbytes would be minimal. The database could then include a list of
offsets for the 64kbyte chunks at the beginning of the field, or
something like that.  A short substring would then require
decompressing just one or two blocks, far less overhead then
decompressing everything.

It would probably be worthwhile to graph compression ratio vs block
size for some reasonable input.  I'd offer to do it; but I doubt I
have a reasonable test set for this.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlock backoff algorithm

2007-11-14 Thread Gregory Maxwell
On Nov 14, 2007 10:12 PM, Joshua D. Drake [EMAIL PROTECTED] wrote:
 http://www.intel.com/performance/server/xeon/intspd.htm
 http://www.intel.com/performance/server/xeon/fpspeed.htm

That says precisely nothing about the matter at hand. Someone should
simply change it and benchmark it in pgsql. I doubt you'll see a
difference there on regular AMD/Intel ... and if it makes the sun
hyperthreaded cpu happier...

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


[HACKERS] GIST and GIN indexes on varchar[] aren't working in CVS.

2007-09-01 Thread Gregory Maxwell
There seems to be some behavior change in current CVS with respect to
gist and gin indexes on varchar[]. Some side effect of the tsearch2
merge?


\d search_pages
 Table public.search_pages
  Column   |Type | Modifiers
---+-+---
 page_name | character varying   |
 cats  | character varying[] |
Indexes:
search_pages_page UNIQUE, btree (page_name)

create index search_pages_cats on search_pages using gin (cats);
ERROR:  missing support function 1 for attribute 1 of index search_pages_cats

create index search_pages_cats on search_pages using gist (cats);
ERROR:  data type character varying[] has no default operator class
for access method gist
HINT:  You must specify an operator class for the index or define a
default operator class for the data type.

This works fine in 8.2, for example:
 \d search_pages
 Table public.search_pages
  Column   |Type | Modifiers
---+-+---
 page_name | character varying   |
 cats  | character varying[] |
Indexes:
search_pages_page UNIQUE, btree (page_name)
search_pages_cats gin (cats)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load Distributed Checkpoints test results

2007-06-15 Thread Gregory Maxwell

On 6/15/07, Gregory Stark [EMAIL PROTECTED] wrote:

While in theory spreading out the writes could have a detrimental effect I
think we should wait until we see actual numbers. I have a pretty strong
suspicion that the effect would be pretty minimal. We're still doing the same
amount of i/o total, just with a slightly less chance for the elevator
algorithm to optimize the pattern.


..and the sort patching suggests that the OS's elevator isn't doing a
great job for large flushes in any case. I wouldn't be shocked to see
load distributed checkpoints cause an unconditional improvement since
they may do better at avoiding the huge burst behavior that is
overrunning the OS elevator in any case.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Gregory Maxwell

On 6/14/07, Simon Riggs [EMAIL PROTECTED] wrote:

On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
 Greg Smith [EMAIL PROTECTED] wrote:

  On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
   If the kernel can treat sequential writes better than random writes, is
   it worth sorting dirty buffers in block order per file at the start of
   checkpoints?

 I wrote and tested the attached sorted-writes patch base on Heikki's
 ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

   tests| pgbench | DBT-2 response time (avg/90%/max)
 ---+-+---
  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s

 (*) Don't write buffers that were dirtied after starting the checkpoint.

 machine : 2GB-ram, SCSI*4 RAID-5
 pgbench : -s400 -t4 -c10  (about 5GB of database)
 DBT-2   : 60WH (about 6GB of database)

I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
of writes has been saved by doing that? We would expect a small
percentage of blocks only and so that shouldn't make a significant
difference. I thought we discussed this before, about a year ago. It
would be easy to get that wrong and to avoid writing a block that had
been re-dirtied after the start of checkpoint, but was already dirty
beforehand. How long was the write phase of the checkpoint, how long
between checkpoints?

I can see the sorted writes having an effect because the OS may not
receive blocks within a sufficient time window to fully optimise them.
That effect would grow with increasing sizes of shared_buffers and
decrease with size of controller cache. How big was the shared buffers
setting? What OS scheduler are you using? The effect would be greatest
when using Deadline.


Linux has some instrumentation that might be useful for this testing,

echo 1  /proc/sys/vm/block_dump
Will have the kernel log all physical IO (disable syslog writing to
disk before turning it on if you don't want the system to blow up).

Certainly the OS elevator should be working well enough to not see
that much of an improvement. Perhaps frequent fsync behavior is having
unintended interaction with the elevator?  ... It might be worthwhile
to contact some Linux kernel developers and see if there is some
misunderstanding.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [GENERAL] Index greater than 8k

2006-11-01 Thread Gregory Maxwell

On 11/1/06, Teodor Sigaev [EMAIL PROTECTED] wrote:
[snip]

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for example for
word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob
bar ba ar'. And make GIN functional index over your column (to save disk space).

[snip]

Time of search in GIN weak depend on number of words (opposite to
tsearch2/GiST), but insertion of row may be slow enough


With the right folding the number of possible trigrams for ascii text
is fairly small.. much smaller than the number of words in used in a
large corpus of text so the GIN performance for searches should be
pretty good.

Real magic would be to teach the regex operator to transparently make
use of such an index. ;)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] New CRC algorithm: Slicing by 8

2006-10-24 Thread Gregory Maxwell

On 10/24/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

I wasn't aware that a system could protect against this. :-)

I write 8 Kbytes - how can I guarantee that the underlying disk writes
all 8 Kbytes before it loses power? And why isn't the CRC a valid means
of dealing with this? :-)


[snip]

A file system with an apropreiate transaction method could do this..
In *theory* reiser4 write()s are atomic. No one has verified, however,
that there is no torn page risk introduced in some other part of the
kernel.

I'm not aware of any other system which can guaranteed the atomicity
of 8k writes.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] New CRC algorithm: Slicing by 8

2006-10-21 Thread Gregory Maxwell

On 10/21/06, Tom Lane [EMAIL PROTECTED] wrote:
[snip]

It hasn't even been tested.  One thing I'd want to know about is the
performance effect on non-Intel machines.


On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
1 KB each workload. For 10 blocks of 8k I got simmlar results as
well.

It looks like the new code may have a larger cache footprint, so
actual performance may differ from the microbenchmark.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Replication

2006-08-21 Thread Gregory Maxwell

On 8/21/06, Alvaro Herrera [EMAIL PROTECTED] wrote:

But the confirmation that needs to come is that the WAL changes have
been applied (fsync'ed), so the performance will be terrible.  So bad,
that I don't think anyone will want to use such a replication system ...


Okay. I give up... Why is waiting for fsync on a fast local network
which takes 15us to send a message (infiniband is cheap..) an
unimaginable delay when we tolerate a local 8ms fsync delay on systems
without writeback cache?

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] How does the planner deal with multiple possible indexes?

2006-07-19 Thread Gregory Maxwell

On 7/19/06, Jim C. Nasby [EMAIL PROTECTED] wrote:
[snip]

\d does list bdata__ident_filed_departure before bdata_ident; I'm
wondering if the planner is finding the first index with ident_id in it
and stopping there?



From my own experience it was grabbing the first that has the

requested field as its first member.. I haven't looked at the code to
see if that is the intended behavior.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Gregory Maxwell

Oh come on,  Sorry to troll but this is too easy.

On 5/15/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding.

[snip]

Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?


It wouldn't.
Why would someone use a desktop OS for a database?
Why would you call the result of answering the previous question
zealous and blinding?

PG's use of the OS's block cache is a good move because it makes PG
tend to 'just work' where the alternatives require non-trivial tuning
(sizing their caches not to push the OS into swap).  The advantages of
this are great enough that if additional smarts are needed in the OS
cache it might well be worth the work to add it there and to ask for
new fadvise flags to get the desired behavior.

That's something that would be easy enough for a dedicated hacker to
do, or easy enough to collaborate with the OS developers if the need
could be demonstrated clearly enough.

What reasonable OS couldn't you do that with?

:)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Support Parallel Query Execution in Executor

2006-04-09 Thread Gregory Maxwell
On 4/9/06, Tom Lane [EMAIL PROTECTED] wrote:
 Gregory Maxwell [EMAIL PROTECTED] writes:
  For example, one case made in this thread involved bursty performance
  with seqscans presumably because the I/O was stalling while processing
  was being performed.

 Actually, the question that that raised in my mind is why isn't the
 kernel doing read-ahead properly?  When we're doing nonsequential
 access like an indexscan, it's unsurprising that the kernel can't guess
 which block we need next, but in a plain seqscan you'd certainly expect
 the read-ahead algorithm to kick in and ensure that the next block is
 fetched before we need it.

 So before we go inventing complicated bits of code with lots of added
 overhead, we should first find out exactly why the system doesn't
 already work the way it's supposed to.

But is that really the behavior we should expect?

How much memory can we expect the OS to spend on opportunistic
read-in? How much disk access should be spent on a guess? There is an
intrinsic tradeoff here, applications tend to be bursty so just
because you're reading a lot now doesn't mean you'll continue... and
the filesystem will have fragmentation, so a failed guess can
translate into a lot of pointless seeking.

As I recall, in Linux 2.6 you have something like a max of 128KB of
readahead. Given that and a disk subsystem that reads at 200MB/sec you
can't spend more than 600us processing before requesting enough
additional blocks put the disk back into readhead or you will stall
the disk.  Stalling the disk costs more than you'd expect, due to FS
fragmentation there can be terrific gains from allowing the OS and
disk to issue reads out of order from a large request queue.

It would probably be reasonable to say that the OS should be using
much larger readhead buffers, especially on systems with fast disk
subsystems... But that doesn't come for free and can slaughter
performance for many workloads (consider, what if it was triggering
5MB of file oriented read-ahead for every index scan seek we did?). 
There is an adaptive readahead patch for Linux which should improve
things (http://lwn.net/Articles/176279/ and if you google around there
are some benchmarks) but I doubt that even that would be able to keep
a 200MB/sec+ disk subsystem saturated with the sort of access patterns
PG has...

To address this in a cross platform way will be a challenge. I doubt
Linux is alone at having skimpy readahead (because big readahead
translates into huge losses if you get it wrong).

Given this information, a stupid 'fault-in' process should probably
give huge gains for seqscans... but I think the extra work required to
find a solution which is also useful for index operations is probably
worth it as well.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Support Parallel Query Execution in Executor

2006-04-09 Thread Gregory Maxwell
On 4/9/06, Tom Lane [EMAIL PROTECTED] wrote:
 Certainly.  If the OS has readahead logic at all, it ought to think that
 a seqscan of a large table qualifies.  Your arguments seem to question
 whether readahead is useful at all --- but they would apply *just as
 well* to an app doing its own readahead, which is what is really
 getting proposed in this thread.

We know we're going to read the whole table, the OS doesn't. We can be
confident that we're will not use our  read-ahead when we're really
doing random accesses.

The OS has to deal with many applications with many workloads running
on a wider spectrum of hardware. It's important that it does the right
thing, but probably more important that it doesn't do the wrong thing.
 This encourages erroring on the side of small readahead.

 Before we go replacing a standard OS-level facility with our own
 version, we need to have a much clearer idea of why the OS isn't getting
 the job done for us.  Otherwise we're likely to write a large amount of
 code and find out that it doesn't work very well either.

Thats a fair position...  It would be useful to know much much
readahead PG needs in order to keep a high speed disk subsystem
saturated. This would involve profiling how frequently PG requests
data, how much it requests when running out of a hot cache. We could
then say that the OS would need to readahead xMB to keep a yMB/s disk
subsystem saturated.

It would be good to know how much FBSD will readahead...

It might also be interesting for someone with the right testing rig on
linux to try the adaptive
readahead patch to see if that improves PG's ability to keep the disk busy.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Support Parallel Query Execution in Executor

2006-04-09 Thread Gregory Maxwell
On 4/9/06, Luke Lonergan [EMAIL PROTECTED] wrote:
 Gregory,

 On 4/9/06 1:36 PM, Gregory Maxwell [EMAIL PROTECTED] wrote:

  It might also be interesting for someone with the right testing rig on
  linux to try the adaptive
  readahead patch to see if that improves PG's ability to keep the disk busy.

 the adaptive readahead patch?  Did I miss one?

 We will happily test experimental patches that improve I/O utilitization.
 We have an assortment of gear with high speed I/O, mostly Linux now.

Linux kernel patch, I'd mentioned it in a prior post.

http://www.vanheusden.com/ara/

It increases Linux's maximum readahead from 128K to 1meg .. and it
should be smart enough that you could crank it up further without too
much risk of hurting performance elsewhere.

If PG's bottlenecked on seqscan due to insufficient readahead, I'd
expect this to show an improvement...  although I am still somewhat
doubtful that it'll be enough to keep the disk saturated if PG's
behavior is highly unoptimal.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Support Parallel Query Execution in Executor

2006-04-08 Thread Gregory Maxwell
On 4/8/06, Tom Lane [EMAIL PROTECTED] wrote:
 This is exactly the bit of optimism I was questioning.  We've already
 been sweating blood trying to reduce multiprocessor contention on data
 structures in which collisions ought to be avoidable (ie, buffer arrays
 where you hope not everyone is hitting the same buffer at once).  I
 think passing large volumes of data between different processes is going
 to incur quite a lot of locking overhead, pipeline stalls for cache line
 transfers, etc, etc, because heavy contention for the transfer buffer is
 simply not going to be avoidable.

We should consider true parallel execution and overlapping execution
with I/O as distinct cases.

For example, one case made in this thread involved bursty performance
with seqscans presumably because the I/O was stalling while processing
was being performed.  In general this can be avoided without parallel
execution through the use of non-blocking I/O and making an effort to
keep the request pipeline full.

There are other cases where it is useful to perform parallel I/O
without parallel processing.. for example: a query that will perform
an index lookup per row can benefit from running some number of those
lookups in parallel in order to hide the lookup latency and give the
OS and disk elevators a chance to make the random accesses a little
more orderly. This can be accomplished without true parallel
processing. (Perhaps PG does this already?)

Parallel execution to get access to more CPU and memory bandwidth is a
fine thing, and worth the costs in many cases... but it shouldn't be
used as an easy way to get parallel IO without careful consideration.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] [GENERAL] A real currency type

2006-03-21 Thread Gregory Maxwell
On 3/21/06, Jim C. Nasby [EMAIL PROTECTED] wrote:
 ISTM that having a currency type is pretty common for most databases; I
 don't really see any reason not to just include it. Likewise for a type
 that actually stores timezone info with a timestamp.

This really should be generalized to work with all the base types
because there are strong use cases for each. (timezones with
timestamps, currencies with numeric, physical units with floats and
ints)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Gregory Maxwell
On 2/17/06, Ragnar [EMAIL PROTECTED] wrote:
 Say again ?
 Let us say you have 1 billion rows, where the
 column in question contains strings like
 baaaaaa
 baaaaab
 baaaaac
 ...
 not necessarily in this order on disc of course

 The minimum value would be keyed as 0001h,
 the next one as 0002h and so on.

 Now insert new value 'a'

 Not only will you have to update 1 billion records,
 but also all the values in your map.

 please explain

No comment on the usefulness of the idea overall.. but the solution
would be to insert with the colliding value of the existing one lesser
than it..

It will falsly claim equal, which you then must fix with a second
local sort which should be fast because you only need to sort the
duplicates/false dupes.  If you insert too much then this obviously
becomes completely useless.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Why don't we allow DNS names in pg_hba.conf?

2006-02-13 Thread Gregory Maxwell
On 2/13/06, Joshua D. Drake [EMAIL PROTECTED] wrote:
 Well as one of the people that deploys and managees many, many
 postgresql installations I can say I have never run into the need to
 have dns names and the thought of dns names honestly seems silly. It
 will increase overhead and dependencies that I just wouldn't want in my
 installations.

It is not uncommon for an environment that has already suffered
through one forced renumbering to forbid the use of hard set IPs in
application software.

With IPv6 we will just see more and more of that.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Fixing row comparison semantics

2005-12-26 Thread Gregory Maxwell
On 12/26/05, Pavel Stehule [EMAIL PROTECTED] wrote:
 (1,1) * (1,2) = true
 (1,2) * (2,1) is NULL
 (2,3) * (1,2) = false

 it's usefull for multicriterial optimalisation

This is indeed a sane and useful function which should be adopted by
the SQL standard.. in postgresql this would easily enough be
implemented as a user function so I'm not sure we need special support
for it.

The idea is that in a multidimension comparison you can only sometimes
say when one tuple is strictly less than (or greater than) another
because differing dimensions are incomparable.  So, like his example,
we can not say if (1,2) is lesser or greater than (2,1) because saying
so would require some priority of the dimensions which may not be
known or may not exist, it is only clear that they are not equal..

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Upcoming PG re-releases

2005-12-08 Thread Gregory Maxwell
On 12/8/05, Bruce Momjian pgman@candle.pha.pa.us wrote:
A script which identifies non-utf-8 characters and provides some
context, line numbers, etc, will greatly speed up the process of
remedying the situation.

 I think the best we can do is the iconv -c with the diff idea, which
 is already in the release notes.  I suppose we could merge the iconv and
 diff into a single command, but I don't see a portable way to output the
 iconv output to stdout., /dev/stdin not being portable.

No, what is needed for people who care about fixing their data is a
loadable strip_invalid_utf8() that works in older versions.. then just
select * from bar where foo != strip_invalid_utf8(foo);  The function
would be useful in general, for example, if you have an application
which doesn't already have much utf8 logic, you want to use a text
field, and stripping is the behaviour you want. For example, lots of
simple web applications.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Replication on the backend

2005-12-06 Thread Gregory Maxwell
On 12/6/05, Jan Wieck [EMAIL PROTECTED] wrote:
  IMO this is not true. You can get affordable 10GBit network adapters, so 
  you can have plenty of bandwith in a db server pool (if they are located in 
  the same area). Even 1GBit Ethernet greatly helps here, and would make it 
  possible to balance read-intensive (and not write intensive) applications. 
  We using linux bonding interface with 2 gbit NICs, and 200 MBytes/sec 
  throughput is something you need to have a quite some harddisks to reach 
  that. Latency is not bad too.

 It's not so much the bandwidth but more the roundtrips that limit your
 maximum transaction throughput. Remember, whatever the priority, you
 can't increase the speed of light.

Eh, why would light limited delay be any slower than a disk on FC the
same distance away? :)

In any case, performance of PG on iscsi is just fine. You can't blame
the network... Doing multimaster replication is hard because the
locking primitives that are fine on a simple multiprocessor system
(with a VERY high bandwidth very low latency interconnect between
processors) just don't work across a network, so you're left finding
other methods and making them work...

But again, multimaster isn't hard because there of some inherently
slow property of networks.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Reduce NUMERIC size by 2 bytes, reduce max length to 508 digits

2005-12-05 Thread Gregory Maxwell
On 12/5/05, Tom Lane [EMAIL PROTECTED] wrote:
  Not only does 4000! not work, but 400! doesn't even work.  I just lost
  demo wow factor points!

 It looks like the limit would be about factorial(256).

 The question remains, though, is this computational range good for
 anything except demos?

I've hesitated commenting, because I think it might be a silly reason,
but perhaps it's one other people share.  ...  I use PG as a
calculator for big numbers because it's the only user friendly thing
on my system that can do factorial(300) - factorial(280). I'd rather
use something like octave, but I've found its pretty easy to escape
its range.   If the range for computation is changed, then I'll
probably keep an old copy around just for this, though I'm not quite
sure how much I'd be affected..

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] Upcoming PG re-releases

2005-12-04 Thread Gregory Maxwell
On 12/4/05, Tom Lane [EMAIL PROTECTED] wrote:
 Paul Lindner [EMAIL PROTECTED] writes:
  On Sun, Dec 04, 2005 at 11:34:16AM -0500, Tom Lane wrote:
  Paul Lindner [EMAIL PROTECTED] writes:
  iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql
 
  Is that really a one-size-fits-all solution?  Especially with -c?

  I'd say yes, and the -c flag is needed so iconv strips out the
  invalid characters.

 That's exactly what's bothering me about it.  If we recommend that
 we had better put a large THIS WILL DESTROY YOUR DATA warning first.
 The problem is that the data is not invalid from the user's point
 of view --- more likely, it's in some non-UTF8 encoding --- and so
 just throwing away some of the characters is unlikely to make people
 happy.

Nor is it even guarenteed to make the data load: If the column is
unique constrained and the removal of the non-UTF characters makes two
rows have the same data where they didn't before...

The way to preserve the data is to switch the column to be a bytea.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Reducing relation locking overhead

2005-12-02 Thread Gregory Maxwell
On 02 Dec 2005 15:25:58 -0500, Greg Stark [EMAIL PROTECTED] wrote:
 I suspect this comes out of a very different storage model from Postgres's.

 Postgres would have no trouble building an index of the existing data using
 only shared locks. The problem is that any newly inserted (or updated) records
 could be missing from such an index.

 To do it you would then have to gather up all those newly inserted records.
 And of course while you're doing that new records could be inserted. And so
 on. There's no guarantee it would ever finish, though I suppose you could
 detect the situation if the size of the new batch wasn't converging to 0 and
 throw an error.

After you're mostly caught up, change locking behavior to block
further updates while the final catchup happens. This could be driven
by a hurestic that says make up to N attempts to catch up without
blocking, after that just take a lock and finish the job. Presumably
the catchup would be short compared to the rest of the work.

Are their enviroments which could not tolerate even this minimal hit?
Probably, which leaves the choice of telling them 'don't reindex then'
or providingaA knob which would tell it to never block (would just try
N times and then give up, failing the reindex).

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] generalizing the planner knobs

2005-12-02 Thread Gregory Maxwell
On 02 Dec 2005 15:49:02 -0500, Greg Stark [EMAIL PROTECTED] wrote:
 Rod Taylor [EMAIL PROTECTED] writes:
  The missing capability in this case is to be able to provide or generate
  (self learning?) statistics for a function that describe a typical result
  and the cost of getting that result.

 Ok, try WHERE radius_authenticate(user, (select ...), ?)

 The point is that you can improve the estimates the planner gets. But you can
 never make them omniscient. There will always be cases where the user knows
 his data more than the planner. And those hints are still valid when a new
 optimizer has new plans available.

Actually...  If a statistics engine stores the entire query as well
and used that as a key I don't see why it couldn't figure this out.
I.e. in queries that look like Z operation X has historically had
selectivity Y.

The the instruction to the user is simple: 'make sure that queries
with different results look different' . This is often naturally the
case.

The challenge becomes how do you group together queries which are
mostly the same so that you get enough data, but not falsely cluster
queries with different statistics.

The simplest way check the statistics list for the most similar query
match, and use that information. If the result is similar to what is
expected, use it to update the statistics record. If the measured
selectivity is too different make a new record which will then attract
similar queries.

Sounds like a good research project for someone.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] generalizing the planner knobs

2005-12-01 Thread Gregory Maxwell
On 12/1/05, Pollard, Mike [EMAIL PROTECTED] wrote:
 Optimizer hints were added because some databases just don't have a very
 smart optimizer.  But you are much better served tracking down cases in
 which the optimizer makes a bad choice, and teaching the optimizer how
 to make a better one.  That way, all users get the benefit of the fix.
 Remember, the purpose of SQL is to isolate the end user from having to
 care about how the data is retrieved; that is the RDBMS' problem.  (the
 other thing forgotten was that it was supposed to be a natural language.
 NVL.  Bah.)

The flipside there is that a good set of hinting options  may increase
the amount of detailed feedback we get from users on improvements
needed in the optimizer.  The current knobs are pretty blunt and don't
do as much as I'd like when trying to track down exactly where the
optimiser has gone wrong.

If we'd really like to avoid people using the knobs to rig queries,
how about making them only  work with explain analyze, useful for
debugging but not so useful for actual queries.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Improving count(*)

2005-11-21 Thread Gregory Maxwell
On 11/21/05, Jim C. Nasby [EMAIL PROTECTED] wrote:
 What about Greg Stark's idea of combining Simon's idea of storing
 per-heap-block xmin/xmax with using that information in an index scan?
 ISTM that's the best of everything that's been presented: it allows for
 faster index scans without adding a lot of visibility overhead to the
 index heap, and it also allows VACUUM to hit only pages that need
 vacuuming. Presumably this could also be used as the on-disk backing for
 the FSM, or it could potentially replace the FSM.

This should be a big win all around, especially now since in memory
bitmaps make it more likely that some classes of queries will be pure
index.  I still think it would be useful to have a estimated_count()
which switches to whatever method is needed to get a reasonably
accurate count quickly (stats when there are no wheres we can't
predict, sampling otherwise if the involved tables are large, and a
normal count in other cases.)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Improving count(*)

2005-11-18 Thread Gregory Maxwell
On 11/18/05, Merlin Moncure [EMAIL PROTECTED] wrote:
  In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
  Server) the leaf level of the narrowest index on the table is scanned,
  following a linked list of leaf pages.  Leaf pages can be pretty dense
  under Sybase, because they do use prefix compression.  A count(*)
  on a table with 100 million rows is going to take a few minutes, but
 it
  is going to be at least an order of magnitude faster than a data page
  scan -- maybe two orders of magnitude faster.

 MS SQL server (pre 2005) is not an MVCC database, so it's not apples to
 apples with pg.  Many of the people who wander on this list and complain
 about count(*) either come from one of those or some other non-MVCC
 database or worse, a flat-file xbase type system.  A performance
 comparison between MS 2005 and pg would be much more interesting.
 Personally, I don't know what all the fuss is about [although I wouldn't
 complain about an optimization ;)].

count(*) WHERE 1  is indeed a corner case that few to no real
applications should care about... If we were having to choose between
improving that case and preserving the performance and maintainability
of PG then I think the discussion would already be over.

However, some great ideas have been proposed here which would not only
help in that case but would otherwise be quite useful.

*Inclusion of a 'MVCC inflight' bit in indexes which would allow
skipping MVCC checks in clumps of an index scan which have no pending
changes. This would further close the performance gap between PG and
non-MVCC databases for some workloads.
*Introduction of high performance table sampling, which would be
useful in many applications (including counting where there is a where
clause) as well as for testing and adhoc queries.
and
*a estimate_count() that provides the planner estimate, which would
return right away and provide what is really needed most of the time
people try to count(*) on a large table.

So, while this is a silly case to optimize for it's one where it
appears that the proposed solutions will make PG better all around.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: Réf. : RE: [HACKERS] Running PostGre on DVD

2005-11-15 Thread Gregory Maxwell
On 11/15/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 I don't understand why an user can't WILLINGLY (by EXPLICITLY setting an
 OPTION) allow a privileged administrator to run PostGre.
 It is a MAJOR problem for me, that will force me to use another database
 because my database will be on a DVD and I'm not sure that on the PC on
 which it will be executed, the user isn't an admin or that I can create an
 unprivileged user. To resume, I don't want my user to be unable to run my
 application for that.
 The persons specifying this option would know perfectly well the risks
 linked to it.
 I'm starting to think the PostGre developpers think the users are children.
 I'm deeply disappointed to be forced to compile my own PostGre and I will
 not.

You can do it.
Modify the source, it's a one line change.  Be grateful that you have
this privilege that you would lack with a proprietary database.

Running as an administrator isn't a matter of taste, it's
fundamentally broken from a security perspective. Just as you are
(usually) asked to jump through hoops to break the normal promises
that the database provide, you will be asked to do so on this one.

If you are unable to make a one line change to the source and rebuild
the application then you probably are unable to understand the
security implications of your decision. I wouldn't call this treating
you like a child, I'd call this expecting you to be an adult.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] MERGE vs REPLACE

2005-11-13 Thread Gregory Maxwell
On 11/13/05, Robert Treat [EMAIL PROTECTED] wrote:
 On Saturday 12 November 2005 04:06, Matteo Beccati wrote:
  | 1 |1 | NULL |
 Wow, that seems ugly maybe there's a reason for it, but I'm not sure we
 could deviate from my$ql's behavior on this even if we wanted... they are the
 standard here.

I don't think that's ugly, I think that's exactly working as
advertised. Replace behaves exactly like deleting the record with the
matching primary key and inserting the provided input. ... not merging
together old data with new.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] SIGSEGV taken on 8.1 during dump/reload

2005-11-09 Thread Gregory Maxwell
On 11/8/05, Tom Lane [EMAIL PROTECTED] wrote:
 Teodor Sigaev [EMAIL PROTECTED] writes:
  Layout of GIST_SPLITVEC struct has been changed from 8.0, I'm afraid that 
  old
  .so is used.  spl_(right|left)valid fields was added to GIST_SPLITVEC.

 Does look a bit suspicious ... Robert, are you *sure* you've got the
 right version of pgsphere linked in?  Did you compile it against the
 right set of Postgres header files?

So it turned out that he didn't... Is this a sign that we need to
include a versioning symbol in SOs so we can give a nice clear error
message module foo compiled for PostgreSQL 8.0.2 this is PostgreSQL
8.1. Is there ever a case where we want people using modules compiled
against an old version, are there cases where users can't recompile
their modules but the old ones would work?

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Interval aggregate regression failure (expected seems

2005-11-07 Thread Gregory Maxwell
On 07 Nov 2005 14:22:37 -0500, Greg Stark [EMAIL PROTECTED] wrote:
 IIRC, floating point registers are actually longer than a double so if the
 entire calculation is done in registers and then the result rounded off to
 store in memory it may get the right answer. Whereas if it loses the extra
 bits on the intermediate values (the infinite repeating fractions) that might
 be where you get the imprecise results.

Hm. I thought -march=pentium4 -mcpu=pentium4 implies -mfpmath=sse. 
SSE is a much better choice on P4 for performance reasons, and never
has excess precision. I'm guessing from the above that I'm incorrect,
in which case we should always be compiled with -mfpmath=sse -msse2
when we are complied -march=pentium4, this should remove problems
caused by excess precision. The same behavior can be had on non sse
platforms with -ffloat-store.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing the overhead of NUMERIC data

2005-11-04 Thread Gregory Maxwell
On 11/4/05, Martijn van Oosterhout kleptog@svana.org wrote:
 Yeah, and while one way of removing that dependance is to use ICU, that
 library wants everything in UTF-16. So we replace copying to add NULL
 to string with converting UTF-8 to UTF-16 on each call. Ugh! The
 argument for UTF-16 is that if you're using a language that doesn't use
 ASCII at all, UTF-8 gets inefficient pretty quickly.

Is this really the case? Only unicode values 000800 - 00 are
smaller in UTF-16 than in UTF-8, and in their case it's three bytes vs
two. Cyrilic, Arabic, Greek, Latin, etc are all two bytes in both.

So, yes in some cases UTF-8 will use three bytes where UTF-16 would be
two, but thats less inefficient than UTF-16 for ASCII, which many
people find acceptable.

 Locale sensetive, efficient storage, fast comparisons, pick any two!

I don't know that the choices are that limited, as I indicated earlier
in the thread I think it's useful to think of all of these encodings
as just different compression algorithms. If our desire was to have
all three, the backend could be made null safe and we could use the
locale-sensitive and fast representation (Probably UTF-16 or UTF-32)
in memory, and store on disk whatever is most efficient for storage.
(lz compressed UTF-whatever for fat fields, UTF-8 for mostly ascii
small fields, SCSU for non-ascii short fields
(http://www.unicode.org/reports/tr6/), etc)

 My guess is that in the long run there would be two basic string
 datatypes, one UTF-8, null terminated string used in the backend code
 as a standard C string, default collation strcmp. The other UTF-16 for
 user data that wants to be able to collate in a locale dependant way.

So if we need locale dependant colation we suffer 2x inflation for
many texts, and multibyte complexity still required if we are to
collate correctly when there are characters outside of the BMP. Yuck.

Disk storage type, memory strorage type, user API type, and collation
should be decoupled.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing the overhead of NUMERIC data

2005-11-04 Thread Gregory Maxwell
On 11/4/05, Tom Lane [EMAIL PROTECTED] wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  Yeah, and while one way of removing that dependance is to use ICU, that
  library wants everything in UTF-16.

 Really?  Can't it do UCS4 (UTF-32)?  There's a nontrivial population
 of our users that isn't satisfied with UTF-16 anyway, so if that really
 is a restriction of ICU then we're going to have to look elsewhere :-(

The correct question to ask is something like Does it support non-bmp
characters? or Does it really support UTF-16 or just UCS2?

UTF-16 is (now) a variable width encoding which is a strict superset
of UCS2 which allows the representation of all Unicode characters.
UCS2 is fixed width and only supports characters from the basic
multilingual plane.  UTF-32 and UCS4 are (now) effectively the same
thing and can represent all unicode characters with a 4 byte fixed
length word.

The code can demand UTF-16 but still be fine for non-BMP characters.
However, many things which claim to support UTF-16 really only support
UCS2 or at least have bugs in their handling of non-bmp characters. 
Software that supports UTF-8 is somewhat more likely to support
non-bmp characters correctly since the variable length code paths get
more of a workout in many environments. :)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing the overhead of NUMERIC data

2005-11-04 Thread Gregory Maxwell
On 11/4/05, Martijn van Oosterhout kleptog@svana.org wrote:
[snip]
 : ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not
 : support surrogates, and UTF-16 does support surrogates. This means
 : that UCS-2 only supports UTF-16's Base Multilingual Plane (BMP). The
 : notion of UCS-2 is deprecated and dead. Unicode 2.0 in 1996 changed
 : its default encoding to UTF-16.
 snip

This means it's fine.. ICU's use of UTF-16 will not break our support
for all of unicode. Conversion too and from UTF-16 isn't cheap,
however, if you're doing it all the time. Storing ASCII in UTF-16 is
pretty lame. Widespread use of UTF-16 tends to hide bugs in the
handling of non-bmp characters. ...  I would be somewhat surprised to
see a substantial performance difference in working with UTF-16 data
over UTF-8, but then again ... they'd know and I wouldn't.

Other lame aspects of using unicode encodings other than UTF-8
internally is that it's harder to figure out what is text in GDB
output and such.. can make debugging more difficult.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing the overhead of NUMERIC data

2005-11-03 Thread Gregory Maxwell
On 11/3/05, Martijn van Oosterhout kleptog@svana.org wrote:
 That's called UTF-16 and is currently not supported by PostgreSQL at
 all. That may change, since the locale library ICU requires UTF-16 for
 everything.

UTF-16 doesn't get us out of the variable length character game, for
that we need UTF-32... Unless we were to only support UCS-2, which is
what some databases do for their Unicode support.  I think that would
be a huge step back and as you pointed out below, it is not efficient.
:)

 The question is, if someone declares a field CHAR(20), do they really
 mean to fix 40 bytes of storage for each and every row? I doubt it,
 that's even more wasteful of space than a varlena header.

 Which puts you right back to variable length fields.

Another way to look at this is in the context of compression: With
unicode, characters are really 32bit values... But only a small range
of these values is common.  So we store and work with them in a
compressed format, UTF-8.

The costs of compression is that fixed width fields can not be fixed
width, and the some operations are much more expensive than they would
be otherwise.

As such it might be more interesting to ask some other questions like:
are we using the best compression algorithm for the application, and,
why do we sometimes stack two compression algorithms? For longer
fields would we be better off working with UTF-32 and being more
agressive about where we LZ compress the fields?

  I dunno... no opinion on the matter here, but I did want to point out
  that the field can be fixed length without a header. Those proposing such
  a change, however, should accept that this may result in an overall
  expense.

 The only time this may be useful is for *very* short fields, in the
 order of 4 characters or less. Else the overhead swamps the varlena
 header...

Not even 4 characters if we are to support all of unicode... Length +
UTF-8 is a win vs UTF-32 in most cases for fields with more than one
character.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: slru.c race condition (was Re: [HACKERS] TRAP: FailedAssertion(!((itemid)-lp_flags

2005-10-31 Thread Gregory Maxwell
On 10/31/05, Jim C. Nasby [EMAIL PROTECTED] wrote:
 On Mon, Oct 31, 2005 at 01:34:17PM -0500, Bruce Momjian wrote:
  There is no way if the system has some incorrect value whether that
  would later corrupt the data or not.  Anything the system does that it
  shouldn't do is a potential corruption problem.
 But is it safe to say that there are areas where a failed assert is far
 more likely to result in data corruption? And that there's also areas
 where there's likely to be difficult/impossible to find bugs, such as
 race conditions? ISTM that it would be valuable to do some additional
 checking in these critical areas.

There are, no doubt, also places where an assert has minimal to no
performance impact. I'd wager a guess that the intersection of low
impact asserts, and asserts which measure high risk activities, is
small enough to be uninteresting.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-30 Thread Gregory Maxwell
On 10/26/05, Christopher Kings-Lynne [EMAIL PROTECTED] wrote:
iconv -c -f UTF8 -t UTF8
 recode UTF-8..UTF-8  dump_in.sql  dump_out.sql

I've got a file with characters that pg won't accept that recode does
not fix but iconv does. Iconv is fine for my application, so I'm just
posting to the list so that anyone looking for why recode didn't work
for them will find the suggestion to use iconv.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] enums

2005-10-28 Thread Gregory Maxwell
On 10/27/05, Andrew Dunstan [EMAIL PROTECTED] wrote:
 Yes, MySQL is broken in some regards, as usual. However, the API isn't
 bad (except for the fact that it doesn't care what invalid crap you
 throw at it), and more importantly there are thousands of apps and
 developers who think around that interface. We should copy it without
 the brokenness as much as possible unless we have good cause
 otherwise.
 

 mmm ... no. It is too broken. We should do enums orthogonally with other
 type definitions in PostgreSQL. Where I would like to get to is that we
 have a flavor of CREATE TYPE that will create the enum type for us,
 including all the support that I build into my little kit. And if you
 want to change the enumeration set on a column, you would use ALTER
 TABLE foo ALTER COLUMN  bar  TYPE newtype USING ...

eh, Well that we have a reasonable user extensiable type system is
reasonable reason.  What I was mostly objecting to was the use of
lexical collation the don't mess with what people already expect
argument was just the most handy strawman available. :)

And in doing so you could insert a enum in the middle of the existing
list without breaking the values already in the table?  If so that
would be very useful.

 Inline declarations of enums does not strike me as good.

You're right, it's a property of a type.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] enums

2005-10-27 Thread Gregory Maxwell
On 10/27/05, Jim Nasby [EMAIL PROTECTED] wrote:
 Adding -hackers back to the list...
  You could as equally say that it's ordering it by the order of the
  enum declaration, which seems quite reasonable to me.

 I don't really see why that's considered reasonable, especially as a default. 
 I could maybe see an argument for having a means to see this ordering, but 
 IMO anything depending on that is broken. I don't think we should be making 
 any guarantees about how enums are stored in the database (including 
 ordering).

 Your examples show why I don't think it's a good idea to use MySQL as a guide 
 for how to do enums.

Yes, MySQL is broken in some regards, as usual. However, the API isn't
bad (except for the fact that it doesn't care what invalid crap you
throw at it), and more importantly there are thousands of apps and
developers who think around that interface. We should copy it without
the brokenness as much as possible unless we have good cause
otherwise.

 If we do decide to include the concept of ordering in enums, then it should 
 be fully supported and not just an artifact of our storage mechanism. This 
 means supporting things like being able to re-order the accepted values in an 
 enum. But like I said, I just don't see the use case for doing that.

So what do you propose we do for a default ordering?  I hope you don't
think we should force a sort as though the enum labels were text...
That almost certainly incorrect for most applications of enums, which
are used to make opaque labels more human compatible.

MySQL's behavior of allowing the user to specify the collation in the
typedef makes a lot of sense to me, it doesn't matter that it actually
works as an artifact of the storage backend. I'd argue that it would
make sense to sort by the specification order even if we changed the
backend to use varchars rather than numbers.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] enums

2005-10-27 Thread Gregory Maxwell
On 10/27/05, Andrew Dunstan [EMAIL PROTECTED] wrote:
 That seems counter-intuitive. It's also exposing an implimentation
 detail (that the enum is stored internally as a number).

 No it is not. Not in the slightest. It is honoring the enumeration order
 defined for the type. That is the ONLY correct behaviour, IMNSHO.
 Otherwise, you could just as easily use a domain with a check constraint.

 In fact, mysql's behaviour is laughably, even ludicrously, inconsistent:
[snip]
 So for order by it honors the enumeration order, but for  it uses the
 lexical ordering. Lovely, eh?

Oh wow. That is broken, I didn't try that case because I figured it
would do it right (i.e. use the enum order).

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] On externals sorts and other IO bottlenecks in postgresql.

2005-10-23 Thread Gregory Maxwell
I don't recall this being mentioned in the prior threads:
http://www.cs.duke.edu/TPIE/

GPLed, but perhaps it has some good ideas.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Gregory Maxwell
On 10/3/05, Ron Peacetree [EMAIL PROTECTED] wrote:
[snip]
 Just how bad is this CPU bound condition?  How powerful a CPU is
 needed to attain a DB IO rate of 25MBps?

 If we replace said CPU with one 2x, 10x, etc faster than that, do we
 see any performance increase?

 If a modest CPU can drive a DB IO rate of 25MBps, but that rate
 does not go up regardless of how much extra CPU we throw at
 it...

Single threaded was mentioned.
Plus even if it's purely cpu bound, it's seldom as trivial as throwing
CPU at it, consider the locking in both the application, in the
filesystem, and elsewhere in the kernel.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Gregory Maxwell
On 9/30/05, Ron Peacetree [EMAIL PROTECTED] wrote:
 4= I'm sure we are paying all sorts of nasty overhead for essentially
 emulating the pg filesystem inside another filesystem.  That means
 ~2x as much overhead to access a particular piece of data.

 The simplest solution is for us to implement a new VFS compatible
 filesystem tuned to exactly our needs: pgfs.

 We may be able to avoid that by some amount of hacking or
 modifying of the current FSs we use, but I suspect it would be more
 work for less ROI.

On this point, Reiser4 fs already implements a number of things which
would be desirable for PostgreSQL. For example: write()s to reiser4
filesystems are atomic, so there is no risk of torn pages (this is
enabled because reiser4 uses WAFL like logging where data is not
overwritten but rather relocated). The filesystem is modular and
extensible so it should be easy to add whatever additional semantics
are needed.  I would imagine that all that would be needed is some
more atomicity operations (single writes are already atomic, but I'm
sure it would be useful to batch many writes into a transaction),some
layout and packing controls, and some flush controls.  A step further
would perhaps integrate multiversioning directly into the FS (the
wandering logging system provides the write side of multiversioning, a
little read side work would be required.). More importantly: the file
system was intended to be extensible for this sort of application.

It might make a good 'summer of code' project for someone next year,
... presumably by then reiser4 will have made it into the mainline
kernel by then. :)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Gregory Maxwell
On 9/28/05, Ron Peacetree [EMAIL PROTECTED] wrote:
 2= We use my method to sort two different tables.  We now have these
 very efficient representations of a specific ordering on these tables.  A
 join operation can now be done using these Btrees rather than the
 original data tables that involves less overhead than many current
 methods.

If we want to make joins very fast we should implement them using RD
trees. For the example cases where a join against a very large table
will produce a much smaller output, a RD tree will provide pretty much
the optimal behavior at a very low memory cost.

On the subject of high speed tree code for in-core applications, you
should check out http://judy.sourceforge.net/ . The performance
(insert, remove, lookup, AND storage) is really quite impressive.
Producing cache friendly code is harder than one might expect, and it
appears the judy library has already done a lot of the hard work. 
Though it is *L*GPLed, so perhaps that might scare some here away from
it. :) and good luck directly doing joins with a LC-TRIE. ;)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Gregory Maxwell
On 9/15/05, Tom Lane [EMAIL PROTECTED] wrote:
 Yesterday's CVS tip:
 1 32s   2 46s   4 88s   8 168s
 plus no-cmpb and spindelay2:
 1 32s   2 48s   4 100s  8 177s
 plus just-committed code to pad LWLock to 32:
 1 33s   2 50s   4 98s   8 179s
 alter to pad to 64:
 1 33s   2 38s   4 108s  8 180s
 
 I don't know what to make of the 2-process time going down while
 4-process goes up; that seems just weird.  But both numbers are
 repeatable.

It is odd.

In the two process case there is, assuming random behavior, a 1/2
chance that you've already got the right line, but in the 4 process
case only a 1/4 chance (since we're on a 4 way box). This would
explain why we don't see as much cost in the intentionally misaligned
case. You'd expect the a similar pattern of improvement with the
64byte alignment (some in the two process case, but more in the 4
case), but here we see more improvement in the two way case.

If I had to guess I might say that the 64byte alignment is removing
much of the unneeded line bouncing in the the two process case but is
at the same time creating more risk of bouncing caused by aliasing.
Since two processes have 1/2 chance the aliasing isn't a problem so
the change is a win, but in the four process case it's no longer a win
because with aliasing there is still a lot of fighting over the cache
lines even if you pack well, and the decrease in packing makes odd
aliasing somewhat more likely. This might also explain why the
misaligned case performed so poorly in the 4process case, since the
misalignment didn't just increase the cost 2x, it also increased the
likelihood of a bogus bounce due to aliasing..

If this is the case, then it may be possible through very careful
memory alignment to make sure that no two high contention locks that
are likely to be contended at once share the same line (through either
aliasing or through being directly within the same line).

Then again I could be completely wrong, my understanding of
multiprocessor cache coherency is very limited, and I have no clue how
cache aliasing fits into it... So the above is just uninformed
conjecture.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pl/Ruby, deprecating plPython and Core

2005-08-16 Thread Gregory Maxwell
On 8/16/05, Joshua D. Drake [EMAIL PROTECTED] wrote:
 Sure... it hasn't been found. We can play the it might have or might
 not have game all day long but it won't get us anywhere. Today, and
 yesterday pl/Ruby can be run trust/untrusted, pl/python can not.
  Both of these things could be said about Python when it was about the
  same age Ruby is now.
 
 But they can't be said about Python now. Again I love Python but I can't
 use it the way I want to in the database.
 
 I believe that unless plPython can either be fixed
 
 
  Fixed how ?
 
 Be able to be trusted.

Really a lot of your points seem either to be appealing to the fad
appeal of Ruby or misinformation about Python.  It's silliness. The
inclusion of pl/ruby should be considered independently of pl/python,
they are separate matters. I promise that the aggregate work required
for all coders who know Python to switch to ruby is far far greater
than the work required to fix the issues with pl/python. :)

I'd like to propose a more useful goal for consideration:  PostgreSQL
users should be able to use whatever language they write their
frontend in to write their PL code.

This doesn't mean it would be reasonable to include everything under
the sun in the main distro, just as Linux distros don't normally ship
ADA or Haskall compilers.  But rather, any PL language which meets a
certain level of capability (and yes, I'd propose having trusted
support as being one of those requirements in languages where it makes
sense) and has a sufficiently large user-base that we can reasonably
expect it to be well supported should either be included in the main
distro, or at least in a side-car PostgreSQL-PL package if driven
there due to licensing concerns.

Obviously there are costs in maintaining many PLs, but at the same
time it doesn't really make sense to say that we're going to include
PL/bar, and PL/baz but not PL/foo if all have comparable abilities and
userbases.

I see there being two rational paths, 1) support only one (or perhaps
two where one is C and the other is something with trusted support) PL
and claim that developers need to learn this PL in addition to what
they write their frontends in. or 2) support a wealth of PLs with the
intention of allowing developers to use the same language for their
frontends as their database PL code.  Any other position creates
silly arguments, like replacing PL/Python with PL/Ruby.

In terms of PostgreSQL's competitiveness in the marketplace of
databases, my position would serve well: Other databases will have a
more difficult time providing broad PL support, since PG already has a
good head start there and joe-random application developer who doesn't
care what database he uses will feel a lot more comfortable when he
knows he can use the same language he's comfortable with for both
front and back end support.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] pl/Ruby, deprecating plPython and Core

2005-08-16 Thread Gregory Maxwell
On 8/16/05, David Fetter [EMAIL PROTECTED] wrote:
 It's not.  In PL/parlance, trusted means prevented from ever
 opening a filehandle or a socket, and PL/PythonU is called
 PL/Python*U* (U for *un*trusted) because it cannot be so prevented.
 
 If somebody has figured out a way to make a PL/Python (without the U),
 that's great, but nothing has happened on this front in a couple of
 years, and Guido said that it was a problem with the language that he
 wasn't going to fix.

It's not a problem in the *language*, it's a problem in the
implementation. There are other implementations of python, including
one inside the JavaVM.

It's also one which could be worked around with the existing python
implementation by completely sandboxing the process running python
(i.e. via seccomp in linux for example).

Yes, it's a problem, yes it should be fixed.  But it is BS to claim
that python fundamentally has a problem and needs to be removed
because of it, just as much as it would be BS to claim that ruby
should forbidden because it permits the same sort of unmaintainable
syntax that has plagued perl for years. :)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Gregory Maxwell
On 6/23/05, Gavin Sherry [EMAIL PROTECTED] wrote:

  inertia) but seeking to a lot of new tracks to write randomly-positioned
  dirty sectors would require significant energy that just ain't there
  once the power drops.  I seem to recall reading that the seek actuators
  eat the largest share of power in a running drive...
 
 I've seen discussion about disks behaving this way. There's no magic:
 they're battery backed.

Nah this isn't always the case, for example some of the IBM deskstars
had a few tracks at the start of the disk reserved.. if the power
failed the head retracted all the way and used the rotational energy
to power it long enough to write out the cache..  At start the drive
would read it back in and finish flushing it.

 unfortunately firmware bugs made it not always wait until the
head returned to the start to begin writing...

I'm not sure what other drives do this (er, well do it correctly :) ).

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] LGPL

2005-06-17 Thread Gregory Maxwell
On 6/18/05, Tom Lane [EMAIL PROTECTED] wrote:
 What is important is that it is possible, and useful, to build Postgres
 in a completely non-GPL environment.  If that were not so then I think
 we'd have some license issues.  But the fact that building PG in a
 GPL-ized environment creates a GPL-ized binary is not a problem from my
 point of view.  You've already bought into the GPL if you're using that
 environment.

Put another way: Linking to a GPLed library creates a gpled result,
but being GPLed is completely and totally irrelevant to *users*
because the GPL places no restrictions on use whatsoever.

... But is it really the case that PostgreSQL developers are being
paid to code because PG is BSDed and proprietary forks are possible?
... There is no harm in being BSDed, but I question that the users of
PostgreSQL are gaining enough advantage that there needs to be so much
paranoia about making sure that the code is as easy as possible to
make propritary forks of...

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] User/Group Quotas Revisited

2005-06-11 Thread Gregory Maxwell
 - Who has permissions to set the user's quota per tablespace, the
 superuser and the tablespace owner?

It would be nice if this were nestable, that is, if the sysadmin could
carve out a tablespace for a user then the user could carve that into
seperately quotated sub tables..

The idea being, a user may have several tables, some of which are
likely to get big and fat and gain lots of crud, but some of which
will never grow too big but you really don't want to fail just because
someone floodded the other table and used up your quota.  It would be
nice if the user could manage that subassignment as he saw fit without
assistance from the admin.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] Bloom Filter indexes?

2005-05-28 Thread Gregory Maxwell
Has any thought been given to adding bloom filter indexes to PostgreSQL?

A bloom index would be created on a column, and could then be used to
accelerate exact matches where it is common that the user may query
for a value that doesn't exist. For example, with the query select
userid from user_table where name=notauser, the failure could be
returned instantly, in most cases.

A bloom filter index could be used to accelerate joins, esp full outer joins. 

Insertions into a bloom filter are very cheap. Updates could be done
as an insert. Deletes are expensive (either you make a refcounted
filter or you regenerate the filter). However, since bloom filters
have false positives, it would be acceptable to regenerate the filter
during a vacuum if there have been entries deleted or updated. The
filter could be resized at vacuum time based on statistics gathered
during execution.

It would also be useful to have an array bloom index: store a bloom
filter per record for an arrayed field, as well as the bloom filter
for all records. This would allow membership tests for a field
containing large arrays to happen very quickly. Perhaps useful for GIS
and full text indexing applications.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Bloom Filter indexes?

2005-05-28 Thread Gregory Maxwell
Has any thought been given to adding bloom filter indexes to PostgreSQL?

A bloom index would be created on a column, and could then be used to
accelerate exact matches where it is common that the user may query
for a value that doesn't exist. For example, with the query select
userid from user_table where name=notauser, the failure could be
returned instantly, in most cases.

A bloom filter index could be used to accelerate joins, esp full outer joins.

Insertions into a bloom filter are very cheap. Updates could be done
as an insert. Deletes are expensive (either you make a refcounted
filter or you regenerate the filter). However, since bloom filters
have false positives, it would be acceptable to regenerate the filter
during a vacuum if there have been entries deleted or updated. The
filter could be resized at vacuum time based on statistics gathered
during execution.

It would also be useful to have an array bloom index: store a bloom
filter per record for an arrayed field, as well as the bloom filter
for all records. This would allow membership tests for a field
containing large arrays to happen very quickly. Perhaps useful for GIS
and full text indexing applications.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq