Re: [HACKERS] Column storage positions

2007-04-01 Thread Phil Currier

On 4/1/07, Guillaume Smet [EMAIL PROTECTED] wrote:

Phil, did you make any progress with your patch?  Your results seemed
very encouraging and your implementation interesting.
IIRC, the problem was that you weren't interested in working on the
visual/mysqlish column ordering. As the plan was to decouple column
ordering in three different orderings, I don't think it's really a
problem if your implementation doesn't support one of them (at least
if it doesn't prevent us from having the visual one someday).


I haven't done much with it since February, largely because my
available free time evaporated.  But I do intend to get back to it
when I have a chance.  But you're right, the storage position stuff
I've worked on is completely independent from display positions, and
certainly wouldn't prevent that being added separately.


Is there any chance you keep us posted with your progress and post a
preliminary patch exposing your design choices? This could allow other
people to see if there are interesting results with their particular
database and workload.


Yeah, I'll try to clean things up and post a patch eventually.  And if
anyone feels like working on the display position piece, let me know;
perhaps we could pool our efforts for 8.4.

phil

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Column storage positions

2007-02-22 Thread Phil Currier

On 2/22/07, Tom Lane [EMAIL PROTECTED] wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
 Alvaro Herrera wrote:
 Right, I'm not advocating not doing that -- I'm just saying that the
 first step to that could be decoupling physical position with attr id
 :-) Logical column ordering (the order in which SELECT * expands to)
 seems to me to be a different feature.

 Except in the sense that divorcing the id from the storage order makes
 it possible to do sanely. :-)

They are different features, but they are going to hit all the same
code, because the hardest part of this remains making sure that every
piece of the code is using the right kind of column number.  The
suggestion I posted awhile ago amounts to saying that we might be able
to solve that by default, by making sure that only one definition of
column number is relevant to the majority of the backend and we can
figure out exactly where the other definitions need to apply.  But
that's handwaving until someone actually does it :-(


I don't really think it's just handwaving at this point because I've
done a lot of it :).  I'm not saying the work is done, or that a lot
more testing isn't required, but at the moment I have a working system
that seems to do what it needs to do to separate storage position from
permanent ID/display position.  And the changes to accomplish this
were quite localized - namely the tuple access routines in
heaptuple.c, and the small handful of places that need to construct
tuple descriptors.  That's pretty much it - the rest of the codebase
is remains untouched.


In any case I think it's foolish not to tackle both issues at once.
We know we'd like to have both features and we know that all the same
bits of code need to be looked at to implement either.


I guess I disagree with that sentiment.  I don't think it's necessary
to bundle these two features together, even if some analysis will be
duplicated between them, since they are completely distinct in a
functional sense and will touch different places in the code.
Smaller, more incremental changes make more sense to me.

But if both-features-at-once is what the community wants, that's fine,
no worries.  I'll just pull my own personal hat out of the ring until
someone comes along who's interested in implementing them both at the
same time.

phil

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Column storage positions

2007-02-21 Thread Phil Currier

On 2/21/07, Alvaro Herrera [EMAIL PROTECTED] wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically.  When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.


That's exactly what I'm proposing.  On table creation, the system
chooses an efficient column order for you.  The next time an ALTER
TABLE operation forces a rewrite, the system would recompute the
column storage order.  I hadn't thought of having CLUSTER also redo
the storage order, but that seems safe since it takes an exclusive
lock on the table.  I'm less sure about whether it's safe to do this
during a TRUNCATE.

phil

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Column storage positions

2007-02-21 Thread Phil Currier

On 2/21/07, Bruce Momjian [EMAIL PROTECTED] wrote:

Keep in mind we have a patch in process to reduce the varlena length and
reduce alignment requirements, so once that is in, reordering columns
will not be as important.


Well, as I understand it, that patch isn't really addressing the same
problem.  Consider this table:
create table foo (a varchar(10), b int, c smallint, d int, e smallint, );

There are two problems here:

1) On my machine, each int/smallint column pair takes up 8 bytes.  2
of those 8 bytes are alignment padding wasted on the smallint field.
If we grouped all the smallint fields together within the tuple, that
space would not be lost.

2) Each time you access any of the int/smallint fields, you have to
peek inside the varchar field to figure out its length.  If we stored
the varchar field at the end of the tuple instead, the access times
for all the other fields would be measurably improved, by a factor
that greatly outweighs the small penalty imposed on the varchar field
itself.

My understanding is that the varlena headers patch would potentially
reduce the size of the varchar header (which is definitely worthwhile
by itself), but it wouldn't help much for either of these problems.
Or am I misunderstanding what that patch does?

phil

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Column storage positions

2007-02-21 Thread Phil Currier

On 2/21/07, Martijn van Oosterhout kleptog@svana.org wrote:

On Wed, Feb 21, 2007 at 03:59:12PM +0100, Florian G. Pflug wrote:
 I think you'd want to have a flag per field that tell you if the user
 has overridden the storage pos for that specific field. Otherwise,
 the next time you have to chance to optimize the ordering, you might
 throw away changes that the admin has done on purpose.

Why would you want to let the admin have any say at all about the
storage order?


Well, for two reasons:

1) If you have a table with one very-frequently-accessed varchar()
column and several not-frequently-accessed int columns, it might
actually make sense to put the varchar column first.  The system won't
always be able to make the most intelligent decision about table
layout.

2) As I described in my original email, without this capability, I
don't see any good way to perform an upgrade between PG versions
without rewriting each table's data.  Maybe most people aren't doing
upgrades like this right now, but it seems like it will only become
more common in the future.  In my opinion, this is more important than
#1.

But I understand that it's a potential foot-gun, so I'm happy to drop
it.  It would be nice though if there were some ideas about how to
address problem #2 at least.

phil

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Column storage positions

2007-02-21 Thread Phil Currier

On 2/21/07, Martijn van Oosterhout kleptog@svana.org wrote:

 don't see any good way to perform an upgrade between PG versions
 without rewriting each table's data.  Maybe most people aren't doing
 upgrades like this right now, but it seems like it will only become
 more common in the future.  In my opinion, this is more important than
 #1.

I don't see this either. For all current tables, the storage position
is the attribute number, no exception. You say:

 because the version X table could
 have dropped columns that might or might not be present in any given
 tuple on disk.

Whether they're there or not is irrelevent. Drop columns are not
necesarily empty, but in any case they occupy a storage position until
the table is rewritten. A dump/restore doesn't need to preserve this,
but pg_migrator will need some smarts to handle it. The system will
need to create a column of the appropriate type and drop it to get to
the right state.


I agree, a dump/restore that rewrites all the table datafiles doesn't
need to handle this.  And I agree that the system will need to create
dropped columns and then drop them again, that's exactly what I
suggested in fact.  We're talking about pg_migrator-style upgrades
only here.

Say we do this in 8.2:

create table foo (a varchar(10), b int);
insert into foo 
alter table foo add column c int;

At this point, the column storage order is (a, b, c) because 8.2 never
changes storage order.  Then you upgrade to 8.3.  pg_dump now wants to
write out some DDL that will create a table matching the existing
table datafile, since we don't want to have to rewrite it.  pg_dump
prints out:

create table foo (a varchar(10), b int, c int);

The 8.3 system will try to create the table with column order (b, c,
a), since it's trying to optimize storage order, and that won't match
the existing table datafile.  What we need is a way to make sure that
the table matches the original datafile.

Now say that it's not an 8.2 - 8.3 upgrade, say it's an 8.3 - 8.4
upgrade.  In this case, 8.3 would have the table with storage order
(b, a, c).  (Column c would have been added at the end since it was
added without a default, and didn't force a table rewrite.)  How do
you get pg_dump to print out table creation DDL that will result in a
table matching the existing (b, a, c) table datafile?

This is why I think pg_dump needs to be able to print an ALTER TABLE
statement that will explicitly assign storage positions.  This happens
to have the side-effect of being potentially useful to admins who
might want control over that.

If this only affected 8.2 - 8.3 upgrades, then maybe it's not as
important an issue.  But I think it affects *all* future upgrades,
which is why I'm trying to raise the issue now.



If you really want to use pg_dump I'd suggest an option to pg_dump
--dump-dropped-columns which will include the dropped columns in the
CREATE TABLE but drop them immediatly after. It's really more a corner
case than anything else.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Column storage positions

2007-02-21 Thread Phil Currier

On 2/21/07, Gregory Stark [EMAIL PROTECTED] wrote:

So yes, there would be a use case for specifying the physical column layout
when pg_migrator is doing the pg_dump/restore. But pg_migrator could probably
just update the physical column numbers itself. It's not like updating system
catalog tables directly is any more of an abstraction violation than swapping
files out from under the database...


If people are ok with that answer, then I'll gladly stop suggesting
that ALTER TABLE be able to explicitly set storage positions.  I was
just trying to avoid forcing a tool like pg_migrator to muck with the
system catalogs.

phil

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Column storage positions

2007-02-20 Thread Phil Currier

Inspired by this thread [1], and in particular by the idea of storing
three numbers (permanent ID, on-disk storage position, display
position) for each column, I spent a little time messing around with a
prototype implementation of column storage positions to see what kind
of difference it would make.  The results were encouraging: on a table
with 20 columns of alternating smallint and varchar(10) datatypes,
selecting the max() of one of the rightmost int columns across 1
million rows ran around 3 times faster.  The same query on the
leftmost varchar column (which should suffer the most from this
change) predictably got a little slower (about 10%); I couldn't
measure a performance drop on the rightmost varchar columns.  The
table's size didn't drop much in this case, but a different table of
20 alternating int and smallint columns showed a 20% slimmer disk
footprint, pretty much as expected.  Pgbenching showed no measurable
difference, which isn't surprising since the pgbench test tables
consist of just int values with char filler at the end.

So here is a proposal for separating a column's storage position from
its permanent ID.  I've ignored the display position piece of the
original thread because display positions don't do much other than
save you the hassle of creating a view on top of your table, while
storage positions have demonstrable, tangible benefits.  And there is
no reason to connect the two features; display positions can easily be
added separately at a later point.

We want to decouple a column's on-disk storage position from its
permanent ID for two reasons: to minimize the space lost to alignment
padding between fields, and to speed up access to individual fields.
The system will automatically assign new storage positions when a
table is created, and when a table alteration requires a rewrite
(currently just adding a column with a default, or changing a column
datatype).  To allow users to optimize tables based on the fields they
know will be frequently accessed, I think we should extend ALTER TABLE
to accept user-assigned storage positions (something like ALTER TABLE
ALTER col SET STORAGE POSITION X).  This command would also be useful
for another reason discussed below.

In my prototype, I used these rules to determine columns' storage order:
1) fixed-width fields before variable-width, dropped columns always last
2) fixed-width fields ordered by increasing size
3) not-null fields before nullable fields
There are other approaches worth considering - for example, you could
imagine swapping the priority of rules 2 and 3.  Resultant tables
would generally have more alignment waste, but would tend to have
slightly faster field access.  I'm really not sure what the optimal
strategy is since every user will have a slightly different metric for
optimal.  In any event, either of these approaches is better than
the current situation.

To implement this, we'll need a field (perhaps attstoragepos?) in
pg_attribute to hold the storage position.  It will equal attnum until
it is explicitly reassigned.  The routines in heaptuple.c need to
quickly loop through the fields of a tuple in storage order rather
than attnum order, so I propose extending TupleDesc to hold an
attrspos array that sits alongside the attrs array.  In the
prototype I used an array of int2 indices into the attrs array,
ordered by storage position.

These changes cause a problem in ExecTypeFromTLInternal: this function
calls CreateTemplateTupleDesc followed by TupleDescInitEntry, assuming
that attnum == attstoragepos for all tuples.  With the introduction of
storage positions, this of course will no longer be true.  I got
around this by having expand_targetlist, build_physical_tlist, and
build_relation_tlist make sure each TargetEntry (for targetlists
corresponding to either insert/update tuples, or base tuples pulled
straight from the heap) gets a correct resorigtbl and resname.  Then
ExecTypeFromTLInternal first tries calling a new function
TupleDescInitEntryAttr, which hands off to TupleDescInitEntry and then
performs a syscache lookup to update the storage position using the
resorigtbl.  This is a little ugly because ExecTypeFromTLInternal
doesn't know in advance what kind of tupledesc it's building, so it
needs to retreat to the old method whenever the syscache lookup fails,
but it was enough to pass the regression tests.  I could use some
advice on this - there's probably a better way to do it.

Another problem relates to upgrades.  With tools like pg_migrator now
on pgfoundry, people will eventually expect quick upgrades that don't
require rewriting each table's data.  Storage positions would cause a
problem for every version X - version Y upgrade with Y = 8.3, even
when X is also = 8.3, because a version X table could always have
been altered without a rewrite into a structure different from what
Y's CREATE TABLE will choose.  I don't think it's as simple as just
using the above-mentioned ALTER TABLE extension to assign